OpenIndex
AI & ML interests
None defined yet.
Recent Activity
OpenIndex
We index the internet and publish it as Parquet. 17 datasets, 199K+ downloads, all queryable with DuckDB or load_dataset.
Web
We process every Common Crawl release and convert it to structured formats so you don't have to wrangle WARC files yourself.
| Dataset | What's in it | Scale |
|---|---|---|
| open-markdown | Clean markdown from Common Crawl with URL, language, content metadata | Billions of pages |
| open-html | Same data but raw HTML, for when you need the original page source | Billions of pages |
| draft | Newest CC snapshot (CC-MAIN-2026-12) before it gets merged into open-markdown | Billions of pages |
Social
The two largest community archives on the internet, continuously maintained. Good for training data, trend analysis, or just digging through 20 years of internet arguments.
| Dataset | What's in it | Scale |
|---|---|---|
| hacker-news | Every HN story, comment, poll, and job post since 2006. Live-updated every 5 min | 47M+ items |
| hacker-news-rss | RSS feeds discovered from links posted on HN | 623K feeds |
| arctic | The Arctic Shift Reddit archive. 6.5B comments, 1.9B submissions | 8.4B items |
Code
Full mirrors of the major package registries and GitHub's public event stream. If you want to study how open source actually works, start here.
| Dataset | What's in it | Scale |
|---|---|---|
| open-github | Every public GitHub event: pushes, PRs, issues, stars, forks, reviews, releases | Continuous |
| open-github-issues | Issues, PRs, comments, reviews, commits for 17 major repos | 21.7M rows |
| open-npm | Every npm package with versions, dependencies, maintainers, download stats | 3.9M+ packages |
| open-pypi | Every PyPI package with releases, classifiers, dependencies, project URLs | 600K+ packages |
Academia
Structured dumps of the two biggest open research databases. Useful for citation graphs, topic modeling, or finding who's working on what.
| Dataset | What's in it | Scale |
|---|---|---|
| open-arxiv | Every arXiv paper since 1991 with abstracts, authors, categories, DOIs | 2.99M papers |
| open-alex | Full OpenAlex dump: works, authors, sources, institutions, topics, funders | 114M records |
Knowledge
Wikipedia in three formats (pick the one that fits your pipeline) and the entire Open Library book catalog.
| Dataset | What's in it | Scale |
|---|---|---|
| open-wikipedia | Every Wikipedia article in original MediaWiki markup, all languages | All articles |
| open-wikipedia-markdown | Same articles, converted to clean Markdown | All articles |
| open-wikipedia-text | Same articles, as plain text | All articles |
| open-library | Full Open Library catalog: works, editions, authors, subjects, publishers | 150M+ records |
AI
The agent ecosystem is growing fast. This is a snapshot of everything published on skills.sh.
| Dataset | What's in it | Scale |
|---|---|---|
| open-skills | Agent skills with READMEs, install commands, security audits, weekly installs | 133K skills |