OpenIndex

community
Activity Feed

AI & ML interests

None defined yet.

Recent Activity

tamnd  updated a Space about 11 hours ago
open-index/README
tamnd  published a Space about 11 hours ago
open-index/README
tamnd  updated a dataset about 15 hours ago
open-index/open-pypi
View all activity

Organization Card

OpenIndex

We index the internet and publish it as Parquet. 17 datasets, 199K+ downloads, all queryable with DuckDB or load_dataset.

Web

We process every Common Crawl release and convert it to structured formats so you don't have to wrangle WARC files yourself.

Dataset What's in it Scale
open-markdown Clean markdown from Common Crawl with URL, language, content metadata Billions of pages
open-html Same data but raw HTML, for when you need the original page source Billions of pages
draft Newest CC snapshot (CC-MAIN-2026-12) before it gets merged into open-markdown Billions of pages

Social

The two largest community archives on the internet, continuously maintained. Good for training data, trend analysis, or just digging through 20 years of internet arguments.

Dataset What's in it Scale
hacker-news Every HN story, comment, poll, and job post since 2006. Live-updated every 5 min 47M+ items
hacker-news-rss RSS feeds discovered from links posted on HN 623K feeds
arctic The Arctic Shift Reddit archive. 6.5B comments, 1.9B submissions 8.4B items

Code

Full mirrors of the major package registries and GitHub's public event stream. If you want to study how open source actually works, start here.

Dataset What's in it Scale
open-github Every public GitHub event: pushes, PRs, issues, stars, forks, reviews, releases Continuous
open-github-issues Issues, PRs, comments, reviews, commits for 17 major repos 21.7M rows
open-npm Every npm package with versions, dependencies, maintainers, download stats 3.9M+ packages
open-pypi Every PyPI package with releases, classifiers, dependencies, project URLs 600K+ packages

Academia

Structured dumps of the two biggest open research databases. Useful for citation graphs, topic modeling, or finding who's working on what.

Dataset What's in it Scale
open-arxiv Every arXiv paper since 1991 with abstracts, authors, categories, DOIs 2.99M papers
open-alex Full OpenAlex dump: works, authors, sources, institutions, topics, funders 114M records

Knowledge

Wikipedia in three formats (pick the one that fits your pipeline) and the entire Open Library book catalog.

Dataset What's in it Scale
open-wikipedia Every Wikipedia article in original MediaWiki markup, all languages All articles
open-wikipedia-markdown Same articles, converted to clean Markdown All articles
open-wikipedia-text Same articles, as plain text All articles
open-library Full Open Library catalog: works, editions, authors, subjects, publishers 150M+ records

AI

The agent ecosystem is growing fast. This is a snapshot of everything published on skills.sh.

Dataset What's in it Scale
open-skills Agent skills with READMEs, install commands, security audits, weekly installs 133K skills

models 0

None public yet