OpenIndex

community

Activity Feed

AI & ML interests

None defined yet.

Recent Activity

tamnd updated a dataset about 4 hours ago

open-index/hacker-news

tamnd updated a dataset 13 days ago

open-index/open-markdown-v2

tamnd updated a dataset 15 days ago

open-index/commoncrawl-urls

View all activity

Organization Card

Community About org cards

OpenIndex

We index the internet and publish it as Parquet. 15 datasets, 199K+ downloads, all queryable with DuckDB or load_dataset.

Web

We process every Common Crawl release and convert it to structured formats so you don't have to wrangle WARC files yourself.

Dataset	What's in it	Scale
open-markdown	Clean markdown from Common Crawl with URL, language, content metadata	Billions of pages

Social

The two largest community archives on the internet, continuously maintained. Good for training data, trend analysis, or just digging through 20 years of internet arguments.

Dataset	What's in it	Scale
hacker-news	Every HN story, comment, poll, and job post since 2006. Live-updated every 5 min	47M+ items
hacker-news-rss	RSS feeds discovered from links posted on HN	623K feeds
arctic	The Arctic Shift Reddit archive. 8.3B comments, 2.2B submissions	10.5B items

Code

Full mirrors of the major package registries and GitHub's public event stream. If you want to study how open source actually works, start here.

Dataset	What's in it	Scale
open-github	Every public GitHub event: pushes, PRs, issues, stars, forks, reviews, releases	Continuous
open-github-issues	Issues, PRs, comments, reviews, commits for 17 major repos	21.7M rows
open-npm	Every npm package with versions, dependencies, maintainers, download stats	35M+ rows
open-pypi	Every PyPI package with releases, classifiers, dependencies, project URLs	47M+ rows

Academia

Structured dumps of the two biggest open research databases. Useful for citation graphs, topic modeling, or finding who's working on what.

Dataset	What's in it	Scale
open-arxiv	Every arXiv paper since 1991 with abstracts, authors, categories, DOIs	2.99M papers
open-alex	Full OpenAlex dump: works, authors, sources, institutions, topics, funders	114M records

Knowledge

Wikipedia in three formats (pick the one that fits your pipeline) and the entire Open Library book catalog.

Dataset	What's in it	Scale
open-wikipedia	Every Wikipedia article in original MediaWiki markup, all languages	All articles
open-wikipedia-markdown	Same articles, converted to clean Markdown	All articles
open-wikipedia-text	Same articles, as plain text	All articles
open-library	Full Open Library catalog: works, editions, authors, subjects, publishers	150M+ records

AI

The agent ecosystem is growing fast. This is a snapshot of everything published on skills.sh.

Dataset	What's in it	Scale
open-skills	Agent skills with READMEs, install commands, security audits, weekly installs	133K skills

models 0

None public yet

datasets 18

AI & ML interests

Recent Activity

Team members 2

OpenIndex

Web

Social

Code

Academia

Knowledge

AI

models 0

datasets 18 Sort: Recently updated

datasets 18