OpenWebText and Wikipedia Dataset Torrents | Generated by AI

Home 2025.11

OpenWebText Dataset Torrents

OpenWebText is an open-source replication of OpenAI’s WebText dataset, consisting of ~38GB of cleaned text extracted from high-karma Reddit links (primarily web pages scraped and filtered for quality). It’s a popular large-scale text corpus for training language models. While the full scraped text isn’t always directly available as a single torrent, here are reliable options:

For the full raw text corpus, check the official site for direct downloads (not torrent-based) or use the URLs above with scraping scripts from the OpenWebText GitHub repo. An enhanced version, OpenWebText2 (~multi-TB scale), is available via EleutherAI’s repo but uses streaming rather than torrents.

Wikipedia Dump Torrents

Wikipedia dumps are monthly XML exports of the entire database (articles, revisions, metadata). The English version is massive (~20-25GB compressed for abstracts, up to 100+GB for full history). Torrents are community-maintained (unofficial but verified against official checksums) and web-seeded from Wikimedia servers for reliability. Always verify downloads against dumps.wikimedia.org hashes.

The main hub for torrents is the Meta-Wiki Data Dump Torrents page, which lists the latest English Wikipedia dumps (e.g., enwiki-20251101). Here’s a summary of recent ones:

Dump Date File Type Compressed Size Torrent Link Notes
2025-11-01 Pages-Articles (XML, abstracts only) ~22GB enwiki-20251101-pages-articles-multistream.xml.bz2 Multistream format; easiest for text extraction.
2025-11-01 Pages-Articles-History (XML, full revisions) ~120GB enwiki-20251101-pages-meta-history*.xml.bz2 Includes all edits; split into streams for easier handling.
2025-10-01 Pages-Articles (XML, abstracts only) ~21GB enwiki-20251001-pages-articles-multistream.xml.bz2 Previous monthly; good for historical comparison.

For processing these into plain text, use tools like wikiextractor from Python. Torrents reduce server load and resume interrupted downloads – use a client like qBittorrent. If you need dumps for other languages or subsets, the Meta-Wiki page has filters.


Back

x-ai/grok-4-fast

Donate