English Wikipedia Download Guide | Generated by AI

Home 2025.11

This is a directory listing from one of the official Wikimedia download mirrors (usually https://dumps.wikimedia.org/enwiki/latest/).
It contains all the different ways you can download the current English Wikipedia articles (as of early November 2025).

Main formats and what they actually contain

File name pattern What it is Approx. size (uncompressed) Best for training LLMs? Notes
enwiki-latest-pages-articles.xml.bz2 One single giant file with all articles + talk pages, templates, redirects, etc. ~85–90 GB uncompressed Yes, very commonly used Easiest if you have space and bandwidth
enwiki-latest-pages-articles1.xml-p1p41242.bz2 … up to … enwiki-latest-pages-articles27.xml-… The same data, but split into 27 smaller chunks (multistream) Each ~200–600 MB compressed → total still ~85–90 GB uncompressed Yes, most popular choice Lets you download in parallel and resume easily
enwiki-latest-pages-articles-multistreamX.xml.bz2 (e.g. multistream27) The actual huge compressed data files that belong to the split version above 300–600 MB each compressed These are the real data files you want You need these + the index files
enwiki-latest-pages-articles-multistreamX.xml.bz2.md5 / .meta Checksum and tiny metadata files < 1 KB Not needed for text Only for verifying downloads
enwiki-latest-pages-articles-multistream-indexX.xml.bz2 Index files that tell you which article is at which byte offset in the big multistream files ~30–60 MB each compressed Required if you use multistream Needed for fast random access; most processing scripts expect them

Recommendation: What almost everyone uses to train models in 2025

The de-facto standard way (used by LLaMA, Mistral, Qwen, DeepSeek, Gemma, Phi, etc.) is:

Download the 27 multistream files + their 27 index files
(that is everything that starts with enwiki-latest-pages-articlesXX.xml-p…p….bz2 and the matching …-indexXX…bz2)

Or even simpler: just download the single giant file if your connection and disk can handle it:

enwiki-latest-pages-articles.xml.bz2    ← 24.6 GB compressed, ~85–90 GB uncompressed

Both contain exactly the same raw XML data (all main-namespace articles + a lot of extra pages).

Why not the tiny files (multistream3.xml-p…, multistream4.xml-p… etc.)?

Those are very old splits from 2010–2015 that only contain a tiny fraction of Wikipedia (articles that happened to fall into that page-ID range back then). They are kept only for historical reasons and are basically useless today.

Quick start commands (2025 dump)

Parallel download of the recommended 27+27 files (fastest):

# Get the list of files
wget -q https://dumps.wikimedia.org/enwiki/latest/ -O - | \
  grep -E 'enwiki-latest-pages-articles[0-9]+[02468]\.xml-p[0-9]+p[0-9]+\.bz2"' | \
  sed -e 's/.*href="//' -e 's/".*//' | \
  xargs -n1 -P10 wget https://dumps.wikimedia.org/enwiki/latest/

# Also download the matching index files
wget -q https://dumps.wikimedia.org/enwiki/latest/ -O - | \
  grep -E 'enwiki-latest-pages-articles-multistream-index' | \
  sed -e 's/.*href="//' -e 's/".*//' | \
  xargs -n1 -P10 wget https://dumps.wikimedia.org/enwiki/latest/

Or just the single file (simplest):

wget https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2

After that you normally process it with tools like:

So in short:
Download either the single 24 GB enwiki-latest-pages-articles.xml.bz2 or the 27 split files + indexes. Everything else in that list is either checksums, old garbage, or metadata you don’t need for training.


Back

x-ai/grok-4-fast

Donate