Wikipedia Dump File Structure Differences | Generated by AI

Home 2025.11

The key difference is how the full Wikipedia article text is split across multiple files — specifically whether one .xml.bz2 file contains only one stream of revisions (non-multistream) or many compressed streams bundled together (multistream).

1. Non-multistream dumps (older style, or the default “pages-articles” files)

Example filename:
enwiki-20251101-pages-articles1.xml-p1p41242.bz2

2. Multistream dumps (the files that contain “multistream” in the name)

Example filename (the one you linked):
enwiki-20251101-pages-articles-multistream1.xml-p1p41242.bz2

Why does this matter for wget -c?

In practice, both commands:

wget -c https://.../enwiki-20251101-pages-articles1.xml-p1p41242.bz2
wget -c https://.../enwiki-20251101-pages-articles-multistream1.xml-p1p41242.bz2

behave exactly the same from wget’s point of view: both are normal single bzip2 files, so resumable download with -c works perfectly in both cases.

The difference only appears when you start processing the downloaded file.

Summary Table

Feature Non-multistream (regular) Multistream
Filename contains -pages-articlesN.xml-...bz2 -pages-articles-multistreamN.xml-...bz2
Article text location Inside the XML <page><revision><text>...</text></revision></page> Outside the XML, in separate compressed streams
File size (for the same page range) Much larger (often 2–3×) Much smaller
Random access / single-page extract Very slow (must decompress whole file) Very fast (with the index file)
Tools that understand it directly Any XML parser (SAX, xmltodict, etc.) wikimedia’s dump-generator tools, pywiki, mwxml, etc.
Needs companion index file? No Yes (-multistream-index....txt.bz2)
wget -c resumability Perfect Perfect (still one bzip2 file)

So which one should you download?

Both files you can resume with wget -c without any problem; the “multistream” part only changes what is inside the file, not how HTTP/wget sees it.


Back

x-ai/grok-4-fast

Donate