Downloading FineWeb-Edu for GPT-3 Ablation | Generated by AI
Removed the file. Here’s the guide:
Downloading 100B Tokens of FineWeb-Edu for a Small-Scale GPT-3 Ablation
A walkthrough of scripts/download/plan_and_download_fineweb_gpt3.py — how it plans, downloads, and resumes a ~400 GB pull of FineWeb-Edu from China without a VPN.
Why 100B tokens
100B tokens is the budget the GPT-3 paper used per token-count ablation in Table 2.1 — large enough to train interesting models, small enough to fit on one workstation. Chinchilla scaling (~20 tokens per parameter) puts 100B tokens at a ~5B-parameter compute-optimal model. For a 1.3B-parameter Chinchilla-fit, you’d want ~26B; 100B is comfortably over-budget for that range, which is what makes it a useful ablation slice.
Cheatsheet (English BPE, ~4 bytes/token):
| budget | shards | disk | notes |
|---|---|---|---|
| 10B | ~20 | ~40 GB | GPT-2-class |
| 100B | ~200 | ~400 GB | GPT-3 ablation (this script) |
| 300B | ~600 | ~1.2 TB | GPT-3 paper’s full training set |
| 1T | ~2000 | ~4 TB | Chinchilla-optimal for ~50B params |
| 1.3T | full | ~5.4 TB | Full FineWeb-Edu |
Why FineWeb-Edu
FineWeb-Edu filters FineWeb by an “educational value” classifier. It’s smaller and cleaner than raw FineWeb — better tokens per training step for the kind of benchmarks (MMLU, ARC, HellaSwag) that ablation experiments usually report.
If you want raw web text instead, swap REPO_ID to HuggingFaceFW/fineweb.
Why hf-mirror.com
Direct access to huggingface.co is unreliable from mainland China. hf-mirror.com is a community mirror that serves the same parquet shards over CDN endpoints that aren’t blocked. The script:
- Sets
HF_ENDPOINT=https://hf-mirror.combefore importinghuggingface_hub, so theHfApi().repo_info()listing call also goes through the mirror — not just the parquet downloads. - Hardcodes
BASE_URLfor the actual file fetches.
If you’re outside China, the same script works against the mirror — it just won’t be faster than the official endpoint.
The architecture
Three phases, decoupled:
list_parquet_shards() → select_shards() → download_one() per shard
(HfApi, once) (token budget) (resumable)
State lives in <output-dir>/progress.json. First run writes it; later runs load it and skip the HF API call entirely.
progress.json shape
{
"repo_id": "HuggingFaceFW/fineweb-edu",
"target_tokens": 100000000000,
"bytes_per_token": 4.0,
"shards": [
{"path": "data/CC-MAIN-2013-20/000_00000.parquet", "size": 2147483648,
"dump": "CC-MAIN-2013-20", "status": "done"},
{"path": "data/CC-MAIN-2013-20/000_00001.parquet", "size": 2147483648,
"dump": "CC-MAIN-2013-20", "status": "pending"}
]
}
Updated atomically (tmp + os.replace) after every shard finishes, so a SIGKILL or power cut never leaves it half-written.
How resumption actually works
Two independent layers:
- Per-shard (
.partfile + HTTPRange: bytes=N-): if you Ctrl-C while a 2 GB shard is half downloaded, the partial bytes stay inshard.parquet.part. Next run sees the.part, opens it in append mode, and asks the server for the rest. - Across shards (
progress.jsonstatus field): finished shards are markeddone. Next run skips them in O(1) — noos.path.existswalk, no re-listing from HF.
Together: you can kill the process at any moment and lose at most the bytes buffered in memory (< 1 MB).
Usage
# Dry run — list shards, write progress.json, but don't download
python scripts/download/plan_and_download_fineweb_gpt3.py --plan
# Download (or resume) ~400 GB into datasets/fineweb-edu/
python scripts/download/plan_and_download_fineweb_gpt3.py
# Custom output location
python scripts/download/plan_and_download_fineweb_gpt3.py \
--output-dir /mnt/data/fineweb-edu
# Discard the saved plan and re-pick from a fresh HF listing
python scripts/download/plan_and_download_fineweb_gpt3.py --refresh-plan
On the first run you’ll see something like:
Listing shards in HuggingFaceFW/fineweb-edu...
Wrote new plan to datasets/fineweb-edu/progress.json.
Plan: ~100,000,000,000 tokens @ 4.0 bytes/token (small-scale GPT-3 ablation):
shards: 198 (0 done)
download size: 398.4 GB (0.0 B already on disk)
est. tokens: 107,000,000,000
dumps covered: 1 (CC-MAIN-2013-20)
Downloading to datasets/fineweb-edu via https://hf-mirror.com ...
[1/198] data/CC-MAIN-2013-20/000_00000.parquet (2.0 GB)
progress: 2.0 GB / 398.4 GB 18.3 MB/s ETA 360.1 min
Re-running after Ctrl-C:
Loaded plan from datasets/fineweb-edu/progress.json (198 shards).
Plan: ~100,000,000,000 tokens @ 4.0 bytes/token (small-scale GPT-3 ablation):
shards: 198 (47 done)
download size: 398.4 GB (94.0 GB already on disk)
...
[1/198] done data/CC-MAIN-2013-20/000_00000.parquet (2.0 GB)
...
[48/198] data/CC-MAIN-2013-20/000_00047.parquet (2.0 GB)
attempt 1/3 failed: ... ← resumes from .part automatically
Things to know before running
- Disk: 400 GB free. The shards are not compressed further on disk; one parquet file ≈ one shard ≈ ~2 GB.
- Bandwidth: at 20 MB/s sustained from hf-mirror.com, ~5.5 hours. Slower from residential connections.
- Token estimate is a heuristic:
bytes / 4.0. Real tokenizer counts vary ±10–15% depending on whether you use GPT-2 BPE, tiktoken, or SentencePiece. OverrideBYTES_PER_TOKENif you’ve measured it. - Single CC dump: FineWeb-Edu’s first shards live under
CC-MAIN-2013-20. 198 shards is more than that snapshot contains, so the plan will span multiple dumps in dump-sorted order. If you want topically-balanced data, randomize the shard selection instead — the current ordering is deterministic but not domain-balanced. - Not idempotent against the dataset evolving: if HuggingFace re-shards FineWeb-Edu,
progress.jsonpaths can go stale. Use--refresh-planto rebuild.
What’s next
The downloaded parquet files are raw text. To turn them into a training corpus, point scripts/extract/extract_fineweb.py at datasets/fineweb-edu/. That script streams the parquet text column through pyarrow.iter_batches without loading whole files into memory.