Flexible Dataset Download and Planning Script | Generated by AI

Home 2026.05

Added scripts/download/plan_and_download_fineweb.py. What it does that the existing scripts don’t:

Quick examples:

# See the plan for 10B tokens of FineWeb-Edu, no download
python scripts/download/plan_and_download_fineweb.py \
    --dataset fineweb-edu --target-tokens 10B --plan

# Pull 100B tokens via the mirror into datasets/
python scripts/download/plan_and_download_fineweb.py \
    --dataset fineweb-edu --target-tokens 100B \
    --output-dir datasets/fineweb-edu --mirror hf-mirror

# Single snapshot only, GPT-2-class experiment
python scripts/download/plan_and_download_fineweb.py \
    --dataset fineweb --target-tokens 5B --dump CC-MAIN-2024-10

Note on the heuristic: token estimates use 4 bytes/token (GPT-2 BPE on English). After you’ve tokenized your first batch with your actual tokenizer, pass the measured value via --bytes-per-token for sharper planning. Also added huggingface_hub and datasets to requirements.txt.


Back Donate