FineWeb Data Preparation for nanoGPT Training | Generated by AI
Here’s the breakdown of your repo and how to prep FineWeb data for nanoGPT:
📁 Repo Structure
requirements.txt—gensim,mediawiki-dump,pandas,pyarrowtrain/— Scripts and logs for downloading, extracting, and training on FineWeb + Wikipedia
🔄 Data Pipeline (FineWeb → nanoGPT)
Your scripts show a clear 3-step flow:
Step 1: Download FineWeb Parquet Shards
# From the repo, e.g.:
bash train/wget_fineweb_mirror_1.sh # Downloads shard 0 (~500MB-1GB)
bash train/wget_fineweb_mirror_2_5.sh # Downloads shards 1-10
bash train/wget_fineweb_mirror_11_20.sh # Downloads shards 11-20
Uses hf-mirror.com (faster for Asia). Shards go into fineweb_test_dump/.
Step 2: Extract Text from Parquet → Plain Text
python train/extract_fineweb.py
- Reads all
fineweb_test_dump/000_*.parquetfiles - Extracts the
textcolumn - Writes to
fineweb_test_dump/fineweb_extracted_all.txt - Each document separated by
<|endoftext|>tokens
Step 3: Tokenize with nanoGPT’s prepare.py
nanoGPT expects a single train.bin / val.bin of tokenized data. You need to:
- Copy
fineweb_extracted_all.txtinto your nanoGPT data dir, e.g.:mkdir -p ~/nanoGPT/data/fineweb cp fineweb_test_dump/fineweb_extracted_all.txt ~/nanoGPT/data/fineweb/ - Create a
prepare.py(or adapt nanoGPT’s existing one) that:- Reads the text file
- Tokenizes with GPT-2 BPE (
tiktokenortransformerstokenizer) - Splits into train (90%) / val (10%)
- Saves as
train.binandval.bin(numpy uint16 arrays)
- Run it:
cd ~/nanoGPT python data/fineweb/prepare.py
Step 4: Train
cd ~/nanoGPT
python train.py config/train_fineweb.py
Your train_log_fineweb3.txt shows the config used before:
- 125M params (12 layers, 12 heads, 768 embd)
- block_size=1024, batch_size=8, grad_accum=64
- 12K iters, ~3B tokens
- Trained on RTX 4070 at ~15% MFU, ~9.3s/iter
⚠️ One issue: That server has no GPU (just 1 vCPU, 2GB RAM). Training nanoGPT there would be CPU-only and extremely slow. The logs suggest you trained on an RTX 4070 locally.
Want me to just run the download + extract steps on the server to prep the data? Or do you want to train locally on your GPU machine?