Optimizing the FineWeb Data Pipeline | Generated by AI
๐ Detailed Review: FineWeb Pipeline
What Went Wrong (and Why)
Attempt 1: datasets streaming library โ OOM killed
The load_dataset(..., streaming=True) promise is misleading. Even in โstreamingโ mode, the HF datasets library:
- Loads the entire Arrow schema into memory
- Buffers decompressed parquet chunks in an internal queue
- Runs a Python iterator with significant per-object overhead
On 1.9GB RAM: wrote 115MB before dying On 3.8GB RAM: wrote 223MB before dying
Root cause: Not the data streaming itself โ itโs the HF library overhead. The Arrow table representation is memory-hungry even when โstreaming.โ
Attempt 2: pd.read_parquet() โ OOM killed
Reads the entire 2.1GB parquet into a pandas DataFrame. Pandas + Arrow conversion needs ~3-4x the file size in RAM. A 2GB parquet needs ~6-8GB RAM to load. Instant death on 3.8GB.
Attempt 3: pyarrow.ParquetFile.read_row_group() โ OOM killed
Better than full pandas, but to_pandas() on a row group still converts the entire chunk to a DataFrame. Row groups in FineWeb are large (~100K+ rows), so same problem.
Attempt 4: pyarrow.iter_batches(batch_size=4096) โ โ
Works
This is the correct approach. It:
- Streams small RecordBatches (4096 rows at a time)
- Each batch is ~few MB in memory
- Never loads the full parquet into RAM
- Peak memory stays under 500MB
Current State Assessment
| Metric | Value |
|---|---|
| Shards processed | 5 of 21 |
| Text extracted | 13 GB |
| Lines written | 76M |
| Disk used | 18 GB of 77 GB |
| RAM peak | ~500 MB |
| OOM kills | 0 (with iter_batches) |
Remaining shards (5-20): 16 ร ~2GB = ~32GB parquet โ ~42GB text Total projected: ~55GB raw text Disk after all shards: 77GB - 18GB - 32GB download - 42GB text โ tight but feasible (shards are deleted after extraction)
What Needs to Happen Next
1. Tokenization (critical missing piece)
You have 76M lines of raw text. Before training, this needs to become train.bin and val.bin:
# Token counts from 5 shards:
# ~13GB text โ 3-4 billion tokens (GPT-2 BPE)
# Full 21 shards โ 12-15 billion tokens
The tokenization script needs to:
- Read the text line by line (memory-safe)
- Tokenize with
tiktokenGPT-2 encoder - Write to binary
uint16arrays - Split 90/10 train/val
Estimated tokenized size:
- 5 shards: ~6-8 GB (
train.bin+val.bin) - 21 shards: ~24-30 GB
2. Disk space calculation
| File | 5 shards | 21 shards |
|---|---|---|
| Raw text | 13 GB | 55 GB |
| Tokenized | 7 GB | 28 GB |
| Parquet (temp) | 0 GB (deleted) | 0 GB |
| Total | 20 GB | 83 GB โ ๏ธ |
21 shards will exceed the 77GB disk. You need to either:
- Tokenize shard-by-shard (download โ extract โ tokenize โ delete text)
- Upgrade to a larger disk
- Use fewer shards (
sample-10BTuses ~5 shards)
3. The sample-10BT subset is not whatโs being downloaded
The parquet shards are from CC-MAIN-2013-20, which is the full FineWeb dataset, not the curated sample-10BT. For GPT-2 124M training, sample-10BT is better:
- Curated quality filtering
- ~10B tokens = right-sized for the model
- No need to download 21 shards
4. No HF token โ rate limiting
Youโre downloading anonymously. HF limits unauthenticated downloads to ~1GB/hour. Each 2GB shard takes 2+ hours. Full download = 40+ hours. With a token: ~10x faster.
Recommendations
Immediate (right now):
- You have 5 shards / ~13GB / ~3-4B tokens. Thatโs enough to test the full pipeline.
- Donโt download more shards yet. Tokenize what you have, verify training works.
Before full download:
- Set up HF token:
huggingface-cli login - Switch to
sample-10BTsubset (curated, right-sized) - Write a shard-by-shard pipeline (download โ extract โ tokenize โ cleanup) to stay under disk limit
- Transfer tokenized
train.bin/val.binto your RTX 4070 for actual training
For training:
- 5 shards (~3-4B tokens) with GPT-2 124M config will produce a decent model
max_iters=~12000at effective batch 524K tokens/iter โ 6.3B tokens seen (loops dataset once)- Matches your previous
train_log_fineweb3.txtconfig
Summary
| Issue | Severity | Fix |
|---|---|---|
HF datasets OOM |
โ Solved | Use pyarrow.iter_batches |
| Missing tokenizer | ๐ด Blocker | Write prepare.py |
| Wrong subset (full vs sample-10BT) | ๐ก Quality | Use sample-10BT for training |
| No HF token | ๐ก Speed | huggingface-cli login |
| Disk too small for all shards | ๐ก Capacity | Tokenize shard-by-shard or use fewer shards |
| Config mismatch | ๐ก Correctness | Set max_iters based on actual token count |
The hard part (memory-safe extraction) is solved. Everything else is plumbing. Want me to write the tokenizer and wire up the full pipeline?