Optimizing the FineWeb Data Pipeline | Generated by AI

Home 2026.03

๐Ÿ” Detailed Review: FineWeb Pipeline


What Went Wrong (and Why)

Attempt 1: datasets streaming library โ†’ OOM killed

The load_dataset(..., streaming=True) promise is misleading. Even in โ€œstreamingโ€ mode, the HF datasets library:

On 1.9GB RAM: wrote 115MB before dying On 3.8GB RAM: wrote 223MB before dying

Root cause: Not the data streaming itself โ€” itโ€™s the HF library overhead. The Arrow table representation is memory-hungry even when โ€œstreaming.โ€

Attempt 2: pd.read_parquet() โ†’ OOM killed

Reads the entire 2.1GB parquet into a pandas DataFrame. Pandas + Arrow conversion needs ~3-4x the file size in RAM. A 2GB parquet needs ~6-8GB RAM to load. Instant death on 3.8GB.

Attempt 3: pyarrow.ParquetFile.read_row_group() โ†’ OOM killed

Better than full pandas, but to_pandas() on a row group still converts the entire chunk to a DataFrame. Row groups in FineWeb are large (~100K+ rows), so same problem.

Attempt 4: pyarrow.iter_batches(batch_size=4096) โ†’ โœ… Works

This is the correct approach. It:


Current State Assessment

Metric Value
Shards processed 5 of 21
Text extracted 13 GB
Lines written 76M
Disk used 18 GB of 77 GB
RAM peak ~500 MB
OOM kills 0 (with iter_batches)

Remaining shards (5-20): 16 ร— ~2GB = ~32GB parquet โ†’ ~42GB text Total projected: ~55GB raw text Disk after all shards: 77GB - 18GB - 32GB download - 42GB text โ‰ˆ tight but feasible (shards are deleted after extraction)


What Needs to Happen Next

1. Tokenization (critical missing piece)

You have 76M lines of raw text. Before training, this needs to become train.bin and val.bin:

# Token counts from 5 shards:
# ~13GB text โ‰ˆ 3-4 billion tokens (GPT-2 BPE)
# Full 21 shards โ‰ˆ 12-15 billion tokens

The tokenization script needs to:

Estimated tokenized size:

2. Disk space calculation

File 5 shards 21 shards
Raw text 13 GB 55 GB
Tokenized 7 GB 28 GB
Parquet (temp) 0 GB (deleted) 0 GB
Total 20 GB 83 GB โš ๏ธ

21 shards will exceed the 77GB disk. You need to either:

3. The sample-10BT subset is not whatโ€™s being downloaded

The parquet shards are from CC-MAIN-2013-20, which is the full FineWeb dataset, not the curated sample-10BT. For GPT-2 124M training, sample-10BT is better:

4. No HF token โ†’ rate limiting

Youโ€™re downloading anonymously. HF limits unauthenticated downloads to ~1GB/hour. Each 2GB shard takes 2+ hours. Full download = 40+ hours. With a token: ~10x faster.


Recommendations

Immediate (right now):

Before full download:

  1. Set up HF token: huggingface-cli login
  2. Switch to sample-10BT subset (curated, right-sized)
  3. Write a shard-by-shard pipeline (download โ†’ extract โ†’ tokenize โ†’ cleanup) to stay under disk limit
  4. Transfer tokenized train.bin/val.bin to your RTX 4070 for actual training

For training:


Summary

Issue Severity Fix
HF datasets OOM โœ… Solved Use pyarrow.iter_batches
Missing tokenizer ๐Ÿ”ด Blocker Write prepare.py
Wrong subset (full vs sample-10BT) ๐ŸŸก Quality Use sample-10BT for training
No HF token ๐ŸŸก Speed huggingface-cli login
Disk too small for all shards ๐ŸŸก Capacity Tokenize shard-by-shard or use fewer shards
Config mismatch ๐ŸŸก Correctness Set max_iters based on actual token count

The hard part (memory-safe extraction) is solved. Everything else is plumbing. Want me to write the tokenizer and wire up the full pipeline?


Back Donate