Excess Data, Undertrained Model | Generated by AI

Home 2026.06

Here’s the situation:

DATA ALREADY ON DISK (nanochat cache)
  Location: ~/.cache/nanochat/base_data_climbmix/
  Files:    176 parquet shards (fineweb-edu format)
  Size:     374 GB
  Docs:     ~138M documents
  Tokens:   ~164B tokens (estimated, ~4 chars/BPE token)

TRAINING vs DATA
  Used in training:    0.655B tokens  (10k steps x 65,536 batch)
  Chinchilla optimal:  5.7B tokens   (20x params for 286M model)
  Already on disk:     164B tokens   (28x Chinchilla, 250x what was used)

VERDICT: You don't need more data. You have 164B tokens sitting on disk
and only used 0.655B of them (0.4%).

The fix is to train longer, not download more. For Chinchilla-optimal training:

Target:     5.7B tokens
Batch size: 65,536 tokens/step
Steps needed: 5.7B / 65,536 = ~87,000 steps
Time:       87,000 * 1.18s = ~28.5 hours

Or if you want to go aggressive and use more of the data:

10B tokens:  ~153k steps  (~50 hours)
20B tokens:  ~305k steps  (~100 hours)

Want me to check the nanochat dataset.py to confirm it’ll correctly loop through all 176 shards with more iterations, or do you want to set up a longer training run?


Back Donate