Training Plateau Causes and Solutions | Generated by AI
The training log you posted shows a classic severe plateau in a small-scale nanoGPT run on FineWeb-edu data. The validation loss drops nicely at first (from ~10.9 → ~3.57 around iter 14k), but then completely flattens and even starts to very slowly rise after ~13-14k iterations.
Here are the real reasons why your model stopped improving (ranked by importance for this exact setup):
1. You are massively overfitting a single Parquet file
- Your run name:
125M-single-parquet-4070 - You are training a 40M–125M model (the log says 40.56M params, probably the original GPT-2 small config) on one single FineWeb parquet (probably 10–15 GB → ~3–4 billion tokens at most, but likely much less after filtering).
- By iter 14 000 with batch_size=16, grad_accum=32, block_size=1024 you have already seen: 14 000 iters × 524 288 tokens/iter ≈ 7.3 billion tokens
- That means you have already done 2–3 full epochs over the exact same data.
- FineWeb-edu is high-quality but still has a lot of near-duplicates and boilerplate. After 1.5–2 epochs on the same file(s), a 40M–125M model will memorize almost everything useful and loss plateaus hard.
2. Learning rate is now too high after the plateau
- You use
learning_rate = 1e-3, cosine decay tomin_lr = 1e-4over 20 000 iters. - At iter 14 000 the LR is still ~2.5e-4 (cosine decay is slow at the beginning).
- Once the model has learned everything the data has to offer, keeping LR at hundreds of micro-lr units prevents it from fine-tuning further and actually starts hurting generalization → val loss slowly climbs.
3. Model size vs data diversity mismatch
A 125M (or even the 40M you actually initialized) model on a single parquet file is like giving a college student only one textbook and asking him to keep studying it for years. After a while he can recite it perfectly, but test-time performance on new text stops improving and even degrades slightly due to catastrophic forgetting of earlier patterns.
4. No regularization strong enough
- dropout = 0.1 is okay, but not enough when you overfit this hard
- weight_decay = 0.1 is standard, but still not sufficient for 2+ epochs on tiny data
- No gradient clipping (nanoGPT default is 1.0, but sometimes people set 0.5 or lower for small data)
5. Tokens seen vs real diversity
Even though you saw 7+ billion tokens, the effective diversity is maybe 1–1.5B unique high-quality tokens after dedup, boilerplate, low-edu-score samples, etc. That’s just not enough for a 125M model to keep improving past ~3.5–3.6 loss on this text distribution.
Evidence from your own log
| Step | Val loss | Comment |
|---|---|---|
| 0 | 10.87 | random |
| 2000 | 4.03 | fast progress |
| 5000 | 3.76 | still good |
| 8000 | 3.65 | slowing |
| 11000 | 3.61 | almost flat |
| 13500 | 3.57 | absolute lowest |
| 14500 | 3.569 | already rising again → classic overfit sign |
What you should do instead to actually get lower loss
- Use way more data
- Use the full 100+ GB FineWeb-edu sample (or at least 10–20 parquet files)
- Or stream from HuggingFace datasets with multiple shards
- Train for exactly 1 epoch (or slightly less) on diverse data
- ~10B–15B tokens total is the sweet spot for 125M on FineWeb-edu
- Llama-3/Phi-3 style: 15B tokens → ~3.0–3.2 val loss for 125M-class models
- Lower final learning rate
- Use
min_lr = 5e-5or3e-5(common in recent small model runs) - Or switch to linear decay + longer zero-lr hold
- Use
- Slightly larger model or better architecture tricks
- The original nanoGPT is GPT-2 style (ALiBi absent, old RoPE, etc.)
- Modern 125M models (Qwen2-0.5B, Phi-3-mini, Llama-3 8B scaled down) reach ~2.8–3.0 on the same data because of SwiGLU, RMSNorm, rotary embeddings, etc.
TL;DR – Why your run plateaued at ~3.57 and will never go lower
You are training a 40–125M model for multiple epochs on one single FineWeb parquet file (~few billion tokens total). The model has already memorized everything useful by ~13k iters. Continued training with still-high LR just shuffles weights around → plateau + slight overfit degradation.
This is the #1 mistake people make when they first try nanoGPT on FineWeb: “I’ll just use one file to test” → ends up with exactly this loss curve.
Fix: either (a) use 10–50× more data and train for ~1 epoch total, or (b) accept that ~3.55–3.6 is the absolute floor for this model+data combination and stop training.
If you want a run that actually reaches ~3.0–3.2 loss with nanoGPT-style code, you need at least the full 100GB FineWeb-edu sample (or the newer FineWeb-Edu 2.0) and train for ~15B tokens max.