FineWeb Training Iteration Recommendation | Generated by AI

Home 2025.11

For your ~3.68B token FineWeb dataset and this 40M-parameter model, here’s the clear recommendation:

Keep max_iters = 20,000 (or at least 15,000+)

Don’t drop to 10,000 or 6,000 — you’ll leave a lot of performance on the table.

Why 20k is better than 10k or 6k

max_iters Total tokens seen Epochs on your data Expected final val loss (typical on FineWeb-Edu) Relative quality
6,000 ~3.15 B ~0.85× ~2.9–3.1 OK, but weak
10,000 ~5.24 B ~1.42× ~2.65–2.8 Decent
15,000 ~7.86 B ~2.13× ~2.50–2.65 Very good
20,000 ~10.5 B ~2.85× ~2.42–2.55 Excellent (beats original GPT-2 124M on most benchmarks)

Real-world runs with similar-sized models (40–125M) on FineWeb-Edu show that loss is still dropping noticeably even after 2–3 epochs on this data, because FineWeb-Edu is extremely high-quality and diverse. People routinely go to 10–15B tokens now for 124M-class models and get significantly better coherence, knowledge, and reasoning.

Time difference isn’t huge on your 4070

That’s only ~8–9 extra hours of runtime for a model that will be noticeably smarter and more useful. Totally worth it unless you’re in a big hurry or electricity is extremely expensive where you are.

Practical sweet-spot compromise (if you’re impatient)

Set max_iters = 15,000 → ~8B tokens, ~2.1 epochs
You’ll finish in 11–15 hours and still get 90–95 % of the possible performance.

When you could stop early (monitor this!)

Run with 20,000 but watch the validation loss curve:

Bottom line

Just let it run overnight with 20k (or 15k). You’ll thank yourself tomorrow when you have a genuinely strong little model instead of a merely “okay” one.


Back

x-ai/grok-4-fast

Donate