Training Completed with Best Checkpoint

Training Completed with Best Checkpoint | Generated by AI

Home 2026.04

Training finished cleanly at iter 12000. Summary:

Final losses

Train: 3.3565 / Val: 3.3308
Best val: 3.2929 at iter 11500 (val ticked up slightly to 3.3308 at the final 12000 step — could be noise or the very start of overfitting; either way, the 11500 checkpoint is your best)

Curve

Started 10.99 (random init, ~ln(50304))
Fast drop: 10.99 → 4.39 by iter 1000
Steady descent through iter ~7000 (val 3.42)
Last ~5000 iters yielded only ~0.1 val loss — clear diminishing returns

Health

Train and val tracked very closely the whole run (gap ~0.05) → no meaningful overfitting
~9.34 s/iter, MFU ~13–15%
~6.3B tokens seen total (12k × 524,288)

Comparison point: nanoGPT’s reference GPT-2 (124M) reproduction reaches ~2.85 val on OpenWebText after ~600k iters. You stopped at 1/50th of that compute, so 3.29 is roughly where you’d expect — model is undertrained, not broken.

Best checkpoint: out-fineweb/ (saved at every eval — the iter-11500 save is your best val).

Back Donate