FineWeb GPT-2 Training Summary | Generated by AI

Home 2026.04

Status: Live, healthy, ~72% complete. Started ~2026-04-29 13:46 (tmux session 0).

Configuration (config/train_fineweb.py)

Param Value
Model 12L / 12H / 768 emb, no bias, dropout 0 — 123.59M params
Block size 1024
Batch 8 × 64 grad-accum = effective 512 seq → 524,288 tokens/iter
Optimizer AdamW fused, lr 3e-4 → 3e-5 cosine, warmup 500, β2=0.99, wd 0.1
Schedule max_iters = 12,000 (lr_decay_iters = 12,000)
Eval every 500 iters, 200 batches, checkpoint every eval (always_save_checkpoint=True)
Compile on
Dataset local fineweb

Throughput

Loss curve (val every 500 iters)

step   train     val
 500   5.4508    5.5082
1000   4.3793    4.3908
2000   3.9059    3.8516
3000   3.7440    3.6908
4000   3.6282    3.5831
5000   3.5656    3.4789
6000   3.5130    3.4665   ← slight plateau
7000   3.4670    3.4154
7500   3.4410    3.3677
8000   3.4243    3.3951   ← small bump
8500   3.4017    3.3534   ← latest

Current state

Notes / risk flags


Back Donate