Training Progress and ETA

Training Progress and ETA | Generated by AI

Home 2026.04

Here’s the analysis of the running training and ETA.

Run state

Process: python3 train.py config/train_fineweb.py in tmux 1: (started Wed Apr 29 13:46 HKT)
Now: Apr 29 22:20 HKT → elapsed ≈ 8 h 34 min
Last log line: iter 3200: loss 3.4117, time 9332ms, mfu 13.57%
Last checkpoint: out-fineweb/ckpt.pt written at 21:35 (the step-3000 save). Size 1,492,570,892 B ≈ 1.39 GiB, which matches expectation for a 125M GPT-2 saved with optimizer state (model fp32 ≈ 0.5 GB + AdamW m/v ≈ 1.0 GB).

Config in use (`config/train_fineweb.py`)


dataset	fineweb
model	12L / 12H / 768d, ~123.6M params
block_size	1024
batch / grad_accum	8 × 64 → effective 512 seq → 524,288 tokens/iter
max_iters / lr_decay_iters	12000
warmup	500
lr	3e-4 → 3e-5
eval_interval	500 (always_save_checkpoint=True)
compile	True

So total tokens at completion ≈ 12000 × 524,288 ≈ 6.29 B tokens (the comment’s “~3B” is stale).

Loss trajectory

step    train    val
     10.997   10.991
    5.451    5.508
   4.379    4.391
   4.063    4.010
   3.906    3.852
   3.813    3.774
   3.744    3.691

Still dropping ~0.07 per 500 steps and val < train — healthy, no overfitting. Plenty of room: a 125M GPT-2 on FineWeb-scale data should plateau roughly around 3.0–3.1 val.

When does it end?

Per-iter time: 9.33 s (normal) + ~20 s eval every 500 iters → effective ~9.37 s/iter
Full run: 12000 × 9.37 ≈ 112,400 s ≈ 31.2 h wall-clock
Estimated current iter at 22:20: 30,840 / 9.37 ≈ iter ~3290
Remaining: ~8700 iters × 9.37 s ≈ 22.6 h
ETA: Thu Apr 30, ~20:50 HKT (±30 min depending on eval overhead)

Next checkpoint save is due at step 3500, ≈ iter 3290 + 210 → in ~33 min from now (around 22:53).

Things worth noting

MFU is drifting down slightly (15.4 % → 13.6 %); not concerning, just rolling-average artifact after eval steps.
gradient_accumulation_steps = 64 with batch_size = 8 on a single GPU means each iter is 64 micro-batches — that’s why 9.3 s/iter feels long but throughput is fine (~56 K tok/s).
The comment “~3B tokens seen” on the max_iters line is wrong; with block 1024 × eff. batch 512 it’s ~6.3B at 12K iters. Worth fixing if you care, but doesn’t affect the run.

Want me to keep monitoring and ping you at the next checkpoint, or sample from the current ckpt to sanity-check generations?

Back Donate