Finish Baseline Before Scaling Up | Generated by AI
Let it finish. Reasons:
- 76% sunk cost, ~14h left. Killing now throws away 46 GPU-hours to save 14. Not worth it.
- You need a baseline. “How far can the 4070 go” only means something against a reference point. A clean 124M / 10B / val-3.0X run at this exact config is exactly the baseline you’ll compare every bigger run to. Stopping mid-cosine gives you a worse baseline (LR not decayed, val noisier).
- No new info from stopping early. The loss curve flattened around step 12k; the remaining 4.5k iters are mostly LR-decay polish. Letting it run answers “where does 124M plateau on this config” — useful — without committing more than one extra day.
- The next experiment is multi-day anyway. A 350M medium run at this tokens/step is ~3–4× per step → 5–6 days. 14h to lock in a baseline first is rounding error.
The only reason to kill it would be if you’d discovered a bug in the run (you haven’t) or needed the GPU urgently for something else.
Suggested plan when it finishes tonight:
- Back up the final ckpt:
cp out-fineweb-gpt3/ckpt.pt out-fineweb-gpt3/ckpt_124M_10B_final.pt—always_save_checkpoint=Truewill otherwise stomp it the next run. - Note the final val loss + sample some generations (
python sample.py --out_dir=out-fineweb-gpt3) as the baseline qualitative reference. -
Kick off the next rung. Two reasonable directions — pick one, not both:
- Scale params → gpt2-medium 350M, 10–20B tokens. Answers “can the 4070 train a 350M at all, and how does loss compare to 124M at the same token budget.”
- Scale tokens → keep 124M, bump
max_itersto ~38k (≈20B). Answers “is 124M token-starved at 10B, or saturated.”
The medium-scale run is the more informative one for your stated goal (find the 4070’s ceiling). I’d go that direction.
Want me to draft the train_fineweb_gpt2_medium.py config now so it’s ready to launch the moment this one finishes?