Smoke Test and Training Plan Summary

Smoke Test and Training Plan Summary | Generated by AI

Home 2026.05

Here’s the result.

Smoke test results — `/mnt/data/nanoGPT/out-gpt2-200m-smoke/smoke.log`

Metric	Value
Param count (printed by train.py)	218.51M
Loss trajectory (25 iters)	10.985 → 7.66 (normal cold-start descent)
Per-iter time at bs=2, grad_accum=4	~330 ms
Crashes / NaN / OOM	none

VRAM probe (bf16, no compile)

micro-batch	peak VRAM	ms / micro-step	ms / token
2	5.55 GB	117	0.0571
4 (chosen)	8.01 GB	167	0.0408
6	10.44 GB	231	0.0376
8	OOM	—	—

bs=4 is the clear winner: 28% faster per token than bs=2, with ~4 GB headroom for torch.compile and eval. I updated config/train_gpt2_200m.py to use batch_size=4, gradient_accumulation_steps=128 (still 524,288 tokens/step).

The plan

Architecture (custom, between GPT-2 small and medium):

n_layer = 18,  n_head = 16,  n_embd = 896     → 218.5M non-embedding params

Why 4B tokens: Chinchilla-optimal ≈ 20 × params → 20 × 219M ≈ 4.4B. Your 4B target is right on the line.

Token budget: 524,288 tokens/step × 8,000 steps = 4.19B tokens.

Data: 1,408 FineWeb-Edu shards (~140B tokens) already at data/fineweb/edu_fineweb100B/ — far more than we need, so no re-tokenization.

Time estimate on the RTX 4070:

Setting	Time / iter	Wall-clock for 8,000 iters
bs=4, no compile	~21.4 s	~48 hours (~2 days)
bs=4, with `compile=True` (≈25% speedup typical on Ada)	~16 s	~36 hours (~1.5 days)

So budget roughly ~1.5–2 days of uninterrupted GPU time. Add ~30 min for torch.compile warmup on first iter and ~30 min total for 32 eval passes (every 250 iters × ~3 s/eval).

Expected final val loss: ~2.92–3.00 (better than 124M’s ~3.0–3.1, just shy of 350M’s ~2.85–2.95).

To kick off the real run

python3 train.py config/train_gpt2_200m.py

If you want, I can also: (a) tee output to a log file, (b) wire up wandb, or (c) start it now in the background so you can monitor it.

Back Donate