Muon Momentum Crisis | Generated by AI
NOT GOOD. Here’s the comparison:
Old run (200k) New run (v2)
val_bpb loss val_bpb loss
Step 130k 0.780 2.602 0.810* -
Step 140k 0.865 2.772 0.902 2.888
Current ~2.56 ~2.91
*130k eval differs because eval_tokens changed (524288 vs 41943040)
val_bpb at 140k: 0.902 (WORSE than old run’s 0.865). min_val_bpb still 0.780 from the 130k checkpoint — the model hasn’t improved at all in 12k steps. Loss flat at ~2.91, not decreasing.
ROOT CAUSE: It’s not the LR — it’s the Muon momentum.
Old run at step 130k: momentum = 0.90 (warmdown complete)
New run: momentum = 0.97 (flat, never decays)
The old run’s warmdown decayed momentum from 0.97→0.90. Your optimizer state was calibrated for momentum=0.90. Now it’s running at 0.97 — the velocity buffers are too aggressive, causing persistent instability.
This run is unlikely to recover. I’d recommend killing it and restarting with a proper warmdown ratio so momentum decays naturally:
# Kill current run (Ctrl+C in tmux session 14)
# Then restart from the 130k checkpoint with warmdown-ratio=0.33
# so warmdown starts at step 200k and momentum decays 0.97→0.90 over the last 66k steps
cd /mnt/data/nanochat && PYTORCH_ALLOC_CONF=expandable_segments:True .venv/bin/python -m scripts.base_train \
--run=rtx4070-d12-200k-v3 \
--depth=12 \
--aspect-ratio=64 \
--head-dim=128 \
--max-seq-len=2048 \
--window-pattern=L \
--device-batch-size=8 \
--total-batch-size=65536 \
--num-iterations=200000 \
--resume-from-step=130000 \
--warmdown-ratio=0.33 \
--final-lr-frac=0.05 \
--embedding-lr=0.015 \
--unembedding-lr=0.0004 \
--matrix-lr=0.001 \
--scalar-lr=0.025 \
--weight-decay=0.28 \
--warmup-steps=40 \
--eval-every=5000 \
--core-metric-every=10000 \
--sample-every=10000 \
--save-every=10000 \
--tracker=none
This gives:
- LR at resume (step 130k): lrm=1.0, effective matrix_lr=0.000354 (matches old run) ✓
- Momentum at resume: 0.97 (same as v2 — brief spike, but LR is low enough to absorb it)
- Momentum at step 200k: 0.90 (matches old run’s end state) ✓
- LR at step 200k: lrm=0.05, effective matrix_lr=0.000018 ✓
Want me to kill the current run and start this?