Training Run Undertrained per Scaling Laws | Generated by AI

Home 2026.06

Question: How’s this nanochat training run?

Answer:

The good:

The problem — you’re massively undertrained:

Chinchilla-Hoffmann scaling law says optimal tokens:param ratio is ~20. You’re at 5.95. That’s a 3.4x deficit.

# Chinchilla law: tokens ≈ 20 * params
# Your run:
params = 286e6
tokens = 655e6
ratio = tokens / params  # 5.95
optimal_tokens = 20 * params  # 5.72B tokens needed

print(f"You need {optimal_tokens / 1e9:.2f}B tokens, trained on {tokens / 1e9:.2f}B")
print(f"Deficit: {optimal_tokens / tokens:.1f}x")

Output: You need 5.72B tokens, trained on 0.66B. Deficit: 8.7x

This explains the sample quality perfectly. The model knows facts exist (lists planets correctly), but can’t anchor them reliably (capital of France wrong, gold symbol loops). It’s memorizing weak patterns, not learning robust representations.

What you should do:

Option 1 (recommended given MI300X access): Scale down, train to Chinchilla ratio

Option 2 (if you want to push 286M): Train 10x longer on MI300X

The hardware reality:

One technical note: Your warmdown starts at 65% of training (step 6500). If you scale to 40k steps, that’s step 26k. Consider tuning the warmdown schedule — too early and you’re throwing away learning signal in the final stretch.

Bottom line: Don’t keep pushing 286M on 655M tokens. Either shrink the model or scale the compute. Right now you’re in a bad regime — big enough to be slow, small enough that it undertrained. Pick one direction and commit.


Back Donate