Loss decreases beyond Chinchilla optimum.

Loss decreases beyond Chinchilla optimum. | Generated by AI

Home 2026.06

Question: Why does loss continue to decrease even though you’re training 45.8x past Chinchilla-optimal (13.1B tokens on 286M params vs. optimal 20:1)?

Answer

You’re conflating two different things: compute-optimal scaling (Chinchilla) and when loss actually plateaus. They’re not the same.

The Core Dynamics

Chinchilla-optimal is about allocation, not saturation. It answers: “Given a fixed compute budget, what model/data split is most efficient?” The answer is ~20:1 tokens/param. But that doesn’t mean loss stops decreasing past that ratio — it means you’re spending compute less efficiently (you could’ve trained a bigger model shorter).

Your model is still losing because:

1. Excess Model Capacity

Your 286M model with 12 layers × 6 heads hasn’t saturated. The learning curve looks roughly like:

loss = a / (tokens)^b + noise

where b ≈ 0.08-0.10 for this scale (empirically, not analytically exact). The loss decreases smoothly — it doesn’t hit a hard ceiling. At 8.5B tokens (Phase 1), you’re still in the steep part of that curve.

The validation bpb at resume (0.810) tells you the model still has generalization capacity. If it was overfitting, val bpb would spike above train loss. It hasn’t.

2. Data Diversity » Model Capacity

With BPE vocab (32K tokens), a 2K context window, and 8.5B raw tokens, even fully shuffled, the model encounters an effectively infinite distribution of patterns. You’re training on:

unique patterns ≈ (vocab_size)^(context_length) >> 8.5B tokens

So the model is still exploring the data manifold. It’s not “running out of things to learn.”

3. Learning Rate Schedule is Still Active

Look at your schedule:

Phase 1: Warmup (40 steps), then cosine decay
Final LR at 130k: 0.20 (20% of initial 0.02 for matrices)
Phase 2: Entering warmdown at 65% of 200k = 130k steps

You’re still in the main cosine decay (not yet at the long tail). The model is actively optimizing with meaningful gradient signal. Check your actual LR curve — the model’s still training aggressively.

4. Batch Size Masks Gradient Noise

With batch_size = 65,536 tokens (very large), gradient noise is low. The model sees cleaner optimization directions. Smaller batches would plateau earlier due to noise floor.

Why Overfitting Hasn’t Kicked In Yet

Overfitting typically emerges when:

Val loss > train loss significantly
Validation metrics plateau while train loss keeps dropping
You reach a hard data repetition limit

None of that’s happening. Your CORE metric (0.1916 → 0.1932) and val bpb suggest generalization is still healthy.

What You Should Expect

Loss will keep decreasing until one of these hits:

Learning rate hits noise floor (end of Phase 2, when LR → 0.001)
Data repetition saturation — you’ve seen most token n-grams from your dataset
Overfitting emerges — val metrics degrade relative to train
Batch size floor — gradient noise dominates

Given your setup, I’d expect:

Phase 2 loss: 2.57 → ~2.45-2.50 (diminishing returns, not dramatic drops)
Val bpb: ~0.78-0.82 (stable)
By step 200k: loss plateaus, not because you’re “overtrained,” but because LR is negligible

The Real Insight

Chinchilla says: “Bigger model + less data = more efficient use of compute for a given budget.”

But it doesn’t say: “Stop training past 20:1; loss won’t improve.”

You’re training at 45:1 because:

You want a smaller, cheaper-to-serve model (286M is production-grade for edge/mobile)
You have GPU capacity to throw at it
Loss still decreasing = return on investment is positive

This is the right call for inference-optimized models (think distillation, LoRA bases, or edge deployment).

References:

Chinchilla paper: “Training Compute-Optimal Large Language Models” (Hoffmann et al., 2022) — defines the 20:1 law, but not when loss plateaus
Empirically: OpenAI/DeepSeek findings show loss ∝ tokens^(-0.08) to ^(-0.10) — no hard saturation, just diminishing returns
Your metrics: CORE 0.1916, val bpb 0.780 suggest healthy generalization, not overfitting

Back Donate