Model Capacity Saturation Diagnosis

Model Capacity Saturation Diagnosis | Generated by AI

Home 2025.11

This is a classic case of model capacity saturation disguised as a training issue.

I have analyzed your logs and configuration deeply. The short answer is: You are not training the model you think you are training.

Here is the “hard thinking” breakdown of why your loss has plateaued and how to fix it.

1. The Critical Configuration Error (Size Mismatch)

You commented in your config:

# Model (~125M parameters) – perfect for 12 GB VRAM

But your logs explicitly state:

number of parameters: 40.56M

Why this happened: You set n_embd = 384.

Standard GPT-2 (124M): n_layer=12, n_head=12, n_embd=768.
Your Config: n_layer=12, n_head=12, n_embd=384.

By halving the embedding dimension, you cut the parameter count by roughly a factor of 3-4 (because the majority of parameters are in the linear layers which scale quadratically with n_embd).

The Result: You are training a 40M parameter model. A loss of ~3.57 is actually quite decent for a model this tiny. It has plateaued because it is full. It physically does not have the neurons to memorize or generalize more complex patterns from the FineWeb dataset.

2. The “Val < Train” Anomaly (Over-Regularization)

Look at your Step 14500:

train loss: 3.6415
val loss: 3.5696

Diagnosis: Your validation loss is lower than your training loss. This usually happens when dropout is active during training but turned off during validation. You have dropout = 0.1.

The Fix: For a model this small (40M) trained on a dataset this massive (FineWeb), you do not need dropout. The model is effectively underfitting the infinite stream of data. Dropout is essentially deleting 10% of your tiny model’s brain power during every step, making it harder for it to learn.

Action: Set dropout = 0.0.

3. You have passed the “Chinchilla Optimal” Point

Let’s do the math on your training volume:

Batch size: ~524,000 tokens per step.
Current Step: 14,500.
Total Tokens Seen: $14,500 \times 524,000 \approx 7.6 \text{ Billion tokens}$.

Chinchilla Scaling Laws state: For a 40M parameter model, the compute-optimal training volume is roughly $40M \times 20 \text{ tokens} = 800 \text{ Million tokens}$.

You have trained this model on 7.6 Billion tokens. You have fed it nearly 10x the optimal amount of data.

The plateau is expected. You have squeezed almost every drop of performance possible out of 40M parameters. Continuing to train will yield diminishing returns (as you are seeing now).

4. Performance Bottleneck (MFU is low)

Your Model Flops Utilization (MFU) is hovering around 11-12%.

On an RTX 4070, with compile=True, you should aim for 30%+.
The warning Not enough SMs to use max_autotune_gemm suggests your batch size or model architecture size doesn’t align well with the GPU hardware causing padding/inefficiency.
However, fixing MFU will only make the training faster, it will not lower the loss if the model is already saturated.

Solution: How to get Unstuck

To get the loss down to ~3.0 or lower (standard for GPT-2 small), you need to increase the brain size.

Step 1: Correct the Model Config

To get the actual 124M model you wanted:

n_layer = 12
n_head = 12
n_embd = 768  # CHANGED from 384 to 768
dropout = 0.0 # CHANGED from 0.1 (Data is large enough, don't throttle the model)

Step 2: Adjust Batch Size for VRAM

Since you are increasing the model width (n_embd), your VRAM usage will go up. You are currently using batch_size=16.

If you run out of memory (OOM) with n_embd=768 on your 12GB card, reduce batch_size to 8 or 4 and increase gradient_accumulation_steps to keep the total tokens per step around 500k.

Step 3: Reset and Retrain

You cannot resume from the current checkpoint because the model architecture (embedding size) will change. You must start from iter 0.

What to expect: With n_embd=768, you should see the loss drop significantly faster and break through the 3.5 barrier, likely settling around 3.2 or 3.0 by step 10,000.

Would you like me to calculate the exact gradient_accumulation_steps needed to maintain your training stability with the larger model size?

Back Donate