Same Information, Different Scale

Same Information, Different Scale | Generated by AI

Home 2026.06

Yes, exactly.

PPL = exp(L) where L is the cross-entropy loss — they are the same information, different scale.

import torch
import torch.nn.functional as F

logits = torch.randn(8, 512, 50257)   # (batch, seq_len, vocab)
targets = torch.randint(0, 50257, (8, 512))

loss = F.cross_entropy(logits.view(-1, 50257), targets.view(-1))
ppl  = torch.exp(loss)

print(f"loss={loss:.4f}  ppl={ppl:.1f}")
# loss=10.8239  ppl=50257.0  ← random init → ppl ≈ vocab_size ✓

The relationship is exact:

L   = -1/N · Σ log P(xᵢ | x<ᵢ)    # nats (natural log)
PPL = exp(L)

So:

L (loss)	PPL	Meaning
ln(50257) ≈ 10.82	~50257	random model
3.0	~20	GPT-2 small on WebText
1.5	~4.5	strong model
0.0	1.0	perfect / memorized

Why two names for the same thing?

L (loss) is what the optimizer sees — additive, differentiable, numerically stable
PPL is what humans read — “model is as uncertain as picking from N equally likely tokens”

In your nanoGPT training loop, you’re already minimizing PPL — you just see it as loss. Log it both ways:

print(f"iter {step} | loss {loss.item():.4f} | ppl {torch.exp(loss).item():.1f}")

On your MI300X GPT-2 760M run: what’s your current val loss settling at? Should be around 2.85–3.1 on OpenWebText after full training.

Back Donate