Kullback-Leibler Divergence in Machine Learning

Kullback-Leibler Divergence in Machine Learning | Generated by AI

Home 2026.05

Question: What does KL mean here?

Answer:

Here, KL stands for Kullback–Leibler divergence, a way to measure how different two probability distributions are.

1. Intuition first

Think of two models predicting actions:

Teacher: “I think A is 70%, B is 30%”
Student: “I think A is 40%, B is 60%”

KL divergence answers:

“How much information do I lose if I use one distribution to approximate the other?”

So it’s a distance-like measure between probability distributions (but not symmetric, and not a true distance metric).

2. Formal definition

For two distributions (P) and (Q):

[ \mathrm{KL}(P ,|, Q) = \sum_x P(x)\log \frac{P(x)}{Q(x)} ]

or in continuous form:

[ \mathrm{KL}(P ,|, Q) = \int P(x)\log \frac{P(x)}{Q(x)} dx ]

3. Key properties

(1) Asymmetry matters

[ \mathrm{KL}(P ,|, Q) \neq \mathrm{KL}(Q ,|, P) ]

This is why people talk about forward KL vs reverse KL.

(2) Always ≥ 0

[ \mathrm{KL}(P ,|, Q) \ge 0 ]

It is zero only when the two distributions are identical.

4. Intuition in ML / RL

In your context (distillation / policies):

(P) = teacher policy (what expert would do)
(Q) = student policy (what model currently does)

Then KL measures:

how badly the student fails to match the teacher’s action probabilities

5. Why it matters in distillation

When training models:

Minimizing **KL(teacher student)** → student learns to imitate teacher’s full behavior distribution
Minimizing **KL(student teacher)** → student avoids actions teacher considers unlikely (more “sharp” behavior)

6. Simple analogy

Imagine guessing a friend’s choices:

If you model them poorly, KL is high (your guesses are “surprising” compared to reality)
If your model matches well, KL is low

7. One-line summary

KL divergence measures how different two probability distributions are, and in machine learning it is used to quantify how much a student model differs from a teacher model.

If you want, I can also explain why KL shows up everywhere in RLHF, PPO, diffusion models, and variational inference—it’s actually one of the central building blocks in modern ML.

Back Donate