Kullback-Leibler Divergence in Machine Learning | Generated by AI

Home 2026.05

Question: What does KL mean here?

Answer:

Here, KL stands for Kullback–Leibler divergence, a way to measure how different two probability distributions are.


1. Intuition first

Think of two models predicting actions:

KL divergence answers:

“How much information do I lose if I use one distribution to approximate the other?”

So it’s a distance-like measure between probability distributions (but not symmetric, and not a true distance metric).


2. Formal definition

For two distributions (P) and (Q):

[ \mathrm{KL}(P ,|, Q) = \sum_x P(x)\log \frac{P(x)}{Q(x)} ]

or in continuous form:

[ \mathrm{KL}(P ,|, Q) = \int P(x)\log \frac{P(x)}{Q(x)} dx ]


3. Key properties

(1) Asymmetry matters

[ \mathrm{KL}(P ,|, Q) \neq \mathrm{KL}(Q ,|, P) ]

This is why people talk about forward KL vs reverse KL.


(2) Always ≥ 0

[ \mathrm{KL}(P ,|, Q) \ge 0 ]

It is zero only when the two distributions are identical.


4. Intuition in ML / RL

In your context (distillation / policies):

Then KL measures:

how badly the student fails to match the teacher’s action probabilities


5. Why it matters in distillation

When training models:


6. Simple analogy

Imagine guessing a friend’s choices:


7. One-line summary

KL divergence measures how different two probability distributions are, and in machine learning it is used to quantify how much a student model differs from a teacher model.


If you want, I can also explain why KL shows up everywhere in RLHF, PPO, diffusion models, and variational inference—it’s actually one of the central building blocks in modern ML.


Back Donate