Forward vs Reverse KL Distillation | Generated by AI

Home 2026.05

Question: What does the comment “this is different from previous distillation, reverse KL” mean?

Answer:

Your friend is pointing out a technical difference in how the student learns from the teacher, specifically the direction of KL divergence used in distillation.

Let’s break it down simply.


1. What “distillation” usually means

In standard knowledge distillation, the student tries to match the teacher’s probability distribution:

[ \text{KL}(P_{\text{teacher}} ,|, P_{\text{student}}) ]

This is called forward KL (or KL teacher → student).

Intuition:

So it encourages coverage of teacher behavior


2. What “reverse KL” means

Reverse KL flips the direction:

[ \text{KL}(P_{\text{student}} ,|, P_{\text{teacher}}) ]

Now:

Intuition:


3. Why your friend said “this is different”

They are likely noticing that:

The method described (e.g. on-policy distillation or RL-style training) uses reverse KL instead of the usual forward KL distillation

So instead of:

It becomes:


4. Key behavioral difference

Type Effect Behavior
Forward KL (teacher → student) “cover everything teacher likes” more diverse, safer imitation
Reverse KL (student → teacher) “avoid teacher-disallowed actions” sharper, more aggressive, mode-seeking

5. Why RL / on-policy methods often use reverse KL

In reinforcement learning / policy optimization:

Reverse KL naturally appears in:

So it’s less about “copying teacher perfectly” and more about:

“stay close, but optimize your own trajectory”


6. Simple intuition


7. One-line summary

Your friend means:

This method is not classic imitation distillation (teacher → student forward KL), but a reinforcement-learning style training where the loss uses reverse KL (student → teacher), leading to different behavior: more mode-seeking and policy-optimization-like dynamics.


If you want, I can show you how this connects directly to PPO, KL-regularized RL, and why RLHF uses reverse KL implicitly—that’s where this becomes very intuitive.


Back Donate