On-Policy Distillation Explained | Generated by AI

Home 2026.05

Question: What is On-Policy Distillation?

Answer:

On-Policy Distillation is a reinforcement learning technique where a student policy is trained to imitate a teacher policy using data generated by the student’s own current behavior (on-policy data), rather than relying on a fixed offline dataset.


1. Core idea

In standard knowledge distillation, a student model learns from a teacher’s outputs (usually on a fixed dataset).

In on-policy distillation, the key difference is:

So the learning happens on the same distribution the model is currently experiencing.


2. Why “on-policy” matters

In reinforcement learning:

On-policy distillation inherits this property:

The student avoids distribution mismatch by always training on trajectories it actually generates.

This often leads to more stable learning in dynamic environments.


3. Typical setup

At each training iteration:

  1. Student policy interacts with environment → collects trajectories
  2. Teacher policy provides:

    • action probabilities (soft labels), or
    • corrected actions, or
    • value estimates
  3. Student minimizes a loss like:

    • KL divergence between student and teacher distributions
    • or cross-entropy on teacher actions
  4. Update student policy parameters
  5. Repeat with new on-policy rollouts

4. Simple intuition

Think of it like:

A learner (student) practices driving in real-time, and a driving instructor (teacher) gives feedback on each move, but only based on what the learner is currently doing—not from a recorded dataset of past learners.


5. Advantages


6. Limitations


7. Where it is used


8. One-line summary

On-policy distillation is training a student policy using a teacher’s guidance, while the training data is continuously generated from the student’s own current interactions with the environment.


If you want, I can also compare it directly with offline distillation, imitation learning, and PPO-style RLHF, which is where the differences become very clear.


Back Donate