On-Policy Distillation Explained | Generated by AI
Question: What is On-Policy Distillation?
Answer:
On-Policy Distillation is a reinforcement learning technique where a student policy is trained to imitate a teacher policy using data generated by the student’s own current behavior (on-policy data), rather than relying on a fixed offline dataset.
1. Core idea
In standard knowledge distillation, a student model learns from a teacher’s outputs (usually on a fixed dataset).
In on-policy distillation, the key difference is:
- The student interacts with the environment
- The data is generated from the student’s current policy
- The teacher provides guidance (soft targets, actions, or value signals)
- The student learns while continuously improving its own behavior distribution
So the learning happens on the same distribution the model is currently experiencing.
2. Why “on-policy” matters
In reinforcement learning:
- On-policy methods: learn from data generated by the current policy (e.g., PPO)
- Off-policy methods: learn from replay buffers or past policies (e.g., DQN)
On-policy distillation inherits this property:
The student avoids distribution mismatch by always training on trajectories it actually generates.
This often leads to more stable learning in dynamic environments.
3. Typical setup
At each training iteration:
- Student policy interacts with environment → collects trajectories
-
Teacher policy provides:
- action probabilities (soft labels), or
- corrected actions, or
- value estimates
-
Student minimizes a loss like:
- KL divergence between student and teacher distributions
- or cross-entropy on teacher actions
- Update student policy parameters
- Repeat with new on-policy rollouts
4. Simple intuition
Think of it like:
A learner (student) practices driving in real-time, and a driving instructor (teacher) gives feedback on each move, but only based on what the learner is currently doing—not from a recorded dataset of past learners.
5. Advantages
- Reduces distribution shift (student always trains on its own states)
- More stable than offline distillation in sequential decision tasks
- Works well in large-scale RL training pipelines
- Can improve exploration + learning efficiency
6. Limitations
- More expensive (must generate fresh rollouts continuously)
- Teacher must be available during training
- Can still suffer if teacher is imperfect or biased
- Slower than pure offline distillation in some settings
7. Where it is used
- Large-scale reinforcement learning systems
- Policy compression (big model → smaller model)
- Robotics control learning
- Game-playing agents
- LLM alignment / agent training pipelines (in some variants)
8. One-line summary
On-policy distillation is training a student policy using a teacher’s guidance, while the training data is continuously generated from the student’s own current interactions with the environment.
If you want, I can also compare it directly with offline distillation, imitation learning, and PPO-style RLHF, which is where the differences become very clear.