On-policy reinforcement learning explained | Generated by AI
Question: In RL, what does “on-policy” mean — specifically that you collect data immediately from the current policy, not from stored/offline data?
Answer:
Yes, exactly. Here’s the precise definition:
On-policy means the data you train on was generated by the same policy you’re currently updating. Every gradient step consumes fresh rollouts from the current π_θ, then those samples are discarded after the update.
Off-policy means you can train on data generated by any policy — including old versions of yourself, or a completely different behavior policy. This is what replay buffers (DQN, SAC) enable.
Why it matters mechanically
The core RL objective is:
J(θ) = E_{τ ~ π_θ} [R(τ)]
The expectation is under π_θ. When you estimate this gradient with samples, those samples must come from π_θ for the estimate to be unbiased. If you reuse old samples from π_{θ_old}, you’ve introduced distributional mismatch — the gradient estimate is wrong unless you correct it with importance sampling:
∇J(θ) ≈ E_{τ ~ π_old} [ (π_θ(a|s) / π_old(a|s)) · ∇log π_θ · R ]
PPO does exactly this — it’s technically on-policy with a small off-policy tolerance window, clipping the IS ratio so you don’t stray too far:
ratio = (new_log_probs - old_log_probs).exp()
clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
loss = -torch.min(ratio * advantage, clipped * advantage).mean()
In LLM RLHF context (GRPO/PPO)
In GRPO (what DeepSeek-R1 uses), you:
- Sample a batch of prompts
- Generate G completions from current π_θ — this is the on-policy rollout
- Score them with reward model
- Compute advantages, update θ
- Throw away those completions — next step needs fresh ones from updated θ
This is expensive but necessary for correctness. It’s why RLHF training is slow — you’re doing generation (autoregressive, sequential) inside the training loop.
REINFORCE / vanilla PG = strictly on-policy, one update per rollout batch
PPO = on-policy with a small reuse window (few epochs over same batch, clipped)
DPO = technically off-policy (trains on pre-collected preference pairs, no live rollouts)
GRPO = on-policy like PPO, but without a separate critic network
So yes — on-policy = “I must generate fresh data from my current self before each update.” The moment you store data in a replay buffer and sample it later, you’ve gone off-policy.