On-policy reinforcement learning explained

On-policy reinforcement learning explained | Generated by AI

Home 2026.05

Question: In RL, what does “on-policy” mean — specifically that you collect data immediately from the current policy, not from stored/offline data?

Answer:

Yes, exactly. Here’s the precise definition:

On-policy means the data you train on was generated by the same policy you’re currently updating. Every gradient step consumes fresh rollouts from the current π_θ, then those samples are discarded after the update.

Off-policy means you can train on data generated by any policy — including old versions of yourself, or a completely different behavior policy. This is what replay buffers (DQN, SAC) enable.

Why it matters mechanically

The core RL objective is:

J(θ) = E_{τ ~ π_θ} [R(τ)]

The expectation is under π_θ. When you estimate this gradient with samples, those samples must come from π_θ for the estimate to be unbiased. If you reuse old samples from π_{θ_old}, you’ve introduced distributional mismatch — the gradient estimate is wrong unless you correct it with importance sampling:

∇J(θ) ≈ E_{τ ~ π_old} [ (π_θ(a|s) / π_old(a|s)) · ∇log π_θ · R ]

PPO does exactly this — it’s technically on-policy with a small off-policy tolerance window, clipping the IS ratio so you don’t stray too far:

ratio = (new_log_probs - old_log_probs).exp()
clipped = torch.clamp(ratio, 1 - eps, 1 + eps)
loss = -torch.min(ratio * advantage, clipped * advantage).mean()

In LLM RLHF context (GRPO/PPO)

In GRPO (what DeepSeek-R1 uses), you:

Sample a batch of prompts
Generate G completions from current π_θ — this is the on-policy rollout
Score them with reward model
Compute advantages, update θ
Throw away those completions — next step needs fresh ones from updated θ

This is expensive but necessary for correctness. It’s why RLHF training is slow — you’re doing generation (autoregressive, sequential) inside the training loop.

REINFORCE / vanilla PG = strictly on-policy, one update per rollout batch PPO = on-policy with a small reuse window (few epochs over same batch, clipped) DPO = technically off-policy (trains on pre-collected preference pairs, no live rollouts) GRPO = on-policy like PPO, but without a separate critic network

So yes — on-policy = “I must generate fresh data from my current self before each update.” The moment you store data in a replay buffer and sample it later, you’ve gone off-policy.

Back Donate