Group Relative Policy Optimization Overview

Group Relative Policy Optimization Overview | Generated by AI

Home 2025.12

Question: What is GRPO in machine learning?

Answer:

Group Relative Policy Optimization (GRPO) is a reinforcement learning (RL) algorithm used in machine learning, particularly for fine-tuning large language models (LLMs) to improve their reasoning capabilities in tasks like mathematics, coding, and complex problem-solving.

Key Features and How It Works

GRPO is a variant of Proximal Policy Optimization (PPO), a popular RL algorithm.
The main innovation in GRPO is that it eliminates the need for a separate “critic” or value function model, which PPO requires to estimate future rewards. This reduces memory usage and computational costs by approximately 50%.
Instead of a critic, GRPO uses group sampling: For each input prompt, the model generates multiple output responses (e.g., 64 or more). A reward model (or verifiable reward function) scores each response.
The advantage (how much better or worse a response is) is calculated relative to the group: typically by subtracting the mean reward and normalizing by the standard deviation of the group’s rewards. Responses above the group average get positive advantages (encouraged), and below get negative (discouraged).
The policy (the LLM itself) is then updated to increase the probability of high-advantage responses while using clipping and KL-divergence penalties to ensure stable training.

Advantages Over Traditional Methods

More memory-efficient, making it suitable for large-scale training.
Effective with verifiable rewards (e.g., correct/incorrect answers in math), reducing reliance on human feedback or complex reward models.
Helps prevent issues like reward hacking and improves generalization in reasoning tasks.

Origins and Applications

First introduced in the DeepSeekMath paper (2024) for enhancing mathematical reasoning in open-source LLMs.
Gained prominence with DeepSeek-R1 models, where it was used in reinforcement fine-tuning (often alternated with supervised fine-tuning) to achieve state-of-the-art performance in reasoning benchmarks.

GRPO represents an efficient evolution in RLHF (Reinforcement Learning from Human Feedback) techniques, especially for domains with objective evaluation criteria.

References:

Back Donate