It Takes Two: Your GRPO Is Secretly DPO
TL;DR Summary
By reframing GRPO as contrastive learning and uncovering its link to DPO, this paper challenges the necessity of large group sizes for LLM training. Theoretical analysis and empirical results show that "2-GRPO" performs comparably to "16-GRPO," drastically reducing rollouts by 8x
Abstract
Group Relative Policy Optimization (GRPO) is a prominent reinforcement learning algorithm for post-training Large Language Models (LLMs). It is commonly believed that GRPO necessitates a large group size to ensure stable training via precise statistical estimation, which incurs substantial computational overhead. In this work, we challenge this assumption by reframing GRPO as a form of contrastive learning, which reveals a fundamental connection to Direct Preference Optimization (DPO). Motivated by DPO's empirical success, we investigate the minimal two-rollout case (2-GRPO), a configuration previously deemed infeasible. We provide a rigorous theoretical analysis to validate 2-GRPO and demonstrate empirically that it achieves performance on par with 16-GRPO, despite using only 1/8 of the rollouts and reducing training time by over 70%.
English Analysis
1. Bibliographic Information
- Title: It Takes Two: Your GRPO Is Secretly DPO
- Authors: Yihong Wu, Liheng Ma, Lei Ding, Muzhi Li, Xinyu Wang, Kejia Chen, Zhan Su, Zhanguang Zhang, Chenyang Huang, Yingxue Zhang, Mark Coates, Jian-Yun Nie. The authors are affiliated with several prominent academic institutions and research labs, including Université de Montréal, McGill University, Mila, and Huawei Noah's Ark Lab.
- Journal/Conference: This paper is available as a preprint on arXiv. The arXiv ID
2510.00977
suggests a future publication date, which is a common placeholder for preprints submitted ahead of time. - Publication Year: The preprint is dated for 2025.
- Abstract: The paper investigates Group Relative Policy Optimization (GRPO), a popular reinforcement learning algorithm for training Large Language Models (LLMs). It challenges the common belief that GRPO requires a large group of generated responses (
rollouts
) to be effective. By reframing GRPO as a form of contrastive learning, the authors uncover a fundamental link to Direct Preference Optimization (DPO). This insight motivates them to study2-GRPO
, the minimal case with only two rollouts per prompt, which was previously thought to be unstable. Through theoretical analysis and empirical experiments, they show that2-GRPO
performs as well as16-GRPO
(using 16 rollouts) while using only 1/8th of the rollouts and reducing training time by over 70%. - Original Source Link:
- arXiv Page: https://arxiv.org/abs/2510.00977
- PDF Link: http://arxiv.org/pdf/2510.00977v1
- Publication Status: Preprint.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Post-training Large Language Models (LLMs) with reinforcement learning (RL) is computationally expensive. Group Relative Policy Optimization (GRPO) is a state-of-the-art RL algorithm that has shown great success, but it is believed to require generating a large number of responses (a "group") for each training prompt to stabilize the learning process. This large group size is a major computational bottleneck, significantly increasing training time and cost.
- Gap in Prior Work: The theoretical underpinnings of GRPO, especially the necessity of a large group size, were not well understood. The prevailing assumption was that a large group is essential for accurate statistical estimation of rewards, which is needed for stable training.
- Fresh Angle: This paper challenges that assumption by providing a new theoretical perspective. It reframes GRPO as a contrastive learning problem, which reveals an unexpected and deep connection to another popular alignment algorithm, Direct Preference Optimization (DPO). Since DPO works effectively with just a single pair of preferred/dispreferred responses, the authors question if GRPO could also work with a minimal group.
-
Main Contributions / Findings (What):
- A Contrastive Reinterpretation of GRPO: The paper formally shows that the GRPO objective functions like a contrastive loss, aiming to distinguish "positive" (high-reward) rollouts from "negative" (low-reward) ones.
- Theoretical Link to DPO: This reinterpretation establishes a clear theoretical bridge between GRPO and DPO, two seemingly different but conceptually related preference-based optimization methods.
- Validation of
2-GRPO
: The paper introduces and validates2-GRPO
, a variant using the minimal group size of two. They provide theoretical guarantees showing that2-GRPO
maintains unbiased gradient estimates, debunking the idea that large groups are necessary for stability. - Massive Efficiency Gains: Empirically,
2-GRPO
is shown to match the performance of the standard16-GRPO
on challenging mathematical reasoning tasks. Crucially, it achieves this with a 70-80% reduction in training time and by generating only 12.5% (1/8th) of the rollouts, making high-performance RL post-training much more accessible.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data, capable of understanding and generating human-like language (e.g., GPT-4, Llama). "Post-training" refers to a second stage of training where a pre-trained LLM is fine-tuned for specific capabilities, such as following instructions or improving its reasoning skills.
- Reinforcement Learning (RL): A paradigm of machine learning where an "agent" (the LLM) learns to make decisions by performing actions (generating text) in an environment to maximize a cumulative "reward."
- RL with Verifiable Rewards (RLVR): A specific application of RL for LLMs where the correctness of a generated response can be automatically checked. For example, in mathematical problem-solving, a final answer can be verified against the known solution, yielding a clear reward signal (e.g., 1 for correct, 0 for incorrect).
- Proximal Policy Optimization (PPO): A popular and robust RL algorithm. It tries to improve the model's policy (how it generates text) without making updates that are too large, which could destabilize training. It typically uses a "value function" to estimate the quality of states, which helps in calculating how much better or worse an action was than expected.
- Direct Preference Optimization (DPO): An alternative to PPO for aligning LLMs with human preferences. Instead of learning a reward model and then using RL, DPO directly optimizes the LLM on a dataset of preference pairs (e.g., "response A is better than response B"). It is known for its simplicity and strong performance.
- Group Relative Policy Optimization (GRPO): The central algorithm of this paper. It is a variant of PPO that replaces the learnable value function with a simpler, on-the-fly normalization scheme. For each prompt, it generates a group of responses, calculates the mean and standard deviation of their rewards, and uses these statistics to normalize the rewards and compute an "advantage" for each response. This advantage tells the model how much better or worse a specific response was compared to the group average.
-
Differentiation:
- GRPO vs. PPO: GRPO avoids the need to train a separate, often large, value model, which simplifies the training pipeline and reduces computational overhead. It relies on group statistics instead.
- GRPO vs. DPO: Conventionally, GRPO is an online RL algorithm that generates its own data, while DPO is trained offline on a fixed preference dataset. GRPO uses group-wise normalization, while DPO uses pairwise comparison. This paper's key insight is to show that when GRPO's group size is reduced to two, it behaves very similarly to DPO, effectively bridging these two methods.
4. Methodology (Core Technology & Implementation)
The paper's core argument is built on a theoretical re-evaluation of the GRPO algorithm.
-
Principles: GRPO as Contrastive Learning The central idea is that the "advantage" in GRPO, which guides the learning, naturally separates responses into two camps:
- Positive Examples: Responses with a reward higher than the group average get a positive advantage.
- Negative Examples: Responses with a reward lower than the group average get a negative advantage. The GRPO objective then implicitly tries to increase the probability of positive examples and decrease the probability of negative examples. This is the exact principle behind contrastive learning.
-
Steps & Procedures: Mathematical Formulation
-
General Contrastive Loss: The authors first define a general form for the gradient of any contrastive loss function.
- Explanation: This equation states that a contrastive loss gradient has two parts: one that pushes up the probability of the positive sample (
y^-
(
- Explanation: This equation states that a contrastive loss gradient has two parts: one that pushes up the probability of the positive sample (
-
Showing GRPO is Contrastive (Proposition 3.2): The paper derives the gradient of the GRPO objective and shows it fits the general contrastive form. In the RLVR setting (rewards are 0 or 1), the gradient simplifies to:
- Explanation: This gradient pushes up the average probability of correct responses () and pushes down the average probability of incorrect responses () for a given prompt . The term weights prompts by how uncertain the model is about them, focusing learning on harder cases. This structure perfectly matches the contrastive learning definition.
-
Showing DPO is also Contrastive (Proposition 3.3): They perform a similar analysis on the DPO loss function and show its gradient also has the same contrastive structure, solidifying the conceptual link.
-
-
Introducing and Justifying
2-GRPO
Since DPO works with pairs, the authors propose2-GRPO
, where the group size . They address three key concerns:-
Advantage Estimation (Proposition 4.1):
- Concern: With , if one response is correct (reward=1) and one is incorrect (reward=0), the advantages are simply +1 and -1. If both are the same, the advantage is 0. This seems too crude compared to the fine-grained advantages from a large group.
- Justification: The paper proves that while the instantaneous advantage is discrete, the expected advantage estimated by
2-GRPO
over many training steps is an unbiased estimator of the true advantage (x-p
, where is the reward and is the model's average success rate). The large-group GRPO estimates a scaled version of this: . Since they only differ by a scaling factor,2-GRPO
's optimization signal is fundamentally sound.
-
Gradient Variance (Lemma 4.3):
- Concern: A smaller group size () should lead to higher variance in the gradient estimate for a single prompt, potentially destabilizing training.
- Justification: The total number of rollouts in a training batch is , where is the number of prompts. The authors argue that one can compensate for a smaller by increasing . For instance, instead of using prompts and rollouts (total 512), one can use prompts and rollouts (total 512). Since rollouts are sampled from different prompts, this keeps the overall gradient variance under control while drastically speeding up training because generating responses is the main bottleneck.
-
Exploration on Hard Questions (Proposition 4.4):
- Concern: For difficult problems where the model rarely succeeds, a small group of 2 might almost never contain a correct answer, starving the model of positive learning signals.
- Justification:
2-GRPO
allows for far more training updates for the same computational budget. The paper shows that the probability of finding at least one correct answer over many2-GRPO
updates is comparable to, or even better than, the probability of finding one in fewer16-GRPO
updates.
-
5. Experimental Setup
- Datasets:
- Training:
MATH
(a benchmark of math competition problems) andDAPO-Math-Sub
(a 7.5k subset of a larger math dataset). - Evaluation: A suite of five challenging, out-of-distribution math reasoning benchmarks:
MATH-500
,AMC 2023
,Minerva Math
,AIME 2025
, andOlympiadBench
. This setup rigorously tests the generalization ability of the trained models.
- Training:
- Evaluation Metrics:
Mean@32
: The average accuracy across 32 independently generated responses for each problem. This measures the overall reliability of the model.Pass@32
: The percentage of problems for which at least one of the 32 responses is correct. This measures the model's peak capability.
- Baselines:
- The primary baseline is the standard
16-GRPO
, which uses a group size of 16. - The paper also reports scores for the base models without any RL post-training (
w/o
).
- The primary baseline is the standard
- Models: Three different models were used to ensure the findings are not model-specific:
Qwen-1.5B
,Qwen-7B
, andDS-1.5B
.
6. Results & Analysis
-
Core Results: The main results are presented in Table 1. The key finding is that
2-GRPO
consistently achieves performance on par with16-GRPO
across all models, datasets, and evaluation benchmarks, while being dramatically more efficient.- Efficiency:
2-GRPO
reduces wall-clock training time by 73-84%. For example, on the MATH dataset, trainingQwen-7B
took 9.3 hours with16-GRPO
but only 2.43 hours with2-GRPO
. - Performance: The performance difference between
2-GRPO
and16-GRPO
is negligible and inconsistent (sometimes slightly better, sometimes slightly worse). For instance, onMATH-500
,Qwen-7B
trained with2-GRPO
scored 75.23Mean@32
, while16-GRPO
scored 75.90. This tiny difference is insignificant compared to the massive speedup. - Resource Usage:
2-GRPO
uses only 1/8th of the generated rollouts compared to16-GRPO
, directly translating to lower computational costs.
- Efficiency:
-
Visualizations: The training curves in Figures 1 and 2 provide a clear visual confirmation of the results.
该图像为两张折线图图表。左图展示不同组大小(G=2和G=16)下奖励值随训练时间(分钟)的变化曲线,显示G=2组在较短时间内迅速达到较高奖励水平。右图为验证得分随训练时间变化曲线,G=2组同样快速上升并接近G=16组表现,表明小组大小的GRPO在训练效率和效果上具备竞争力。
-
Figure 1 (Qwen-1.5B): The green line (G=2) shows that both the reward and validation score rise much more steeply with respect to wall-clock time compared to the orange line (G=16).
2-GRPO
reaches the peak performance of16-GRPO
in a fraction of the time.该图像为图表,包含两个子图。左侧图表显示不同组大小(G=2和G=16)下奖励值随时间变化的曲线,G=2在较短时间内迅速提升至高奖励,表现接近G=16。右侧图表展示评价分数随时间变化趋势,G=2组同样较快达到稳定且优于G=16组的得分,支持少量轮次训练效果良好观点。
-
Figure 2 (Qwen-7B): This figure shows the same trend for the larger 7B model.
2-GRPO
is significantly faster and achieves a final performance that is competitive with, and in this case slightly better than,16-GRPO
.
-
-
Ablations / Parameter Sensitivity: The appendix includes an ablation study (Table 2) on group sizes . The results show that performance is remarkably stable across these group sizes, while training time increases almost linearly with . This provides strong evidence for the paper's central claim: large group sizes are not necessary for GRPO's success.
7. Conclusion & Reflections
-
Conclusion Summary: This paper delivers a clear and impactful message: the widely-held belief that GRPO needs a large group size is incorrect. By re-framing GRPO through the lens of contrastive learning, the authors establish a theoretical connection to DPO and demonstrate that a minimal group size of two (
2-GRPO
) is not only viable but highly effective.2-GRPO
matches the performance of16-GRPO
while drastically cutting down on training time and computational costs, making powerful RL techniques more practical and accessible. -
Limitations & Future Work: The authors thoughtfully discuss several nuances and future directions:
- Data Efficiency:
2-GRPO
is computationally efficient but might be less data efficient. When both rollouts in a pair receive the same reward (both correct or both incorrect), the advantage is zero, and no learning signal is generated. This means many generated rollouts are effectively discarded. - Potential for Adaptive Group Sizes: This limitation suggests a promising future direction: designing algorithms that can dynamically adjust the group size. For instance, one could use a small group size by default for efficiency but increase it for "hard" prompts where the model needs more exploration to find a correct response.
- Quantization Perspective: The paper also suggests viewing
2-GRPO
as a form of "quantization" where the continuous advantage values of standard GRPO are simplified to just {-1, 0, 1}.
- Data Efficiency:
-
Personal Insights & Critique:
- Elegance and Simplicity: The paper's strength lies in its simple yet profound insight. The connection between GRPO and DPO is elegant and provides a unified view of preference-based learning in LLMs.
- Practical Impact: The findings have immediate and significant practical implications. Development teams can adopt
2-GRPO
to dramatically accelerate their RL post-training pipelines, enabling faster iteration and experimentation with lower costs. - Strong Validation: The claims are rigorously supported by a combination of clear theoretical arguments and comprehensive experiments across multiple models and difficult benchmarks.
- A New Direction: This work opens up a new way of thinking about RL algorithms for LLMs. Instead of focusing on more complex methods for variance reduction, it shows that rethinking the core objective and batching strategy can lead to huge efficiency wins. The idea of trading a large group size for a larger number of prompts is a powerful and practical heuristic.
Similar papers
Recommended via semantic vector search.
Group Sequence Policy Optimization
Group Sequence Policy Optimization (GSPO) enhances LLM reinforcement learning by utilizing sequence-likelihood for importance ratios and performing sequence-level optimization. This novel approach significantly stabilizes MoE model training, outperforming GRPO in efficiency and p
ExGRPO: Learning to Reason from Experience
ExGRPO identifies key experience metrics to prioritize valuable reasoning data, improving reinforcement learning efficiency and reasoning performance in large language models, with stable training across diverse model scales.
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
This study uses reinforcement learning with GRPO to fine-tune a 1.5B-parameter LLM, enhancing math reasoning under limited resources, achieving notable accuracy gains with low cost, while identifying stability and length constraints challenges.
Understanding R1-Zero-Like Training: A Critical Perspective
This paper critically analyzed R1-Zero-like training, revealing pretraining biases in base LLMs and an optimization bias in GRPO that inflates response length. It introduces Dr. GRPO, an unbiased method, and a minimalist recipe, achieving SOTA 43.3% on AIME 2024 with a 7B model,
Discussion
Leave a comment
No comments yet. Start the discussion!