Understanding R1-Zero-Like Training: A Critical Perspective
TL;DR Summary
This paper critically analyzed R1-Zero-like training, revealing pretraining biases in base LLMs and an optimization bias in GRPO that inflates response length. It introduces Dr. GRPO, an unbiased method, and a minimalist recipe, achieving SOTA 43.3% on AIME 2024 with a 7B model,
Abstract
DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.
English Analysis
1. Bibliographic Information
- Title: Understanding R1-Zero-Like Training: A Critical Perspective
- Authors: Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin.
- Affiliations: The authors are from Sea AI Lab, National University of Singapore, and Singapore Management University. This indicates a collaboration between a corporate research lab and academic institutions.
- Journal/Conference: This paper is available on arXiv, a preprint server. This means it has not yet undergone formal peer review for a conference or journal but is shared to disseminate findings quickly. The submission date suggests it is very recent work.
- Publication Year: 2025 (as listed on arXiv, though submitted in 2024/2025).
- Abstract: The paper critically investigates the "R1-Zero-like" training paradigm, popularized by DeepSeek-R1-Zero, where Reinforcement Learning (RL) is applied directly to base Large Language Models (LLMs) to improve reasoning without prior Supervised Fine-Tuning (SFT). The authors deconstruct this paradigm by analyzing its two main components: the base models and the RL algorithm. They find that some base models (like Qwen2.5) already have strong reasoning skills, and the "Aha moment" of self-reflection is present even in the original DeepSeek-V3-Base model before RL. They identify a significant flaw in the Group Relative Policy Optimization (GRPO) algorithm, an optimization bias that causes response lengths to increase, especially for incorrect answers. To fix this, they introduce Dr. GRPO, an unbiased version that improves token efficiency. Using these insights, they create a minimalist training recipe that achieves a new state-of-the-art result (43.3% on AIME 2024) with a 7B parameter model.
- Original Source Link:
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: The AI community has been highly interested in the
DeepSeek-R1-Zero
model, which demonstrated that applying Reinforcement Learning (RL) at scale directly to a base LLM could significantly boost its complex reasoning abilities, bypassing the standard Supervised Fine-Tuning (SFT) step. This led to a "scaling phenomenon" where both model performance and response length increased together, along with the emergence of skills like self-reflection (the "Aha moment"). However, the underlying reasons for this success were not well understood. - Gap in Prior Work: Many researchers attempted to replicate
R1-Zero
's success, but it was unclear whether the gains came from the RL process itself, the choice of base model, or the specific RL algorithm used. There was a need for a critical, systematic analysis to separate these factors. - Innovation: This paper provides that critical analysis. Instead of just replicating the results, it dissects the process, questioning fundamental assumptions. The authors investigate whether the "emergent" abilities were truly emergent and if the observed increase in response length was a genuine sign of improved reasoning or an artifact of the training algorithm.
- Core Problem: The AI community has been highly interested in the
-
Main Contributions / Findings (What):
- Base Model Analysis: The paper reveals that many popular base models used for
R1-Zero
replications are not "pure."Qwen2.5
models show strong reasoning even without prompt templates, suggesting they may have been pretrained on question-answer data. Crucially, they find that the "Aha moment" of self-reflection already exists in base models, including the originalDeepSeek-V3-Base
, before any RL is applied. - Identification of Optimization Bias in GRPO: The authors discover that the
Group Relative Policy Optimization (GRPO)
algorithm, a key component inR1-Zero
training, contains a length bias. This bias unintentionally encourages the model to produce longer responses when it is wrong and shorter responses when it is right, artificially inflating the average response length during training. - Proposal of Dr. GRPO: To correct this flaw, they propose
Dr. GRPO
(GRPO Done Right), a simple yet effective modification that removes the biasing terms from the GRPO objective function. This new method maintains reasoning performance while significantly improving token efficiency by preventing the model from generating overly long incorrect answers. - A Minimalist State-of-the-Art Recipe: By combining their insights, the authors develop a highly efficient training recipe. Using a 7B
Qwen2.5-Math
model, their unbiasedDr. GRPO
algorithm, and a focused dataset, they achieve a new state-of-the-art accuracy of 43.3% on the AIME 2024 benchmark, using only 27 hours on 8 A100 GPUs.
- Base Model Analysis: The paper reveals that many popular base models used for
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data (e.g., the internet). A base model is the initial model after this pretraining; it's good at predicting the next word but not necessarily at following instructions.
- Supervised Fine-Tuning (SFT): This is the process of training a base model on a smaller, high-quality dataset of instruction-response pairs (e.g., question-answer pairs) to teach it how to follow instructions and act like a helpful assistant.
- Reinforcement Learning (RL): A machine learning paradigm where an "agent" (the LLM) learns to make decisions (generate tokens) to maximize a cumulative "reward." In this context, the reward is given for generating a correct answer to a reasoning problem.
- R1-Zero-like Training: A novel training paradigm introduced by
DeepSeek-R1-Zero
. It skips the SFT step and applies RL directly to the base model. This was revolutionary because it suggested that complex reasoning could be unlocked through RL alone, without needing curated SFT data. - "Aha Moment": A term used to describe the phenomenon where a model, during RL training, appears to develop emergent reasoning skills like self-correction and reflection (e.g., generating phrases like "Wait, let me recheck that...").
- Group Relative Policy Optimization (GRPO): An RL algorithm designed for LLMs. For a given prompt, it samples a group of responses, compares their rewards (e.g., correct vs. incorrect), and updates the model to increase the probability of generating high-reward responses.
-
Technological Evolution & Differentiation:
- The standard pipeline for creating instruction-following LLMs was Pretraining -> SFT -> RL from Human Feedback (RLHF).
DeepSeek-R1-Zero
challenged this by showing Pretraining -> RL could work for complex reasoning tasks, which was simpler and potentially more scalable.- Many subsequent works (
SimpleRL-Zero
,Open-Reasoner-Zero
) tried to replicate this success, often usingQwen2.5
models. - This paper differentiates itself by not just replicating but critically analyzing the paradigm. It questions whether the "Aha moment" is truly from RL and whether the observed "double increase" (in performance and length) is a desirable outcome or a methodological flaw. Their proposed
Dr. GRPO
is a direct response to a flaw they identified in the existing approach.
4. Methodology (Core Technology & Implementation)
The paper's methodology is twofold: a critical analysis of existing components and the proposal of an improved algorithm.
Part 1: Analysis of Base Models and RL
The authors systematically investigate the two core components of R1-Zero-like training.
- Base Model Investigation:
- They analyze a wide range of models (
Qwen-2.5
,Llama-3.1
,DeepSeek
series) on 500 questions from the MATH dataset. - Templates: They test three prompt templates to see how they affect model behavior:
Template 1 (R1 template)
: A verbose template instructing the model to think withintags. Template 2 (Qwen-Math template)
: A more standard chat-based template using special tokens.Template 3 (No template)
: Just the raw question.
- Evaluated Abilities:
- Question-Answering Ability: Does the model answer the question or just continue the text?
- Exploration Ability: Can the model generate correct answers at all when sampling multiple times? Measured by
pass@8
accuracy. - Self-Reflection ("Aha Moment"): Do base models already generate self-reflection keywords (e.g., "recheck", "Aha!")? This was checked using both keyword matching and a stronger LLM (GPT-4o-mini) for verification.
- They analyze a wide range of models (
Part 2: Analysis of GRPO and the Proposal of Dr. GRPO
-
RL Formulation:
- LLM generation is modeled as a Markov Decision Process (MDP). The goal is to maximize the expected reward, which is 1 for a correct final answer and 0 otherwise.
- The standard RL objective is given by:
- Here, is the policy (the LLM), is the reward for output given question , and the term is a penalty to keep the policy from changing too much from a reference policy . The paper sets , removing this penalty, as the reward is based on a fixed rule (correctness) and not a learned reward model.
-
Identifying Bias in GRPO:
- The authors present the GRPO objective function and highlight its "advantage" estimator, which determines the direction of the policy update. The advantage in GRPO is defined as:
- The overall update for a response is also scaled by . This leads to two biases:
- Response-level length bias: The term means that for a correct answer (positive advantage), shorter responses get a larger update, encouraging brevity. For an incorrect answer (negative advantage), longer responses get a smaller penalty, unintentionally encouraging the model to generate longer incorrect responses.
- Question-level difficulty bias: The normalization by standard deviation (
std
) gives higher weight to questions where the model is either consistently right or consistently wrong (low variance in rewards), biasing training towards very easy or very hard examples.
-
Dr. GRPO: The Unbiased Solution:
Dr. GRPO
is a simple fix: remove the two biasing terms. The new, unbiased advantage estimator is just the centered reward:- The per-response length normalization is also removed from the loss calculation. As shown in Appendix A, this modified objective aligns correctly with the standard REINFORCE policy gradient algorithm with a baseline, making it theoretically sound and unbiased.
5. Experimental Setup
-
Datasets:
- Training: Various math question sets were used to test different conditions, including
MATH
(high-school competition math),GSM
(grade-school math),ASDiv
(basic algebra), andORZ
(a large, diverse combination). - Evaluation: A standard suite of challenging math reasoning benchmarks:
AIME 2024
,AMC
,MATH500
,Minerva Math
, andOlympiadBench
.
- Training: Various math question sets were used to test different conditions, including
-
Evaluation Metrics:
- Accuracy: The primary metric for reasoning performance.
- Response Length: Used to track the effect of the length bias.
pass@8
: Measures the model's ability to find a correct solution within 8 attempts, indicating its exploration capability.
-
Baselines:
- The main comparison is between GRPO and the proposed Dr. GRPO.
- The final model, Oat-Zero-7B, is compared against other SOTA open-source models of similar size that also follow the R1-Zero paradigm, such as
SimpleRL-Zero-7B
,PRIME-Zero-7B
, andOpenReasoner-Zero-7B
.
6. Results & Analysis
Core Results on Base Models (Section 2)
-
Templates and Pretraining Bias:
- Figure 7 (images/3.jpg) shows that for
Llama
andDeepSeek
models, using a template is crucial to get them to answer questions. However,Qwen2.5
models perform best with no template. - Table 1 quantifies this: for
Qwen2.5-Math-7B
, using no template yields a 38.2% average accuracy, while the standard 4-shot prompting only gets 23.8%. This strongly suggestsQwen2.5
models were pretrained on concatenated question-answer text, making them SFT-like from the start.
- Figure 7 (images/3.jpg) shows that for
-
"Aha Moment" Pre-exists in Base Models:
- The right plot in Figure 7 (images/3.jpg) shows that nearly all base models, including DeepSeek-V3-Base, already exhibit self-reflection behaviors. This finding challenges the claim that this ability emerges purely from RL training.
- Figure 13 (in Appendix) provides concrete examples of
DeepSeek-V3-Base
generating phrases like "Wait, I'm overthinking" and "Aha! ... This gives me an idea." before any RL tuning.
Core Results on RL Algorithm (Section 3)
-
Dr. GRPO Fixes the Length Bias:
- Figure 8 (images/5.jpg) provides a compelling comparison.
- Plots 1 & 5 show that
Dr. GRPO
achieves comparable reward and final benchmark scores toGRPO
. Performance is not sacrificed. - Plot 2 shows that while
GRPO
's average output length increases continuously throughout training,Dr. GRPO
's length stabilizes. - Plot 4 is the most damning evidence:
GRPO
causes the length of incorrect responses to skyrocket, confirming the length bias.Dr. GRPO
keeps incorrect responses short, leading to much better token efficiency.
- Plots 1 & 5 show that
- Figure 8 (images/5.jpg) provides a compelling comparison.
-
Templates and Question Sets Interact:
- Figure 9 (images/6.jpg) shows that when there is a mismatch between the base model and the template (e.g.,
Qwen2.5-Math
with theR1 template
), performance is initially destroyed and RL must reconstruct the reasoning ability. In this case, a large, diverse training set (ORZ-57K
) is needed. - However, with a better-suited template (
Qwen-Math template
), even a simpler, out-of-distribution dataset (GSM-8K
) is sufficient to achieve high performance. This suggests RL is reinforcing existing behaviors rather than teaching new knowledge.
- Figure 9 (images/6.jpg) shows that when there is a mismatch between the base model and the template (e.g.,
-
Domain Pretraining is Key for Weaker Models:
- Figure 10 (images/7.jpg) shows experiments with
Llama-3.2-3B
, a model not specialized for math.- The left plot shows that RL on the vanilla
Llama
base model yields minimal gains. However, after continual pretraining on math data (FineMath
andNuminaQA
), the model's potential for improvement with RL (RL ceiling
) is significantly higher. - The right plot re-confirms that
GRPO
creates the misleading "double-increase" phenomenon on this model, whileDr. GRPO
does not.
- The left plot shows that RL on the vanilla
- Figure 10 (images/7.jpg) shows experiments with
Final Model Performance
- Figure 6 (images/2.jpg) showcases the final results. The authors'
Oat-Zero-7B
model, trained with their minimalist recipe, achieves the highest average accuracy (51.4%) among all compared 7B models. It particularly excels on the difficultAIME 2024
benchmark with 43.3% accuracy, establishing a new state-of-the-art.
7. Conclusion & Reflections
-
Conclusion Summary:
- The paper successfully demystifies key aspects of
R1-Zero-like
training. It demonstrates that the choice of base model and its pretraining history are critical, and that some "emergent" abilities like self-reflection may already be present pre-RL. - It identifies and corrects a significant optimization bias in the
GRPO
algorithm that leads to inefficient, long responses. The proposedDr. GRPO
maintains performance while improving token efficiency. - By leveraging these insights, the authors provide a simple, efficient, and reproducible recipe that achieves SOTA results, proving that effective RL for reasoning does not require uncontrolled growth in response length.
- The paper successfully demystifies key aspects of
-
Limitations & Future Work:
- The authors do not explicitly state limitations. However, one could consider that the analysis is focused on mathematical reasoning; the findings might not generalize perfectly to other domains like creative writing or general conversation.
- The analysis of the "Aha moment" is correlational. While they show it exists in base models, they don't fully explain its causal role in the RL process (though Appendix F suggests it doesn't correlate with higher accuracy at inference time).
- Future work could involve applying
Dr. GRPO
to larger models and different reasoning domains to verify its benefits at scale.
-
Personal Insights & Critique:
- This is an excellent example of critical and rigorous scientific work in AI. Instead of just chasing higher benchmark scores, the authors delved into the "why" and uncovered a fundamental flaw in a popular method.
- The discovery of the length bias in
GRPO
is a significant contribution. It serves as a cautionary tale: what looks like an emergent capability (long, complex reasoning) can sometimes be an algorithmic artifact. This insight promotes more efficient and principled approaches to RL for LLMs. Dr. GRPO
is a model of a good scientific fix: it is simple, theoretically grounded, and empirically effective.- The paper's overall message is powerful: understanding the components of a complex system is more valuable than treating it as a black box. By doing so, we can achieve better results with greater efficiency. The work is a major step forward for the open-source community trying to build powerful reasoning models.
Similar papers
Recommended via semantic vector search.
REINFORCEMENTLEARNING IS ALL YOU NEED
This paper demonstrates that pure reinforcement learning, applied to a 3B language model on the Countdown Game, effectively enhances reasoning, outperforming baselines and generalizing well without human feedback. Findings show emergent insights don't always guarantee correctness
DAPO: An Open-Source LLM Reinforcement Learning System at Scale
To boost reproducibility in LLM reinforcement learning, DAPO introduces an open-source system featuring a novel Decoupled Clip and Dynamic Sampling Policy Optimization algorithm. Utilizing a Qwen2.5-32B model, this system achieves a state-of-the-art 50 points on AIME 2024, openly
Reinforcement Learning for Reasoning in Small LLMs: What Works and What Doesn't
This study uses reinforcement learning with GRPO to fine-tune a 1.5B-parameter LLM, enhancing math reasoning under limited resources, achieving notable accuracy gains with low cost, while identifying stability and length constraints challenges.
Discussion
Leave a comment
No comments yet. Start the discussion!