AiPaper
Status: completed

Understanding R1-Zero-Like Training: A Critical Perspective

Reinforcement Learning for Math ReasoningSequence Policy OptimizationRL Training for Large Language ModelsOptimization Bias AnalysisPretraining Impact on Base Models
Original LinkPDFEdit PDF notes
Price: 0.10
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper critically analyzed R1-Zero-like training, revealing pretraining biases in base LLMs and an optimization bias in GRPO that inflates response length. It introduces Dr. GRPO, an unbiased method, and a minimalist recipe, achieving SOTA 43.3% on AIME 2024 with a 7B model,

Abstract

DeepSeek-R1-Zero has shown that reinforcement learning (RL) at scale can directly enhance the reasoning capabilities of LLMs without supervised fine-tuning. In this work, we critically examine R1-Zero-like training by analyzing its two core components: base models and RL. We investigate a wide range of base models, including DeepSeek-V3-Base, to understand how pretraining characteristics influence RL performance. Our analysis reveals that DeepSeek-V3-Base already exhibit ''Aha moment'', while Qwen2.5 base models demonstrate strong reasoning capabilities even without prompt templates, suggesting potential pretraining biases. Additionally, we identify an optimization bias in Group Relative Policy Optimization (GRPO), which artificially increases response length (especially for incorrect outputs) during training. To address this, we introduce Dr. GRPO, an unbiased optimization method that improves token efficiency while maintaining reasoning performance. Leveraging these insights, we present a minimalist R1-Zero recipe that achieves 43.3% accuracy on AIME 2024 with a 7B base model, establishing a new state-of-the-art. Our code is available at https://github.com/sail-sg/understand-r1-zero.

English Analysis

1. Bibliographic Information

  • Title: Understanding R1-Zero-Like Training: A Critical Perspective
  • Authors: Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin.
  • Affiliations: The authors are from Sea AI Lab, National University of Singapore, and Singapore Management University. This indicates a collaboration between a corporate research lab and academic institutions.
  • Journal/Conference: This paper is available on arXiv, a preprint server. This means it has not yet undergone formal peer review for a conference or journal but is shared to disseminate findings quickly. The submission date suggests it is very recent work.
  • Publication Year: 2025 (as listed on arXiv, though submitted in 2024/2025).
  • Abstract: The paper critically investigates the "R1-Zero-like" training paradigm, popularized by DeepSeek-R1-Zero, where Reinforcement Learning (RL) is applied directly to base Large Language Models (LLMs) to improve reasoning without prior Supervised Fine-Tuning (SFT). The authors deconstruct this paradigm by analyzing its two main components: the base models and the RL algorithm. They find that some base models (like Qwen2.5) already have strong reasoning skills, and the "Aha moment" of self-reflection is present even in the original DeepSeek-V3-Base model before RL. They identify a significant flaw in the Group Relative Policy Optimization (GRPO) algorithm, an optimization bias that causes response lengths to increase, especially for incorrect answers. To fix this, they introduce Dr. GRPO, an unbiased version that improves token efficiency. Using these insights, they create a minimalist training recipe that achieves a new state-of-the-art result (43.3% on AIME 2024) with a 7B parameter model.
  • Original Source Link:

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: The AI community has been highly interested in the DeepSeek-R1-Zero model, which demonstrated that applying Reinforcement Learning (RL) at scale directly to a base LLM could significantly boost its complex reasoning abilities, bypassing the standard Supervised Fine-Tuning (SFT) step. This led to a "scaling phenomenon" where both model performance and response length increased together, along with the emergence of skills like self-reflection (the "Aha moment"). However, the underlying reasons for this success were not well understood.
    • Gap in Prior Work: Many researchers attempted to replicate R1-Zero's success, but it was unclear whether the gains came from the RL process itself, the choice of base model, or the specific RL algorithm used. There was a need for a critical, systematic analysis to separate these factors.
    • Innovation: This paper provides that critical analysis. Instead of just replicating the results, it dissects the process, questioning fundamental assumptions. The authors investigate whether the "emergent" abilities were truly emergent and if the observed increase in response length was a genuine sign of improved reasoning or an artifact of the training algorithm.
  • Main Contributions / Findings (What):

    1. Base Model Analysis: The paper reveals that many popular base models used for R1-Zero replications are not "pure." Qwen2.5 models show strong reasoning even without prompt templates, suggesting they may have been pretrained on question-answer data. Crucially, they find that the "Aha moment" of self-reflection already exists in base models, including the original DeepSeek-V3-Base, before any RL is applied.
    2. Identification of Optimization Bias in GRPO: The authors discover that the Group Relative Policy Optimization (GRPO) algorithm, a key component in R1-Zero training, contains a length bias. This bias unintentionally encourages the model to produce longer responses when it is wrong and shorter responses when it is right, artificially inflating the average response length during training.
    3. Proposal of Dr. GRPO: To correct this flaw, they propose Dr. GRPO (GRPO Done Right), a simple yet effective modification that removes the biasing terms from the GRPO objective function. This new method maintains reasoning performance while significantly improving token efficiency by preventing the model from generating overly long incorrect answers.
    4. A Minimalist State-of-the-Art Recipe: By combining their insights, the authors develop a highly efficient training recipe. Using a 7B Qwen2.5-Math model, their unbiased Dr. GRPO algorithm, and a focused dataset, they achieve a new state-of-the-art accuracy of 43.3% on the AIME 2024 benchmark, using only 27 hours on 8 A100 GPUs.

3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data (e.g., the internet). A base model is the initial model after this pretraining; it's good at predicting the next word but not necessarily at following instructions.
    • Supervised Fine-Tuning (SFT): This is the process of training a base model on a smaller, high-quality dataset of instruction-response pairs (e.g., question-answer pairs) to teach it how to follow instructions and act like a helpful assistant.
    • Reinforcement Learning (RL): A machine learning paradigm where an "agent" (the LLM) learns to make decisions (generate tokens) to maximize a cumulative "reward." In this context, the reward is given for generating a correct answer to a reasoning problem.
    • R1-Zero-like Training: A novel training paradigm introduced by DeepSeek-R1-Zero. It skips the SFT step and applies RL directly to the base model. This was revolutionary because it suggested that complex reasoning could be unlocked through RL alone, without needing curated SFT data.
    • "Aha Moment": A term used to describe the phenomenon where a model, during RL training, appears to develop emergent reasoning skills like self-correction and reflection (e.g., generating phrases like "Wait, let me recheck that...").
    • Group Relative Policy Optimization (GRPO): An RL algorithm designed for LLMs. For a given prompt, it samples a group of responses, compares their rewards (e.g., correct vs. incorrect), and updates the model to increase the probability of generating high-reward responses.
  • Technological Evolution & Differentiation:

    • The standard pipeline for creating instruction-following LLMs was Pretraining -> SFT -> RL from Human Feedback (RLHF).
    • DeepSeek-R1-Zero challenged this by showing Pretraining -> RL could work for complex reasoning tasks, which was simpler and potentially more scalable.
    • Many subsequent works (SimpleRL-Zero, Open-Reasoner-Zero) tried to replicate this success, often using Qwen2.5 models.
    • This paper differentiates itself by not just replicating but critically analyzing the paradigm. It questions whether the "Aha moment" is truly from RL and whether the observed "double increase" (in performance and length) is a desirable outcome or a methodological flaw. Their proposed Dr. GRPO is a direct response to a flaw they identified in the existing approach.

4. Methodology (Core Technology & Implementation)

The paper's methodology is twofold: a critical analysis of existing components and the proposal of an improved algorithm.

Part 1: Analysis of Base Models and RL

The authors systematically investigate the two core components of R1-Zero-like training.

  • Base Model Investigation:
    • They analyze a wide range of models (Qwen-2.5, Llama-3.1, DeepSeek series) on 500 questions from the MATH dataset.
    • Templates: They test three prompt templates to see how they affect model behavior:
      1. Template 1 (R1 template): A verbose template instructing the model to think within tags.
      2. Template 2 (Qwen-Math template): A more standard chat-based template using special tokens.
      3. Template 3 (No template): Just the raw question.
    • Evaluated Abilities:
      • Question-Answering Ability: Does the model answer the question or just continue the text?
      • Exploration Ability: Can the model generate correct answers at all when sampling multiple times? Measured by pass@8 accuracy.
      • Self-Reflection ("Aha Moment"): Do base models already generate self-reflection keywords (e.g., "recheck", "Aha!")? This was checked using both keyword matching and a stronger LLM (GPT-4o-mini) for verification.

Part 2: Analysis of GRPO and the Proposal of Dr. GRPO

  • RL Formulation:

    • LLM generation is modeled as a Markov Decision Process (MDP). The goal is to maximize the expected reward, which is 1 for a correct final answer and 0 otherwise.
    • The standard RL objective is given by: I(πθ)=EqpQ[Eoπθ(q)[R(q,o)]βDKL[πθ(q))πref(q)]] \mathcal { I } ( \pi _ { \theta } ) = \underset { \mathbf { q } \sim p _ { \mathcal { Q } } } { \mathbb { E } } \left[ \underset { \mathbf { o } \sim \pi _ { \theta } ( \cdot | \mathbf { q } ) } { \mathbb { E } } [ R ( \mathbf { q } , \mathbf { o } ) ] - \beta \mathbb { D } _ { K L } [ \pi _ { \theta } ( \cdot | \mathbf { q } ) ) | | \pi _ { \mathrm { ref } } ( \cdot | \mathbf { q } ) ] \right]
    • Here, πθ\pi_{\theta} is the policy (the LLM), R(q,o)R(\mathbf{q}, \mathbf{o}) is the reward for output o\mathbf{o} given question q\mathbf{q}, and the DKL\mathbb{D}_{KL} term is a penalty to keep the policy from changing too much from a reference policy πref\pi_{\text{ref}}. The paper sets β=0\beta=0, removing this penalty, as the reward is based on a fixed rule (correctness) and not a learned reward model.
  • Identifying Bias in GRPO:

    • The authors present the GRPO objective function and highlight its "advantage" estimator, which determines the direction of the policy update. The advantage A^i,t\hat{A}_{i,t} in GRPO is defined as: A^i,t=R(q,oi)mean({R(q,o1),,R(q,oG)})std({R(q,o1),,R(q,oG)}) \hat { A } _ { i , t } = \frac { R ( \mathbf { q } , \mathbf { o } _ { i } ) - \mathrm { mean } ( \{ R ( \mathbf { q } , \mathbf { o } _ { 1 } ) , \dots , R ( \mathbf { q } , \mathbf { o } _ { G } ) \} ) } { \mathrm {std} ( \{ R ( \mathbf { q } , \mathbf { o } _ { 1 } ) , \dots , R ( \mathbf { q } , \mathbf { o } _ { G } ) \} ) }
    • The overall update for a response oi\mathbf{o}_i is also scaled by 1/oi1/|\mathbf{o}_i|. This leads to two biases:
      1. Response-level length bias: The 1/oi1/|\mathbf{o}_i| term means that for a correct answer (positive advantage), shorter responses get a larger update, encouraging brevity. For an incorrect answer (negative advantage), longer responses get a smaller penalty, unintentionally encouraging the model to generate longer incorrect responses.
      2. Question-level difficulty bias: The normalization by standard deviation (std) gives higher weight to questions where the model is either consistently right or consistently wrong (low variance in rewards), biasing training towards very easy or very hard examples.
  • Dr. GRPO: The Unbiased Solution:

    • Dr. GRPO is a simple fix: remove the two biasing terms. The new, unbiased advantage estimator A~i,t\tilde{A}_{i,t} is just the centered reward: A~i,t=R(q,oi)mean({R(q,o1),,R(q,oG)}) \tilde { A } _ { i , t } = R ( \mathbf { q } , \mathbf { o } _ { i } ) - \mathrm { mean } ( \{ R ( \mathbf { q } , \mathbf { o } _ { 1 } ) , \dots , R ( \mathbf { q } , \mathbf { o } _ { G } ) \} )
    • The per-response length normalization is also removed from the loss calculation. As shown in Appendix A, this modified objective aligns correctly with the standard REINFORCE policy gradient algorithm with a baseline, making it theoretically sound and unbiased.

5. Experimental Setup

  • Datasets:

    • Training: Various math question sets were used to test different conditions, including MATH (high-school competition math), GSM (grade-school math), ASDiv (basic algebra), and ORZ (a large, diverse combination).
    • Evaluation: A standard suite of challenging math reasoning benchmarks: AIME 2024, AMC, MATH500, Minerva Math, and OlympiadBench.
  • Evaluation Metrics:

    • Accuracy: The primary metric for reasoning performance.
    • Response Length: Used to track the effect of the length bias.
    • pass@8: Measures the model's ability to find a correct solution within 8 attempts, indicating its exploration capability.
  • Baselines:

    • The main comparison is between GRPO and the proposed Dr. GRPO.
    • The final model, Oat-Zero-7B, is compared against other SOTA open-source models of similar size that also follow the R1-Zero paradigm, such as SimpleRL-Zero-7B, PRIME-Zero-7B, and OpenReasoner-Zero-7B.

6. Results & Analysis

Core Results on Base Models (Section 2)

  • Templates and Pretraining Bias:

    • Figure 7 (images/3.jpg) shows that for Llama and DeepSeek models, using a template is crucial to get them to answer questions. However, Qwen2.5 models perform best with no template.
    • Table 1 quantifies this: for Qwen2.5-Math-7B, using no template yields a 38.2% average accuracy, while the standard 4-shot prompting only gets 23.8%. This strongly suggests Qwen2.5 models were pretrained on concatenated question-answer text, making them SFT-like from the start.
  • "Aha Moment" Pre-exists in Base Models:

    • The right plot in Figure 7 (images/3.jpg) shows that nearly all base models, including DeepSeek-V3-Base, already exhibit self-reflection behaviors. This finding challenges the claim that this ability emerges purely from RL training.
    • Figure 13 (in Appendix) provides concrete examples of DeepSeek-V3-Base generating phrases like "Wait, I'm overthinking" and "Aha! ... This gives me an idea." before any RL tuning.

Core Results on RL Algorithm (Section 3)

  • Dr. GRPO Fixes the Length Bias:

    • Figure 8 (images/5.jpg) provides a compelling comparison.
      • Plots 1 & 5 show that Dr. GRPO achieves comparable reward and final benchmark scores to GRPO. Performance is not sacrificed.
      • Plot 2 shows that while GRPO's average output length increases continuously throughout training, Dr. GRPO's length stabilizes.
      • Plot 4 is the most damning evidence: GRPO causes the length of incorrect responses to skyrocket, confirming the length bias. Dr. GRPO keeps incorrect responses short, leading to much better token efficiency.
  • Templates and Question Sets Interact:

    • Figure 9 (images/6.jpg) shows that when there is a mismatch between the base model and the template (e.g., Qwen2.5-Math with the R1 template), performance is initially destroyed and RL must reconstruct the reasoning ability. In this case, a large, diverse training set (ORZ-57K) is needed.
    • However, with a better-suited template (Qwen-Math template), even a simpler, out-of-distribution dataset (GSM-8K) is sufficient to achieve high performance. This suggests RL is reinforcing existing behaviors rather than teaching new knowledge.
  • Domain Pretraining is Key for Weaker Models:

    • Figure 10 (images/7.jpg) shows experiments with Llama-3.2-3B, a model not specialized for math.
      • The left plot shows that RL on the vanilla Llama base model yields minimal gains. However, after continual pretraining on math data (FineMath and NuminaQA), the model's potential for improvement with RL (RL ceiling) is significantly higher.
      • The right plot re-confirms that GRPO creates the misleading "double-increase" phenomenon on this model, while Dr. GRPO does not.

Final Model Performance

  • Figure 6 (images/2.jpg) showcases the final results. The authors' Oat-Zero-7B model, trained with their minimalist recipe, achieves the highest average accuracy (51.4%) among all compared 7B models. It particularly excels on the difficult AIME 2024 benchmark with 43.3% accuracy, establishing a new state-of-the-art.

7. Conclusion & Reflections

  • Conclusion Summary:

    • The paper successfully demystifies key aspects of R1-Zero-like training. It demonstrates that the choice of base model and its pretraining history are critical, and that some "emergent" abilities like self-reflection may already be present pre-RL.
    • It identifies and corrects a significant optimization bias in the GRPO algorithm that leads to inefficient, long responses. The proposed Dr. GRPO maintains performance while improving token efficiency.
    • By leveraging these insights, the authors provide a simple, efficient, and reproducible recipe that achieves SOTA results, proving that effective RL for reasoning does not require uncontrolled growth in response length.
  • Limitations & Future Work:

    • The authors do not explicitly state limitations. However, one could consider that the analysis is focused on mathematical reasoning; the findings might not generalize perfectly to other domains like creative writing or general conversation.
    • The analysis of the "Aha moment" is correlational. While they show it exists in base models, they don't fully explain its causal role in the RL process (though Appendix F suggests it doesn't correlate with higher accuracy at inference time).
    • Future work could involve applying Dr. GRPO to larger models and different reasoning domains to verify its benefits at scale.
  • Personal Insights & Critique:

    • This is an excellent example of critical and rigorous scientific work in AI. Instead of just chasing higher benchmark scores, the authors delved into the "why" and uncovered a fundamental flaw in a popular method.
    • The discovery of the length bias in GRPO is a significant contribution. It serves as a cautionary tale: what looks like an emergent capability (long, complex reasoning) can sometimes be an algorithmic artifact. This insight promotes more efficient and principled approaches to RL for LLMs.
    • Dr. GRPO is a model of a good scientific fix: it is simple, theoretically grounded, and empirically effective.
    • The paper's overall message is powerful: understanding the components of a complex system is more valuable than treating it as a black box. By doing so, we can achieve better results with greater efficiency. The work is a major step forward for the open-source community trying to build powerful reasoning models.

Similar papers

Recommended via semantic vector search.

Discussion

Leave a comment

Sign in to join the discussion.

No comments yet. Start the discussion!