- Title: Understanding R1-Zero-Like Training: A Critical Perspective
- Authors: Zichen Liu, Changyu Chen, Wenjun Li, Penghui Qi, Tianyu Pang, Chao Du, Wee Sun Lee, Min Lin.
- Affiliations: The authors are from Sea AI Lab, National University of Singapore, and Singapore Management University. This indicates a collaboration between a corporate research lab and academic institutions.
- Journal/Conference: This paper is available on arXiv, a preprint server. This means it has not yet undergone formal peer review for a conference or journal but is shared to disseminate findings quickly. The submission date suggests it is very recent work.
- Publication Year: 2025 (as listed on arXiv, though submitted in 2024/2025).
- Abstract: The paper critically investigates the "R1-Zero-like" training paradigm, popularized by DeepSeek-R1-Zero, where Reinforcement Learning (RL) is applied directly to base Large Language Models (LLMs) to improve reasoning without prior Supervised Fine-Tuning (SFT). The authors deconstruct this paradigm by analyzing its two main components: the base models and the RL algorithm. They find that some base models (like Qwen2.5) already have strong reasoning skills, and the "Aha moment" of self-reflection is present even in the original DeepSeek-V3-Base model before RL. They identify a significant flaw in the Group Relative Policy Optimization (GRPO) algorithm, an optimization bias that causes response lengths to increase, especially for incorrect answers. To fix this, they introduce Dr. GRPO, an unbiased version that improves token efficiency. Using these insights, they create a minimalist training recipe that achieves a new state-of-the-art result (43.3% on AIME 2024) with a 7B parameter model.
- Original Source Link:
2. Executive Summary
4. Methodology (Core Technology & Implementation)
The paper's methodology is twofold: a critical analysis of existing components and the proposal of an improved algorithm.
Part 1: Analysis of Base Models and RL
The authors systematically investigate the two core components of R1-Zero-like training.
- Base Model Investigation:
- They analyze a wide range of models (
Qwen-2.5
, Llama-3.1
, DeepSeek
series) on 500 questions from the MATH dataset.
- Templates: They test three prompt templates to see how they affect model behavior:
Template 1 (R1 template)
: A verbose template instructing the model to think within tags.
Template 2 (Qwen-Math template)
: A more standard chat-based template using special tokens.
Template 3 (No template)
: Just the raw question.
- Evaluated Abilities:
- Question-Answering Ability: Does the model answer the question or just continue the text?
- Exploration Ability: Can the model generate correct answers at all when sampling multiple times? Measured by
pass@8
accuracy.
- Self-Reflection ("Aha Moment"): Do base models already generate self-reflection keywords (e.g., "recheck", "Aha!")? This was checked using both keyword matching and a stronger LLM (GPT-4o-mini) for verification.
Part 2: Analysis of GRPO and the Proposal of Dr. GRPO
5. Experimental Setup
-
Datasets:
- Training: Various math question sets were used to test different conditions, including
MATH
(high-school competition math), GSM
(grade-school math), ASDiv
(basic algebra), and ORZ
(a large, diverse combination).
- Evaluation: A standard suite of challenging math reasoning benchmarks:
AIME 2024
, AMC
, MATH500
, Minerva Math
, and OlympiadBench
.
-
Evaluation Metrics:
- Accuracy: The primary metric for reasoning performance.
- Response Length: Used to track the effect of the length bias.
pass@8
: Measures the model's ability to find a correct solution within 8 attempts, indicating its exploration capability.
-
Baselines:
- The main comparison is between GRPO and the proposed Dr. GRPO.
- The final model, Oat-Zero-7B, is compared against other SOTA open-source models of similar size that also follow the R1-Zero paradigm, such as
SimpleRL-Zero-7B
, PRIME-Zero-7B
, and OpenReasoner-Zero-7B
.
6. Results & Analysis
Core Results on Base Models (Section 2)
Core Results on RL Algorithm (Section 3)
-
Dr. GRPO Fixes the Length Bias:
- Figure 8 (images/5.jpg) provides a compelling comparison.
- Plots 1 & 5 show that
Dr. GRPO
achieves comparable reward and final benchmark scores to GRPO
. Performance is not sacrificed.
- Plot 2 shows that while
GRPO
's average output length increases continuously throughout training, Dr. GRPO
's length stabilizes.
- Plot 4 is the most damning evidence:
GRPO
causes the length of incorrect responses to skyrocket, confirming the length bias. Dr. GRPO
keeps incorrect responses short, leading to much better token efficiency.
-
Templates and Question Sets Interact:
- Figure 9 (images/6.jpg) shows that when there is a mismatch between the base model and the template (e.g.,
Qwen2.5-Math
with the R1 template
), performance is initially destroyed and RL must reconstruct the reasoning ability. In this case, a large, diverse training set (ORZ-57K
) is needed.
- However, with a better-suited template (
Qwen-Math template
), even a simpler, out-of-distribution dataset (GSM-8K
) is sufficient to achieve high performance. This suggests RL is reinforcing existing behaviors rather than teaching new knowledge.
-
Domain Pretraining is Key for Weaker Models:
- Figure 10 (images/7.jpg) shows experiments with
Llama-3.2-3B
, a model not specialized for math.
- The left plot shows that RL on the vanilla
Llama
base model yields minimal gains. However, after continual pretraining on math data (FineMath
and NuminaQA
), the model's potential for improvement with RL (RL ceiling
) is significantly higher.
- The right plot re-confirms that
GRPO
creates the misleading "double-increase" phenomenon on this model, while Dr. GRPO
does not.
- Figure 6 (images/2.jpg) showcases the final results. The authors'
Oat-Zero-7B
model, trained with their minimalist recipe, achieves the highest average accuracy (51.4%) among all compared 7B models. It particularly excels on the difficult AIME 2024
benchmark with 43.3% accuracy, establishing a new state-of-the-art.
7. Conclusion & Reflections