Paper status: completed

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Published:03/18/2025

LLM-guided motion planning (27)Open-Source LLM Optimization (2)Sequence Policy Optimization (38)RL Training for Large Language Models (63)

Original Link PDF

Price: 0.100000

9 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

To boost reproducibility in LLM reinforcement learning, DAPO introduces an open-source system featuring a novel Decoupled Clip and Dynamic Sampling Policy Optimization algorithm. Utilizing a Qwen2.5-32B model, this system achieves a state-of-the-art 50 points on AIME 2024, openly

Abstract

Inference scaling empowers LLMs with unprecedented reasoning ability, with reinforcement learning as the core technique to elicit complex reasoning. However, key technical details of state-of-the-art reasoning LLMs are concealed (such as in OpenAI o1 blog and DeepSeek R1 technical report), thus the community still struggles to reproduce their RL training results. We propose the $\textbf{D}$ ecoupled Clip and $\textbf{D}$ ynamic s $\textbf{A}$ mpling $\textbf{P}$ olicy $\textbf{O}$ ptimization ( $\textbf{DAPO}$ ) algorithm, and fully open-source a state-of-the-art large-scale RL system that achieves 50 points on AIME 2024 using Qwen2.5-32B base model. Unlike previous works that withhold training details, we introduce four key techniques of our algorithm that make large-scale LLM RL a success. In addition, we open-source our training code, which is built on the verl framework, along with a carefully curated and processed dataset. These components of our open-source system enhance reproducibility and support future research in large-scale LLM RL.

Mind Map

In-depth Reading

English Analysis~15 min read · 16,789 chars

1. Bibliographic Information

Title: DAPO: An Open-Source LLM Reinforcement Learning System at Scale
Authors: The paper lists a large team from ByteDance Seed, the Institute for AI Industry Research (AIR) at Tsinghua University, The University of Hong Kong, and the SIA-Lab of Tsinghua AIR and ByteDance Seed. Key project leads and supervisors include Qiying Yu, Hao Zhou, Jingjing Liu, and Mingxuan Wang.
Journal/Conference: This paper is a preprint available on arXiv. It has not yet been published in a peer-reviewed journal or conference. arXiv is a common platform for researchers to share cutting-edge work quickly.
Publication Year: The paper is dated March 17, 2025. This is likely a forward-looking date or a typo in the preprint; the submission date to arXiv is in March 2025, but the content reflects work from 2024/2025.
Abstract: The authors address the challenge of reproducing state-of-the-art reasoning in Large Language Models (LLMs), which is primarily achieved through Reinforcement Learning (RL) but often involves concealed technical details. They introduce DAPO (Decoupled Clip and Dynamic sAmpling Policy Optimization), an algorithm and a fully open-source RL system. Using a Qwen2.5-32B base model, their system achieves a score of 50 on the AIME 2024 benchmark, surpassing previous results. The paper highlights four key techniques that are critical for this success. By open-sourcing their code, curated dataset, and methodology, they aim to enhance reproducibility and foster future research in large-scale LLM RL.
Original Source Link:
- arXiv Link: https://arxiv.org/abs/2503.14476
- PDF Link: https://arxiv.org/pdf/2503.14476
- Status: Preprint on arXiv.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Recent breakthroughs in LLM reasoning, exemplified by systems like OpenAI's o1 and DeepSeek's R1, rely heavily on large-scale Reinforcement Learning to elicit complex, multi-step thought processes (Chain-of-Thought). However, the specific algorithms and "secret sauce" behind these systems are not publicly disclosed, making it extremely difficult for the research community to reproduce their results or build upon them.
- Importance: This lack of transparency hinders academic progress and democratized access to powerful AI reasoning capabilities. The community has struggled to replicate the performance of these models, suggesting that standard RL algorithms like PPO or GRPO are insufficient without crucial, unpublished modifications.
- Innovation: This paper's primary innovation is not just a new algorithm but a complete, open-source system that successfully trains a powerful reasoning LLM. It demystifies the process by identifying and explaining four specific technical solutions to common pitfalls in large-scale LLM RL, such as training instability, inefficient learning, and loss of model diversity.
Main Contributions / Findings (What):
1. DAPO Algorithm: They propose the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm, a variant of policy optimization tailored for long-form reasoning tasks.
2. Four Key Techniques: The paper introduces and validates four critical techniques that collectively solve major challenges in LLM RL:
  - Clip-Higher: Prevents the model from becoming too deterministic (entropy collapse) by encouraging exploration.
  - Dynamic Sampling: Improves training efficiency by filtering out uninformative training data that provides zero gradient signals.
  - Token-Level Policy Gradient Loss: Stabilizes training for long-form text generation by ensuring every token contributes fairly to the learning process.
  - Overlong Reward Shaping: Reduces noise and confusion in the reward signal by handling overly long generated responses more intelligently.
3. State-of-the-Art Open-Source Result: They achieve a score of 50 on AIME 2024 using the Qwen2.5-32B model, outperforming the previous state-of-the-art (DeepSeek-R1-Zero-Qwen-32B, which scored 47) while using only 50% of the training steps.
4. Fully Open-Sourced System: They release their training code (built on the verl framework), a carefully curated and processed dataset (DAPO-Math-17K), and all algorithmic details, promoting transparency and reproducibility in the field.

Foundational Concepts:
- Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. In the context of LLMs, the "agent" is the language model, the "action" is generating the next token, and the "reward" is a score indicating the quality or correctness of the final generated text.
- Policy Optimization: A family of RL algorithms that directly learn a policy (in this case, the LLM's probability distribution over the next token) to maximize expected rewards.
- Proximal Policy Optimization (PPO): A popular and stable policy optimization algorithm. It prevents drastic changes to the policy during training by "clipping" the objective function, ensuring updates stay within a trusted region of the previous policy.
- Group Relative Policy Optimization (GRPO): An adaptation of PPO for LLMs that simplifies the reward calculation. Instead of learning a separate value function to estimate future rewards, it samples a group of responses for a given prompt, and the "advantage" (how much better a specific response is) is calculated relative to the average reward of the group.
- Chain-of-Thought (CoT): A technique where LLMs are prompted to generate intermediate reasoning steps before giving a final answer. RL is used to train models to generate longer, more accurate CoTs.
Previous Works:
- OpenAI o1 & DeepSeek R1: These are state-of-the-art reasoning models that demonstrated remarkable performance on complex tasks like competitive math (AIME). Their technical reports attribute this success to large-scale RL but omit crucial implementation details, motivating this work.
- GRPO: The paper builds directly on GRPO, using it as a baseline. However, they find that a "naive" implementation of GRPO only achieves 30 points on AIME, far below the state-of-the-art, indicating that significant modifications are needed.
- KL Divergence Penalty: Standard RL for LLMs often includes a KL divergence penalty to prevent the trained policy from straying too far from the original pre-trained model. The authors argue this is unnecessary for long-CoT reasoning, where the goal is to significantly transform the model's behavior, and thus they remove it from their algorithm.
Differentiation: The paper differentiates itself by being explicit and open. While prior works announced high-performance reasoning models, DAPO provides a complete blueprint for how to achieve those results. Its novelty lies in identifying and solving specific, practical engineering challenges that arise during large-scale RL, which were previously unaddressed in the literature.

4. Methodology (Core Technology & Implementation)

The core of the paper is the DAPO algorithm, which is essentially a refined version of GRPO incorporating four key techniques.

The overall objective function for DAPO is: $\begin{array} { r l } { \mathcal { I } _ { \mathrm { D A P O } } ( \theta ) = } & { \mathbb { E } _ { ( q , a ) \sim \mathcal { D } , \{ \sigma _ { i } \} _ { i = 1 } ^ { G } \sim \pi _ { \theta _ { \mathrm { o l d } } } ( \cdot | q ) } } \\ & { \left[ \frac { 1 } { \sum _ { i = 1 } ^ { G } | \sigma _ { i } | } \displaystyle \sum _ { i = 1 } ^ { G } \sum _ { t = 1 } ^ { \infty _ { i } } \operatorname* { m i n } \left( r _ { i , t } ( \theta ) \hat { A } _ { i , t } , \ \exp \Bigl ( r _ { i , t } ( \theta ) , 1 - \varepsilon _ { \mathrm { l o w } } , 1 + \varepsilon _ { \mathrm { h i g h } } \Bigr ) \hat { A } _ { i , t } \right) \right] } \\ { \mathrm { s . t . } \ } & { 0 < \left| \{ o _ { i } \mid \mathsf { i } \mathsf { s } _ { - } \mathsf { e q u i v a l e n t } ( a , o _ { i } ) \} \right| < G , } \end{array}$ Where:

$\theta$ : The parameters of the policy LLM.
q, a: A question-answer pair from the dataset $\mathcal{D}$ .
$\{o_i\}_{i=1}^G$ : A group of $G$ output sequences sampled from the old policy $\pi_{\theta_{\mathrm{old}}}$ .
$|o_i|$ : The length (number of tokens) of the $i$ -th output.
$r_{i,t}(\theta) = \frac{\pi_{\theta}(o_{i,t}|q, o_{i,<t})}{\pi_{\theta_{\mathrm{old}}}(o_{i,t}|q, o_{i,<t})}$ : The importance sampling ratio for the $t$ -th token of the $i$ -th output. It measures how much more likely the current policy is to generate this token compared to the old policy.
$\hat{A}_{i,t}$ : The estimated advantage for the $i$ -th output at token $t$ . In GRPO/DAPO, this is constant for all tokens in a sequence and is calculated by normalizing the sequence's final reward $R_i$ against the group's average reward.
The key components of DAPO are reflected in this formula: the token-level normalization ( $\frac{1}{\sum |o_i|}$ ), the decoupled clipping ( $\varepsilon_{\mathrm{low}}, \varepsilon_{\mathrm{high}}$ ), and the dynamic sampling constraint ( $0 < |\{\dots\}| < G$ ).

Let's break down each of the four core techniques.

1. Raise the Ceiling: Clip-Higher

Problem: The authors observed entropy collapse during training. Entropy is a measure of randomness or diversity in the model's output probabilities. When entropy collapses, the model becomes overly confident and generates very similar, deterministic responses, which stops it from exploring new and potentially better reasoning paths.
Cause: Standard PPO/GRPO uses a symmetric clipping range (e.g., $[1-\varepsilon, 1+\varepsilon]$ where $\varepsilon=0.2$ ). This heavily restricts the probability increase of low-probability "exploration" tokens. For a token with a probability of 0.01, its probability can only increase to $0.01 \times 1.2 = 0.012$ . In contrast, a high-probability "exploitation" token (e.g., probability 0.9) can easily be pushed higher. This imbalance stifles exploration.
Solution: They propose decoupling the clipping range into a lower bound $1 - \varepsilon_{\mathrm{low}}$ and a higher bound $1 + \varepsilon_{\mathrm{high}}$ . By setting $\varepsilon_{\mathrm{high}}$ to a larger value than $\varepsilon_{\mathrm{low}}$ (e.g., $\varepsilon_{\mathrm{low}}=0.2, \varepsilon_{\mathrm{high}}=0.28$ ), they "raise the ceiling" for probability increases. This allows the model to more aggressively reward and explore unlikely but promising tokens, thus maintaining higher entropy and diversity.

该图像为两张折线图。左图（a）展示在AIME任务中，带Clip-Higher策略（紫色线）相比不带Clip-Higher（浅蓝色线）在训练步数增加时准确率（AIME avg@32）更高且提升更明显。右图（b）展示生成熵，带Clip-Higher时熵值保持在较高水平且波动，而不带Clip-Higher时熵值迅速降低并趋近于零，表明带Clip-Higher有助于维持模型多样性和探索能力。

As shown in Image 2, the Clip-Higher strategy (purple line) maintains a much higher generation entropy throughout training (right plot) compared to the standard approach where entropy collapses to near zero. This leads to better and more stable performance on the AIME benchmark (left plot).

2. The More the Merrier: Dynamic Sampling

Problem: As the model improves, for many prompts, it will generate a group of responses that are all correct (or all incorrect). In GRPO, the advantage is calculated by normalizing the reward within the group. If all rewards are the same (e.g., all +1), the advantage for every response becomes zero. A zero advantage leads to a zero gradient, meaning the model learns nothing from that prompt. This is called the "gradient-decreasing problem" and becomes more severe as training progresses, making learning inefficient and noisy.
Solution: They implement Dynamic Sampling. After generating responses for a prompt, they check if there is a mix of correct and incorrect answers. If not (i.e., accuracy is 0% or 100%), they discard that prompt and its generations. They continue sampling new prompts until the training batch is filled only with prompts that have "effective gradients" (i.e., a mix of outcomes).
Impact: Although this seems to increase the cost of data generation, the authors find it actually speeds up convergence. As shown in Image 6, the model trained with Dynamic Sampling reaches its peak performance much faster (around 2,000 steps) than the one without it (around 6,000 steps).

该图像为图表，展示了在不同训练步数（Step）下，采用动态采样（Dynamic Sampling）与不采用动态采样两种策略在AIME指标（AIME avg@32）上的表现。曲线显示采用动态采样策略的模型性能提升更快，且在约2000步时达到峰值，随后略有波动；而未采用动态采样的模型性能上升较慢且波动较小，峰值出现于约6000步后但低于动态采样的最高性能。图中用虚线标记了各自的峰值阶段。

3. Rebalancing Act: Token-Level Policy Gradient Loss

Problem: The original GRPO algorithm uses a sample-level loss, where the loss is first averaged over tokens within each sequence, and then averaged over all sequences in the batch. This gives every sequence equal weight, regardless of its length. In long-CoT reasoning, this is problematic:
1. Tokens in a very long, high-quality reasoning path are undervalued, as their individual contribution to the loss is diluted.
2. Long, low-quality responses (e.g., containing gibberish or repetition) are not penalized effectively, leading to unhealthy increases in response length and entropy.
Solution: They switch to a Token-level Policy Gradient Loss. The loss is calculated over all tokens in the batch and then normalized by the total number of tokens. This ensures that every token has an equal influence on the gradient update, regardless of the length of the sequence it belongs to. This better rewards useful patterns in long reasoning chains and penalizes undesirable ones.

该图像为两幅折线图组成的图表，比较了有无token级别损失（w/ token-level loss，w/o token-level loss）情况下，训练步数（Step）与生成熵（Generation Entropy）及平均响应长度（Mean Response Length）的关系。左图显示无token级别损失时，生成熵随步数大幅上升，而有token级别损失时生成熵相对平稳且较低。右图显示无token级别损失时平均响应长度先迅速增加后逐渐下降，有token级别损失时平均响应长度稳定增长并较长。整体表明token级别损失有助于控制生成熵并增强响应的长度稳定性。

Image 4 illustrates this effect. Without token-level loss (light blue line), both generation entropy and mean response length become unstable and explode. With token-level loss (purple line), both metrics exhibit much healthier, more stable growth.

4. Hide and Seek: Overlong Reward Shaping

Problem: When generated responses exceed a maximum length, they are truncated. Naively assigning a large negative reward to these truncated samples introduces significant noise. A correct reasoning process might be penalized simply for being too long, confusing the model about what constitutes good reasoning.
Solution: They propose a two-part strategy:
1. Overlong Filtering: As a simple but effective baseline, they simply mask the loss for truncated samples, ignoring them during the gradient update. This alone stabilizes training and improves performance.
2. Soft Overlong Punishment: A more refined approach. They define a "soft punishment" interval (e.g., between 16,384 and 20,480 tokens). Within this interval, the reward is gradually decreased as the length increases. Responses longer than the absolute maximum receive a full penalty. This provides a smoother signal to the model to avoid generating excessively long text.
  
  该图像为双子图表，左图展示了在AIME任务中有无超长过滤（overlong filtering）条件下模型性能随训练步骤（Step）变化的曲线，纵轴为AIME avg@32指标；右图显示了同样条件下演员模型生成熵（Generation Entropy）随训练步骤变化的曲线，显示无超长过滤时生成熵在约3500步后显著上升，而有过滤时较为平稳。整体反映超长过滤对模型训练稳定性和性能的影响。
  
  Image 5 shows that overlong filtering (light blue line) leads to more stable performance on AIME (left) and prevents the entropy from exploding (right), demonstrating its effectiveness in reducing reward noise.

5. Experimental Setup

Datasets:
- Training: DAPO-Math-17K, a custom dataset of 17,000 math problems. The problems were curated and transformed such that all answers are integers. This simplifies the reward function, as it only needs to check for integer equality, avoiding the complexity and errors of parsing symbolic math expressions.
- Evaluation: AIME 2024, a standard and challenging benchmark for mathematical reasoning.
Evaluation Metrics:
- avg@32: To ensure stable and reliable results, they evaluate each problem 32 times with different random seeds and report the average accuracy.
Baselines:
- Naive GRPO: A direct implementation of the Group Relative Policy Optimization algorithm, which serves as their starting point.
- DeepSeek-R1-Zero-Qwen-32B: The previous state-of-the-art result on the same base model, reported by DeepSeek AI.
Training Configuration:
- Model: Qwen2.5-32B.
- Optimizer: AdamW with a learning rate of $1 \times 10^{-6}$ .
- Rollout: For each prompt, they sample $G=16$ responses.
- Hyperparameters: $\varepsilon_{\mathrm{low}} = 0.2$ , $\varepsilon_{\mathrm{high}} = 0.28$ .

6. Results & Analysis

Core Results:

该图像是图表，展示了DAPO算法在AIME 2024任务上随着训练步数增加的准确率变化。图中用不同符号分别表示DAPO的平均准确率（紫色圆点）、通过率（浅蓝色倒三角）和一致率（浅蓝色三角），横轴为训练步数，纵轴为准确率百分比。图中以虚线标明DeepSeek-R1-Zero-Qwen-32B的50%准确率基准，DAPO在约5600步时达到该水平。

As shown in Image 1, DAPO achieves a 50% accuracy on AIME 2024. This surpasses the 47% achieved by DeepSeek-R1-Zero-Qwen-32B. Notably, DAPO reaches this performance in approximately 5,500 training steps, which the authors claim is 50% of the steps used by DeepSeek. This demonstrates both superior final performance and improved training efficiency.

Ablations / Progressive Improvements: The paper presents a clear ablation study in Table 1, showing how each technique contributes to the final result.

Model	AIME24 avg@32
DeepSeek-R1-Zero-Qwen-32B	47
Naive GRPO	30
+ Overlong Filtering	36
+ Clip-Higher	38
+ Soft Overlong Punishment	41
+ Token-level Loss	42
+ Dynamic Sampling (DAPO)	50

This table is a powerful illustration of the paper's core message: success in large-scale RL comes from a combination of carefully designed techniques. Dynamic Sampling provides the largest single boost (+8 points), highlighting the critical importance of gradient quality. The other techniques each contribute 2-6 points, demonstrating their cumulative value.

Training Dynamics:

该图像为四个折线图组成的图表，展示了训练过程中不同指标随训练步数的变化趋势：(a)平均响应长度随训练步数逐渐增加；(b)奖励得分迅速提升后趋于稳定在较高水平；(c)生成熵先下降后回升；(d)平均概率先增后减。整体反映了RL训练过程中模型生成能力和反馈信号的动态变化。

The authors emphasize the importance of monitoring key metrics during the complex RL training process. Image 7 shows typical training curves for DAPO:

(a) Mean Response Length: Steadily increases, allowing for more complex reasoning.
(b) Reward Score: Rises quickly and plateaus, indicating the model is learning to solve the training problems. However, this can also signal overfitting if not validated against an external set.
(c) Generation Entropy: After an initial drop, it is maintained at a healthy level, showing that exploration is preserved (thanks to Clip-Higher).
(d) Mean Probability: Inversely related to entropy, it shows the model's confidence.
Case Study: Emergence of Reflective Behavior: The paper provides an interesting qualitative result. Early in training, the model's responses are straightforward. As training progresses, more complex reasoning patterns emerge, including self-correction and reflection. For instance, the model might generate a phrase like "However, wait a moment, let's rethink about...", indicating it is actively evaluating and refining its own thought process. This suggests that RL doesn't just reinforce existing patterns but can also discover entirely new, more sophisticated reasoning strategies.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully demystifies the process of training high-performance reasoning LLMs with reinforcement learning. It introduces DAPO, an open-source system composed of a refined algorithm and four key practical techniques that address critical issues like entropy collapse, gradient inefficiency, and reward noise. By achieving state-of-the-art results on the AIME benchmark and, most importantly, open-sourcing their entire stack (code, data, methods), the authors provide an invaluable resource to the research community, enabling broader access, reproducibility, and future innovation in LLM reasoning.
Limitations & Future Work:
- The authors themselves point to future work in better understanding and interpreting the emergence of complex reasoning abilities during RL.
- The experiments are focused on the mathematical domain. While the techniques are likely generalizable to other reasoning tasks (like coding or science), this has not been demonstrated in the paper.
- The computational cost of large-scale RL remains very high, which could still be a barrier for researchers with limited resources, even with the open-source code.
Personal Insights & Critique:
- This paper is an excellent example of high-impact research that combines algorithmic novelty with engineering pragmatism. The focus on reproducibility and open-sourcing is commendable and directly addresses a major pain point in the AI community.
- The four techniques are not revolutionary in isolation, but their combination and application to the specific problem of long-CoT RL are what make the system effective. This highlights that progress in AI is often about meticulous system-building and problem-solving, not just single breakthrough ideas.
- The dataset transformation to integer answers is a clever and practical trick that simplifies the reward mechanism, a notoriously difficult part of applying RL. This is a valuable lesson for practitioners.
- Overall, DAPO provides a strong, transparent, and reproducible baseline that will likely become a cornerstone for future research in LLM alignment and reasoning.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.