Paper status: completed

DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization

Published:05/18/2025

RL Training for Large Language Models (67)Group Relative Policy Optimization (1)Discriminative Constrained Optimization Framework (1)Large Reasoning Models (1)Enhancement of Mathematical Reasoning Capabilities (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DisCO is a new framework for Large Reasoning Models, addressing limitations of Group Relative Policy Optimization. By using a discriminative objective and non-clipping scoring functions, it eliminates difficulty bias and achieves stable long-term training, enhancing mathematical

Abstract

The recent success and openness of DeepSeek-R1 have brought widespread attention to Group Relative Policy Optimization (GRPO) as a reinforcement learning method for large reasoning models (LRMs). In this work, we analyze the GRPO objective under a binary reward setting and reveal an inherent limitation of question-level difficulty bias. We also identify a connection between GRPO and traditional discriminative methods in supervised learning. Motivated by these insights, we introduce a new Discriminative Constrained Optimization (DisCO) framework for reinforcing LRMs, grounded in the principle of discriminative learning. The main differences between DisCO and GRPO and its recent variants are: (1) it replaces the group relative objective with a discriminative objective defined by a scoring function; (2) it abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives used as scoring functions; (3) it employs a simple yet effective constrained optimization approach to enforce the KL divergence constraint. As a result, DisCO offers notable advantages over GRPO and its variants: (i) it completely eliminates difficulty bias by adopting discriminative objectives; (ii) it addresses the entropy instability in GRPO and its variants through the use of non-clipping scoring functions and a constrained optimization approach, yielding long and stable training dynamics; (iii) it allows the incorporation of advanced discriminative learning techniques to address data imbalance, where a significant number of questions have more negative than positive generated answers during training. Our experiments on enhancing the mathematical reasoning capabilities of SFT-finetuned models show that DisCO significantly outperforms GRPO and its improved variants such as DAPO, achieving average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks for an 1.5B model.

Mind Map

In-depth Reading

English Analysis~29 min read · 38,723 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "DisCO: Reinforcing Large Reasoning Models with Discriminative Constrained Optimization."

1.2. Authors

The authors are Gang Li, Ming Lin, Tomer Galanti, Zhengzhong Tu, and Tianbao Yang. Their affiliations are:

Texas A&M University: Gang Li, Tomer Galanti, Zhengzhong Tu, Tianbao Yang.
Ming Lin's affiliation is given as linming04@gmail.com, indicating a potential independent researcher or affiliation not explicitly listed.

1.3. Journal/Conference

The paper is published as a preprint on arXiv. Given the publication date (2025-05-18T11:08:32.000Z), it is likely intended for a future conference or journal, or it is a very recent preprint. arXiv is a well-respected open-access archive for preprints of scientific papers, particularly in mathematics, physics, computer science, and related fields. It allows researchers to disseminate their work quickly and receive feedback before formal peer review and publication.

1.4. Publication Year

2025

1.5. Abstract

This paper introduces Discriminative Constrained Optimization (DisCO), a novel reinforcement learning framework for reinforcing Large Reasoning Models (LRMs). The authors analyze the existing Group Relative Policy Optimization (GRPO) method, popular with DeepSeek-R1, identifying an inherent limitation known as "question-level difficulty bias" in its objective function. They also draw a connection between GRPO and traditional discriminative methods. Motivated by these insights, DisCO is designed to eliminate this bias by replacing GRPO's group relative objective with a discriminative objective based on a scoring function. It abandons clipping-based surrogates in favor of non-clipping RL surrogate objectives as scoring functions and employs a constrained optimization approach to enforce KL divergence constraints for stability. DisCO addresses entropy instability, ensures stable training, and can incorporate advanced discriminative learning techniques to handle data imbalance (more negative than positive generated answers). Experiments on mathematical reasoning tasks demonstrate that DisCO significantly outperforms GRPO and its variants like DAPO, achieving average gains of 7% over GRPO and 6% over DAPO for a 1.5B model across six benchmarks.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2505.12366v3 PDF Link: https://arxiv.org/pdf/2505.12366v3.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The paper addresses the critical need for more effective reinforcement learning (RL) methods to fine-tune Large Reasoning Models (LRMs). The recent success of models like DeepSeek-R1 has highlighted the potential of RL, specifically Group Relative Policy Optimization (GRPO), in enhancing LRM capabilities, particularly in complex domains like mathematics.

However, the authors identify several challenges and gaps in current GRPO-based approaches:

Difficulty Bias: GRPO's group relative advantage function inherently introduces a bias that disproportionately weights questions of intermediate difficulty, neglecting those that are either very easy or very hard. This can hinder learning efficiency, as the model might not adequately learn from challenging problems or continue to refine its performance on simpler ones.
Training Instability & Entropy Issues: Existing GRPO variants often suffer from entropy collapse (leading to deterministic and less exploratory policies) or excessive entropy growth (resulting in highly random and inefficient policies). Many proposed remedies are heuristic and lack a principled foundation.
Clipping-based Surrogates: GRPO's reliance on clipping-based surrogates (similar to PPO) can cause vanishing gradients and contribute to entropy instability.
Data Imbalance: In the context of RL fine-tuning for reasoning, sparse rewards often lead to imbalanced rollouts, where there are significantly more negative (incorrect) generated answers than positive (correct) ones for certain questions. Traditional objectives may not effectively handle this imbalance.

The paper's entry point is a principled redesign of the objective function for LRM reinforcement, moving away from GRPO's limitations and grounding the new framework in discriminative learning principles.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Analysis of GRPO's Limitations: A rigorous analysis of the GRPO objective under a binary reward setting, explicitly revealing the question-level difficulty bias caused by its group relative advantage function and establishing a conceptual connection to AUC maximization in discriminative learning.
Introduction of DisCO Framework: Proposing Discriminative Constrained Optimization (DisCO), a novel and principled framework for reinforcing LRMs. DisCO:
- Eliminates difficulty bias by adopting a discriminative objective that aims to increase scores for positive outputs and decrease scores for negative ones, without question-level weighting.
- Addresses entropy instability by using non-clipping RL surrogate objectives as scoring functions.
- Ensures training stability through a constrained optimization approach that enforces a KL divergence constraint, dynamically regulating policy updates.
- Leverages advanced discriminative learning techniques, specifically Distributionally Robust Optimization (DRO), to effectively tackle data imbalance in generated rollouts.
Empirical Superiority: Demonstrating that DisCO significantly outperforms GRPO and its improved variants (e.g., DAPO, Dr. GRPO, TRPA) in enhancing mathematical reasoning capabilities of SFT-finetuned models. For a 1.5B model, DisCO achieved average gains of 7% over GRPO and 6% over DAPO across six benchmark tasks, even outperforming models trained with much longer response lengths. DisCO also exhibits superior training dynamics, maintaining stable generation entropy.

These findings address the identified issues by providing a theoretically grounded and empirically validated method that improves learning efficiency, stability, and robustness for reinforcing large reasoning models.

3.1. Foundational Concepts

To understand this paper, a beginner needs to grasp several core concepts from machine learning, particularly in the domain of large language models and reinforcement learning:

Large Reasoning Models (LRMs): These are large language models (LLMs) specifically fine-tuned or designed to excel at complex reasoning tasks, often involving multi-step problem-solving like mathematics, scientific inquiry, or logical deduction. They go beyond simple text generation to produce coherent, logically sound chains of thought.
Reinforcement Learning (RL): An area of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent isn't explicitly told what actions to take; instead, it discovers which actions yield the most reward through trial and error. Key components include:
- Policy ( $\pi$ ): The agent's strategy, mapping observed states to actions. In LLMs, this is the model that generates text.
- Reward ( $r$ ): A scalar feedback signal indicating the desirability of an action taken by the agent. In this paper, rule-based binary rewards (1 for correct, 0 for incorrect) are used for reasoning tasks.
- Environment: The external system with which the agent interacts. For LRMs, this includes the input question and the process of verifying the generated answer.
- State: The current situation or observation from the environment. For LRMs, this could be the input question and the generated tokens so far.
Policy Optimization: A class of RL algorithms that directly optimize the policy function.
- Policy Gradient Methods: Algorithms that update the policy parameters in the direction of the gradient of the expected reward.
- Trust Region Methods (e.g., TRPO, PPO): These methods constrain the policy update step to a "trust region" to prevent overly aggressive updates that could destabilize training. They typically use a Kullback-Leibler (KL) divergence constraint or penalty to measure the difference between the new and old policies.
Kullback-Leibler (KL) Divergence ( $\mathbb{D}_{\mathrm{KL}}$ ): A measure of how one probability distribution diverges from a second, expected probability distribution. In RL, it's often used to quantify how much a new policy deviates from an old policy or a reference policy, ensuring that updates are not too drastic and training remains stable. A small KL divergence means the new policy is very similar to the old one.
- Formula: For discrete probability distributions $P$ and $Q$ , the KL divergence is: $ \mathbb{D}_{\mathrm{KL}}(P || Q) = \sum_x P(x) \log\left(\frac{P(x)}{Q(x)}\right) $
- Symbol Explanation:
  - P(x): The probability of event $x$ according to distribution $P$ .
  - Q(x): The probability of event $x$ according to distribution $Q$ .
  - $\sum_x$ : Sum over all possible events $x$ .
  - $\log$ : Natural logarithm.
Entropy (in RL context): A measure of the randomness or predictability of a policy's actions. High entropy means the policy explores more diverse actions; low entropy means it tends to choose very specific actions. In RL, maintaining appropriate entropy is crucial: too low can lead to entropy collapse (premature convergence to suboptimal solutions), and too high can lead to excessive exploration (random, inefficient behavior).
Discriminative Learning: A paradigm in machine learning focused on learning a mapping from input data to output labels (or scores) by directly modeling the conditional probability of the output given the input, or by learning a decision boundary. Unlike generative models that try to model the full joint distribution of inputs and outputs, discriminative models focus on distinguishing between classes.
- AUC Maximization: Area Under the Receiver Operating Characteristic (ROC) Curve (AUC) is a common metric for evaluating the performance of binary classifiers. Maximizing AUC aims to improve the ranking of positive samples above negative samples. In a pairwise sense, it means increasing the score of a positive sample while decreasing the score of a negative sample.
Supervised Fine-Tuning (SFT): A common technique for adapting pre-trained large language models to specific tasks or domains. It involves training the model on a dataset of input-output pairs (e.g., question-answer pairs) using supervised learning objectives (e.g., cross-entropy loss) to make the model follow instructions or generate desired outputs. The models in this paper are initially SFT-finetuned before RL is applied.

3.2. Previous Works

The paper discusses several prior studies, primarily focusing on RL methods for fine-tuning LLMs, especially in reasoning tasks:

Group Relative Policy Optimization (GRPO) [23, 59]: The core algorithm of DeepSeek-R1, which is the main baseline and the subject of analysis in this paper. GRPO generates multiple outputs for an input question and uses a group relative advantage function to update the policy. It calculates an advantage for each output based on how its reward compares to the average reward for that question within the group of generated outputs.
- Objective (simplified expectation formulation from the paper, Eq. 2): $ \mathcal { I } _ { \mathrm { G R P O } } ( \theta ) = \mathbb { E } _ { q } \mathbb { E } _ { o \sim \pi _ { \mathrm { o l d } } ( \cdot | q ) } \left[ \frac { 1 } { | o | } \sum _ { t = 1 } ^ { | o | } f \left( \frac { \pi _ { \theta } ( o _ { t } | q , o _ { < t } ) } { \pi _ { \mathrm { o l d } } ( o _ { t } | q , o _ { < t } ) } , A ( o | q ) \right) \right] - \beta \mathbb { D } _ { \mathrm { K L } } ( \pi _ { \theta } | | \pi _ { \mathrm { r e f } } ) $
- Symbol Explanation:
  - $\theta$ : Parameters of the current policy model $\pi_\theta$ .
  - $\pi_{\mathrm{old}}$ : The policy model from the previous iteration, used for generating outputs.
  - $\pi_{\mathrm{ref}}$ : A frozen reference model, usually the SFT model, used for KL regularization.
  - $q$ : An input question.
  - $o$ : A generated output (answer) for question $q$ .
  - $o_t$ : The $t$ -th token in the output $o$ .
  - $o_{<t}$ : The sequence of tokens before $o_t$ .
  - $\pi_\theta(o_t | q, o_{<t})$ : The probability of generating token $o_t$ given question $q$ and preceding tokens $o_{<t}$ by the current policy $\pi_\theta$ .
  - $|o|$ : Length of the output sequence.
  - $f(x, y) = \min(xy, \mathrm{clip}(x, 1-\epsilon, 1+\epsilon)y)$ $f (x, y) = min (x y, clip (x, 1 - ϵ, 1 + ϵ) y)$ : A clipping-based surrogate function similar to PPO's clipped objective.
    - $x = \frac{\pi_\theta(o_t | q, o_{<t})}{\pi_{\mathrm{old}}(o_t | q, o_{<t})}$ : The likelihood ratio of the current policy to the old policy for a token.
    - $y = A(o|q)$ : The advantage function.
    - $\epsilon$ : A hyperparameter defining the clipping range.
  - $A(o|q) = \frac{(r(o|q) - \mathbb{E}_{o' \sim \pi_{\mathrm{old}}(\cdot|q)} r(o'|q))}{\sqrt{\mathrm{var}_{o' \sim \pi_{\mathrm{old}}(\cdot|q)} r(o'|q)}}$ $A (o ∣ q) = \frac{( r ( o ∣ q ) - E _{o^{'} \sim π_{old} (\cdot ∣ q)} r ( o ^{'} ∣ q ))}{var _{o^{'} \sim π_{old} (\cdot ∣ q)} r ( o ^{'} ∣ q )}$ : The group relative advantage function. It measures how much better the reward of output $o$ $o$ is compared to the average reward for question $q$ $q$ , normalized by the standard deviation of rewards.
    - $r(o|q) \in \{1, 0\}$ : Binary reward for output $o$ given question $q$ .
    - $\mathbb{E}_{o' \sim \pi_{\mathrm{old}}(\cdot|q)} r(o'|q)$ : Expected reward for question $q$ under $\pi_{\mathrm{old}}$ .
    - $\mathrm{var}_{o' \sim \pi_{\mathrm{old}}(\cdot|q)} r(o'|q)$ : Variance of rewards for question $q$ under $\pi_{\mathrm{old}}$ .
  - $\beta \mathbb{D}_{\mathrm{KL}}(\pi_\theta || \pi_{\mathrm{ref}})$ $β D_{KL} (π_{θ} ∣∣ π_{ref})$ : A KL divergence regularization term to prevent the policy from drifting too far from a reference model.
    - $\beta$ : A hyperparameter for the regularization strength.
Dr. GRPO [40]: A variant that removes variance normalization in the advantage function and length normalization, aiming to mitigate difficulty bias and response-level length bias. The paper's analysis shows it only mitigates but does not eliminate the difficulty bias.
DAPO [79]: Addresses GRPO's limitations like entropy collapse, training instability, and biased loss through techniques such as decoupled clipping (two distinct clipping hyperparameters, $\epsilon_{\mathrm{low}}$ and $\epsilon_{\mathrm{high}}$ ), dynamic sampling, and a token-level policy loss. The paper notes DAPO may induce excessive entropy growth.
GPG [13]: Introduces a simplified REINFORCE-based objective that removes the need for a critic and reference models, enhancing scalability.
TRPA [61]: Uses the Direct Preference Optimization (DPO) objective combined with a KL divergence regularization to fine-tune LRMs. The paper notes it can be recovered from DisCO-b under specific conditions.
RL from Human Feedback (RLHF): A broader category of RL methods for LLMs, where human preferences are used to train a reward model, which then guides policy optimization (e.g., using PPO). Examples include OpenAI's work [85] and models like InstructGPT [52].
Direct Preference Optimization (DPO) [54]: An off-policy RLHF method that simplifies the training process by directly optimizing a policy against a dataset of human preferences without needing to explicitly train a separate reward model. Variants like KTO [16] and SimPO [44] exist.
REINFORCE [68]: A foundational policy gradient method that updates policy parameters using Monte Carlo samples of rewards. GPG is a REINFORCE-based objective.
TRPO [56] and PPO [57]: Trust Region Policy Optimization and Proximal Policy Optimization are classic policy gradient methods that use trust regions or clipped objectives to stabilize training by limiting policy updates. GRPO is inspired by these.

3.3. Technological Evolution

The journey of fine-tuning LLMs, especially for complex reasoning, has evolved significantly:

Early LLMs & Prompting (Pre-RL): Initially, LLMs relied heavily on techniques like Chain-of-Thought (CoT) prompting [66, 48, 81] or Tree-of-Thought [78] to elicit reasoning capabilities without explicit fine-tuning for reasoning. These methods focused on structuring prompts to guide the model's generation process.
RL for General Alignment (RLHF): The advent of RLHF [85, 52] marked a shift, where RL was used to align LLMs with human preferences (helpfulness, harmlessness) on general text generation tasks, often employing PPO.
Off-policy RL (DPO): To address the complexities and costs of RLHF (e.g., training a reward model), methods like DPO emerged, allowing direct optimization from preference data.
RL for Reasoning (GRPO & Variants): A major breakthrough for reasoning tasks specifically was the scaling of RL training using verifiable rewards (e.g., exact match, formal verification) to teach LLMs to learn through self-exploration. DeepSeek-R1's GRPO [23, 59] pioneered this, achieving state-of-the-art performance. This spurred a wave of research focusing on improving GRPO, including efforts like Dr. GRPO [40], DAPO [79], GPG [13], and TRPA [61]. These works primarily focused on tweaking GRPO's advantage function, clipping mechanisms, or regularization.

This paper's work, DisCO, fits into the latest stage of this evolution. It moves beyond incremental tweaks to GRPO by performing a fundamental redesign of the objective function, drawing on principles from discriminative learning, to address GRPO's inherent limitations more comprehensively.

3.4. Differentiation Analysis

Compared to the main methods in related work, DisCO offers several core differences and innovations:

Feature	GRPO & Variants (e.g., Dr. GRPO, DAPO)	TRPA (DPO-based)	DisCO
Objective Function	Group relative objective with clipping-based surrogates	DPO objective (log-likelihood ratio)	Discriminative objective with non-clipping scoring functions
Difficulty Bias	Yes (inherent in advantage function weighting)	No (objective is not weighted by `p(q)`)	No (explicitly removed by design)
Scoring Function Type	Clipped likelihood ratio	Log-likelihood ratio (from DPO)	Flexible: Log-likelihood or Likelihood ratio (non-clipping)
Clipping	Yes	No (DPO is clipping-free)	No (explicitly abandons clipping for stability)
KL Divergence Handling	Regularization ( $\mathbb{D}_{\mathrm{KL}}(\pi_\theta \|\| \pi_{\mathrm{ref}})$ )	Regularization ( $\mathbb{D}_{\mathrm{KL}}(\pi_{\mathrm{old}} \|\| \pi_\theta)$ )	Constrained Optimization ( $\mathbb{D}_{\mathrm{KL}}(\pi_{\mathrm{old}} \|\| \pi_\theta) \leq \delta$ ) with squared-hinge penalty
Entropy Stability	Prone to collapse or excessive growth (due to clipping/regularization)	Can be unstable with standard regularization	Addressed via non-clipping scores and constrained optimization
Data Imbalance	No explicit mechanism	No explicit mechanism	Incorporates advanced discriminative learning (DRO for partial AUC)
Theoretical Foundation	Policy gradient with heuristic advantage/clipping	Preference learning based on reward modeling equivalence	Discriminative learning, AUC maximization, DRO

In essence, DisCO stands out by fundamentally re-thinking the optimization problem from a discriminative learning perspective, directly addressing identified shortcomings of GRPO with principled solutions rather than heuristic patches. Its use of non-clipping scoring functions and a dynamic constrained optimization approach for KL divergence offers more stable training, while DRO explicitly tackles data imbalance, which is a common issue in sparse reward settings.

4. Methodology

4.1. Principles

The core idea of DisCO is to reframe the problem of reinforcing Large Reasoning Models (LRMs) as a discriminative learning task. Instead of using a complex group relative advantage function with clipping, DisCO directly optimizes a discriminative objective that aims to:

Increase scores for positive answers: Make the model more likely to generate correct solutions.
Decrease scores for negative answers: Make the model less likely to generate incorrect solutions.

This approach is grounded in the intuition of AUC maximization, a well-established concept in discriminative learning. Key principles guiding DisCO's design include:

Elimination of Difficulty Bias: By removing the question-level weighting observed in GRPO, DisCO ensures that learning benefits equally from all questions, regardless of their initial correctness probability.
Non-Clipping Scoring Functions: To avoid issues like entropy collapse and vanishing gradients associated with clipping operations (like in PPO/GRPO), DisCO employs smooth, non-clipping score functions.
Stable Training via Constrained Optimization: Rather than a fixed KL divergence regularization, DisCO uses a trust region constraint on the KL divergence between the new and old policies. This dynamically adjusts the penalty on policy divergence, engaging only when the constraint is violated, leading to more stable and efficient learning.
Robustness to Data Imbalance: Recognizing that reasoning tasks often yield many more incorrect answers than correct ones, DisCO incorporates techniques from Distributionally Robust Optimization (DRO) to specifically handle this imbalanced rollout challenge, ensuring that valuable learning signals from rare positive examples are not overshadowed.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Analysis of GRPO's Difficulty Bias

The paper begins by analyzing the Group Relative Policy Optimization (GRPO) objective under a binary reward setting, where $r(o|q) \in \{1, 0\}$ . For a given question $q$ , let $p(q) = \mathbb{E}_{o \sim \pi_{\mathrm{old}}(\cdot|q)}[r(o|q)]$ be the probability of generating a correct answer under the old model $\pi_{\mathrm{old}}$ . The advantage function $A(o|q)$ in GRPO is defined as: $ A(o|q) = \frac{(r(o|q) - \mathbb{E}{o' \sim \pi{\mathrm{old}}(\cdot|q)} r(o'|q))}{\sqrt{\mathrm{var}{o' \sim \pi{\mathrm{old}}(\cdot|q)} r(o'|q)}} $ Given binary rewards, the expected reward is p(q) and the variance is p(q)(1-p(q)). Thus, $A(o|q)$ simplifies to: $ A(o|q) = \begin{cases} \sqrt{\frac{1-p(q)}{p(q)}}, & \text{if } r(o|q) = 1 \ -\sqrt{\frac{p(q)}{1-p(q)}}, & \text{if } r(o|q) = 0 \end{cases} $ This formulation leads to a key insight presented in Proposition 1, which re-expresses a core part of the GRPO objective (excluding the KL regularization) as a discriminative objective with a question-level weight. Let $\mathcal{I}_0(\theta)$ be the GRPO objective without the KL term: $ \mathcal { I } _ { 0 } ( \theta ) = \mathbb { E } _ { q } \mathbb { E } _ { o \sim \pi _ { \mathrm { o l d } } ( \cdot | q ) } \left[ \frac { 1 } { | o | } \sum _ { t = 1 } ^ { | o | } f \left( \frac { \pi _ { \theta } ( o _ { t } | q , o _ { < t } ) } { \pi _ { \mathrm { o l d } } ( o _ { t } | q , o _ { < t } ) } , A ( o | q ) \right) \right] $ where $f(x, y) = \min(xy, \mathrm{clip}(x, 1-\epsilon, 1+\epsilon)y)$ . The proposition states that under certain conditions on f(x,y), $\mathcal{I}_0(\theta)$ can be rewritten as: $ \mathcal { I } _ { 0 } ( \theta ) = \mathbb { E } _ { q } \sqrt { p ( q ) ( 1 - p ( q ) ) } \mathbb { E } _ { o \sim \pi _ { \mathrm { o l d } } ^ { + } ( \cdot \vert q ) , o ^ { \prime } \sim \pi _ { \mathrm { o l d } } ^ { - } ( \cdot \vert q ) } [ s _ { \theta } ^ { + } ( o , q ) - s _ { \theta } ^ { - } ( o ^ { \prime } , q ) ] $ with the GRPO-specific scoring functions for positive and negative answers defined as: $ f ^ { + } ( x , 1 ) = \mathrm { m i n } ( x , 1 + \epsilon ) , \quad f ^ { - } ( x , 1 ) = \mathrm { m a x } ( x , 1 - \epsilon ) . $ Here, $s_{\theta}^+(o,q)$ and $s_{\theta}^-(o,q)$ are constructed from $\frac{1}{|o|}\sum_{t=1}^{|o|} f^+(\frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\mathrm{old}}(o_t|q,o_{<t})}, 1)$ and $\frac{1}{|o|}\sum_{t=1}^{|o|} f^-(\frac{\pi_\theta(o_t|q,o_{<t})}{\pi_{\mathrm{old}}(o_t|q,o_{<t})}, 1)$ respectively.

Symbol Explanation for Proposition 1:
- $\mathcal{I}_0(\theta)$ : The GRPO objective without the KL regularization term.
- $\mathbb{E}_q$ : Expectation over input questions $q$ .
- $\sqrt{p(q)(1-p(q))}$ : This is the question-level weight, denoted as $\omega(q)$ .
- $\pi_{\mathrm{old}}^+(\cdot|q)$ : Conditional distribution of positive outputs (reward = 1) given $q$ under $\pi_{\mathrm{old}}$ .
- $\pi_{\mathrm{old}}^-(\cdot|q)$ : Conditional distribution of negative outputs (reward = 0) given $q$ under $\pi_{\mathrm{old}}$ .
- $s_{\theta}^+(o,q)$ : Scoring function for a positive output $o$ .
- $s_{\theta}^-(o',q)$ : Scoring function for a negative output $o'$ .
- $f^+(x,1)$ : The non-decreasing function defining the positive scoring part, derived from f(x,y) when $y>0$ .
- $f^-(x,1)$ : The non-decreasing function defining the negative scoring part, derived from f(x,y) when $y \le 0$ .
  
  This analysis reveals two key insights:

Discriminative Nature: The term $\mathbb{E}_{o \sim \pi_{\mathrm{old}}^+, o' \sim \pi_{\mathrm{old}}^-} [s_{\theta}^+(o,q) - s_{\theta}^-(o',q)]$ is a discriminative objective, aiming to increase scores of positive answers and decrease scores of negative answers, similar to AUC maximization.
Difficulty Bias: The weighting factor $\omega(q) = \sqrt{p(q)(1-p(q))}$ causes difficulty bias. As shown in Figure 1(a), this weight is small for questions where p(q) is close to 0 (very hard) or 1 (very easy), meaning the optimization focuses less on these questions. This is problematic because learning from very hard questions is crucial for capability improvement, and even easy questions still present opportunities for refinement. The paper empirically validates this by showing that removing this weight (GRPO-RW) leads to faster improvements in correctness ratios (Figure 1(c) and 1(d)).

4.2.2. DisCO's Basic Approach (`DisCO-b`)

Motivated by the discriminative nature revealed in GRPO, DisCO directly optimizes a discriminative objective without the problematic question-level weighting.

Discriminative Objective: DisCO defines a single scoring function $s_\theta(o,q)$ for both positive and negative outputs. The basic objective for maximization is: $ \mathcal { I } _ { 1 } ( \theta ) = \mathbb { E } _ { q } \mathbb { E } _ { o \sim \pi _ { \mathrm { o l d } } ^ { + } ( \cdot | q ) , o ^ { \prime } \sim \pi _ { \mathrm { o l d } } ^ { - } ( \cdot | q ) } \ell ( s _ { \theta } ( o , q ) - s _ { \theta } ( o ^ { \prime } , q ) ) . $

Symbol Explanation:
- $\mathcal{I}_1(\theta)$ : DisCO's basic objective.
- $\ell(\cdot)$ : A surrogate function. For comparison with GRPO, the identity function $\ell(s)=s$ is used.
- $s_\theta(o,q)$ : A scoring function for an output $o$ given question $q$ by the model $\pi_\theta$ .
  
  This objective directly aims to maximize the difference between the scores of positive and negative answers. A key departure from GRPO is the use of a single scoring function $s_\theta(o,q)$ for both positive and negative outputs, and specifically, the use of non-clipping scoring functions to avoid entropy collapse and vanishing gradients.

Scoring Functions: Two non-clipping scoring functions are considered:

Log-likelihood: $ s _ { \theta } ( o , q ) = \frac { 1 } { | o | } \sum _ { t = 1 } ^ { | o | } \log \pi _ { \theta } ( o _ { t } | q , o _ { < t } ) $
Likelihood Ratio (L-ratio): $ s _ { \theta } ( o , q ) = \frac { 1 } { | o | } \sum _ { t = 1 } ^ { | o | } \frac { \pi _ { \theta } ( o _ { t } | q , o _ { < t } ) } { \pi _ { \mathrm { o l d } } ( o _ { t } | q , o _ { < t } ) } $

Symbol Explanation:
- $\pi_\theta(o_t | q, o_{<t})$ : Probability of token $o_t$ under current policy.
- $\pi_{\mathrm{old}}(o_t | q, o_{<t})$ : Probability of token $o_t$ under old policy.
- $|o|$ : Length of the output.
  
  These scoring functions have connections to Vanilla Policy Gradient (VPG) and TRPO objectives, respectively, but are used within a discriminative framework.

Stabilizing Training with Constrained Optimization: To address the long-standing issue of training instability in RL, DisCO adopts a trust region constraint approach, inspired by TRPO. Instead of a regularization term with a fixed coefficient, it formulates the problem as a constrained optimization: $ \begin{array} { r l } & { \underset { \theta } { \mathrm { m a x } } \mathcal { I } _ { 1 } ( \theta ) : = \mathbb { E } _ { q } \mathbb { E } _ { o \sim \pi _ { \mathrm { o l d } } ^ { + } ( \cdot | q ) , o ^ { \prime } \sim \pi _ { \mathrm { o l d } } ^ { - } ( \cdot | q ) } \ell ( s _ { \theta } ( o , q ) - s _ { \theta } ( o ^ { \prime } , q ) ) } \ & { s . t . \quad \mathbb { D } _ { \mathrm { K L } } ( \pi _ { \mathrm { o l d } } | | \pi _ { \theta } ) \leq \delta . } \end{array} $

Symbol Explanation:
- $\mathbb{D}_{\mathrm{KL}}(\pi_{\mathrm{old}} || \pi_\theta)$ : KL divergence from the old policy to the current policy.
- $\delta$ : The maximum allowed KL divergence, defining the trust region.
  
  This constrained problem is solved using a squared-hinge penalty function, transforming it into an unconstrained optimization problem: $ \operatorname* { m a x } _ { \theta } \mathbb { E } _ { q } \mathbb { E } _ { o \sim \pi _ { \mathrm { o l d } } ^ { + } ( \cdot | q ) , o ^ { \prime } \sim \pi _ { \mathrm { o l d } } ^ { - } ( \cdot | q ) } \ell ( s _ { \theta } ( o , q ) - s _ { \theta } ( o ^ { \prime } , q ) ) - \beta [ \mathbb { D } _ { \mathrm { K L } } ( \pi _ { \mathrm { o l d } } | | \pi _ { \theta } ) - \delta ] _ { + } ^ { 2 } , $
Symbol Explanation:
- $\beta$ : Penalty parameter for the constraint.
- $[\cdot]_+ = \max\{\cdot, 0\}$ : The positive part function. The penalty is only applied when the KL divergence constraint is violated (i.e., $\mathbb{D}_{\mathrm{KL}} > \delta$ ).
  
  The squared-hinge penalty differs from regular KL regularization ( $\beta \mathbb{D}_{\mathrm{KL}}(\pi_{\mathrm{old}} || \pi_\theta)$ ) in that its gradient contribution (and thus its impact on policy updates) becomes zero when the KL divergence is within the allowed trust region ( $\mathbb{D}_{\mathrm{KL}} \leq \delta$ ). This dynamic weighting makes it more effective in stabilizing training without unnecessarily restricting learning when the policy is already close to the old one.

4.2.3. DisCO's Improved Approach for Tackling Imbalanced Rollouts

A significant challenge in RL fine-tuning for reasoning is sparse rewards, leading to imbalanced rollouts where the number of negative outputs far outweighs positive ones for many questions. The basic AUC objective ( $\mathcal{I}_1$ ) can be ineffective in such scenarios because a high AUC can still be achieved if the model makes many low-score negative errors and few high-score positive errors.

To address this, DisCO incorporates Distributionally Robust Optimization (DRO) [84], specifically a formulation related to partial AUC maximization. This approach aims to maximize the scores of positive samples against top-ranked negative samples, rather than all negative samples equally. For a given question $q$ and a positive data $o$ , the DRO formulation for partial AUC maximization minimizes a robust risk over a set of probability measures $Q$ on negative data: $ \begin{array} { l } { \displaystyle \operatorname* { i n f } _ { Q \in \mathcal { Q } } \tau \mathbb { D } _ { \mathrm { K L } } ( Q , \pi _ { \mathrm { o l d } } ^ { - } ( \cdot | q ) ) + \mathbb { E } _ { o ^ { \prime } \sim Q } [ s _ { \theta } ( o , q ) - s _ { \theta } ( o ^ { \prime } , q ) ] : } \ { \displaystyle \qquad = - \tau \log \bigg ( \mathbb { E } _ { o ^ { \prime } \sim \pi _ { \mathrm { o l d } } ^ { - } ( \cdot | q ) } \exp \bigg ( \frac { s _ { \theta } ( o ^ { \prime } , q ) - s _ { \theta } ( o , q ) } { \tau } \bigg ) \bigg ) . } \end{array} $

Symbol Explanation:
- $\mathcal{Q}$ : Set of probability measures $Q$ on negative data, absolutely continuous with respect to $\pi_{\mathrm{old}}^-(\cdot|q)$ .
- $\tau$ : A hyperparameter controlling the robustness/risk-sensitivity. It defines the "size" of the ambiguity set for $Q$ .
- $\mathbb{D}_{\mathrm{KL}}(Q, \pi_{\mathrm{old}}^-(\cdot|q))$ : KL divergence between the robust distribution $Q$ and the empirical negative distribution $\pi_{\mathrm{old}}^-(\cdot|q)$ .
  
  This infimum formulation gives rise to the final DRO-based objective for maximization in DisCO: $ \mathcal { I } _ { 2 } ( \theta ) = - \mathbb { E } _ { q } \mathbb { E } _ { o \sim \pi _ { \mathrm { o d d } } ^ { + } ( \cdot | q ) } \tau \log \left( \mathbb { E } _ { o ^ { \prime } \sim \pi _ { \mathrm { o d d } } ^ { - } ( \cdot | q ) } \exp \left( \frac { s _ { \theta } ( o ^ { \prime } , q ) - s _ { \theta } ( o , q ) } { \tau } \right) \right) . $
Symbol Explanation:
- $\mathcal{I}_2(\theta)$ : DisCO's improved objective, based on DRO.
  
  This objective is always less than or equal to $\mathcal{I}_1(\theta)$ by Jensen's inequality for the convex function $-\log$ . Maximizing $\mathcal{I}_2(\theta)$ could be more effective than maximizing $\mathcal{I}_1(\theta)$ because it implicitly focuses on improving the scores of positive answers relative to the worst-case (highest-scoring) negative answers, making it robust to imbalance.

Finally, the full DisCO framework (DisCO as opposed to DisCO-b) solves the DRO-based discriminative objective under the KL divergence constraint, using the same squared-hinge penalty method: $ \begin{array} { r l } & { \underset { \theta } { \operatorname* { m a x } } \mathcal { I } _ { 2 } ( \theta ) : = - \mathbb { E } _ { q } \mathbb { E } _ { \partial \sim \pi _ { \mathrm { o d d } } ^ { + } ( \cdot | q ) } \tau \log \bigg ( \mathbb { E } _ { \sigma ^ { \prime } \sim \pi _ { \mathrm { o d d } } ^ { - } ( \cdot | q ) } \exp \bigg ( \frac { s _ { \theta } ( \sigma ^ { \prime } , q ) - s _ { \theta } ( o , q ) } { \tau } \bigg ) \bigg ) , } \ & { s . t . \quad \mathbb { D } _ { \mathrm { K L } } ( \pi _ { \mathrm { o l d } } | | \pi _ { \theta } ) \leq \delta . } \end{array} $

4.2.4. Algorithm Flow (Algorithm 1)

In practice, all expectations are replaced by empirical averages over sampled data. The KL divergence is also estimated using sampled data at each iteration. The overall training process for DisCO (using $\mathcal{I}_2(\theta)$ ) is outlined as follows:

Algorithm 1: DisCO

Input: Initial policy model $\pi_0$ , reward function $r$ , question set $\mathcal{D}$ , hyperparameters $\delta, \beta, \tau$ .
Initialization: Policy model $\pi_\theta = \pi_0$ .
For each training step (Step = $1, \ldots, T$ ): a. Sample a batch of questions $\mathcal{B}$ from $\mathcal{D}$ . b. Update the old policy model: $\pi_{\mathrm{old}} = \pi_\theta$ . c. For each question $q \in \mathcal{B}$ : i. Sample $n$ responses $\{o_i\}_{i=1}^n \sim \pi_{\mathrm{old}}(\cdot|q)$ . Denote this set as $S_q$ . ii. Partition $S_q$ into $S_q^+$ (positive answers, $r(o_i|q)=1$ ) and $S_q^-$ (negative answers, $r(o_i|q)=0$ ) based on rewards. d. For each minibatch $\mathcal{B}_m \subseteq \mathcal{B}$ : i. Compute KL divergence estimator: $ \hat { \mathbb { D } } _ { K L } = \frac { 1 } { \sum _ { q \in \mathcal { B } _ { m } } \sum _ { o \in \mathcal { S } _ { q } } \vert o \vert } \displaystyle \sum _ { q \in \mathcal { B } _ { m } } \sum _ { o \in S _ { q } } \sum _ { t = 1 } ^ { \vert o \vert } \log \frac { \pi _ { \mathrm { o l d } } ( o _ { t } \vert q , o _ { < t } ) } { \pi _ { \theta } ( o _ { t } \vert q , o _ { < t } ) } $ * Symbol Explanation: * $\hat{\mathbb{D}}_{\mathrm{KL}}$ : Empirical estimator of the KL divergence. * $\sum_{q \in \mathcal{B}_m} \sum_{o \in \mathcal{S}_q} |o|$ : Total number of tokens across all sampled outputs in the minibatch. This acts as a normalization factor. * $\log \frac{\pi_{\mathrm{old}}(o_t|q, o_{<t})}{\pi_\theta(o_t|q, o_{<t})}$ : The log-ratio of probabilities for each token, contributing to the KL divergence estimate. ii. Compute gradient $G_1$ for the discriminative objective $\mathcal{I}_2(\theta)$ : $ G _ { 1 } = \frac { 1 } { | \mathcal { B } _ { m } | } \displaystyle \sum _ { q \in \mathcal { B } _ { m } } \frac { 1 } { | S _ { q } ^ { + } | } \displaystyle \sum _ { o \in S _ { q } ^ { + } } { \left( \nabla s _ { \theta } ( o , q ) - \nabla \Big ( \tau \log \displaystyle \sum _ { o ^ { \prime } \in S _ { q } ^ { - } } \exp ( \frac { s _ { \theta } ( o ^ { \prime } , q ) } { \tau } ) \Big ) \right) } $ * Symbol Explanation: * $G_1$ : Gradient component from the discriminative objective. * $|\mathcal{B}_m|$ : Number of questions in the minibatch. * $|S_q^+|$ : Number of positive outputs for question $q$ . * $\nabla s_\theta(o,q)$ : Gradient of the scoring function for a positive output $o$ . * $\nabla \Big ( \tau \log \sum_{o' \in S_q^-} \exp (\frac{s_\theta(o',q)}{\tau}) \Big )$ : Gradient component related to robustly minimizing scores of negative outputs. iii. Compute gradient $G_2$ for the squared-hinge penalty term: $ G _ { 2 } = 2 \beta [ \hat { \mathbb { D } } _ { K L } - \delta ] _ { + } \nabla \hat { \mathbb { D } } _ { K L } $ * Symbol Explanation: * $G_2$ : Gradient component from the KL constraint penalty. * $2\beta [\hat{\mathbb{D}}_{\mathrm{KL}} - \delta]_+$ : The dynamic weighting factor for the KL gradient. It's 0 if $\hat{\mathbb{D}}_{\mathrm{KL}} \le \delta$ , otherwise $2\beta (\hat{\mathbb{D}}_{\mathrm{KL}} - \delta)$ . * $\nabla \hat{\mathbb{D}}_{\mathrm{KL}}$ : Gradient of the KL divergence estimator. iv. Update policy parameters: $\pi_\theta$ is updated using an optimizer (e.g., Adam-W) with the combined gradient $G = G_1 + G_2$ .
End for.

5. Experimental Setup

5.1. Datasets

The experiments are conducted on mathematical reasoning tasks using two primary datasets:

DeepScaleR-Preview-Dataset [42]: This is the main training dataset.
- Source & Characteristics: It aggregates AIME problems (from 1984 to 2023), AMC problems (before 2023), and questions from Omni-MATH [21] and Still [45] datasets.
- Scale: Approximately 40.3k unique problem-answer pairs.
- Domain: Focused on high-school and olympiad-level mathematical reasoning.
- Example Data Sample: While the paper does not provide a direct example of a problem-answer pair, these datasets typically consist of mathematical word problems, geometric proofs, or algebraic equations, along with their step-by-step solutions and final answers. For instance, an AIME problem might be: "What is the smallest positive integer $n$ such that the sum of the digits of $n$ is 2024 and $n$ is divisible by 2024?" The expected output would be a detailed chain of thought leading to the correct numerical answer.
- Choice Rationale: This dataset is chosen for its comprehensive coverage of diverse mathematical reasoning problems, making it suitable for training and evaluating LRMs' capabilities in this domain.
DAPO-Math-17k [79]: Used for additional experiments to verify the generalizability of DisCO beyond the primary dataset.
- Source & Characteristics: A smaller dataset (17k problems) specifically designed for mathematical reasoning.
- Domain: Mathematical reasoning.
- Choice Rationale: Used to demonstrate that DisCO's improvements are fundamental and not dataset-specific.
  
  Models are evaluated on six benchmark datasets:

AIME 2024
AIME 2025
MATH 500 [28, 37]
AMC 2023
Minerva [34]
Olympiad Bench (O-Bench) [26] These benchmarks cover various levels of mathematical difficulty and problem types, providing a robust evaluation of reasoning capabilities.

5.2. Evaluation Metrics

The primary evaluation metric used is Pass@1 averaged over $k=16$ responses.

Conceptual Definition: Pass@1 is a metric used in code generation and mathematical reasoning tasks to assess the ability of a model to generate a single correct solution. When averaged over $k$ responses, it measures the proportion of correctly solved problems if the model were to generate $k$ independent responses for each question and at least one of them is correct. In this paper, it is simplified to directly measure the proportion of correct responses out of $k$ trials. It quantifies the reliability and accuracy of the model's reasoning capabilities.
Mathematical Formula: The metric for each question is calculated as: $ \textstyle { \frac { 1 } { k } } \sum _ { i = 1 } ^ { k } \mathbb { I } ( o _ { i } \text{ is correct}) $ The reported Pass@1 is then the average of these question-level scores across all questions in the benchmark.
Symbol Explanation:
- $k$ : The number of independent responses generated for each question (set to 16 in this paper).
- $o_i$ : The $i$ -th generated response for a question.
- $\mathbb{I}(\cdot)$ : The indicator function, which returns 1 if the condition inside the parenthesis is true, and 0 otherwise.
- "is correct": This condition is determined by a rule-based reward mechanism, typically using exact match against an extracted answer or a formal verification tool [23, 33, 55].

5.3. Baselines

The paper compares DisCO against a comprehensive set of recent state-of-the-art reinforcement learning methods and existing reasoning models:

GRPO [23]: The original Group Relative Policy Optimization method, which DeepSeek-R1 is based on. It is a foundational baseline for RL fine-tuning of LRMs.
GRPO with Entropy Regularization (GRPO-ER): A variant of GRPO that adds an entropy regularization term to the objective. This is used by DeepScaleR [42] to encourage exploration and prevent entropy collapse.
Dr. GRPO [40]: A variant that aims to mitigate response-level length bias and question-level difficulty bias by removing length and advantage normalization.
DAPO [79]: An RL system designed to address GRPO's limitations such as entropy collapse, training instability, and biased loss through techniques like decoupled clipping and dynamic sampling.
TRPA [61]: Trust Region Preference Approximation, which uses a Direct Preference Optimization (DPO) objective with a KL divergence regularization for fine-tuning LRMs.
STILL-3-1.5B-preview [12]: A model that adapts GRPO by periodically replacing the reference model after a fixed number of training steps. This is an existing model from another study.
DeepScaleR(DSR)-1.5B-Preview [42]: A model trained with GRPO using a maximum response length of 24k for training and 32k for inference. This demonstrates the impact of longer context windows. This is an existing model from another study.
GRPO-LEAD-7B [82]: An extension of GRPO that incorporates length-dependent rewards, explicit penalty terms, and difficulty-based advantage reweighting to promote concise and precise reasoning. This is an existing model from another study.

These baselines represent the current landscape of RL methods for fine-tuning reasoning models, covering different strategies to improve GRPO or alternative RL paradigms (like DPO in TRPA). They are chosen to provide a thorough comparison against the state-of-the-art.

5.4. Training Details

Base Models: Experiments are conducted by fine-tuning three SFT-finetuned models:
- DeepSeek-R1-Distill-Qwen-1.5B (Q1.5B)
- DeepSeek-R1-Distill-Qwen-7B (Q7B)
- DeepSeek-R1-Distill-Llama-8B (L8B) All are distilled reasoning models, meaning they are likely smaller models trained to mimic the behavior of larger, more powerful reasoning models.
Optimizer: AdamW optimizer with a weight decay of 0.0.
Learning Rates: Tuned from $[5e-7, 1e-6, 2e-6]$ .
- Q1.5B: $2 \times 10^{-6}$
- Q7B: $1 \times 10^{-6}$
- L8B: $5 \times 10^{-7}$
Batch Sizes:
- Training batch size: 128
- Mini-batch size: 32
Responses per Question: 8 responses generated for each question during training.
Temperature: Set to 0.6 for both training and evaluation, following recommendations from [23].
Maximum Response Length (MRL): Limited to 8k tokens for both training and inference for DisCO and most baselines, to ensure fair comparison and manage computational costs. Some reference models (e.g., DeepScaleR-1.5B-Preview) use longer MRLs (24k/32k).
Hyperparameters for Baselines:
- GRPO: $\beta = 0.001$ (for KL regularization).
- GRPO-ER: Entropy regularization coefficient = 0.001.
- DAPO: $\epsilon_{\mathrm{low}} = 0.2$ , $\epsilon_{\mathrm{high}} = 0.28$ (for decoupled clipping). Dynamic Sampling is not implemented for DAPO or other methods to ensure fair comparison of algorithmic core, as it significantly increases sampling cost.
Hyperparameters for DisCO:
- $\delta = 10^{-4}$ (KL divergence constraint threshold). Chosen based on empirical observation that average KL divergence is around $2 \times 10^{-5}$ .
- $\beta = 10^3$ (penalty parameter for squared-hinge loss). Chosen so that the effective KL regularization weight (when violated by $\delta$ ) is $\beta \times \delta = 0.1$ .
- $\tau$ (robustness parameter for DRO): 1 for L-ratio scoring function, 10 for log-L scoring function. Tuned from $\{0.5, 1, 5, 10\}$ .
Training Steps:
- Q1.5B models: 1,400 steps.
- Q7B/L8B models: 1,000 steps.
Evaluation Frequency: Evaluations are conducted every 200 training steps, and the best performance achieved by each method is reported.
Computational Resources:
- 1.5B models: $4 \times$ 40G A100 GPUs, ~6 minutes per training step.
- 7B models: $1 \times$ 80G H100 GPUs, ~6.5 minutes per training step.

6. Results & Analysis

6.1. Core Results Analysis

The experiments demonstrate that DisCO consistently and significantly outperforms GRPO and its variants across various model sizes and benchmark tasks.

6.1.1. DeepSeek-R1-Distill-Qwen-1.5B Model (Q1.5B)

The following are the results from Table 2 of the original paper:

Model/Method	MRL(Train/Test)	AIME 2024	AIME 2025	MATH 500	AMC 2023	Minerva	O-Bench	Avg.
OpenAI-o1-Preview		0.4		0.814	-	-		-
DS-Distill-Qwen-1.5B	32k+ / 32k	0.288	0.263	0.828	0.629	0.265	0.433	0.451
DS-Distill-Qwen-1.5B	32k+/8k	0.181	0.215	0.758	0.515	0.237	0.353	0.376
STILL-3-1.5B-preview	29k / 32k	0.325	0.248	0.844	0.667	0.290	0.454	0.471
DSR-1.5B-Preview	24k / 32k	0.431	0.304	0.878	0.736	0.302	0.500	0.525
DSR-1.5B-Preview	24k / 8k	0.358	0.258	0.860	0.679	0.297	0.473	0.488
GRPO	8k / 8k	0.277	0.242	0.838	0.647	0.276	0.462	0.457
GRPO-ER	8k / 8k	0.298	0.242	0.839	0.649	0.279	0.452	0.460
Dr. GRPO	8k / 8k	0.252	0.238	0.831	0.631	0.268	0.440	0.443
DAPO	8k / 8k	0.310	0.252	0.848	0.675	0.296	0.456	0.473
TRPA	8k / 8k	0.354	0.235	0.835	0.653	0.283	0.458	0.470
DisCO (L-ratio)	8k / 8k	0.381	0.306	0.878	0.746	0.319	0.512	0.524
DisCO (log-L)	8k / 8k	0.404	0.317	0.876	0.758	0.333	0.509	0.533

Analysis: DisCO (log-L) achieves the highest average Pass@1 of 0.533, significantly outperforming all other baselines trained and evaluated with the same 8k MRL.
- It shows a 7% average gain over GRPO (0.457) and a 6% gain over DAPO (0.473).
- Notably, DisCO (log-L) (0.533) even surpasses DeepScaleR-1.5B-Preview (0.525), which was trained with a much longer 24k MRL and evaluated with 32k MRL. This highlights DisCO's efficiency in learning with shorter context lengths.
- Both DisCO variants (L-ratio and log-L) consistently outperform all other methods, especially on challenging benchmarks like AIME 2024/2025 and AMC 2023.

6.1.2. DeepSeek-R1-Distill-Qwen-7B Model (Q7B)

The following are the results from Table 3 of the original paper:

Model/Method	MRL(Train/Test)	AIME 2024	AIME 2025	MATH 500	AMC 2023	Minerva	O-Bench	Avg.
DS-Distill-Qwen-7B	32k+ / 32k	0.560	0.396	0.923	0.825	0.380	0.568	0.609
DS-Distill-Qwen-7B	32k+ / 8k	0.402	0.292	0.873	0.688	0.355	0.471	0.513
GRPO-LEAD-7B	8k / 8k	0.470	0.345	0.893	0.748	0.372	0.500	0.555
TRPA	8k / 8k	0.570	-	0.870	0.780	0.360	0.550
GRPO	8k / 8k	0.498	0.394	0.916	0.807	0.381	0.555	0.592
GRPO-ER	8k / 8k	0.515	0.381	0.916	0.825	0.376	0.544	0.593
Dr. GRPO	8k / 8k	0.488	0.346	0.910	0.792	0.368	0.546	0.575
DAPO	8k / 8k	0.454	0.335	0.907	0.799	0.388	0.535	0.570
TRPA	8k / 8k	0.510	0.367	0.898	0.779	0.379	0.534	0.578
DisCO (L-ratio)	8k / 8k	0.583	0.421	0.923	0.852	0.399	0.585	0.627
DisCO (log-L)	8k / 8k	0.558	0.410	0.927	0.854	0.410	0.592	0.625

Analysis: For the larger 7B model, DisCO also consistently outperforms all competing approaches. DisCO (L-ratio) achieves the highest average Pass@1 of 0.627, closely followed by DisCO (log-L) at 0.625. These scores are significantly higher than GRPO (0.592), GRPO-ER (0.593), and DAPO (0.570). DisCO even exceeds the performance of DS-Distill-Qwen-7B trained with 32k+ MRL (0.609) when using only 8k MRL, further underscoring its efficiency and effectiveness.

6.1.3. DeepSeek-R1-Distill-Llama-8B Model (L8B)

The following are the results from Table 4 of the original paper:

Model/Method	MRL(Train/Test)	AIME 2024	AIME 2025	MATH 500	AMC 2023	Minerva	O-Bench	Avg.
DS-Distill-Llama-8B	32k+/32k	0.506	0.346	0.896	0.815	0.295	0.541	0.566
DS-Distill-Llama-8B	32k+ /8k	0.348	0.238	0.825	0.652	0.267	0.440	0.462
GRPO	8k / 8k	0.410	0.240	0.873	0.759	0.307	0.506	0.516
GRPO+ER	8k / 8k	0.408	0.277	0.882	0.785	0.311	0.511	0.529
Dr. GRPO	8k / 8k	0.423	0.285	0.867	0.786	0.300	0.497	0.526
DAPO	8k / 8k	0.333	0.308	0.879	0.794	0.325	0.522	0.527
TRPA	8k / 8k	0.454	0.279	0.864	0.756	0.289	0.518	0.527
DisCO (L-ratio)	8k / 8k	0.506	0.356	0.900	0.831	0.326	0.553	0.579
DisCO (log-L)	8k / 8k	0.523	0.354	0.896	0.843	0.331	0.560	0.584

Analysis: The results for the Llama-8B model further corroborate DisCO's superiority. DisCO (log-L) achieves the highest average Pass@1 of 0.584, and DisCO (L-ratio) is close behind at 0.579. Both significantly outperform all GRPO variants and TRPA, again demonstrating robust performance across different model architectures. The gains are particularly noticeable on AIME and AMC benchmarks.

6.1.4. Training Dynamics

The paper analyzes the training dynamics in terms of training rewards and generation entropy over training steps.

The following are the training dynamics from Figure 2 of the original paper:

Figure 2: Training dynamics of different methods: left two are for fine-tuning DeepSeek-R1-DistillQwen-1.5B model and right two are for fine-tuning DeepSeek-R1-Distill-Qwen-7B model. (a), (c) plot th… 该图像是图表，展示了不同方法的训练动态。左侧两个子图（a，b）展示了针对DeepSeek-R1-DistillQwen-1.5B模型的训练奖励和生成熵随训练步数的变化，右侧两个子图（c，d）则展示了针对DeepSeek-R1-Distill-Qwen-7B模型的相应变化。

Analysis for Q1.5B and Q7B Models (Figure 2):
- GRPO, GRPO-ER, Dr. GRPO: These methods suffer from entropy collapse, where the generation entropy quickly drops to very low levels (approaching 0). This indicates a highly deterministic policy, which leads to premature saturation in training rewards. The model stops exploring and gets stuck in local optima.
- DAPO: While designed to mitigate entropy collapse, DAPO exhibits excessive entropy growth, where the entropy increases rapidly and becomes very high. This suggests a highly random policy, leading to inefficient learning and also premature saturation in training rewards.
- TRPA: Shows instability in generation entropy during later steps (around 1100 for Q1.5B and 800 for Q7B), indicating that its KL regularization is not always sufficient to maintain stable training.
- DisCO (both L-ratio and log-L): In stark contrast, DisCO methods show the most stable training dynamics. Training rewards continue to increase steadily, and generation entropy is maintained around 0.22, indicating a balanced exploration-exploitation trade-off and robust learning.
  
  The following are the training dynamics from Figure 6 of the original paper:
  
  该图像是图表，展示了不同方法在微调 DeepSeek-R1-Distill-Llama-8B 模型时的训练动态。图 (a) 显示了训练奖励（对每一步生成输出的平均值）与训练步数的关系；图 (b) 展示了生成熵与训练步数的变化。
Analysis for L8B Model (Figure 6):
- Similar trends are observed for the L8B model, with GRPO variants suffering from entropy collapse, DAPO from excessive entropy growth, and TRPA showing instability. DisCO again demonstrates superior stability in both training reward and generation entropy, which remains around 0.2.
  
  These dynamics confirm that DisCO's design choices (non-clipping scoring functions and constrained optimization) effectively address the stability issues inherent in other GRPO-based methods.

6.2. Ablation Studies

6.2.1. DisCO vs DisCO-b

The following are the results from Figure 3 (left) of the original paper:

Figure 3: Ablation studies: left for comparing DisCO vs DisCO-b; middle and right for comparing clipping with non-clipping scoring functions. 该图像是图表，展示了不同模型和评分函数的对比结果。左侧部分比较了DisCO与其变种DisCO-b在五个任务及其平均表现（Pass@1）的效果；中间部分展示了不同的评分函数（如Clipped L-ratio和L-ratio）在相同条件下的表现；右侧展示了生成熵随训练步数变化的情况，并包含一个小图，上面有不同限制比率的生成熵数据。这些数据表明DisCO及其优化方法在数学推理能力上的优势。

Analysis (Figure 3, left): DisCO (the full framework with DRO) is compared to DisCO-b (the basic approach without DRO for imbalanced rollouts) using the L-ratio scoring function for training Q7B models. DisCO consistently demonstrates significant improvements over DisCO-b, especially on the more difficult AIME datasets. This highlights the effectiveness of the DRO component in handling imbalanced rollouts and boosting performance. Appendix A.1 (Figure 5) further confirms this, showing DisCO's consistent superiority across various settings.

The following are the results from Figure 5 of the original paper:

该图像是图表，展示了在不同模型和评分函数下，DisCO-b与DisCO的比较。图中包含四个子图，分别标注为(a) L-ratio on 1.5B、(b) log-L on 1.5B、(c) L-ratio on 7B和(d) log-L on 7B，纵轴为Pass@1的值。图中蓝色条表示DisCO-b，绿色条表示DisCO，展示了在不同基准上的表现差异。
Analysis (Figure 5): This figure provides a detailed comparison between DisCO-b and DisCO across different models (1.5B, 7B) and scoring functions (L-ratio, log-L). In all four subplots, DisCO (green bars) consistently achieves higher average Pass@1 scores than DisCO-b (blue bars). This reinforces that the advanced feature of DisCO for tackling imbalanced rollouts (DRO) contributes significantly to its overall performance. The paper also notes that even DisCO-b variants perform better than other baselines, indicating the strength of the core discriminative objective and constrained optimization.

6.2.2. Clipping vs Non-Clipping Scoring Functions

The following are the results from Figure 3 (middle and right) of the original paper:

Analysis (Figure 3, middle and right): This ablation compares the non-clipping L-ratio and log-L scoring functions with clipped L-ratio (similar to GRPO/DAPO objectives) within the DisCO-b framework for Q1.5B models.
- Clipped L-ratio with $\epsilon_{\mathrm{high}} = 0.2$ (similar to GRPO) leads to entropy collapse, where the generation entropy drops sharply, resulting in poor performance.
- Clipped L-ratio with $\epsilon_{\mathrm{high}} = 0.28$ (similar to DAPO) causes excessively high entropy, indicating a very random policy, which also yields worse performance.
- In contrast, non-clipping L-ratio and log-L scoring functions (used in DisCO) maintain stable entropy and achieve better performance. This strongly supports DisCO's design choice to abandon clipping.

6.2.3. KL Regularization vs Constrained Optimization

The following are the results from Figure 4 (left) of the original paper:

$Figure 4: Ablation studies: left for comparing KL regularization vs constrained optimization; middle for sensitivity of DisCO w.r.t. the hyperparameter $\\tau$ ; right for contribution of each compone…$ 该图像是一个图表，展示了消融研究的结果，包括左侧对比 KL 正规化与约束优化的效果，中间展示了 DisCO 对超参数 au 的敏感性，右侧则呈现了各组成部分的贡献。

Analysis (Figure 4, left): This study compares the effectiveness of DisCO's constrained optimization approach (using the squared-hinge penalty) against a traditional KL regularization approach (where the KL divergence is simply added to the loss with a fixed coefficient, e.g., 0.001) for both Q1.5B and Q7B models. Constrained optimization consistently outperforms KL regularization. Furthermore, KL regularization was observed to lead to instability during training for Q7B models, similar to TRPA, suggesting it is insufficient for stabilizing training in these complex scenarios. This validates DisCO's use of a dynamic, constraint-based approach for KL divergence.

6.2.4. Sensitivity of Hyperparameter $\tau$

The following are the results from Figure 4 (middle) of the original paper:

Analysis (Figure 4, middle): The sensitivity of DisCO to the hyperparameter $\tau$ (the robustness parameter in DRO) is evaluated by running DisCO on Q1.5B models for 1400 steps with $\tau \in \{0.5, 1, 5, 10\}$ . The results indicate that DisCO is not highly sensitive to $\tau$ within this tested range, showing stable performance. This suggests that the method is robust to the choice of this specific hyperparameter.

6.2.5. Effect of Each Design Choice

The following are the results from Figure 4 (right) of the original paper:

Analysis (Figure 4, right): This ablation study analyzes the individual contribution of DisCO's key components by progressively modifying the DisCO-b approach on Q1.5B models:
1. DisCO-b: Basic discriminative objective + constrained KL.
2. DisCO-b + Question-level weight bias: Reintroducing the $\sqrt{p(q)(1-p(q))}$ weight from GRPO.
3. DisCO-b + KL regularization: Replacing the KL constraint with standard regularization.
4. DisCO-b + Clipping: Using a clipped scoring function with $\epsilon_{\mathrm{high}} = 0.2$ .
  
  The results show that each proposed component is important for DisCO's overall improvement:
- Adding question-level weight bias to DisCO-b ( $-w_q$ ) significantly degrades performance, confirming the detrimental effect of difficulty bias.
- Replacing the KL constraint with regularization (-kl_reg) also lowers performance, supporting the use of constrained optimization for stability.
- Using a clipping scoring function (-clip) leads to the most substantial drop in performance, emphasizing the vital importance of non-clipping scoring functions. This comprehensive ablation confirms that DisCO's architectural choices collectively contribute to its strong performance.

6.2.6. Experiments on DAPO-Math-17K Dataset

The following are the results from Table 5 of the original paper:

Model/Method	MRL(Train/Test)	AIME 2024	AIME 2025	MATH 500	AMC 2023	Minerva	O-Bench	Avg.
DS-Distill-Qwen-1.5B	32k+ / 32k	0.288	0.263	0.828	0.629	0.265	0.433	0.451
DS-Distill-Qwen-1.5B	32k+ /8k	0.181	0.215	0.758	0.515	0.237	0.353	0.376
GRPO	8k / 8k	0.342	0.256	0.842	0.672	0.267	0.458	0.473
GRPO-ER	8k / 8k	0.290	0.260	0.852	0.681	0.287	0.463	0.472
Dr. GRPO	8k / 8k	0.300	0.250	0.849	0.705	0.292	0.464	0.477
DAPO	8k / 8k	0.275	0.229	0.812	0.653	0.256	0.441	0.444
TRPA	8k / 8k	0.346	0.279	0.836	0.683	0.281	0.450	0.479
DisCO (L-ratio)	8k / 8k	0.413	0.310	0.874	0.775	0.307	0.495	0.529
DisCO (log-L)	8k / 8k	0.460	0.317	0.873	0.775	0.320	0.502	0.541

Analysis: Experiments on the DAPO-Math-17K dataset with Q1.5B models for 1400 steps confirm that DisCO's improvements are fundamental and generalizable across different datasets. DisCO (log-L) achieves an average Pass@1 of 0.541, and DisCO (L-ratio) achieves 0.529. Both variants significantly outperform other baselines, maintaining their strong lead even on a different training dataset. This robust performance across datasets further validates the efficacy of the DisCO framework.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work successfully introduced DisCO (Discriminative Constrained Optimization), a novel and principled framework for reinforcing Large Reasoning Models (LRMs). Through a rigorous analysis of Group Relative Policy Optimization (GRPO), the authors identified critical limitations, including question-level difficulty bias and issues with clipping-based surrogates leading to entropy instability. DisCO addresses these by:

Redesigning the objective: Employing a pure discriminative objective that eliminates difficulty bias.
Utilizing non-clipping scoring functions: Enhancing training stability and avoiding vanishing gradients.
Implementing constrained optimization: Enforcing KL divergence limits with a squared-hinge penalty for dynamic and stable policy updates.
Incorporating Distributionally Robust Optimization (DRO): Effectively handling data imbalance in generated rollouts, a common problem with sparse rewards.

Empirical evaluations on mathematical reasoning tasks demonstrated DisCO's significant superiority over GRPO and its contemporary variants (DAPO, Dr. GRPO, TRPA). DisCO achieved substantial average Pass@1 gains (e.g., 7% over GRPO, 6% over DAPO for a 1.5B model) and exhibited remarkably stable training dynamics with controlled generation entropy. Its effectiveness was consistent across different model sizes (1.5B, 7B, 8B) and even outperformed models trained with significantly longer context lengths, highlighting its learning efficiency and generalizability.

7.2. Limitations & Future Work

The paper does not explicitly detail a "Limitations" or "Future Work" section. However, based on the context and focus, some implicit points can be inferred:

Domain Specificity: The experiments are primarily conducted on mathematical reasoning tasks. While the improvements are significant in this domain, further validation on other reasoning tasks (e.g., scientific reasoning, logical deduction, code generation) would strengthen the claim of generalizability.
Hyperparameter Tuning: While the paper states DisCO is not sensitive to $\tau$ in the tested range, the selection of $\delta$ and $\beta$ for the constrained optimization still involves tuning. The efficiency of this tuning process in more complex or diverse environments could be a practical consideration.
Computational Cost: Although DisCO improves learning efficiency, the underlying RL fine-tuning process for LRMs, involving extensive data generation and verification, remains computationally intensive. Optimizing the sampling strategies (which DAPO discusses but DisCO explicitly excludes from comparison) could be a future direction.
Scaling to Larger Models: The experiments are on models up to 8B parameters. Scaling DisCO to much larger, more powerful LRMs (e.g., 70B+) might introduce new computational or optimization challenges.

7.3. Personal Insights & Critique

Novelty and Principled Approach: DisCO's greatest strength lies in its principled approach. Instead of merely patching GRPO, it steps back to re-evaluate the objective from a discriminative learning perspective. This shift allows for the integration of well-established techniques from that field (like AUC maximization and DRO) to tackle fundamental issues in RL for LRMs. The explicit elimination of difficulty bias and the robust handling of data imbalance are significant theoretical and practical contributions.
Training Stability: The move from fixed KL regularization to constrained optimization with a squared-hinge penalty is a clever and effective design choice. This dynamic approach to constraining policy updates is more robust than a static regularization term, which often requires careful tuning and can lead to instability. The empirical results on entropy stability strongly support this design.
Applicability Beyond Reasoning: The principles underlying DisCO – discriminative objectives, non-clipping surrogates, and robust constrained optimization – are not unique to mathematical reasoning. They could potentially be applied to other LLM fine-tuning tasks where binary or scalar rewards are available, such as code generation, factual QA, or certain forms of dialogue policy learning, especially in sparse reward environments.
Complexity of DRO: While Distributionally Robust Optimization (DRO) for handling imbalanced rollouts is theoretically sound and empirically effective here, it introduces additional computational complexity (the $\tau \log(\mathbb{E}[\exp(\dots)])$ term) and a new hyperparameter $\tau$ . For a beginner, understanding and implementing this component might be more challenging than simpler re-weighting schemes. However, the paper demonstrates its value.
Interpretability: The discriminative nature of DisCO is quite intuitive: push positive scores up, negative scores down. This clear objective might offer better interpretability of what the model is learning compared to the more convoluted advantage functions of some RL algorithms.
Implicit Reward Shaping: The success of DisCO still heavily relies on the quality of the rule-based reward mechanism. While the paper focuses on how to learn from these rewards, the design of effective and nuanced reward functions (which remains a challenge for complex, subjective reasoning) is still a prerequisite. Future work could explore how DisCO interacts with more complex, non-binary reward signals or learned reward models.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.