Paper status: completed

S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models

Published:05/12/2025

LLM Reasoning Capacity Enhancement (38)Sequence Policy Optimization (40)Reinforcement Learning in Reasoning Models (1)Chain-of-Thought Generation Length Extension (1)Test-Time Scaling (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces S-GRPO, a novel reinforcement learning method that allows reasoning models to exit early during chain-of-thought generation, improving efficiency by evaluating intermediate reasoning steps and reducing redundancy compared to existing approaches.

Abstract

As Test-Time Scaling emerges as an active research focus in the large language model community, advanced post-training methods increasingly emphasize extending chain-of-thought (CoT) generation length, thereby enhancing reasoning capabilities to approach Deepseek R1-like reasoning models. However, recent studies reveal that reasoning models (even Qwen3) consistently exhibit excessive thought redundancy in CoT generation. This overthinking issue arises from the inherent limitations of conventional outcome-reward reinforcement learning, which systematically overlooks the regulation of intermediate reasoning processes. This paper introduces Serial-Group Decaying-Reward Policy Optimization (S-GRPO), a novel reinforcement learning paradigm that enables models to implicitly evaluate the sufficiency of intermediate reasoning steps, thereby facilitating early exit in CoT generation. Unlike GRPO, which samples multiple possible reasoning paths in parallel (parallel group), S-GRPO only samples one reasoning path and serially selects multiple temporal positions from the path to exit thinking and directly generate answers (serial group). For correct answers within a serial group, rewards gradually decrease based on the exit positions along the reasoning path from front to back. This design encourages the model to produce more accurate and concise thoughts, while also incentivizing early thinking termination when appropriate. Empirical evaluations demonstrate that S-GRPO is compatible with state-of-the-art reasoning models, including Qwen3 and Deepseek-distill. Across diverse benchmarks such as GSM8K, AIME 2024, AMC 2023, MATH-500, and GPQA Diamond, S-GRPO achieves a substantial reduction in sequence length (35.4% - 61.1%) while simultaneously improving accuracy (absolute 0.72% - 6.08%).

Mind Map

In-depth Reading

English Analysis~23 min read · 30,910 chars

1. Bibliographic Information

1.1. Title

S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models

The title clearly states the paper's core contributions:

S-GRPO: The name of the proposed method, which stands for Serial-Group Decaying-Reward Policy Optimization. This suggests a modification of a previous method, likely GRPO.
Early Exit: This points to the primary goal of the method—enabling the model to terminate its reasoning process early.
Reinforcement Learning in Reasoning Models: This specifies the technical approach (Reinforcement Learning) and the target application (improving reasoning in Large Language Models).

1.2. Authors

Muzhi Dai
Chenxu Yang
Qingyi Si

The authors are affiliated with Huawei Technologies Co., Ltd., and the Institute of Information Engineering, Chinese Academy of Sciences. These are prominent industrial and academic research institutions, respectively, indicating a strong background in applied and fundamental AI research.

1.3. Journal/Conference

The paper is available on arXiv, which is a preprint server for academic papers. This means it has not yet undergone formal peer review for publication in a conference or journal. The provided publication date of May 12, 2025, and version number $v2$ suggest this is a very recent submission, likely intended for a top-tier AI conference in 2025 (e.g., NeurIPS, ICLR, ICML). While preprints are not peer-reviewed, they are a standard way for researchers to quickly disseminate cutting-edge work.

1.4. Publication Year

2025 (as per the preprint metadata).

1.5. Abstract

The abstract introduces the problem of "overthinking" in large language models (LLMs), where they generate excessively long and redundant chains-of-thought (CoT) during reasoning tasks. This inefficiency is attributed to standard reinforcement learning (RL) methods that only reward the final outcome, ignoring the quality of the intermediate reasoning steps.

To address this, the paper proposes Serial-Group Decaying-Reward Policy Optimization (S-GRPO). Unlike existing methods like GRPO that sample multiple reasoning paths in parallel, S-GRPO generates a single path and creates a "serial group" by sampling multiple early exit points along this path. A "decaying reward" mechanism is introduced: correct answers from earlier exits receive higher rewards. This incentivizes the model to learn to identify the point of "sufficient reasoning" and terminate early.

Empirical results on state-of-the-art models (like Qwen3) across various reasoning benchmarks (GSM8K, MATH-500, etc.) show that S-GRPO significantly reduces sequence length by 35.4% - 61.1% while simultaneously improving accuracy by 0.72% - 6.08%.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2505.07686v2
PDF Link: https://arxiv.org/pdf/2505.07686v2
Publication Status: This is a preprint available on arXiv and has not yet been officially published in a peer-reviewed venue.

2. Executive Summary

2.1. Background & Motivation

Modern large language models (LLMs) have shown remarkable reasoning abilities, often enhanced by generating a step-by-step chain-of-thought (CoT). A common strategy to boost performance, known as Test-Time Scaling, involves prompting the model to produce longer and more detailed reasoning paths. However, this has led to a significant problem: overthinking. Models like Qwen3 and Deepseek-R1 often generate thoughts that are redundant, irrelevant, or unnecessarily long, which increases computational costs during inference and can even hurt accuracy by leading the model astray.

The root of this problem lies in how these models are trained. Conventional post-training methods using Reinforcement Learning (RL), such as GRPO, typically focus only on the final answer's correctness. They use an outcome-reward system: if the final answer is correct, the entire reasoning path gets a positive reward; otherwise, it gets zero. This approach fails to provide any feedback on the efficiency or sufficiency of the intermediate reasoning steps. The model is rewarded for a correct answer, regardless of whether it took 10 steps or 100, encouraging it to generate long thoughts "just in case."

This paper identifies a critical gap: the lack of a mechanism to regulate the intermediate reasoning process. The authors' innovative idea is to shift the RL paradigm from comparing different parallel reasoning chains to comparing different completion points within a single serial reasoning chain. This allows the model to learn not just how to reason, but also when to stop reasoning.

2.2. Main Contributions / Findings

The paper presents three main contributions:

A Novel Serial-Group RL Paradigm: The authors pioneer a new RL framework that focuses on regulating the intermediate steps of reasoning. Instead of a parallel group of multiple independent reasoning paths, they introduce a serial group created by truncating a single reasoning path at various points. This allows the RL algorithm to compare the quality of shorter vs. longer thought processes derived from the same initial trajectory.
The S-GRPO Algorithm: They propose Serial-Group Decaying-Reward Policy Optimization (S-GRPO), an algorithm designed to implement this new paradigm. S-GRPO encourages models to produce high-quality reasoning in the early stages of CoT generation. Its key features are a two-stage rollout process to preserve the model's existing reasoning capabilities and a decaying reward function that incentivizes correct and early exits.
Synergistic Improvement in Efficiency and Accuracy: Through extensive experiments on various mathematical and scientific reasoning benchmarks, the paper demonstrates that S-GRPO is highly effective. It achieves a substantial reduction in the average number of generated tokens (a proxy for computational cost) by 35.4% to 61.1%. Crucially, this efficiency gain does not come at the cost of performance; instead, accuracy simultaneously improves by an absolute 0.72% to 6.08%. This establishes S-GRPO as a method that creates a win-win scenario for reasoning models.

3.1. Foundational Concepts

3.1.1. Chain-of-Thought (CoT) Reasoning

Chain-of-Thought (CoT) is a prompting technique that encourages LLMs to solve complex problems by thinking step-by-step. Instead of directly outputting the final answer, the model is prompted to generate a sequence of intermediate reasoning steps that lead to the solution. This process mimics human problem-solving and has been shown to significantly improve performance on tasks requiring arithmetic, commonsense, or logical reasoning. For example, when asked "The cafeteria had 23 apples. If they used 20 to make lunch and bought 6 more, how many apples do they have?", a CoT-enabled model would first reason "Start with 23 apples. Used 20, so 23 - 20 = 3. Bought 6 more, so 3 + 6 = 9." before giving the final answer.

3.1.2. Reinforcement Learning (RL)

Reinforcement Learning (RL) is a machine learning paradigm where an agent learns to make decisions by interacting with an environment. The agent takes actions, the environment transitions to a new state, and the agent receives a reward or penalty. The agent's goal is to learn a policy (a strategy for choosing actions) that maximizes the cumulative reward over time. In the context of LLMs, the model is the agent, the generated text is the action, and the reward is typically based on the quality of the final output (e.g., correctness of the answer).

3.1.3. Proximal Policy Optimization (PPO)

Proximal Policy Optimization (PPO) is a popular and effective policy gradient RL algorithm. Policy gradient methods work by directly adjusting the policy's parameters to favor actions that lead to higher rewards. A key challenge is ensuring that policy updates are not too large, which could destabilize training. PPO addresses this by introducing a "clipped" objective function that discourages the new policy from deviating too far from the old one. This makes training more stable and reliable. S-GRPO's optimization objective is based on the PPO-clip objective. The general form of the PPO-clip objective is: $ L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min\left( r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t \right) \right] $ Where:

$r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio between the new and old policies.
$\hat{A}_t$ is the estimated advantage at timestep $t$ .
$\epsilon$ is a hyperparameter that defines the clipping range.

3.2. Previous Works

3.2.1. GRPO (Group Robust Preference Optimization)

GRPO is a key method that S-GRPO builds upon and contrasts with. GRPO is designed to distill the pass@k reasoning ability of a model into its pass@1 performance. This means making the model's single best guess as good as the best result from many guesses.

Mechanism: For a given problem, GRPO samples $k$ different reasoning paths in parallel, forming a parallel group.
Reward: Each path in the group is evaluated. If a path leads to a correct answer, it receives a reward of 1; otherwise, it receives 0.
Optimization: The rewards within the group are then used to calculate relative advantages. Paths with correct answers get positive advantages, while incorrect ones get negative advantages. The model is then updated to increase the probability of generating the "winning" paths.
Limitation: As highlighted by the S-GRPO paper, GRPO only cares about the final outcome. It doesn't penalize a correct but extremely long and inefficient reasoning path, leading to the "overthinking" problem.

The following figure from the paper visually contrasts GRPO's parallel group with S-GRPO's serial group.

该图像是示意图，展示了并行组相对的 GRPO 方法与串行组相对的 S-GRPO 方法的比较。GRPO 采用多个完整推理路径，而 S-GRPO 则在单一推理路径上进行多次选择，体现了衰减奖励的特征，其中奖励 $r^i = rac{1}{2^{i-1}}$ 带有不同的退出位置。

3.2.2. Other Efficient Reasoning Methods (Baselines)

The paper compares S-GRPO against several other methods aimed at making reasoning more efficient:

DEER: A training-free approach that makes early-exit decisions during inference based on the confidence scores of intermediate answers.
ConCISE: An off-policy method that first generates concise CoT data (using special prompts and early-exit) and then fine-tunes the model on this data using SFT (Supervised Fine-Tuning) or SimPO (Simple Preference Optimization).
RL + Length Penalty: A modification of RL where the reward function is adjusted to penalize longer correct responses.
ShorterBetter: A GRPO-based method that explicitly assigns higher rewards to shorter correct paths within the parallel group.

3.3. Technological Evolution

The field has evolved from simply enabling reasoning to optimizing it for both accuracy and efficiency.

Basic Prompting: Early work focused on zero-shot and few-shot prompting.
Chain-of-Thought (CoT): The introduction of CoT unlocked more complex reasoning by having models "show their work."
RL for Outcome-Based Improvement (e.g., GRPO): Researchers used RL to fine-tune models to be better reasoners, focusing on maximizing the rate of correct final answers. This improved accuracy but often led to longer, more costly computations.
RL for Efficient Reasoning (e.g., S-GRPO, ShorterBetter): The current wave of research, including this paper, addresses the "overthinking" side effect. The focus is now on finding a balance—achieving high accuracy without wasteful computation. S-GRPO sits at this frontier, proposing a novel way to teach models the concept of "reasoning sufficiency."

3.4. Differentiation Analysis

The core innovation of S-GRPO lies in its serial-group and decaying-reward mechanisms.

S-GRPO vs. GRPO/ShorterBetter: GRPO and ShorterBetter operate on a parallel group of independent reasoning paths. They compare which path is better. In contrast, S-GRPO operates on a serial group derived from a single path. It compares which exit point is better. This provides a much more direct and fine-grained signal for learning when to stop thinking.
S-GRPO vs. RL + Length Penalty: A simple length penalty can be naive. It might excessively punish a problem that genuinely requires a long reasoning chain, potentially harming accuracy. S-GRPO's decaying reward is more nuanced; it doesn't blindly punish length but rewards earlier correctness. A long path can still receive a positive (though smaller) reward if it's correct and all earlier exits were wrong.
S-GRPO vs. DEER: DEER is a training-free inference-time strategy that requires calculating confidence scores, adding overhead. S-GRPO is a training-time method that bakes the early-exit capability directly into the model's policy, requiring no extra computation during inference.
S-GRPO vs. ConCISE: ConCISE uses off-policy optimization (SFT/SimPO) to fit the model to a new distribution of concise data. The S-GRPO authors argue this can be disruptive. S-GRPO uses on-policy RL and a two-stage rollout that preserves the model's original reasoning process, making it a less disruptive "final stage" optimization.

4. Methodology

4.1. Principles

The central idea of Serial-Group Decaying-Reward Policy Optimization (S-GRPO) is to teach a language model to implicitly assess whether its current chain-of-thought is sufficient to answer a question, and if so, to exit the reasoning process early. This is achieved by reformulating the reinforcement learning problem. Instead of comparing multiple independent solutions (as in GRPO), S-GRPO compares multiple potential exit points along a single, serially generated reasoning path.

The method is built on two key principles:

Serial-Group Comparison: By forcing the model to generate answers at different stages of its thought process, it can learn to distinguish between incomplete, sufficient, and redundant reasoning.
Decaying Rewards: By assigning higher rewards to correct answers that are generated earlier, the model is explicitly incentivized to find the most concise and efficient reasoning path to a correct solution.

The overall framework of S-GRPO, as shown in the figure below, involves three main stages: Serial-Group Generation, Decaying Reward Strategy, and Advantage Computation & Parameter Update.

$Figure 2: The framework of S-GRPO. The complete answer inducer is omitted in the figure and is represented by </think> instead. The complete answer inducer is "Time is limited, stop thinking and start answering.\\n</think>\\n\\n"$ 该图像是S-GRPO框架的示意图，展示了全思维展开与早期退出思维的流程。关键的衰减奖励计算公式为 $r = rac{1}{2^{N_{g}-1}}$ ，其中 $N_{g}$ 是在思维路径上选择的时间位置数量。图中使用了不同颜色表示思维过程中的各个阶段。

4.2. Core Methodology In-depth (Layer by Layer)

The entire process can be broken down according to Algorithm 1 in the paper.

4.2.1. Stage 1: Serial-Group Generation

This stage constructs the serial group of responses for a single query. It is a two-phase rollout process.

4.2.1.1. Full Thought Rollout

First, for a given query $q$ , the policy model $\pi_{\theta}$ generates a complete, uninterrupted reasoning path. This path consists of a sequence of thought tokens followed by a final conclusion.

Process: The model generates the full output $O^0 = (T_1, T_2, \dots, T_n, </think>, C^0)$ , where $(T_1, \dots, T_n)$ is the CoT, $</think>$ is a special token signaling the end of reasoning, and $C^0$ is the final answer.
Truncation Point Selection: After generating the full path, $m$ temporal positions $\{P_1, P_2, \dots, P_m\}$ are randomly and uniformly sampled from the thought sequence $(T_1, \dots, T_n)$ . These positions will serve as the early-exit points. This randomness ensures the model learns to evaluate reasoning sufficiency at various depths.

4.2.1.2. Early-exit Thought Rollout

Next, for each sampled position $P_i$ , an early-exit path is constructed.

Process: The original full thought path is truncated at position $P_i$ . A special instructional prompt, $Time is limited, stop thinking and start answering.\n</think>\n\n$ , is inserted at this point. This prompt explicitly instructs the model to stop its reasoning and generate an answer based on the partial thought process it has so far.
Generation: The model $\pi_{\theta}$ then generates an intermediate answer $C^i$ conditioned on this truncated and prompted path.
Serial Group Formation: The serial group is formed by the original full response $O^0$ and the $m$ early-exit responses $\{O^1, O^2, \dots, O^m\}$ , where each $O^i$ consists of the truncated thought path and its corresponding generated answer $C^i$ .

4.2.2. Stage 2: Decaying Reward Strategy

Once the serial group is generated, each response in the group is assigned a reward. This is where the "decaying" mechanism comes into play. The responses are ordered based on their exit position (i.e., the length of their thought process).

The reward $r^i$ for a response $O^i$ with answer $C^i$ is calculated using the following formula: $r ^ { i } = { \left\{ \begin{array} { l l } { { \frac { 1 } { 2 ^ { N _ { \mathrm { right } } - 1 } } } , } & { { \mathrm { if } } \ C ^ { i } { \mathrm { i s ~ c o r r e c t } } , } \\ { 0 , } & { { \mathrm { if } } \ C ^ { i } { \mathrm { i s ~ i n c o r r e c t } } . } \end{array} \right. }$

Symbol Explanation:
- $r^i$ : The reward assigned to the $i$ -th response in the ordered serial group.
- $C^i$ : The answer generated by the $i$ -th response.
- $N_{right}$ : This is the accumulated count of correct answers up to and including the current position $i$ in the ordered group.
Example: Imagine a serial group with 4 exit points, ordered by length.
- Exit 1 (shortest): Answer is incorrect. Reward = 0. ( $N_{right}$ is 0).
- Exit 2: Answer is correct. This is the first correct answer. $N_{right}$ becomes 1. Reward = $1 / 2^{(1-1)} = 1$ .
- Exit 3: Answer is correct. This is the second correct answer. $N_{right}$ becomes 2. Reward = $1 / 2^{(2-1)} = 0.5$ .
- Exit 4: Answer is incorrect. Reward = 0. ( $N_{right}$ remains 2).
- Full path: Answer is correct. This is the third correct answer. $N_{right}$ becomes 3. Reward = $1 / 2^{(3-1)} = 0.25$ .
  
  This strategy has two objectives:

Correctness First: Incorrect answers always receive a reward of 0, ensuring the model does not sacrifice accuracy.
Efficiency Incentive: For correct answers, the reward decays exponentially. This creates a strong preference for the shortest path that yields a correct answer, directly encouraging the model to learn early exits.

4.2.3. Stage 3: Advantage Computation and Parameter Update

With rewards assigned, the final step is to update the model's parameters using a policy gradient method.

4.2.3.1. Advantage Computation

The advantage for each response in the serial group is calculated. The advantage function determines how much better a given response is compared to the average of all responses in the group. The formula for the advantage $\hat{A}_i$ of response $i$ is: $ \hat{A}_i = r_i - \operatorname{mean}(r_i) $ Here, $\operatorname{mean}(r_i)$ is the average reward across all responses in the serial group. This is a simplified version of standard advantage calculation, as the authors note they remove the standard deviation normalization for training stability. A positive advantage means the response was better than average, while a negative advantage means it was worse. This advantage value $\hat{A}_i$ is then broadcast to all tokens in the response sequence $O^i$ .

4.2.3.2. Parameter Update

The model's parameters $\theta$ are updated using the computed advantages. The optimization objective follows the PPO-clip style to ensure stable training updates.

The objective function $\mathcal{T}_{\mathrm{S-GRPO}}(\theta)$ is: $\begin{array} { r l } & { \mathcal { T } _ { \mathrm { S \cdot GRPO } } ( \theta ) = \mathbb { E } _ { [ q \sim P ( Q ) , \{ \sigma _ { i } \} _ { i = 1 } ^ { G } \sim \pi _ { \theta \circ \mathbf { d } } ( O | q ) ] } } \\ & { \qquad [ \displaystyle \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \frac { 1 } { | \omega _ { i } | } \sum _ { t = 1 } ^ { | \omega _ { i } | } \{ \operatorname* { m i n } \{ \frac { \pi _ { \theta } ^ { i , t } } { \pi _ { \theta \circ \mathbf { d } } ^ { i , t } } \hat { A } _ { i , t } , \ \mathrm { c l i p } ( \frac { \pi _ { \theta } ^ { i , t } } { \pi _ { \theta \circ \mathbf { d } } ^ { i , t } } , 1 - \epsilon , 1 + \epsilon ) \hat { A } _ { i , t } \} \} ] } \end{array}$

Symbol Explanation:
- $\mathbb{E}$ : The expectation, meaning we average this value over many samples.
- $q \sim P(Q)$ : A query $q$ is sampled from the distribution of all queries P(Q).
- $\{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O|q)$ : The serial group of $G$ responses is generated using the "old" policy $\pi_{\theta_{old}}$ (the policy from before the update).
- $G$ : The size of the serial group (i.e., $m+1$ ).
- $|o_i|$ : The length (number of tokens) of response $o_i$ .
- $\pi_{\theta}^{i,t}$ : The probability of generating token $t$ of response $i$ with the new (current) policy $\pi_{\theta}$ .
- $\pi_{\theta_{old}}^{i,t}$ : The probability of generating token $t$ of response $i$ with the old (sampling) policy $\pi_{\theta_{old}}$ .
- $\frac{\pi_{\theta}^{i,t}}{\pi_{\theta_{old}}^{i,t}}$ : The importance sampling ratio. It re-weights the advantage to be applicable to the new policy.
- $\hat{A}_{i,t}$ : The token-level advantage, which is equal to the sequence-level advantage $\hat{A}_i$ .
- $\mathrm{clip}(\cdot, 1-\epsilon, 1+\epsilon)$ : The clipping function. It constrains the importance sampling ratio to be within the range $[1-\epsilon, 1+\epsilon]$ . This prevents the policy update from being too large and destabilizing training.
- $\min\{\cdot, \cdot\}$ : The objective takes the minimum of the unclipped and clipped terms. If the advantage is positive, this encourages increasing the probability of the response but caps the update. If the advantage is negative, it encourages decreasing the probability but also caps the update size.

5. Experimental Setup

5.1. Datasets

Training Dataset:
- DeepMath-103K: The training data was sourced from this large-scale dataset of mathematics problems. It contains approximately 103,000 problems with difficulty ranging from grade 5 to grade 10. The paper highlights that this dataset is challenging and has been decontaminated to ensure no overlap with common evaluation benchmarks.
Evaluation Benchmarks: To test the method's effectiveness and generalization, a diverse set of five benchmarks was used:
1. GSM8K: A popular benchmark of 1,319 grade-school math word problems that require 2-8 steps of reasoning using basic arithmetic.
2. AIME 2024: A set of 30 challenging problems from the American Invitational Mathematics Examination, covering secondary-level algebra, combinatorics, geometry, etc.
3. AMC 2023: 40 questions from the American Mathematics Competitions, designed to test problem-solving skills in high school mathematics.
4. MATH-500: A challenging set of 500 problems from high-school math competitions, curated by OpenAI.
5. GPQA Diamond: A graduate-level scientific Q&A benchmark. The "Diamond" subset contains 198 high-quality questions in physics, chemistry, and biology that are difficult even for domain experts.
Data Example: The appendix of the paper provides a visual example of how training data is constructed. For a single problem, it shows the full thought process and then multiple truncated versions, each with an early-exit answer and a corresponding decaying reward. The following image from the paper illustrates this process.

Figure 5 from the original paper shows an example of training data truncation and the decaying reward assignment for a math problem. Incorrect answers receive a reward of 0, while subsequent correct answers receive exponentially decaying rewards (1, 1/2, 1/4, etc.).

5.2. Evaluation Metrics

The paper uses two primary metrics to evaluate performance, focusing on the trade-off between accuracy and efficiency.

Accuracy (Acc / pass@1):
- Conceptual Definition: This metric measures the percentage of problems for which the model generates the correct final answer in a single attempt (pass@1). It is the primary measure of the model's reasoning correctness.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of correctly solved problems}}{\text{Total number of problems}} $
- Symbol Explanation:
  - Number of correctly solved problems: The count of questions where the model's first generated output contained the right answer.
  - Total number of problems: The total number of questions in the benchmark set.
Token Count (Tokens):
- Conceptual Definition: This metric measures the average number of tokens in the model's generated output (both thought process and final answer) for each problem in a benchmark. It serves as a direct proxy for computational cost and inference latency—fewer tokens mean a more efficient model.
- Mathematical Formula: $ \text{Token Count} = \frac{\sum_{i=1}^{N} \text{length}(O_i)}{N} $
- Symbol Explanation:
  - $N$ : The total number of problems in the benchmark.
  - $O_i$ : The full output generated by the model for the $i$ -th problem.
  - $\text{length}(O_i)$ : The number of tokens in the output $O_i$ .

5.3. Baselines

S-GRPO was compared against a comprehensive set of baselines representing different approaches to reasoning and efficiency:

Vanilla: The base reasoning model without any additional fine-tuning. This serves as the fundamental point of comparison.
DEER: A training-free, inference-time early-exit method.
ConCISE (SFT/SimPO): A training-based method using off-policy optimization to teach conciseness.
GRPO: The direct predecessor, using on-policy RL with a parallel-group mechanism.
RL + Length Penalty: An on-policy RL method that adds an explicit penalty for length to the reward function.
ShorterBetter: Another GRPO variant that gives higher rewards to shorter correct paths in the parallel group.

These baselines cover the spectrum from no training to inference-time tricks to various forms of RL-based optimization, providing a robust evaluation of S-GRPO's relative performance.

6. Results & Analysis

6.1. Core Results Analysis

The main experimental results are presented in Table 1, which compares S-GRPO against the baselines across four different models and five benchmarks. The results strongly validate the effectiveness of the proposed method.

The following is the complete data from Table 1 of the original paper:

Method	GSM8K		AIME 2024		AMC 2023		MATH-500		GPQA		Overall
Method	Acc	Tokens	Acc	Tokens	Acc	Tokens	Acc	Tokens	Acc	Tokens	Acc	Tokens
DeepSeek-R1-Distill-Qwen-7B
Vanilla	92.4	1,833	55.4	13,232	77.2	9,693	85.8	5,590	50.1	15,385	70.02	9,147
DEER	88.8	917	53.3	10,971	87.5	4,142	91.8	2,431	47.5	5,280	73.78+3.76	4,748-48.1%
ConCISESFT	92.9	832	52.1	9,751	−		92.0	2,244	50.0	5,892
ConCISESimPO	92.1	715	48.3	7,745	−	−	91.0	1,946	48.0	4,859
GRPO	93.2	1,767	55.0	13,451	87.5	9,887	93.6	5,317	50.7	15,817	76.00+5.98	9,248+1.1%
RL + Length Penalty	92.4	1,062	51.9	7,464	86.9	3,540	92.2	2,451	49.1	3,984	74.50+4.48	3,700-59.5%
ShorterBetter		—	53.3	5,288	75.9	2,580	−	−		−
S-GRPO	93.8	906	56.0	7,377	87.5	3,494	92.4	2,252	50.8	3,751	76.10+6.08	3,556.61.1%
DeepSeek-R1-Distill-Qwen-14B
Vanilla	94.2	2,129	64.4	11,099	90.5	5,527	93.5	3,844	59.2	6,034	80.36	5,727
DEER	93.3	982	70.0	10,335	90.0	4,349	91.4	2,753	57.1	4,767	80.36+0.0	4,637-19.0%
GRPO	95.3	2,120	65.8	13,504	91.9	6,595	84.0	4,471	58.9	7,354	79.18-1.18	6,809+18.9%
RL + Length Penalty	94.7	775	55.0	7,950	88.1	3,396	92.4	1,993	56.0	4,380	77.24-3.12	3,699-34%
S-GRPO	96.2	724	64.4	6,712	91.9	3,352	93.6	2,146	59.3	3,334	81.08+0.72	3,253-35.4%
Qwen3-8B
Vanilla	95.4	2,370	74.1	15,326	91.3	9,452	93.4	5,577	55.6	8,741	81.90	8,293
DEER	95.5	981	76.7	11,287	95.0	6,198	93.4	3,208	52.5	3,104	82.62+0.72	4,956-40.2%
GRPO	95.8	2,355	72.7	15,154	92.8	8,983	94.4	5,440	55.8	8,819	82.30+0.4	8,150-1.7%
RL + Length Penalty	95.4	1,323	73.8	9,666	93.4	5,042	94.2	3,247	56.2	5,293	82.60+0.7	4,914-40.7%
S-GRPO	96.1	1,292	77.3	8,810	95.0	5,962	95.2	3,166	57.7	5,271	84.26+2.36	4,922-40.6%
Qwen3-14B
Vanilla	95.5	1,909	75.4	14,116	96.9	7,576	95.2	5,078	58.8	7,576	84.36	7,251
DEER	95.5	908	76.7	10,333	95.0	5,099	94.8	2,987	57.1	2,435	83.82-0.54	4,352-40.0%
GRPO	96.1	1,956	77.7	14,544	98.4	8,000	95.8	5,140	59.3	7,966	85.46+1.1	7,521+3.7%
RL + Length Penalty	95.8	1,090	74.8	9,056	96.6	5,059	95.8	2,866	59.4	4,949	84.46+0.1	4,604-36.5%
S-GRPO	96.3	952	77.9	8,932	97.8	4,537	96.4	2,652	60.6	4,537	85.80+1.14	4,322-40.4%

Key Observations:

Synergistic Improvement: S-GRPO is the only method that consistently achieves the best of both worlds: it improves accuracy while simultaneously making the largest or near-largest reductions in token count. For example, on DeepSeek-R1-Distill-Qwen-7B, S-GRPO boosts overall accuracy by an absolute 6.08% over Vanilla while cutting token usage by a massive 61.1%.
Comparison with GRPO: GRPO, which focuses only on outcome, slightly improves accuracy but at the cost of increasing token count in most cases. This clearly demonstrates the "overthinking" problem that S-GRPO is designed to solve. S-GRPO achieves comparable or better accuracy than GRPO with significantly fewer tokens.
Comparison with Other Efficiency Methods: Methods like RL + Length Penalty and ShorterBetter successfully reduce token count but often at the expense of accuracy (e.g., RL + Length Penalty drops accuracy by 3.12% on DeepSeek-14B). DEER and ConCISE show mixed results, sometimes improving accuracy on simpler tasks but failing or even hurting performance on more complex ones. S-GRPO stands out by delivering robust gains across both simple (GSM8K) and complex (AIME, GPQA) tasks.
Scalability: The benefits of S-GRPO hold across different model sizes (7B to 14B) and families (DeepSeek-distill to Qwen3), indicating that it is a broadly applicable and scalable technique.

6.2. Performance with Different Token Budgets

Figure 3 explores how model performance changes under different maximum generation length constraints.

Figure 3: Performance of DeepSeek-R1-Distill-Qwen-7B and Qwen3-8B without or with S-GRPO training on GSM8K and AIME 2024 under different generation-length budgets.

Analysis:

Low Budget Dominance: Under tight token budgets (e.g., max length of 1000-2000), S-GRPO-trained models achieve significantly higher accuracy than their vanilla counterparts. This shows that S-GRPO teaches the models to generate concise and correct reasoning paths, making them highly effective even when computational resources are limited.
High Budget Efficiency: With generous token budgets, the vanilla models' accuracy improves but they generate very long sequences. In contrast, the S-GRPO models achieve slightly better accuracy while naturally generating much shorter sequences. They have learned to stop once reasoning is sufficient and do not use the extra budget unnecessarily.
Robustness: The performance curves for S-GRPO are generally smoother than for the vanilla models, indicating that its performance is more stable and less sensitive to the specific length budget imposed at inference time.

6.3. Ablation Studies

The ablation study in Table 2 investigates the contribution of each key component of S-GRPO on the Qwen3-8B model.

The following are the results from Table 2 of the original paper:

Method	GSM8K		AIME 2024		AMC 2023		MATH-500		GPQA		Overall
Method	Acc	Tokens	Acc	Tokens	Acc	Tokens	Acc	Tokens	Acc	Tokens	Acc	Tokens
Qwen3-8B
S-GRPO	96.1	1,292	77.3	8,810	95.0	5,962	95.2	3,166	57.7	5,271	84.26	4,922
w/o. Decaying (Shortest 1)	95.9	1,175	69.6	8,721	92.5	4,581	94.8	2,740	55.7	4,734	81.70-2.56	4,390-10.8%
w/o. Decaying (All 1)	96.0	2,385	74.4	14,940	94.7	9,000	95.0	5,614	54.9	8,955	83.00-1.26	8,179+66.2%
- w/o. Serial	95.8	2,355	72.7	15,154	92.8	8,983	94.4	5,440	55.8	8,819	82.30-1.96	8,150+65.6%

Analysis:

w/o. Decaying (Shortest 1): In this setting, only the single shortest correct response gets a reward of 1. This is an overly strict penalty on length. While it does lead to the shortest token counts (-10.8% vs. S-GRPO), it significantly harms overall accuracy (-2.56%), showing that a more nuanced reward is needed.
w/o. Decaying (All 1): Here, all correct responses get a reward of 1, regardless of length. This removes the incentive for conciseness. As a result, the token count explodes ( $+66.2%$ vs. S-GRPO) to levels similar to the vanilla model, and accuracy also drops. This proves that the decaying reward is essential for encouraging efficiency.
- w/o. Serial: This setting removes the serial-group mechanism, effectively degenerating the method into the original GRPO. The results are very similar to the w/o. Decaying (All 1) setting: high token counts and lower accuracy than S-GRPO. This confirms that the serial-group generation is a critical innovation that provides the fine-grained feedback needed for learning efficient reasoning.

6.4. Case Study

Figure 4 provides a compelling qualitative example on a GSM8K problem.

Figure 4: Comparison of a generated content sample on GSM8K.

Analysis:

Left (Vanilla CoT): The baseline Qwen3-8B model produces a very long, convoluted reasoning path. It correctly solves the problem but engages in significant "overthinking," taking 111 tokens.
Center (Hard Truncation): If we simply cut off the vanilla CoT's reasoning at the same length as the S-GRPO output (48 tokens), the reasoning is incomplete, and the model fails to reach the correct conclusion. This shows that simple truncation is not enough; the model must learn to structure its thoughts concisely.
Right (S-GRPO): The S-GRPO-trained model produces a direct, concise, and correct reasoning path. It uses less than half the tokens of the vanilla model (48 vs. 111) while achieving the correct answer. This visually demonstrates that S-GRPO has successfully taught the model to avoid redundant steps and "get to the point."

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces and validates S-GRPO, a novel reinforcement learning paradigm for training efficient and accurate reasoning models. By shifting from the conventional parallel-group comparison to an innovative serial-group approach with a decaying-reward function, S-GRPO directly addresses the "overthinking" problem prevalent in modern LLMs.

The key finding is that it's possible to achieve a synergistic improvement in both efficiency and accuracy. Empirical results across a wide range of models and difficult benchmarks show that S-GRPO can drastically reduce inference cost (by up to 61.1%) while simultaneously improving reasoning accuracy (by up to 6.08%). The method is presented as a practical and effective final optimization stage for post-training reasoning models.

7.2. Limitations & Future Work

The authors do not explicitly state limitations, but we can infer some potential areas for future research:

Prompt Sensitivity: The method relies on a hardcoded instructional prompt ("Time is limited, stop thinking...") to induce early exits during training. The effectiveness might be sensitive to the wording of this prompt, and a more adaptive or learned mechanism could be more robust.
Sampling Hyperparameters: The performance might depend on the choice of $m$ , the number of temporal positions sampled for early exit. The paper uses $m=8$ , but the sensitivity to this parameter is not explored. An optimal, perhaps dynamic, choice of $m$ could further improve results.
Generalization to Other Domains: The experiments are focused on mathematical and scientific reasoning. While these are excellent testbeds for CoT, it remains to be seen how well S-GRPO generalizes to other sequential generation tasks like creative writing, summarization, or dialogue, where the notion of a single "correct" answer and "sufficient" thought is more ambiguous.
Computational Cost of Training: The method requires an "over-sampling" technique to ensure enough correct early-exit samples for training. This, combined with the multiple rollouts per query to form the serial group, could make the training process computationally expensive compared to standard fine-tuning.

7.3. Personal Insights & Critique

Key Insight: The paper's most brilliant contribution is the conceptual shift from inter-path comparison (parallel-group) to intra-path comparison (serial-group). This is a simple yet powerful idea that provides a much more direct learning signal for reasoning efficiency. It reframes the problem from "which path is best?" to "at what point is this path good enough?".
Potential for Broader Application: The core principle of S-GRPO—rewarding early sufficiency—is highly transferable. It could be adapted for:
- Iterative Summarization: A model could learn to stop reading a long document once it has gathered enough information to write a good summary.
- Efficient Code Generation: A model could learn to write a function after generating just the necessary helper functions, without adding superfluous code.
- Resource-aware Agents: In agentic systems, this could translate to learning to act with the minimum necessary information gathering, saving API calls or other computational resources.
Critique and Areas for Improvement:
1. Reward Mechanism: The exponential decay ( $1/2^{N-1}$ ) is a strong heuristic. However, a more adaptive reward scheme could be even better. For instance, the reward could be a function of both the exit position and the problem's inherent difficulty, preventing over-penalization on complex problems that genuinely require longer thought.
2. Binary Correctness: The reward is based on a binary notion of correctness. For problems with partial credit or nuanced answers, a more continuous reward signal could be beneficial.
3. On-Policy Nature: As an on-policy RL algorithm, S-GRPO can be sample-inefficient. Exploring off-policy variants could potentially reduce the computational cost of training, although the authors argue that the on-policy nature is a strength that preserves the model's abilities.
  
  Overall, S-GRPO is a significant step forward in making powerful reasoning models practical for real-world deployment, where both accuracy and computational efficiency are paramount. It offers an elegant solution to a pressing problem in the LLM community.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

S-GRPO: Early Exit via Reinforcement Learning in Reasoning Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~23 min read · 30,910 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Chain-of-Thought (CoT) Reasoning

3.1.2. Reinforcement Learning (RL)

3.1.3. Proximal Policy Optimization (PPO)

3.2. Previous Works

3.2.1. GRPO (Group Robust Preference Optimization)

3.2.2. Other Efficient Reasoning Methods (Baselines)

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Stage 1: Serial-Group Generation

4.2.1.1. Full Thought Rollout

4.2.1.2. Early-exit Thought Rollout

4.2.2. Stage 2: Decaying Reward Strategy

4.2.3. Stage 3: Advantage Computation and Parameter Update

4.2.3.1. Advantage Computation

4.2.3.2. Parameter Update

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Performance with Different Token Budgets

6.3. Ablation Studies

6.4. Case Study

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers