Initializing viewer…

English Analysis

Details

1. Bibliographic Information

Title: Group Sequence Policy Optimization
Authors: Chujie Zheng, Shixuan Liu, Mingze Li, Xiong-Hui Chen, Bowen $\nabla \times \mathbf{u}^\ast$ E, Chang Gao, Kai Dang, Yuqiong Liu, Rui Men, An Yang, Jingren Zhou, Junyang Lin.
Affiliations: The authors are from the Qwen Team at Alibaba Inc., known for developing the Qwen series of large language models.
Journal/Conference: This paper is an arXiv preprint. This means it has been made publicly available for rapid dissemination but has not yet undergone a formal peer-review process for publication in a conference or journal.
Publication Year: The arXiv identifier 2507.18071 suggests a submission date of July 2025.
Abstract: The paper introduces Group Sequence Policy Optimization (GSPO), a reinforcement learning (RL) algorithm designed for training large language models (LLMs). GSPO's core innovation is its departure from traditional token-level importance ratios. Instead, it defines the importance ratio based on the likelihood of the entire generated sequence. Consequently, key operations like clipping, reward assignment, and optimization are performed at the sequence level. The authors claim that GSPO surpasses the existing GRPO algorithm in training efficiency and performance, provides crucial stability for training Mixture-of-Experts (MoE) models, and has the potential to simplify RL infrastructure. These advantages are credited for significant improvements in the latest Qwen3 models.
Original Source Link:
- arXiv Page: https://arxiv.org/abs/2507.18071
- PDF Link: https://arxiv.org/pdf/2507.18071

2. Executive Summary

Background & Motivation (Why):
- Core Problem: State-of-the-art reinforcement learning algorithms like Group Relative Policy Optimization (GRPO) suffer from severe training instability when applied to very large language models, particularly those with a Mixture-of-Experts (MoE) architecture. This instability can lead to a sudden and irreversible "model collapse," where the model's performance degrades catastrophically.
- Identified Gap: The authors diagnose the root cause of GRPO's instability. They argue that GRPO fundamentally misapplies the principle of importance sampling. By calculating importance weights at the token level, GRPO introduces high-variance noise into the training gradients. This noise accumulates with longer text generations and is amplified by the clipping mechanism, ultimately destabilizing the training process.
- Innovation: The paper proposes that the unit of optimization should match the unit of reward. Since rewards in RL for LLMs are typically assigned to an entire generated sequence, the optimization process (including importance sampling and clipping) should also operate at the sequence level.
Main Contributions / Findings (What):
1. A New RL Algorithm (GSPO): The paper introduces Group Sequence Policy Optimization (GSPO), which performs importance sampling, clipping, and optimization on entire sequences of text rather than individual tokens.
2. Superior Stability and Efficiency: GSPO is empirically shown to be more stable and efficient than GRPO. It achieves better performance on benchmark tasks with the same amount of computational resources.
3. Stabilization of MoE Training: A critical finding is that GSPO inherently solves the stability issues associated with training MoE models. It eliminates the need for complex workarounds like Routing Replay, which were previously necessary to prevent model collapse in GRPO.
4. Potential for Simplified Infrastructure: By operating at the sequence level, GSPO is more tolerant of numerical precision differences between training and inference engines, potentially removing the need for costly recomputation steps and simplifying the overall RL pipeline.
5. Proven in Practice: The effectiveness of GSPO is validated by its successful application in the training of the state-of-the-art Qwen3 models, contributing to their remarkable performance.

Foundational Concepts:
- Reinforcement Learning (RL): A machine learning paradigm where an agent (here, an LLM) learns to make decisions by performing actions (generating text) in an environment to maximize a cumulative reward. In the context of LLMs, RL is used to align model behavior with human preferences or to improve performance on specific tasks like coding or math.
- Policy ( $\pi_\theta$ ): The policy is the LLM itself, parameterized by its weights $\theta$ . It defines the probability of generating a certain token given a prompt and the preceding tokens.
- Importance Sampling: A statistical method used in off-policy RL. When training a policy $\pi_\theta$ using data generated by an older policy $\pi_{\theta_{old}}$ , importance sampling re-weights the training samples by the ratio of their probabilities under the new and old policies ( $\frac{\pi_\theta(y|x)}{\pi_{\theta_{old}}(y|x)}$ ). This corrects for the fact that the data comes from a different distribution, allowing for more efficient use of collected data.
- Proximal Policy Optimization (PPO): A widely used RL algorithm that improves training stability. Its key feature is a "clipping" mechanism that prevents the policy from updating too aggressively in a single step. This keeps the new policy "proximal" (close) to the old one, avoiding destructive updates. A major drawback noted in the paper is its reliance on a separate "value model" to estimate future rewards, which adds significant computational overhead.
- Mixture-of-Experts (MoE): A neural network architecture designed for scaling models efficiently. An MoE model consists of multiple "expert" sub-networks. For any given input, a routing mechanism selects a small subset of these experts to perform the computation. This allows for a massive number of total parameters while keeping the computational cost per input low. However, this sparse activation introduces unique stability challenges during RL training.
Previous Works:
- GRPO (Group Relative Policy Optimization): This is the immediate predecessor and primary baseline for GSPO. GRPO cleverly eliminates the need for a value model by generating a group of responses for each prompt. It then calculates the "advantage" (a measure of how good a response is) for each response relative to the others in its group. A good response gets a positive advantage, and a poor one gets a negative advantage.
Differentiation:
- The fundamental difference between GRPO and GSPO lies in the granularity of optimization.
- GRPO is token-level: It calculates an importance ratio for every single token in a response and applies the PPO clipping mechanism to each token's gradient contribution individually. The paper argues this is flawed because the importance weight for a single token sample is statistically noisy.
- GSPO is sequence-level: It calculates a single importance ratio for the entire response sequence. Clipping is then applied to the whole sequence. This aligns the optimization unit with the reward unit (which is also at the sequence level) and is theoretically more sound from an importance sampling perspective, leading to a much more stable and effective training signal.

4. Methodology (Core Technology & Implementation)

Principles: The central intuition of GSPO is that since the reward $r(x, y)$ is assigned to an entire sequence $y$ , the off-policy correction via importance sampling should also be applied at the sequence level. This avoids the high-variance noise introduced by GRPO's token-level approach.
Steps & Procedures:
1. Data Generation: For each query $x$ from the dataset $\mathcal{D}$ , generate a group of $G$ responses $\{y_i\}_{i=1}^G$ using the old policy $\pi_{\theta_{old}}$ .
2. Advantage Estimation: Calculate the advantage $\widehat{A}_i$ $A_{i}$ for each response $y_i$ $y_{i}$ by normalizing its reward relative to the other responses in the group. This is identical to the method used in GRPO and avoids needing a value model. $\widehat { A } _ { i } = \frac { r ( x , y _ { i } ) - \mathrm { mean } \left( \{ r ( x , y _ { i } ) \} _ { i = 1 } ^ { G } \right) } { \mathrm { std } \left( \{ r ( x , y _ { i } ) \} _ { i = 1 } ^ { G } \right) }$
  - $r(x, y_i)$ : The reward for the $i$ -th response.
  - mean(...), std(...): The mean and standard deviation of rewards across all $G$ responses in the group.
3. Sequence-Level Importance Ratio: Define a new importance ratio $s_i(\theta)$ $s_{i} (θ)$ based on the likelihood of the entire sequence. $s _ { i } ( \theta ) = \left( \frac { \pi _ { \theta } ( y _ { i } | x ) } { \pi _ { \theta _ { \mathrm { o l d } } } ( y _ { i } | x ) } \right) ^ { \frac { 1 } { | y _ { i } | } } = \exp \left( \frac { 1 } { | y _ { i } | } \sum _ { t = 1 } ^ { | y _ { i } | } \log \frac { \pi _ { \theta } ( y _ { i , t } | x , y _ { i , < t } ) } { \pi _ { \theta _ { \mathrm { o l d } } } ( y _ { i , t } | x , y _ { i , < t } ) } \right)$
  - $\pi_{\theta}(y_i|x)$ : The probability of generating the entire sequence $y_i$ given query $x$ under the current policy $\theta$ .
  - $\pi_{\theta_{old}}(y_i|x)$ : The probability under the old policy.
  - $|y_i|$ : The length of the sequence $y_i$ .
  - The exponent $\frac{1}{|y_i|}$ performs length normalization. This is crucial because without it, longer sequences would have exponentially smaller likelihoods, causing their importance ratios to vanish or explode. This normalization brings the ratios to a more stable numerical range, regardless of sequence length.
4. Optimization Objective: The final objective function for GSPO applies the PPO clipping mechanism at the sequence level. $\mathcal { I } _ { \mathrm { GSPO } } ( \theta ) = \mathbb { E } _ { x \sim \mathcal { D } , \{ y _ { i } \} _ { i = 1 } ^ { G } \sim \pi _ { \theta _ { \mathrm { odd } } } ( \cdot | x ) } \left[ \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \operatorname* { m i n } \left( s _ { i } ( \theta ) \widehat { A } _ { i } , \mathrm { c l i p } \left( s _ { i } ( \theta ) , 1 - \varepsilon , 1 + \varepsilon \right) \widehat { A } _ { i } \right) \right]$
  - $\mathbb{E}[...]$ : The expectation, meaning we average this value over many queries and generated responses.
  - $s_i(\theta)\widehat{A}_i$ : The unclipped objective for a single response.
  - clip(s_i(\theta), 1 - \varepsilon, 1 + \varepsilon): This function constrains the sequence importance ratio $s_i(\theta)$ to be within the range $[1 - \varepsilon, 1 + \varepsilon]$ .
  - min(...): This takes the minimum of the unclipped and clipped objectives. If the advantage $\widehat{A}_i$ is positive, it prevents the objective from becoming too large when $s_i(\theta)$ is large. If $\widehat{A}_i$ is negative, it prevents the objective from becoming too small when $s_i(\theta)$ is small. This is the core of PPO's stability mechanism, now applied to the whole sequence.
Gradient Analysis: The key difference becomes apparent in the gradients.
- GSPO Gradient: $\nabla _ { \theta } \mathcal { I } _ { \mathsf { GSPO } } ( \theta ) = \mathbb { E } [ \dots ] \left[ \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \underbrace { \left( \frac { \pi _ { \theta } ( y _ { i } \vert x ) } { \pi _ { \theta _ { \mathrm { odd } } } ( y _ { i } \vert x ) } \right) ^ { \frac { 1 } { | y _ { i } | } } } _ { \text{Sequence-level weight} } \widehat { A _ { i } } \cdot \frac { 1 } { | y _ { i } | } \sum _ { t = 1 } ^ { | y _ { i } | } \nabla _ { \theta } \log \pi _ { \theta } ( y _ { i , t } \vert x , y _ { i , < t } ) \right]$ In GSPO, the gradient for every token in a sequence $y_i$ is scaled by the same sequence-level importance weight. This provides a consistent, stable learning signal across the entire sequence.
- GRPO Gradient: $\nabla _ { \theta } \mathcal { I } _ { \mathrm { GRPO } } ( \theta ) = \mathbb { E } [ \dots ] \left[ \frac { 1 } { G } \sum _ { i = 1 } ^ { G } \widehat { A } _ { i } \cdot \frac { 1 } { \vert y _ { i } \vert } \sum _ { t = 1 } ^ { \vert y _ { i } \vert } \underbrace { \frac { \pi _ { \theta } \left( y _ { i , t } \vert x , y _ { i , < t } \right) } { \pi _ { \theta _ { \mathrm { old } } } \left( y _ { i , t } \vert x , y _ { i , < t } \right) } } _ { \text{Token-level weight} } \nabla _ { \theta } \log \pi _ { \theta } ( y _ { i , t } \vert x , y _ { i , < t } ) \right]$ In GRPO, the gradient for each token is scaled by its own, highly variable token-level importance weight. This introduces significant noise, as a single token's probability ratio can fluctuate wildly, leading to unstable updates.
GSPO-token Variant: The paper also introduces GSPO-token, a variant that allows for token-level advantage customization (e.g., in multi-turn dialogue where different turns might get different rewards). It cleverly uses a stop-gradient (sg) operation to maintain the stable sequence-level weighting for the gradient calculation while allowing the objective function to be defined at a token-level. When all tokens share the same advantage, GSPO-token is numerically identical to GSPO.

5. Experimental Setup

Model: The experiments were conducted on a model fine-tuned from Qwen3-30B-A3B-Base, which is a large-scale MoE model.
Datasets & Benchmarks: The model's performance was evaluated on three challenging reasoning benchmarks:
- AIME'24: A competition-level mathematics benchmark.
- LiveCodeBench: A benchmark for programming and code generation.
- CodeForces: A competitive programming benchmark.
Evaluation Metrics:
- Pass@1: The probability that at least one generated solution (out of a certain number of samples) passes the test cases. Used for AIME'24 and LiveCodeBench.
- Elo Rating: A skill rating system used to quantify performance on CodeForces.
Baselines: The primary baseline is the GRPO algorithm. The authors note that for GRPO to work on the MoE model, they had to use a special stabilization strategy called Routing Replay. GSPO did not require this.
Hyperparameters: Clipping ranges were carefully tuned for a fair comparison:
- GSPO: [3e-4, 4e-4] (a very small range, reflecting the different scale of sequence-level ratios).
- GRPO: [0.2, 0.27] (a typical range for token-level PPO-style algorithms).

6. Results & Analysis

Core Results: Training Efficiency and Performance

该图像为多子图表，展示了GSPO与GRPO算法在不同训练计算量下的表现比较。主图显示GSPO在训练奖励（Training Reward）上整体优于GRPO。三个子图分别展示了AIMET24、LiveCodeBench和Codeforces三个基准任务中，GSPO均显著领先GRPO，表现出更高的训练效率和性能提升。
- Analysis of Image 1: This figure presents the main experimental results. The top plot shows the training reward over time (measured in training compute). The three bottom plots show performance on the AIME'24, LiveCodeBench, and CodeForces benchmarks.
- Key Takeaway: In all four plots, the red line (GSPO) is consistently above the blue line (GRPO). This demonstrates that GSPO achieves both higher final performance and better sample efficiency—it reaches a given performance level with less training compute. The GSPO training curves are also visibly smoother, indicating greater stability.
Curious Observation on Clipping Fractions

$Bar chart of clipping fractions for GSPO and GRPO$ 该图像为条形图，比较了GSPO和GRPO算法的Clipping Fraction值。图中显示GSPO的Clipping Fraction约为0.15，显著高于GRPO的0.0013，表明GSPO在序列级别裁剪中具有更高的比例，反映出其在训练中更稳定有效的优化性能。
- Analysis of Image 2: This bar chart shows the average fraction of data points (tokens for GRPO, sequences for GSPO) that are clipped during training.
- Key Takeaway: GSPO clips approximately 15% of its samples, while GRPO clips only 0.13%. This is a difference of over two orders of magnitude. The counter-intuitive finding is that GSPO, despite discarding a much larger fraction of its data from the primary gradient update path, learns more efficiently. The authors interpret this as strong evidence that GRPO's token-level gradient estimates are inherently noisy. GSPO's sequence-level clipping acts as a more effective filter, retaining only the most reliable sequences for training, which results in a higher-quality learning signal.
Benefit of GSPO for MoE Training

该图像为折线图，横轴表示训练计算资源，纵轴表示训练奖励。图中对比了“GRPO带路由重放”和“不带路由重放”两种方法的训练效果，结果显示带路由重放的GRPO在训练奖励上表现更加稳定且逐渐提升，而不带路由重放的GRPO奖励整体呈下降趋势。该图体现了路由重放机制对训练稳定性和性能的积极影响。
- Background on MoE Instability: In MoE models, different inputs activate different "experts." RL updates can cause these activation patterns (or "routing decisions") to change. With GRPO, this "expert-activation volatility" causes the token-level importance ratios to fluctuate wildly, because the network computing the numerator ( $\pi_\theta$ ) is physically different from the network that computed the denominator ( $\pi_{\theta_{old}}$ ). This breaks the assumptions of importance sampling and leads to training collapse.
- GRPO's Workaround: Routing Replay is a technique that forces the updated policy $\pi_\theta$ to use the same expert routing paths as the old policy $\pi_{\theta_{old}}$ when calculating importance ratios.
- Analysis of Image 3: This figure shows that GRPO with Routing Replay (purple line) converges, while GRPO without it (orange line) diverges, with the training reward collapsing. This proves that Routing Replay is essential for GRPO to work on MoE models.
- GSPO's Advantage: GSPO does not need this workaround. Because it relies on the overall sequence likelihood, it is robust to small changes in underlying expert activations. The sequence likelihood remains relatively stable even if the routing paths change slightly. This simplifies the training process, reduces memory and communication overhead, and allows the MoE model to learn and adapt its routing decisions freely.
Benefit of GSPO for RL Infrastructure: The paper suggests that because sequence-level likelihoods are more robust to minor numerical precision differences than token-level likelihoods, GSPO may allow developers to directly use the likelihoods computed by an efficient inference engine (like vLLM) for training. This would eliminate the need for a separate, resource-intensive recomputation step using the training engine, streamlining the entire RL pipeline.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully identifies a fundamental flaw in the token-level application of importance sampling in algorithms like GRPO and proposes an elegant solution: Group Sequence Policy Optimization (GSPO). By aligning the unit of optimization with the unit of reward—the sequence—GSPO achieves superior training stability, efficiency, and performance. Its ability to stabilize MoE training without complex hacks is a significant practical breakthrough. The algorithm's success in improving the Qwen3 models provides strong real-world validation of its effectiveness.
Limitations & Future Work:
- The paper does not explicitly discuss limitations. However, one could consider scenarios where finer-grained, token-level reward shaping is crucial. While GSPO-token is proposed for this, its practical effectiveness is not explored in the experiments.
- The experiments are focused on complex reasoning tasks (math and code). It would be interesting to see how GSPO performs on other alignment tasks, such as reducing toxicity or improving helpfulness in dialogue, where rewards might be more nuanced.
- The sensitivity of GSPO to its own clipping hyperparameters ( $\varepsilon$ ), which are on a very different scale from GRPO's, could be explored further.
Personal Insights & Critique:
- Elegance and First Principles: The core insight of GSPO is both simple and powerful. It is a return to the first principles of importance sampling, correcting a subtle but critical misapplication in prior work. This kind of theoretically-grounded simplification often leads to the most robust advances.
- Practical Impact: The stabilization of MoE training is arguably the most significant contribution. As MoE architectures become standard for scaling LLMs, having a stable and simple RL algorithm is a critical enabling technology. GSPO removes a major engineering and research hurdle.
- Counter-intuitive Findings: The result on clipping fractions is a fascinating piece of evidence. It challenges the common intuition that using more data is always better and highlights that the quality of the learning signal is far more important than the quantity of gradient updates.
- Overall, GSPO appears to be a substantial step forward for large-scale RL with LLMs. It is a well-motivated, clearly explained, and empirically validated algorithm that addresses a pressing need in the field. Its adoption in the state-of-the-art Qwen3 models is a testament to its real-world value.