-
Core Results: Training Efficiency and Performance
该图像为多子图表,展示了GSPO与GRPO算法在不同训练计算量下的表现比较。主图显示GSPO在训练奖励(Training Reward)上整体优于GRPO。三个子图分别展示了AIMET24、LiveCodeBench和Codeforces三个基准任务中,GSPO均显著领先GRPO,表现出更高的训练效率和性能提升。
- Analysis of Image 1: This figure presents the main experimental results. The top plot shows the training reward over time (measured in training compute). The three bottom plots show performance on the
AIME'24
, LiveCodeBench
, and CodeForces
benchmarks.
- Key Takeaway: In all four plots, the red line (GSPO) is consistently above the blue line (GRPO). This demonstrates that GSPO achieves both higher final performance and better sample efficiency—it reaches a given performance level with less training compute. The GSPO training curves are also visibly smoother, indicating greater stability.
-
Curious Observation on Clipping Fractions
该图像为条形图,比较了GSPO和GRPO算法的Clipping Fraction值。图中显示GSPO的Clipping Fraction约为0.15,显著高于GRPO的0.0013,表明GSPO在序列级别裁剪中具有更高的比例,反映出其在训练中更稳定有效的优化性能。
- Analysis of Image 2: This bar chart shows the average fraction of data points (tokens for GRPO, sequences for GSPO) that are clipped during training.
- Key Takeaway: GSPO clips approximately 15% of its samples, while GRPO clips only 0.13%. This is a difference of over two orders of magnitude. The counter-intuitive finding is that GSPO, despite discarding a much larger fraction of its data from the primary gradient update path, learns more efficiently. The authors interpret this as strong evidence that GRPO's token-level gradient estimates are inherently noisy. GSPO's sequence-level clipping acts as a more effective filter, retaining only the most reliable sequences for training, which results in a higher-quality learning signal.
-
Benefit of GSPO for MoE Training
该图像为折线图,横轴表示训练计算资源,纵轴表示训练奖励。图中对比了“GRPO带路由重放”和“不带路由重放”两种方法的训练效果,结果显示带路由重放的GRPO在训练奖励上表现更加稳定且逐渐提升,而不带路由重放的GRPO奖励整体呈下降趋势。该图体现了路由重放机制对训练稳定性和性能的积极影响。
- Background on MoE Instability: In MoE models, different inputs activate different "experts." RL updates can cause these activation patterns (or "routing decisions") to change. With GRPO, this "expert-activation volatility" causes the token-level importance ratios to fluctuate wildly, because the network computing the numerator (πθ) is physically different from the network that computed the denominator (πθold). This breaks the assumptions of importance sampling and leads to training collapse.
- GRPO's Workaround:
Routing Replay
is a technique that forces the updated policy πθ to use the same expert routing paths as the old policy πθold when calculating importance ratios.
- Analysis of Image 3: This figure shows that GRPO with
Routing Replay
(purple line) converges, while GRPO without it (orange line) diverges, with the training reward collapsing. This proves that Routing Replay
is essential for GRPO to work on MoE models.
- GSPO's Advantage: GSPO does not need this workaround. Because it relies on the overall sequence likelihood, it is robust to small changes in underlying expert activations. The sequence likelihood remains relatively stable even if the routing paths change slightly. This simplifies the training process, reduces memory and communication overhead, and allows the MoE model to learn and adapt its routing decisions freely.
-
Benefit of GSPO for RL Infrastructure:
The paper suggests that because sequence-level likelihoods are more robust to minor numerical precision differences than token-level likelihoods, GSPO may allow developers to directly use the likelihoods computed by an efficient inference engine (like vLLM) for training. This would eliminate the need for a separate, resource-intensive recomputation step using the training engine, streamlining the entire RL pipeline.