ASPO: Asymmetric Importance Sampling Policy Optimization
AI Review
Read structured AI reviewer reports
Compare multi-model feedback, strengths and weaknesses, and final recommendations in one place.
Completed: 2
TL;DR Summary
ASPO resolves a critical flaw in LLM Reinforcement Learning by flipping Importance Sampling ratios for positive-advantage tokens, coupled with dual-clipping. This method significantly improves training stability, mitigates premature convergence, and enhances performance on coding
Abstract
Recent Large Language Model (LLM) post-training methods rely on token-level clipping mechanisms during Reinforcement Learning (RL). However, we identify a fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens. This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens, aligning their update direction with the learning dynamics of negative ones. AIS further incorporates a soft dual-clipping mechanism to stabilize extreme updates while maintaining gradient flow. Comprehensive experiments on coding and mathematical reasoning benchmarks demonstrate that ASPO significantly mitigates premature convergence, improves training stability, and enhances final performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting IS in LLM RL. The code and models of ASPO are available at https://github.com/wizard-III/Archer2.0.
English Analysis
1. Bibliographic Information
- Title: ASPO: Asymmetric Importance Sampling Policy Optimization
- Authors: Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai.
- Affiliations: The authors are from Kuaishou Technology and Tsinghua University.
- Journal/Conference: The paper is available on arXiv, a preprint server for academic papers. This means it has not yet undergone formal peer review for a conference or journal, but it is a common way to disseminate research quickly in the AI community.
- Publication Year: The paper references dates and publications from 2025, suggesting it was likely written in late 2024 for a 2025 submission or that the dates are placeholders. We will use the dates as written in the paper.
- Abstract: The paper identifies a critical flaw in recent Reinforcement Learning (RL) methods used for training Large Language Models (LLMs), specifically in the Outcome-Supervised RL (OSRL) paradigm. The authors argue that the Importance Sampling (IS) ratio, a key component of these methods, creates an unbalanced weighting for tokens with positive and negative advantages. This mismatch over-rewards high-probability tokens and penalizes low-probability ones, leading to poor training dynamics like premature convergence. To fix this, they propose Asymmetric Importance Sampling Policy Optimization (ASPO), a method that flips the IS ratio for positive-advantage tokens and uses a soft dual-clipping mechanism for stability. Experiments on coding and math benchmarks show ASPO significantly improves training stability and final model performance over strong baselines.
- Original Source Link:
- arXiv Link: https://arxiv.org/abs/2510.06062
- PDF Link: http://arxiv.org/pdf/2510.06062v1
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Modern methods for improving LLMs after initial training often use Reinforcement Learning (RL), particularly a paradigm called Outcome-Supervised RL (OSRL). A popular algorithm in this space is Group Relative Policy Optimization (GRPO), which is based on Proximal Policy Optimization (PPO). The paper argues that GRPO and similar methods have a hidden flaw in how they use Importance Sampling (IS) ratios.
- The Gap: In OSRL, all tokens in a generated response (e.g., the solution to a math problem) receive the same reward (correct or incorrect). The IS ratio, which is meant to correct for differences between the current and old model policies, ends up acting as a token-level learning weight. The paper discovers a critical "mismatch":
- For negative-advantage tokens (from incorrect responses), the weighting works as expected.
- For positive-advantage tokens (from correct responses), the weighting is counterproductive. It gives larger updates to tokens the model is already confident about (high probability) and smaller updates to tokens it struggles with (low probability).
- Innovation: This "weight mismatch" causes the model to overfit on what it already knows, leading to a collapse in creativity (entropy collapse) and getting stuck in suboptimal solutions. The paper's key innovation is to identify and correct this fundamental flaw by proposing a simple yet effective modification to the weighting scheme.
-
Main Contributions / Findings (What):
- Identified a Flaw: The paper is the first to identify and analyze the IS ratio mismatch for positive-advantage tokens in GRPO-based OSRL methods.
- Proposed ASPO: They introduce Asymmetric Importance Sampling Policy Optimization (ASPO). Its core mechanism is to flip the IS ratio for positive-advantage tokens (using the reciprocal). This ensures that "good" but low-probability tokens receive stronger updates, promoting more balanced learning.
- Enhanced Stability: ASPO includes a
soft dual-clipping
mechanism to prevent extreme updates that might arise from the flipped ratio, ensuring training remains stable. - Demonstrated Superior Performance: Through comprehensive experiments on challenging mathematical reasoning and coding benchmarks, ASPO is shown to mitigate entropy collapse, improve training stability, and achieve significantly better final performance than strong existing methods like
DAPO
.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. In the context of LLMs, the "agent" is the LLM, the "action" is generating a token, and the "reward" is based on the quality of the final generated text.
- Policy: In RL, the policy () is the agent's strategy. For an LLM, the policy is the probability distribution over the next token given the preceding text.
- Advantage (): A measure of how much better a specific action (generating a token) is compared to the average action in a given state. A positive advantage means the action was better than average; a negative one means it was worse.
- Outcome-Supervised RL (OSRL): A specific type of RL for LLMs where the reward is given only at the end based on the final outcome (e.g., the math problem's solution is 100% correct or not). All tokens in the response share this same final reward.
- Importance Sampling (IS): A technique used in off-policy RL to reuse data generated by an old policy () to train the current policy (). It does this by weighting the rewards with the IS ratio: . An IS ratio greater than 1 means the current policy is more likely to generate that token than the old policy.
- Proximal Policy Optimization (PPO): A popular RL algorithm known for its stability. Its key feature is the
PPO-Clip
mechanism, which "clips" the IS ratio to prevent the new policy from straying too far from the old one in a single update, which helps stabilize training. - Group Relative Policy Optimization (GRPO): An adaptation of PPO for LLMs. It works by generating a group of responses for a given prompt, and the reward for each response is calculated relative to the others in the group (e.g., by normalizing the scores). This is the baseline algorithm that ASPO aims to improve.
-
Previous Works:
- PPO-Clip (Schulman et al., 2017): The origin of the clipping mechanism that GRPO and ASPO build upon.
- GRPO (Shao et al., 2024): The widely adopted OSRL algorithm that established the effectiveness of this paradigm for LLM reasoning tasks. ASPO is a direct modification of GRPO.
- CISPO (Chen et al., 2025): Identified that PPO's "hard" clipping masks gradients entirely for clipped tokens. It proposed a "soft" clipping that clips the update magnitude but preserves the gradient direction, allowing the model to continue learning from these tokens.
- GSPO (Zheng et al., 2025): Argued that token-level clipping is mismatched with sequence-level rewards in OSRL and proposed using a sequence-level IS ratio instead.
-
Differentiation: While previous works like
CISPO
andGSPO
focused on how to clip (soft vs. hard) or at what level to clip (token vs. sequence), ASPO addresses a more fundamental issue: what the IS ratio itself represents. ASPO argues the ratio is a learning weight and its standard formulation is flawed for positive-advantage tokens. The core innovation is flipping the ratio for these specific tokens, a change no prior work had proposed.
4. Methodology (Core Technology & Implementation)
The paper's methodology unfolds in three parts: first, it provides evidence that IS is not serving its intended purpose; second, it diagnoses the exact problem with IS weighting; and third, it presents ASPO as the solution.
4.1. The Problem: IS is a Misguided Weight, Not a Corrector
The authors first question the role of IS in OSRL. Since the reward assigned to each token is already an inaccurate, coarse signal (the same for all tokens in a response), they ask: does applying a fine-grained IS correction make sense?
-
Experiment: They compare standard
GRPO
with a variant where the IS ratio is fixed to 1.0 (GRPO w/o IS
). -
Analysis (Image 1):
该图像为六个折线图组成的图表,展示了GRPO及其去除重要性采样(IS)机制的变体在不同训练步数下的表现差异。图中指标包括LiveCodeBench V5得分、熵、Logits、重复率、裁剪比例和KL散度。总体来看,GRPO在训练稳定性和KL散度方面表现略高,但去除IS的版本在重复率和裁剪比例上更低,说明IS机制影响模型训练动态。
As shown in
Image 1
, removing IS (GRPO w/o IS
, orange line) does not harm final performance (a) and leads to more stable training dynamics. Entropy (b) drops more slowly, and metrics like repetition rate (d), clip ratio (e), and KL loss (f) grow more gradually. This suggests the IS mechanism in GRPO is not acting as a necessary distribution corrector but rather as a token-level weight that contributes to instability. They also found that positive samples (from correct responses) tend to have higher average IS ratios than negative samples (seeImage 2
), which accelerates the fitting of positive samples and leads to faster entropy collapse.
4.2. The Diagnosis: Weight Misallocation for Positive Tokens
The paper reframes the IS ratio not as a statistical correction term but as a token-level training weight. An ideal weighting scheme should give larger weights to tokens the model is struggling with (low probability) to help them improve.
-
The Flaw Visualized (Image 3):
该图像包含三部分图表:(a)为IS比值在旧概率和当前概率二维空间的3D可视化,区分了是否更新的区域;(b)和(c)为IS比值随旧概率与当前概率变化的平面示意图,分别对应优势值大于0和小于0的情况,展示了原始区域、降低区域、提升区域及双重裁剪区域的划分及不同裁剪比率对区域边界的影响。
Image 3
visualizes the IS ratio () as a weight.- For Negative Advantage Tokens (c): The weighting is reasonable. A high-probability "bad" token gets a large weight, leading to a strong penalty to reduce its probability.
- For Positive Advantage Tokens (b): The weighting is mismatched. A "good" token that already has a high probability (top-left region) gets an even larger weight, causing overfitting. A "good" token with a low probability (bottom-right region) gets a tiny weight, suppressing its learning.
-
Experimental Validation (Image 4): To test this hypothesis, they ran an experiment where the problematic token-level IS ratios for positive samples were replaced with a smoothed, response-level average.
该图像为多个折线图组成的图表,展示了两种方法(DAPO与DAPO w/Pos Response-Level IS Mean)在训练过程中不同指标的变化趋势。包括(a) LCB v5 Avg@8随训练步数增加而提升,(b) LCB v5 Pass@8在训练中表现差异,(c) 熵值随训练平稳降低,(d) 重复率,(e) 剪辑比率,(f) KL损失均随着训练增加,两方法在后四项指标上DAPO表现波动较大,DAPO w/Pos Response-Level IS Mean更为稳定。整体体现了后者在训练稳定性和性能上的改进。
As seen in
Image 4
, this simple change (DAPO w/Pos Response-Level IS Mean
, orange line) significantly stabilizes training dynamics (c, d, e, f) and even improves performance on thepass@8
metric (b), confirming that the original token-level weighting for positive samples is indeed harmful.
4.3. The Solution: Asymmetric Importance Sampling (ASPO)
Based on the analysis, the authors propose ASPO, which corrects the weight mismatch with a three-step process.
-
Step 1: Token Masking (Hard Clipping): This is the standard PPO-Clip mechanism. It prevents overly aggressive updates by masking the gradients of tokens whose IS ratios are already far from 1.0 in the desired update direction.
-
Step 2: Weight Flipping (The Core Innovation): This is the central idea of ASPO. The IS ratio, now denoted as , is computed asymmetrically:
- For tokens with negative advantage (): The ratio remains the same as in GRPO.
- For tokens with positive advantage (): The ratio is flipped (i.e., the reciprocal is used).
The paper implements this as:
Here,
sg(·)
is the stop-gradient operation, which prevents the denominator from affecting the gradient calculation, ensuring the gradient behaves as if the weight were simply . This flip ensures low-probability "good" tokens get high weights, and high-probability "good" tokens get low weights, correcting the mismatch.
-
Step 3: Dual Clipping (Soft Clipping): Flipping the ratio for positive tokens introduces a new risk: if a token's current probability is near zero, its flipped weight could become huge, destabilizing training. To prevent this, ASPO applies a
dual-clip
(an upper bound on the weight) to positive-advantage tokens. This clipping is done in asoft
manner (à laCISPO
), which constrains the magnitude of the update but preserves the gradient, allowing these lagging tokens to still contribute to learning.
4.4. Gradient Analysis
The gradient of the GRPO objective is proportional to the IS ratio: The gradient for ASPO's positive-advantage tokens is proportional to the flipped ratio: Substituting and , the gradient update term for ASPO becomes proportional to . This explicitly shows that the gradient update is inversely proportional to the token's current probability, giving larger updates to less confident tokens.
5. Experimental Setup
- Datasets:
- Mathematical Reasoning: A mixture of challenging datasets including AIME24/25,
AMC23
,MATH-500
,Minerva Math
, andOlympiadBench
. - Coding:
LiveCodeBench
(v5 and v6), a benchmark that evaluates models on recent competitive programming problems, alongside datasets likeCodeContests
andCodeForces
.
- Mathematical Reasoning: A mixture of challenging datasets including AIME24/25,
- Evaluation Metrics:
pass@K
: The probability that at least one of K generated solutions for a problem is correct. It measures the model's ability to find a correct answer given multiple attempts.avg@K
: The average reward over K generated solutions. It measures the overall quality of the model's generations.
- Baselines:
DeepSeek-R1-Distill-Qwen-1.5B
: The base model before RL fine-tuning.DAPO
: A strong, open-source implementation of a GRPO-based OSRL system, serving as the primary baseline.- Other strong 1.5B models in the field:
DeepScaleR-1.5B
,DeepCoder-1.5B
,FastCuRL-1.5B-V3
, andNemotron-1.5B
.
6. Results & Analysis
-
Core Results:
Table 1: Mathematical Benchmarks
Method AIME24 AIME25 AMC23 MATH-500 Minerva Olympiad Avg. avg@64 pass@64 avg@64 pass@64 avg@64 pass@64 avg@4 pass@4 avg@8 pass@8 avg@4 pass@4 DeepSeek-R1-1.5B 30.6 80.0 23.5 63.3 70.7 100.0 83.6 92.4 27.6 48.2 44.6 59.4 46.8 DAPO 42.1 80.0 28.6 56.7 80.3 97.5 87.6 94.6 29.2 46.3 53.2 65.8 53.5 DeepScaleR-1.5B 42.0 83.3 29.0 63.3 81.3 100.0 87.7 93.6 30.3 51.1 50.7 61.0 53.5 FastCuRL-1.5B-V3 48.1 80.0 32.7 60.0 86.4 95.0 89.8 94.0 33.6 50.0 55.3 64.3 57.7 Nemotron-1.5B 48.0 76.7 33.1 60.0 86.1 97.5 90.6 93.6 35.3 47.8 59.2 66.8 58.7 ASPO-Math-1.5B 49.0 80.0 35.1 70.0 87.2 95.0 90.5 94.4 35.1 50.4 58.8 66.9 59.3 Table 2: Code Benchmarks
Method LCB v5 (2024.08.01-2025.02.01) LCB v6 (2025.02.01-2025.05.01) Avg. avg@8 pass@8 avg@16 pass@16 DeepSeek-R1-1.5B 16.7 29.0 17.2 34.4 17.0 DAPO 26.0 40.5 27.6 43.5 26.8 DeepCoder-1.5B 23.3 39.1 22.6 42.0 23.0 Nemotron-1.5B 26.1 35.5 29.5 42.8 27.8 ASPO-Code-1.5B 31.5 47.0 30.5 46.0 31.0 The results in both tables clearly show that ASPO consistently outperforms the base model and all strong baselines, including
DAPO
, across both mathematical and coding domains. The average score improvements are substantial, demonstrating the effectiveness of the proposed method. -
Training Dynamics Analysis (Image 5):
这是一组实验结果的图表,展示了三种方法(DAPO、DAPO带正向响应级别IS均值、ASPO)在训练过程中的性能对比。包括:(a) LCB v5 Avg@8随训练步数增加而上升,ASPO表现最好;(b) LCB v5 Pass@8曲线,ASPO提升明显;(c) Entropy随训练减少,ASPO保持较高熵值,说明训练更稳定;(d) Repetition重复率,ASPO重复较低且增幅平缓;(e) Clip Ratio裁剪比率,ASPO保持最低并稳定;(f) KL Loss,ASPO曲线最低且平稳,整体显示ASPO在训练稳定性和性能上优于其他方法。
Image 5
provides a clear comparison of training dynamics.- Performance (a, b): While ASPO (green line) starts slightly slower than
DAPO
, it avoids stagnating and ultimately achieves a much higher final performance. - Entropy (c):
DAPO
's entropy collapses rapidly, indicating the model is becoming overly deterministic and getting stuck in a local optimum. ASPO's entropy decreases much more gradually and stabilizes at a higher level, which is characteristic of "healthy convergence" and allows for continued exploration and learning. - Stability (d, e, f): ASPO shows significantly lower and more stable repetition rates, clip ratios, and KL loss throughout training. This confirms that it mitigates the unstable, self-reinforcing updates caused by the weight mismatch in
DAPO
.
- Performance (a, b): While ASPO (green line) starts slightly slower than
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully identifies a fundamental flaw—IS ratio mismatch for positive-advantage tokens—in widely used GRPO-based OSRL methods for LLMs. It proposes ASPO, an elegant solution that flips the IS ratios for these tokens and incorporates a soft dual-clip for stability. This correction leads to more stable training, mitigates entropy collapse, and results in state-of-the-art performance on difficult reasoning and coding tasks.
-
Limitations & Future Work: The authors acknowledge two main limitations:
- The experiments were conducted on 1.5B-scale models. Further research is needed to see if the findings generalize to much larger models.
- The conclusions are specific to GRPO-based OSRL. It remains an open question whether the observation that "importance sampling is not important" holds for other RL paradigms, such as those with more fine-grained, token-level rewards (process supervision).
-
Personal Insights & Critique:
- Novelty and Impact: The paper's core insight is simple, powerful, and very well-supported by evidence. It highlights the importance of re-examining foundational algorithmic components when applying them to new domains like LLM fine-tuning. The discovery that IS acts as a harmful weight rather than a helpful corrector is a significant contribution to the field.
- Clarity and Rigor: The paper is exceptionally well-structured. It builds its case logically, using controlled experiments at each step to motivate and validate its claims. The visualizations are clear and effectively communicate the core problem and solution.
- Future Implications: ASPO provides a practical and easy-to-implement improvement over existing methods. It is likely to be widely adopted by researchers and practitioners working on RL for LLMs. The paper also encourages a more critical look at other "standard" RL components that may not behave as expected in the context of LLMs. It opens up new avenues for research into more suitable weighting and update schemes for OSRL.
Similar papers
Recommended via semantic vector search.
Group Sequence Policy Optimization
Group Sequence Policy Optimization (GSPO) enhances LLM reinforcement learning by utilizing sequence-likelihood for importance ratios and performing sequence-level optimization. This novel approach significantly stabilizes MoE model training, outperforming GRPO in efficiency and p
It Takes Two: Your GRPO Is Secretly DPO
By reframing GRPO as contrastive learning and uncovering its link to DPO, this paper challenges the necessity of large group sizes for LLM training. Theoretical analysis and empirical results show that "2-GRPO" performs comparably to "16-GRPO," drastically reducing rollouts by 8x
Discussion
Leave a comment
No comments yet. Start the discussion!