Paper status: completed

ASPO: Asymmetric Importance Sampling Policy Optimization

Published:10/07/2025

Sequence Policy Optimization (38)RL Training for Large Language Models (63)

Original Link PDF

Price: 0.100000

7 readers

AI Review

Read structured AI reviewer reports

paper.reviews.ctaSubtitle

Completed: 2

View AI reviews

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ASPO resolves a critical flaw in LLM Reinforcement Learning by flipping Importance Sampling ratios for positive-advantage tokens, coupled with dual-clipping. This method significantly improves training stability, mitigates premature convergence, and enhances performance on coding

Abstract

Recent Large Language Model (LLM) post-training methods rely on token-level clipping mechanisms during Reinforcement Learning (RL). However, we identify a fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens. This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens, aligning their update direction with the learning dynamics of negative ones. AIS further incorporates a soft dual-clipping mechanism to stabilize extreme updates while maintaining gradient flow. Comprehensive experiments on coding and mathematical reasoning benchmarks demonstrate that ASPO significantly mitigates premature convergence, improves training stability, and enhances final performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting IS in LLM RL. The code and models of ASPO are available at https://github.com/wizard-III/Archer2.0.

Mind Map

In-depth Reading

English Analysis~13 min read · 15,258 chars

1. Bibliographic Information

Title: ASPO: Asymmetric Importance Sampling Policy Optimization
Authors: Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, and Kun Gai.
Affiliations: The authors are from Kuaishou Technology and Tsinghua University.
Journal/Conference: The paper is available on arXiv, a preprint server for academic papers. This means it has not yet undergone formal peer review for a conference or journal, but it is a common way to disseminate research quickly in the AI community.
Publication Year: The paper references dates and publications from 2025, suggesting it was likely written in late 2024 for a 2025 submission or that the dates are placeholders. We will use the dates as written in the paper.
Abstract: The paper identifies a critical flaw in recent Reinforcement Learning (RL) methods used for training Large Language Models (LLMs), specifically in the Outcome-Supervised RL (OSRL) paradigm. The authors argue that the Importance Sampling (IS) ratio, a key component of these methods, creates an unbalanced weighting for tokens with positive and negative advantages. This mismatch over-rewards high-probability tokens and penalizes low-probability ones, leading to poor training dynamics like premature convergence. To fix this, they propose Asymmetric Importance Sampling Policy Optimization (ASPO), a method that flips the IS ratio for positive-advantage tokens and uses a soft dual-clipping mechanism for stability. Experiments on coding and math benchmarks show ASPO significantly improves training stability and final model performance over strong baselines.
Original Source Link:
- arXiv Link: https://arxiv.org/abs/2510.06062
- PDF Link: http://arxiv.org/pdf/2510.06062v1

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Modern methods for improving LLMs after initial training often use Reinforcement Learning (RL), particularly a paradigm called Outcome-Supervised RL (OSRL). A popular algorithm in this space is Group Relative Policy Optimization (GRPO), which is based on Proximal Policy Optimization (PPO). The paper argues that GRPO and similar methods have a hidden flaw in how they use Importance Sampling (IS) ratios.
- The Gap: In OSRL, all tokens in a generated response (e.g., the solution to a math problem) receive the same reward (correct or incorrect). The IS ratio, which is meant to correct for differences between the current and old model policies, ends up acting as a token-level learning weight. The paper discovers a critical "mismatch":
  - For negative-advantage tokens (from incorrect responses), the weighting works as expected.
  - For positive-advantage tokens (from correct responses), the weighting is counterproductive. It gives larger updates to tokens the model is already confident about (high probability) and smaller updates to tokens it struggles with (low probability).
- Innovation: This "weight mismatch" causes the model to overfit on what it already knows, leading to a collapse in creativity (entropy collapse) and getting stuck in suboptimal solutions. The paper's key innovation is to identify and correct this fundamental flaw by proposing a simple yet effective modification to the weighting scheme.
Main Contributions / Findings (What):
- Identified a Flaw: The paper is the first to identify and analyze the IS ratio mismatch for positive-advantage tokens in GRPO-based OSRL methods.
- Proposed ASPO: They introduce Asymmetric Importance Sampling Policy Optimization (ASPO). Its core mechanism is to flip the IS ratio for positive-advantage tokens (using the reciprocal). This ensures that "good" but low-probability tokens receive stronger updates, promoting more balanced learning.
- Enhanced Stability: ASPO includes a soft dual-clipping mechanism to prevent extreme updates that might arise from the flipped ratio, ensuring training remains stable.
- Demonstrated Superior Performance: Through comprehensive experiments on challenging mathematical reasoning and coding benchmarks, ASPO is shown to mitigate entropy collapse, improve training stability, and achieve significantly better final performance than strong existing methods like DAPO.

Foundational Concepts:
- Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. In the context of LLMs, the "agent" is the LLM, the "action" is generating a token, and the "reward" is based on the quality of the final generated text.
- Policy: In RL, the policy ( $\pi$ ) is the agent's strategy. For an LLM, the policy is the probability distribution over the next token given the preceding text.
- Advantage ( $\hat{A}$ ): A measure of how much better a specific action (generating a token) is compared to the average action in a given state. A positive advantage means the action was better than average; a negative one means it was worse.
- Outcome-Supervised RL (OSRL): A specific type of RL for LLMs where the reward is given only at the end based on the final outcome (e.g., the math problem's solution is 100% correct or not). All tokens in the response share this same final reward.
- Importance Sampling (IS): A technique used in off-policy RL to reuse data generated by an old policy ( $\pi_{\theta_{old}}$ ) to train the current policy ( $\pi_{\theta}$ ). It does this by weighting the rewards with the IS ratio: $r_t = \frac{\pi_{\theta}(o_t|...)}{\pi_{\theta_{old}}(o_t|...)}$ . An IS ratio greater than 1 means the current policy is more likely to generate that token than the old policy.
- Proximal Policy Optimization (PPO): A popular RL algorithm known for its stability. Its key feature is the PPO-Clip mechanism, which "clips" the IS ratio to prevent the new policy from straying too far from the old one in a single update, which helps stabilize training.
- Group Relative Policy Optimization (GRPO): An adaptation of PPO for LLMs. It works by generating a group of responses for a given prompt, and the reward for each response is calculated relative to the others in the group (e.g., by normalizing the scores). This is the baseline algorithm that ASPO aims to improve.
Previous Works:
- PPO-Clip (Schulman et al., 2017): The origin of the clipping mechanism that GRPO and ASPO build upon.
- GRPO (Shao et al., 2024): The widely adopted OSRL algorithm that established the effectiveness of this paradigm for LLM reasoning tasks. ASPO is a direct modification of GRPO.
- CISPO (Chen et al., 2025): Identified that PPO's "hard" clipping masks gradients entirely for clipped tokens. It proposed a "soft" clipping that clips the update magnitude but preserves the gradient direction, allowing the model to continue learning from these tokens.
- GSPO (Zheng et al., 2025): Argued that token-level clipping is mismatched with sequence-level rewards in OSRL and proposed using a sequence-level IS ratio instead.
Differentiation: While previous works like CISPO and GSPO focused on how to clip (soft vs. hard) or at what level to clip (token vs. sequence), ASPO addresses a more fundamental issue: what the IS ratio itself represents. ASPO argues the ratio is a learning weight and its standard formulation is flawed for positive-advantage tokens. The core innovation is flipping the ratio for these specific tokens, a change no prior work had proposed.

4. Methodology (Core Technology & Implementation)

The paper's methodology unfolds in three parts: first, it provides evidence that IS is not serving its intended purpose; second, it diagnoses the exact problem with IS weighting; and third, it presents ASPO as the solution.

4.1. The Problem: IS is a Misguided Weight, Not a Corrector

The authors first question the role of IS in OSRL. Since the reward assigned to each token is already an inaccurate, coarse signal (the same for all tokens in a response), they ask: does applying a fine-grained IS correction make sense?

Experiment: They compare standard GRPO with a variant where the IS ratio is fixed to 1.0 (GRPO w/o IS).
Analysis (Image 1):

该图像为六个折线图组成的图表，展示了GRPO及其去除重要性采样（IS）机制的变体在不同训练步数下的表现差异。图中指标包括LiveCodeBench V5得分、熵、Logits、重复率、裁剪比例和KL散度。总体来看，GRPO在训练稳定性和KL散度方面表现略高，但去除IS的版本在重复率和裁剪比例上更低，说明IS机制影响模型训练动态。

As shown in Image 1, removing IS (GRPO w/o IS, orange line) does not harm final performance (a) and leads to more stable training dynamics. Entropy (b) drops more slowly, and metrics like repetition rate (d), clip ratio (e), and KL loss (f) grow more gradually. This suggests the IS mechanism in GRPO is not acting as a necessary distribution corrector but rather as a token-level weight that contributes to instability. They also found that positive samples (from correct responses) tend to have higher average IS ratios than negative samples (see Image 2), which accelerates the fitting of positive samples and leads to faster entropy collapse.

4.2. The Diagnosis: Weight Misallocation for Positive Tokens

The paper reframes the IS ratio not as a statistical correction term but as a token-level training weight. An ideal weighting scheme should give larger weights to tokens the model is struggling with (low probability) to help them improve.

The Flaw Visualized (Image 3):

该图像包含三部分图表：(a)为IS比值在旧概率和当前概率二维空间的3D可视化，区分了是否更新的区域；(b)和(c)为IS比值随旧概率与当前概率变化的平面示意图，分别对应优势值大于0和小于0的情况，展示了原始区域、降低区域、提升区域及双重裁剪区域的划分及不同裁剪比率对区域边界的影响。

Image 3 visualizes the IS ratio ( $r = \text{Current Probability} / \text{Old Probability}$ ) as a weight.
- For Negative Advantage Tokens (c): The weighting is reasonable. A high-probability "bad" token gets a large weight, leading to a strong penalty to reduce its probability.
- For Positive Advantage Tokens (b): The weighting is mismatched. A "good" token that already has a high probability (top-left region) gets an even larger weight, causing overfitting. A "good" token with a low probability (bottom-right region) gets a tiny weight, suppressing its learning.
Experimental Validation (Image 4): To test this hypothesis, they ran an experiment where the problematic token-level IS ratios for positive samples were replaced with a smoothed, response-level average.

该图像为多个折线图组成的图表，展示了两种方法（DAPO与DAPO w/Pos Response-Level IS Mean）在训练过程中不同指标的变化趋势。包括(a) LCB v5 Avg@8随训练步数增加而提升，(b) LCB v5 Pass@8在训练中表现差异，(c) 熵值随训练平稳降低，(d) 重复率，(e) 剪辑比率，(f) KL损失均随着训练增加，两方法在后四项指标上DAPO表现波动较大，DAPO w/Pos Response-Level IS Mean更为稳定。整体体现了后者在训练稳定性和性能上的改进。

As seen in Image 4, this simple change (DAPO w/Pos Response-Level IS Mean, orange line) significantly stabilizes training dynamics (c, d, e, f) and even improves performance on the pass@8 metric (b), confirming that the original token-level weighting for positive samples is indeed harmful.

4.3. The Solution: Asymmetric Importance Sampling (ASPO)

Based on the analysis, the authors propose ASPO, which corrects the weight mismatch with a three-step process.

Step 1: Token Masking (Hard Clipping): This is the standard PPO-Clip mechanism. It prevents overly aggressive updates by masking the gradients of tokens whose IS ratios are already far from 1.0 in the desired update direction.
Step 2: Weight Flipping (The Core Innovation): This is the central idea of ASPO. The IS ratio, now denoted as $\hat{r}_t^i$ , is computed asymmetrically:
- For tokens with negative advantage ( $\hat{A}_t^i < 0$ ): The ratio remains the same as in GRPO. $\hat{r}_t^i = r_t^i = \frac{\pi_{\theta}(o_t^i | ...)}{\pi_{\theta_{old}}(o_t^i | ...)}$
- For tokens with positive advantage ( $\hat{A}_t^i > 0$ ): The ratio is flipped (i.e., the reciprocal is used). $\hat { r } _ { t } ^ { i } = \frac { 1 } { r_t^i } = \frac { \pi _ { \theta _ { \mathrm { o l d } } } ( o _ { t } ^ { i } \mid q , o _ { < t } ^ { i } ) } { \pi _ { \theta } ( o _ { t } ^ { i } \mid q , o _ { < t } ^ { i } ) }$ The paper implements this as: $\hat { r } _ { t } ^ { i } = \frac { \pi _ { \theta _ { \mathrm { o l d } } } ( o _ { t } ^ { i } \mid q , o _ { < t } ^ { i } ) \pi _ { \theta } ( o _ { t } ^ { i } \mid q , o _ { < t } ^ { i } ) } { \operatorname { sg } ( \pi _ { \theta } ^ { 2 } ( o _ { t } ^ { i } \mid q , o _ { < t } ^ { i } ) ) }$ Here, sg(·) is the stop-gradient operation, which prevents the denominator from affecting the gradient calculation, ensuring the gradient behaves as if the weight were simply $\frac{\pi_{\theta_{old}}}{\pi_{\theta}}$ . This flip ensures low-probability "good" tokens get high weights, and high-probability "good" tokens get low weights, correcting the mismatch.
Step 3: Dual Clipping (Soft Clipping): Flipping the ratio for positive tokens introduces a new risk: if a token's current probability $\pi_{\theta}$ is near zero, its flipped weight $\frac{1}{r_t^i}$ could become huge, destabilizing training. To prevent this, ASPO applies a dual-clip (an upper bound on the weight) to positive-advantage tokens. This clipping is done in a soft manner (à la CISPO), which constrains the magnitude of the update but preserves the gradient, allowing these lagging tokens to still contribute to learning.

4.4. Gradient Analysis

The gradient of the GRPO objective is proportional to the IS ratio: $\nabla _ { \theta } \mathcal { I } _ { GRPO } ( \theta ) \propto r _ { t } ^ { i } ( \theta ) \nabla _ { \theta } \log \pi _ { \theta } ( o _ { t } ^ { i } | ... ) \hat { A } _ { t } ^ { i }$ The gradient for ASPO's positive-advantage tokens is proportional to the flipped ratio: $\nabla _ { \theta } \mathcal { I } _ { ASPO } ( \theta ) \propto \frac { 1 } { r _ { t } ^ { i } ( \theta ) } \nabla _ { \theta } \log \pi _ { \theta } ( o _ { t } ^ { i } | ... ) \hat { A } _ { t } ^ { i }$ Substituting $r_t^i = \frac{\pi_\theta}{\pi_{\theta_{old}}}$ and $\nabla_\theta \log \pi_\theta = \frac{\nabla_\theta \pi_\theta}{\pi_\theta}$ , the gradient update term for ASPO becomes proportional to $\frac{1}{\pi_\theta(o_t^i|...)^2}$ . This explicitly shows that the gradient update is inversely proportional to the token's current probability, giving larger updates to less confident tokens.

5. Experimental Setup

Datasets:
- Mathematical Reasoning: A mixture of challenging datasets including AIME24/25, AMC23, MATH-500, Minerva Math, and OlympiadBench.
- Coding: LiveCodeBench (v5 and v6), a benchmark that evaluates models on recent competitive programming problems, alongside datasets like CodeContests and CodeForces.
Evaluation Metrics:
- pass@K: The probability that at least one of K generated solutions for a problem is correct. It measures the model's ability to find a correct answer given multiple attempts.
- avg@K: The average reward over K generated solutions. It measures the overall quality of the model's generations.
Baselines:
- DeepSeek-R1-Distill-Qwen-1.5B: The base model before RL fine-tuning.
- DAPO: A strong, open-source implementation of a GRPO-based OSRL system, serving as the primary baseline.
- Other strong 1.5B models in the field: DeepScaleR-1.5B, DeepCoder-1.5B, FastCuRL-1.5B-V3, and Nemotron-1.5B.

6. Results & Analysis

Core Results:

Table 1: Mathematical Benchmarks

Method	AIME24	AIME25	AMC23	MATH-500	Minerva	Olympiad	Avg.
	avg@64 pass@64	avg@64 pass@64	avg@64 pass@64	avg@4 pass@4	avg@8 pass@8	avg@4 pass@4
DeepSeek-R1-1.5B	30.6 80.0	23.5 63.3	70.7 100.0	83.6 92.4	27.6 48.2	44.6 59.4	46.8
DAPO	42.1 80.0	28.6 56.7	80.3 97.5	87.6 94.6	29.2 46.3	53.2 65.8	53.5
DeepScaleR-1.5B	42.0 83.3	29.0 63.3	81.3 100.0	87.7 93.6	30.3 51.1	50.7 61.0	53.5
FastCuRL-1.5B-V3	48.1 80.0	32.7 60.0	86.4 95.0	89.8 94.0	33.6 50.0	55.3 64.3	57.7
Nemotron-1.5B	48.0 76.7	33.1 60.0	86.1 97.5	90.6 93.6	35.3 47.8	59.2 66.8	58.7
ASPO-Math-1.5B	49.0 80.0	35.1 70.0	87.2 95.0	90.5 94.4	35.1 50.4	58.8 66.9	59.3

Table 2: Code Benchmarks

Method	LCB v5 (2024.08.01-2025.02.01)	LCB v6 (2025.02.01-2025.05.01)	Avg.
	avg@8 pass@8	avg@16 pass@16
DeepSeek-R1-1.5B	16.7 29.0	17.2 34.4	17.0
DAPO	26.0 40.5	27.6 43.5	26.8
DeepCoder-1.5B	23.3 39.1	22.6 42.0	23.0
Nemotron-1.5B	26.1 35.5	29.5 42.8	27.8
ASPO-Code-1.5B	31.5 47.0	30.5 46.0	31.0

The results in both tables clearly show that ASPO consistently outperforms the base model and all strong baselines, including DAPO, across both mathematical and coding domains. The average score improvements are substantial, demonstrating the effectiveness of the proposed method.

Training Dynamics Analysis (Image 5):

这是一组实验结果的图表，展示了三种方法（DAPO、DAPO带正向响应级别IS均值、ASPO）在训练过程中的性能对比。包括：(a) LCB v5 Avg@8随训练步数增加而上升，ASPO表现最好；(b) LCB v5 Pass@8曲线，ASPO提升明显；(c) Entropy随训练减少，ASPO保持较高熵值，说明训练更稳定；(d) Repetition重复率，ASPO重复较低且增幅平缓；(e) Clip Ratio裁剪比率，ASPO保持最低并稳定；(f) KL Loss，ASPO曲线最低且平稳，整体显示ASPO在训练稳定性和性能上优于其他方法。

Image 5 provides a clear comparison of training dynamics.
- Performance (a, b): While ASPO (green line) starts slightly slower than DAPO, it avoids stagnating and ultimately achieves a much higher final performance.
- Entropy (c): DAPO's entropy collapses rapidly, indicating the model is becoming overly deterministic and getting stuck in a local optimum. ASPO's entropy decreases much more gradually and stabilizes at a higher level, which is characteristic of "healthy convergence" and allows for continued exploration and learning.
- Stability (d, e, f): ASPO shows significantly lower and more stable repetition rates, clip ratios, and KL loss throughout training. This confirms that it mitigates the unstable, self-reinforcing updates caused by the weight mismatch in DAPO.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully identifies a fundamental flaw—IS ratio mismatch for positive-advantage tokens—in widely used GRPO-based OSRL methods for LLMs. It proposes ASPO, an elegant solution that flips the IS ratios for these tokens and incorporates a soft dual-clip for stability. This correction leads to more stable training, mitigates entropy collapse, and results in state-of-the-art performance on difficult reasoning and coding tasks.
Limitations & Future Work: The authors acknowledge two main limitations:
- The experiments were conducted on 1.5B-scale models. Further research is needed to see if the findings generalize to much larger models.
- The conclusions are specific to GRPO-based OSRL. It remains an open question whether the observation that "importance sampling is not important" holds for other RL paradigms, such as those with more fine-grained, token-level rewards (process supervision).
Personal Insights & Critique:
- Novelty and Impact: The paper's core insight is simple, powerful, and very well-supported by evidence. It highlights the importance of re-examining foundational algorithmic components when applying them to new domains like LLM fine-tuning. The discovery that IS acts as a harmful weight rather than a helpful corrector is a significant contribution to the field.
- Clarity and Rigor: The paper is exceptionally well-structured. It builds its case logically, using controlled experiments at each step to motivate and validate its claims. The visualizations are clear and effectively communicate the core problem and solution.
- Future Implications: ASPO provides a practical and easy-to-implement improvement over existing methods. It is likely to be widely adopted by researchers and practitioners working on RL for LLMs. The paper also encourages a more critical look at other "standard" RL components that may not behave as expected in the context of LLMs. It opens up new avenues for research into more suitable weighting and update schemes for OSRL.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.