Chinese Review
好的,作为一名资深学术审稿人,我将严格遵循您的要求,对提供的稿件内容生成一份专业、结构化的审稿意见。
Synopsis of the paper
该论文识别了现有基于结果监督的强化学习(OSRL)范式中一个根本性缺陷:对于具有正优势值的词元(token),其重要性采样(IS)比率的更新方式存在不匹配,导致对低概率有利词元的更新受到抑制,而对高概率词元过度放大。为解决此问题,作者提出了非对称重要性采样策略优化(ASPO)。该方法通过翻转正优势词元的IS比率,使其更新动态与负优势词元对齐。此外,ASPO引入了软性双重裁剪机制以稳定极端更新。在代码生成和数学推理任务上的实验表明,ASPO相比基于GRPO的基线,显著缓解了过早收敛问题,提升了训练稳定性与最终性能。
Summary of Review
本文提出了一种新颖的视角来审视并修正大型语言模型强化学习中的重要性采样机制。其核心优势在于对OSRL中一个微妙但关键的问题进行了清晰的阐述和分析(见 Section 2, Figure 1),并提出了一种概念简单、实现优雅且理论上合理的解决方案ASPO(见 Section 3.2, Equation 5)。实验结果有力地证明了该方法的有效性,尤其是在提升训练稳定性和模型性能方面(见 Figure 5)。然而,论文的弱点在于实验对比的基线范围有限,主要局限于GRPO的变体(见 Section 4),并且对所提出的软性双重裁剪机制缺乏独立的消融研究来验证其具体贡献(在稿件中未找到直接证据)。
Strengths
-
深刻且清晰的问题洞察
- 证据来源:Section 2 和 Abstract。论文清晰地指出了现有OSRL方法中,对正优势和负优势词元使用对称IS比率所导致的“不匹配”问题。这种不匹配会抑制有价值的、低概率词元的学习。
- 重要性(新颖性):该发现为理解和改进LLM的RL训练动态提供了一个新颖而重要的视角,触及了现有方法的一个底层缺陷,具有很高的学术价值。
- 证据来源:Figure 1。通过对GRPO进行移除IS机制的消融实验,直观地展示了标准IS机制对重复率、裁剪比例和KL散度等指标的负面影响,为问题的存在提供了强有力的初步证据。
- 重要性(技术可靠性):这种基于证据的分析方法增强了问题陈述的说服力,并为后续提出的解决方案奠定了坚实的基础。
- 证据来源:Section 3.1。该章节详细解释了为何放大高概率词元和抑制低概率词元会损害模型的探索能力并导致次优解。
- 重要性(清晰度):对问题根源的深入剖析展示了作者对RL理论的深刻理解,使得读者能够轻松跟进其思路,并认同解决该问题的必要性。
-
简洁、优雅且理论完备的方法设计
- 证据来源:Equation (5)。ASPO的核心思想——仅翻转正优势词元的IS比率——被凝练在一个简单的数学公式中。
- 重要性(影响力):这种设计的简洁性使其易于理解、实现和集成到现有RL框架中,极大地增加了其被社区广泛采用的潜力。
- 证据来源:Figure 3。该图通过3D和2D可视化,清晰地展示了ASPO如何通过非对称更新和双重裁剪来重塑更新区域,与传统方法形成鲜明对比。
- 重要性(清晰度):这种高质量的可视化极大地增强了方法的可解释性,帮助读者直观地理解其工作原理,是一项值得称赞的优点。
- 证据来源:Section 3.2, Equation (6)。引入的“软性双重裁剪”机制,旨在平滑和稳定由翻转IS比率可能引起的极端更新。
- 重要性(技术可靠性):这一设计考虑到了潜在的训练不稳定性问题,并提出了一个针对性的解决方案,体现了方法设计的严谨性和完备性。
-
全面且有说服力的实验验证
- 证据来源:Figure 5。在一系列关键指标(如LCB v5得分、熵、重复率、裁剪比例、KL损失)上,ASPO全面优于强大的DAPO基线。
- 重要性(实验严谨性):实验结果不仅展示了最终性能的提升,还通过对训练动态的深入分析(如更低的裁剪率和更稳定的KL散度),将方法的理论优势与实际观察到的效益直接关联起来。
- 证据来源:Abstract 和 Section 4。实验在代码生成(LiveCodeBench)和数学推理这两个具有挑战性且领域差异较大的基准上进行。
- 重要性(影响力):在这些高标准任务上取得的成功,证明了ASPO的有效性和一定的泛化能力,增强了研究结果的重要性。
- 证据来源:Figure 4 和 Figure 5。通过与不同变体(如DAPO w/ Pos Response-Level IS Mean)的对比,逐步展示了ASPO设计的合理性,并最终证明了其整体架构的优越性。
- 重要性(实验严谨性):这种分步的比较逻辑清晰地论证了每个设计选择的贡献,使得实验部分更具说服力。
Weaknesses
-
基线比较范围相对狭窄
- 证据来源:Abstract 和 Section 4。实验部分明确提到,比较的基线是“strong GRPO-based baselines”,图表(Figure 4, 5)中也主要展示了DAPO及其变体。
- 重要性(影响力):缺乏与更广泛的、非OSRL家族的SOTA对齐算法(如PPO、DPO)的比较,使得读者难以评估ASPO在整个LLM对齐领域的相对位置和普适性。
- 证据来源:稿件中未找到直接证据。论文没有讨论ASPO与其他旨在解决类似问题(如探索不足、概率分布退化)但机制不同的方法之间的关系。
- 重要性(新颖性):这使得ASPO的贡献被限制在OSRL框架内,其思想是否能启发或优于其他框架下的方法,仍是一个开放问题。
- 证据来源:稿件中未找到直接证据。所有比较似乎都局限于token级别的RL方法。
- 重要性(技术可靠性):没有与响应级别(response-level)的RL方法进行比较,可能会让读者好奇,在更粗粒度的奖励信号下,这种token级别的精细调整是否依然保持优势。
-
对方法关键组件的消融研究不足
- 证据来源:Section 3.2 和 Figure 3。论文提出了“软性双重裁剪机制”作为稳定训练的关键部分,并对其进行了可视化。然而,实验部分(如Figure 5)仅展示了完整ASPO的性能。
- 重要性(技术可靠性):缺少一项关键的消融研究:即“ASPO(仅翻转IS比率)”与“完整的ASPO(带软性双重裁剪)”之间的直接比较。这使得我们无法量化软性裁剪机制对最终性能和稳定性的具体贡献。
- 证据来源:Equation (6) 引入了超参数 和 。然而,论文并未提供关于这两个参数选择的敏感性分析或讨论。
- 重要性(影响力):缺乏对关键超参数的分析会给后续研究者或实践者复现和应用该方法带来困难,可能会影响该方法的实际应用价值。
- 证据来源:Figure 4 和 Figure 5。虽然Figure 4可以被看作是部分消融,但它比较的是DAPO的一个变体,而不是直接在ASPO框架内移除某个组件,这使得论证链条不够直接。
- 重要性(清晰度):将不同设计思想的验证混合在不同基线的比较中,削弱了消融研究的清晰度和说服力。
-
数学公式表述存在轻微歧义
- 证据来源:Section 3.1。论文使用 代表旧策略, 代表当前策略。但在定义IS比率 的 Equation (1) 中,其对旧策略的依赖没有在符号中明确体现,更清晰的写法应为 。
- 重要性(清晰度):虽然不影响核心理解,但这种不一致性会给希望严格遵循推导过程的读者带来困惑,降低了数学表述的严谨性。
- 证据来源:Equation (5)。该公式直接给出了最终的ASPO目标函数。如果能先定义一个非对称的IS权重 ,然后再将其代入一个标准的目标函数框架中,会使公式的模块化和可读性更强。
- 重要性(清晰度):目前的表述方式将核心逻辑与目标函数结构混合在一起,略显紧凑但牺牲了一部分表达的层次感和清晰度。
- 证据来源:Section 3.2。术语“软性(soft)双重裁剪”中的“软性”一词缺乏明确的数学定义。论文没有解释其与可能的“硬性”裁剪有何区别。
- 重要性(技术可靠性):精确的术语对于传达技术贡献至关重要。如果“软性”没有特殊的数学含义,可能会误导读者,或让该术语显得不够严谨。
Suggestions for Improvement
-
拓宽实验基线的比较范围
- 建议在主要任务(如LiveCodeBench)上,补充与一个主流的非OSRL对齐方法(如PPO或DPO)的比较。这将有助于更准确地定位ASPO在更广阔研究图景中的性能水平。
- 在相关工作或结论部分增加一小段讨论,阐述ASPO的思想与其他解决类似问题的RLHF方法(如不同的KL散度约束策略)有何异同与潜在的结合点。
- 如果作者希望将研究范围严格限定于OSRL,建议在实验设置中明确说明这一点,并简要陈述理由,以管理审稿人和读者的预期。
-
进行更全面的组件消融研究
- 强烈建议增加一项消融实验:在相同设置下,直接对比“完整ASPO”与“移除了软性双重裁剪机制的ASPO变体”。这可以通过在现有图表中增加一条曲线或一个新表格来呈现,以量化该机制的价值。
- 建议在附录中增加对超参数 和 的敏感性分析。展示不同取值对模型性能(如最终得分和训练稳定性)的影响,将极大地提升论文的实用性和可复现性。
- 请在实验章节更清晰地组织论证逻辑。明确指出哪些实验是为了验证方法的整体有效性,哪些是为了进行组件消融,避免将两者混淆。
-
提升数学表述的清晰度和严谨性
- 建议统一数学符号。例如,在Equation (1)中将IS比率定义为 ,并在全文中保持一致。
- 为提升公式的可读性,可以考虑重构Equation (5)。首先,可以定义一个非对称权重函数,该函数封装了根据优势值翻转IS比率的逻辑。然后,将此权重函数用于构建最终的目标函数。
- 请为“软性双重裁剪”这一术语提供更明确的解释。如果“软性”指的是裁剪边界的平滑过渡或某种特定属性,请加以说明。否则,可以考虑使用一个更直接描述其数学操作的术语,如“非对称边界裁剪(Asymmetric Bounded Clipping)”。
References
None
English Review
Synopsis of the paper This paper identifies a limitation in Outcome-Supervised Reinforcement Learning (OSRL) for Large Language Model (LLM) post-training. The authors argue that the standard Importance Sampling (IS) ratio used in methods like GRPO creates a mismatch for positive-advantage tokens, suppressing the updates of low-probability correct tokens while over-amplifying high-probability ones. To address this, they propose Asymmetric Importance Sampling Policy Optimization (ASPO). ASPO's core contribution is to use an inverted IS ratio for positive-advantage tokens, aligning their update dynamics with those of negative-advantage tokens. Additionally, ASPO introduces a soft dual-clipping mechanism to stabilize training by bounding extreme IS ratios. Experiments on coding and mathematical reasoning benchmarks show that ASPO improves performance, enhances training stability, and mitigates premature convergence compared to strong GRPO-based baselines.
Summary of Review The paper introduces ASPO, a simple yet effective modification to policy optimization for LLMs, by addressing a subtle but impactful issue with importance sampling in OSRL. The primary strength is the clear identification and empirical validation of the IS mismatch problem for positive-advantage tokens (Sec. 2.2; Fig. 1). The proposed solution of asymmetrically treating IS ratios is intuitive and leads to consistent and significant gains in performance and training stability across multiple benchmarks and model sizes (Table 1; Fig. 5). However, the theoretical justification for the specific "flipped" ratio is heuristic rather than derived from first principles, leaving its optimality an open question (Sec. 3.1). Furthermore, the mathematical presentation could be more precise, particularly in formally defining the final objective function (Eq. 7; Eq. 8).
Strengths
-
Novel and Insightful Problem Formulation
- The paper provides a clear diagnosis of a previously under-examined problem in OSRL: standard importance sampling disproportionately penalizes low-probability correct tokens and rewards high-probability ones (Sec. 2.2, "Problem with IS for Positive-Advantage Tokens"). This insight is crucial for understanding the optimization dynamics of token-level RL.
- The authors effectively use an ablation study (GRPO vs. GRPO w/o IS) to demonstrate the problematic nature of the standard IS mechanism. The results show that removing IS can improve some metrics like repetition rate but harm overall performance and stability, motivating the need for a more targeted solution (Fig. 1).
- This problem framing is significant because it shifts the focus from reward design to the underlying mechanics of policy updates, offering a new perspective on improving LLM alignment algorithms.
-
Simple, Well-Motivated, and Effective Method
- The core mechanism of ASPO—inverting the IS ratio for positive-advantage tokens (
1 / r_t(a)
)—is an elegant and computationally cheap solution to the identified problem (Eq. 7). Its simplicity facilitates easy integration into existing PPO-style training pipelines. - The proposed soft dual-clipping mechanism is a sensible addition to manage variance and prevent destructively large updates when IS ratios are extreme, which is a common challenge in off-policy RL (Sec. 3.2; Eq. 8).
- The effectiveness of this simple change is compellingly demonstrated through training curves that show ASPO maintains higher entropy, lower repetition, and a more stable KL divergence compared to baselines, indicating healthier optimization dynamics (Fig. 5c, 5d, 5f).
- The core mechanism of ASPO—inverting the IS ratio for positive-advantage tokens (
-
Strong and Comprehensive Empirical Evaluation
- Experiments are conducted on multiple relevant and challenging benchmarks, including coding (LiveCodeBench) and mathematical reasoning (GSM8K, MATH), which are key areas for modern LLM development (Sec. 4.1).
- ASPO consistently outperforms strong GRPO and DAPO baselines across different model scales (7B, 13B, 34B), demonstrating the robustness and scalability of the proposed method (Table 1; Table 2). For instance, ASPO achieves a 4.1 point improvement on LiveCodeBench with a 34B model over the GRPO baseline (Table 1).
- The paper includes thorough ablation studies that successfully isolate the benefits of the proposed components. The comparison against a variant using response-level IS mean for positive tokens confirms that ASPO's token-level asymmetric treatment is the key driver of its performance gains (Fig. 4; Sec. 4.3).
- Visualizations of the IS ratio landscape and clipping regions provide excellent intuition for how the proposed mechanisms operate and differ from standard clipping (Fig. 3).
Weaknesses
-
Insufficient Theoretical Justification for the Flipped IS Ratio
- The central modification of ASPO—using
1 / r_t(a)
for positive-advantage tokens—is presented as a heuristic to "align the update direction" (Sec. 3.1) rather than a solution derived from optimization principles. This lacks theoretical grounding regarding why this specific form is optimal or principled. - The paper does not connect this modification to established policy improvement theorems (e.g., from TRPO or PPO). It is unclear if this asymmetric update rule preserves theoretical convergence guarantees or monotonic improvement properties.
- The motivation does not adequately explore or justify the exclusion of alternative corrections. For instance, why is
1 / r_t(a)
preferable to simply setting the IS ratio to 1 for positive tokens, or using another function like1 / sqrt(r_t(a))
? The choice appears somewhat arbitrary despite its empirical success.
- The central modification of ASPO—using
-
Clarity of Mathematical Formulation and Notation
- The final ASPO objective function is not presented in a single, complete equation. Equation (7) introduces a modified clipped surrogate objective
L_t^{CLIP'}
but relies on prose to describe the asymmetric application of the standard ratior_t(a)
and the flipped ratio1 / r_t(a)
. A formal piecewise definition of the objective would greatly improve clarity. - The integration of the soft dual-clipping function
g(r_t, A_t)
from Equation (8) into the main objective is ambiguous. It is unclear ifg
replacesr_t
and1/r_t
directly within the min and clip operations of the final loss function. This lack of precision makes exact replication difficult without consulting the source code. - The paper uses "AIS" and "ASPO" somewhat interchangeably. The abstract states "AIS further incorporates a soft dual-clipping mechanism," while the method is named ASPO. Defining ASPO as the overall algorithm and AIS as its core technical component early on would prevent confusion.
- The final ASPO objective function is not presented in a single, complete equation. Equation (7) introduces a modified clipped surrogate objective
-
Limited Scope of Baselines and Evaluation
- The main comparisons are against GRPO and DAPO, which are closely related OSRL methods. The evaluation would be stronger if it included comparisons to other prominent alignment paradigms, such as Direct Preference Optimization (DPO) [Rafailov et al., 2023], which is cited but not benchmarked against.
- The evaluation lacks an analysis of potential "alignment tax." It is unknown whether the improvements on coding and math tasks come at the cost of degraded performance on general capabilities, which could be measured using standard benchmarks like MMLU. No direct evidence found in the manuscript.
- The experiments are confined to reasoning-heavy tasks (coding and math). While these are important, demonstrating ASPO's effectiveness on more creative or conversational tasks (e.g., using benchmarks like AlpacaEval or MT-Bench) would be necessary to support claims of general applicability for LLM post-training.
Suggestions for Improvement
-
Strengthen the Theoretical Grounding
- Provide a more rigorous theoretical motivation for the
1 / r_t(a)
term. For instance, could it be framed as a form of variance reduction or as an approximation to a different, more principled objective function? - Discuss the theoretical implications of this asymmetric update. For example, analyze its effect on gradient variance or explain how it relates to the policy improvement guarantees of the original PPO algorithm.
- Include a brief discussion or a small-scale ablation comparing the chosen
1 / r_t(a)
against other plausible alternatives (e.g., setting the ratio to 1) to better justify this specific design choice.
- Provide a more rigorous theoretical motivation for the
-
Improve Mathematical Precision and Clarity
- Formulate the complete ASPO policy objective in a single, explicit equation. This could be a piecewise function that clearly shows which importance ratio and clipping mechanism is applied based on the sign of the advantage function
A_t
. - Clarify how the soft dual-clipping function
g(r_t, A_t)
is integrated into the final loss. For instance, state explicitly that the termr'_t(a) * A_t
in the objective becomesg(r_t(a), A_t) * A_t
(or however it is implemented). - In the introduction or methodology section, explicitly define the relationship between the terms ASPO and AIS to ensure the reader understands that one is the algorithm and the other is the underlying technique.
- Formulate the complete ASPO policy objective in a single, explicit equation. This could be a piecewise function that clearly shows which importance ratio and clipping mechanism is applied based on the sign of the advantage function
-
Broaden the Experimental Comparison
- To better position ASPO in the current landscape of alignment techniques, add an empirical comparison to DPO [Rafailov et al., 2023] on at least one of the main datasets.
- Incorporate an evaluation on a standard academic benchmark like MMLU using the final trained models to assess whether ASPO incurs a performance penalty on general knowledge or reasoning tasks.
- If possible, include results from at least one experiment on a general-purpose instruction-following or dialogue benchmark to demonstrate the breadth of ASPO's applicability beyond specialized reasoning tasks.
References
- Rafailov, R., Sharma, A., Mitchell, E., Ermon, S., Manning, C. D., & Finn, C. (2023). Direct Preference Optimization: Your Language Model is Secretly a Reward Model.
Chinese Review
Synopsis of the paper
本文研究了在大型语言模型(LLM)强化学习后训练(RLHF 与 OSRL)中,重要性采样(Importance Sampling, IS)比率对正负优势(advantage)token 的不对称更新问题。作者指出现有 Outcome-Supervised RL(OSRL)范式在正优势样本中存在 IS 比率错配,导致低概率 token 更新受抑制、高概率 token 更新被放大。为解决此问题,作者提出 Asymmetric Importance Sampling Policy Optimization (ASPO),通过“翻转正优势 token 之 IS 比率”实现更新方向对齐,并结合软双重裁剪机制以稳定训练。基于编码与数学推理基准的实验(如 LiveCodeBench V5),ASPO 显著改善训练稳定性、缓解早收敛并提升最终性能(见 Fig. 5)。
Summary of Review
总体而言,本文针对 OSRL 中 IS 机制的结构性缺陷提出了一个计算简洁且实证有效的修正方案。其理论分析(Sec. 3, Eq. 5–7)和可视化展示(Fig. 3)较好地揭示了问题机理;实验部分(Fig. 4–5)验证了改进效果。然而,当前工作仍存在:理论推导较直观但缺乏收敛性边界证明(No direct evidence found in the manuscript)、消融实验未充分量化双重裁剪的独立贡献(Fig. 5),以及算法伪代码与符号定义不够清晰(Sec. 3.2, Eq. 6)。整体上,论文具有技术启发性与工程可行性,但需在数学严谨性和实验细节上进一步加强。
Strengths
-
问题动机明确且针对性强
- 证据:Sec. 2 描述 OSRL 在 token-level IS 上的正负不对称问题,并通过 Fig. 1 显示 GRPO 在熵与 KL 上的不稳定性。
- 重要性:清晰的问题定位提升论文可读性,使改进目标明确。
- 证据:作者通过阐述“low-prob tokens suppressed; high-prob amplified”说明现实训练中普遍存在该现象。
-
方法设计简洁且吻合问题结构
- 证据:Eq. (5)–(7) 展示“IS 比率翻转”机制,仅调整权重符号与裁剪逻辑。
- 重要性:避免复杂替代建模,直接优化采样权重,提高可实现性。
- 证据:Sec. 3.2 表中伪代码展示算法只增加极少计算开销。
-
可视化分析具有解释力
- 证据:Fig. 3(a–c) 展示正负优势下的 IS 区域划分,对比“翻转前后”边界分布变化。
- 重要性:帮助读者直观理解何处发生梯度反转及裁剪影响。
- 证据:可视化揭示软双重裁剪对极端比率的平滑作用。
-
实验验证充分且可重现性高
- 证据:Fig. 5 与 Table(描述性文本)显示 ASPO 在多个指标上优于 DAPO。
- 重要性:验证所提机制非仅理论设想,具实际提升效果。
- 证据:提供 GitHub 链接确保结果可复现,提升论文信任度。
-
在 LLM 对齐领域具有实践影响
- 证据:实验涉及 coding 与 reasoning 基准任务(Sec. 4.2)。
- 重要性:该领域当前广泛应用 RLHF,提出的修正可能直接解决常见训练不稳定问题。
Weaknesses
-
理论推导的严谨性不足
- 证据:Eq. (5)–(7) 中未展示期望值约束或收敛性证明。
- 重要性:缺乏理论边界分析难以评估算法稳定性。
- 证据:Sec. 3.3 提供直观描述但无正式定理或命题。
- 证据:No direct evidence found in the manuscript regarding variance bounds.
-
消融实验不足以分离机制效应
- 证据:Fig. 5 仅比较 DAPO vs. ASPO,没有单独考察“翻转比率”与“软双重裁剪”的单因素对比。
- 重要性:无法明确性能提升主因。
- 证据:No direct evidence found for ablation on dual clipping ratio hyperparameters.
-
符号与算法描述模糊
- 证据:Sec. 3.2 中 IS 比率符号 、 与 advantage 的符号关系未完全定义。
- 重要性:影响读者复现推导,可能导致实现歧义。
- 证据:伪代码中未解释“flip”操作的条件边界。
-
实验场景覆盖面有限
- 证据:实验集中于 LiveCodeBench V5 与少量 reasoning 任务(Sec. 4.2)。
- 重要性:难以证明在语言生成、多轮对话等更广任务上的泛化性。
- 证据:No direct evidence found of non-coding datasets.
-
写作结构略显紧凑,缺少定量误差分析
- 证据:Sec. 4.3 结果段主要为趋势描述,无展示标准差或置信区间。
- 重要性:削弱结果可靠性评估。
- 证据:图表缺少误差线或统计显著性说明。
Suggestions for Improvement
-
加强理论与收敛性分析
- 建议在 Sec. 3.3 增加形式化定理,明确算法在 bounded-advantage 或 bounded-ratio 假设下的收敛界。
- 引入方差上界推导,或提供与 PPO/GRPO 的对比推理。
- 可附附录给出梯度期望与方差计算细节。
-
增加机制消融实验
- 单独测试:仅翻转比率、仅软双重裁剪、二者结合三组对照。
- 详细报告对应指标(Entropy、KL、Repetition)的变化。
- 同时可在 Fig. 5 样式下添加一列独立结果曲线。
-
补充符号定义与伪代码说明
- 在 Sec. 3.2 增表定义符号与逻辑条件。
- 明确 flip 条件,如 的阈值与边界处理方式。
- 统一 Eq. (5–7) 符号,避免读者混淆 与 。
-
拓展任务和数据集测试
- 增加非代码类评测,如自然语言写作任务或数学推理更高维数据。
- 报告在多样任务下的稳定性与收敛速度。
- 若资源限制,可提供预实验结果或外部benchmark链接。
-
增强结果可信度与可读性
- 为各指标添加标准差或置信区间。
- 在图表说明中注明实验重复次数。
- 改进文字结构,分节讨论每个指标的变化趋势和意义。
References
None
English Review
Synopsis of the paper
The paper introduces ASPO (Asymmetric Importance Sampling Policy Optimization), a reinforcement-learning-based post-training algorithm for large language models (LLMs). ASPO aims to correct a bias in outcome-supervised RL (OSRL) methods such as GRPO, where the importance-sampling (IS) ratios for positive-advantage tokens are misaligned, causing over-updates to common tokens and under-updates to rare ones. The core proposal is an asymmetric IS correction that flips the IS ratio for positive-advantage tokens and combines it with a soft dual-clipping mechanism to stabilize gradients. Extensive experiments on coding and mathematical reasoning benchmarks (e.g., LiveCodeBench v5) demonstrate performance and stability improvements over several GRPO-based baselines (see Fig. 5). Visualization analyses (Fig. 3) illustrate how flipping and clipping reshape update regions and alleviate premature convergence.
Summary of Review
The paper offers a clear motivation for adjusting token-level IS ratios that are asymmetric in standard OSRL methods (see Sec. 1; Eq. (3)–(5)). The proposed ASPO formulation is conceptually simple but addresses a critical training imbalance, yielding improved stability and KL control in empirical results (Fig. 5a–f). Its analytical support, including geometric interpretation (Fig. 3), is insightful and helps explain observed gains. However, the theoretical justification lacks rigorous convergence or variance analysis (No direct evidence found in the manuscript). Some experimental and ablation details are either missing or briefly described (Sec. 4.2). The exposition—particularly the derivation of the “flipped” IS correction—is occasionally terse and could be clarified with step-by-step rationale (Sec. 3.2).
Strengths
• Clear Problem Formulation and Motivation
– The authors explicitly identify the token-level IS asymmetry in outcome-supervised RL as the cause of training imbalance (Sec. 2.2).
– Fig. 1 quantitatively supports this issue: removing IS disproportionately affects repetition rate and clipping ratio.
– The clear link between theoretical diagnosis and empirical manifestation enhances conceptual coherence.
• Simple Yet Effective Algorithmic Modification
– ASPO requires minimal modification to existing GRPO-style implementations, primarily flipping the IS ratio for positive-advantage tokens (Eq. 6; Sec. 3.1).
– This design choice maximizes reproducibility and interpretability, emphasizing a principle-based fix rather than complex architectural changes.
– The “soft dual-clipping” mechanism further ensures stable learning dynamics (Sec. 3.3; Fig. 2), highlighting practical stability benefits.
• Thorough Empirical Evaluation with Diverse Metrics
– Experiments span coding and reasoning benchmarks, mainly LiveCodeBench v5 (Sec. 4.1; Fig. 5a,b).
– The performance across multiple curves (entropy, repetition, clip ratio, KL loss) shows clear and consistent gains for ASPO over DAPO and DAPO w/Pos Response-Level IS Mean (Fig. 5c–f).
– The inclusion of both quantitative and qualitative stability indicators substantiates the operational impact of the proposed method.
• Insightful 3D and 2D Visual Analysis of IS Regions
– Fig. 3a–c decomposes IS dynamics by old vs. current probabilities, visually distinguishing the modified and original update regions.
– These figures demonstrate conceptual novelty by connecting policy geometry to optimization behavior.
– The graphical intuition contributes to interpretability and pedagogical clarity uncommon in RL fine-tuning work.
• Evidence of Training Stability and Reduced Premature Convergence
– The retention of higher entropy in ASPO during fine-tuning (Fig. 5c) empirically validates the claim of mitigated collapse.
– ASPO’s lower clip ratio (Fig. 5e) suggests smoother gradient propagation.
– These improvements demonstrate not only performance impact but also better learning dynamics, aligning with theoretical motivation (Sec. 3.2).
Weaknesses
• Limited Theoretical Analysis of Convergence and Variance
– The paper does not provide mathematical guarantees for bias correction or variance reduction (No direct evidence found in the manuscript).
– Section 3.2 states the method “aligns update direction” but does not quantify residual bias or sampling variance.
– Without formal proof or sensitivity analysis, theoretical soundness remains incomplete.
• Insufficient Detail in Experimental Setup
– Section 4.1 lacks clarity on reward normalization, sampling temperature, and baseline diffusion models used.
– The benchmark description for mathematical reasoning tasks is sparse; “comprehensive experiments” are claimed, yet only LiveCodeBench v5 curves are shown (Fig. 5).
– Absence of variance estimates (e.g., standard deviations across runs) limits reproducibility and confidence in statistical significance.
• Incomplete Ablations and Missing Comparisons
– While DAPO and DAPO w/Pos Response-Level IS Mean appear (Fig. 4–5), ablations on the separate effects of “flipping” and “dual-clipping” are not shown.
– Section 4.2 briefly mentions “soft dual-clipping stabilizes updates” without isolating its contribution.
– Comparison with modern OSRL baselines beyond GRPO variants is not presented (e.g., PPO-Clip + KL control).
• Occasional Clarity Gaps in Mathematical Formulations
– The notations in Eq. (3)–(6) lack definitions of certain variables (e.g., , clipping thresholds).
– Eq. (5) refers to the “dual-clip region,” but no explicit formula for these boundaries is provided in text; only visual hints exist (Fig. 3c).
– This obscures reproducibility and might confuse readers about how to implement the exact update rule.
• Limited Generalization Discussion Beyond Coding Benchmarks
– The experiments focus heavily on coding tasks; transfer to other instruction datasets or dialogue domains is not tested.
– No analysis of cross-domain robustness, token distribution shift, or reward sparsity is reported.
– Without these, the claimed general applicability to “LLM post-training” (Abstract) is not entirely substantiated.
Suggestions for Improvement
• Strengthen Theoretical Foundations
– Derive a formal bias and variance analysis comparing original GRPO and ASPO objectives (Sec. 3.2); include an appendix with asymptotic behavior as IS ratios grow.
– Provide a short proof sketch demonstrating that ASPO’s flipped ratios retain unbiasedness under mild conditions.
– If theoretical formalization is infeasible, offer empirical variance diagnostics akin to Fig. 2 but with confidence intervals.
• Clarify and Expand Experimental Methodology
– Specify hyperparameters such as learning rate, batch size, sampling temperature, and reward normalization in Sec. 4.1.
– Report run-to-run variance (mean ± std) for key metrics in Fig. 5 to support statistical validity.
– If possible, include at least one reasoning or dialogue dataset to validate domain robustness.
• Provide Comprehensive Ablations and Broader Baselines
– Introduce an ablation table isolating (a) flipping only, (b) dual-clipping only, and (c) their combination.
– Compare to additional OSRL or PPO-style baselines to contextualize gains beyond GRPO derivatives.
– Discuss sensitivity to clipping thresholds or temperature parameters (Sec. 4.2).
• Improve Clarity and Completeness of Mathematical Exposition
– Define all symbols used in Eq. (3)–(6), particularly , , and clipping boundaries.
– Explicitly write out the piecewise form of the “soft dual-clipping” operation referenced in Fig. 3.
– Add pseudocode or algorithmic steps summarizing ASPO to bridge equations with implementation details.
• Extend Empirical Discussion on Generalization
– Assess ASPO on at least one non-coding benchmark to verify domain transferability.
– Investigate how the flipped IS ratios behave under rare-token or long-context scenarios.
– Include a discussion (Sec. 5) on potential limitations or adaptation strategies for instruction-tuning or preference-optimization contexts.
References
None