AiPaper
Paper status: completed

Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models

Published:05/29/2025
Original LinkPDF
Price: 0.10
Price: 0.10
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Segment Policy Optimization (SPO) introduces segment-level advantage estimation for RL in large language models, balancing precision and stability without a critic. SPO outperforms PPO and GRPO on GSM8K and MATH500, enhancing reasoning abilities effectively.

Abstract

Enhancing the reasoning capabilities of large language models effectively using reinforcement learning (RL) remains a crucial challenge. Existing approaches primarily adopt two contrasting advantage estimation granularities: token-level methods (e.g., PPO) aim to provide fine-grained advantage signals but suffer from inaccurate estimation due to difficulties in training an accurate critic model. On the other extreme, trajectory-level methods (e.g., GRPO) solely rely on a coarse-grained advantage signal from the final reward, leading to imprecise credit assignment. To address these limitations, we propose Segment Policy Optimization (SPO), a novel RL framework that leverages segment-level advantage estimation at an intermediate granularity, achieving a better balance by offering more precise credit assignment than trajectory-level methods and requiring fewer estimation points than token-level methods, enabling accurate advantage estimation based on Monte Carlo (MC) without a critic model. SPO features three components with novel strategies: (1) flexible segment partition; (2) accurate segment advantage estimation; and (3) policy optimization using segment advantages, including a novel probability-mask strategy. We further instantiate SPO for two specific scenarios: (1) SPO-chain for short chain-of-thought (CoT), featuring novel cutpoint-based partition and chain-based advantage estimation, achieving 66-1212 percentage point improvements in accuracy over PPO and GRPO on GSM8K. (2) SPO-tree for long CoT, featuring novel tree-based advantage estimation, which significantly reduces the cost of MC estimation, achieving 77-1111 percentage point improvements over GRPO on MATH500 under 2K and 4K context evaluation. We make our code publicly available at https://github.com/AIFrameResearch/SPO.

Mind Map

In-depth Reading

English Analysis

Bibliographic Information

  • Title: Segment Policy Optimization: Effective Segment-Level Credit Assignment in RL for Large Language Models
  • Authors: Yiran Guo, Lijie Xu, Jie Liu, Dan Ye, Shuang Qiu
  • Affiliations: Institute of Software, Chinese Academy of Sciences; University of Chinese Academy of Sciences; City University of Hong Kong.
  • Journal/Conference: This paper is a pre-print available on arXiv. The publication date in the metadata is May 29, 2025, which suggests it is an early submission intended for a future top-tier conference like NeurIPS or ICLR.
  • Publication Year: 2025 (as per provided metadata).
  • Abstract: The paper tackles the challenge of using Reinforcement Learning (RL) to enhance the reasoning abilities of Large Language Models (LLMs). The authors identify limitations in existing RL methods: token-level approaches like PPO suffer from inaccurate advantage estimation due to unstable critic models, while trajectory-level approaches like GRPO provide coarse and imprecise credit assignment. To solve this, they propose Segment Policy Optimization (SPO), a new RL framework that operates at an intermediate, segment-level granularity. This approach balances precision in credit assignment with the ability to accurately estimate advantages using critic-free Monte Carlo (MC) sampling. SPO consists of three components: flexible segment partitioning, accurate segment advantage estimation, and policy optimization using these advantages. The paper introduces two specific implementations: SPO-chain for short chain-of-thought (CoT) reasoning and SPO-tree for long CoT. SPO-chain demonstrated 6-12 percentage point improvements over PPO and GRPO on the GSM8K benchmark. SPO-tree showed 7-11 percentage point improvements over GRPO on the MATH500 benchmark.
  • Original Source Link:

Executive Summary

Background & Motivation (Why)

  • Core Problem: A fundamental challenge in training powerful reasoning LLMs with Reinforcement Learning (RL) is credit assignment. In tasks like mathematical reasoning, the model generates a long sequence of tokens (a "trajectory"), but the reward (correct or incorrect) is only given at the very end. The problem is how to determine which specific tokens (actions) in the sequence were responsible for the final outcome.
  • Gaps in Prior Work:
    1. Token-level methods (e.g., PPO): These methods try to assign credit to every single token. To do this, they require a "critic" model to estimate the value of being in a particular state (i.e., having generated a certain prefix). However, training an accurate critic for LLMs is notoriously difficult, leading to unreliable credit assignment.
    2. Trajectory-level methods (e.g., GRPO): These methods avoid the critic by assigning a single credit score (advantage) to the entire sequence of tokens. While computationally simple, this is imprecise. It treats all tokens equally, failing to reward good intermediate steps in a failed solution or penalize redundant steps in a successful one. This can also lead to overfitting.
  • Paper's Innovation: The paper proposes a middle-ground solution, Segment Policy Optimization (SPO). Instead of assigning credit at the token or trajectory level, SPO divides the generated sequence into a few contiguous segments and assigns credit at this intermediate granularity. This approach is fine-grained enough to provide more precise feedback than trajectory-level methods but coarse-grained enough to allow for accurate, critic-free advantage estimation using Monte Carlo sampling.

Main Contributions / Findings (What)

  1. A Novel RL Framework (SPO): The paper introduces Segment Policy Optimization (SPO), a modular RL framework for LLMs that performs credit assignment at the segment level, effectively balancing the trade-off between estimation accuracy and credit precision.
  2. Innovative Techniques: The framework incorporates several new strategies:
    • Flexible Segment Partition: A method to divide token sequences into segments, including a novel cutpoint-based strategy that identifies key decision points.
    • Tree-based Segment Advantage Estimation: A highly efficient method for estimating segment advantages in long reasoning tasks, which significantly reduces computational cost by reusing samples.
    • Probability-Mask Policy Optimization: A technique that focuses policy updates on the most critical tokens within a segment, further improving credit assignment.
  3. Specialized Implementations: The authors create two tailored versions of SPO:
    • SPO-chain: Designed for short-form reasoning (e.g., on the GSM8K dataset), it uses cutpoint-based partitioning and a straightforward sampling method.
    • SPO-tree: Designed for long-form reasoning (e.g., on the MATH dataset), it uses the efficient tree-based sampling method to handle long sequences.
  4. State-of-the-Art Results: Experiments show that SPO significantly outperforms established baselines. SPO-chain improves accuracy by 6-12 points on GSM8K, and SPO-tree improves accuracy by 7-11 points on MATH500, demonstrating the effectiveness of segment-level credit assignment.

Prerequisite Knowledge & Related Work

Foundational Concepts

  • Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data. They can generate human-like text, answer questions, and perform complex reasoning tasks by predicting the next token in a sequence.
  • Reinforcement Learning (RL): A machine learning paradigm where an "agent" (the LLM) learns to make decisions by performing "actions" (generating tokens) in an "environment" to maximize a cumulative "reward". In the context of LLM reasoning, the reward is typically sparse, given only after the full solution is generated.
  • Markov Decision Process (MDP): A mathematical framework for modeling decision-making. For LLMs, it's defined as:
    • State (sts_t): The prompt and all tokens generated so far: st=[x,y<t]s_t = [\mathbf{x}, \mathbf{y}_{<t}].
    • Action (ata_t): The next token to be generated, yty_t.
    • Policy (πθ(atst)\pi_{\theta}(a_t|s_t)): The LLM itself, parameterized by θ\theta, which gives the probability of generating token ata_t given the current state sts_t.
    • Reward (R\mathcal{R}): A signal indicating success. For reasoning tasks, it's often a binary reward (1 for a correct final answer, 0 otherwise), making it sparse and delayed.
    • Value Function (V(s)): The expected total future reward starting from state ss. It tells you "how good" a state is.
    • Advantage Function (A(s, a)): The extra reward gained by taking action aa in state ss compared to the average action. It's defined as A(s,a)=Q(s,a)V(s)A(s,a) = Q(s,a) - V(s), where Q(s,a) is the value of the state-action pair. It helps identify better-than-average actions.
  • Monte Carlo (MC) Estimation: A method to estimate expected values by averaging results from repeated random sampling. In this paper, to estimate the value of a state V(s), the model generates multiple complete sequences starting from ss and averages their final rewards. This is "critic-free" as it doesn't require a separate model to predict the value.
  • Chain-of-Thought (CoT) Reasoning: A technique where LLMs are prompted to generate intermediate reasoning steps before giving a final answer. This improves performance on complex tasks like math and logic problems.

Previous Works

  • Proximal Policy Optimization (PPO): A popular RL algorithm for training LLMs. PPO tries to make stable updates to the policy. Its objective function prevents the new policy from straying too far from the old one. The core PPO objective is: JPPO(θ)=Et[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)] \mathcal{J}_{\mathrm{PPO}}(\theta) = \mathbb{E}_{t} \left[ \min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right) \right]

    • rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_{\theta}(a_t|s_t)}{\pi_{\theta_{\text{old}}}(a_t|s_t)} is the probability ratio between the new and old policies.
    • A^t\hat{A}_t is the estimated advantage for the action at time step tt.
    • ϵ\epsilon is a small hyperparameter that defines the clipping range. Limitation: PPO traditionally estimates A^t\hat{A}_t for every token using a critic model, which is often inaccurate for LLMs.
  • Group Relative Policy Optimization (GRPO): An alternative RL algorithm that simplifies credit assignment. It samples multiple trajectories for a given prompt, and the advantage for each trajectory is calculated based on its final reward relative to the average reward of the group. This single trajectory-level advantage is then assigned to all tokens within that trajectory. Limitation: This "one-size-fits-all" credit assignment is imprecise for long sequences, as it doesn't distinguish between good and bad steps within a trajectory.

  • VinePPO: A recent work that also attempts intermediate-granularity credit assignment. It partitions trajectories into "steps" based on heuristic rules (like line breaks) and uses MC estimation for step-level advantages.

Differentiation

SPO differentiates itself from these key alternatives:

  • SPO vs. PPO: SPO is critic-free, using MC estimation for advantages, which the paper argues is more accurate and stable than training a critic. It also operates at a coarser (segment) level, requiring fewer advantage estimation points.
  • SPO vs. GRPO: SPO provides more precise credit assignment. Instead of one advantage for the whole trajectory, it provides multiple advantages for different segments, allowing the model to better identify which parts of its reasoning were productive.
  • SPO vs. VinePPO: SPO is more general and efficient.
    • Generality: VinePPO relies on heuristic rules (like line breaks) to define segments, which may not apply to all tasks. SPO allows for arbitrary partitioning, including the novel cutpoint-based method that is task-agnostic.
    • Efficiency: For long reasoning, SPO-tree introduces a tree-based sampling method that reuses computations, making it far more efficient than VinePPO's approach, which requires costly independent resampling for each step. The paper notes that VinePPO can be seen as a special instance of the more general SPO framework.

Methodology (Core Technology & Implementation Details)

The core of the paper is the Segment Policy Optimization (SPO) framework. It is designed to be modular and flexible, consisting of three main components.

Figure 1: Overview of SPO framework. Our framework consists of three components: segment partition, segment advantage estimation, and policy optimization, each of which can be implemented in differen… 该图像是论文中图1的示意图,展示了SPO框架的三个核心组件:灵活的分段划分、准确的分段优势估计和基于分段优势的策略优化。图中具体说明了基于切点的分段方法和概率掩码策略对切点赋予分段优势的过程。

1. Flexible Segment Partition

The first step in SPO is to divide a generated token sequence y=[y1,y2,...,yT]\mathbf{y} = [y_1, y_2, ..., y_T] into KK contiguous segments: y=[seg1,seg2,...,segK]\mathbf{y} = [\mathbf{seg}_1, \mathbf{seg}_2, ..., \mathbf{seg}_K]. A segment is simply a subsequence of tokens, segk=[ytk,...,ytk+11]\mathbf{seg}_k = [y_{t_k}, ..., y_{t_{k+1}-1}]. The framework supports various strategies for this partitioning:

  • Fixed Token Count Partition: The simplest strategy, where the sequence is split into segments of a fixed length (e.g., every 50 tokens).
  • Adaptive Cutpoint-based Partition: A more sophisticated strategy. A "cutpoint" is defined as a token generated with low probability, i.e., where πθ(ytst)<ρ\pi_{\theta}(y_t|s_t) < \rho, for some threshold ρ\rho. These are points of high uncertainty where the model's reasoning is likely to diverge. The trajectory is then partitioned such that each segment contains a roughly equal number of these cutpoints. This focuses the analysis on the most critical parts of the reasoning process.

2. Segment Advantage Estimation via Monte Carlo

After partitioning, SPO estimates the advantage of each segment. The advantage for segment kk, denoted AksegA_k^{\mathrm{seg}}, measures the improvement in expected reward from generating that segment. It is defined as the change in the value function across the segment's boundaries:

Akseg:=V(stk+1)V(stk) A_{k}^{\mathrm{seg}} := V(s_{t_{k+1}}) - V(s_{t_k})

  • stks_{t_k} is the state just before generating segment kk.

  • stk+1s_{t_{k+1}} is the state just after generating segment kk.

    Instead of using a critic model to estimate V(s), SPO uses Monte Carlo (MC) estimation. Two MC strategies are proposed:

  • (a) Chain-based Advantage Estimation: For each segment boundary state stks_{t_k}, the model generates NN full trajectories from that point onwards. The value V^(stk)\hat{V}(s_{t_k}) is the average of the final rewards from these NN trajectories. This is done for the start and end of each segment to calculate the segment advantage. This method is simple but can be computationally expensive as the sampled trajectories are discarded after estimation.

  • (b) Tree-based Advantage Estimation: A more sample-efficient method, especially for long sequences. Trajectories are generated in a tree structure. Each node in the tree represents a segment. From a parent node, multiple child segments are sampled, forming a group of "siblings". The value of a node is calculated recursively from the leaves (which have the final reward) up to the root. The advantage of a node (segment) is then calculated by comparing its value to the average value of its siblings. This structure allows samples to be reused for both value estimation and policy optimization, dramatically improving efficiency.

    Figure 5: (a) Comparison of SPO-tree (6-6-6) and GRPO on MATH500 with a context size of 2K. (b) Variations of tree structures on GSM8K. (c) SPO-tree with different advantage methods on GSM8K. 该图像是论文中图5的图表,展示了SPO-tree与GRPO在MATH500及GSM8K数据集上的性能对比。包含三个子图:(a) SPO-tree(6-6-6)在MATH500上与GRPO的准确率随时间变化;(b) 不同树结构在GSM8K上的准确率变化;(c) SPO-tree在GSM8K上采用不同优势估计方法的准确率变化。

3. Policy Optimization Using Segment Advantages

Once the segment advantages A^kseg\hat{A}_k^{\mathrm{seg}} are estimated, they are used to update the policy πθ\pi_{\theta}. The paper highlights a novel strategy for this:

  • Policy Gradient with Token-Probability Masks: A standard approach would be to assign the segment advantage A^kseg\hat{A}_k^{\mathrm{seg}} to all tokens within that segment. However, SPO proposes a more precise method. The advantage is applied only to the cutpoints (low-probability tokens) within the segment. The loss from other (high-probability) tokens is masked out. This focuses the learning signal on the most uncertain and decisive actions, further refining credit assignment.

SPO-chain for Short CoT

SPO-chain is an instance of SPO tailored for short reasoning tasks (e.g., GSM8K). It combines:

  1. Partition: Adaptive Cutpoint-based Segment Partition.

  2. Estimation: Chain-based Segment Advantage Estimation.

  3. Optimization: Policy Gradient with Token-Probability Masks.

    The loss function for SPO-chain is: JSPOchain(θ)=E{1Zk=1Kt=tktk+11[Mtmin(rt(θ)A^kseg,clip(rt(θ),1ϵ,1+ϵ)A^kseg)βDKL(πθπref)]} \mathcal{J}_{\mathrm{SPO}}^{\mathrm{chain}}(\theta) = \mathbb{E} \Bigg\{ \frac{1}{Z} \sum_{k=1}^{K} \sum_{t=t_k}^{t_{k+1}-1} \left[ M_t \cdot \min\left(r_t(\theta)\hat{A}_k^{\mathrm{seg}}, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_k^{\mathrm{seg}}\right) - \beta D_{\mathrm{KL}}(\pi_{\theta} \| \pi_{\mathrm{ref}}) \right] \Bigg\}

  • A^kseg\hat{A}_k^{\mathrm{seg}} is the advantage for the kk-th segment.
  • MtM_t is the probability mask: Mt=1M_t = 1 if the token at time tt is a cutpoint (i.e., πθold(atst)<ρ\pi_{\theta_{\text{old}}}(a_t|s_t) < \rho), and 0 otherwise. This ensures the loss is only computed for critical tokens.
  • ZZ is a normalization term, the total count of masked-in tokens.
  • The rest of the terms are similar to the PPO objective.

SPO-Tree for Long CoT

SPO-tree is designed for long reasoning tasks (e.g., MATH) where SPO-chain's sampling is too expensive. It combines:

  1. Partition: Fixed Token Count Segment Partition.

  2. Estimation: Tree-based Segment Advantage Estimation.

  3. Optimization: Policy Gradient with Token-Probability Masks.

    The loss function for SPO-tree is similar but adapted for the tree structure. It only considers segments (nodes) with non-zero advantage for training, further improving efficiency. JSPOtree(θ)=E{1Zntree,A^(n)0t=1seg(n)[Mn,tmin(rn,t(θ)A^(n),clip(rn,t(θ),1ϵ,1+ϵ)A^(n))βDKL(πθπref)]} \mathcal{J}_{\mathrm{SPO}}^{\mathrm{tree}}(\theta) = \mathbb{E} \Bigg\{ \frac{1}{Z} \sum_{n \in \text{tree}, \hat{A}(n) \neq 0} \sum_{t=1}^{|\mathrm{seg}(n)|} \Big[ M_{n,t} \cdot \min\left(r_{n,t}(\theta)\hat{A}(n), \text{clip}(r_{n,t}(\theta), 1-\epsilon, 1+\epsilon)\hat{A}(n)\right) - \beta D_{\mathrm{KL}}(\pi_{\theta} \| \pi_{\mathrm{ref}}) \Big] \Bigg\}

  • A^(n)\hat{A}(n) is the advantage of the segment represented by node nn in the tree.
  • The sum is over all nodes nn in the tree with a non-zero advantage.
  • Mn,tM_{n,t} is the probability mask for tokens within the segment of node nn.

Experimental Setup

Datasets

  • GSM8K: A dataset of 8,500 high-quality, grade school-level math word problems. It requires multi-step reasoning to solve.
    • Example Data Sample (from Appendix I):

      Problem: The city of Gading is planting trees. There are 24 maple trees and 15 times as many oak trees. If each maple tree costs 300 and each oak tree costs50, what is the total cost of all trees? Model's Reasoning Steps (from Figure 12): The model attempts to calculate the number of oak trees, the cost of maple trees, the cost of oak trees, and then the total cost. The provided figure shows where the model makes mistakes and has low confidence.

  • MATH: A dataset of 12,500 challenging competition mathematics problems from subjects like Algebra, Geometry, and Number Theory. It is significantly more difficult than GSM8K.
    • Example Data Sample (from Appendix J):

      Problem: What is the smallest positive integer n such that 17n ≡ 1234 (mod 7)? Model's Reasoning Steps: The model needs to simplify the congruence by reducing the coefficients modulo 7, find the multiplicative inverse of 3 modulo 7, and solve for nn.

  • Knights-and-Knaves: A benchmark for logical reasoning puzzles. Characters are either knights (always truthful) or knaves (always liars), and the goal is to determine their identities from their statements. The paper uses the 3-person subset (3ppl).

Evaluation Metrics

  • Pass@1 (Accuracy): This is the primary metric.
    • Conceptual Definition: It measures the percentage of problems for which the model generates a perfectly correct final answer in a single attempt, using greedy decoding (i.e., always picking the token with the highest probability at each step, with temperature=0). It's a strict measure of the model's reasoning capability.
    • Mathematical Formula: Pass@1=1Dtesti=1DtestI(Check(Generate(pi))=True) \text{Pass@1} = \frac{1}{|D_{\text{test}}|} \sum_{i=1}^{|D_{\text{test}}|} \mathbb{I}(\text{Check}(\text{Generate}(p_i)) = \text{True})
    • Symbol Explanation:
      • DtestD_{\text{test}}: The set of test problems.
      • pip_i: The ii-th problem in the test set.
      • Generate(pi)\text{Generate}(p_i): The solution generated by the model for problem pip_i.
      • Check()\text{Check}(\cdot): A function that verifies if the generated solution is correct.
      • I()\mathbb{I}(\cdot): An indicator function that is 1 if the condition inside is true, and 0 otherwise.

Baselines

  • PPO: The standard token-level RL baseline, which uses a critic model.
  • GRPO: The standard trajectory-level RL baseline, which is critic-free.
  • VinePPO: A recent and strong baseline that also uses intermediate-level credit assignment.
  • Other Baselines: RestEM (iterative self-training), DPO (Direct Preference Optimization), and RLOO (RL with outcome-only feedback).
  • Long CoT Baselines: For long context experiments, the paper also compares against recent state-of-the-art models DeepScaleR and STILL-3, which are based on GRPO.

Results & Analysis

Core Results: SPO-chain for Short CoT (on GSM8K)

The experiments demonstrate SPO-chain's superiority in short-form reasoning tasks.

  • Accuracy Comparison: As shown in Figure 3(a), SPO-chain (int5) achieves the highest test accuracy (56.7%), significantly outperforming PPO (44.6%) and GRPO (45.7%) by 12.1 and 11.0 percentage points, respectively. It also outperforms the strong VinePPO baseline (55.4%).

    Figure 3: (a) Test accuracy comparison of different methods on GSM8K. Baseline results are from \[12\]. (b) Episode generation time comparison between SPO-chain (int5) and VinePPO during training. (c)… 该图像是图表,展示了图3中SPO与其他方法在GSM8K数据集上的性能对比。(a)不同方法的测试准确率;(b)训练过程中SPO-chain和VinePPO的单集生成时间;(c)SPO-chain与GRPO的验证准确率随时间变化趋势。

  • Efficiency: Figure 3(b) shows that SPO-chain is more time-efficient in generating episodes during training compared to VinePPO. This is because its cutpoint-based partitioning results in fewer advantage estimation points per trajectory.

  • Training Dynamics: Figure 3(c) illustrates that SPO-chain not only converges to a much better solution than GRPO but also learns more effectively over the same wall-clock time.

Ablation Studies for SPO-chain

  • Segmentation Granularity: Figure 4(a) explores how the number of cutpoints per segment ("interval") affects performance. A moderate interval (5) achieves the best trade-off between performance and training time. Very coarse intervals (100, approaching trajectory-level) perform poorly, while very fine intervals (2) offer marginal gains at a higher computational cost. This validates the benefit of a mid-level granularity.

    Figure 4:(a) Variations of segment partition granularity (different cutpoint intervals). (b) Variations of segment partition strategies. (c) Ablation on probability-mask policy optimization strategy. 该图像是论文中的图表,展示了SPO方法在分段粒度、分段策略及概率掩码策略上的性能变化。图(a)比较了不同cutpoint间隔的验证准确率,(b)比较了不同分段策略的测试准确率,(c)展示了概率掩码策略的消融效果。

  • Partitioning Strategy: Figure 4(b) compares different partitioning methods. The cutpoint-based strategy used in SPO-chain (int5) achieves the best accuracy, even with a smaller sampling budget than a fixed token count partition, proving its effectiveness.

  • Probability-Mask Optimization: Figure 4(c) shows the impact of the probability mask. Removing the mask drops SPO-chain's accuracy from 56.7% to 55.0%. More strikingly, adding the mask to GRPO boosts its accuracy from 45.7% to 53.6%. This highlights that focusing updates on low-probability tokens is a powerful and broadly applicable technique for improving credit assignment.

Core Results: SPO-tree for Long CoT (on MATH)

SPO-tree is evaluated on the more challenging MATH dataset, which requires long-form reasoning.

  • Performance vs. GRPO: Figure 5(a) shows SPO-tree significantly outperforming both vanilla GRPO and GRPO with the probability mask during training on MATH500 with a 2K context size. This confirms that segment-level advantage signals are crucial for learning in complex, long-CoT scenarios.

    Figure 5: (a) Comparison of SPO-tree (6-6-6) and GRPO on MATH500 with a context size of 2K. (b) Variations of tree structures on GSM8K. (c) SPO-tree with different advantage methods on GSM8K. 该图像是论文中图5的图表,展示了SPO-tree与GRPO在MATH500及GSM8K数据集上的性能对比。包含三个子图:(a) SPO-tree(6-6-6)在MATH500上与GRPO的准确率随时间变化;(b) 不同树结构在GSM8K上的准确率变化;(c) SPO-tree在GSM8K上采用不同优势估计方法的准确率变化。

  • Performance at Different Context Lengths: Table 1 presents a detailed comparison across various context sizes. SPO-tree consistently outperforms GRPO. Notably, at smaller context sizes (2K and 4K), SPO-tree's lead is massive (e.g., 0.736 vs. 0.62 on MATH500 at 2K). This suggests SPO's precise credit assignment improves token efficiency, enabling the model to find correct solutions more concisely. Interestingly, other GRPO-based models like DeepScaleR that were trained on much larger contexts perform worse at smaller evaluation contexts, potentially because GRPO encourages redundant, lengthy solutions.

    (Manual transcription of Table 1 from the paper)

    Dataset Eval Context Size Base GRPO SPO-tree DeepScaleR* STILL-3*
    MATH500 2K 0.566 0.62 0.736 0.538 0.662
    4K 0.74 0.752 0.828 0.744 0.794
    32K 0.838 0.84 0.848 0.878 0.846
    AIME24 2K 0.067 0.033 0.1 0 0.067
    4K 0.167 0.2 0.2 0.167 0.133
    32K 0.267 0.333 0.333 0.333 0.233

    Note: The paper states that DeepScaleR uses 8K-24K context and STILL-3 uses a very large generation length during training, while the authors' GRPO and SPO-tree models use a maximum of 4K context.

Ablation Studies for SPO-tree

  • Tree Structure: Figure 5(b) compares different tree structures (e.g., 4-4-4, 6-6-6). While performance differences are not substantial, larger trees (which allow for more accurate value estimation) tend to perform better later in training, demonstrating the robustness of the approach.
  • Advantage Normalization: Figure 5(c) shows that computing the advantage with or without standard deviation normalization yields similar performance, with the unnormalized version being slightly better in their tests.

Conclusion & Personal Thoughts

Conclusion Summary

The paper introduces Segment Policy Optimization (SPO), a novel and effective RL framework for training LLMs. By operating at an intermediate segment-level granularity, SPO successfully bridges the gap between imprecise trajectory-level methods (like GRPO) and unstable token-level methods (like PPO). Its key innovations—including flexible partitioning, critic-free MC advantage estimation, and probability-masked optimization—lead to more precise credit assignment and improved sample efficiency. The specialized instantiations, SPO-chain and SPO-tree, demonstrate state-of-the-art performance on both short and long chain-of-thought reasoning tasks, validating the central hypothesis that a mid-grained approach to credit assignment is superior for complex reasoning.

Limitations & Future Work

The authors acknowledge the following limitations and outline future directions:

  • Context Size: The long-CoT experiments were limited to a 4K context window due to computational constraints. Future work will explore scaling SPO to even larger contexts.
  • Task Diversity: The experiments primarily focused on mathematical reasoning. The authors plan to test SPO's effectiveness in other domains, such as code generation and general instruction-following with RL from Human Feedback (RLHF).
  • Sample Efficiency: While SPO-tree is highly efficient, it still operates within an on-policy framework. The authors suggest exploring more off-policy algorithms to further improve sample reuse and training efficiency.

Personal Insights & Critique

  • Significance and Novelty: The paper's core contribution is not just a new algorithm, but a new paradigm for credit assignment in RL for LLMs. The concept of a tunable, "segment-level" granularity is powerful and intuitive. It elegantly resolves the dilemma between PPO and GRPO. The cutpoint-based partitioning and probability-mask are particularly clever, providing a data-driven way to focus on "what matters" in a reasoning trace.
  • Practical Impact: SPO appears to be a highly practical and impactful contribution. It offers a robust, critic-free alternative that delivers superior performance. The provided code further increases its potential for broad adoption by the research community. The finding that the probability-mask technique can significantly boost even a simple baseline like GRPO is a valuable takeaway on its own.
  • Critical Reflections and Open Questions:
    • Hyperparameter Sensitivity: The cutpoint-based method relies on a probability threshold ρ\rho. The sensitivity of the method to this hyperparameter is not deeply explored, and its optimal value might vary across models and tasks.

    • Tree Sampling Overhead: While SPO-tree is efficient, constructing a large tree for a single prompt still requires substantial initial sampling. The replay buffer mitigates this, but it could remain a bottleneck for very large-scale, distributed training.

    • Generality of "Cutpoints": The idea that low-probability tokens are the most important "turning points" is compelling for reasoning. It would be interesting to see if this holds true for more creative or stylistic generation tasks, where high-entropy (low-probability) choices might be desirable for different reasons.

      Overall, this is an exceptionally well-executed paper that identifies a clear problem, proposes an elegant and well-motivated solution, and validates it with strong empirical results. SPO has the potential to become a new standard for fine-tuning reasoning LLMs with reinforcement learning.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.