Paper status: completed

Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents

Published:10/08/2025

Sequence Policy Optimization (40)RL Training for Large Language Models (67)Cross-Stratum Bias Correction (1)Stratified Advantage Normalization (1)Reinforcement Learning with Structural Heterogeneity (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Stratified GRPO with Stratified Advantage Normalization eliminates cross-stratum bias in heterogeneous LLM search agent trajectories, yielding unbiased, stable credit assignment and superior multi-step RL performance.

Abstract

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 S TRATIFIED GRPO: H ANDLING S TRUCTURAL H ET - EROGENEITY IN R EINFORCEMENT L EARNING OF LLM S EARCH A GENTS Anonymous authors Paper under double-blind review A BSTRACT Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi-step problems, and reinforcement learn- ing (RL) has become a key paradigm for training them. However, the trajectories of search agents are structurally heterogeneous, where variations in the number, placement, and outcomes of search calls lead to fundamentally different answer directions and reward distributions. Standard policy gradient methods, which use a single global baseline, suffer from what we identify and formalize as cross- stratum bias—an “apples-to-oranges” comparison of heterogeneous trajectories. This cross-stratum bias distorts credit assignment and hinders exploration of com- plex, multi-step search stra

Mind Map

In-depth Reading

English Analysis~34 min read · 46,338 chars

1. Bibliographic Information

1.1. Title

Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents

1.2. Authors

The paper lists "Anonymous authors" indicating it is currently under double-blind review, a common practice in academic conferences to ensure impartial evaluation. Therefore, specific author names, research backgrounds, and affiliations are not disclosed at this stage.

1.3. Journal/Conference

The paper is published at OpenReview.net, a platform commonly used for conference submissions (e.g., ICLR, NeurIPS) before final acceptance. This suggests it is intended for a top-tier machine learning or natural language processing conference. The platform facilitates open peer review, making it a key venue for cutting-edge research in these fields.

1.4. Publication Year

The Published at (UTC): 2025-10-08T00:00:00.000Z indicates a publication date in October 2025.

1.5. Abstract

This paper addresses a fundamental challenge in applying Reinforcement Learning (RL) to Large Language Model (LLM) agents that utilize search tools: the structural heterogeneity of agent trajectories. Variations in search calls (number, placement, outcomes) lead to diverse reward distributions and answer directions, making trajectories inherently incomparable. Standard policy gradient methods, which use a single global baseline for computing advantages, suffer from what the authors term cross-stratum bias—an "apples-to-oranges" comparison. This bias distorts credit assignment and hinders the exploration of complex multi-step search strategies.

To overcome this, the authors propose Stratified GRPO. Its central component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on structural properties and computes advantages locally within each stratum, ensuring fair comparison among true peers. The paper theoretically proves that SAN eliminates cross-stratum bias, yields conditionally unbiased unit-variance estimates within each stratum, and maintains global unbiasedness and unit-variance properties. For practical stability in finite-sample regimes, SAN is linearly blended with a global estimator. Extensive experiments on single-hop and multi-hop question-answering benchmarks demonstrate that Stratified GRPO consistently and substantially outperforms GRPO by up to 11.3 points, achieving higher training rewards, greater stability, and more effective search policies. These results establish stratification as a principled solution for structural heterogeneity in RL for LLM search agents.

1.6. Original Source Link

The original source link is https://openreview.net/forum?id=hqnGfzQQfa. It is currently under double-blind review. The PDF link is https://openreview.net/pdf?id=hqnGfzQQfa.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the structural heterogeneity inherent in the trajectories of Large Language Model (LLM) agents that interact with external tools like search engines, particularly when these agents are trained using Reinforcement Learning (RL).

LLM agents are increasingly powerful, capable of tackling complex, multi-step problems by leveraging external tools. RL has emerged as a promising paradigm for training these agents to learn sophisticated reasoning and tool-use strategies directly from outcome-based rewards. However, the trajectories generated by LLM search agents are not uniform; they exhibit significant structural differences. For example, some trajectories might involve zero search calls, while others might involve multiple, strategically placed search calls, each yielding different information and leading to distinct answer directions and reward distributions.

The challenge arises because standard policy gradient methods, which are commonly used in RL, typically employ a single global baseline to compute advantages for all trajectories. This implicitly assumes that all trajectories are comparable, regardless of their underlying structure. The authors identify and formalize this flawed assumption as cross-stratum bias. This "apples-to-oranges" comparison distorts credit assignment—the process of determining which actions were responsible for a given reward—and consequently hinders the agent's ability to effectively explore and learn complex, multi-step search strategies. This leads to suboptimal policies and training instability, as illustrated by issues like training collapse in existing methods. The problem is important because it prevents RL from fully realizing its potential in training sophisticated LLM agents for real-world tool-use scenarios.

The paper's entry point is recognizing this structural heterogeneity as a fundamental, often overlooked, issue for RL in LLM search agents. Its innovative idea is to introduce stratification—dividing trajectories into homogeneous groups or strata based on their structural properties—to ensure that comparisons and advantage calculations are made among true peers.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

Identification and Formalization of Cross-Stratum Bias: The paper rigorously identifies and formalizes cross-stratum bias as a fundamental challenge in policy gradient methods for LLM search agents. It provides a theoretical decomposition (Proposition 1 and Theorem 1) demonstrating that this bias arises from using a global baseline across structurally heterogeneous agent trajectories, leading to inflated variance and distorted credit assignment.
Proposal of Stratified GRPO with Stratified Advantage Normalization (SAN): The authors propose Stratified GRPO, a principled RL algorithm designed to eliminate cross-stratum bias. Its core component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on structural properties (e.g., search count) and computes advantages locally within each stratum. This ensures that trajectories are evaluated only against their true peers, leading to a fair and stable credit assignment.
Rigorous Theoretical Analysis of SAN: The paper provides comprehensive theoretical guarantees for SAN. It proves that SAN eliminates cross-stratum bias, is conditionally unbiased, and achieves unit variance within each stratum (Theorem 4). Crucially, SAN achieves these superior conditional properties while retaining the global unbiasedness and unit variance of standard normalization (Theorem 5), yielding a more pure and scale-stable learning signal.
Introduction of Blended Advantage for Practical Stability: To address practical stability concerns in finite-sample regimes where some strata might be small, the paper introduces a Blended Advantage (Definition 2). This robustly combines SAN with the global estimator through a linear interpolation, balancing local purity with global stability.
Empirical Validation and Superior Performance: Extensive experiments on seven diverse single-hop and multi-hop question-answering (QA) benchmarks demonstrate that Stratified GRPO substantially outperforms the standard GRPO baseline by up to 11.3 points. It achieves higher training rewards, greater training stability (e.g., preventing training collapse observed in GRPO for the instruct model), and learns more effective search policies (Figure 1), particularly on complex multi-hop tasks.

These findings collectively establish stratification as a principled and effective remedy for structural heterogeneity in Reinforcement Learning for LLM search agents, leading to improved learning and agent performance.

3.1. Foundational Concepts

To fully grasp the contributions of this paper, a beginner needs to understand several core concepts from Reinforcement Learning (RL) and Large Language Models (LLMs).

Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data, capable of understanding, generating, and processing human language. They can perform various tasks like question answering, summarization, and translation. In this paper, LLMs are used as agents that interact with environments.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent observes the state of the environment, takes an action, receives a reward, and transitions to a new state. The goal is to learn a policy.
- Agent: The LLM itself, making decisions (e.g., generating tokens, issuing search queries).
- Environment: The context in which the agent operates, including the prompt, external tools like a search engine, and the knowledge base.
- State: The current situation or information available to the agent at any given time (e.g., the current prompt, generated text so far, retrieved search results).
- Action: A decision made by the agent (e.g., generating the next token, formulating and executing a search query).
- Reward: A numerical signal indicating the quality of the agent's actions or final outcome (e.g., Exact Match (EM) score for a question-answering task). The agent aims to maximize this reward.
- Policy ( $p_{\theta}$ ): The agent's strategy or a probability distribution over actions given a state. It is parameterized by $\theta$ , which the RL algorithm aims to optimize.
Policy Gradient Methods: A class of RL algorithms that directly optimize the policy by estimating the gradient of the expected reward with respect to the policy parameters $\theta$ . The general update rule involves advantages.
- Trajectory ( $\tau$ ): A sequence of states, actions, and rewards generated by the agent from the beginning to the end of an interaction with the environment. For example, in a search agent, a trajectory might include initial prompt, generated text, search query, search results, more generated text, another search query, and the final answer.
- Advantage Function ( $A(\tau)$ ): In policy gradient methods, instead of using the raw reward $R(\tau)$ to update the policy, an advantage function is often used. It represents how much better an action (or trajectory) is compared to an expected or baseline value. This reduces variance in the gradient estimates and helps in credit assignment. The basic form is $A(\tau) = R(\tau) - b$ , where $b$ is a baseline.
- Baseline ( $b$ ): A reference value subtracted from the reward in the advantage function. The baseline itself should not depend on the action being evaluated to maintain unbiasedness of the gradient estimate. Common baselines include the average reward of a batch of trajectories or a learned value function (critic). A good baseline reduces the variance of the gradient estimates without introducing bias.
- Normalization: A technique used to standardize values (e.g., rewards or advantages) to a common scale, typically zero mean and unit variance. This can stabilize training by preventing rewards from having vastly different magnitudes or distributions across different trajectories or strata.
- Credit Assignment: The problem of determining which actions or sequences of actions in a trajectory were responsible for the observed reward. In multi-step tasks, this can be challenging, as a final reward might be the result of many intermediate actions.
- Structural Heterogeneity: This refers to the significant qualitative differences in the structure of trajectories. For LLM search agents, this includes variations in the number of search calls, where they are placed in the reasoning chain, and what information they retrieve. Such differences can lead to fundamentally different answer directions and reward distributions.
- Stratification: The process of dividing a dataset or a batch of trajectories into distinct, more homogeneous subgroups (called strata) based on some shared structural property. The goal is to analyze or process data within these strata separately, ensuring that comparisons are made between similar items.
- Cross-Stratum Bias: The central problem identified in this paper. It occurs when a global baseline is used to compute advantages across structurally heterogeneous trajectories. Because trajectories from different strata (e.g., with different numbers of search calls) have inherently different reward distributions, comparing them with a single global baseline creates a systematic bias. This bias distorts credit assignment, unfairly penalizing trajectories from low-reward strata or favoring those from high-reward strata, even if the individual actions within those trajectories were optimal for their respective strata.

3.2. Previous Works

The paper contextualizes its work within various existing paradigms in LLM research and RL.

Reinforcement Learning from Human Feedback (RLHF): A popular method for aligning LLMs with human preferences. It typically involves three steps: supervised fine-tuning (SFT) of an LLM, training a reward model (RM) using human preference data, and then optimizing the LLM's policy using RL (often PPO) with the RM as the reward signal. This approach is effective but can be computationally expensive and unstable due to RM training and distribution shift.
Proximal Policy Optimization (PPO) (Schulman et al., 2017): A widely used policy gradient algorithm in RLHF. PPO aims to keep new policies close to old policies to prevent excessively large policy updates that could lead to poor performance. It typically uses a learned value function (critic) as a baseline to estimate advantages. The paper notes that similar structural bias to cross-stratum bias can occur in PPO if the critic does not adequately condition on the structural heterogeneity.
Direct Preference Optimization (DPO) (Rafailov et al., 2023): An alternative to RLHF that optimizes preference data directly without training a separate reward model. It reframes the RLHF objective as a simple classification problem, making it more stable and computationally efficient.
Reinforcement Learning with Verifiable Rewards (RLVR): A category of RL applications where the reward can be directly computed based on verifiable outcomes (e.g., correctness of an answer in QA, performance in a mathematical task), rather than relying on a learned reward model from human preferences. This paper operates within the RLVR setting for LLM search agents.
- Group Relative Policy Optimization (GRPO) (Shao etet al., 2024): A prominent RLVR approach that the current paper builds upon. GRPO simplifies PPO by removing its dependency on a learned value function (critic), instead using group-based baselines (e.g., mean reward of a batch) to compute advantages. While more stable than PPO in some LLM contexts, it still suffers from cross-stratum bias if these groups are not structurally homogeneous.
- RLOO (Ahmadian et al., 2024): Another RLVR method that revisits the foundational REINFORCE algorithm with simplifications tailored for LLM training. It also uses a global baseline and thus would be susceptible to cross-stratum bias.
REINFORCE (Williams, 1992): The foundational policy gradient algorithm. It updates the policy directly using reward-weighted policy gradients. It can have high variance, which is often reduced by subtracting a baseline.
LLM Search Agents: A category of LLM-based agents that are augmented with external tools, particularly search engines. These agents can interleave token generation with search queries to gather information and solve complex tasks.
- Retrieval-Augmented Generation (RAG) (Lewis et al., 2020): A popular technique where LLMs retrieve relevant documents or information from a knowledge base before generating a response. This helps LLMs access up-to-date or specific knowledge that they might not have been trained on.
- Search-o1 (Li et al., 2025) and IRCoT (Trivedi et al., 2023): Examples of non-RL methods that use LLMs with search. IRCoT (Interleaving Retrieval with Chain-of-Thought) focuses on integrating retrieval into chain-of-thought reasoning through prompting. Search-o1 is another agentic search-enhanced model.
- Search-R1 (Jin et al., 2025) and ReSearch (Chen et al., 2025): Recent RL-based methods for training LLM search agents. Search-R1 explicitly trains LLMs to reason and leverage search engines using RL. ReSearch also focuses on learning to reason with search via RL. These methods typically use general-purpose RL algorithms like PPO or GRPO and, according to this paper, would face the cross-stratum bias problem.

3.3. Technological Evolution

The field of LLMs has rapidly evolved from initial impressive generative capabilities to becoming agents capable of interacting with the real world through tools.

Early LLMs: Primarily focused on text generation, summarization, and translation, based on supervised fine-tuning (SFT) on massive text corpora.
Instruction Following & Alignment (RLHF/DPO): The advent of RLHF (and later DPO) enabled LLMs to better align with human instructions and preferences, making them more useful in conversational settings. This marked a shift towards outcome-based learning and optimization of user-aligned objectives.
Tool-Augmented LLMs (LLM Agents): Recognizing the limitations of LLMs' internal knowledge and reasoning, researchers began equipping them with external tools like search engines, calculators, and APIs. This transformed LLMs into agents that can solve more complex, multi-step tasks by intelligently interleaving generation and tool-use. Techniques like RAG and Chain-of-Thought (CoT) prompting were crucial here.
RL for Tool-Augmented LLMs: The current frontier involves using RL to train these LLM agents to learn optimal tool-use strategies directly from outcome-based rewards. This is challenging because tool-use introduces new complexities, such as long-horizon reasoning and dynamic environment interactions.

This paper fits into the latter part of this timeline. While RL has shown promise for training LLM agents, this work identifies a crucial, often overlooked, challenge specific to LLM search agents: the structural heterogeneity of trajectories generated during tool-use. Prior RL methods for LLMs, even those using group-based baselines like GRPO, were designed for more homogeneous RL environments or did not explicitly account for the profound structural differences introduced by variable tool-use. This paper's Stratified GRPO addresses this specific gap, aiming to make RL training more effective and stable for these complex LLM agents.

3.4. Differentiation Analysis

Compared to the main methods in related work, especially GRPO, the core differences and innovations of this paper's approach lie in its explicit and principled handling of structural heterogeneity.

Standard Policy Gradient Methods (e.g., REINFORCE with global baseline, RLOO, GRPO):
- Core Assumption: Implicitly assumes all trajectories are comparable when computing advantages using a single global baseline.
- Credit Assignment: Suffers from cross-stratum bias, leading to distorted credit assignment. Trajectories with inherently different reward distributions (e.g., those with 0 search calls vs. 3 search calls) are compared against the same global average, unfairly penalizing or rewarding them.
- Exploration: The cross-stratum bias can hinder exploration of complex multi-step strategies. For example, trajectories involving multiple search calls might initially have lower average rewards, causing GRPO to under-sample them, even if they hold potential for higher rewards after further exploration.
- Stability: Can suffer from training instability (e.g., training collapse), especially with models like the Instruct model, as seen in the experiments.
Stratified GRPO (Proposed Method):
- Core Innovation: Explicitly addresses structural heterogeneity by introducing Stratified Advantage Normalization (SAN). This involves partitioning trajectories into homogeneous strata based on structural properties (e.g., number of search calls).
- Credit Assignment: Computes advantages locally within each stratum, ensuring that trajectories are evaluated only against their "true peers." This fundamentally eliminates cross-stratum bias (as proven by Proposition 1 and Theorem 1) and provides fairer credit assignment.
- Exploration: By removing the cross-stratum bias, Stratified GRPO encourages more effective exploration of different search strategies (e.g., policies involving multiple search calls) by evaluating them fairly within their own context, even if their initial rewards are not globally competitive.
- Stability: Exhibits higher training stability and converges to more effective search policies, as demonstrated by consistently achieving higher rewards and preventing training collapse compared to GRPO.
- Theoretical Guarantees: Provides strong theoretical backing for SAN's properties, including conditional unbiasedness and unit variance within strata (Theorem 4), while preserving global unbiasedness and unit variance (Theorem 5).
- Practical Stability: Incorporates a Blended Advantage (Definition 2) to mitigate finite-sample instability for small strata, by blending SAN with Global Normalization (GN).
  
  In essence, while GRPO made strides by using group-based baselines to improve RL for LLMs, Stratified GRPO takes this a step further by recognizing that these groups themselves might be heterogeneous, particularly in the context of LLM agents performing tool-use. By applying stratification and local normalization, Stratified GRPO provides a more refined and robust learning signal, unlocking better performance and stability for LLM search agents. The paper also implicitly critiques PPO-based methods for LLM search agents (like Search-R1), suggesting that they too would suffer from similar structural bias if their value function (critic) does not adequately condition on trajectory structure.

4. Methodology

4.1. Principles

The core idea behind Stratified GRPO is to explicitly account for the structural heterogeneity of LLM search agent trajectories in Reinforcement Learning (RL). Instead of treating all trajectories as comparable and using a single global baseline for advantage estimation, the method proposes to:

Partition trajectories into homogeneous strata: Group trajectories that share similar structural properties (e.g., the same number of search calls) into distinct strata.
Compute advantages locally within each stratum: Evaluate rewards and estimate baselines and variances only within these homogeneous groups.

The intuition is that an "apples-to-oranges" comparison, where a trajectory with many search calls is evaluated against one with zero search calls using a single global average reward, is inherently unfair and distorting. By comparing trajectories only against their "true peers" (i.e., those within the same stratum), Stratified GRPO aims to eliminate cross-stratum bias, provide more accurate credit assignment, and yield a purer, more stable learning signal for the policy gradient. This enables the agent to better understand the value of different multi-step search strategies and explore them more effectively.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. RL for Multi-Turn Search Agents

The problem of training a multi-turn search agent is framed as a Reinforcement Learning (RL) problem. The agent, parameterized by a policy $p_{\theta}$ , interacts with a search engine. This interaction involves interleaving token generation (e.g., generating parts of an answer) with search queries (e.g., asking the search engine for information).

For a given prompt $x$ , sampled from a distribution $\mathcal{D}$ , the agent generates a sequence of actions and observations, forming a trajectory $\tau$ . This trajectory $\tau$ is sampled according to the policy $p_{\theta}(\cdot \mid x)$ . Once the trajectory is complete (e.g., the agent produces a final answer), it receives a scalar reward $R(\tau)$ , which quantifies the quality of the final response.

The objective of the RL problem is to maximize the expected reward:

$\max _{\theta} J(\theta)=\mathbb{E}_{x \sim \mathcal{D}, \tau \sim p_{\theta}(\tau \mid x)}[R(\tau)]$

Here:

$J(\theta)$ is the objective function that the RL algorithm aims to maximize.
$\theta$ represents the parameters of the agent's policy (e.g., the weights of the LLM).
$\mathbb{E}$ denotes the expectation operator.
$x \sim \mathcal{D}$ means that prompts $x$ are sampled from a distribution of prompts $\mathcal{D}$ .
$\tau \sim p_{\theta}(\tau \mid x)$ indicates that trajectories $\tau$ are generated by the policy $p_{\theta}$ conditioned on the given prompt $x$ .
$R(\tau)$ is the scalar reward received for completing the trajectory $\tau$ .

This optimization problem is typically solved using policy gradient methods, which estimate the gradient of $J(\theta)$ with respect to $\theta$ and update $\theta$ in the direction of increasing expected reward.

4.2.2. Cross-Stratum Bias in Policy-Gradient Baselines

The paper identifies cross-stratum bias as a fundamental issue when using global baselines in policy gradient methods for LLM search agents. This is because trajectories from search agents exhibit significant structural heterogeneity (e.g., varying number of search calls, content, outcomes), leading to strata with systematically different answer directions and reward distributions.

To formalize this, the following notation is introduced:

$B = \{\tau_1, \ldots, \tau_K\}$ : A batch of $K$ trajectories sampled independently and identically distributed (i.i.d.) from the policy $p_{\theta}$ for a fixed prompt $x$ .
The batch $B$ is partitioned into $I$ non-empty strata $B_0, \ldots, B_{I-1}$ based on a predefined structural property (e.g., the number of search calls).
$n_k = |B_k|$ : The number of trajectories in stratum $B_k$ .
$R_i = R(\tau_i)$ : The reward for trajectory $\tau_i$ .
$\bar{R}_{\text {global }} = \frac{1}{K} \sum_{j=1}^{K} R(\tau_j)$ : The global mean reward of the entire batch.
$\bar{R}_k = \frac{1}{n_k} \sum_{\tau_i \in B_k} R_i$ : The stratum-specific mean reward for stratum $B_k$ .

Based on these definitions, two natural advantage estimators are considered:
Global Advantage ( $\hat{A}_{G}$ ): Uses the global mean reward as a baseline. $ \hat{A}{G}\left(\tau{i}\right)=R_{i}-\bar{R}_{\text {global }} $
Stratified Advantage ( $\hat{A}_{S}$ ): Uses the stratum-specific mean reward as a baseline. $ \hat{A}{S}\left(\tau{i}\right)=R_{i}-\bar{R}{k} \quad\left(\text { for } \tau{i} \in B_{k}\right) $

Proposition 1 (Advantage Decomposition)

For any trajectory $\tau_i \in B_k$ , the global advantage can be decomposed as: $\hat{A}_{G}\left(\tau_{i}\right)=\hat{A}_{S}\left(\tau_{i}\right)+\underbrace{\left(\bar{R}_{k}-\bar{R}_{\text {global }}\right)}_{\text {cross-stratum bias }}$ This proposition reveals that the global advantage for a trajectory is not just its stratified advantage but also includes an additional term: the difference between its stratum's mean reward and the global mean reward. This term, $(\bar{R}_{k}-\bar{R}_{\text {global }})$ , is identified as the cross-stratum bias. It's a deterministic offset applied uniformly to all trajectories within a given stratum. If a stratum has a mean reward lower than the global mean, trajectories from that stratum will have their advantages artificially decreased, and vice versa. This effectively creates an "apples-to-oranges" comparison across structurally different trajectories.

Theorem 1 (Variance Reduction via Stratified Baselines)

The empirical variances of the stratified and global advantage estimators satisfy: $\operatorname{Var}\left[\hat{A}_{S}\right] \leq \operatorname{Var}\left[\hat{A}_{G}\right]$ Moreover, the reduction in variance is exactly the variance induced by the cross-stratum bias: $\operatorname{Var}\left[\hat{A}_{G}\right]-\operatorname{Var}\left[\hat{A}_{S}\right]=\frac{1}{K} \sum_{k=0}^{I-1} n_{k}\left(\bar{R}_{k}-\bar{R}_{\text {global }}\right)^{2}$ This theorem mathematically proves that stratification not only removes cross-stratum bias but also strictly reduces the variance of the advantage estimator whenever the mean rewards of the strata are different ( $\bar{R}_0 \neq \bar{R}_1 \neq \cdots \neq \bar{R}_{I-1}$ ). The term $\frac{1}{K} \sum_{k=0}^{I-1} n_{k}\left(\bar{R}_{k}-\bar{R}_{\text {global }}\right)^{2}$ represents the "between-stratum variance," which is essentially the variance explained by the differences in mean rewards across strata. By centering rewards within each stratum, Stratified Advantage (and later SAN) eliminates this component, leading to a lower overall variance for the learning signal. This means Stratified GRPO provides a more stable and accurate gradient estimate.

4.2.3. Stratified Advantage Normalization: Definition and Theoretical Guarantees

Building on the concept of stratification, the paper introduces Stratified Advantage Normalization (SAN). SAN combines stratification with per-stratum normalization to create a stable and scale-invariant learning signal.

Definition 1 (SAN Advantage)

For a given prompt $x$ , partition the batch of trajectories into strata $\{B_k(x)\}$ based on a chosen partitioning function (e.g., the search count for search agents). The SAN advantage for a trajectory $\tau_i \in B_k(x)$ is defined as: $A_{\mathrm{SAN}}\left(\tau_{i}\right)=\frac{R\left(\tau_{i}\right)-\widehat{\mu}_{k}(x)}{\widehat{\sigma}_{k}(x)+\varepsilon}$ Here:

$R(\tau_i)$ is the reward of trajectory $\tau_i$ .
$\widehat{\mu}_{k}(x)$ is the empirical mean reward of stratum $B_k(x)$ for the given prompt $x$ .
$\widehat{\sigma}_{k}(x)$ is the empirical standard deviation of rewards in stratum $B_k(x)$ for the given prompt $x$ .
$\varepsilon$ is a small positive constant added for numerical stability (to prevent division by zero if $\widehat{\sigma}_{k}(x)$ is zero).

SAN thus computes a Z-score-like value for each trajectory within its own stratum, effectively standardizing the rewards to have a mean of 0 and a standard deviation of 1 (ignoring $\varepsilon$ for ideal cases) for each stratum.

Proposition 2 (Invariance to Positive Affine Reward Transforms)

Suppose $\varepsilon=0$ . The SAN advantage $A_{\text {SAN }}(\tau)$ is invariant under any positive affine transformation of the rewards, $R'(\tau)=aR(\tau)+b$ with $a>0$ . That is, $A_{\text {SAN }}^{\prime}(\tau)=A_{\text {SAN }}(\tau)$ . This proposition states that if you linearly scale and shift all rewards (e.g., changing the unit of currency or adding a fixed bonus), the computed SAN advantage remains the same. This is a highly desirable property as it makes the learning signal robust to arbitrary reward scaling, meaning the RL algorithm doesn't need to retune if the reward function's scale changes.

Theorem 2 (Variance Decomposition for Normalized Stratified Advantage)

Let $A_{\mathrm{SAN}}\left(\tau_{i}\right)$ be the stratified and normalized advantage. It is related to the global advantage $\hat{A}_{G}$ by the following exact decomposition: $\operatorname{Var}\left[\hat{A}_{G}\right]-\operatorname{Var}\left[A_{\mathrm{SAN}}\right]=\underbrace{\frac{1}{K} \sum_{k=0}^{I-1} n_{k}\left(\bar{R}_{k}-\bar{R}_{\text {global }}\right)^{2}}_{ \text {Term A: Between-Stratum Variance }}+\underbrace{\frac{1}{K} \sum_{k=0}^{I-1} n_{k} \sigma_{k}^{2}\left(1-\frac{1}{\left(\sigma_{k}+\varepsilon\right)^{2}}\right)}_{ \text {Term B: Normalization Effect }}$ This theorem extends Theorem 1 by showing the total variance reduction from $\hat{A}_{G}$ to $A_{\mathrm{SAN}}$ .

Term A (Between-Stratum Variance): This is the same term as in Theorem 1. It quantifies the variance caused by differences in mean rewards across strata. SAN completely removes this structural bias by centering rewards within each stratum.
Term B (Normalization Effect): This term captures the additional effect of per-stratum normalization (dividing by $\widehat{\sigma}_k + \varepsilon$ ). This term primarily works to stabilize the scale of rewards within each stratum, ensuring a consistent and numerically robust learning signal. While it can be positive or negative depending on the values of $\sigma_k$ and $\varepsilon$ , its main role is scale stabilization.

Theorem 3 (Population SAN Expectation)

Let $S=S(\tau ; \theta)$ be a discrete stratum assignment that may depend on $\theta$ , and define the per-trajectory SAN advantage: $A_{\mathrm{SAN}}(\tau):=\frac{R(\tau)-\mu_{S}(\theta)}{\sigma_{S}(\theta)+\varepsilon}$ where $\mu_{k}(\theta)=\mathbb{E}_{\theta}[R \mid S=k]$ and $\sigma_{k}(\theta)>0$ are the stratum-wise mean and standard deviation, and $\varepsilon>0$ is a small regularizer. Then, under standard regularity conditions, $\mathbb{E}_{\theta}\left[A_{\mathrm{SAN}}(\tau) \nabla_{\theta} \log p_{\theta}(\tau)\right]=\sum_{k} \frac{p_{k}(\theta)}{\sigma_{k}(\theta)+\varepsilon} \nabla_{\theta} \mu_{k}(\theta), \quad p_{k}(\theta):=\operatorname{Pr}_{\theta}(S=k)$ This theorem shows that the population SAN gradient estimator exactly targets a weighted sum of within-stratum gradients. The term $\nabla_{\theta} \mu_{k}(\theta)$ is the true per-stratum policy gradient, representing the gradient of the expected reward within stratum $k$ . Specifically: $\nabla_{\theta} \mu_{k}(\theta)=\mathbb{E}_{\theta}\left[\left(R(\tau)-\mu_{k}(\theta)\right) \nabla_{\theta} \log p_{\theta}(\tau \mid S=k) \mid S=k\right]$ This population SAN estimator is a weighted sum of these per-stratum policy gradients, where each stratum's gradient is weighted by its probability $p_k(\theta)$ and inversely scaled by its standard deviation $\sigma_k(\theta)+\varepsilon$ . This means SAN is an asymptotically unbiased estimator that correctly pushes the policy to improve rewards in each stratum, effectively eliminating the structural bias present in global baselines.

4.2.4. Structural Comparison of SAN and Global Normalization

This section provides a direct comparison between SAN and Global Normalization (GN), highlighting SAN's advantages. GN is defined as: $A_{\mathrm{GN}}\left(\tau_{i}\right):=\frac{R\left(\tau_{i}\right)-\bar{R}_{\text {global }}}{\bar{\sigma}_{\text {global }}+\varepsilon}$ Here:

$R(\tau_i)$ is the reward of trajectory $\tau_i$ .
$\bar{R}_{\text {global }}$ is the global mean reward of the batch.
$\bar{\sigma}_{\text {global }}$ is the global standard deviation of rewards in the batch.
$\varepsilon$ is for numerical stability.

GN normalizes all rewards using global statistics (mean and standard deviation) across the entire batch, without regard for stratum structure.

Proposition 3 (Exact Advantage Decomposition)

For any fixed batch partition $\{B_k(x)\}_{k=0}^{I-1}$ and any $\tau_i \in B_k(x)$ : $A_{\mathrm{GN}}\left(\tau_{i}\right)=\underbrace{\frac{\widehat{\sigma}_{k}(x)+\varepsilon}{\widehat{\sigma}_{\text {global }}(x)+\varepsilon}}_{:=\alpha_{k}(x)} A_{\mathrm{SAN}}\left(\tau_{i}\right)+\underbrace{\frac{\widehat{\mu}_{k}(x)-\bar{R}_{\text {global }}(x)}{\bar{\sigma}_{\text {global }}(x)+\varepsilon}}_{:=\Delta_{k}(x)}$ This proposition precisely shows how GN relates to SAN. The GN advantage for a trajectory is a rescaled SAN advantage (scaled by $\alpha_k(x)$ ) plus an additional term $\Delta_k(x)$ .

$\alpha_k(x)$ is a scaling factor that varies by stratum, representing the ratio of stratum-specific standard deviation to global standard deviation.
$\Delta_k(x)$ is the cross-stratum offset bias. This term is crucial because it represents a systematic bias in GN that depends on how much the mean reward of stratum $k$ ( $\widehat{\mu}_{k}(x)$ ) deviates from the global mean reward ( $\bar{R}_{\text {global}}(x)$ ), scaled by the global standard deviation. This bias is the source of GN's structural flaw.

Gradient Bias from the Cross-Stratum Bias: The cross-stratum offset $\Delta_k$ directly translates into a structural flaw in the GN gradient estimator ( $\widehat{g}_{\mathrm{GN}}$ ): $\widehat{g}_{\mathrm{GN}}(x)=\frac{1}{K} \sum_{k} \sum_{\tau_{i} \in B_{k}} \alpha_{k} A_{\mathrm{SAN}}\left(\tau_{i}\right) \nabla_{\theta} \log p_{\theta}\left(\tau_{i} \mid x\right)+\underbrace{\frac{1}{K} \sum_{k} \sum_{\tau_{i} \in B_{k}} \Delta_{k} \nabla_{\theta} \log p_{\theta}\left(\tau_{i} \mid x\right)}_{\text {Bias from Cross-Stratum Offset }}$ The second term, Bias from Cross-Stratum Offset, shows that GN's gradient is systematically biased. This bias couples reward differences across strata with the policy's score vectors, meaning that the learning signal is distorted by global statistics rather than focusing purely on improving within-stratum performance. This can hinder exploration, as strata with mean rewards below the global mean will be systematically downweighted, potentially preventing the agent from exploring complex, multi-step search strategies that might be high-reward in their own strata.

Theorem 4 (Conditional Properties of SAN and GN Advantages)

This theorem analyzes the conditional properties (within a stratum) of SAN and GN in the large-sample limit (population statistics), assuming $\varepsilon=0$ . Population reward statistics for a given prompt $x$ and stratum $k$ :

$\mu_k(x) := \mathbb{E}[R(\tau) \mid k, x]$ : Conditional mean reward for stratum $k$ .
$\sigma_k^2(x) := \operatorname{Var}(R(\tau) \mid k, x)$ : Conditional variance of rewards for stratum $k$ .
$\mu(x) := \mathbb{E}[R(\tau) \mid x]$ : Global mean reward.
$\sigma^2(x) := \operatorname{Var}(R(\tau) \mid x)$ : Global variance of rewards.

Conditional Expectation (Bias):
- $\mathbb{E}\left[A_{\mathrm{SAN}} \mid k, x\right]=0$ : The SAN advantage is unbiased within each stratum. This means SAN provides a pure signal, indicating how a trajectory performs relative to the average of its peers.
- $\mathbb{E}\left[A_{\mathrm{GN}} \mid k, x\right]=\frac{\mu_{k}(x)-\mu(x)}{\sigma(x)}$ : The GN advantage carries a systematic bias proportional to the difference between the stratum's mean reward and the global mean reward. This is the quantitative expression of cross-stratum bias.
Conditional Variance:
- $\operatorname{Var}\left(A_{\mathrm{SAN}} \mid k, x\right)=1$ : The SAN advantage provides a consistent unit variance within each stratum. This ensures a stable and uniformly scaled learning signal across all strata.
- $\operatorname{Var}\left(A_{\mathrm{GN}} \mid k, x\right)=\frac{\sigma_{k}^{2}(x)}{\sigma^{2}(x)}$ : The GN advantage's variance scales with the ratio of stratum-to-global variance. This means the learning signal from GN has an inconsistent scale across strata, making credit assignment less reliable.
  
  In summary, SAN acts as a pure and scale-stable signal carrier at the conditional level, while GN introduces cross-stratum bias and inconsistent scaling.

Theorem 5 (Global Moments of SAN and GN)

This theorem analyzes the global (marginal) moments of SAN and GN advantages, again in the large-sample limit ( $\varepsilon=0$ ). Consider the population SAN and GN advantages:

$A_{\mathrm{SAN}}=\frac{R-\mu_{S}(x)}{\sigma_{S}(x)}$
$A_{\mathrm{GN}}=\frac{R-\mu(x)}{\sigma(x)}$

(a) Global Means:
$\mathbb{E}\left[A_{\mathrm{SAN}} \mid x\right]=0$
$\mathbb{E}\left[A_{\mathrm{GN}} \mid x\right]=0$ Both SAN and GN are globally unbiased (have a mean of zero) when averaged over all strata.

(b) Global Variances:

$\operatorname{Var}\left(A_{\mathrm{SAN}} \mid x\right)=1$
$\operatorname{Var}\left(A_{\mathrm{GN}} \mid x\right)=1$ Both SAN and GN achieve the exact same unit variance globally.

The key takeaway from Theorems 4 and 5 is that while SAN and GN appear equivalent at the global level (both being globally unbiased with unit variance), they differ fundamentally at the conditional level (within strata). SAN is conditionally pure (zero mean, unit variance within each stratum), providing a consistent learning signal. GN, however, is conditionally biased and has inconsistent scaling, distorting the learning signal within strata. This conditional purity of SAN is what truly governs more accurate credit assignment and effective learning dynamics.

4.2.5. Blended Advantage for Finite-Sample Stability

While SAN offers theoretical advantages, in practical finite-sample regimes (when dealing with limited data in a batch), some strata might contain very few trajectories. In such cases, the empirical mean and standard deviation estimates ( $\widehat{\mu}_k, \widehat{\sigma}_k$ ) for those small strata can be noisy, leading to potentially unstable SAN advantage estimates.

To mitigate this finite-sample instability, the paper proposes a Blended Advantage that combines the local purity of SAN with the global stability of GN.

Definition 2 (Blended Advantage)

For a trajectory $\tau \in B_k(x)$ , the Blended Advantage is defined as: $A_{\text {blend }}(\tau)=\alpha A_{\mathrm{SAN}}(\tau)+(1-\alpha) A_{\mathrm{GN}}(\tau), \quad \alpha \in[0,1]$ Here:

$A_{\text {blend }}(\tau)$ is the final advantage used for policy updates.
$A_{\mathrm{SAN}}(\tau)$ is the Stratified Advantage Normalization (from Definition 1).
$A_{\mathrm{GN}}(\tau)$ is the Global Normalized advantage.
$\alpha$ $α$ is a hyperparameter between 0 and 1 that controls the blending ratio.
- If $\alpha=1$ , the Blended Advantage becomes purely SAN.
- If $\alpha=0$ , the Blended Advantage reverts to GN.
  
  This blending mechanism allows Stratified GRPO to leverage the benefits of SAN (eliminating cross-stratum bias) while using GN to stabilize advantage estimates for strata with small sample sizes by borrowing information from the global distribution. The parameter $\alpha$ can be tuned to balance these two objectives.

The Stratified GRPO algorithm incorporates this blended advantage.

Algorithm 1: Stratified GRPO

Algorithm 1: Stratified GRPO
Require: Policy  $p_{\theta}$ , batch  $B=\left\{\tau_{1}, \ldots, \tau_{K}\right\}$  with rewards  $\left\{R_{i}\right\}$  ，blending  $\alpha \in[0,1]$ , stabilizer
     $\varepsilon>0$ .
    Compute global stats:  $\bar{R}_{\text {global }} \leftarrow \frac{1}{K} \sum_{i} R_{i}$ ;
     $\widehat{\sigma}_{\text {global }} \leftarrow \sqrt{\frac{1}{K} \sum_{i}\left(R_{i}-\bar{R}_{\text {global }}\right)^{2}}$ .
For all  $i$ , set  $A_{\mathrm{GN}}\left(\tau_{i}\right) \leftarrow\left(R_{i}-\bar{R}_{\text {global }}\right) /\left(\widehat{\sigma}_{\text {global }}+\varepsilon\right)$ .
Partition indices into per-prompt, per-stratum groups  $I_{k}(x)$  (e.g., by search count).
for each prompt  $x$  do
        for each stratum  $k$  with index set  $I_{k}(x)$  do
             $n_{k} \leftarrow\left|I_{k}(x)\right| ; \quad \bar{R}_{k} \leftarrow \frac{1}{n_{k}} \sum_{i \in I_{k}(x)} R_{i} ; \quad \widehat{\sigma}_{k} \leftarrow \sqrt{\frac{1}{n_{k}} \sum_{i \in I_{k}(x)}\left(R_{i}-\bar{R}_{k}\right)^{2}}$ .
            for  $i \in I_{k}(x)$  do
                 $A_{\mathrm{SAN}}\left(\tau_{i}\right) \leftarrow\left(R_{i}-\bar{R}_{k}\right) /\left(\widehat{\sigma}_{k}+\varepsilon\right)$ .
                 $A_{\mathrm{blend}}\left(\tau_{i}\right) \leftarrow \alpha A_{\mathrm{SAN}}\left(\tau_{i}\right)+(1-\alpha) A_{\mathrm{GN}}\left(\tau_{i}\right)$ .
            end for
        end for
    end for
    Return gradient estimate  $\widehat{g}_{\text {blend }} \leftarrow \frac{1}{K} \sum_{i=1}^{K} A_{\text {blend }}\left(\tau_{i}\right) \nabla_{\theta} \log p_{\theta}\left(\tau_{i} \mid x\right)$ .

Step-by-step breakdown of Algorithm 1:

Initialization:
- Policy $p_{\theta}$ : The LLM's policy to be optimized.
- Batch $B=\{\tau_1, \ldots, \tau_K\}$ : A collection of $K$ trajectories sampled from $p_{\theta}$ .
- Rewards $\{R_i\}$ : The corresponding rewards for each trajectory $\tau_i$ .
- Blending hyperparameter $\alpha \in [0,1]$ : Controls the mix between SAN and GN.
- Stabilizer $\varepsilon > 0$ : A small constant for numerical stability.
Compute Global Statistics:
- First, the global mean reward ( $\bar{R}_{\text {global }}$ ) is calculated across all $K$ trajectories in the batch: $\bar{R}_{\text {global }} \leftarrow \frac{1}{K} \sum_{i=1}^{K} R_{i}$
- Then, the global standard deviation ( $\widehat{\sigma}_{\text {global }}$ ) is calculated for all rewards in the batch: $\widehat{\sigma}_{\text {global }} \leftarrow \sqrt{\frac{1}{K} \sum_{i=1}^{K}\left(R_{i}-\bar{R}_{\text {global }}\right)^{2}}$
Compute Global Normalized (GN) Advantages:
- For every trajectory $\tau_i$ in the batch, its Global Normalized advantage $A_{\mathrm{GN}}(\tau_i)$ is computed using the global statistics: $A_{\mathrm{GN}}\left(\tau_{i}\right) \leftarrow\left(R_{i}-\bar{R}_{\text {global }}\right) /\left(\widehat{\sigma}_{\text {global }}+\varepsilon\right)$
Partition Trajectories into Strata:
- The trajectories are partitioned into per-prompt, per-stratum groups $I_k(x)$ . This means that for each unique prompt $x$ encountered in the batch, trajectories originating from that prompt are further grouped into strata $k$ based on a predefined structural property. For LLM search agents, this property is typically the search count (number of search calls). $I_k(x)$ is the set of indices of trajectories belonging to stratum $k$ for prompt $x$ .
Iterate through Prompts and Strata:
- The algorithm then loops through each unique prompt $x$ in the batch.
- Inside this, it loops through each stratum $k$ identified for that prompt $x$ .
Compute Stratum-Specific Statistics:
- For each stratum $k$ (identified by its index set $I_k(x)$ ), the number of trajectories $n_k$ is counted: $n_k \leftarrow |I_k(x)|$ .
- The stratum-specific mean reward ( $\bar{R}_k$ ) is calculated for trajectories within that stratum: $\bar{R}_{k} \leftarrow \frac{1}{n_{k}} \sum_{i \in I_{k}(x)} R_{i}$
- The stratum-specific standard deviation ( $\widehat{\sigma}_k$ ) is calculated for rewards within that stratum: $\widehat{\sigma}_{k} \leftarrow \sqrt{\frac{1}{n_{k}} \sum_{i \in I_{k}(x)}\left(R_{i}-\bar{R}_{k}\right)^{2}}$
Compute Stratified Advantage Normalization (SAN) and Blended Advantages:
- For each trajectory $\tau_i$ $τ_{i}$ belonging to the current stratum $k$ $k$ :
  - Its Stratified Advantage Normalization $A_{\mathrm{SAN}}(\tau_i)$ is computed using the stratum-specific statistics: $A_{\mathrm{SAN}}\left(\tau_{i}\right) \leftarrow\left(R_{i}-\bar{R}_{k}\right) /\left(\widehat{\sigma}_{k}+\varepsilon\right)$
  - The Blended Advantage $A_{\text {blend }}(\tau_i)$ is then computed by linearly combining $A_{\mathrm{SAN}}(\tau_i)$ and the previously calculated $A_{\mathrm{GN}}(\tau_i)$ using the blending ratio $\alpha$ : $A_{\text {blend }}\left(\tau_{i}\right) \leftarrow \alpha A_{\mathrm{SAN}}\left(\tau_{i}\right)+(1-\alpha) A_{\mathrm{GN}}\left(\tau_{i}\right)$
Return Gradient Estimate:
- After computing Blended Advantages for all trajectories in the batch, the algorithm returns the final policy gradient estimate $\widehat{g}_{\text {blend }}$ : $\widehat{g}_{\text {blend }} \leftarrow \frac{1}{K} \sum_{i=1}^{K} A_{\text {blend }}\left(\tau_{i}\right) \nabla_{\theta} \log p_{\theta}\left(\tau_{i} \mid x\right)$ This gradient estimate is used to update the policy parameters $\theta$ . The term $\nabla_{\theta} \log p_{\theta}\left(\tau_{i} \mid x\right)$ is the score function for trajectory $\tau_i$ , indicating how changing the policy parameters would affect the probability of generating that trajectory. The advantage term $A_{\text {blend }}(\tau_i)$ weights this score function, ensuring that trajectories with higher advantages (better than expected for their stratum) contribute more to increasing their probability.

5. Experimental Setup

5.1. Datasets

The experiments evaluate Stratified GRPO on seven diverse question-answering (QA) benchmarks, covering both single-hop and multi-hop reasoning tasks.

Training Set:
- The training set is constructed by merging the training splits of two well-known QA datasets:
  - Natural Questions (NQ) (Kwiatkowski et al., 2019): A single-hop QA dataset where questions can typically be answered by finding a short passage in Wikipedia. It's known for its factual, Google-search-like questions.
  - HotpotQA (Yang et al., 2018): A multi-hop QA dataset that requires reasoning over multiple pieces of evidence (documents) to answer a question. This necessitates more complex information retrieval and synthesis.
- The use of both NQ and HotpotQA for training aims to expose the LLM agent to a broad range of QA complexities, from direct factual retrieval to more involved multi-step reasoning.
Evaluation Benchmarks:
- Single-Hop QA Benchmarks:
  - Natural Questions (NQ) (Kwiatkowski et al., 2019): As described above, focuses on factual questions answerable from a single passage.
  - TriviaQA (Joshi et al., 2017): A large-scale reading comprehension dataset, often requiring retrieval of specific entities or facts.
  - PopQA (Mallen et al., 2023): A dataset designed to evaluate LLMs' ability to recall factual knowledge, often highlighting the effectiveness of parametric vs. non-parametric memories.
- Multi-Hop QA Benchmarks:
  - HotpotQA (Yang et al., 2018): As described above, requires reasoning over multiple documents.
  - 2WikiMultiHopQA (Ho et al., 2020): A dataset explicitly designed to evaluate multi-hop reasoning by constructing questions that span information across two Wikipedia documents.
  - MuSiQue (Trivedi et al., 2022): (Multi-hop Questions via Single-hop Question Composition) This dataset focuses on multi-hop questions that can be decomposed into a series of simpler single-hop questions, challenging models to perform sequential information retrieval and synthesis.
  - Bamboogle (Press et al., 2023): A dataset used to measure and narrow the compositionality gap in language models, likely requiring multi-step reasoning and tool-use.
    
    These datasets were chosen because they represent a diverse set of question-answering challenges, ranging from simple information recall to complex multi-step reasoning and information synthesis. They are well-established benchmarks for evaluating LLM agents and their ability to leverage external knowledge.

5.2. Evaluation Metrics

For all evaluation benchmarks, the Exact Match (EM) metric is used.

Exact Match (EM)

Conceptual Definition: Exact Match is a strict evaluation metric commonly used in question-answering and machine comprehension tasks. It measures whether the model's generated answer matches one of the ground-truth answers exactly. It focuses on the precision and correctness of the final output, giving no partial credit for partially correct answers or semantic similarity. Its design goal is to assess whether the model can produce the precise correct answer.
Mathematical Formula: The Exact Match score for a single question-answer pair is typically defined as: $\mathrm{EM} = \begin{cases} 1 & \text{if } \text{predicted\_answer} \in \text{ground\_truth\_answers} \\ 0 & \text{otherwise} \end{cases}$ For a dataset of $N$ questions, the overall EM score is the average of the EM scores for individual questions: $\mathrm{EM}_{\text{overall}} = \frac{1}{N} \sum_{j=1}^{N} \mathrm{EM}_j$
Symbol Explanation:
- $\text{predicted\_answer}$ : The string generated by the LLM agent as its answer to a question.
- $\text{ground\_truth\_answers}$ : A set of one or more reference answers provided for a question. Answers in this set are usually normalized (e.g., punctuation removed, case folded, articles removed) to account for minor stylistic variations.
- $\mathrm{EM}_j$ : The Exact Match score for the $j$ -th question.
- $N$ : The total number of questions in the dataset.
- $\mathrm{EM}_{\text{overall}}$ : The average Exact Match score across the entire dataset, usually expressed as a percentage.

5.3. Baselines

Stratified GRPO is compared against a comprehensive set of non-RL and RL methods, categorized as follows:

Non-RL Baselines: These methods do not involve Reinforcement Learning for policy optimization.
- Direct Generation: The LLM directly generates an answer without using any external tools or retrieval. This serves as a lower bound for performance, showing the LLM's inherent knowledge.
- Supervised Fine-Tuning (SFT): The LLM is fine-tuned on a dataset of question-answer pairs. While SFT can improve performance, it doesn't typically involve dynamic tool-use or RL.
- RAG (Retrieval-Augmented Generation) (Lewis et al., 2020): A common technique where the LLM retrieves relevant documents before generating an answer. It's a static form of retrieval, usually one-shot.
- Search-o1 (Li et al., 2025): An agentic search-enhanced large reasoning model. This method likely involves some form of agentic behavior with search but is not based on RL for policy optimization.
- IRCoT (Interleaving Retrieval with Chain-of-Thought) (Trivedi et al., 2023): This method uses prompts to instruct LLMs to interleave reasoning steps with retrieval calls, enabling multi-step information gathering.
RL Methods: These methods use Reinforcement Learning to train the LLM agent.
- Search-R1 (Jin et al., 2025): An RL-based method specifically designed for training LLMs to reason and leverage search engines. It's a PPO-based approach.
- R1 (DeepSeek-AI et al., 2025): Refers to RL training without search. This baseline highlights the performance gain specifically attributable to search capabilities. It might represent a REINFORCE-style or PPO-based method focused solely on generation given an initial context.
- ReSearch (Chen et al., 2025): Another RL-based approach focused on learning to reason with search for LLMs.
- GRPO (Group Relative Policy Optimization) (Shao et al., 2024): The direct baseline for Stratified GRPO. GRPO is an RLVR method that uses group-based baselines (e.g., mean reward of a batch) to compute advantages, avoiding the need for a learned value function like PPO. It's robust but, according to this paper, suffers from cross-stratum bias.
  
  These baselines were chosen to provide a comprehensive comparison, including models with no external tools, static retrieval, heuristic search, and various RL approaches, allowing the paper to demonstrate the specific advantages of Stratified GRPO in handling structural heterogeneity.

5.4. Models and Training

LLM Models: The experiments use two variants of the Qwen-2.5-3B model (Yang et al., 2024):
- Qwen-2.5-3B Base: A foundational LLM.
- Qwen-2.5-3B Instruct: A version of the LLM already fine-tuned to follow instructions, which often exhibits different RL training dynamics.
Retrieval Setup:
- Knowledge Source: The 2018 Wikipedia dump (Karpukhin et al., 2020) is used as the external knowledge source.
- Retriever: E5 (Wang et al., 2022) is used as the retriever. It fetches the top-3 passages per search query, providing relevant context to the LLM.
Training Details:
- Hardware: Training is conducted on 8 GPUs.
- Batch Sizes: Global batch size of 256, mini-batch size of 256.
- Sequence Lengths: Maximum sequence length is 4096 tokens. Maximum response length and retrieved content length are 500 tokens in each interaction turn.
- Rollout Sampling:
  - Temperature: 1.0 (encourages more diverse generation).
  - Top-p: 1.0 (includes all tokens whose cumulative probability sum up to 1.0, similar to no filtering).
- Optimization:
  - Learning Rate: $1 \mathrm{e}-6$ .
  - Warm-up Ratio: 0.1 (gradually increases the learning rate at the beginning of training).
  - Training Steps: 200 steps.
- RL-Specific Parameters (for GRPO and Stratified GRPO):
  - KL Divergence Coefficient (\beta): 0.001. This is a hyperparameter often used in PPO-like algorithms to control the divergence between the current policy and the policy that generated the data, helping to stabilize training.
  - Clipping Ratio (\epsilon): 0.2. Also common in PPO and GRPO, this clips the advantage ratio to prevent overly aggressive policy updates.
  - Responses per Prompt: 8 responses sampled per prompt, which forms the batch of trajectories for advantage calculation.
- Stratified GRPO Specifics:
  - Blending Parameter (\alpha): 0.8 for Qwen 2.5 3B Instruct and 0.6 for Qwen 2.5 3B Base. This indicates different optimal blending for different base LLMs.
- Search Agent Interaction:
  - Maximum Interaction Turns: 4 (the agent can perform up to 4 sequential steps of generation or search).
  - Passages per Search Call: Top 3 passages are retrieved for each search query.
- Implementation Framework: The implementation is based on the Verl framework (Sheng et al., 2025).
  
  All these settings aim to provide a fair and consistent comparison with previous work, particularly Search-R1 (Jin et al., 2025), which the authors cite as having a consistent experimental setup.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results consistently demonstrate the clear superiority of Stratified GRPO over all baselines, particularly on multi-hop question-answering tasks.

The following are the results from Table 1 of the original paper:

Methods	NQ	TriviaQA	PopQA	HotpotQA	2Wiki	Musique	Bamboogle	Avg.
Non-RL Baselines
Direct Generation	10.6	28.8	10.8	14.9	24.4	2.0	2.4	13.4
SFT	24.9	29.2	10.4	18.6	24.8	4.4	11.2	17.6
RAG	34.8	54.4	38.7	25.5	22.6	4.7	8.0	27.0
Search-o1	23.8	47.2	26.2	22.1	21.8	5.4	32.0	25.5
IRCoT	11.1	31.2	20.0	16.4	17.1	6.7	24.0	18.1
Qwen2.5-3B-Base
Search-R1	40.6	58.7	43.5	28.4	27.3	4.9	8.8	30.3
R1	22.6	45.5	17.3	20.1	26.8	5.5	22.4	22.9
ReSearch	42.7	59.7	43.0	30.5	27.2	7.4	12.8	31.9
GRPO	45.2	61.2	43.8	32.6	29.7	7.8	12.9	33.3
Stratified GRPO	45.9	61.4	43.0	40.8	39.9	17.7	42.7	41.6
Qwen2.5-3B-Instruct
Search-R1	34.1	54.5	37.8	32.4	31.9	10.3	26.4	32.5
R1	21.0	44.9	17.1	20.8	27.5	6.0	19.2	22.4
ReSearch	36.5	57.1	39.5	35.1	27.2	9.5	26.6	33.1
GRPO	33.4	52.9	36.7	26.5	27.4	6.4	21.0	29.2
Stratified GRPO	44.5	60.9	44.3	41.0	37.3	16.9	38.7	40.5

Key Observations and Analysis:

Overall Superiority: Stratified GRPO consistently achieves the best performance across almost all seven QA benchmarks for both the Qwen2.5-3B-Base and Qwen2.5-3B-Instruct models.
- For Qwen2.5-3B-Base, Stratified GRPO achieves an average of 41.6 EM, outperforming GRPO (33.3 EM) by 8.3 points and the next best baseline (ReSearch at 31.9 EM) by 9.7 points. The paper states an improvement of up to 11.3 points, which is observed when comparing Stratified GRPO (41.6 avg) to GRPO (30.3 avg for Search-R1 on base model in similar context, or 33.3 overall for GRPO, so this might be referring to max diff on a specific benchmark or a slightly different average calculation).
- For Qwen2.5-3B-Instruct, Stratified GRPO achieves an average of 40.5 EM, outperforming GRPO (29.2 EM) by 11.3 points (this matches the abstract's claim).
Pronounced Advantage on Multi-Hop Tasks: The performance gains are particularly significant on multi-hop QA benchmarks (HotpotQA, 2Wiki, Musique, Bamboogle).
- For Qwen2.5-3B-Base, on Musique, Stratified GRPO scores 17.7 EM, a massive improvement over GRPO's 7.8 EM (more than double). On Bamboogle, it jumps from 12.9 EM (GRPO) to 42.7 EM, a substantial relative gain.
- Similar patterns are seen with Qwen2.5-3B-Instruct. For example, on HotpotQA, Stratified GRPO achieves 41.0 EM versus GRPO's 26.5 EM.
- This strong performance on multi-hop tasks suggests that Stratified GRPO's ability to handle structural heterogeneity is crucial for learning complex, sequential information retrieval and reasoning strategies. These tasks inherently require varied search strategies, making them highly susceptible to cross-stratum bias.
Comparison with Non-RL Baselines: All RL-based methods (Search-R1, ReSearch, GRPO, Stratified GRPO) generally outperform non-RL methods that rely on static RAG or prompting (Direct Generation, SFT, RAG, Search-o1, IRCoT), especially Stratified GRPO. This reinforces the value of RL for training LLM agents to use tools effectively.
Implications for PPO-based Methods: The consistent outperformance of Stratified GRPO over PPO-based Search-R1 suggests that the issue of cross-stratum bias is not unique to GRPO (which is a policy gradient method without a learned value function). The paper hypothesizes that this bias likely manifests in PPO-like algorithms through the difficulty of training an accurate value function (critic) for structurally diverse trajectories. This implies that principled handling of trajectory structure is a key factor for robust RL training across various algorithms for LLM search agents.

6.2. Ablation Studies / Parameter Analysis

The paper conducts an ablation study to evaluate the contribution of each component of Stratified GRPO: Stratified Advantage Normalization (SAN) alone and the full Stratified GRPO (which includes SAN and Blended Advantage).

The following are the results from Table 2 of the original paper:

Model Variants	NQ	TriviaQA	PopQA	HotpotQA	2wiki	Musique	Bamboogle	Avg.
Qwen2.5-3B-Base
GRPO	45.2	61.2	43.8	32.6	29.7	7.8	12.9	33.3
w/ SAN	43.7	59.3	41.1	36.6	38.4	12.6	25.0	36.7
Stratified GRPO	45.9	61.4	43.0	40.8	39.9	17.7	42.7	41.6
Qwen2.5-3B-Instruct
GRPO	33.4	52.9	36.7	26.5	27.4	6.4	21.0	29.2
w/ SAN	42.5	60.1	44.2	39.4	41.0	16.0	36.3	39.9
Stratified GRPO	44.5	60.9	44.3	41.0	37.3	16.9	38.7	40.5

Analysis of Ablation Study:

Contribution of SAN: For both Qwen2.5-3B-Base and Qwen2.5-3B-Instruct, GRPO w/ SAN significantly outperforms the baseline GRPO.
- For Qwen2.5-3B-Base, GRPO w/ SAN improves the average EM from 33.3 to 36.7 (a 3.4-point gain).
- For Qwen2.5-3B-Instruct, the gain is even more substantial, from 29.2 to 39.9 (a 10.7-point gain), suggesting SAN is particularly effective when the baseline (GRPO) struggles (as GRPO for Instruct model showed training collapse).
- This confirms that Stratified Advantage Normalization itself provides substantial benefits by eliminating cross-stratum bias and stabilizing the learning signal.
Contribution of Blending (Full Stratified GRPO): The addition of advantage blending (i.e., the full Stratified GRPO algorithm) further enhances performance over GRPO w/ SAN.
- For Qwen2.5-3B-Base, the average EM increases from 36.7 (w/ SAN) to 41.6 (full Stratified GRPO), an additional 4.9-point gain.
- For Qwen2.5-3B-Instruct, the average EM increases from 39.9 (w/ SAN) to 40.5 (full Stratified GRPO), an additional 0.6-point gain.
- This demonstrates that the blending mechanism is not just a theoretical add-on but provides practical stability and performance improvements, especially in finite-sample regimes where stratum-specific statistics might be noisy.
Synergistic Effect: The results clearly show a synergistic effect where both SAN and the blended advantage contribute to the overall superior performance of Stratified GRPO. SAN addresses the fundamental bias, and blending ensures practical robustness. The effectiveness is particularly evident on complex multi-hop QA tasks, where Stratified GRPO (full version) shows the largest improvements over GRPO w/ SAN.

6.3. Training Dynamics

The paper also provides an analysis of the training dynamics, including training rewards and the number of search calls, comparing Stratified GRPO with GRPO.

The following figure (Figure 1 from the original paper) illustrates the training dynamics:

该图像是多子图折线图，展示了Stratified GRPO与GRPO在Qwen 2.5 3B Base和Instruct模型上的训练奖励与搜索调用次数随训练步数的变化对比。总体体现Stratified GRPO在训练奖励和搜索策略上表现更优。

Image 1:
- Description: The image is a multi-subplot line chart illustrating the comparison of training rewards and number of search calls over training steps between Stratified GRPO and GRPO on Qwen 2.5 3B Base and Instruct models. The left plots show training rewards (y-axis) over training steps (x-axis) for both models and methods. The right plots show the average number of search calls per question (y-axis) over training steps (x-axis) for both models and methods.
- Alt text: img-0.jpeg

Analysis of Training Dynamics (Figure 1):

Improved Reward and Training Stability (Left Plots):
- Qwen2.5-3B-Base: Stratified GRPO (blue line) consistently achieves higher training rewards than GRPO (orange line) throughout the training process. Both methods show an increasing trend, but Stratified GRPO reaches a higher plateau.
- Qwen2.5-3B-Instruct: This is where the difference is starkest. Standard GRPO (orange line) exhibits a training collapse, where its reward drops drastically and fails to recover. This is a known instability issue in RL training. In contrast, Stratified GRPO (blue line) maintains a stable and monotonically increasing reward signal, demonstrating its superior stability and learning efficiency. This prevents the LLM from learning a degraded policy.
Learning an Effective Search Policy (Right Plots):
- The average number of search calls per question during training provides insight into the search policy learned by the agent.
- Qwen2.5-3B-Base: Stratified GRPO (blue line) successfully learns a policy that converges to approximately 2.5 search calls per question. This indicates that it has learned to perform iterative searches and leverage the search tool effectively, suggesting a multi-step reasoning capability. Conversely, the baseline GRPO (orange line) stagnates at around one search call, failing to significantly explore or utilize more complex multi-step search strategies.
- Qwen2.5-3B-Instruct: For this model, GRPO's training collapse (as seen in the left plot) is mirrored in its search behavior, where it fails to learn any meaningful search policy. Stratified GRPO, however, learns a stable search policy that utilizes approximately 2.5 search calls, similar to the base model.
- This finding supports the paper's argument that GRPO's cross-stratum bias prevents it from effectively exploring potentially better search policies that involve more tool-use. By removing this bias, Stratified GRPO enables the agent to discover and exploit complex multi-step search strategies, which directly translates to its superior performance on multi-hop benchmarks that require sequential information retrieval.
  
  In essence, the empirical results confirm the theoretical claims: Stratified GRPO provides a purer learning signal, leading to more stable RL training, higher rewards, and the ability to learn more sophisticated tool-use policies for LLM search agents.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work addresses a critical, yet often overlooked, challenge in applying Reinforcement Learning (RL) to Large Language Model (LLM) agents that use search tools: the structural heterogeneity of their trajectories. The paper rigorously identifies and formalizes cross-stratum bias, demonstrating how standard policy gradient methods using a global baseline lead to distorted credit assignment and hinder the exploration of complex multi-step search strategies.

To overcome this, the authors propose Stratified GRPO, featuring Stratified Advantage Normalization (SAN). SAN partitions trajectories into homogeneous strata based on structural properties (like search count) and computes advantages locally within these groups. The theoretical analysis proves that SAN effectively eliminates cross-stratum bias, ensures conditional unbiasedness, and achieves unit variance within each stratum, while maintaining global unbiasedness and unit variance. For practical robustness in finite-sample regimes, SAN is linearly blended with a global estimator.

Extensive experiments on diverse single-hop and multi-hop question-answering benchmarks empirically validate the approach. Stratified GRPO consistently and substantially outperforms the GRPO baseline by up to 11.3 points, achieving higher training rewards, greater training stability (preventing training collapse), and learning more effective search policies that effectively utilize multi-step search. These results firmly establish stratification as a principled and effective remedy for structural heterogeneity in RL for LLM search agents.

7.2. Limitations & Future Work

The paper does not explicitly dedicate a section to "Limitations" or "Future Work." However, based on the problem formulation and proposed solution, some potential limitations and avenues for future research can be inferred:

Definition of Strata: The current work primarily uses search count as the basis for stratification. While intuitive for search agents, defining homogeneous strata might be more complex for other types of LLM agents that use different tools or exhibit other forms of structural heterogeneity (e.g., varying tool types, depth of reasoning chains, success/failure of API calls). Future work could explore more dynamic or learned stratification criteria.
Computational Overhead: Calculating stratum-specific statistics (mean and standard deviation) for every stratum in every training step might introduce some computational overhead, especially if the number of strata or prompts is very large. Optimizing the efficiency of stratum computation could be a direction.
Blending Hyperparameter $\alpha$ : The blending hyperparameter $\alpha$ needs to be tuned (e.g., 0.8 for Instruct model, 0.6 for Base model). This introduces an additional hyperparameter, and its optimal value might vary across tasks and models. Research into adaptive or learned blending mechanisms could be beneficial.
Small Strata Robustness: While blending addresses finite-sample instability for small strata, extremely sparse strata (e.g., a stratum with only one trajectory) might still pose challenges for robust statistic estimation. Further investigation into more sophisticated handling of very small strata could be valuable.
Generalization to Other RL Algorithms: While the paper hypothesizes that cross-stratum bias also affects PPO-based methods, Stratified GRPO is specifically designed for GRPO. Adapting Stratified Advantage Normalization to PPO (e.g., by incorporating it into the value function estimation or advantage calculation for actor-critic methods) would be a natural next step for broader applicability.
Beyond QA Tasks: The experiments are focused on question-answering. Exploring the effectiveness of Stratified GRPO on other LLM agent tasks that involve multi-step planning, code generation, or complex reasoning with various tools could demonstrate its wider applicability.

7.3. Personal Insights & Critique

This paper offers several valuable insights and provokes critical reflection on RL applications for LLMs.

Insight 1: The Criticality of Heterogeneity: The paper powerfully highlights that not all trajectories are created equal, especially when LLM agents use tools. The formalization of cross-stratum bias as an "apples-to-oranges" problem is an elegant and intuitive explanation for observed RL instability in complex tool-use settings. This idea of structural heterogeneity is likely pervasive in many LLM agent scenarios beyond just search count and could apply to agents interacting with diverse APIs, planning tasks, or even complex conversational flows. This forces a re-evaluation of standard RL baselines and normalization techniques in such contexts.
Insight 2: Conditional Purity vs. Global Equivalence: The distinction drawn between conditional (within stratum) and global (marginal) moments of advantage estimators (Theorems 4 and 5) is a profound theoretical contribution. It shows that while two estimators might appear equally good globally, their behavior at a finer-grained, conditional level can drastically impact learning dynamics. SAN's ability to provide a zero-mean, unit-variance signal within each stratum ensures a pure and consistent credit assignment, which is essential for discovering nuanced multi-step strategies. This principle could inspire similar conditional analyses in other RL domains where diverse behavior modes exist.
Insight 3: The Power of Simplification with Rigor: The GRPO framework itself, by eschewing a learned value function, offers a simpler yet robust RL approach. Stratified GRPO builds on this by introducing stratification as a relatively simple, yet theoretically rigorous, modification. The resulting performance gains and stability improvements are significant, demonstrating that principled statistical adjustments can yield substantial practical benefits without needing overly complex architectural changes.

Critique:

Strata Definition and Universality: While search count is a clear and effective stratification variable for the QA search agent task, generalizing this concept to other LLM agent tasks (e.g., coding assistants, scientific discovery agents) might be challenging. What constitutes a "homogeneous stratum" could be ambiguous or require domain-specific knowledge. A more adaptive or learned stratification method might be necessary for broader applicability, which could increase complexity.
Interpretability of Blending Parameter: The blending parameter $\alpha$ is tuned per model. While practical, it suggests that the optimal balance between local purity and global stability is not universal. Further work could explore how to dynamically adjust $\alpha$ during training or even learn it, potentially making the method more robust across diverse settings without manual tuning.
Edge Cases with Very Small Strata: Although blending helps, the paper acknowledges the potential for noisy advantage estimates when strata contain very few trajectories. This might still be an issue in highly sparse or exploratory environments where certain tool-use patterns are rare but crucial. This could lead to a chicken-and-egg problem where rare but potentially high-reward strata struggle to get accurate advantage estimates and thus are not explored sufficiently.

Overall, Stratified GRPO presents a significant step forward in making Reinforcement Learning more effective and stable for complex LLM agents that engage in tool-use. Its core idea of addressing structural heterogeneity with stratification is principled and applicable beyond the specific context of search agents, offering a valuable new perspective for RL research in the era of LLMs.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~34 min read · 46,338 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. RL for Multi-Turn Search Agents

4.2.2. Cross-Stratum Bias in Policy-Gradient Baselines

Proposition 1 (Advantage Decomposition)

Theorem 1 (Variance Reduction via Stratified Baselines)

4.2.3. Stratified Advantage Normalization: Definition and Theoretical Guarantees

Definition 1 (SAN Advantage)

Proposition 2 (Invariance to Positive Affine Reward Transforms)

Theorem 2 (Variance Decomposition for Normalized Stratified Advantage)

Theorem 3 (Population SAN Expectation)

4.2.4. Structural Comparison of SAN and Global Normalization

Proposition 3 (Exact Advantage Decomposition)

Theorem 4 (Conditional Properties of SAN and GN Advantages)

Theorem 5 (Global Moments of SAN and GN)

4.2.5. Blended Advantage for Finite-Sample Stability

Definition 2 (Blended Advantage)

Algorithm 1: Stratified GRPO

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Models and Training

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.3. Training Dynamics

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers