Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents
TL;DR Summary
Stratified GRPO with Stratified Advantage Normalization eliminates cross-stratum bias in heterogeneous LLM search agent trajectories, yielding unbiased, stable credit assignment and superior multi-step RL performance.
Abstract
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 S TRATIFIED GRPO: H ANDLING S TRUCTURAL H ET - EROGENEITY IN R EINFORCEMENT L EARNING OF LLM S EARCH A GENTS Anonymous authors Paper under double-blind review A BSTRACT Large language model (LLM) agents increasingly rely on external tools such as search engines to solve complex, multi-step problems, and reinforcement learn- ing (RL) has become a key paradigm for training them. However, the trajectories of search agents are structurally heterogeneous, where variations in the number, placement, and outcomes of search calls lead to fundamentally different answer directions and reward distributions. Standard policy gradient methods, which use a single global baseline, suffer from what we identify and formalize as cross- stratum bias—an “apples-to-oranges” comparison of heterogeneous trajectories. This cross-stratum bias distorts credit assignment and hinders exploration of com- plex, multi-step search stra
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Stratified GRPO: Handling Structural Heterogeneity in Reinforcement Learning of LLM Search Agents
1.2. Authors
The paper lists "Anonymous authors" indicating it is currently under double-blind review, a common practice in academic conferences to ensure impartial evaluation. Therefore, specific author names, research backgrounds, and affiliations are not disclosed at this stage.
1.3. Journal/Conference
The paper is published at OpenReview.net, a platform commonly used for conference submissions (e.g., ICLR, NeurIPS) before final acceptance. This suggests it is intended for a top-tier machine learning or natural language processing conference. The platform facilitates open peer review, making it a key venue for cutting-edge research in these fields.
1.4. Publication Year
The Published at (UTC): 2025-10-08T00:00:00.000Z indicates a publication date in October 2025.
1.5. Abstract
This paper addresses a fundamental challenge in applying Reinforcement Learning (RL) to Large Language Model (LLM) agents that utilize search tools: the structural heterogeneity of agent trajectories. Variations in search calls (number, placement, outcomes) lead to diverse reward distributions and answer directions, making trajectories inherently incomparable. Standard policy gradient methods, which use a single global baseline for computing advantages, suffer from what the authors term cross-stratum bias—an "apples-to-oranges" comparison. This bias distorts credit assignment and hinders the exploration of complex multi-step search strategies.
To overcome this, the authors propose Stratified GRPO. Its central component, Stratified Advantage Normalization (SAN), partitions trajectories into homogeneous strata based on structural properties and computes advantages locally within each stratum, ensuring fair comparison among true peers. The paper theoretically proves that SAN eliminates cross-stratum bias, yields conditionally unbiased unit-variance estimates within each stratum, and maintains global unbiasedness and unit-variance properties. For practical stability in finite-sample regimes, SAN is linearly blended with a global estimator. Extensive experiments on single-hop and multi-hop question-answering benchmarks demonstrate that Stratified GRPO consistently and substantially outperforms GRPO by up to 11.3 points, achieving higher training rewards, greater stability, and more effective search policies. These results establish stratification as a principled solution for structural heterogeneity in RL for LLM search agents.
1.6. Original Source Link
The original source link is https://openreview.net/forum?id=hqnGfzQQfa. It is currently under double-blind review.
The PDF link is https://openreview.net/pdf?id=hqnGfzQQfa.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is the structural heterogeneity inherent in the trajectories of Large Language Model (LLM) agents that interact with external tools like search engines, particularly when these agents are trained using Reinforcement Learning (RL).
LLM agents are increasingly powerful, capable of tackling complex, multi-step problems by leveraging external tools. RL has emerged as a promising paradigm for training these agents to learn sophisticated reasoning and tool-use strategies directly from outcome-based rewards. However, the trajectories generated by LLM search agents are not uniform; they exhibit significant structural differences. For example, some trajectories might involve zero search calls, while others might involve multiple, strategically placed search calls, each yielding different information and leading to distinct answer directions and reward distributions.
The challenge arises because standard policy gradient methods, which are commonly used in RL, typically employ a single global baseline to compute advantages for all trajectories. This implicitly assumes that all trajectories are comparable, regardless of their underlying structure. The authors identify and formalize this flawed assumption as cross-stratum bias. This "apples-to-oranges" comparison distorts credit assignment—the process of determining which actions were responsible for a given reward—and consequently hinders the agent's ability to effectively explore and learn complex, multi-step search strategies. This leads to suboptimal policies and training instability, as illustrated by issues like training collapse in existing methods. The problem is important because it prevents RL from fully realizing its potential in training sophisticated LLM agents for real-world tool-use scenarios.
The paper's entry point is recognizing this structural heterogeneity as a fundamental, often overlooked, issue for RL in LLM search agents. Its innovative idea is to introduce stratification—dividing trajectories into homogeneous groups or strata based on their structural properties—to ensure that comparisons and advantage calculations are made among true peers.
2.2. Main Contributions / Findings
The primary contributions of this paper are:
-
Identification and Formalization of Cross-Stratum Bias: The paper rigorously identifies and formalizes
cross-stratum biasas a fundamental challenge inpolicy gradient methodsforLLM search agents. It provides a theoretical decomposition (Proposition 1 and Theorem 1) demonstrating that this bias arises from using aglobal baselineacrossstructurally heterogeneous agent trajectories, leading to inflated variance and distortedcredit assignment. -
Proposal of Stratified GRPO with Stratified Advantage Normalization (SAN): The authors propose
Stratified GRPO, a principledRL algorithmdesigned to eliminatecross-stratum bias. Its core component,Stratified Advantage Normalization (SAN), partitions trajectories intohomogeneous stratabased on structural properties (e.g., search count) and computesadvantages locallywithin each stratum. This ensures that trajectories are evaluated only against their true peers, leading to a fair and stablecredit assignment. -
Rigorous Theoretical Analysis of SAN: The paper provides comprehensive theoretical guarantees for
SAN. It proves thatSANeliminatescross-stratum bias, isconditionally unbiased, and achievesunit variancewithin each stratum (Theorem 4). Crucially,SANachieves these superior conditional properties while retaining theglobal unbiasednessandunit varianceof standardnormalization(Theorem 5), yielding a more pure and scale-stable learning signal. -
Introduction of Blended Advantage for Practical Stability: To address practical stability concerns in
finite-sample regimeswhere some strata might be small, the paper introduces aBlended Advantage(Definition 2). This robustly combinesSANwith theglobal estimatorthrough a linear interpolation, balancinglocal puritywithglobal stability. -
Empirical Validation and Superior Performance: Extensive experiments on seven
diverse single-hop and multi-hop question-answering (QA) benchmarksdemonstrate thatStratified GRPOsubstantially outperforms the standardGRPObaseline by up to 11.3 points. It achieveshigher training rewards,greater training stability(e.g., preventingtraining collapseobserved inGRPOfor the instruct model), and learnsmore effective search policies(Figure 1), particularly on complexmulti-hoptasks.These findings collectively establish
stratificationas a principled and effective remedy forstructural heterogeneityinReinforcement LearningforLLM search agents, leading to improved learning and agent performance.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the contributions of this paper, a beginner needs to understand several core concepts from Reinforcement Learning (RL) and Large Language Models (LLMs).
-
Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data, capable of understanding, generating, and processing human language. They can perform various tasks like question answering, summarization, and translation. In this paper,
LLMsare used asagentsthat interact with environments. -
Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make decisions by performing actions in anenvironmentto maximize a cumulativereward. Theagentobserves thestateof theenvironment, takes anaction, receives areward, and transitions to a newstate. The goal is to learn apolicy.- Agent: The
LLMitself, making decisions (e.g., generating tokens, issuing search queries). - Environment: The context in which the
agentoperates, including the prompt, external tools like a search engine, and the knowledge base. - State: The current situation or information available to the
agentat any given time (e.g., the current prompt, generated text so far, retrieved search results). - Action: A decision made by the
agent(e.g., generating the next token, formulating and executing a search query). - Reward: A numerical signal indicating the quality of the
agent's actions or final outcome (e.g.,Exact Match (EM)score for aquestion-answeringtask). Theagentaims to maximize thisreward. - Policy (): The
agent's strategy or a probability distribution overactionsgiven astate. It is parameterized by , which theRL algorithmaims to optimize.
- Agent: The
-
Policy Gradient Methods: A class of
RL algorithmsthat directly optimize thepolicyby estimating the gradient of theexpected rewardwith respect to thepolicy parameters. The general update rule involvesadvantages.- Trajectory (): A sequence of
states,actions, andrewardsgenerated by theagentfrom the beginning to the end of an interaction with theenvironment. For example, in asearch agent, atrajectorymight include initial prompt, generated text, search query, search results, more generated text, another search query, and the final answer. - Advantage Function (): In
policy gradient methods, instead of using the rawrewardto update thepolicy, anadvantage functionis often used. It represents how much better anaction(ortrajectory) is compared to an expected or baseline value. This reduces variance in thegradient estimatesand helps incredit assignment. The basic form is , where is abaseline. - Baseline (): A reference value subtracted from the
rewardin theadvantage function. Thebaselineitself should not depend on theactionbeing evaluated to maintainunbiasednessof thegradient estimate. Commonbaselinesinclude the averagerewardof a batch oftrajectoriesor a learnedvalue function(critic). A goodbaselinereduces thevarianceof thegradient estimateswithout introducingbias. - Normalization: A technique used to standardize values (e.g., rewards or advantages) to a common scale, typically zero mean and unit variance. This can stabilize training by preventing
rewardsfrom having vastly different magnitudes or distributions across differenttrajectoriesorstrata. - Credit Assignment: The problem of determining which
actionsor sequences ofactionsin atrajectorywere responsible for the observedreward. Inmulti-step tasks, this can be challenging, as a finalrewardmight be the result of many intermediateactions. - Structural Heterogeneity: This refers to the significant qualitative differences in the structure of
trajectories. ForLLM search agents, this includes variations in the number of search calls, where they are placed in thereasoning chain, and what information they retrieve. Such differences can lead tofundamentally different answer directionsandreward distributions. - Stratification: The process of dividing a dataset or a batch of
trajectoriesinto distinct, more homogeneous subgroups (calledstrata) based on some shared structural property. The goal is to analyze or process data within thesestrataseparately, ensuring that comparisons are made between similar items. - Cross-Stratum Bias: The central problem identified in this paper. It occurs when a
global baselineis used to computeadvantagesacrossstructurally heterogeneous trajectories. Becausetrajectoriesfrom differentstrata(e.g., with different numbers of search calls) have inherently differentreward distributions, comparing them with a singleglobal baselinecreates a systematicbias. Thisbiasdistortscredit assignment, unfairly penalizingtrajectoriesfromlow-reward strataor favoring those fromhigh-reward strata, even if the individual actions within thosetrajectorieswere optimal for their respectivestrata.
- Trajectory (): A sequence of
3.2. Previous Works
The paper contextualizes its work within various existing paradigms in LLM research and RL.
- Reinforcement Learning from Human Feedback (RLHF): A popular method for aligning
LLMswith human preferences. It typically involves three steps:supervised fine-tuning (SFT)of anLLM, training areward model (RM)using human preference data, and then optimizing theLLM'spolicyusingRL(oftenPPO) with theRMas therewardsignal. This approach is effective but can becomputationally expensiveandunstabledue toRMtraining anddistribution shift. - Proximal Policy Optimization (PPO) (Schulman et al., 2017): A widely used
policy gradient algorithminRLHF.PPOaims to keep newpoliciesclose to oldpoliciesto prevent excessively largepolicy updatesthat could lead to poor performance. It typically uses alearned value function(critic) as abaselineto estimateadvantages. The paper notes that similarstructural biastocross-stratum biascan occur inPPOif thecriticdoes not adequately condition on thestructural heterogeneity. - Direct Preference Optimization (DPO) (Rafailov et al., 2023): An alternative to
RLHFthat optimizes preference data directly without training a separatereward model. It reframes theRLHFobjective as a simple classification problem, making it more stable and computationally efficient. - Reinforcement Learning with Verifiable Rewards (RLVR): A category of
RLapplications where therewardcan be directly computed based on verifiable outcomes (e.g., correctness of an answer inQA, performance in a mathematical task), rather than relying on a learnedreward modelfrom human preferences. This paper operates within theRLVRsetting forLLM search agents.- Group Relative Policy Optimization (GRPO) (Shao etet al., 2024): A prominent
RLVRapproach that the current paper builds upon.GRPOsimplifiesPPOby removing its dependency on alearned value function(critic), instead usinggroup-based baselines(e.g., mean reward of a batch) to computeadvantages. While more stable thanPPOin someLLMcontexts, it still suffers fromcross-stratum biasif thesegroupsare not structurally homogeneous. - RLOO (Ahmadian et al., 2024): Another
RLVRmethod that revisits the foundationalREINFORCE algorithmwith simplifications tailored forLLM training. It also uses aglobal baselineand thus would be susceptible tocross-stratum bias.
- Group Relative Policy Optimization (GRPO) (Shao etet al., 2024): A prominent
- REINFORCE (Williams, 1992): The foundational
policy gradient algorithm. It updates thepolicydirectly usingreward-weightedpolicy gradients. It can have high variance, which is often reduced by subtracting abaseline. - LLM Search Agents: A category of
LLM-based agentsthat are augmented with external tools, particularly search engines. These agents caninterleave token generationwithsearch queriesto gather information and solve complex tasks.- Retrieval-Augmented Generation (RAG) (Lewis et al., 2020): A popular technique where
LLMsretrieve relevant documents or information from a knowledge base before generating a response. This helpsLLMsaccess up-to-date or specific knowledge that they might not have been trained on. - Search-o1 (Li et al., 2025) and IRCoT (Trivedi et al., 2023): Examples of non-
RLmethods that useLLMswith search.IRCoT(Interleaving Retrieval with Chain-of-Thought) focuses on integrating retrieval intochain-of-thought reasoningthrough prompting.Search-o1is another agentic search-enhanced model. - Search-R1 (Jin et al., 2025) and ReSearch (Chen et al., 2025): Recent
RL-based methods for trainingLLM search agents.Search-R1explicitly trainsLLMsto reason and leverage search engines usingRL.ReSearchalso focuses on learning to reason with search viaRL. These methods typically use general-purposeRL algorithmslikePPOorGRPOand, according to this paper, would face thecross-stratum biasproblem.
- Retrieval-Augmented Generation (RAG) (Lewis et al., 2020): A popular technique where
3.3. Technological Evolution
The field of LLMs has rapidly evolved from initial impressive generative capabilities to becoming agents capable of interacting with the real world through tools.
-
Early LLMs: Primarily focused on
text generation,summarization, andtranslation, based onsupervised fine-tuning (SFT)on massive text corpora. -
Instruction Following & Alignment (RLHF/DPO): The advent of
RLHF(and laterDPO) enabledLLMsto better align with human instructions and preferences, making them more useful in conversational settings. This marked a shift towardsoutcome-based learningandoptimization of user-aligned objectives. -
Tool-Augmented LLMs (LLM Agents): Recognizing the limitations of
LLMs' internal knowledge and reasoning, researchers began equipping them with external tools like search engines, calculators, and APIs. This transformedLLMsintoagentsthat can solve more complex,multi-step tasksby intelligentlyinterleaving generationandtool-use. Techniques likeRAGandChain-of-Thought (CoT)prompting were crucial here. -
RL for Tool-Augmented LLMs: The current frontier involves using
RLto train theseLLM agentsto learn optimaltool-use strategiesdirectly fromoutcome-based rewards. This is challenging becausetool-useintroduces new complexities, such aslong-horizon reasoninganddynamic environment interactions.This paper fits into the latter part of this timeline. While
RLhas shown promise for trainingLLM agents, this work identifies a crucial, often overlooked, challenge specific toLLM search agents: thestructural heterogeneityoftrajectoriesgenerated duringtool-use. PriorRLmethods forLLMs, even those usinggroup-based baselineslikeGRPO, were designed for more homogeneousRL environmentsor did not explicitly account for the profound structural differences introduced by variabletool-use. This paper'sStratified GRPOaddresses this specific gap, aiming to makeRLtraining more effective and stable for these complexLLM agents.
3.4. Differentiation Analysis
Compared to the main methods in related work, especially GRPO, the core differences and innovations of this paper's approach lie in its explicit and principled handling of structural heterogeneity.
-
Standard Policy Gradient Methods (e.g., REINFORCE with global baseline, RLOO, GRPO):
- Core Assumption: Implicitly assumes all
trajectoriesare comparable when computingadvantagesusing a singleglobal baseline. - Credit Assignment: Suffers from
cross-stratum bias, leading todistorted credit assignment.Trajectories with inherently differentreward distributions(e.g., those with 0 search calls vs. 3 search calls) are compared against the sameglobal average, unfairly penalizing or rewarding them. - Exploration: The
cross-stratum biascanhinder explorationofcomplex multi-step strategies. For example,trajectoriesinvolving multiple search calls might initially have lower average rewards, causingGRPOtounder-samplethem, even if they hold potential for higher rewards after further exploration. - Stability: Can suffer from
training instability(e.g.,training collapse), especially with models like theInstructmodel, as seen in the experiments.
- Core Assumption: Implicitly assumes all
-
Stratified GRPO (Proposed Method):
-
Core Innovation: Explicitly addresses
structural heterogeneityby introducingStratified Advantage Normalization (SAN). This involvespartitioning trajectoriesintohomogeneous stratabased on structural properties (e.g., number of search calls). -
Credit Assignment: Computes
advantages locallywithin eachstratum, ensuring thattrajectoriesare evaluated only against their "true peers." This fundamentallyeliminates cross-stratum bias(as proven by Proposition 1 and Theorem 1) and providesfairer credit assignment. -
Exploration: By removing the
cross-stratum bias,Stratified GRPOencourages more effectiveexplorationof differentsearch strategies(e.g., policies involving multiple search calls) by evaluating them fairly within their own context, even if their initialrewardsare not globally competitive. -
Stability: Exhibits
higher training stabilityandconverges to more effective search policies, as demonstrated by consistently achieving higher rewards and preventingtraining collapsecompared toGRPO. -
Theoretical Guarantees: Provides strong theoretical backing for
SAN's properties, includingconditional unbiasednessandunit variancewithinstrata(Theorem 4), while preservingglobal unbiasednessandunit variance(Theorem 5). -
Practical Stability: Incorporates a
Blended Advantage(Definition 2) to mitigatefinite-sample instabilityfor smallstrata, by blendingSANwithGlobal Normalization (GN).In essence, while
GRPOmade strides by usinggroup-based baselinesto improveRLforLLMs,Stratified GRPOtakes this a step further by recognizing that thesegroupsthemselves might be heterogeneous, particularly in the context ofLLM agentsperformingtool-use. By applyingstratificationandlocal normalization,Stratified GRPOprovides a more refined and robustlearning signal, unlocking better performance and stability forLLM search agents. The paper also implicitly critiquesPPO-based methods forLLM search agents(likeSearch-R1), suggesting that they too would suffer from similarstructural biasif theirvalue function(critic) does not adequately condition ontrajectory structure.
-
4. Methodology
4.1. Principles
The core idea behind Stratified GRPO is to explicitly account for the structural heterogeneity of LLM search agent trajectories in Reinforcement Learning (RL). Instead of treating all trajectories as comparable and using a single global baseline for advantage estimation, the method proposes to:
-
Partition
trajectoriesintohomogeneous strata: Grouptrajectoriesthat share similar structural properties (e.g., the same number of search calls) into distinctstrata. -
Compute
advantages locallywithin eachstratum: Evaluaterewardsand estimatebaselinesandvariancesonly within these homogeneous groups.The intuition is that an "apples-to-oranges" comparison, where a
trajectorywith many search calls is evaluated against one with zero search calls using a singleglobal average reward, is inherently unfair and distorting. By comparingtrajectoriesonly against their "true peers" (i.e., those within the samestratum),Stratified GRPOaims to eliminatecross-stratum bias, provide more accuratecredit assignment, and yield a purer, more stablelearning signalfor thepolicy gradient. This enables theagentto better understand the value of differentmulti-step search strategiesand explore them more effectively.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. RL for Multi-Turn Search Agents
The problem of training a multi-turn search agent is framed as a Reinforcement Learning (RL) problem. The agent, parameterized by a policy , interacts with a search engine. This interaction involves interleaving token generation (e.g., generating parts of an answer) with search queries (e.g., asking the search engine for information).
For a given prompt , sampled from a distribution , the agent generates a sequence of actions and observations, forming a trajectory . This trajectory is sampled according to the policy . Once the trajectory is complete (e.g., the agent produces a final answer), it receives a scalar reward , which quantifies the quality of the final response.
The objective of the RL problem is to maximize the expected reward:
Here:
-
is the objective function that the
RL algorithmaims to maximize. -
represents the parameters of the
agent'spolicy(e.g., the weights of theLLM). -
denotes the expectation operator.
-
means that prompts are sampled from a distribution of prompts .
-
indicates that
trajectoriesare generated by thepolicyconditioned on the given prompt . -
is the scalar
rewardreceived for completing thetrajectory.This optimization problem is typically solved using
policy gradient methods, which estimate the gradient of with respect to and update in the direction of increasingexpected reward.
4.2.2. Cross-Stratum Bias in Policy-Gradient Baselines
The paper identifies cross-stratum bias as a fundamental issue when using global baselines in policy gradient methods for LLM search agents. This is because trajectories from search agents exhibit significant structural heterogeneity (e.g., varying number of search calls, content, outcomes), leading to strata with systematically different answer directions and reward distributions.
To formalize this, the following notation is introduced:
-
: A batch of
trajectoriessampled independently and identically distributed (i.i.d.) from thepolicyfor a fixed prompt . -
The batch is partitioned into non-empty
stratabased on a predefined structural property (e.g., the number of search calls). -
: The number of
trajectoriesinstratum. -
: The
rewardfortrajectory. -
: The
global mean rewardof the entire batch. -
: The
stratum-specific mean rewardforstratum.Based on these definitions, two natural
advantage estimatorsare considered: -
Global Advantage (): Uses the
global mean rewardas abaseline. $ \hat{A}{G}\left(\tau{i}\right)=R_{i}-\bar{R}_{\text {global }} $ -
Stratified Advantage (): Uses the
stratum-specific mean rewardas abaseline. $ \hat{A}{S}\left(\tau{i}\right)=R_{i}-\bar{R}{k} \quad\left(\text { for } \tau{i} \in B_{k}\right) $
Proposition 1 (Advantage Decomposition)
For any trajectory , the global advantage can be decomposed as:
This proposition reveals that the global advantage for a trajectory is not just its stratified advantage but also includes an additional term: the difference between its stratum's mean reward and the global mean reward. This term, , is identified as the cross-stratum bias. It's a deterministic offset applied uniformly to all trajectories within a given stratum. If a stratum has a mean reward lower than the global mean, trajectories from that stratum will have their advantages artificially decreased, and vice versa. This effectively creates an "apples-to-oranges" comparison across structurally different trajectories.
Theorem 1 (Variance Reduction via Stratified Baselines)
The empirical variances of the stratified and global advantage estimators satisfy:
Moreover, the reduction in variance is exactly the variance induced by the cross-stratum bias:
This theorem mathematically proves that stratification not only removes cross-stratum bias but also strictly reduces the variance of the advantage estimator whenever the mean rewards of the strata are different (). The term represents the "between-stratum variance," which is essentially the variance explained by the differences in mean rewards across strata. By centering rewards within each stratum, Stratified Advantage (and later SAN) eliminates this component, leading to a lower overall variance for the learning signal. This means Stratified GRPO provides a more stable and accurate gradient estimate.
4.2.3. Stratified Advantage Normalization: Definition and Theoretical Guarantees
Building on the concept of stratification, the paper introduces Stratified Advantage Normalization (SAN). SAN combines stratification with per-stratum normalization to create a stable and scale-invariant learning signal.
Definition 1 (SAN Advantage)
For a given prompt , partition the batch of trajectories into strata based on a chosen partitioning function (e.g., the search count for search agents). The SAN advantage for a trajectory is defined as:
Here:
-
is the
rewardoftrajectory. -
is the empirical
mean rewardofstratumfor the given prompt . -
is the empirical
standard deviationofrewardsinstratumfor the given prompt . -
is a small positive constant added for
numerical stability(to prevent division by zero if is zero).SANthus computes aZ-score-like value for eachtrajectorywithin its own stratum, effectively standardizing therewardsto have ameanof 0 and astandard deviationof 1 (ignoring for ideal cases) for eachstratum.
Proposition 2 (Invariance to Positive Affine Reward Transforms)
Suppose . The SAN advantage is invariant under any positive affine transformation of the rewards, with . That is, .
This proposition states that if you linearly scale and shift all rewards (e.g., changing the unit of currency or adding a fixed bonus), the computed SAN advantage remains the same. This is a highly desirable property as it makes the learning signal robust to arbitrary reward scaling, meaning the RL algorithm doesn't need to retune if the reward function's scale changes.
Theorem 2 (Variance Decomposition for Normalized Stratified Advantage)
Let be the stratified and normalized advantage. It is related to the global advantage by the following exact decomposition:
This theorem extends Theorem 1 by showing the total variance reduction from to .
- Term A (Between-Stratum Variance): This is the same term as in Theorem 1. It quantifies the
variancecaused by differences inmean rewardsacrossstrata.SANcompletely removes thisstructural biasby centeringrewardswithin eachstratum. - Term B (Normalization Effect): This term captures the additional effect of
per-stratum normalization(dividing by ). This term primarily works to stabilize the scale ofrewardswithin eachstratum, ensuring a consistent and numerically robustlearning signal. While it can be positive or negative depending on the values of and , its main role isscale stabilization.
Theorem 3 (Population SAN Expectation)
Let be a discrete stratum assignment that may depend on , and define the per-trajectory SAN advantage:
where and are the stratum-wise mean and standard deviation, and is a small regularizer. Then, under standard regularity conditions,
This theorem shows that the population SAN gradient estimator exactly targets a weighted sum of within-stratum gradients.
The term is the true per-stratum policy gradient, representing the gradient of the expected reward within stratum . Specifically:
This population SAN estimator is a weighted sum of these per-stratum policy gradients, where each stratum's gradient is weighted by its probability and inversely scaled by its standard deviation . This means SAN is an asymptotically unbiased estimator that correctly pushes the policy to improve rewards in each stratum, effectively eliminating the structural bias present in global baselines.
4.2.4. Structural Comparison of SAN and Global Normalization
This section provides a direct comparison between SAN and Global Normalization (GN), highlighting SAN's advantages. GN is defined as:
Here:
-
is the
rewardoftrajectory. -
is the
global mean rewardof the batch. -
is the
global standard deviationofrewardsin the batch. -
is for
numerical stability.GNnormalizes allrewardsusingglobal statistics(mean and standard deviation) across the entire batch, without regard forstratum structure.
Proposition 3 (Exact Advantage Decomposition)
For any fixed batch partition and any :
This proposition precisely shows how GN relates to SAN. The GN advantage for a trajectory is a rescaled SAN advantage (scaled by ) plus an additional term .
- is a
scaling factorthat varies bystratum, representing the ratio ofstratum-specific standard deviationtoglobal standard deviation. - is the
cross-stratum offset bias. This term is crucial because it represents a systematicbiasinGNthat depends on how much themean rewardofstratum() deviates from theglobal mean reward(), scaled by theglobal standard deviation. Thisbiasis the source ofGN's structural flaw.
Gradient Bias from the Cross-Stratum Bias:
The cross-stratum offset directly translates into a structural flaw in the GN gradient estimator ():
The second term, Bias from Cross-Stratum Offset, shows that GN's gradient is systematically biased. This bias couples reward differences across strata with the policy's score vectors, meaning that the learning signal is distorted by global statistics rather than focusing purely on improving within-stratum performance. This can hinder exploration, as strata with mean rewards below the global mean will be systematically downweighted, potentially preventing the agent from exploring complex, multi-step search strategies that might be high-reward in their own strata.
Theorem 4 (Conditional Properties of SAN and GN Advantages)
This theorem analyzes the conditional properties (within a stratum) of SAN and GN in the large-sample limit (population statistics), assuming .
Population reward statistics for a given prompt and stratum :
- :
Conditional mean rewardforstratum. - :
Conditional variance of rewardsforstratum. - :
Global mean reward. - :
Global variance of rewards.
-
Conditional Expectation (Bias):
- : The
SAN advantageisunbiasedwithin eachstratum. This meansSANprovides a pure signal, indicating how atrajectoryperforms relative to the average of itspeers. - : The
GN advantagecarries a systematicbiasproportional to the difference between thestratum's mean rewardand theglobal mean reward. This is the quantitative expression ofcross-stratum bias.
- : The
-
Conditional Variance:
-
: The
SAN advantageprovides a consistentunit variancewithin eachstratum. This ensures a stable and uniformly scaledlearning signalacross allstrata. -
: The
GN advantage'svariancescales with the ratio ofstratum-to-global variance. This means thelearning signalfromGNhas an inconsistent scale acrossstrata, makingcredit assignmentless reliable.In summary,
SANacts as a pure andscale-stable signal carrierat theconditional level, whileGNintroducescross-stratum biasandinconsistent scaling.
-
Theorem 5 (Global Moments of SAN and GN)
This theorem analyzes the global (marginal) moments of SAN and GN advantages, again in the large-sample limit ().
Consider the population SAN and GN advantages:
-
-
(a) Global Means:
-
-
Both
SANandGNare globallyunbiased(have ameanof zero) when averaged over allstrata.
(b) Global Variances:
-
Both
SANandGNachieve the exact sameunit varianceglobally.
The key takeaway from Theorems 4 and 5 is that while SAN and GN appear equivalent at the global level (both being globally unbiased with unit variance), they differ fundamentally at the conditional level (within strata). SAN is conditionally pure (zero mean, unit variance within each stratum), providing a consistent learning signal. GN, however, is conditionally biased and has inconsistent scaling, distorting the learning signal within strata. This conditional purity of SAN is what truly governs more accurate credit assignment and effective learning dynamics.
4.2.5. Blended Advantage for Finite-Sample Stability
While SAN offers theoretical advantages, in practical finite-sample regimes (when dealing with limited data in a batch), some strata might contain very few trajectories. In such cases, the empirical mean and standard deviation estimates () for those small strata can be noisy, leading to potentially unstable SAN advantage estimates.
To mitigate this finite-sample instability, the paper proposes a Blended Advantage that combines the local purity of SAN with the global stability of GN.
Definition 2 (Blended Advantage)
For a trajectory , the Blended Advantage is defined as:
Here:
- is the final
advantageused forpolicy updates. - is the
Stratified Advantage Normalization(from Definition 1). - is the
Global Normalized advantage. - is a
hyperparameterbetween 0 and 1 that controls the blending ratio.-
If , the
Blended Advantagebecomes purelySAN. -
If , the
Blended Advantagereverts toGN.This
blending mechanismallowsStratified GRPOto leverage the benefits ofSAN(eliminatingcross-stratum bias) while usingGNtostabilize advantage estimatesforstratawith small sample sizes byborrowing informationfrom the global distribution. The parameter can be tuned to balance these two objectives.
-
The Stratified GRPO algorithm incorporates this blended advantage.
Algorithm 1: Stratified GRPO
Algorithm 1: Stratified GRPO
Require: Policy , batch with rewards ,blending , stabilizer
.
Compute global stats: ;
.
For all , set .
Partition indices into per-prompt, per-stratum groups (e.g., by search count).
for each prompt do
for each stratum with index set do
.
for do
.
.
end for
end for
end for
Return gradient estimate .
Step-by-step breakdown of Algorithm 1:
-
Initialization:
Policy: TheLLM'spolicyto be optimized.- Batch : A collection of
trajectoriessampled from . Rewards: The correspondingrewardsfor eachtrajectory.Blending hyperparameter: Controls the mix betweenSANandGN.Stabilizer: A small constant fornumerical stability.
-
Compute Global Statistics:
- First, the
global mean reward() is calculated across alltrajectoriesin the batch: - Then, the
global standard deviation() is calculated for allrewardsin the batch:
- First, the
-
Compute Global Normalized (GN) Advantages:
- For every
trajectoryin the batch, itsGlobal Normalized advantageis computed using theglobal statistics:
- For every
-
Partition Trajectories into Strata:
- The
trajectoriesare partitioned intoper-prompt,per-stratum groups. This means that for each unique prompt encountered in the batch,trajectoriesoriginating from that prompt are further grouped intostratabased on a predefinedstructural property. ForLLM search agents, this property is typically thesearch count(number of search calls). is the set of indices oftrajectoriesbelonging tostratumfor prompt .
- The
-
Iterate through Prompts and Strata:
- The algorithm then loops through each unique prompt in the batch.
- Inside this, it loops through each
stratumidentified for that prompt .
-
Compute Stratum-Specific Statistics:
- For each
stratum(identified by its index set ), the number oftrajectoriesis counted: . - The
stratum-specific mean reward() is calculated fortrajectorieswithin thatstratum: - The
stratum-specific standard deviation() is calculated forrewardswithin thatstratum:
- For each
-
Compute Stratified Advantage Normalization (SAN) and Blended Advantages:
- For each
trajectorybelonging to the currentstratum:- Its
Stratified Advantage Normalizationis computed using thestratum-specific statistics: - The
Blended Advantageis then computed by linearly combining and the previously calculated using the blending ratio :
- Its
- For each
-
Return Gradient Estimate:
- After computing
Blended Advantagesfor alltrajectoriesin the batch, the algorithm returns the finalpolicy gradient estimate: Thisgradient estimateis used to update thepolicy parameters. The term is thescore functionfortrajectory, indicating how changing thepolicy parameterswould affect the probability of generating thattrajectory. Theadvantageterm weights thisscore function, ensuring thattrajectorieswith higheradvantages(better than expected for their stratum) contribute more to increasing their probability.
- After computing
5. Experimental Setup
5.1. Datasets
The experiments evaluate Stratified GRPO on seven diverse question-answering (QA) benchmarks, covering both single-hop and multi-hop reasoning tasks.
-
Training Set:
- The training set is constructed by merging the training splits of two well-known
QA datasets:- Natural Questions (NQ) (Kwiatkowski et al., 2019): A
single-hop QAdataset where questions can typically be answered by finding a short passage in Wikipedia. It's known for its factual, Google-search-like questions. - HotpotQA (Yang et al., 2018): A
multi-hop QAdataset that requires reasoning over multiple pieces of evidence (documents) to answer a question. This necessitates more complex information retrieval and synthesis.
- Natural Questions (NQ) (Kwiatkowski et al., 2019): A
- The use of both
NQandHotpotQAfor training aims to expose theLLM agentto a broad range ofQA complexities, from direct factual retrieval to more involvedmulti-step reasoning.
- The training set is constructed by merging the training splits of two well-known
-
Evaluation Benchmarks:
- Single-Hop QA Benchmarks:
- Natural Questions (NQ) (Kwiatkowski et al., 2019): As described above, focuses on factual questions answerable from a single passage.
- TriviaQA (Joshi et al., 2017): A large-scale reading comprehension dataset, often requiring retrieval of specific entities or facts.
- PopQA (Mallen et al., 2023): A dataset designed to evaluate
LLMs'ability to recall factual knowledge, often highlighting the effectiveness ofparametricvs.non-parametric memories.
- Multi-Hop QA Benchmarks:
-
HotpotQA (Yang et al., 2018): As described above, requires reasoning over multiple documents.
-
2WikiMultiHopQA (Ho et al., 2020): A dataset explicitly designed to evaluate
multi-hop reasoningby constructing questions that span information across two Wikipedia documents. -
MuSiQue (Trivedi et al., 2022): (Multi-hop Questions via Single-hop Question Composition) This dataset focuses on
multi-hop questionsthat can be decomposed into a series of simplersingle-hop questions, challenging models to perform sequentialinformation retrievaland synthesis. -
Bamboogle (Press et al., 2023): A dataset used to measure and narrow the
compositionality gapinlanguage models, likely requiringmulti-step reasoningandtool-use.These datasets were chosen because they represent a diverse set of
question-answeringchallenges, ranging from simple information recall to complexmulti-step reasoningandinformation synthesis. They are well-established benchmarks for evaluatingLLM agentsand their ability to leverage external knowledge.
-
- Single-Hop QA Benchmarks:
5.2. Evaluation Metrics
For all evaluation benchmarks, the Exact Match (EM) metric is used.
Exact Match (EM)
-
Conceptual Definition:
Exact Matchis a strictevaluation metriccommonly used inquestion-answeringandmachine comprehensiontasks. It measures whether themodel's generated answer matches one of the ground-truth answers exactly. It focuses on the precision and correctness of the final output, giving no partial credit for partially correct answers or semantic similarity. Its design goal is to assess whether the model can produce the precise correct answer. -
Mathematical Formula: The
Exact Matchscore for a singlequestion-answerpair is typically defined as: For a dataset of questions, the overallEMscore is the average of theEMscores for individual questions: -
Symbol Explanation:
- : The string generated by the
LLM agentas its answer to a question. - : A set of one or more reference answers provided for a question. Answers in this set are usually normalized (e.g., punctuation removed, case folded, articles removed) to account for minor stylistic variations.
- : The
Exact Matchscore for the -th question. - : The total number of questions in the dataset.
- : The average
Exact Matchscore across the entire dataset, usually expressed as a percentage.
- : The string generated by the
5.3. Baselines
Stratified GRPO is compared against a comprehensive set of non-RL and RL methods, categorized as follows:
-
Non-RL Baselines: These methods do not involve
Reinforcement Learningforpolicy optimization.- Direct Generation: The
LLMdirectly generates an answer without using any external tools or retrieval. This serves as a lower bound for performance, showing theLLM's inherent knowledge. - Supervised Fine-Tuning (SFT): The
LLMisfine-tunedon a dataset ofquestion-answerpairs. WhileSFTcan improve performance, it doesn't typically involve dynamictool-useorRL. - RAG (Retrieval-Augmented Generation) (Lewis et al., 2020): A common technique where the
LLMretrieves relevant documents before generating an answer. It's a static form of retrieval, usually one-shot. - Search-o1 (Li et al., 2025): An
agentic search-enhanced large reasoning model. This method likely involves some form ofagentic behaviorwith search but is not based onRLforpolicy optimization. - IRCoT (Interleaving Retrieval with Chain-of-Thought) (Trivedi et al., 2023): This method uses
promptsto instructLLMstointerleave reasoningsteps withretrieval calls, enablingmulti-step information gathering.
- Direct Generation: The
-
RL Methods: These methods use
Reinforcement Learningto train theLLM agent.-
Search-R1 (Jin et al., 2025): An
RL-based method specifically designed for trainingLLMstoreasonandleverage search engines. It's aPPO-based approach. -
R1 (DeepSeek-AI et al., 2025): Refers to
RLtraining without search. This baseline highlights the performance gain specifically attributable to search capabilities. It might represent aREINFORCE-styleorPPO-based method focused solely ongenerationgiven an initial context. -
ReSearch (Chen et al., 2025): Another
RL-based approach focused onlearning to reason with searchforLLMs. -
GRPO (Group Relative Policy Optimization) (Shao et al., 2024): The direct baseline for
Stratified GRPO.GRPOis anRLVRmethod that usesgroup-based baselines(e.g., mean reward of a batch) to computeadvantages, avoiding the need for a learnedvalue functionlikePPO. It's robust but, according to this paper, suffers fromcross-stratum bias.These baselines were chosen to provide a comprehensive comparison, including models with no external tools, static retrieval, heuristic search, and various
RLapproaches, allowing the paper to demonstrate the specific advantages ofStratified GRPOin handlingstructural heterogeneity.
-
5.4. Models and Training
-
LLM Models: The experiments use two variants of the
Qwen-2.5-3B model(Yang et al., 2024):Qwen-2.5-3B Base: A foundationalLLM.Qwen-2.5-3B Instruct: A version of theLLMalreadyfine-tunedto follow instructions, which often exhibits differentRLtraining dynamics.
-
Retrieval Setup:
- Knowledge Source: The 2018 Wikipedia dump (Karpukhin et al., 2020) is used as the external knowledge source.
- Retriever:
E5(Wang et al., 2022) is used as theretriever. It fetches the top-3 passages per search query, providing relevant context to theLLM.
-
Training Details:
-
Hardware: Training is conducted on 8 GPUs.
-
Batch Sizes:
Global batch sizeof 256,mini-batch sizeof 256. -
Sequence Lengths:
Maximum sequence lengthis 4096 tokens.Maximum response lengthandretrieved content lengthare 500 tokens in each interaction turn. -
Rollout Sampling:
Temperature: 1.0 (encourages more diverse generation).Top-p: 1.0 (includes all tokens whose cumulative probability sum up to 1.0, similar to no filtering).
-
Optimization:
Learning Rate: .Warm-up Ratio: 0.1 (gradually increases the learning rate at the beginning of training).Training Steps: 200 steps.
-
RL-Specific Parameters (for GRPO and Stratified GRPO):
KL Divergence Coefficient (\beta): 0.001. This is a hyperparameter often used inPPO-like algorithms to control the divergence between the currentpolicyand thepolicythat generated the data, helping to stabilize training.Clipping Ratio (\epsilon): 0.2. Also common inPPOandGRPO, this clips theadvantageratio to prevent overly aggressivepolicy updates.Responses per Prompt: 8 responses sampled per prompt, which forms the batch oftrajectoriesforadvantagecalculation.
-
Stratified GRPO Specifics:
Blending Parameter (\alpha): 0.8 forQwen 2.5 3B Instructand 0.6 forQwen 2.5 3B Base. This indicates different optimal blending for different baseLLMs.
-
Search Agent Interaction:
Maximum Interaction Turns: 4 (the agent can perform up to 4 sequential steps of generation or search).Passages per Search Call: Top 3 passages are retrieved for each search query.
-
Implementation Framework: The implementation is based on the
Verl framework(Sheng et al., 2025).All these settings aim to provide a fair and consistent comparison with previous work, particularly
Search-R1(Jin et al., 2025), which the authors cite as having a consistent experimental setup.
-
6. Results & Analysis
6.1. Core Results Analysis
The experimental results consistently demonstrate the clear superiority of Stratified GRPO over all baselines, particularly on multi-hop question-answering tasks.
The following are the results from Table 1 of the original paper:
| Methods | NQ | TriviaQA | PopQA | HotpotQA | 2Wiki | Musique | Bamboogle | Avg. |
|---|---|---|---|---|---|---|---|---|
| Non-RL Baselines | ||||||||
| Direct Generation | 10.6 | 28.8 | 10.8 | 14.9 | 24.4 | 2.0 | 2.4 | 13.4 |
| SFT | 24.9 | 29.2 | 10.4 | 18.6 | 24.8 | 4.4 | 11.2 | 17.6 |
| RAG | 34.8 | 54.4 | 38.7 | 25.5 | 22.6 | 4.7 | 8.0 | 27.0 |
| Search-o1 | 23.8 | 47.2 | 26.2 | 22.1 | 21.8 | 5.4 | 32.0 | 25.5 |
| IRCoT | 11.1 | 31.2 | 20.0 | 16.4 | 17.1 | 6.7 | 24.0 | 18.1 |
| Qwen2.5-3B-Base | ||||||||
| Search-R1 | 40.6 | 58.7 | 43.5 | 28.4 | 27.3 | 4.9 | 8.8 | 30.3 |
| R1 | 22.6 | 45.5 | 17.3 | 20.1 | 26.8 | 5.5 | 22.4 | 22.9 |
| ReSearch | 42.7 | 59.7 | 43.0 | 30.5 | 27.2 | 7.4 | 12.8 | 31.9 |
| GRPO | 45.2 | 61.2 | 43.8 | 32.6 | 29.7 | 7.8 | 12.9 | 33.3 |
| Stratified GRPO | 45.9 | 61.4 | 43.0 | 40.8 | 39.9 | 17.7 | 42.7 | 41.6 |
| Qwen2.5-3B-Instruct | ||||||||
| Search-R1 | 34.1 | 54.5 | 37.8 | 32.4 | 31.9 | 10.3 | 26.4 | 32.5 |
| R1 | 21.0 | 44.9 | 17.1 | 20.8 | 27.5 | 6.0 | 19.2 | 22.4 |
| ReSearch | 36.5 | 57.1 | 39.5 | 35.1 | 27.2 | 9.5 | 26.6 | 33.1 |
| GRPO | 33.4 | 52.9 | 36.7 | 26.5 | 27.4 | 6.4 | 21.0 | 29.2 |
| Stratified GRPO | 44.5 | 60.9 | 44.3 | 41.0 | 37.3 | 16.9 | 38.7 | 40.5 |
Key Observations and Analysis:
-
Overall Superiority:
Stratified GRPOconsistently achieves the best performance across almost all sevenQA benchmarksfor both theQwen2.5-3B-BaseandQwen2.5-3B-Instructmodels.- For
Qwen2.5-3B-Base,Stratified GRPOachieves an average of 41.6 EM, outperformingGRPO(33.3 EM) by 8.3 points and the next best baseline (ReSearchat 31.9 EM) by 9.7 points. The paper states an improvement of up to 11.3 points, which is observed when comparingStratified GRPO(41.6 avg) toGRPO(30.3 avg forSearch-R1on base model in similar context, or 33.3 overall for GRPO, so this might be referring to max diff on a specific benchmark or a slightly different average calculation). - For
Qwen2.5-3B-Instruct,Stratified GRPOachieves an average of 40.5 EM, outperformingGRPO(29.2 EM) by 11.3 points (this matches the abstract's claim).
- For
-
Pronounced Advantage on Multi-Hop Tasks: The performance gains are particularly significant on
multi-hop QA benchmarks(HotpotQA, 2Wiki, Musique, Bamboogle).- For
Qwen2.5-3B-Base, onMusique,Stratified GRPOscores 17.7 EM, a massive improvement overGRPO's 7.8 EM (more than double). OnBamboogle, it jumps from 12.9 EM (GRPO) to 42.7 EM, a substantial relative gain. - Similar patterns are seen with
Qwen2.5-3B-Instruct. For example, onHotpotQA,Stratified GRPOachieves 41.0 EM versusGRPO's 26.5 EM. - This strong performance on
multi-hoptasks suggests thatStratified GRPO's ability to handlestructural heterogeneityis crucial for learning complex, sequentialinformation retrievalandreasoning strategies. These tasks inherently require variedsearch strategies, making them highly susceptible tocross-stratum bias.
- For
-
Comparison with Non-RL Baselines: All
RL-based methods (Search-R1,ReSearch,GRPO,Stratified GRPO) generally outperformnon-RLmethods that rely on staticRAGorprompting(Direct Generation,SFT,RAG,Search-o1,IRCoT), especiallyStratified GRPO. This reinforces the value ofRLfor trainingLLM agentsto use tools effectively. -
Implications for PPO-based Methods: The consistent outperformance of
Stratified GRPOoverPPO-basedSearch-R1suggests that the issue ofcross-stratum biasis not unique toGRPO(which is apolicy gradient methodwithout alearned value function). The paper hypothesizes that thisbiaslikely manifests inPPO-like algorithms through the difficulty of training an accuratevalue function(critic) forstructurally diverse trajectories. This implies thatprincipled handling of trajectory structureis a key factor for robustRLtraining across various algorithms forLLM search agents.
6.2. Ablation Studies / Parameter Analysis
The paper conducts an ablation study to evaluate the contribution of each component of Stratified GRPO: Stratified Advantage Normalization (SAN) alone and the full Stratified GRPO (which includes SAN and Blended Advantage).
The following are the results from Table 2 of the original paper:
| Model Variants | NQ | TriviaQA | PopQA | HotpotQA | 2wiki | Musique | Bamboogle | Avg. |
|---|---|---|---|---|---|---|---|---|
| Qwen2.5-3B-Base | ||||||||
| GRPO | 45.2 | 61.2 | 43.8 | 32.6 | 29.7 | 7.8 | 12.9 | 33.3 |
| w/ SAN | 43.7 | 59.3 | 41.1 | 36.6 | 38.4 | 12.6 | 25.0 | 36.7 |
| Stratified GRPO | 45.9 | 61.4 | 43.0 | 40.8 | 39.9 | 17.7 | 42.7 | 41.6 |
| Qwen2.5-3B-Instruct | ||||||||
| GRPO | 33.4 | 52.9 | 36.7 | 26.5 | 27.4 | 6.4 | 21.0 | 29.2 |
| w/ SAN | 42.5 | 60.1 | 44.2 | 39.4 | 41.0 | 16.0 | 36.3 | 39.9 |
| Stratified GRPO | 44.5 | 60.9 | 44.3 | 41.0 | 37.3 | 16.9 | 38.7 | 40.5 |
Analysis of Ablation Study:
-
Contribution of SAN: For both
Qwen2.5-3B-BaseandQwen2.5-3B-Instruct,GRPO w/ SANsignificantly outperforms the baselineGRPO.- For
Qwen2.5-3B-Base,GRPO w/ SANimproves the averageEMfrom 33.3 to 36.7 (a 3.4-point gain). - For
Qwen2.5-3B-Instruct, the gain is even more substantial, from 29.2 to 39.9 (a 10.7-point gain), suggestingSANis particularly effective when the baseline (GRPO) struggles (asGRPOforInstructmodel showedtraining collapse). - This confirms that
Stratified Advantage Normalizationitself provides substantial benefits by eliminatingcross-stratum biasand stabilizing thelearning signal.
- For
-
Contribution of Blending (Full Stratified GRPO): The addition of
advantage blending(i.e., the fullStratified GRPOalgorithm) further enhances performance overGRPO w/ SAN.- For
Qwen2.5-3B-Base, the averageEMincreases from 36.7 (w/ SAN) to 41.6 (fullStratified GRPO), an additional 4.9-point gain. - For
Qwen2.5-3B-Instruct, the averageEMincreases from 39.9 (w/ SAN) to 40.5 (fullStratified GRPO), an additional 0.6-point gain. - This demonstrates that the
blending mechanismis not just a theoretical add-on but provides practical stability and performance improvements, especially infinite-sample regimeswherestratum-specific statisticsmight be noisy.
- For
-
Synergistic Effect: The results clearly show a
synergistic effectwhere bothSANand theblended advantagecontribute to the overall superior performance ofStratified GRPO.SANaddresses the fundamentalbias, andblendingensures practical robustness. The effectiveness is particularly evident oncomplex multi-hop QA tasks, whereStratified GRPO(full version) shows the largest improvements overGRPO w/ SAN.
6.3. Training Dynamics
The paper also provides an analysis of the training dynamics, including training rewards and the number of search calls, comparing Stratified GRPO with GRPO.
The following figure (Figure 1 from the original paper) illustrates the training dynamics:
该图像是多子图折线图,展示了Stratified GRPO与GRPO在Qwen 2.5 3B Base和Instruct模型上的训练奖励与搜索调用次数随训练步数的变化对比。总体体现Stratified GRPO在训练奖励和搜索策略上表现更优。
- Image 1:
- Description: The image is a multi-subplot line chart illustrating the comparison of training rewards and number of search calls over training steps between Stratified GRPO and GRPO on Qwen 2.5 3B Base and Instruct models. The left plots show training rewards (y-axis) over training steps (x-axis) for both models and methods. The right plots show the average number of search calls per question (y-axis) over training steps (x-axis) for both models and methods.
- Alt text: img-0.jpeg
Analysis of Training Dynamics (Figure 1):
-
Improved Reward and Training Stability (Left Plots):
- Qwen2.5-3B-Base:
Stratified GRPO(blue line) consistently achieves highertraining rewardsthanGRPO(orange line) throughout the training process. Both methods show an increasing trend, butStratified GRPOreaches a higher plateau. - Qwen2.5-3B-Instruct: This is where the difference is starkest.
Standard GRPO(orange line) exhibits atraining collapse, where itsrewarddrops drastically and fails to recover. This is a knowninstability issueinRLtraining. In contrast,Stratified GRPO(blue line) maintains a stable andmonotonically increasing reward signal, demonstrating its superiorstabilityandlearning efficiency. This prevents theLLMfrom learning a degradedpolicy.
- Qwen2.5-3B-Base:
-
Learning an Effective Search Policy (Right Plots):
-
The
average number of search calls per questionduring training provides insight into thesearch policylearned by theagent. -
Qwen2.5-3B-Base:
Stratified GRPO(blue line) successfully learns apolicythat converges to approximately 2.5search callsper question. This indicates that it has learned to performiterative searchesand leverage the search tool effectively, suggesting amulti-step reasoningcapability. Conversely, thebaseline GRPO(orange line)stagnatesat around onesearch call, failing to significantly explore or utilize more complexmulti-step search strategies. -
Qwen2.5-3B-Instruct: For this model,
GRPO'straining collapse(as seen in the left plot) is mirrored in itssearch behavior, where it fails to learn any meaningfulsearch policy.Stratified GRPO, however, learns astable search policythat utilizes approximately 2.5search calls, similar to the base model. -
This finding supports the paper's argument that
GRPO'scross-stratum biasprevents it from effectively exploring potentially bettersearch policiesthat involve moretool-use. By removing thisbias,Stratified GRPOenables theagentto discover andexploit complex multi-step search strategies, which directly translates to its superior performance onmulti-hop benchmarksthat require sequentialinformation retrieval.In essence, the empirical results confirm the theoretical claims:
Stratified GRPOprovides apurer learning signal, leading to more stableRLtraining, higherrewards, and the ability to learn more sophisticatedtool-use policiesforLLM search agents.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
This work addresses a critical, yet often overlooked, challenge in applying Reinforcement Learning (RL) to Large Language Model (LLM) agents that use search tools: the structural heterogeneity of their trajectories. The paper rigorously identifies and formalizes cross-stratum bias, demonstrating how standard policy gradient methods using a global baseline lead to distorted credit assignment and hinder the exploration of complex multi-step search strategies.
To overcome this, the authors propose Stratified GRPO, featuring Stratified Advantage Normalization (SAN). SAN partitions trajectories into homogeneous strata based on structural properties (like search count) and computes advantages locally within these groups. The theoretical analysis proves that SAN effectively eliminates cross-stratum bias, ensures conditional unbiasedness, and achieves unit variance within each stratum, while maintaining global unbiasedness and unit variance. For practical robustness in finite-sample regimes, SAN is linearly blended with a global estimator.
Extensive experiments on diverse single-hop and multi-hop question-answering benchmarks empirically validate the approach. Stratified GRPO consistently and substantially outperforms the GRPO baseline by up to 11.3 points, achieving higher training rewards, greater training stability (preventing training collapse), and learning more effective search policies that effectively utilize multi-step search. These results firmly establish stratification as a principled and effective remedy for structural heterogeneity in RL for LLM search agents.
7.2. Limitations & Future Work
The paper does not explicitly dedicate a section to "Limitations" or "Future Work." However, based on the problem formulation and proposed solution, some potential limitations and avenues for future research can be inferred:
- Definition of Strata: The current work primarily uses
search countas the basis forstratification. While intuitive forsearch agents, defininghomogeneous stratamight be more complex for other types ofLLM agentsthat use different tools or exhibit other forms ofstructural heterogeneity(e.g., varying tool types, depth of reasoning chains, success/failure of API calls). Future work could explore more dynamic or learnedstratification criteria. - Computational Overhead: Calculating
stratum-specific statistics(mean and standard deviation) for everystratumin every training step might introduce some computational overhead, especially if the number ofstrataorpromptsis very large. Optimizing the efficiency ofstratumcomputation could be a direction. - Blending Hyperparameter : The
blending hyperparameterneeds to be tuned (e.g., 0.8 for Instruct model, 0.6 for Base model). This introduces an additional hyperparameter, and its optimal value might vary across tasks and models. Research into adaptive or learned blending mechanisms could be beneficial. - Small Strata Robustness: While
blendingaddressesfinite-sample instabilityfor smallstrata, extremely sparsestrata(e.g., astratumwith only onetrajectory) might still pose challenges for robuststatistic estimation. Further investigation into more sophisticated handling of very smallstratacould be valuable. - Generalization to Other RL Algorithms: While the paper hypothesizes that
cross-stratum biasalso affectsPPO-based methods,Stratified GRPOis specifically designed forGRPO. AdaptingStratified Advantage NormalizationtoPPO(e.g., by incorporating it into thevalue functionestimation oradvantagecalculation foractor-critic methods) would be a natural next step for broader applicability. - Beyond QA Tasks: The experiments are focused on
question-answering. Exploring the effectiveness ofStratified GRPOon otherLLM agent tasksthat involvemulti-step planning,code generation, orcomplex reasoningwith various tools could demonstrate its wider applicability.
7.3. Personal Insights & Critique
This paper offers several valuable insights and provokes critical reflection on RL applications for LLMs.
- Insight 1: The Criticality of Heterogeneity: The paper powerfully highlights that not all
trajectoriesare created equal, especially whenLLM agentsuse tools. The formalization ofcross-stratum biasas an "apples-to-oranges" problem is an elegant and intuitive explanation for observedRL instabilityin complextool-use settings. This idea ofstructural heterogeneityis likely pervasive in manyLLM agentscenarios beyond justsearch countand could apply to agents interacting with diverse APIs, planning tasks, or even complex conversational flows. This forces a re-evaluation of standardRL baselinesandnormalization techniquesin such contexts. - Insight 2: Conditional Purity vs. Global Equivalence: The distinction drawn between
conditional(withinstratum) andglobal(marginal)momentsofadvantage estimators(Theorems 4 and 5) is a profound theoretical contribution. It shows that while two estimators might appear equally good globally, their behavior at a finer-grained,conditional levelcan drastically impactlearning dynamics.SAN's ability to provide azero-mean,unit-variance signalwithin eachstratumensures apureandconsistent credit assignment, which is essential for discovering nuancedmulti-step strategies. This principle could inspire similarconditional analysesin otherRL domainswherediverse behavior modesexist. - Insight 3: The Power of Simplification with Rigor: The
GRPOframework itself, by eschewing a learnedvalue function, offers a simpler yet robustRLapproach.Stratified GRPObuilds on this by introducingstratificationas a relatively simple, yet theoretically rigorous, modification. The resulting performance gains and stability improvements are significant, demonstrating that principled statistical adjustments can yield substantial practical benefits without needing overly complex architectural changes.
Critique:
-
Strata Definition and Universality: While
search countis a clear and effectivestratification variablefor theQA search agenttask, generalizing this concept to otherLLM agenttasks (e.g., coding assistants, scientific discovery agents) might be challenging. What constitutes a "homogeneous stratum" could be ambiguous or require domain-specific knowledge. A more adaptive orlearned stratificationmethod might be necessary for broader applicability, which could increase complexity. -
Interpretability of Blending Parameter: The
blending parameteris tuned per model. While practical, it suggests that the optimal balance betweenlocal purityandglobal stabilityis not universal. Further work could explore how to dynamically adjust during training or even learn it, potentially making the method more robust across diverse settings without manual tuning. -
Edge Cases with Very Small Strata: Although
blendinghelps, the paper acknowledges the potential fornoisy advantage estimateswhenstratacontain very fewtrajectories. This might still be an issue in highly sparse or exploratory environments where certaintool-use patternsare rare but crucial. This could lead to achicken-and-egg problemwhere rare but potentially high-rewardstratastruggle to get accurateadvantage estimatesand thus are not explored sufficiently.Overall,
Stratified GRPOpresents a significant step forward in makingReinforcement Learningmore effective and stable for complexLLM agentsthat engage intool-use. Its core idea of addressingstructural heterogeneitywithstratificationis principled and applicable beyond the specific context ofsearch agents, offering a valuable new perspective forRLresearch in the era ofLLMs.
Similar papers
Recommended via semantic vector search.