Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks
TL;DR Summary
This paper proposes HGPO to address context inconsistency in long-horizon tasks by hierarchical grouping and adaptive advantage aggregation, improving bias-variance trade-offs and outperforming existing RL methods without extra models.
Abstract
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 H IERARCHY - OF -G ROUPS P OLICY O PTIMIZATION FOR L ONG -H ORIZON A GENTIC T ASKS Anonymous authors Paper under double-blind review A BSTRACT Group-based reinforcement learning (RL), such as GRPO, has advanced the capa- bilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward step- wise group-based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. How- ever, we find a key issue in estimating stepwise relative advantages, namely con- text inconsistency , where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advan- tage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is "Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks". It focuses on developing a novel reinforcement learning algorithm to improve the performance of Large Language Models (LLMs) in complex, multi-step environments.
1.2. Authors
The authors are listed as "Anonymous authors", indicating that the paper is currently under double-blind review. This practice is common in academic conferences to ensure impartial evaluation.
1.3. Journal/Conference
The paper is published at OpenReview, with a scheduled publication date of 2025-10-08T00:00:00.000Z. This platform is widely used for submissions to top-tier machine learning conferences such as ICLR, NeurIPS, and ICML, where papers undergo rigorous peer review. The "double-blind review" status further confirms its submission to such a venue, implying a high standard of research.
1.4. Publication Year
The publication year is 2025.
1.5. Abstract
The paper introduces Hierarchy-of-Groups Policy Optimization (HGPO), a new reinforcement learning (RL) algorithm designed to enhance the capabilities of Large Language Models (LLMs) in long-horizon agentic tasks. It identifies a key problem in existing stepwise group-based policy optimization methods, termed context inconsistency, where steps within the same group may have different historical contexts, leading to biased advantage estimation and degraded policy optimization. HGPO addresses this by assigning each step to multiple hierarchical groups based on the consistency of historical contexts. It then computes distinct advantages within each group and combines them using an adaptive weighting scheme. This approach aims to achieve a favorable bias-variance trade-off in stepwise advantage estimation without requiring additional models or rollouts. Empirical evaluations on challenging agentic tasks, ALFWorld and WebShop, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, demonstrate that HGPO significantly outperforms current agentic RL methods under the same computational constraints.
1.6. Original Source Link
The original source link is: https://openreview.net/forum?id=T8Dev99qnz The PDF link is: https://openreview.net/pdf?id=T8Dev99qnz The paper is currently under double-blind review, meaning it is a submission to a conference or journal and has not yet been formally accepted or published.
2. Executive Summary
2.1. Background & Motivation
The proliferation of Large Language Models (LLMs) has opened new avenues for developing versatile agentic systems capable of perceiving, reasoning, and acting in complex environments. These LLM agents are increasingly applied to long-horizon tasks, such as embodied navigation, web browsing, and interactive gaming, which demand robust planning and decision-making over extended sequences of interactions.
The core problem this paper aims to solve lies within the post-training stage of enhancing these LLM agents using reinforcement learning (RL). While group-based RL methods (e.g., GRPO) have shown promise in single-turn tasks due to their computational efficiency, extending them to multi-turn long-horizon tasks presents significant challenges. Traditional trajectory-wise policy optimization (where the entire interaction history is concatenated) suffers from context explosion, as the input context length grows rapidly, limiting scalability.
To mitigate this, recent research has shifted towards stepwise policy optimization, which treats each step independently while using a memory module to retain historical context. However, this paper identifies a critical issue in stepwise group-based RL: context inconsistency. This occurs when steps grouped together for advantage estimation (because they share the same current state) originate from trajectories with different historical contexts. This inconsistency leads to severely biased advantage estimation, which in turn degrades the effectiveness of policy optimization. Existing solutions, such as using only "Oracle" steps (steps with identical current state and historical context), are highly inefficient due to their scarcity and introduce high variance due to small group sizes.
The paper's entry point is to directly tackle this context inconsistency problem within stepwise group-based RL, aiming to develop a more fine-grained and reliable advantage estimator that can balance bias and variance efficiently.
2.2. Main Contributions / Findings
The primary contributions of this paper are:
- Revealing Context Inconsistency: The paper rigorously identifies and empirically demonstrates the issue of
context inconsistencyinstepwise group-based RL. It shows that this inconsistency leads to significantbiasinadvantage estimation, directly hindering policy optimization. Through pilot studies (Figure 2), it quantifies the substantial estimation bias in bothtrajectory-levelandstep-level advantagescompared toOracle advantages, emphasizing the detrimental impact of differing historical contexts. - Proposing Hierarchy-of-Groups Policy Optimization (HGPO): The paper introduces a novel RL algorithm,
HGPO, specifically designed forlong-horizon agentic tasks. HGPO employs two key components:- Context-aware Hierarchical Grouping: It organizes steps into multiple hierarchical groups based on the consistency of their historical contexts. This allows for a more nuanced comparison of steps, improving data utilization and reducing variance, particularly by allowing steps with varying degrees of context consistency to contribute to estimation.
- Adaptive Weighting Advantage Estimation: It computes distinct advantages for each hierarchical group and then aggregates them using an adaptive weighting scheme. This scheme prioritizes groups with more consistent historical contexts by assigning them larger weights, thereby effectively reducing estimation bias while maintaining a balanced variance.
- Achieving Strong Empirical Performance:
HGPOachievesstate-of-the-artresults on two challenging agentic benchmarks: ALFWorld and WebShop. UsingQwen2.5-1.5B-InstructandQwen2.5-7B-Instructas base models,HGPOconsistently and significantly outperforms existingagentic RL methods(including PPO, RLOO, GRPO, and GiGPO) under identical computational constraints (GPU memory, LLM rollouts, and minimal additional time cost). The results also highlightHGPO's better generalization capabilities onout-of-distribution tasks.
3. Prerequisite Knowledge & Related Work
This section provides foundational knowledge and contextualizes the paper within the broader research landscape, explaining key concepts and prior works necessary for a comprehensive understanding.
3.1. Foundational Concepts
-
Large Language Models (LLMs):
LLMsare advanced artificial intelligence models, typically based on thetransformer architecture, trained on vast amounts of text data. They can understand, generate, and process human language, making them capable of a wide range of tasks, including conversation, summarization, translation, and code generation. In the context ofagentic tasks,LLMsserve as the "brain" of an agent, interpreting observations, reasoning about situations, and generating actions. Examples include GPT-4o, Gemini-2.5-Pro, and Qwen2.5-Instruct. -
Reinforcement Learning (RL):
Reinforcement Learningis a paradigm of machine learning where anagentlearns to make decisions by interacting with anenvironment. Theagenttakesactionsin differentstatesof theenvironmentand receivesrewardsorpenalties. The goal of theagentis to learn apolicy(a mapping fromstatestoactions) that maximizes the cumulativerewardover time. Key components include:- Agent: The learner and decision-maker. In this paper, it's an
LLM. - Environment: The world the
agentinteracts with. Examples are ALFWorld and WebShop. - State (): A description of the current situation in the
environmentat time . - Action (): A decision made by the
agentat time to interact with theenvironment. - Reward (): A scalar feedback signal from the
environmentindicating the desirability of theagent'saction. Inlong-horizon tasks, rewards are oftensparse(given only at the end) anddelayed. - Policy (): The
agent's strategy, parameterized by , that dictates how it choosesactionsgivenstates. - Trajectory (): A sequence of
states,actions, andrewardsfrom an interaction episode, e.g., .
- Agent: The learner and decision-maker. In this paper, it's an
-
Long-horizon Agentic Tasks: These are complex tasks that require an
LLM agentto perform multiple sequential steps orturnsto achieve a goal. Unlikesingle-turn tasks(e.g., generating a single response),long-horizon tasksinvolve sustained interaction, planning, reasoning, and memory to navigate dynamic environments. Examples include controlling a robot, browsing the web, or solving multi-step puzzles. -
Policy Optimization: The process in
RLof iteratively updating theagent'spolicy() to improve its performance (i.e., maximize expected cumulativereward). This typically involves collectingtrajectoriesby interacting with theenvironmentusing the currentpolicy, then using theseexperiencesto calculategradientsthat guide the update of thepolicy parameters. -
Advantage Estimation: In
policy gradient methods,advantage functionsare used to reduce the variance ofgradient estimates, which makesRL trainingmore stable and efficient. Theadvantage functionfor a state-action pair(s, a)tells us how much better or worse takingactioninstateis compared to the averageactiontaken fromstate. Formally, theadvantage functionA(s, a)is often defined as , whereQ(s, a)is theaction-value function(expected cumulativerewardstarting from state , taking action , and then following thepolicy), andV(s)is thestate-value function(expected cumulativerewardstarting from state and following thepolicy). Ingroup-based RL,advantagesare estimated directly from sampledrewardswithin groups, often without explicitly learningvalue functions. -
Group-based Reinforcement Learning: A class of
RL algorithmsthat estimateadvantages(orrelative returns) by comparing therewardsof samples within agroupof collectedtrajectories. Instead of using a separatevalue network(like in PPO), these methods leverage statistical comparisons acrossgroup membersto determine whichactionswere relatively better or worse. This can often lead to computational efficiencies and robustness. -
Context: In
sequential decision-making,contextrefers to the historical information or pastobservationsandactionsthat influence theagent's understanding of the currentstateand its future decisions. Maintaining consistent and relevantcontextis critical for effective planning and reasoning inlong-horizon tasks. -
Bias-Variance Trade-off: A fundamental concept in
machine learningandstatistics.Biasrefers to the error introduced by approximating a real-world problem, or by using a simplified model. Highbiascan cause the model to miss the relevant relations between features and target outputs (underfitting).Variancerefers to the model's sensitivity to small fluctuations in the training data. Highvariancecan cause the model to model the random noise in the training data rather than the intended outputs (overfitting). Inadvantage estimation, methods aim to reducebias(accuracy of estimation) while keepingvariance(stability of estimation) low.
3.2. Previous Works
The paper contextualizes HGPO by discussing several relevant lines of research:
-
LLM-based Decision-Making Agents: Early approaches relied on prompting strategies without further training.
- ReAct (Yao et al., 2023): Integrates
reasoning(using chain-of-thought) andacting(selecting actions) in an interleaved manner. TheLLMgenerates thoughts and then actions based on observations. - Reflexion (Shinn et al., 2024): Extends
ReActby incorporatingself-reflectionanditerative improvement. Theagentgenerates atrajectory, reflects on its outcome, and uses this reflection to improve futuretrajectories. - These methods are simple but often limited by the
pre-trained model'sknowledge and lackdomain-specific adaptation.
- ReAct (Yao et al., 2023): Integrates
-
Reinforcement Learning for LLM-based Agents:
RLoffers a powerful paradigm for adaptingLLMsto dynamic environments post-training.- DQN (Mnih et al., 2015): A
value-based RL algorithmthat uses deep neural networks to approximate theQ-value function. Applied to text games. - PPO (Schulman et al., 2017):
Proximal Policy Optimizationis a popularpolicy gradient algorithmthat aims to find a balance betweenpolicy optimizationand stability by using aclipped objective functionand often requires avalue networkto estimateadvantages. - RLHF (Ziegler et al., 2019; Ouyang et al., 2022):
Reinforcement Learning from Human Feedbackis a widely used technique for aligningLLMswith human preferences, often involving training areward modelfrom human comparisons and then optimizing theLLMwithPPO. - Group-based RL Algorithms: These methods avoid the need for
value networksby estimatingadvantagesdirectly fromsampled groups.- RLOO (Kool et al., 2019):
Reinforcement Learning with Offline Observationsis agroup-based RLapproach that estimatesadvantagesby comparing observed rewards within a group oftrajectories. - GRPO (Shao et al., 2024):
Group-based Reinforcement Learning with Policy Optimization. Originally designed forsingle-turn tasks, it computesadvantagesbased on therewardsof agroupoftrajectories.- Trajectory-level Advantage (adapted to stepwise for long-horizon tasks): For a given
trajectory, itsadvantageis computed relative to therewardsof alltrajectoriesin itsgroup. Here, is thetotal rewardoftrajectory, is thegroupoftrajectories(e.g., alltrajectoriesgenerated from the same initial state), is thenumber of trajectoriesin thatgroup, and is thestandard deviationofrewardswithin . Thisadvantageis assigned to every step in , which can be toocoarse-grainedformulti-turn tasks.
- Trajectory-level Advantage (adapted to stepwise for long-horizon tasks): For a given
- GiGPO (Feng et al., 2025b):
Group-in-Group Policy Optimization. This method extendsGRPOforLLM agentsby providingfiner-grained credit assignmentat thestep-level. It clustersstepswith identical currentstatesacrosstrajectoriesintostep-level groups.- Step-level Advantage: For a given
stepwith currentstate, itsadvantageis computed relative to therewardsof otherstepsin itsstep-level group. Here, is therewardassociated withstep(often astepwise rewardderived from the fulltrajectory reward), is thegroupofstepssharing the same currentstate, is thenumber of stepsin thatgroup, and is thestandard deviationofrewardswithin . This method improves granularity but, as the paper points out, suffers fromcontext inconsistency.
- Step-level Advantage: For a given
- RLOO (Kool et al., 2019):
- DQN (Mnih et al., 2015): A
-
Long-Horizon Agentic Reinforcement Learning: This area focuses on equipping
LLMswithplanning,reasoning, andmemory capabilitiesfor sustained interaction.- Trajectory-wise Policy Optimization: Methods like RAGEN (Wang et al., 2025d) and SearchR1 (Jin et al., 2025a) optimize over full
multi-turn rolloutsby concatenatingstatesandmodel outputs. Their limitation iscontext explosion. - Stepwise Policy Optimization: Methods like those by Feng et al. (2025b) and Luo et al. (2025c) treat each
stepindependently, using amemory moduleto managehistorical context. This is wherecontext inconsistencybecomes a problem.
- Trajectory-wise Policy Optimization: Methods like RAGEN (Wang et al., 2025d) and SearchR1 (Jin et al., 2025a) optimize over full
3.3. Technological Evolution
The evolution of LLM agents for long-horizon tasks has progressed from simple prompting to sophisticated RL-based fine-tuning. Initially, LLMs were guided by structured prompts (e.g., ReAct, Reflexion) to perform multi-step tasks. While effective, these methods were constrained by the inherent knowledge of the pre-trained model. Reinforcement Learning emerged as a powerful tool to adapt LLMs to specific environments and tasks by learning from feedback. Early RL applications used standard algorithms like DQN and PPO.
For LLM agents, group-based RL gained traction due to its efficiency, avoiding the need for value networks. However, these methods were often designed for single-turn interactions. Adapting them to long-horizon tasks led to two main paradigms: trajectory-wise (which faced context explosion) and stepwise (which introduced context inconsistency). This paper fits into the stepwise policy optimization paradigm, aiming to refine advantage estimation by resolving the context inconsistency issue that GiGPO and similar methods encounter. HGPO represents an advancement in finer-grained credit assignment by leveraging a hierarchical context structure.
3.4. Differentiation Analysis
Compared to prior group-based RL methods for long-horizon agentic tasks, HGPO introduces core innovations:
-
Addressing Context Inconsistency: This is the primary differentiation. While
GiGPOattemptsstep-level advantage estimation, it still groups steps based only on the current state, ignoring the precedinghistorical contextin thememory module.HGPOexplicitly recognizes and addresses thiscontext inconsistencyas a source ofbias. -
Context-aware Hierarchical Grouping: Instead of a single
step-level group,HGPOassigns eachstepto multiplehierarchical groups, where each level of the hierarchy corresponds to a different degree ofhistorical context consistency. This allows for a more nuanced and accurate comparison ofsteps. -
Adaptive Weighting Advantage Estimation:
HGPOdoesn't just formhierarchical groups; it intelligently combines theiradvantagesusing anadaptive weighting scheme. This scheme prioritizesadvantagesfromhigher-level groups(those with more consistent historical contexts) by assigning them larger weights, thereby actively reducingbiasin the finaladvantage estimate. This contrasts with simpler aggregation or ignoring inconsistent contexts. -
Improved Bias-Variance Trade-off: By explicitly accounting for
context consistencyand adaptively weightingadvantages,HGPOsystematically achieves a betterbias-variance trade-offthantrajectory-level(high bias, low variance) orstep-level(potentially high bias due to inconsistency, variable variance) methods. It leveragesOracle-likeaccuracy when available, while still using less consistent (but more abundant)stepsto reducevariance. -
Efficiency without Extra Models:
HGPOachieves these improvements without introducing extra models or requiring additional data collection/rollouts. Thehierarchical groupingprocess operates offline using only existingrolloutsandhashmap lookups, maintaining computational efficiency comparable to its predecessors.The pilot study results in Figure 2 clearly differentiate
HGPO's motivation: bothtrajectory-levelandstep-level advantages(as computed byGRPOandGiGPOrespectively) show significantbiascompared toOracle advantages, andOracle stepsare too scarce for efficient training.HGPOaims to bridge this gap.
4. Methodology
4.1. Principles
The core principle of Hierarchy-of-Groups Policy Optimization (HGPO) is to achieve a more accurate and stable advantage estimation for Large Language Model (LLM) agents in long-horizon agentic tasks by addressing the issue of context inconsistency inherent in existing stepwise group-based reinforcement learning (RL). The intuition is that for a given state-action pair, its value (or advantage) should be compared with other state-action pairs that have not only the same current state but also a similar preceding historical context. By constructing hierarchical groups based on the degree of contextual consistency and adaptively weighting their respective advantage estimates, HGPO can effectively reduce bias (by prioritizing more consistent contexts) while maintaining low variance (by utilizing a broader range of comparisons).
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Setup of Long-Horizon Agentic Tasks
In long-horizon agentic tasks, an LLM agent (parameterized by ) interacts with an environment over multiple turns to achieve a goal. For a task example , at each turn , the agent observes an environment state and generates a textual action . Here, is the token vocabulary, is the maximum generation length, and ranges from 1 to (maximum interaction turns). The paper focuses on a sparse delayed reward setting, where a scalar reward is provided only at the final step of a trajectory .
4.2.2. Trajectory-wise vs. Stepwise Policy Optimization
- Trajectory-wise Policy Optimization: This framework concatenates the full interaction
historyof atrajectoryforpolicy optimization, meaning thepolicyconditions its actions on the entire history: . The main drawback iscontext explosion: as the number of turns increases, thecontext lengthgrows rapidly, makingRL trainingcomputationally expensive and unscalable. - Stepwise Policy Optimization: To address
context explosion, this framework treats eachstepwithin atrajectoryindependently. It uses amemory moduleto retain a fixed-length historicalcontextof interactions. This keeps theprompt lengthrelatively stable, enabling more scalableRL training.
4.2.3. Group-based Reinforcement Learning and the Issue of Context Inconsistency
Unlike Proximal Policy Optimization (PPO) which uses a value function to estimate advantages, group-based RL algorithms (like GRPO) compute advantages directly from the statistics of a sampled group of trajectories .
-
Trajectory-level Advantage (GRPO adaptation): When
GRPOis adapted to thestepwise settingforlong-horizon tasks, it calculates atrajectory-level advantageas: Where:- is the
trajectory-level advantagefor the -thtrajectory. - is the
total rewardobtained fortrajectory. - is the
groupof sampledtrajectories(e.g., those starting from the same initial state). - is the
number of trajectoriesin thegroup. - is the
sum of rewardsof alltrajectoriesin thegroup. - is the
standard deviationofrewardswithin thegroup. This method assigns the sameadvantage valueto everystepin , which iscoarse-grainedand overlooksfiner credit assignment.
- is the
-
Step-level Advantage (GiGPO adaptation): To address the
coarse-grainednature,step-level relative advantage estimators(like inGiGPO) are used. Here,stepswith identical currentstatesacrossgroup trajectoriesare clustered intostep-level groups, and theiradvantagesare computed as: Where:- is the
step-level advantagefor astepassociated with currentstate. - is the
reward(specifically, thestepwise rewardas defined later in the paper) associated withstep. - is the
step-level groupcontaining allstepsthat share the identical currentstate. - is the
number of stepsin thatgroup. - is the
sum of rewardsof allstepsin thegroup. - is the
standard deviationofrewardsamongstepsin . This providesfiner-grained credit assignmentbut introduces the issue ofcontext inconsistency.
- is the
-
The Issue of Context Inconsistency: The paper highlights that
step-level groups(e.g., ) may containstepsthat share the same currentstatebut have distinct historical contexts in theirmemory modules. This is illustrated in Figure 1. (b). For example, if twotrajectoriespass through the samestate, but arrived there via different sequences of precedingstatesandactions, theirhistorical contextsare inconsistent. Whenadvantagesare estimated from such an inconsistent group, the estimate becomesbiased, failing to accurately reflect the true impact of the currentstateandactiongiven its specific historicalcontext. Empirical evidence in Figure 2 (a) and (b) confirms thisbias. A naive solution of using onlyOracle groups(steps with identical current state and historical context) is inefficient due to their scarcity and leads to highvariancedue to small group sizes (Figure 2 (c) and (d)).
4.2.4. Hierarchy-of-Groups Policy Optimization (HGPO)
HGPO is proposed to address these challenges, comprising two main components: context-aware hierarchical grouping and adaptive weighting advantage estimation. Figure 3 provides an overview.
The following figure (Figure 3 from the original paper) illustrates the HGPO method:
该图像是论文中用于展示HGPO方法的示意图,包含三部分:左侧为Rollout组轨迹,展示了不同轨迹步的状态、动作和奖励;中间为上下文感知的层次分组示意,根据历史上下文划分为0-、1-、2-context组;右侧为自适应加权优势估计,展示了的加权计算过程。该图说明了HGPO在优势估计中的偏差-方差权衡策略。
Figure 3: Overview of HGPO. The LLM-based agent interacts with a set of environments initialized from the same state , producing four group trajectories (states with the same color are identical). HGPO comprises two key components: context-aware hierarchical grouping and adaptive weighted advantage computation. For illustration, consider the state (purple). First, HGPO assigns into three hierarchical groups according to its historical contexts. Then, it computes the final advantage estimate by adaptively aggregating the weighted advantages from these groups.
4.2.4.1. Context-aware Hierarchical Grouping
This component organizes steps into multi-level groups based on their historical contexts. The intuition is to evaluate the advantage of each step relative to different levels of historical context consistency for more accurate estimates.
-
Defining the
k-step context operator: For the -thstepin the -thtrajectory, which is given by , and given a maximumcontext length, ak-step context operatoris defined (Eq. 3): Where:- is the
k-step historical contextending atstate. - is the
stateatturnintrajectory. - is the
context depthparameter, ranging from0to . - is the
maximum context lengthfor grouping. This operator returns the sequence ofhistorical statespreceding the currentstate. If , it returns all availablestatesfrom the start of thetrajectoryup to , where might represent an initial state.
- is the
-
Defining the
k-th hierarchical group: Based on this operator, thek-th hierarchical groupfor a givenstepis defined as (Eq. 4): Where:- is the
k-th hierarchical groupfor thestepwith currentstate. (j, n)represents astepintrajectoryatturn.- is the
index setof allstepsacross alltrajectoriesof length . - means that the
k-step historical contextforstepis identical to thek-step historical contextforstep. In simpler terms, contains allsteps(from potentially differenttrajectories) that have the exact same currentstateAND the exact samek-step historypreceding it.
- is the
-
Hierarchy Structure: The resulting
hierarchy-of-groupsstructure satisfies (Eq. 5): This means:- includes all
stepsthat only share the same current state (equivalent tostep-level groupinginGiGPO). - includes
stepsthat share thesame current stateAND thesame immediately preceding state(). - includes
stepsthat share thesame current stateAND thesame K-step history. These are theOracle steps. - The group sizes are nested: is the largest, and is the smallest, as requiring more
context consistencyreduces the number of matchingsteps. Thishierarchyimprovesstep utilization(as evenstepswith partialcontext consistencyare used) and reducesvariance(by having larger groups at lower values). This process is fully offline and relies onhashmap lookups.
- includes all
4.2.4.2. Adaptive Weighting Advantage Estimation
The core idea here is that higher-level hierarchical groups (larger ) provide more accurate advantage comparisons due to stronger context consistency, but they are scarcer. Conversely, lower-level groups (smaller ) are more abundant but less consistent. HGPO integrates information across all hierarchical groups with adaptive weights to balance bias and variance.
-
Advantage for the
k-th hierarchical group: Theadvantage estimationfor thek-th hierarchical groupis defined similarly to thestep-level advantage(Eq. 6): Where:- is the
advantage estimateforstepbased on itsk-th hierarchical group. - is the
stepwise reward(defined below) forstep. - is the
k-th hierarchical group(as defined by Eq. 4) that belongs to. - is the
sizeof thisk-th hierarchical group. - is the
standard deviationofstepwise rewardswithin .
- is the
-
Stepwise Reward Calculation: For each
step, itsstepwise rewardis computed as thediscounted sum of future rewardsfrom that step (Eq. 7 implies this is done, and is common in RL): Where:- is the
stepwise rewardforstepintrajectory. - is the
discount factor, which weights immediate rewards more heavily than future rewards. - is the
instantaneous rewardatturnintrajectory(which issparseanddelayedin this problem, typically 0 until the final step).
- is the
-
Aggregated Advantage: The final
advantagefor astepis aggregated from thehierarchical groupsusing anadaptive weighting scheme(Eq. 7): Where:- is the
final aggregated advantage estimatefor thestep. - is the
adaptive weightfor thek-th hierarchical group. It is defined as: Here, is aweighting coefficient. A larger assigns proportionally larger weights tohigher-level groups(larger ), which have more consistenthistorical contextsand thus are expected to yield more accurate (less biased)advantage estimates. For example, if , all weights are equal (uniform weighting). If , increases linearly with . The sum of weights in the denominator ensures that the weights sum to 1. This scheme prioritizes accuracy from consistent contexts while still leveraging data from less consistent contexts to control variance.
- is the
4.2.4.3. Objective for Policy Optimization
The policy optimization objective for HGPO is a PPO-like objective with a KL divergence penalty (Eq. 8):
Where:
- is the
objective functionto be maximized for updating thepolicy parameters. - denotes the
expectationover collectedtrajectories. - is the
number of trajectoriesin agroup. - is the
maximum lengthof atrajectory. - is the
importance sampling ratio, comparing theprobabilityofactionunder thecurrent policyto theold policy. - is the
aggregated advantagecomputed byHGPO(from Eq. 7). - is the
clipping functionfromPPO, which limits theimportance ratioto stay within to prevent excessively largepolicy updates. is theclipping parameter. - is the
coefficientfor theKL divergence penalty. - is the
Kullback-Leibler (KL) divergencebetween thecurrent policyand areference policy(often theold policy, or apre-trained LLM). This term regularizespolicy updatesto prevent them from deviating too far from thereference policy, ensuring stability and preventing catastrophic forgetting.
4.2.4.4. Bias-Variance Trade-off in HGPO
Proposition 4.1 formally analyzes how HGPO achieves a favorable bias-variance trade-off under certain conditions.
Let and denote the bias and variance of the estimated advantage within the -th group .
The conditions are:
-
Bias decreases monotonically: . This means that as
context consistencyincreases (higher ), thebiasof theadvantage estimatedecreases. is thebiasoftrajectory-level advantage, is thebiasofstep-level advantage(0-context), and is thebiasof theOracle advantage(full -context). -
Variance increases monotonically and independently: . This means that as
context consistencyincreases (higher ), thegroup sizegenerally decreases, leading to highervariancein theadvantage estimate. is thevarianceoftrajectory-level advantage, is thevarianceofstep-level advantage, and is thevarianceof theOracle advantage. Independence of and for is assumed for variance calculation.The
biasandvarianceof the aggregated estimator are:
-
Bias of (Eq. 9): This shows that the
biasof the combinedadvantageis aweighted sumof the individualbiases. Since decreases with and higher means lowerbias, by appropriately weighting ,HGPOcan achieve a lowerbias. Specifically, thebiassatisfies (Eq. 11): This inequality shows that thebiasofHGPO'sadvantage estimatoris always less than or equal to thestep-level bias(which is itself less than or equal totrajectory-level bias) and greater than or equal to theOracle bias. This meansHGPOeffectively trades off between these extremes, leaning towards lowerbias. -
Variance of (Eq. 10): Assuming independence between the
advantage estimatesfrom differenthierarchical groups, thevarianceof the combinedadvantageis theweighted sum of variances. Thevariancesatisfies (Eq. 12): This indicates thatHGPO'svarianceis bounded between thevarianceof thestep-level estimatorand theOracle estimator, depending on and . By weighting,HGPOcan reduce the impact of high-varianceOracle estimates(small groups) while still benefiting from their lowbias.
In summary, HGPO provides a principled framework for advantage estimation that systematically leverages historical context to lower bias while maintaining statistical efficiency (balancing variance) through weighted aggregation.
4.2.5. Algorithm Pseudo-code (Algorithm 1)
The following describes the HGPO training process as outlined in Algorithm 1 in Appendix A:
Algorithm 1: The pseudo-code of HGPO
Require: Initial policy π_θ_old, task distribution p(X), discount factor γ, weighting ω, clipping
parameter ε, KL penalty β, group size N, the length of historical context K, parameter α
for each training iteration do
Update the old policy model: θ_old ← θ
// Multi-step rollout phase
Sample task x ~ p(X) and initialize N identical environments
for t=1 to T do
Sample actions {a_s^(i) ~ π_θ_old(⋅ | s_t^(i), x)}_i=1^N
Execute actions, observe rewards {r_t^(i)}_i=1^N and next state {s_t+1^(i)}_i=1^N
end for
// Grouping phase
Context-aware hierarchical grouping by Eq. (5)
// Advantage computation phase
Compute multiple advantages within each group by Eq. (7)
// Policy update phase
Update policy θ by maximizing objective J_HGPO(θ)
end for
Step-by-step Explanation:
-
Initialization: The algorithm starts with an
initial policy. It requires severalhyperparameters: atask distributionp(X),discount factor,weighting coefficient,clipping parameter,KL penalty coefficient,group size, andmaximum historical context length. -
Training Loop: The core process repeats for a fixed number of
training iterations. -
Update Old Policy: At the beginning of each iteration, the
old policyis updated to thecurrent policy. This is crucial forPPO-like algorithmsto ensure theimportance sampling ratio(Eq. 8) is correctly calculated against a fixed reference. -
Multi-step Rollout Phase:
- A
taskis sampled from thetask distributionp(X). - identical
environmentsare initialized for thistask. This means alltrajectoriesstart from the same initialstateand are aimed at the samegoal. - For each
turnfrom1to (maximum turns):- The
agent(using theold policy) samplesactions, one for each of the parallelenvironments, conditioned on the currentstateand thetask. - These
actionsare executed in their respectiveenvironments. - The
rewardsand thenext statesare observed. This generatesrollout trajectories.
- The
- A
-
Grouping Phase:
- After all
rolloutsare collected,context-aware hierarchical groupingis performed for allstepswithin thesetrajectoriesaccording to Eq. (5). This means for eachstep, it identifies allsteps(from any of thetrajectories) that share the same currentstateand also match different lengths ofhistorical context(from to ). This process forms thehierarchical groups.
- After all
-
Advantage Computation Phase:
- For every
stepin the collectedrollouts,multiple advantagesare computed within each of its correspondinghierarchical groupsusing Eq. (6). - Then, these
group advantagesare aggregated using theadaptive weighting scheme(Eq. 7) to obtain the finaladvantage estimatefor thatstep.
- For every
-
Policy Update Phase:
-
The
policy parametersare updated by maximizing theHGPO objective functionas defined in Eq. (8). This optimization typically involvesgradient ascentusing anoptimizer(e.g., Adam).This process iteratively refines the
LLM agent'spolicyby leveraginghierarchical context-aware advantage estimationto make more informed and stable updates.
-
5. Experimental Setup
5.1. Datasets
The paper evaluates HGPO on two challenging agentic benchmarks designed to assess LLM agents' ability to perform multi-step decision-making.
-
ALFWorld (Shridhar et al., 2021):
- Description:
ALFWorldis a text-based embodied environment where an agent receives atext goaland must accomplish it throughmulti-turn interaction. It simulates household activities. - Scale and Characteristics: It includes 4,639
task instancesacross six categories of common household activities:Pick & Place(Pick),Examine in Light(Look),Clean & Place(Clean),Heat & Place(Heat),Cool & Place(Cool), andPick Two & Place(Pick2). The environment involves navigation, object interaction, and logical reasoning. - Example of a data sample: A task might be "Go to the kitchen, find a potato, put it in the microwave, and then place it in the fridge." The agent would perceive textual descriptions of its surroundings and generate textual actions like
go to kitchen,look for potato,take potato from countertop,put potato in microwave,heat potato,open fridge,put potato in fridge. - Choice Justification:
ALFWorldis widely used to testlong-horizon planningandrobust decision-makingin atextual embodied setting, making it suitable for evaluatingLLM agents. It includes bothin-distributionandout-of-distributiontasks, allowing for generalization assessment.
- Description:
-
WebShop (Yao et al., 2022):
- Description:
WebShopis a complex, web-based interactive environment designed to testLLM agentsin realistic online shopping scenarios. The agent interacts with a simulatedHTML-based shopping websiteto search for, navigate to, and ultimately purchase a suitable item. - Scale and Characteristics: It contains over 1.1 million products and 12,000 user instructions, providing a rich and diverse
action spaceandobservation space(web page content). This environment requiresunderstanding natural language instructions,interpreting web page layouts,making navigation decisions, andformulating search queries. - Example of a data sample: A task might be "Find a pair of headphones under $50 with noise cancellation and add it to your cart." The agent would observe HTML elements, execute actions like
click search bar,type "headphones",click search button,click filter "noise cancellation",set price range "0-50",click product "XYZ Headphones",click "add to cart". - Choice Justification:
WebShoppresents a highly challenginglong-horizon agentic taskin areal-world-like web environment, demanding strongreasoning,planning, andinteraction capabilitiesfromLLM agents.
- Description:
5.2. Evaluation Metrics
The paper uses several metrics to evaluate the performance of LLM agents on ALFWorld and WebShop, as well as internal RL training dynamics.
-
ALFWorld Metrics:
- Overall Success Rate (↑):
- Conceptual Definition: This metric measures the percentage of all tasks (across both
in-distributionandout-of-distributioncategories) where the agent successfully completes the given goal within the environment. A higher percentage indicates better overall performance. - Mathematical Formula:
- Symbol Explanation:
Number of Successful Tasks: The count oftask instanceswhere theagentsuccessfully achieved its stated goal.Total Number of Tasks Attempted: The total count of alltask instancesthat theagentwas evaluated on.
- Conceptual Definition: This metric measures the percentage of all tasks (across both
- In-Success (↑):
- Conceptual Definition: Measures the
success ratespecifically fortask instancesthat arein-distribution, meaning they are similar in structure or content to tasks seen during training. This assesses the agent's performance on familiar scenarios. - Mathematical Formula: Same as
Overall Success Rate, but restricted toin-distribution tasks. - Symbol Explanation: Same as
Overall Success Rate, butNumber of Successful TasksandTotal Number of Tasks Attemptedrefer specifically toin-distribution tasks.
- Conceptual Definition: Measures the
- Out-Success (↑):
- Conceptual Definition: Measures the
success ratespecifically fortask instancesthat areout-of-distribution, meaning they differ significantly from tasks seen during training. This is a crucial metric for evaluating thegeneralization capabilityof the agent to novel or unseen situations. - Mathematical Formula: Same as
Overall Success Rate, but restricted toout-of-distribution tasks. - Symbol Explanation: Same as
Overall Success Rate, butNumber of Successful TasksandTotal Number of Tasks Attemptedrefer specifically toout-of-distribution tasks.
- Conceptual Definition: Measures the
- Overall Success Rate (↑):
-
WebShop Metrics:
- Average Task Score (↑):
- Conceptual Definition: This metric quantifies the average degree of task completion for the
WebShopenvironment.WebShoptasks can sometimes be partially completed, and thetask score(typically normalized between 0 and 1) reflects how many sub-goals or requirements of a task were met. A score of 1 indicates full success. The specific calculation method for partial credit can vary but usually involves checking for correct item selection, filters applied, price range, etc. The paper does not provide a specific formula for this, but implies it's a standard metric used inWebShopevaluations. - Mathematical Formula: (Not explicitly provided in the paper; assumed to be an environment-specific scoring mechanism). Generally, for tasks:
- Symbol Explanation:
- : The score (e.g., from 0 to 1) achieved on task .
- : The total number of tasks evaluated.
- Conceptual Definition: This metric quantifies the average degree of task completion for the
- Average Task Success Rate (↑):
- Conceptual Definition: Measures the percentage of
WebShoptasks where the agent achieved full success (e.g., atask scoreof 1). This is a binary success metric. - Mathematical Formula: Same as
Overall Success Ratefor ALFWorld, but specific toWebShoptasks. - Symbol Explanation: Same as
Overall Success Rate, butNumber of Successful TasksandTotal Number of Tasks Attemptedrefer specifically toWebShop tasks.
- Conceptual Definition: Measures the percentage of
- Average Task Score (↑):
-
Training Dynamics Metrics (for further analysis): These metrics provide insights into the stability and efficiency of the
RL training process. The paper describes them conceptually in Appendix C.4:- Mean Advantages: Average value of the
advantage estimatesover a batch ofsteps. A positive and stable value indicates effectivecredit assignment. - Policy Gradient Loss: The value of the
objective function(Eq. 8) that the policy is optimizing. A smooth decrease typically indicates stable learning. - KL Divergence: Measures the difference between the
new policyand theold policy. A moderate value showssteady learningwithout overly aggressive updates. - Policy Gradient Clip Fraction: The proportion of
gradientsthat areclippedduringPPO-style optimization. A moderate fraction suggests stable training; too high implies instability. - Mean Reward: The average cumulative
rewardobtained perepisode. A direct indicator of learning progress. - Episode Success Rate: The percentage of episodes where the agent successfully completes its task, similar to the main success metrics but tracked during training.
- Mean Advantages: Average value of the
5.3. Baselines
HGPO is compared against a comprehensive set of competitive baselines, categorized into closed-source LLMs, prompting agents, and RL training methods.
-
1. Closed-source LLMs (Prompting-based): These models are used directly with
structured promptswithout anyRL fine-tuning, representing the performance ceiling of large, proprietary models using onlyprompt engineering.- GPT-4o (Achiam et al., 2023): A state-of-the-art multimodal
LLMfrom OpenAI. - Gemini-2.5-Pro (Team et al., 2023): A powerful
LLMfrom Google, comparable to GPT-4o.
- GPT-4o (Achiam et al., 2023): A state-of-the-art multimodal
-
2. Prompting Agents (Open-source LLMs with Prompting): These methods use open-source
LLMs(Qwen2.5) combined with specificprompting strategiesto enableagentic behavior, withoutRL fine-tuning.- Qwen2.5 (Pure Prompting): The base
LLMused forRL trainingis also evaluated with simple directpromptingto establish a lower bound. - ReAct (Yao et al., 2023): An agent that interleaves
reasoning(Chain-of-Thought) andactingsteps to solve tasks. - Reflexion (Shinn et al., 2024): An agent that improves over multiple tries by
reflectingon past failures and refining itsreasoningandactions.
- Qwen2.5 (Pure Prompting): The base
-
3. RL Training Methods (Fine-tuning Open-source LLMs): These methods
fine-tunethe baseLLM(Qwen2.5) usingReinforcement Learning.- PPO (with critic) (Schulman et al., 2017):
Proximal Policy Optimization, a widely usedpolicy gradient algorithmthat requires an additionalvalue network(critic) to estimateadvantages. It serves as a strong traditionalRL baseline. - RLOO (Kool et al., 2019; Ahmadian et al., 2024):
Reinforcement Learning with Offline Observations, agroup-based RL approachthat estimatesadvantageswithout explicitvalue networks. - GRPO (Shao et al., 2024):
Group-based Reinforcement Learning with Policy Optimization. The basegroup-based RL algorithm, adapted to thestepwise settingforlong-horizon tasksas described in Section 3.2, usingtrajectory-level advantage estimation. - GiGPO (Feng et al., 2025b):
Group-in-Group Policy Optimization. Aprior hierarchical RL methodthat extendsGRPOby performingstep-level advantage estimation(0-context grouping) forLLM-based agents, but suffers fromcontext inconsistency. This is the most directgroup-based RL baselineforHGPO.
- PPO (with critic) (Schulman et al., 2017):
5.4. Implementation Details
For fairness and comparability, all RL training methods share common configurations:
- Base Models:
Qwen2.5-1.5B-InstructandQwen2.5-7B-Instruct(Yang et al., 2024). - Prompting Strategy: Each
LLM agentis prompted to generate achain-of-thoughtwithin tags, followed by anactionwithin tags (as shown in Figures 6 and 7 in Appendix C.5). - ALFWorld Specifics:
- Max prompt length: 2048 tokens.
- Max response length: 512 tokens.
- Max environment steps per episode: 50.
- Learning rate: for
actor(policy), forcritic(PPO only). - Reward: 10 for success, 0 for failure, -0.1 penalty for invalid actions.
- Group size (): 8 for
group-based RL. - Groups per rollout: 16, resulting in environments total.
- PPO: Uses 128 separate environments for rollouts.
- Rollout temperature: 1.0. Validation temperature: 0.4.
- Mini-batch size: 256.
- KL-divergence loss coefficient (): 0.01.
- Discount factor (): 0.95.
- WebShop Specifics:
- Max prompt length: 4096 tokens.
- Max response length: 512 tokens.
- Max environment steps per episode: 15.
- Learning rate: for
actor, forcritic(PPO only). - Reward: 10 for success, 0 for failure, -0.1 penalty for invalid actions.
- Group size (): 8 for
group-based RL. - Groups per rollout: 16, resulting in environments total.
- PPO: Uses 128 separate environments for rollouts.
- Rollout temperature: 1.0. Validation temperature: 0.4.
- Mini-batch size: 64.
- KL-divergence loss coefficient (): 0.01.
- Discount factor (): 0.95.
- HGPO Specifics: The
weighting coefficient(from Eq. 7) is set to 1 by default for main experiments. When computing weights, groups with zero advantage (likely due to all members having the same reward in a homogenous group, causing or ) are omitted to avoid undefined estimates. - Computing Details:
- Qwen2.5-1.5B-Instruct: 2 NVIDIA H100 GPUs.
- Qwen2.5-7B-Instruct: 4 NVIDIA H100 GPUs.
- Training iterations: 160 for each experiment.
- Evaluation: Both
GiGPOandHGPOare tested with three random seeds, reporting mean and standard deviation.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate HGPO's superior performance across both ALFWorld and WebShop benchmarks. The tables and figures provided in the paper support the claims regarding HGPO's effectiveness, stability, and improved generalization.
The following are the results from Table 1 of the original paper:
| Model | Type | Method | ALFWorld | WebShop | ||
|---|---|---|---|---|---|---|
| In-Success | Out-Success | Task Scores | Task Success Rates | |||
| Qwen2.5-1.5B-Instruct | Prompting | GPT-4o | 48.0 | 46.0 | 31.8 | 23.7 |
| Gemini-2.5-Pro | 60.3 | 50.5 | 42.5 | 35.9 | ||
| Qwen2.5 | 4.1 | - | 23.1 | 5.2 | ||
| ReAct | 12.8 | - | 40.1 | 11.3 | ||
| Reflexion | 21.8 | - | 55.8 | 21.9 | ||
| RL Training | PPO (with critic) | 54.4x3.1 | - | 73.8x3.0 | 51.5x2.9 | |
| RLOO | 69.7x2.5 | 68.7x10.7 | 73.9x5.6 | 52.1x6.7 | ||
| GRPO | 72.8x3.6 | 70.1x2.5 | 75.8x3.5 | 56.8x3.8 | ||
| GiGPO (K=2) | 85.42x1.32 | 80.72x1.62 | 84.52x0.98 | 69.79x0.59 | ||
| HGPO (K=2) | 89.58x0.45 | 80.73x2.38 | 87.53x0.77 | 72.66x1.78 | ||
| GiGPO (K=4) | 85.15x2.81 | 80.98x0.45 | 88.5x0.49 | 74.08x0.98 | ||
| HGPO (K=4) | 92.45x0.81 | 89.06x2.34 | 88.90x0.90 | 75.91x1.19 | ||
| Qwen2.5-7B-Instruct | Prompting | Qwen2.5 | 14.8 | - | 26.4 | 7.8 |
| ReAct | 31.2 | - | 46.2 | 19.5 | ||
| Reflexion | 42.7 | - | 58.1 | 28.8 | ||
| RL Training | PPO (with critic) | 77.08x1.12 | 76.23x1.46 | 81.4x3.1 | 68.7x5.1 | |
| RLOO | 77.86x0.03 | 73.95x0.05 | 80.3x3.2 | 65.7x4.0 | ||
| GRPO | 78.64x0.73 | 76.82x1.47 | 79.3x2.8 | 66.1x3.7 | ||
| GiGPO (K=2) | 89.84x2.20 | 82.81x5.46 | 86.23x1.43 | 75.13x1.37 | ||
| HGPO (K=2) | 91.15x1.19 | 84.89x4.30 | 88.93x0.84 | 76.43x1.47 | ||
| GiGPO (K=4) | 90.88x0.90 | 87.76x0.45 | 87.25x1.02 | 76.18x1.25 | ||
| HGPO (K=4) | 94.79x0.90 | 93.22x1.62 | 87.88x0.41 | 77.21x0.22 | ||
6.1.1. HGPO Achieves Overall Superior Performance
As evident from Table 1, HGPO consistently outperforms all baselines across both ALFWorld and WebShop, using both Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct models.
- RL vs. Prompting: All
RL training methodssignificantly outperformprompting-based methods. For example, on ALFWorld (Qwen2.5-1.5B),Reflexionachieves 21.8%In-Success, whilePPOreaches 54.4% and achieves 92.45%. This underscores the substantial benefits ofRL fine-tuningforagentic reasoninganddecision-makingcompared to staticprompting. Even sophisticatedclosed-source LLMslikeGPT-4oandGemini-2.5-Proare surpassed byRL-trained Qwen2.5models, highlighting the importance ofdomain-specific adaptation. - HGPO vs. Other RL Methods: Among the
RL-based approaches,HGPOconsistently achieves the highestsuccess ratesandtask scores. For instance, withQwen2.5-1.5B-Instructon ALFWorld, reaches 92.45% In-Success, significantly higher than at 85.15%,GRPOat 72.8%, andPPOat 54.4%. Similar trends are observed on WebShop and with the largerQwen2.5-7B-Instructmodel. This validatesHGPO's effectiveness in providing more reliableadvantage estimatesforpolicy optimization.
6.1.2. HGPO Benefits More from Larger
A notable observation is that HGPO exhibits a more pronounced performance improvement compared to GiGPO as (the maximum context length for hierarchical grouping) increases from 2 to 4.
- ALFWorld (Qwen2.5-1.5B):
GiGPO In-Successgoes from 85.42% (K=2) to 85.15% (K=4), showing a slight decrease. In contrast,HGPO In-Successimproves from 89.58% (K=2) to 92.45% (K=4). - WebShop (Qwen2.5-1.5B):
GiGPO Task Scoresimprove from 84.52% (K=2) to 88.5% (K=4).HGPO Task Scoresimprove from 87.53% (K=2) to 88.90% (K=4). While both improve,HGPOmaintains a lead. The paper attributes this toGiGPO's limitation: as increases,prompt inconsistencybecomes more severe withinstep-level groups(0-context grouping), leading to increasinglybiased advantage estimatesand limiting performance gains.HGPO, by contrast, directly mitigatesprompt inconsistencythrough itshierarchical grouping mechanismandadaptive weighting, which emphasizesstepswithconsistent contexts. This result strongly supports the core hypothesis ofHGPOthat addressingcontext inconsistencyis crucial for leveraging longer contexts effectively.
6.1.3. HGPO Exhibits Better Generalization on Out-of-Distribution Tasks
The Out-Success metric on ALFWorld provides insight into generalization capabilities.
- All
baseline methodsexperience significant performance degradation onout-of-distribution tasks. ForQwen2.5-1.5B,GRPOdrops from 72.8%In-Successto 70.1%Out-Success. drops from 85.15% to 80.98%. HGPOmaintains superior performance with less degradation. ForQwen2.5-1.5B, achieves 92.45% In-Success and 89.06% Out-Success. This minimal drop compared toGiGPO's larger gap suggests thatHGPO's more robust and stableadvantage estimation(by handlingcontext inconsistency) leads topolicy updatesthatgeneralizebetter to unseen task variations.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Ablation Study
The paper conducts an ablation study (Table 2) to evaluate the effectiveness of HGPO's core components: hierarchical grouping and adaptive weighting. The base model for this study is with Qwen2.5-1.5B-Instruct.
The following are the results from Table 2 of the original paper:
| Ablation | ALFWorld(%) | WebShop(%) |
|---|---|---|
| HGPO | 89.5840.45 | 72.6641.78 |
| W/o HoG-1 | 13.5040.58 | 10.1341.42 |
| W/o HoG-2 | 86.4741.89 | 57.9441.02 |
| W/o Ada. Weighting | 87.2341.80 | 68.4840.45 |
-
W/o HoG-1 (Without Hierarchical Grouping, using only Oracle steps):
- Description: This setting removes
hierarchical groupingentirely and attempts to optimize the policy using onlyOracle steps(i.e.,stepsthat match perfectly in currentstateandfull historical context). - Result: Performance drops dramatically to 13.50% on ALFWorld and 10.13% on WebShop, essentially
failed policy learning. - Analysis: This confirms the paper's initial empirical finding that
Oracle stepsare too scarce (as shown in Figure 2. (c) and (d) and Table 5) to form sufficientgroupsfor effectivepolicy optimization. Relying solely on them leads to extremedata inefficiency.
- Description: This setting removes
-
W/o HoG-2 (Oracle Advantages for Oracle Steps, Step-level for Others):
- Description: In this configuration,
Oracle stepsare identified and theiradvantagesare computed using their smallOracle groups. For all othersteps(the majority),step-level advantagesare computed using0-context groups(similar toGiGPO). - Result: Performance significantly drops to 86.47% on ALFWorld (from 89.58%) and 57.94% on WebShop (from 72.66%). The drop is particularly severe on WebShop.
- Analysis: This drop, despite using more data than
W/o HoG-1, is attributed to the highvarianceintroduced by thesmall group sizesofOracle steps. WhileOracle stepshave lowbias, their scarcity means theiradvantage estimatesare highlyunstable, which degrades the overalloptimization process. This validates the necessity of usinghierarchical groupingto balancebiasandvarianceby incorporating larger, less consistent groups whenOracle stepsare too few.
- Description: In this configuration,
-
W/o Ada. Weighting (Without Adaptive Weighting):
-
Description: This ablation replaces
adaptive weighting(i.e., setting ) withuniform weights(equivalent to setting in Eq. 7). This means allhierarchical groupscontribute equally to the finaladvantage estimate. -
Result: Performance declines to 87.23% on ALFWorld (from 89.58%) and 68.48% on WebShop (from 72.66%).
-
Analysis: This demonstrates the importance of
adaptive weighting. By not prioritizinghigher-level groups(those with more consistenthistorical contextsand thus lowerbias),uniform weightingdilutes the more accurateadvantage information, leading to increasedbiasin the aggregated estimate.Adaptive weightingeffectively balances thebias-variance trade-offby giving more credence to less biased estimates from higher consistency contexts.Collectively, these
ablation studiesconfirm that bothcontext-aware hierarchical groupingandadaptive weighting advantage estimationare critical and synergistic components ofHGPO, essential for its superior performance.
-
6.2.2. Parameter Analysis
The paper investigates the effect of the weighting coefficient in Eq. (7), which controls the sharpness of the weight distribution across hierarchical groups.
The following are the results from Table 4 of the original paper:
| Parameter | α=0 | α=1 | α=2 |
|---|---|---|---|
| ALFWorld | 87.23 ± 1.80 | 89.58 ± 0.45 | 84.76 ± 1.17 |
| WebShop | 68.48 ± 0.45 | 72.66 ± 1.78 | 72.65 ± 1.77 |
- Results:
- Setting (which corresponds to
uniform weighting, as seen inW/o Ada. Weightingablation) results in lower performance (87.23% on ALFWorld, 68.48% on WebShop) compared to . - Setting yields the best performance (89.58% on ALFWorld, 72.66% on WebShop).
- Increasing further to 2 leads to a slight decrease in performance on ALFWorld (84.76%) and similar performance on WebShop (72.65%) compared to .
- Setting (which corresponds to
- Analysis: This finding suggests that
adaptive weightingis beneficial, and anintermediate valueof works best. A higher (e.g., ) puts too much emphasis onhigher-level groups(larger ), which, while less biased, have smaller sizes and thus highervariance. A lower (e.g., ) gives too much weight tolower-level groups(smaller ), which have morebiasdue tocontext inconsistency. The sweet spot at indicates a good balance in prioritizing less biased estimates without excessively inflatingvariance. The paper notes thatextensive parameter tuningis not required, and provides robust performance across tasks.
6.2.3. Training Dynamics
Figures 4 and 8 (Appendix D.3) illustrate the training dynamics of GRPO, GiGPO, and HGPO across several metrics. The paper claims HGPO achieves more stable and efficient policy optimization.
The following figure (Figure 4 from the original paper) shows the training dynamics of HGPO, GiGPO, and GRPO on ALFWorld:
该图像是图表,展示了论文中HGPO、GIGPO和GRPO三种算法在训练过程中KL Loss、Policy Gradient Clip、Policy Gradient Loss、Mean Advantage、Mean Reward和Episode Success Rate等指标的变化趋势,反映了HGPO的优越性能。
Figure 4: Training dynamics of HGPO (Red), GiGPO (Yellow), and GRPO (Purple) on ALFWorld using Qwen2.5-1.5B-Instruct. The details of these metrics are shown in Appendix D.3.
The following figure (Figure 8 from the original paper) shows the training dynamics of HGPO, GiGPO, and GRPO on WebShop:
该图像是论文中关于HGPO、GIPO和GRPO三种方法在训练过程中不同指标变化的图表,展示了KL Loss、Policy Gradient Clip、Policy Gradient Loss、Mean Advantage、Mean Reward和Episode Success Rate随训练轮次的变化趋势。
Figure 8: Training dynamics of HGPO (Red), GiGPO (Yellow), and GRPO (Blue) on WebShop using Qwen2.5-1.5B-Instruct. Best viewed in color.
- Policy Gradient Clip Fraction:
HGPO(red curve) maintains amoderate level, indicatingstable training. In contrast,GiGPO(yellow) andGRPO(purple/blue) displayhigher fractions, suggestinginstabilityand frequentconstraint violationsdue to aggressive or noisyupdates. - KL Loss:
GRPO's curve istoo low, implyingslow learningand insufficientpolicy exploration.GiGPO's curve isrelatively high, suggesting anoverly aggressive learning processthat might lead topolicy divergence.HGPOachieves abalanced trajectory, demonstratingsteady and stable policy learning. - Mean Advantages, Policy Gradient Loss, Mean Reward, Episode Success Rate: While specific numerical values vary, the trends in these plots generally show
HGPOmaintaining more consistent positivemean advantages, smootherloss curves, and highermean rewardsandsuccess ratesearlier in training and at convergence, reflecting its more effectiveadvantage estimationandpolicy updates.
6.2.4. Distribution of Hierarchical Group Sizes
Figure 5 and Figure 9 (Appendix D.4) visualize the distributions of hierarchical group sizes.
The following figure (Figure 5 from the original paper) shows the distribution of hierarchical group sizes () on ALFWorld and WebShop:
该图像是一个包含六个子图的柱状图,展示了Alfworld和Webshop环境中不同上下文组(0-Context、1-Context、2-Context)的组大小分布情况。各组大小以从0-2到≥30的区间显示,反映了历史上下文一致性与组大小的关系。
Figure 5: The distributions of hierarchical group sizes on ALFWorld and WebShop using Qwen2.5-1.5B-Instruct. The Y-axis denotes the ratio.
The following figure (Figure 9 from the original paper) shows the distribution of hierarchical group sizes () on ALFWorld and WebShop:
该图像是多组直方图组成的图表,展示了不同上下文组(0-4 Context Group)在ALFWorld和Webshop任务中,各组大小的分布情况,横轴为组大小,纵轴为比例,体现了层级组策略优化中历史上下文一致性的分布特征。
Figure 9: The distributions of hierarchical group sizes on ALFWorld and WebShop using Qwen2.5-1.5B-Instruct.
- Observations:
0-context groups(steps sharing only the current state) tend to have ahigher proportion of large group sizes. This is expected as they have the least stringent matching criteria.- As the
context depthincreases (e.g., from0-contextto1-contextto2-contextto4-context), theproportion of large groups decreases, andsmaller groupsbecome more frequent. - For
2-contextand especially4-contextgroups, a significant portion of steps fall into verysmall group sizes(e.g., 0-2 members).
- Analysis: This empirically confirms the paper's claim that
Oracle steps(those with identicalcurrent statesandfull historical contexts) are generally scarce and form smaller groups. This scarcity would lead to highvarianceif relied upon exclusively (W/o HoG-1ablation).HGPOaddresses this by leveraging the abundance oflower-context groupsto reducevariancewhile usingadaptive weightingto reducebias.
6.2.5. Step Utilization Ratio
Table 5 in Appendix D.2 reports the average proportion of steps allocated to different context groups per rollout.
The following are the results from Table 5 of the original paper:
| Dataset | 0-Context | 1-Context | 2-Context | 3-Context | 4-Context |
|---|---|---|---|---|---|
| ALFWorld () | 0.97 | 0.75 | 0.52 | - | - |
| ALFWorld () | 0.98 | 0.77 | 0.54 | 0.34 | 0.19 |
| WebShop () | 0.92 | 0.64 | 0.44 | - | - |
| WebShop () | 0.90 | 0.59 | 0.4 | 0.21 | 0.09 |
- Observations:
- Nearly all
steps(90-98%) fall into0-context groups. This means moststepsshare the same currentstatewith at least one otherstepfrom the collectedrollouts. - As the
number of historical contextsrequired for matching increases (i.e., increases), theutilization ratiosteadily decreases. For , only 19% of steps in ALFWorld and 9% in WebShop belong to4-context groups.
- Nearly all
- Analysis: This table quantitatively confirms the challenge of
scarcity of Oracle steps. It highlights thatstepswith highly consistenthistorical contextsare rare. This scarcity necessitatesHGPO's approach of using ahierarchyandadaptive weightingto effectively utilize all availablestepsforadvantage estimation, rather than discarding the majority ofstepsdue to highcontext inconsistencyor accepting highvariancefrom sparseOracle groups.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces Hierarchy-of-Groups Policy Optimization (HGPO), a novel group-based Reinforcement Learning (RL) algorithm specifically designed to overcome the context inconsistency problem in long-horizon Large Language Model (LLM) agent training. By proposing context-aware hierarchical advantage estimation and an adaptive weighting scheme, HGPO enables more fine-grained per-step credit assignment while maintaining the efficiency and stability characteristic of group-based RL. The empirical evaluations on ALFWorld and WebShop benchmarks, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, unequivocally demonstrate that HGPO significantly outperforms existing prompt-based agents and prior RL approaches under identical computational constraints. The ablation studies further confirm the crucial role of both hierarchical grouping and adaptive weighting in HGPO's superior performance, attributing it to a favorable bias-variance trade-off in advantage estimation.
7.2. Limitations & Future Work
The authors highlight one specific direction for future work:
- Alternative Strategies for Context Inconsistency: They suggest exploring other methods to handle
context inconsistency, such asconditionally controlling trajectories during the rollout stage. This implies a potential shift from post-hoc grouping duringadvantage estimationto proactive management oftrajectoriesduring data collection, which could generate more consistent contexts from the outset.
7.3. Personal Insights & Critique
7.3.1. Strengths
- Clear Problem Identification: The paper clearly identifies
context inconsistencyas a critical and previously underexplored issue instepwise group-based RLforlong-horizon tasks. The empirical evidence (Figure 2) effectively motivates the proposed solution. - Elegant Solution:
HGPO's two-pronged approach (context-aware hierarchical groupingandadaptive weighting) is elegant and intuitively addresses thebias-variance trade-off. It leverages the strengths of bothfine-grained(low bias, high variance) andcoarse-grained(high bias, low variance)advantage estimates. - Computational Efficiency: The
offlinenature of thehierarchical grouping(usinghashmap lookups) is a significant advantage, ensuring thatHGPOdoes not incur substantial extra computational costs during therollout phaseor require additionalmodels. - Strong Empirical Validation: The comprehensive experiments on two challenging benchmarks (
ALFWorldandWebShop) with differentLLM sizes(1.5B and 7B) provide robust evidence ofHGPO's effectiveness. The consistent outperformance over strongbaselines(includingPPOandGiGPO) is compelling. - Improved Generalization:
HGPO's better performance onout-of-distribution tasksin ALFWorld is a crucial indicator of the quality of itspolicy learningand robustness.
7.3.2. Potential Issues, Unverified Assumptions, or Areas for Improvement
- Assumption Justification in Proposition 4.1: While the paper justifies the monotonic bias decrease and variance increase assumptions (i.e., and ), these are still theoretical conditions. Further empirical validation or a more rigorous theoretical derivation of these specific monotonicities in diverse
long-horizon LLM agentsettings would strengthen the theoretical foundation. Especially, the independence assumption for variance calculation might be strong if groups at different levels are not entirely disjoint. - Sensitivity to (Maximum Context Length for Grouping): The paper shows performing better than for
HGPO, but beyond that, thestep utilization ratioforhigher-context groupsbecomes very low (e.g., 9% for4-contextin WebShop). It would be interesting to analyze how performance changes for larger values (e.g., or ). At some point, thevariancefrom extremely smallhigher-context groupsmight outweigh thebias reduction, even withadaptive weighting. - Dynamic or Adaptive : Currently, is a fixed hyperparameter, and is chosen empirically. Could itself be dynamically determined based on the observed
context consistencywithin a batch? Or could be made adaptive (e.g., annealed or learned) during training to fine-tune thebias-variance trade-off? - Impact of Sparse Delayed Rewards: The paper focuses on
sparse delayed rewards. How wouldHGPOperform in environments withdense rewardsor differentreward shapingstrategies? The impact ofcontext inconsistencymight differ when more frequentreward signalsare available. - Alternative
Policy Optimization Algorithms:HGPOintegrates with aPPO-like objective. Investigating its compatibility and benefits when integrated with otherpolicy optimization algorithms(e.g.,SAC,DQNfor discrete actions if adapted) could broaden its applicability. - Interpretability of
Hierarchical Groups: While the concept is clear, further analysis into what specific types ofhistorical contextsare most commonly shared at differenthierarchical levelscould provide deeper insights into the nature ofcontext consistencyinagentic tasks. - Memory Footprint and Speed: While
offline groupingusinghashmapsis efficient, for extremely longtrajectoriesor very large batch sizes, the memory footprint and lookup times for constructing and querying allhierarchical groupscould still become a factor. A detailed analysis of the computational complexity as a function of , , and would be valuable.
7.3.3. Transferability and Future Value
The core idea of handling context inconsistency through hierarchical grouping and adaptive weighting is highly transferable. This framework could be applied to:
-
Other Sequential Decision-Making Problems: Beyond
LLM agents, anyRL taskwherehistorical contextis important andgroup-based advantage estimationis used could potentially benefit fromHGPO's approach. -
Multi-Agent RL: In
multi-agent systems,context consistencyacross agents' observations and actions could be crucial, andHGPO's principles might be adapted. -
Long-Context Models in General: The general principle of "weighting by context consistency" could be applied to other
long-context modelsbeyondRL, where aggregating information from diversehistorical windowsis needed.Overall,
HGPOis a significant step forward in makinggroup-based RLmore robust and effective for complexlong-horizon agentic tasks, laying a strong foundation for future research incontext-aware policy optimization.
Similar papers
Recommended via semantic vector search.