AiPaper
Paper status: completed

Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks

Published:10/08/2025
Original LinkPDF
Price: 0.10
Price: 0.10
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper proposes HGPO to address context inconsistency in long-horizon tasks by hierarchical grouping and adaptive advantage aggregation, improving bias-variance trade-offs and outperforming existing RL methods without extra models.

Abstract

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 H IERARCHY - OF -G ROUPS P OLICY O PTIMIZATION FOR L ONG -H ORIZON A GENTIC T ASKS Anonymous authors Paper under double-blind review A BSTRACT Group-based reinforcement learning (RL), such as GRPO, has advanced the capa- bilities of large language models on long-horizon agentic tasks. To enable more fine-grained policy updates, recent research has increasingly shifted toward step- wise group-based policy optimization, which treats each step in a rollout trajectory independently while using a memory module to retain historical context. How- ever, we find a key issue in estimating stepwise relative advantages, namely con- text inconsistency , where steps within the same group may differ in their historical contexts. Empirically, we reveal that this issue can lead to severely biased advan- tage estimation, thereby degrading policy optimization significantly. To address the issue, in this paper, we propose

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Hierarchy-of-Groups Policy Optimization for Long-Horizon Agentic Tasks". It focuses on developing a novel reinforcement learning algorithm to improve the performance of Large Language Models (LLMs) in complex, multi-step environments.

1.2. Authors

The authors are listed as "Anonymous authors", indicating that the paper is currently under double-blind review. This practice is common in academic conferences to ensure impartial evaluation.

1.3. Journal/Conference

The paper is published at OpenReview, with a scheduled publication date of 2025-10-08T00:00:00.000Z. This platform is widely used for submissions to top-tier machine learning conferences such as ICLR, NeurIPS, and ICML, where papers undergo rigorous peer review. The "double-blind review" status further confirms its submission to such a venue, implying a high standard of research.

1.4. Publication Year

The publication year is 2025.

1.5. Abstract

The paper introduces Hierarchy-of-Groups Policy Optimization (HGPO), a new reinforcement learning (RL) algorithm designed to enhance the capabilities of Large Language Models (LLMs) in long-horizon agentic tasks. It identifies a key problem in existing stepwise group-based policy optimization methods, termed context inconsistency, where steps within the same group may have different historical contexts, leading to biased advantage estimation and degraded policy optimization. HGPO addresses this by assigning each step to multiple hierarchical groups based on the consistency of historical contexts. It then computes distinct advantages within each group and combines them using an adaptive weighting scheme. This approach aims to achieve a favorable bias-variance trade-off in stepwise advantage estimation without requiring additional models or rollouts. Empirical evaluations on challenging agentic tasks, ALFWorld and WebShop, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, demonstrate that HGPO significantly outperforms current agentic RL methods under the same computational constraints.

The original source link is: https://openreview.net/forum?id=T8Dev99qnz The PDF link is: https://openreview.net/pdf?id=T8Dev99qnz The paper is currently under double-blind review, meaning it is a submission to a conference or journal and has not yet been formally accepted or published.

2. Executive Summary

2.1. Background & Motivation

The proliferation of Large Language Models (LLMs) has opened new avenues for developing versatile agentic systems capable of perceiving, reasoning, and acting in complex environments. These LLM agents are increasingly applied to long-horizon tasks, such as embodied navigation, web browsing, and interactive gaming, which demand robust planning and decision-making over extended sequences of interactions.

The core problem this paper aims to solve lies within the post-training stage of enhancing these LLM agents using reinforcement learning (RL). While group-based RL methods (e.g., GRPO) have shown promise in single-turn tasks due to their computational efficiency, extending them to multi-turn long-horizon tasks presents significant challenges. Traditional trajectory-wise policy optimization (where the entire interaction history is concatenated) suffers from context explosion, as the input context length grows rapidly, limiting scalability.

To mitigate this, recent research has shifted towards stepwise policy optimization, which treats each step independently while using a memory module to retain historical context. However, this paper identifies a critical issue in stepwise group-based RL: context inconsistency. This occurs when steps grouped together for advantage estimation (because they share the same current state) originate from trajectories with different historical contexts. This inconsistency leads to severely biased advantage estimation, which in turn degrades the effectiveness of policy optimization. Existing solutions, such as using only "Oracle" steps (steps with identical current state and historical context), are highly inefficient due to their scarcity and introduce high variance due to small group sizes.

The paper's entry point is to directly tackle this context inconsistency problem within stepwise group-based RL, aiming to develop a more fine-grained and reliable advantage estimator that can balance bias and variance efficiently.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

  1. Revealing Context Inconsistency: The paper rigorously identifies and empirically demonstrates the issue of context inconsistency in stepwise group-based RL. It shows that this inconsistency leads to significant bias in advantage estimation, directly hindering policy optimization. Through pilot studies (Figure 2), it quantifies the substantial estimation bias in both trajectory-level and step-level advantages compared to Oracle advantages, emphasizing the detrimental impact of differing historical contexts.
  2. Proposing Hierarchy-of-Groups Policy Optimization (HGPO): The paper introduces a novel RL algorithm, HGPO, specifically designed for long-horizon agentic tasks. HGPO employs two key components:
    • Context-aware Hierarchical Grouping: It organizes steps into multiple hierarchical groups based on the consistency of their historical contexts. This allows for a more nuanced comparison of steps, improving data utilization and reducing variance, particularly by allowing steps with varying degrees of context consistency to contribute to estimation.
    • Adaptive Weighting Advantage Estimation: It computes distinct advantages for each hierarchical group and then aggregates them using an adaptive weighting scheme. This scheme prioritizes groups with more consistent historical contexts by assigning them larger weights, thereby effectively reducing estimation bias while maintaining a balanced variance.
  3. Achieving Strong Empirical Performance: HGPO achieves state-of-the-art results on two challenging agentic benchmarks: ALFWorld and WebShop. Using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct as base models, HGPO consistently and significantly outperforms existing agentic RL methods (including PPO, RLOO, GRPO, and GiGPO) under identical computational constraints (GPU memory, LLM rollouts, and minimal additional time cost). The results also highlight HGPO's better generalization capabilities on out-of-distribution tasks.

3. Prerequisite Knowledge & Related Work

This section provides foundational knowledge and contextualizes the paper within the broader research landscape, explaining key concepts and prior works necessary for a comprehensive understanding.

3.1. Foundational Concepts

  • Large Language Models (LLMs): LLMs are advanced artificial intelligence models, typically based on the transformer architecture, trained on vast amounts of text data. They can understand, generate, and process human language, making them capable of a wide range of tasks, including conversation, summarization, translation, and code generation. In the context of agentic tasks, LLMs serve as the "brain" of an agent, interpreting observations, reasoning about situations, and generating actions. Examples include GPT-4o, Gemini-2.5-Pro, and Qwen2.5-Instruct.

  • Reinforcement Learning (RL): Reinforcement Learning is a paradigm of machine learning where an agent learns to make decisions by interacting with an environment. The agent takes actions in different states of the environment and receives rewards or penalties. The goal of the agent is to learn a policy (a mapping from states to actions) that maximizes the cumulative reward over time. Key components include:

    • Agent: The learner and decision-maker. In this paper, it's an LLM.
    • Environment: The world the agent interacts with. Examples are ALFWorld and WebShop.
    • State (sts_t): A description of the current situation in the environment at time tt.
    • Action (ata_t): A decision made by the agent at time tt to interact with the environment.
    • Reward (rtr_t): A scalar feedback signal from the environment indicating the desirability of the agent's action. In long-horizon tasks, rewards are often sparse (given only at the end) and delayed.
    • Policy (πθπ_θ): The agent's strategy, parameterized by θθ, that dictates how it chooses actions given states.
    • Trajectory (ττ): A sequence of states, actions, and rewards from an interaction episode, e.g., {(s1,a1,r1),,(sT,aT,rT)}\{(\boldsymbol{s}_1, \boldsymbol{a}_1, r_1), \ldots, (\boldsymbol{s}_T, \boldsymbol{a}_T, r_T)\}.
  • Long-horizon Agentic Tasks: These are complex tasks that require an LLM agent to perform multiple sequential steps or turns to achieve a goal. Unlike single-turn tasks (e.g., generating a single response), long-horizon tasks involve sustained interaction, planning, reasoning, and memory to navigate dynamic environments. Examples include controlling a robot, browsing the web, or solving multi-step puzzles.

  • Policy Optimization: The process in RL of iteratively updating the agent's policy (πθπ_θ) to improve its performance (i.e., maximize expected cumulative reward). This typically involves collecting trajectories by interacting with the environment using the current policy, then using these experiences to calculate gradients that guide the update of the policy parameters θθ.

  • Advantage Estimation: In policy gradient methods, advantage functions are used to reduce the variance of gradient estimates, which makes RL training more stable and efficient. The advantage function for a state-action pair (s, a) tells us how much better or worse taking action aa in state ss is compared to the average action taken from state ss. Formally, the advantage function A(s, a) is often defined as Q(s,a)V(s)Q(s, a) - V(s), where Q(s, a) is the action-value function (expected cumulative reward starting from state ss, taking action aa, and then following the policy), and V(s) is the state-value function (expected cumulative reward starting from state ss and following the policy). In group-based RL, advantages are estimated directly from sampled rewards within groups, often without explicitly learning value functions.

  • Group-based Reinforcement Learning: A class of RL algorithms that estimate advantages (or relative returns) by comparing the rewards of samples within a group of collected trajectories. Instead of using a separate value network (like in PPO), these methods leverage statistical comparisons across group members to determine which actions were relatively better or worse. This can often lead to computational efficiencies and robustness.

  • Context: In sequential decision-making, context refers to the historical information or past observations and actions that influence the agent's understanding of the current state and its future decisions. Maintaining consistent and relevant context is critical for effective planning and reasoning in long-horizon tasks.

  • Bias-Variance Trade-off: A fundamental concept in machine learning and statistics. Bias refers to the error introduced by approximating a real-world problem, or by using a simplified model. High bias can cause the model to miss the relevant relations between features and target outputs (underfitting). Variance refers to the model's sensitivity to small fluctuations in the training data. High variance can cause the model to model the random noise in the training data rather than the intended outputs (overfitting). In advantage estimation, methods aim to reduce bias (accuracy of estimation) while keeping variance (stability of estimation) low.

3.2. Previous Works

The paper contextualizes HGPO by discussing several relevant lines of research:

  • LLM-based Decision-Making Agents: Early approaches relied on prompting strategies without further training.

    • ReAct (Yao et al., 2023): Integrates reasoning (using chain-of-thought) and acting (selecting actions) in an interleaved manner. The LLM generates thoughts and then actions based on observations.
    • Reflexion (Shinn et al., 2024): Extends ReAct by incorporating self-reflection and iterative improvement. The agent generates a trajectory, reflects on its outcome, and uses this reflection to improve future trajectories.
    • These methods are simple but often limited by the pre-trained model's knowledge and lack domain-specific adaptation.
  • Reinforcement Learning for LLM-based Agents: RL offers a powerful paradigm for adapting LLMs to dynamic environments post-training.

    • DQN (Mnih et al., 2015): A value-based RL algorithm that uses deep neural networks to approximate the Q-value function. Applied to text games.
    • PPO (Schulman et al., 2017): Proximal Policy Optimization is a popular policy gradient algorithm that aims to find a balance between policy optimization and stability by using a clipped objective function and often requires a value network to estimate advantages.
    • RLHF (Ziegler et al., 2019; Ouyang et al., 2022): Reinforcement Learning from Human Feedback is a widely used technique for aligning LLMs with human preferences, often involving training a reward model from human comparisons and then optimizing the LLM with PPO.
    • Group-based RL Algorithms: These methods avoid the need for value networks by estimating advantages directly from sampled groups.
      • RLOO (Kool et al., 2019): Reinforcement Learning with Offline Observations is a group-based RL approach that estimates advantages by comparing observed rewards within a group of trajectories.
      • GRPO (Shao et al., 2024): Group-based Reinforcement Learning with Policy Optimization. Originally designed for single-turn tasks, it computes advantages based on the rewards of a group of trajectories.
        • Trajectory-level Advantage (adapted to stepwise for long-horizon tasks): For a given trajectory τi\tau_i, its advantage AT(τi)A^T(\tau_i) is computed relative to the rewards of all trajectories in its group GτG_\tau. AT(τi)=(R(τi)1/GτjGτR(τj))/σGτ A^{T}\left(\tau_{i}\right)=\left(R\left(\tau_{i}\right)-1/\left|G_{\tau}\right| \sum_{j \in G_{\tau}} R\left(\tau_{j}\right)\right) / \sigma_{G_{\tau}} Here, R(τi)R(\tau_i) is the total reward of trajectory τi\tau_i, GτG_\tau is the group of trajectories (e.g., all trajectories generated from the same initial state), Gτ|G_\tau| is the number of trajectories in that group, and σGτ\sigma_{G_\tau} is the standard deviation of rewards within GτG_\tau. This advantage is assigned to every step in τi\tau_i, which can be too coarse-grained for multi-turn tasks.
      • GiGPO (Feng et al., 2025b): Group-in-Group Policy Optimization. This method extends GRPO for LLM agents by providing finer-grained credit assignment at the step-level. It clusters steps with identical current states across trajectories into step-level groups Gs~iG_{\tilde{s}_i}.
        • Step-level Advantage: For a given step with current state s~i\tilde{\boldsymbol{s}}_i, its advantage AS(s~i)A^S(\tilde{\boldsymbol{s}}_i) is computed relative to the rewards of other steps in its step-level group Gs~iG_{\tilde{\boldsymbol{s}}_i}. AS(s~i)=(R(s~i)1/Gs~ijGs~iR(s~j))/σGs~i A^{S}\left(\tilde{\boldsymbol{s}}_{i}\right)=\left(R\left(\tilde{\boldsymbol{s}}_{i}\right)-1/\left|G_{\tilde{\boldsymbol{s}}_{i}}\right| \sum_{j \in G_{\tilde{\boldsymbol{s}}_{i}}} R\left(\tilde{\boldsymbol{s}}_{j}\right)\right) / \sigma_{G_{\tilde{\boldsymbol{s}}_{i}}} Here, R(s~i)R(\tilde{\boldsymbol{s}}_i) is the reward associated with step s~i\tilde{\boldsymbol{s}}_i (often a stepwise reward derived from the full trajectory reward), Gs~iG_{\tilde{\boldsymbol{s}}_i} is the group of steps sharing the same current state s~i\tilde{\boldsymbol{s}}_i, Gs~i|G_{\tilde{\boldsymbol{s}}_i}| is the number of steps in that group, and σGs~i\sigma_{G_{\tilde{\boldsymbol{s}}_i}} is the standard deviation of rewards within Gs~iG_{\tilde{\boldsymbol{s}}_i}. This method improves granularity but, as the paper points out, suffers from context inconsistency.
  • Long-Horizon Agentic Reinforcement Learning: This area focuses on equipping LLMs with planning, reasoning, and memory capabilities for sustained interaction.

    • Trajectory-wise Policy Optimization: Methods like RAGEN (Wang et al., 2025d) and SearchR1 (Jin et al., 2025a) optimize over full multi-turn rollouts by concatenating states and model outputs. Their limitation is context explosion.
    • Stepwise Policy Optimization: Methods like those by Feng et al. (2025b) and Luo et al. (2025c) treat each step independently, using a memory module to manage historical context. This is where context inconsistency becomes a problem.

3.3. Technological Evolution

The evolution of LLM agents for long-horizon tasks has progressed from simple prompting to sophisticated RL-based fine-tuning. Initially, LLMs were guided by structured prompts (e.g., ReAct, Reflexion) to perform multi-step tasks. While effective, these methods were constrained by the inherent knowledge of the pre-trained model. Reinforcement Learning emerged as a powerful tool to adapt LLMs to specific environments and tasks by learning from feedback. Early RL applications used standard algorithms like DQN and PPO.

For LLM agents, group-based RL gained traction due to its efficiency, avoiding the need for value networks. However, these methods were often designed for single-turn interactions. Adapting them to long-horizon tasks led to two main paradigms: trajectory-wise (which faced context explosion) and stepwise (which introduced context inconsistency). This paper fits into the stepwise policy optimization paradigm, aiming to refine advantage estimation by resolving the context inconsistency issue that GiGPO and similar methods encounter. HGPO represents an advancement in finer-grained credit assignment by leveraging a hierarchical context structure.

3.4. Differentiation Analysis

Compared to prior group-based RL methods for long-horizon agentic tasks, HGPO introduces core innovations:

  1. Addressing Context Inconsistency: This is the primary differentiation. While GiGPO attempts step-level advantage estimation, it still groups steps based only on the current state, ignoring the preceding historical context in the memory module. HGPO explicitly recognizes and addresses this context inconsistency as a source of bias.

  2. Context-aware Hierarchical Grouping: Instead of a single step-level group, HGPO assigns each step to multiple hierarchical groups, where each level of the hierarchy corresponds to a different degree of historical context consistency. This allows for a more nuanced and accurate comparison of steps.

  3. Adaptive Weighting Advantage Estimation: HGPO doesn't just form hierarchical groups; it intelligently combines their advantages using an adaptive weighting scheme. This scheme prioritizes advantages from higher-level groups (those with more consistent historical contexts) by assigning them larger weights, thereby actively reducing bias in the final advantage estimate. This contrasts with simpler aggregation or ignoring inconsistent contexts.

  4. Improved Bias-Variance Trade-off: By explicitly accounting for context consistency and adaptively weighting advantages, HGPO systematically achieves a better bias-variance trade-off than trajectory-level (high bias, low variance) or step-level (potentially high bias due to inconsistency, variable variance) methods. It leverages Oracle-like accuracy when available, while still using less consistent (but more abundant) steps to reduce variance.

  5. Efficiency without Extra Models: HGPO achieves these improvements without introducing extra models or requiring additional data collection/rollouts. The hierarchical grouping process operates offline using only existing rollouts and hashmap lookups, maintaining computational efficiency comparable to its predecessors.

    The pilot study results in Figure 2 clearly differentiate HGPO's motivation: both trajectory-level and step-level advantages (as computed by GRPO and GiGPO respectively) show significant bias compared to Oracle advantages, and Oracle steps are too scarce for efficient training. HGPO aims to bridge this gap.

4. Methodology

4.1. Principles

The core principle of Hierarchy-of-Groups Policy Optimization (HGPO) is to achieve a more accurate and stable advantage estimation for Large Language Model (LLM) agents in long-horizon agentic tasks by addressing the issue of context inconsistency inherent in existing stepwise group-based reinforcement learning (RL). The intuition is that for a given state-action pair, its value (or advantage) should be compared with other state-action pairs that have not only the same current state but also a similar preceding historical context. By constructing hierarchical groups based on the degree of contextual consistency and adaptively weighting their respective advantage estimates, HGPO can effectively reduce bias (by prioritizing more consistent contexts) while maintaining low variance (by utilizing a broader range of comparisons).

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Setup of Long-Horizon Agentic Tasks

In long-horizon agentic tasks, an LLM agent πθ\pi_{\theta} (parameterized by θ\theta) interacts with an environment over multiple turns to achieve a goal. For a task example x\boldsymbol{x}, at each turn tt, the agent observes an environment state stS\boldsymbol{s}_t \in \mathcal{S} and generates a textual action atVn\boldsymbol{a}_t \in \mathcal{V}^n. Here, V\mathcal{V} is the token vocabulary, nn is the maximum generation length, and tt ranges from 1 to TT (maximum interaction turns). The paper focuses on a sparse delayed reward setting, where a scalar reward rtRr_t \in \mathcal{R} is provided only at the final step of a trajectory τ={(s1,a1),,(sT,aT)}\tau = \{(\boldsymbol{s}_1, \boldsymbol{a}_1), \ldots, (\boldsymbol{s}_T, \boldsymbol{a}_T)\}.

4.2.2. Trajectory-wise vs. Stepwise Policy Optimization

  • Trajectory-wise Policy Optimization: This framework concatenates the full interaction history of a trajectory τ\tau for policy optimization, meaning the policy πθ\pi_\theta conditions its actions on the entire history: πθ(ats0:t,x)\pi_{\theta}(\boldsymbol{a}_t \mid \boldsymbol{s}_{0:t}, \boldsymbol{x}). The main drawback is context explosion: as the number of turns TT increases, the context length grows rapidly, making RL training computationally expensive and unscalable.
  • Stepwise Policy Optimization: To address context explosion, this framework treats each step within a trajectory independently. It uses a memory module to retain a fixed-length historical context of KTK \ll T interactions. This keeps the prompt length relatively stable, enabling more scalable RL training.

4.2.3. Group-based Reinforcement Learning and the Issue of Context Inconsistency

Unlike Proximal Policy Optimization (PPO) which uses a value function to estimate advantages, group-based RL algorithms (like GRPO) compute advantages directly from the statistics of a sampled group of trajectories GτG_\tau.

  • Trajectory-level Advantage (GRPO adaptation): When GRPO is adapted to the stepwise setting for long-horizon tasks, it calculates a trajectory-level advantage as: AT(τi)=(R(τi)1/GτjGτR(τj))/σGτ A^{T}\left(\tau_{i}\right)=\left(R\left(\tau_{i}\right)-1/\left|G_{\tau}\right| \sum_{j \in G_{\tau}} R\left(\tau_{j}\right)\right) / \sigma_{G_{\tau}} Where:

    • AT(τi)A^{T}\left(\tau_{i}\right) is the trajectory-level advantage for the ii-th trajectory.
    • R(τi)R\left(\tau_{i}\right) is the total reward obtained for trajectory τi\tau_i.
    • GτG_{\tau} is the group of sampled trajectories (e.g., those starting from the same initial state).
    • Gτ|G_{\tau}| is the number of trajectories in the group GτG_\tau.
    • jGτR(τj)\sum_{j \in G_{\tau}} R\left(\tau_{j}\right) is the sum of rewards of all trajectories in the group.
    • σGτ\sigma_{G_{\tau}} is the standard deviation of rewards within the group GτG_\tau. This method assigns the same advantage value to every step in τi\tau_i, which is coarse-grained and overlooks finer credit assignment.
  • Step-level Advantage (GiGPO adaptation): To address the coarse-grained nature, step-level relative advantage estimators (like in GiGPO) are used. Here, steps with identical current states s~i\tilde{\boldsymbol{s}}_i across group trajectories are clustered into step-level groups Gs~iG_{\tilde{\boldsymbol{s}}_i}, and their advantages are computed as: AS(s~i)=(R(s~i)1/Gs~ijGs~iR(s~j))/σGs~i A^{S}\left(\tilde{\boldsymbol{s}}_{i}\right)=\left(R\left(\tilde{\boldsymbol{s}}_{i}\right)-1/\left|G_{\tilde{\boldsymbol{s}}_{i}}\right| \sum_{j \in G_{\tilde{\boldsymbol{s}}_{i}}} R\left(\tilde{\boldsymbol{s}}_{j}\right)\right) / \sigma_{G_{\tilde{\boldsymbol{s}}_{i}}} Where:

    • AS(s~i)A^{S}\left(\tilde{\boldsymbol{s}}_{i}\right) is the step-level advantage for a step associated with current state s~i\tilde{\boldsymbol{s}}_i.
    • R(s~i)R\left(\tilde{\boldsymbol{s}}_{i}\right) is the reward (specifically, the stepwise reward rt(i)r_t^{(i)} as defined later in the paper) associated with step s~i\tilde{\boldsymbol{s}}_i.
    • Gs~iG_{\tilde{\boldsymbol{s}}_i} is the step-level group containing all steps that share the identical current state s~i\tilde{\boldsymbol{s}}_i.
    • Gs~i|G_{\tilde{\boldsymbol{s}}_i}| is the number of steps in that group.
    • jGs~iR(s~j)\sum_{j \in G_{\tilde{\boldsymbol{s}}_{i}}} R\left(\tilde{\boldsymbol{s}}_{j}\right) is the sum of rewards of all steps in the group.
    • σGs~i\sigma_{G_{\tilde{\boldsymbol{s}}_{i}}} is the standard deviation of rewards among steps in Gs~iG_{\tilde{\boldsymbol{s}}_i}. This provides finer-grained credit assignment but introduces the issue of context inconsistency.
  • The Issue of Context Inconsistency: The paper highlights that step-level groups (e.g., Gs~iG_{\tilde{\boldsymbol{s}}_i}) may contain steps that share the same current state but have distinct historical contexts in their memory modules. This is illustrated in Figure 1. (b). For example, if two trajectories pass through the same state s2s_2, but arrived there via different sequences of preceding states and actions, their historical contexts are inconsistent. When advantages are estimated from such an inconsistent group, the estimate becomes biased, failing to accurately reflect the true impact of the current state and action given its specific historical context. Empirical evidence in Figure 2 (a) and (b) confirms this bias. A naive solution of using only Oracle groups (steps with identical current state and historical context) is inefficient due to their scarcity and leads to high variance due to small group sizes (Figure 2 (c) and (d)).

4.2.4. Hierarchy-of-Groups Policy Optimization (HGPO)

HGPO is proposed to address these challenges, comprising two main components: context-aware hierarchical grouping and adaptive weighting advantage estimation. Figure 3 provides an overview.

The following figure (Figure 3 from the original paper) illustrates the HGPO method:

img-2.jpeg 该图像是论文中用于展示HGPO方法的示意图,包含三部分:左侧为Rollout组轨迹,展示了不同轨迹步的状态、动作和奖励;中间为上下文感知的层次分组示意,根据历史上下文划分为0-、1-、2-context组;右侧为自适应加权优势估计,展示了AH(s2(1))=k=02wkAkH(s2(1))A^H(s_2^{(1)})=\sum_{k=0}^2 w_k A_k^H(s_2^{(1)})的加权计算过程。该图说明了HGPO在优势估计中的偏差-方差权衡策略。

Figure 3: Overview of HGPO. The LLM-based agent interacts with a set of environments initialized from the same state s0\boldsymbol{s}_{0}, producing four group trajectories (states with the same color are identical). HGPO comprises two key components: context-aware hierarchical grouping and adaptive weighted advantage computation. For illustration, consider the state s2\boldsymbol{s}_{2} (purple). First, HGPO assigns s2\boldsymbol{s}_{2} into three hierarchical groups according to its historical contexts. Then, it computes the final advantage estimate by adaptively aggregating the weighted advantages from these groups.

4.2.4.1. Context-aware Hierarchical Grouping

This component organizes steps into multi-level groups based on their historical contexts. The intuition is to evaluate the advantage of each step relative to different levels of historical context consistency for more accurate estimates.

  • Defining the k-step context operator: For the tt-th step in the ii-th trajectory τi\tau_i, which is given by τi={(s1(i),a1(i)),(s2(i),a2(i)),,(sT(i),aT(i))}\tau_{i}=\left\{\left(\boldsymbol{s}_{1}^{(i)}, \boldsymbol{a}_{1}^{(i)}\right),\left(\boldsymbol{s}_{2}^{(i)}, \boldsymbol{a}_{2}^{(i)}\right), \ldots,\left(\boldsymbol{s}_{T}^{(i)}, \boldsymbol{a}_{T}^{(i)}\right)\right\}, and given a maximum context length KK, a k-step context operator Ck\mathcal{C}_k is defined (Eq. 3): Ck(st(i))={(stk(i),stk+1(i),,st(i)),tk,(s0(i),s1(i),,st(i)),t<k, \mathcal{C}_{k}\left(\boldsymbol{s}_{t}^{(i)}\right)= \begin{cases}\left(\boldsymbol{s}_{t-k}^{(i)}, \boldsymbol{s}_{t-k+1}^{(i)}, \cdots, \boldsymbol{s}_{t}^{(i)}\right), t \geq k, \\ \left(\boldsymbol{s}_{0}^{(i)}, \boldsymbol{s}_{1}^{(i)}, \cdots, \boldsymbol{s}_{t}^{(i)}\right), t<k,\end{cases} Where:

    • Ck(st(i))\mathcal{C}_{k}\left(\boldsymbol{s}_{t}^{(i)}\right) is the k-step historical context ending at state st(i)\boldsymbol{s}_t^{(i)}.
    • st(i)\boldsymbol{s}_t^{(i)} is the state at turn tt in trajectory ii.
    • kk is the context depth parameter, ranging from 0 to KK.
    • KK is the maximum context length for grouping. This operator returns the sequence of kk historical states preceding the current state st(i)\boldsymbol{s}_t^{(i)}. If t<kt < k, it returns all available states from the start of the trajectory up to st(i)\boldsymbol{s}_t^{(i)}, where s0(i)\boldsymbol{s}_0^{(i)} might represent an initial state.
  • Defining the k-th hierarchical group: Based on this operator, the k-th hierarchical group for a given step (st(i),at(i))(\boldsymbol{s}_t^{(i)}, \boldsymbol{a}_t^{(i)}) is defined as (Eq. 4): GkH(st(i))={(j,n)I:Ck(st(i))=Ck(sn(j))} G_{k}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right)=\left\{(j, n) \in \mathcal{I}: \mathcal{C}_{k}\left(\boldsymbol{s}_{t}^{(i)}\right)=\mathcal{C}_{k}\left(\boldsymbol{s}_{n}^{(j)}\right)\right\} Where:

    • GkH(st(i))G_{k}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right) is the k-th hierarchical group for the step with current state st(i)\boldsymbol{s}_t^{(i)}.
    • (j, n) represents a step in trajectory jj at turn nn.
    • I={(i,t)1iN,1tT}\mathcal{I}=\{(i, t) \mid 1 \leq i \leq N, 1 \leq t \leq T\} is the index set of all steps across all NN trajectories of length TT.
    • Ck(st(i))=Ck(sn(j))\mathcal{C}_{k}\left(\boldsymbol{s}_{t}^{(i)}\right)=\mathcal{C}_{k}\left(\boldsymbol{s}_{n}^{(j)}\right) means that the k-step historical context for step st(i)\boldsymbol{s}_t^{(i)} is identical to the k-step historical context for step sn(j)\boldsymbol{s}_n^{(j)}. In simpler terms, GkH(st(i))G_k^H(\boldsymbol{s}_t^{(i)}) contains all steps (from potentially different trajectories) that have the exact same current state st(i)\boldsymbol{s}_t^{(i)} AND the exact same k-step history preceding it.
  • Hierarchy Structure: The resulting hierarchy-of-groups structure satisfies (Eq. 5): G0H(st(i))G1H(st(i))GKH(st(i)),G0H(st(i))GKH(st(i)) G_{0}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right) \supseteq G_{1}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right) \supseteq \cdots \supseteq G_{K}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right), \quad\left|G_{0}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right)\right| \geq \cdots \geq\left|G_{K}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right)\right| This means:

    • G0H(st(i))G_{0}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right) includes all steps that only share the same current state (equivalent to step-level grouping in GiGPO).
    • G1H(st(i))G_{1}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right) includes steps that share the same current state AND the same immediately preceding state (k=1k=1).
    • GKH(st(i))G_{K}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right) includes steps that share the same current state AND the same K-step history. These are the Oracle steps.
    • The group sizes are nested: G0H|G_{0}^{H}| is the largest, and GKH|G_{K}^{H}| is the smallest, as requiring more context consistency reduces the number of matching steps. This hierarchy improves step utilization (as even steps with partial context consistency are used) and reduces variance (by having larger groups at lower kk values). This process is fully offline and relies on hashmap lookups.

4.2.4.2. Adaptive Weighting Advantage Estimation

The core idea here is that higher-level hierarchical groups (larger kk) provide more accurate advantage comparisons due to stronger context consistency, but they are scarcer. Conversely, lower-level groups (smaller kk) are more abundant but less consistent. HGPO integrates information across all hierarchical groups with adaptive weights to balance bias and variance.

  • Advantage for the k-th hierarchical group: The advantage estimation for the k-th hierarchical group is defined similarly to the step-level advantage (Eq. 6): AkH(st(i))=(R(st(i))1/GkHjGkHR(st(i)))/σGkH A_{k}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right)=\left(R\left(\boldsymbol{s}_{t}^{(i)}\right)-1/\left|G_{k}^{H}\right| \sum_{j \in G_{k}^{H}} R\left(\boldsymbol{s}_{t}^{(i)}\right)\right) / \sigma_{G_{k}^{H}} Where:

    • AkH(st(i))A_{k}^{H}\left(\boldsymbol{s}_{t}^{(i)}\right) is the advantage estimate for step st(i)\boldsymbol{s}_t^{(i)} based on its k-th hierarchical group.
    • R(st(i))R\left(\boldsymbol{s}_{t}^{(i)}\right) is the stepwise reward (defined below) for step st(i)\boldsymbol{s}_t^{(i)}.
    • GkHG_{k}^{H} is the k-th hierarchical group (as defined by Eq. 4) that st(i)\boldsymbol{s}_t^{(i)} belongs to.
    • GkH|G_{k}^{H}| is the size of this k-th hierarchical group.
    • σGkH\sigma_{G_{k}^{H}} is the standard deviation of stepwise rewards within GkHG_{k}^{H}.
  • Stepwise Reward Calculation: For each step (st(i),at(i))(\boldsymbol{s}_t^{(i)}, \boldsymbol{a}_t^{(i)}), its stepwise reward rt(i)r_t^{(i)} is computed as the discounted sum of future rewards from that step (Eq. 7 implies this is done, and is common in RL): rt(i)=j=tTγjtrj(i) r_{t}^{(i)}=\sum_{j=t}^{T} \gamma^{j-t} r_{j}^{(i)} Where:

    • rt(i)r_t^{(i)} is the stepwise reward for step tt in trajectory ii.
    • γ(0,1]\gamma \in (0,1] is the discount factor, which weights immediate rewards more heavily than future rewards.
    • rj(i)r_j^{(i)} is the instantaneous reward at turn jj in trajectory ii (which is sparse and delayed in this problem, typically 0 until the final step).
  • Aggregated Advantage: The final advantage for a step (st(i),at(i))(\boldsymbol{s}_t^{(i)}, \boldsymbol{a}_t^{(i)}) is aggregated from the KK hierarchical groups using an adaptive weighting scheme (Eq. 7): AH(sj(i))=k=0KwkAkH(sj(i)) A^{H}\left(\boldsymbol{s}_{j}^{(i)}\right)=\sum_{k=0}^{K} \boldsymbol{w}_{k} A_{k}^{H}\left(\boldsymbol{s}_{j}^{(i)}\right) Where:

    • AH(sj(i))A^{H}\left(\boldsymbol{s}_{j}^{(i)}\right) is the final aggregated advantage estimate for the step.
    • wk\boldsymbol{w}_k is the adaptive weight for the k-th hierarchical group. It is defined as: wk=(k+1)αk(k+1)α(α0) \boldsymbol{w}_{k}=\frac{(k+1)^{\alpha}}{\sum_{k}(k+1)^{\alpha}} \quad (\alpha \geq 0) Here, α\alpha is a weighting coefficient. A larger α\alpha assigns proportionally larger weights to higher-level groups (larger kk), which have more consistent historical contexts and thus are expected to yield more accurate (less biased) advantage estimates. For example, if α=0\alpha=0, all weights wk\boldsymbol{w}_k are equal (uniform weighting). If α=1\alpha=1, wk\boldsymbol{w}_k increases linearly with kk. The sum of weights k(k+1)α\sum_{k}(k+1)^{\alpha} in the denominator ensures that the weights sum to 1. This scheme prioritizes accuracy from consistent contexts while still leveraging data from less consistent contexts to control variance.

4.2.4.3. Objective for Policy Optimization

The policy optimization objective for HGPO is a PPO-like objective with a KL divergence penalty (Eq. 8): JHGPO(θ)=E[1NTi=1Nt=1Tmin(ρθ(at(i))AH(st(i)),clip(ρθ(at(i)),1±ϵ)AH(st(i)))]βDKL(πθ(x)πref(x)) \begin{aligned} \mathcal{J}_{\mathrm{HGPO}}(\theta)= & \mathbb{E}\left[\frac{1}{N T} \sum_{i=1}^{N} \sum_{t=1}^{T} \min \left(\rho_{\theta}\left(\boldsymbol{a}_{t}^{(i)}\right) A^{H}\left(\boldsymbol{s}_{t}^{(i)}\right), \operatorname{clip}\left(\rho_{\theta}\left(\boldsymbol{a}_{t}^{(i)}\right), 1 \pm \epsilon\right) A^{H}\left(\boldsymbol{s}_{t}^{(i)}\right)\right)\right] \\ & -\beta \mathbb{D}_{\mathrm{KL}}\left(\pi_{\theta}(\cdot \mid x) \| \pi_{\mathrm{ref}}(\cdot \mid x)\right) \end{aligned} Where:

  • JHGPO(θ)\mathcal{J}_{\mathrm{HGPO}}(\theta) is the objective function to be maximized for updating the policy parameters θ\theta.
  • E[]\mathbb{E}[\cdot] denotes the expectation over collected trajectories.
  • NN is the number of trajectories in a group.
  • TT is the maximum length of a trajectory.
  • ρθ(at(i))=πθ(at(i)st(i),x)πθold (at(i)st(i),x)\rho_{\theta}\left(\boldsymbol{a}_{t}^{(i)}\right)=\frac{\pi_{\theta}\left(\boldsymbol{a}_{t}^{(i)} \mid \boldsymbol{s}_{t}^{(i)}, x\right)}{\pi_{\theta_{\text {old }}}\left(\boldsymbol{a}_{t}^{(i)} \mid \boldsymbol{s}_{t}^{(i)}, x\right)} is the importance sampling ratio, comparing the probability of action at(i)\boldsymbol{a}_t^{(i)} under the current policy πθ\pi_{\theta} to the old policy πθold \pi_{\theta_{\text {old }}}.
  • AH(st(i))A^{H}\left(\boldsymbol{s}_{t}^{(i)}\right) is the aggregated advantage computed by HGPO (from Eq. 7).
  • clip(,1±ϵ)\operatorname{clip}(\cdot, 1 \pm \epsilon) is the clipping function from PPO, which limits the importance ratio to stay within [1ϵ,1+ϵ][1-\epsilon, 1+\epsilon] to prevent excessively large policy updates. ϵ\epsilon is the clipping parameter.
  • β\beta is the coefficient for the KL divergence penalty.
  • DKL(πθ(x)πref(x))\mathbb{D}_{\mathrm{KL}}\left(\pi_{\theta}(\cdot \mid x) \| \pi_{\mathrm{ref}}(\cdot \mid x)\right) is the Kullback-Leibler (KL) divergence between the current policy πθ\pi_{\theta} and a reference policy πref\pi_{\mathrm{ref}} (often the old policy πθold \pi_{\theta_{\text {old }}}, or a pre-trained LLM). This term regularizes policy updates to prevent them from deviating too far from the reference policy, ensuring stability and preventing catastrophic forgetting.

4.2.4.4. Bias-Variance Trade-off in HGPO

Proposition 4.1 formally analyzes how HGPO achieves a favorable bias-variance trade-off under certain conditions. Let bkb_k and vkv_k denote the bias and variance of the estimated advantage AkHA_k^H within the kk-th group GkHG_k^H.

The conditions are:

  1. Bias decreases monotonically: BTb0b1bK0B_T \geq b_0 \geq b_1 \geq \cdots \geq b_K \geq 0. This means that as context consistency increases (higher kk), the bias of the advantage estimate decreases. BTB_T is the bias of trajectory-level advantage, b0b_0 is the bias of step-level advantage (0-context), and bKb_K is the bias of the Oracle advantage (full KK-context).

  2. Variance increases monotonically and independently: v0v1vKVTv_0 \leq v_1 \leq \cdots \leq v_K \leq V_T. This means that as context consistency increases (higher kk), the group size generally decreases, leading to higher variance in the advantage estimate. VTV_T is the variance of trajectory-level advantage, v0v_0 is the variance of step-level advantage, and vKv_K is the variance of the Oracle advantage. Independence of AkHA_k^H and AkHA_{k'}^H for kkk \neq k' is assumed for variance calculation.

    The bias and variance of the aggregated estimator AHA^H are:

  • Bias of AHA^H (Eq. 9): Bias[AH]=Bias[k=0KwkAkH]=k=0Kwkbk \operatorname{Bias}\left[A^{H}\right]=\operatorname{Bias}\left[\sum_{k=0}^{K} w_{k} A_{k}^{H}\right]=\sum_{k=0}^{K} w_{k} b_{k} This shows that the bias of the combined advantage is a weighted sum of the individual biases. Since bkb_k decreases with kk and higher kk means lower bias, by appropriately weighting wkw_k, HGPO can achieve a lower bias. Specifically, the bias satisfies (Eq. 11): bK=k=0KwkbKBias[AH]k=0Kwkb0=b0BT b_{K}=\sum_{k=0}^{K} w_{k} b_{K} \leq \operatorname{Bias}\left[A^{H}\right] \leq \sum_{k=0}^{K} w_{k} b_{0}=b_{0} \leq B_{T} This inequality shows that the bias of HGPO's advantage estimator is always less than or equal to the step-level bias b0b_0 (which is itself less than or equal to trajectory-level bias BTB_T) and greater than or equal to the Oracle bias bKb_K. This means HGPO effectively trades off between these extremes, leaning towards lower bias.

  • Variance of AHA^H (Eq. 10): Var[AH]=Var[k=0KwkAkH]=k=0Kwk2Var[AkH]=k=0Kwk2vk \operatorname{Var}\left[A^{H}\right]=\operatorname{Var}\left[\sum_{k=0}^{K} w_{k} A_{k}^{H}\right]=\sum_{k=0}^{K} w_{k}^{2} \operatorname{Var}\left[A_{k}^{H}\right]=\sum_{k=0}^{K} w_{k}^{2} v_{k} Assuming independence between the advantage estimates from different hierarchical groups, the variance of the combined advantage is the weighted sum of variances. The variance satisfies (Eq. 12): 1K(K+1)2αv0k=0Kwk2v0Var[AH]k=0Kwk2vK(K+1)2αKvK \frac{1}{K(K+1)^{2\alpha}} v_{0} \leq \sum_{k=0}^{K} w_{k}^{2} v_{0} \leq \operatorname{Var}\left[A^{H}\right] \leq \sum_{k=0}^{K} w_{k}^{2} v_{K} \leq \frac{(K+1)^{2\alpha}}{K} v_{K} This indicates that HGPO's variance is bounded between the variance of the step-level estimator v0v_0 and the Oracle estimator vKv_K, depending on KK and α\alpha. By weighting, HGPO can reduce the impact of high-variance Oracle estimates (small groups) while still benefiting from their low bias.

In summary, HGPO provides a principled framework for advantage estimation that systematically leverages historical context to lower bias while maintaining statistical efficiency (balancing variance) through weighted aggregation.

4.2.5. Algorithm Pseudo-code (Algorithm 1)

The following describes the HGPO training process as outlined in Algorithm 1 in Appendix A:

Algorithm 1: The pseudo-code of HGPO

    Require: Initial policy π_θ_old, task distribution p(X), discount factor γ, weighting ω, clipping
        parameter ε, KL penalty β, group size N, the length of historical context K, parameter α
    for each training iteration do
        Update the old policy model: θ_old ← θ
        // Multi-step rollout phase
        Sample task x ~ p(X) and initialize N identical environments
        for t=1 to T do
            Sample actions {a_s^(i) ~ π_θ_old(⋅ | s_t^(i), x)}_i=1^N
            Execute actions, observe rewards {r_t^(i)}_i=1^N and next state {s_t+1^(i)}_i=1^N
        end for
        // Grouping phase
        Context-aware hierarchical grouping by Eq. (5)
        // Advantage computation phase
        Compute multiple advantages within each group by Eq. (7)
        // Policy update phase
        Update policy θ by maximizing objective J_HGPO(θ)
    end for

Step-by-step Explanation:

  1. Initialization: The algorithm starts with an initial policy πθold\pi_{\theta_{\text{old}}}. It requires several hyperparameters: a task distribution p(X), discount factor γ\gamma, weighting coefficient α\alpha, clipping parameter ϵ\epsilon, KL penalty coefficient β\beta, group size NN, and maximum historical context length KK.

  2. Training Loop: The core process repeats for a fixed number of training iterations.

  3. Update Old Policy: At the beginning of each iteration, the old policy θold\theta_{\text{old}} is updated to the current policy θ\theta. This is crucial for PPO-like algorithms to ensure the importance sampling ratio (Eq. 8) is correctly calculated against a fixed reference.

  4. Multi-step Rollout Phase:

    • A task xx is sampled from the task distribution p(X).
    • NN identical environments are initialized for this task. This means all NN trajectories start from the same initial state and are aimed at the same goal.
    • For each turn tt from 1 to TT (maximum turns):
      • The agent (using the old policy πθold\pi_{\theta_{\text{old}}}) samples NN actions {as(i)}i=1N\{\boldsymbol{a}_s^{(i)}\}_{i=1}^N, one for each of the NN parallel environments, conditioned on the current state st(i)\boldsymbol{s}_t^{(i)} and the task xx.
      • These actions are executed in their respective environments.
      • The rewards {rt(i)}i=1N\{r_t^{(i)}\}_{i=1}^N and the next states {st+1(i)}i=1N\{\boldsymbol{s}_{t+1}^{(i)}\}_{i=1}^N are observed. This generates NN rollout trajectories.
  5. Grouping Phase:

    • After all rollouts are collected, context-aware hierarchical grouping is performed for all steps within these NN trajectories according to Eq. (5). This means for each step (st(i),at(i))(\boldsymbol{s}_t^{(i)}, \boldsymbol{a}_t^{(i)}), it identifies all steps (from any of the NN trajectories) that share the same current state st(i)\boldsymbol{s}_t^{(i)} and also match different lengths of historical context (from k=0k=0 to KK). This process forms the hierarchical groups GkH(st(i))G_k^H(\boldsymbol{s}_t^{(i)}).
  6. Advantage Computation Phase:

    • For every step (st(i),at(i))(\boldsymbol{s}_t^{(i)}, \boldsymbol{a}_t^{(i)}) in the collected rollouts, multiple advantages AkH(st(i))A_k^H(\boldsymbol{s}_t^{(i)}) are computed within each of its corresponding hierarchical groups using Eq. (6).
    • Then, these group advantages are aggregated using the adaptive weighting scheme (Eq. 7) to obtain the final advantage estimate AH(st(i))A^H(\boldsymbol{s}_t^{(i)}) for that step.
  7. Policy Update Phase:

    • The policy parameters θ\theta are updated by maximizing the HGPO objective function JHGPO(θ)\mathcal{J}_{\mathrm{HGPO}}(\theta) as defined in Eq. (8). This optimization typically involves gradient ascent using an optimizer (e.g., Adam).

      This process iteratively refines the LLM agent's policy by leveraging hierarchical context-aware advantage estimation to make more informed and stable updates.

5. Experimental Setup

5.1. Datasets

The paper evaluates HGPO on two challenging agentic benchmarks designed to assess LLM agents' ability to perform multi-step decision-making.

  • ALFWorld (Shridhar et al., 2021):

    • Description: ALFWorld is a text-based embodied environment where an agent receives a text goal and must accomplish it through multi-turn interaction. It simulates household activities.
    • Scale and Characteristics: It includes 4,639 task instances across six categories of common household activities: Pick & Place (Pick), Examine in Light (Look), Clean & Place (Clean), Heat & Place (Heat), Cool & Place (Cool), and Pick Two & Place (Pick2). The environment involves navigation, object interaction, and logical reasoning.
    • Example of a data sample: A task might be "Go to the kitchen, find a potato, put it in the microwave, and then place it in the fridge." The agent would perceive textual descriptions of its surroundings and generate textual actions like go to kitchen, look for potato, take potato from countertop, put potato in microwave, heat potato, open fridge, put potato in fridge.
    • Choice Justification: ALFWorld is widely used to test long-horizon planning and robust decision-making in a textual embodied setting, making it suitable for evaluating LLM agents. It includes both in-distribution and out-of-distribution tasks, allowing for generalization assessment.
  • WebShop (Yao et al., 2022):

    • Description: WebShop is a complex, web-based interactive environment designed to test LLM agents in realistic online shopping scenarios. The agent interacts with a simulated HTML-based shopping website to search for, navigate to, and ultimately purchase a suitable item.
    • Scale and Characteristics: It contains over 1.1 million products and 12,000 user instructions, providing a rich and diverse action space and observation space (web page content). This environment requires understanding natural language instructions, interpreting web page layouts, making navigation decisions, and formulating search queries.
    • Example of a data sample: A task might be "Find a pair of headphones under $50 with noise cancellation and add it to your cart." The agent would observe HTML elements, execute actions like click search bar, type "headphones", click search button, click filter "noise cancellation", set price range "0-50", click product "XYZ Headphones", click "add to cart".
    • Choice Justification: WebShop presents a highly challenging long-horizon agentic task in a real-world-like web environment, demanding strong reasoning, planning, and interaction capabilities from LLM agents.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate the performance of LLM agents on ALFWorld and WebShop, as well as internal RL training dynamics.

  • ALFWorld Metrics:

    • Overall Success Rate (↑):
      • Conceptual Definition: This metric measures the percentage of all tasks (across both in-distribution and out-of-distribution categories) where the agent successfully completes the given goal within the environment. A higher percentage indicates better overall performance.
      • Mathematical Formula: Overall Success Rate=Number of Successful TasksTotal Number of Tasks Attempted×100% \text{Overall Success Rate} = \frac{\text{Number of Successful Tasks}}{\text{Total Number of Tasks Attempted}} \times 100\%
      • Symbol Explanation:
        • Number of Successful Tasks: The count of task instances where the agent successfully achieved its stated goal.
        • Total Number of Tasks Attempted: The total count of all task instances that the agent was evaluated on.
    • In-Success (↑):
      • Conceptual Definition: Measures the success rate specifically for task instances that are in-distribution, meaning they are similar in structure or content to tasks seen during training. This assesses the agent's performance on familiar scenarios.
      • Mathematical Formula: Same as Overall Success Rate, but restricted to in-distribution tasks.
      • Symbol Explanation: Same as Overall Success Rate, but Number of Successful Tasks and Total Number of Tasks Attempted refer specifically to in-distribution tasks.
    • Out-Success (↑):
      • Conceptual Definition: Measures the success rate specifically for task instances that are out-of-distribution, meaning they differ significantly from tasks seen during training. This is a crucial metric for evaluating the generalization capability of the agent to novel or unseen situations.
      • Mathematical Formula: Same as Overall Success Rate, but restricted to out-of-distribution tasks.
      • Symbol Explanation: Same as Overall Success Rate, but Number of Successful Tasks and Total Number of Tasks Attempted refer specifically to out-of-distribution tasks.
  • WebShop Metrics:

    • Average Task Score (↑):
      • Conceptual Definition: This metric quantifies the average degree of task completion for the WebShop environment. WebShop tasks can sometimes be partially completed, and the task score (typically normalized between 0 and 1) reflects how many sub-goals or requirements of a task were met. A score of 1 indicates full success. The specific calculation method for partial credit can vary but usually involves checking for correct item selection, filters applied, price range, etc. The paper does not provide a specific formula for this, but implies it's a standard metric used in WebShop evaluations.
      • Mathematical Formula: (Not explicitly provided in the paper; assumed to be an environment-specific scoring mechanism). Generally, for MM tasks: Average Task Score=1Mm=1MScorem \text{Average Task Score} = \frac{1}{M} \sum_{m=1}^{M} \text{Score}_m
      • Symbol Explanation:
        • Scorem\text{Score}_m: The score (e.g., from 0 to 1) achieved on task mm.
        • MM: The total number of tasks evaluated.
    • Average Task Success Rate (↑):
      • Conceptual Definition: Measures the percentage of WebShop tasks where the agent achieved full success (e.g., a task score of 1). This is a binary success metric.
      • Mathematical Formula: Same as Overall Success Rate for ALFWorld, but specific to WebShop tasks.
      • Symbol Explanation: Same as Overall Success Rate, but Number of Successful Tasks and Total Number of Tasks Attempted refer specifically to WebShop tasks.
  • Training Dynamics Metrics (for further analysis): These metrics provide insights into the stability and efficiency of the RL training process. The paper describes them conceptually in Appendix C.4:

    • Mean Advantages: Average value of the advantage estimates over a batch of steps. A positive and stable value indicates effective credit assignment.
    • Policy Gradient Loss: The value of the objective function (Eq. 8) that the policy is optimizing. A smooth decrease typically indicates stable learning.
    • KL Divergence: Measures the difference between the new policy and the old policy. A moderate value shows steady learning without overly aggressive updates.
    • Policy Gradient Clip Fraction: The proportion of gradients that are clipped during PPO-style optimization. A moderate fraction suggests stable training; too high implies instability.
    • Mean Reward: The average cumulative reward obtained per episode. A direct indicator of learning progress.
    • Episode Success Rate: The percentage of episodes where the agent successfully completes its task, similar to the main success metrics but tracked during training.

5.3. Baselines

HGPO is compared against a comprehensive set of competitive baselines, categorized into closed-source LLMs, prompting agents, and RL training methods.

  • 1. Closed-source LLMs (Prompting-based): These models are used directly with structured prompts without any RL fine-tuning, representing the performance ceiling of large, proprietary models using only prompt engineering.

    • GPT-4o (Achiam et al., 2023): A state-of-the-art multimodal LLM from OpenAI.
    • Gemini-2.5-Pro (Team et al., 2023): A powerful LLM from Google, comparable to GPT-4o.
  • 2. Prompting Agents (Open-source LLMs with Prompting): These methods use open-source LLMs (Qwen2.5) combined with specific prompting strategies to enable agentic behavior, without RL fine-tuning.

    • Qwen2.5 (Pure Prompting): The base LLM used for RL training is also evaluated with simple direct prompting to establish a lower bound.
    • ReAct (Yao et al., 2023): An agent that interleaves reasoning (Chain-of-Thought) and acting steps to solve tasks.
    • Reflexion (Shinn et al., 2024): An agent that improves over multiple tries by reflecting on past failures and refining its reasoning and actions.
  • 3. RL Training Methods (Fine-tuning Open-source LLMs): These methods fine-tune the base LLM (Qwen2.5) using Reinforcement Learning.

    • PPO (with critic) (Schulman et al., 2017): Proximal Policy Optimization, a widely used policy gradient algorithm that requires an additional value network (critic) to estimate advantages. It serves as a strong traditional RL baseline.
    • RLOO (Kool et al., 2019; Ahmadian et al., 2024): Reinforcement Learning with Offline Observations, a group-based RL approach that estimates advantages without explicit value networks.
    • GRPO (Shao et al., 2024): Group-based Reinforcement Learning with Policy Optimization. The base group-based RL algorithm, adapted to the stepwise setting for long-horizon tasks as described in Section 3.2, using trajectory-level advantage estimation.
    • GiGPO (Feng et al., 2025b): Group-in-Group Policy Optimization. A prior hierarchical RL method that extends GRPO by performing step-level advantage estimation (0-context grouping) for LLM-based agents, but suffers from context inconsistency. This is the most direct group-based RL baseline for HGPO.

5.4. Implementation Details

For fairness and comparability, all RL training methods share common configurations:

  • Base Models: Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct (Yang et al., 2024).
  • Prompting Strategy: Each LLM agent is prompted to generate a chain-of-thought within <think></think><think> </think> tags, followed by an action within <action></action><action> </action> tags (as shown in Figures 6 and 7 in Appendix C.5).
  • ALFWorld Specifics:
    • Max prompt length: 2048 tokens.
    • Max response length: 512 tokens.
    • Max environment steps per episode: 50.
    • Learning rate: 1e61\text{e}-6 for actor (policy), 1e51\text{e}-5 for critic (PPO only).
    • Reward: 10 for success, 0 for failure, -0.1 penalty for invalid actions.
    • Group size (NN): 8 for group-based RL.
    • Groups per rollout: 16, resulting in 16×8=12816 \times 8 = 128 environments total.
    • PPO: Uses 128 separate environments for rollouts.
    • Rollout temperature: 1.0. Validation temperature: 0.4.
    • Mini-batch size: 256.
    • KL-divergence loss coefficient (β\beta): 0.01.
    • Discount factor (γ\gamma): 0.95.
  • WebShop Specifics:
    • Max prompt length: 4096 tokens.
    • Max response length: 512 tokens.
    • Max environment steps per episode: 15.
    • Learning rate: 1e61\text{e}-6 for actor, 1e51\text{e}-5 for critic (PPO only).
    • Reward: 10 for success, 0 for failure, -0.1 penalty for invalid actions.
    • Group size (NN): 8 for group-based RL.
    • Groups per rollout: 16, resulting in 16×8=12816 \times 8 = 128 environments total.
    • PPO: Uses 128 separate environments for rollouts.
    • Rollout temperature: 1.0. Validation temperature: 0.4.
    • Mini-batch size: 64.
    • KL-divergence loss coefficient (β\beta): 0.01.
    • Discount factor (γ\gamma): 0.95.
  • HGPO Specifics: The weighting coefficient α\alpha (from Eq. 7) is set to 1 by default for main experiments. When computing weights, groups with zero advantage (likely due to all members having the same reward in a homogenous group, causing σ=0\sigma=0 or R(st(i))mean(R)=0R(\boldsymbol{s}_t^{(i)}) - \text{mean}(R) = 0) are omitted to avoid undefined estimates.
  • Computing Details:
    • Qwen2.5-1.5B-Instruct: 2 NVIDIA H100 GPUs.
    • Qwen2.5-7B-Instruct: 4 NVIDIA H100 GPUs.
    • Training iterations: 160 for each experiment.
  • Evaluation: Both GiGPO and HGPO are tested with three random seeds, reporting mean and standard deviation.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate HGPO's superior performance across both ALFWorld and WebShop benchmarks. The tables and figures provided in the paper support the claims regarding HGPO's effectiveness, stability, and improved generalization.

The following are the results from Table 1 of the original paper:

Model Type Method ALFWorld WebShop
In-Success Out-Success Task Scores Task Success Rates
Qwen2.5-1.5B-Instruct Prompting GPT-4o 48.0 46.0 31.8 23.7
Gemini-2.5-Pro 60.3 50.5 42.5 35.9
Qwen2.5 4.1 - 23.1 5.2
ReAct 12.8 - 40.1 11.3
Reflexion 21.8 - 55.8 21.9
RL Training PPO (with critic) 54.4x3.1 - 73.8x3.0 51.5x2.9
RLOO 69.7x2.5 68.7x10.7 73.9x5.6 52.1x6.7
GRPO 72.8x3.6 70.1x2.5 75.8x3.5 56.8x3.8
GiGPO (K=2) 85.42x1.32 80.72x1.62 84.52x0.98 69.79x0.59
HGPO (K=2) 89.58x0.45 80.73x2.38 87.53x0.77 72.66x1.78
GiGPO (K=4) 85.15x2.81 80.98x0.45 88.5x0.49 74.08x0.98
HGPO (K=4) 92.45x0.81 89.06x2.34 88.90x0.90 75.91x1.19
Qwen2.5-7B-Instruct Prompting Qwen2.5 14.8 - 26.4 7.8
ReAct 31.2 - 46.2 19.5
Reflexion 42.7 - 58.1 28.8
RL Training PPO (with critic) 77.08x1.12 76.23x1.46 81.4x3.1 68.7x5.1
RLOO 77.86x0.03 73.95x0.05 80.3x3.2 65.7x4.0
GRPO 78.64x0.73 76.82x1.47 79.3x2.8 66.1x3.7
GiGPO (K=2) 89.84x2.20 82.81x5.46 86.23x1.43 75.13x1.37
HGPO (K=2) 91.15x1.19 84.89x4.30 88.93x0.84 76.43x1.47
GiGPO (K=4) 90.88x0.90 87.76x0.45 87.25x1.02 76.18x1.25
HGPO (K=4) 94.79x0.90 93.22x1.62 87.88x0.41 77.21x0.22

6.1.1. HGPO Achieves Overall Superior Performance

As evident from Table 1, HGPO consistently outperforms all baselines across both ALFWorld and WebShop, using both Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct models.

  • RL vs. Prompting: All RL training methods significantly outperform prompting-based methods. For example, on ALFWorld (Qwen2.5-1.5B), Reflexion achieves 21.8% In-Success, while PPO reaches 54.4% and HGPO(K=4)HGPO (K=4) achieves 92.45%. This underscores the substantial benefits of RL fine-tuning for agentic reasoning and decision-making compared to static prompting. Even sophisticated closed-source LLMs like GPT-4o and Gemini-2.5-Pro are surpassed by RL-trained Qwen2.5 models, highlighting the importance of domain-specific adaptation.
  • HGPO vs. Other RL Methods: Among the RL-based approaches, HGPO consistently achieves the highest success rates and task scores. For instance, with Qwen2.5-1.5B-Instruct on ALFWorld, HGPO(K=4)HGPO (K=4) reaches 92.45% In-Success, significantly higher than GiGPO(K=4)GiGPO (K=4) at 85.15%, GRPO at 72.8%, and PPO at 54.4%. Similar trends are observed on WebShop and with the larger Qwen2.5-7B-Instruct model. This validates HGPO's effectiveness in providing more reliable advantage estimates for policy optimization.

6.1.2. HGPO Benefits More from Larger KK

A notable observation is that HGPO exhibits a more pronounced performance improvement compared to GiGPO as KK (the maximum context length for hierarchical grouping) increases from 2 to 4.

  • ALFWorld (Qwen2.5-1.5B): GiGPO In-Success goes from 85.42% (K=2) to 85.15% (K=4), showing a slight decrease. In contrast, HGPO In-Success improves from 89.58% (K=2) to 92.45% (K=4).
  • WebShop (Qwen2.5-1.5B): GiGPO Task Scores improve from 84.52% (K=2) to 88.5% (K=4). HGPO Task Scores improve from 87.53% (K=2) to 88.90% (K=4). While both improve, HGPO maintains a lead. The paper attributes this to GiGPO's limitation: as KK increases, prompt inconsistency becomes more severe within step-level groups (0-context grouping), leading to increasingly biased advantage estimates and limiting performance gains. HGPO, by contrast, directly mitigates prompt inconsistency through its hierarchical grouping mechanism and adaptive weighting, which emphasizes steps with consistent contexts. This result strongly supports the core hypothesis of HGPO that addressing context inconsistency is crucial for leveraging longer contexts effectively.

6.1.3. HGPO Exhibits Better Generalization on Out-of-Distribution Tasks

The Out-Success metric on ALFWorld provides insight into generalization capabilities.

  • All baseline methods experience significant performance degradation on out-of-distribution tasks. For Qwen2.5-1.5B, GRPO drops from 72.8% In-Success to 70.1% Out-Success. GiGPO(K=4)GiGPO (K=4) drops from 85.15% to 80.98%.
  • HGPO maintains superior performance with less degradation. For Qwen2.5-1.5B, HGPO(K=4)HGPO (K=4) achieves 92.45% In-Success and 89.06% Out-Success. This minimal drop compared to GiGPO's larger gap suggests that HGPO's more robust and stable advantage estimation (by handling context inconsistency) leads to policy updates that generalize better to unseen task variations.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Ablation Study

The paper conducts an ablation study (Table 2) to evaluate the effectiveness of HGPO's core components: hierarchical grouping and adaptive weighting. The base model for this study is HGPO(K=2)HGPO (K=2) with Qwen2.5-1.5B-Instruct.

The following are the results from Table 2 of the original paper:

Ablation ALFWorld(%) WebShop(%)
HGPO 89.5840.45 72.6641.78
W/o HoG-1 13.5040.58 10.1341.42
W/o HoG-2 86.4741.89 57.9441.02
W/o Ada. Weighting 87.2341.80 68.4840.45
  • W/o HoG-1 (Without Hierarchical Grouping, using only Oracle steps):

    • Description: This setting removes hierarchical grouping entirely and attempts to optimize the policy using only Oracle steps (i.e., steps that match perfectly in current state and full historical context).
    • Result: Performance drops dramatically to 13.50% on ALFWorld and 10.13% on WebShop, essentially failed policy learning.
    • Analysis: This confirms the paper's initial empirical finding that Oracle steps are too scarce (as shown in Figure 2. (c) and (d) and Table 5) to form sufficient groups for effective policy optimization. Relying solely on them leads to extreme data inefficiency.
  • W/o HoG-2 (Oracle Advantages for Oracle Steps, Step-level for Others):

    • Description: In this configuration, Oracle steps are identified and their advantages are computed using their small Oracle groups. For all other steps (the majority), step-level advantages are computed using 0-context groups (similar to GiGPO).
    • Result: Performance significantly drops to 86.47% on ALFWorld (from 89.58%) and 57.94% on WebShop (from 72.66%). The drop is particularly severe on WebShop.
    • Analysis: This drop, despite using more data than W/o HoG-1, is attributed to the high variance introduced by the small group sizes of Oracle steps. While Oracle steps have low bias, their scarcity means their advantage estimates are highly unstable, which degrades the overall optimization process. This validates the necessity of using hierarchical grouping to balance bias and variance by incorporating larger, less consistent groups when Oracle steps are too few.
  • W/o Ada. Weighting (Without Adaptive Weighting):

    • Description: This ablation replaces adaptive weighting (i.e., setting α=1\alpha=1) with uniform weights (equivalent to setting α=0\alpha=0 in Eq. 7). This means all hierarchical groups contribute equally to the final advantage estimate.

    • Result: Performance declines to 87.23% on ALFWorld (from 89.58%) and 68.48% on WebShop (from 72.66%).

    • Analysis: This demonstrates the importance of adaptive weighting. By not prioritizing higher-level groups (those with more consistent historical contexts and thus lower bias), uniform weighting dilutes the more accurate advantage information, leading to increased bias in the aggregated estimate. Adaptive weighting effectively balances the bias-variance trade-off by giving more credence to less biased estimates from higher consistency contexts.

      Collectively, these ablation studies confirm that both context-aware hierarchical grouping and adaptive weighting advantage estimation are critical and synergistic components of HGPO, essential for its superior performance.

6.2.2. Parameter Analysis

The paper investigates the effect of the weighting coefficient α\alpha in Eq. (7), which controls the sharpness of the weight distribution across hierarchical groups.

The following are the results from Table 4 of the original paper:

Parameter α=0 α=1 α=2
ALFWorld 87.23 ± 1.80 89.58 ± 0.45 84.76 ± 1.17
WebShop 68.48 ± 0.45 72.66 ± 1.78 72.65 ± 1.77
  • Results:
    • Setting α=0\alpha=0 (which corresponds to uniform weighting, as seen in W/o Ada. Weighting ablation) results in lower performance (87.23% on ALFWorld, 68.48% on WebShop) compared to α=1\alpha=1.
    • Setting α=1\alpha=1 yields the best performance (89.58% on ALFWorld, 72.66% on WebShop).
    • Increasing α\alpha further to 2 leads to a slight decrease in performance on ALFWorld (84.76%) and similar performance on WebShop (72.65%) compared to α=1\alpha=1.
  • Analysis: This finding suggests that adaptive weighting is beneficial, and an intermediate value of α=1\alpha=1 works best. A higher α\alpha (e.g., α=2\alpha=2) puts too much emphasis on higher-level groups (larger kk), which, while less biased, have smaller sizes and thus higher variance. A lower α\alpha (e.g., α=0\alpha=0) gives too much weight to lower-level groups (smaller kk), which have more bias due to context inconsistency. The sweet spot at α=1\alpha=1 indicates a good balance in prioritizing less biased estimates without excessively inflating variance. The paper notes that extensive parameter tuning is not required, and α=1\alpha=1 provides robust performance across tasks.

6.2.3. Training Dynamics

Figures 4 and 8 (Appendix D.3) illustrate the training dynamics of GRPO, GiGPO, and HGPO across several metrics. The paper claims HGPO achieves more stable and efficient policy optimization.

The following figure (Figure 4 from the original paper) shows the training dynamics of HGPO, GiGPO, and GRPO on ALFWorld:

img-3.jpeg 该图像是图表,展示了论文中HGPO、GIGPO和GRPO三种算法在训练过程中KL Loss、Policy Gradient Clip、Policy Gradient Loss、Mean Advantage、Mean Reward和Episode Success Rate等指标的变化趋势,反映了HGPO的优越性能。

Figure 4: Training dynamics of HGPO (Red), GiGPO (Yellow), and GRPO (Purple) on ALFWorld using Qwen2.5-1.5B-Instruct. The details of these metrics are shown in Appendix D.3.

The following figure (Figure 8 from the original paper) shows the training dynamics of HGPO, GiGPO, and GRPO on WebShop:

img-5.jpeg 该图像是论文中关于HGPO、GIPO和GRPO三种方法在训练过程中不同指标变化的图表,展示了KL Loss、Policy Gradient Clip、Policy Gradient Loss、Mean Advantage、Mean Reward和Episode Success Rate随训练轮次的变化趋势。

Figure 8: Training dynamics of HGPO (Red), GiGPO (Yellow), and GRPO (Blue) on WebShop using Qwen2.5-1.5B-Instruct. Best viewed in color.

  • Policy Gradient Clip Fraction: HGPO (red curve) maintains a moderate level, indicating stable training. In contrast, GiGPO (yellow) and GRPO (purple/blue) display higher fractions, suggesting instability and frequent constraint violations due to aggressive or noisy updates.
  • KL Loss: GRPO's curve is too low, implying slow learning and insufficient policy exploration. GiGPO's curve is relatively high, suggesting an overly aggressive learning process that might lead to policy divergence. HGPO achieves a balanced trajectory, demonstrating steady and stable policy learning.
  • Mean Advantages, Policy Gradient Loss, Mean Reward, Episode Success Rate: While specific numerical values vary, the trends in these plots generally show HGPO maintaining more consistent positive mean advantages, smoother loss curves, and higher mean rewards and success rates earlier in training and at convergence, reflecting its more effective advantage estimation and policy updates.

6.2.4. Distribution of Hierarchical Group Sizes

Figure 5 and Figure 9 (Appendix D.4) visualize the distributions of hierarchical group sizes.

The following figure (Figure 5 from the original paper) shows the distribution of hierarchical group sizes (K=2K=2) on ALFWorld and WebShop:

img-4.jpeg 该图像是一个包含六个子图的柱状图,展示了Alfworld和Webshop环境中不同上下文组(0-Context、1-Context、2-Context)的组大小分布情况。各组大小以从0-2到≥30的区间显示,反映了历史上下文一致性与组大小的关系。

Figure 5: The distributions of hierarchical group sizes (K=2)(K=2) on ALFWorld and WebShop using Qwen2.5-1.5B-Instruct. The Y-axis denotes the ratio.

The following figure (Figure 9 from the original paper) shows the distribution of hierarchical group sizes (K=4K=4) on ALFWorld and WebShop:

img-6.jpeg 该图像是多组直方图组成的图表,展示了不同上下文组(0-4 Context Group)在ALFWorld和Webshop任务中,各组大小的分布情况,横轴为组大小,纵轴为比例,体现了层级组策略优化中历史上下文一致性的分布特征。

Figure 9: The distributions of hierarchical group sizes (K=4)(K=4) on ALFWorld and WebShop using Qwen2.5-1.5B-Instruct.

  • Observations:
    • 0-context groups (steps sharing only the current state) tend to have a higher proportion of large group sizes. This is expected as they have the least stringent matching criteria.
    • As the context depth kk increases (e.g., from 0-context to 1-context to 2-context to 4-context), the proportion of large groups decreases, and smaller groups become more frequent.
    • For 2-context and especially 4-context groups, a significant portion of steps fall into very small group sizes (e.g., 0-2 members).
  • Analysis: This empirically confirms the paper's claim that Oracle steps (those with identical current states and full historical contexts) are generally scarce and form smaller groups. This scarcity would lead to high variance if relied upon exclusively (W/o HoG-1 ablation). HGPO addresses this by leveraging the abundance of lower-context groups to reduce variance while using adaptive weighting to reduce bias.

6.2.5. Step Utilization Ratio

Table 5 in Appendix D.2 reports the average proportion of steps allocated to different context groups per rollout.

The following are the results from Table 5 of the original paper:

Dataset 0-Context 1-Context 2-Context 3-Context 4-Context
ALFWorld (K=2K=2) 0.97 0.75 0.52 - -
ALFWorld (K=4K=4) 0.98 0.77 0.54 0.34 0.19
WebShop (K=2K=2) 0.92 0.64 0.44 - -
WebShop (K=4K=4) 0.90 0.59 0.4 0.21 0.09
  • Observations:
    • Nearly all steps (90-98%) fall into 0-context groups. This means most steps share the same current state with at least one other step from the collected rollouts.
    • As the number of historical contexts required for matching increases (i.e., kk increases), the utilization ratio steadily decreases. For K=4K=4, only 19% of steps in ALFWorld and 9% in WebShop belong to 4-context groups.
  • Analysis: This table quantitatively confirms the challenge of scarcity of Oracle steps. It highlights that steps with highly consistent historical contexts are rare. This scarcity necessitates HGPO's approach of using a hierarchy and adaptive weighting to effectively utilize all available steps for advantage estimation, rather than discarding the majority of steps due to high context inconsistency or accepting high variance from sparse Oracle groups.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully introduces Hierarchy-of-Groups Policy Optimization (HGPO), a novel group-based Reinforcement Learning (RL) algorithm specifically designed to overcome the context inconsistency problem in long-horizon Large Language Model (LLM) agent training. By proposing context-aware hierarchical advantage estimation and an adaptive weighting scheme, HGPO enables more fine-grained per-step credit assignment while maintaining the efficiency and stability characteristic of group-based RL. The empirical evaluations on ALFWorld and WebShop benchmarks, using Qwen2.5-1.5B-Instruct and Qwen2.5-7B-Instruct, unequivocally demonstrate that HGPO significantly outperforms existing prompt-based agents and prior RL approaches under identical computational constraints. The ablation studies further confirm the crucial role of both hierarchical grouping and adaptive weighting in HGPO's superior performance, attributing it to a favorable bias-variance trade-off in advantage estimation.

7.2. Limitations & Future Work

The authors highlight one specific direction for future work:

  • Alternative Strategies for Context Inconsistency: They suggest exploring other methods to handle context inconsistency, such as conditionally controlling trajectories during the rollout stage. This implies a potential shift from post-hoc grouping during advantage estimation to proactive management of trajectories during data collection, which could generate more consistent contexts from the outset.

7.3. Personal Insights & Critique

7.3.1. Strengths

  • Clear Problem Identification: The paper clearly identifies context inconsistency as a critical and previously underexplored issue in stepwise group-based RL for long-horizon tasks. The empirical evidence (Figure 2) effectively motivates the proposed solution.
  • Elegant Solution: HGPO's two-pronged approach (context-aware hierarchical grouping and adaptive weighting) is elegant and intuitively addresses the bias-variance trade-off. It leverages the strengths of both fine-grained (low bias, high variance) and coarse-grained (high bias, low variance) advantage estimates.
  • Computational Efficiency: The offline nature of the hierarchical grouping (using hashmap lookups) is a significant advantage, ensuring that HGPO does not incur substantial extra computational costs during the rollout phase or require additional models.
  • Strong Empirical Validation: The comprehensive experiments on two challenging benchmarks (ALFWorld and WebShop) with different LLM sizes (1.5B and 7B) provide robust evidence of HGPO's effectiveness. The consistent outperformance over strong baselines (including PPO and GiGPO) is compelling.
  • Improved Generalization: HGPO's better performance on out-of-distribution tasks in ALFWorld is a crucial indicator of the quality of its policy learning and robustness.

7.3.2. Potential Issues, Unverified Assumptions, or Areas for Improvement

  • Assumption Justification in Proposition 4.1: While the paper justifies the monotonic bias decrease and variance increase assumptions (i.e., BTb0bK0B_T \geq b_0 \geq \cdots \geq b_K \geq 0 and v0vKVTv_0 \leq \cdots \leq v_K \leq V_T), these are still theoretical conditions. Further empirical validation or a more rigorous theoretical derivation of these specific monotonicities in diverse long-horizon LLM agent settings would strengthen the theoretical foundation. Especially, the independence assumption for variance calculation might be strong if groups at different kk levels are not entirely disjoint.
  • Sensitivity to KK (Maximum Context Length for Grouping): The paper shows K=4K=4 performing better than K=2K=2 for HGPO, but beyond that, the step utilization ratio for higher-context groups becomes very low (e.g., 9% for 4-context in WebShop). It would be interesting to analyze how performance changes for larger KK values (e.g., K=8K=8 or K=16K=16). At some point, the variance from extremely small higher-context groups might outweigh the bias reduction, even with adaptive weighting.
  • Dynamic KK or Adaptive αα: Currently, KK is a fixed hyperparameter, and αα is chosen empirically. Could KK itself be dynamically determined based on the observed context consistency within a batch? Or could αα be made adaptive (e.g., annealed or learned) during training to fine-tune the bias-variance trade-off?
  • Impact of Sparse Delayed Rewards: The paper focuses on sparse delayed rewards. How would HGPO perform in environments with dense rewards or different reward shaping strategies? The impact of context inconsistency might differ when more frequent reward signals are available.
  • Alternative Policy Optimization Algorithms: HGPO integrates with a PPO-like objective. Investigating its compatibility and benefits when integrated with other policy optimization algorithms (e.g., SAC, DQN for discrete actions if adapted) could broaden its applicability.
  • Interpretability of Hierarchical Groups: While the concept is clear, further analysis into what specific types of historical contexts are most commonly shared at different hierarchical levels could provide deeper insights into the nature of context consistency in agentic tasks.
  • Memory Footprint and Speed: While offline grouping using hashmaps is efficient, for extremely long trajectories or very large batch sizes, the memory footprint and lookup times for constructing and querying all hierarchical groups could still become a factor. A detailed analysis of the computational complexity as a function of TT, NN, and KK would be valuable.

7.3.3. Transferability and Future Value

The core idea of handling context inconsistency through hierarchical grouping and adaptive weighting is highly transferable. This framework could be applied to:

  • Other Sequential Decision-Making Problems: Beyond LLM agents, any RL task where historical context is important and group-based advantage estimation is used could potentially benefit from HGPO's approach.

  • Multi-Agent RL: In multi-agent systems, context consistency across agents' observations and actions could be crucial, and HGPO's principles might be adapted.

  • Long-Context Models in General: The general principle of "weighting by context consistency" could be applied to other long-context models beyond RL, where aggregating information from diverse historical windows is needed.

    Overall, HGPO is a significant step forward in making group-based RL more robust and effective for complex long-horizon agentic tasks, laying a strong foundation for future research in context-aware policy optimization.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.