Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning
TL;DR Summary
GLIDER employs offline hierarchical reinforcement learning to transform LLMs into efficient decision agents by decomposing long-horizon tasks into structured subtasks, enhancing exploration and adaptability with strong transferability, validated on ScienceWorld and ALFWorld bench
Abstract
While showing sophisticated reasoning abilities, large language models (LLMs) still struggle with long-horizon decision-making tasks due to deficient exploration and long-term credit assignment, especially in sparse-reward scenarios. Inspired by the divide-and-conquer principle, we propose an innovative framework GLIDER (Grounding Language Models as EffIcient Decision-Making Agents via Offline HiErarchical Reinforcement Learning) that introduces a parameter-efficient and generally applicable hierarchy to LLM policies. We develop a scheme where the low-level controller is supervised with abstract, step-by-step plans that are learned and instructed by the high-level policy. This design decomposes complicated problems into a series of coherent chain-of-thought reasoning sub-tasks, providing flexible temporal abstraction to significantly enhance exploration and learning for long-horizon tasks. Furthermore, GLIDER facilitates fast online adaptation to non-stationary environments owing to the strong transferability of its task-agnostic low-level skills. Experiments on ScienceWorld and ALFWorld benchmarks show that GLIDER achieves consistent performance gains, along with enhanced generalization capabilities.
English Analysis
1. Bibliographic Information
- Title: Divide and Conquer: Grounding LLMs as Efficient Decision-Making Agents via Offline Hierarchical Reinforcement Learning
- Authors: Zican Hu, Wei Liu, Xiaoye Qu, Xiangyu Yue, Chunlin Chen, Zhi Wang, Yu Cheng. The authors are affiliated with various institutions, including Nanjing University, Microsoft Research Asia, and other research entities. Their backgrounds span machine learning, artificial intelligence, and large language models.
- Journal/Conference: This paper is a preprint available on arXiv. It has not yet been published in a peer-reviewed journal or conference.
- Publication Year: 2024 (The first version was submitted in May 2025 according to the arXiv identifier, but the content refers to papers from 2024, so the date is likely a typo in the identifier. The work itself reflects the state of the art as of 2024).
- Abstract: Large language models (LLMs) struggle with long-horizon decision-making, particularly in tasks with sparse rewards, due to issues with exploration and long-term credit assignment. To address this, the paper proposes GLIDER (Grounding Language Models as EffIcient Decision-Making Agents via Offline HiErarchical Reinforcement Learning). GLIDER introduces a parameter-efficient hierarchical structure to LLM policies. A high-level policy learns to generate abstract, step-by-step plans, which then supervise a low-level controller that executes primitive actions. This "divide-and-conquer" approach decomposes complex problems into manageable sub-tasks, improving exploration and learning. The framework also enables fast online adaptation to new environments due to the transferability of its low-level skills. Experiments on the ScienceWorld and ALFWorld benchmarks show that GLIDER significantly outperforms existing methods.
- Original Source Link:
- Official Source: https://arxiv.org/abs/2505.19761
- PDF Link: https://arxiv.org/pdf/2505.19761v1.pdf
- Publication Status: This is a preprint and has not undergone formal peer review.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Standard Large Language Models (LLMs), despite their impressive reasoning abilities, are not inherently good at complex, multi-step decision-making tasks (i.e., "long-horizon" tasks). They fail when rewards are infrequent (
sparse-reward scenarios
) because it's hard to determine which of the many actions in a long sequence led to success (long-term credit assignment
). They also tend to explore the environment inefficiently (deficient exploration
). - Gaps in Prior Work:
- Prompt-based methods (e.g., ReAct) stuff interaction history into the LLM's context window, which quickly becomes too long for complex tasks.
- Supervised Fine-Tuning (SFT) methods train LLMs on expert demonstrations, but this is expensive and limits the agent to only what it has seen, preventing novel exploration.
- Standard ("flat") Reinforcement Learning (RL) methods applied to LLMs are often sample-inefficient, requiring massive amounts of trial-and-error interaction with the environment, which is slow and costly.
- Fresh Angle: The paper is inspired by the human "divide-and-conquer" strategy. It proposes that instead of trying to solve a complex task in one go, an LLM agent should first break it down into a sequence of simpler sub-goals and then solve each one. This is achieved through Hierarchical Reinforcement Learning (HRL), where a high-level "manager" LLM sets goals and a low-level "worker" LLM executes them.
- Core Problem: Standard Large Language Models (LLMs), despite their impressive reasoning abilities, are not inherently good at complex, multi-step decision-making tasks (i.e., "long-horizon" tasks). They fail when rewards are infrequent (
-
Main Contributions / Findings (What):
- GLIDER Framework: The primary contribution is a novel framework named GLIDER that grounds LLMs for decision-making using offline HRL. It features a two-level policy structure within a single, parameter-efficient model.
- Autonomous Hierarchy: The high-level policy uses the LLM's reasoning to autonomously decompose tasks into a coherent
chain-of-thought
of sub-tasks. This avoids the need for humans to manually define the hierarchy, a common limitation of classical HRL. - Sample-Efficient Offline Learning: GLIDER is trained primarily on a pre-existing, static dataset of interactions (
offline RL
), which is much more sample-efficient than learning from scratch through live trial-and-error (online RL
). - Fast Online Adaptation: The framework trains general-purpose, task-agnostic low-level skills. When faced with a new, unseen task, the agent can freeze these skills and only fine-tune the high-level planner, allowing for rapid adaptation to new environments.
- State-of-the-Art Performance: Experimental results show that GLIDER substantially outperforms previous methods on complex benchmarks like ScienceWorld and ALFWorld, demonstrating superior performance and generalization.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs) as Agents: This refers to using an LLM as the "brain" of an autonomous agent. The agent receives observations from an environment (e.g., text describing a room), and the LLM generates a text-based action (e.g., "go to the kitchen," "pick up the apple").
- Reinforcement Learning (RL): A machine learning paradigm where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. Key components are the state (a description of the environment), action (a choice the agent can make), reward (feedback from the environment), and policy (the agent's strategy or "brain").
- Hierarchical Reinforcement Learning (HRL): A subfield of RL that structures the agent's policy into multiple levels. A high-level policy (manager) sets abstract goals or sub-tasks, and a low-level policy (worker) learns to achieve these goals by executing primitive actions. This allows for temporal abstraction, as the manager makes decisions less frequently, simplifying long-term planning.
- Offline Reinforcement Learning: A variant of RL where the agent learns from a fixed, pre-collected dataset of interactions (states, actions, rewards) without any further interaction with the live environment. This is highly data-efficient but poses challenges, such as handling the mismatch between the actions in the dataset and the actions the agent wants to take (
distribution shift
). - Supervised Fine-Tuning (SFT) / Behavior Cloning (BC): A simple way to train an agent by directly imitating expert behavior. The model is trained to predict the expert's action given a certain state, similar to a standard supervised learning problem.
- LoRA (Low-Rank Adaptation): A Parameter-Efficient Fine-Tuning (PEFT) technique. Instead of retraining all the billions of parameters in an LLM, LoRA freezes the original model and injects small, trainable "adapter" matrices into its layers. This dramatically reduces the computational cost and memory required for fine-tuning.
-
Previous Works:
- Prompt-based Methods:
ReAct
: Combines reasoning (Thought
) and action-taking (Action
) in a single prompt loop.Reflexion
: Allows an agent to verbally reflect on its past failures and use that reflection to improve its next attempt.- Limitation: These methods are constrained by the LLM's context window length and can struggle with very long tasks.
- Supervised Fine-Tuning (SFT) Methods:
SwiftSage
: Integrates SFT with prompting to improve interactive reasoning.NAT
: Fine-tunes LLMs on trajectories, including failed ones, to learn from mistakes.- Limitation: Performance is capped by the quality and diversity of the demonstration data. These agents lack the ability to explore and discover better strategies not present in the dataset.
- "Flat" (Non-Hierarchical) RL Methods:
ETO
: Uses Direct Preference Optimization (DPO) to fine-tune an LLM policy based on pairs of successful and unsuccessful trajectories.- Other methods use standard RL algorithms like PPO or Q-learning.
- Limitation: These methods struggle with long-horizon planning and sparse rewards, as a single reward signal at the end of a long task is not enough to guide learning effectively.
- Classical HRL Methods:
Options Framework
,MAX-Q
,HiRO
: These are seminal HRL frameworks that provide temporal abstraction.- Limitation: They typically require a human expert to manually define the hierarchy of sub-tasks, which is not scalable and requires domain-specific knowledge.
- Prompt-based Methods:
-
Differentiation: GLIDER stands out by combining the strengths of these different paradigms while mitigating their weaknesses:
- vs. Flat RL: GLIDER uses a hierarchy to break down long-horizon problems, making credit assignment and exploration more tractable.
- vs. Classical HRL: GLIDER's high-level LLM autonomously discovers and generates sub-goals, eliminating the need for manual hierarchy design.
- vs. SFT: GLIDER moves beyond simple imitation by using RL to refine its policy and explore the environment, allowing it to discover better solutions than those in the initial dataset.
- vs. Prompting: GLIDER fine-tunes the model's parameters, embedding decision-making capability directly into the model rather than relying solely on in-context learning, making it more robust and efficient.
- vs. Online RL: GLIDER primarily uses offline RL, making it far more sample-efficient and practical.
4. Methodology (Core Technology & Implementation)
GLIDER's methodology can be broken down into its architecture and its three-stage training pipeline.
该图像是论文中展示GLIDER框架的示意图,详细说明了层次控制结构、Actor-Critic模型和层次结构的交互关系,其中包含公式。图中涵盖低级和高级策略的协同工作及数据采样流程。
As shown in Figure 2, the framework consists of a hierarchical policy architecture, a three-stage training process (SFT, Offline RL, Offline-to-Online RL), and a structured data format for training.
-
Principles: The core idea is "divide and conquer." A complex task is decomposed by a high-level planner into a sequence of simpler sub-tasks, which are then executed by a low-level controller. This mirrors human problem-solving and simplifies learning.
-
Hierarchical Architecture of the LLM Agent:
- High-Level Policy (): The "planner" or "manager".
- Input: Task description and current environment observation .
- Output: A natural language sub-task goal, .
- Operation: It makes a decision only once every primitive steps (e.g., every 5 or 10 actions). This is temporal abstraction.
- Low-Level Policy (): The "controller" or "worker".
- Input: The sub-task goal from the high-level policy and the current observation .
- Output: A primitive, executable action, .
- Operation: It acts at every timestep to achieve the current goal .
- Rewards:
- High-Level Reward (): The high-level policy receives the sum of environment rewards accumulated over the steps its sub-goal was active: . This reward guides it to choose effective sub-goals.
- Low-Level Reward (): The low-level policy receives an intrinsic reward. This is a binary signal (1 if the sub-task is completed, 0 otherwise). This dense, immediate feedback helps it learn to execute sub-goals effectively, regardless of the final task outcome.
- Parameter-Efficient Model:
- A single frozen LLM backbone is shared by both the actor (policy) and critic (value function).
- The actor is created by adding trainable LoRA adapters to the LLM.
- The critic is created by adding a trainable MLP head to the final layer of the LLM.
- Crucially, the same actor-critic model is used for both the high and low levels. A special
hierarchy prompt
is added to the input to tell the model whether it should act as a planner (high-level) or an executor (low-level). This is highly parameter-efficient compared to training separate models for each level.
- High-Level Policy (): The "planner" or "manager".
-
Steps & Procedures (The 3-Stage Training Pipeline):
1. Base Agent Construction via Behavior Cloning (SFT):
- Goal: To give the agent a good starting point by teaching it to imitate expert trajectories. This prevents the agent from starting with completely random, nonsensical actions.
- Process: The hierarchical actor is fine-tuned on a dataset of expert demonstrations. The model learns to predict the expert's sub-goal given a state (for the high-level) and the expert's action given a state and sub-goal (for the low-level).
- Loss Function: The training minimizes the negative log-likelihood of the expert data, with an added penalty for generating overly long outputs.
- : The high-level and low-level demonstration datasets.
- : The high-level and low-level policies (actors).
- : The log-probability of generating the expert's output (sub-goal or action ). Maximizing this term makes the agent's behavior closer to the expert's.
- : A hyperparameter controlling the strength of the length penalty.
- : The token length of the generated sub-goal and action, respectively. This term encourages conciseness.
2. Offline Hierarchical Policy Refinement:
- Goal: To improve upon the SFT policy using RL on a mixed-quality offline dataset (expert + non-expert trajectories), allowing the agent to learn from sub-optimal data and explore beyond pure imitation.
- Sentence-Level Critic Training: The critic learns to estimate the value of states and state-action pairs.
- Q-function Loss: The Q-function is trained to predict the expected future return of taking action in state .
\mathcal { L } _ { Q } ( \phi ) = \mathbb { E } _ { ( s , u , r , s ^ { \prime } ) \sim D _ { r } \left[ \left( r + \gamma V _ { \bar { \psi } } ( s ^ { \prime } ) - Q _ { \phi } ( s , u ) \right) ^ { 2 } \right]
s, u, r, s'
: State, action, reward, and next state from the offline dataset .- : Discount factor.
- : The value of the next state, estimated by a target value network (a delayed copy of the main value network) for stability.
- Value-function Loss: The value function is trained to estimate the value of a state under the current policy . It uses an asymmetric loss to be conservative and avoid overestimating values, which is a common problem in offline RL.
- : This is the expectile regression loss. With , it penalizes underestimation () more heavily than overestimation (), pushing towards the upper end of the Q-value distribution, leading to a conservative but effective value estimate.
- Q-function Loss: The Q-function is trained to predict the expected future return of taking action in state .
\mathcal { L } _ { Q } ( \phi ) = \mathbb { E } _ { ( s , u , r , s ^ { \prime } ) \sim D _ { r } \left[ \left( r + \gamma V _ { \bar { \psi } } ( s ^ { \prime } ) - Q _ { \phi } ( s , u ) \right) ^ { 2 } \right]
- Token-Level Actor Training: The actor (policy) is updated to generate actions that lead to higher Q-values.
- The actor generates an action (which is a sentence) token by token: .
- The actor loss uses an advantage-weighted behavior cloning objective, similar to the AWAC algorithm.
- : This is the advantage function
A(s, u)
, which measures how much better taking action is compared to the average action in state . - The exponential term acts as a weight. Actions with a high advantage are given a much higher weight in the loss, pushing the policy to imitate good actions from the offline dataset more strongly. This implicitly optimizes the policy while constraining it to stay close to the data distribution, mitigating distribution shift.
- : This is the advantage function
3. Offline-to-Online Adaptation:
- Goal: To quickly adapt the pre-trained agent to a new, unseen task or a non-stationary environment.
- Process: The low-level skills are general ("open the door," "pick up an object") and task-agnostic. Therefore, the low-level policy is frozen. Only the high-level policy is fine-tuned with online interactions in the new environment. The agent collects new high-level transitions and uses them to update the high-level actor and critic using the same RL losses as in Stage 2. This is extremely efficient as it leverages the pre-trained skills and only adapts the high-level planning logic.
5. Experimental Setup
-
Datasets:
- ScienceWorld: A text-based simulation of elementary science experiments. It has a large combinatorial state-action space and requires complex, multi-step reasoning. The benchmark contains 30 distinct tasks (e.g., boiling water, testing conductivity).
- ALFWorld: A benchmark that combines text-based instructions with a simulated 3D household environment. Agents must navigate and interact with objects to complete tasks like "put a clean plate in the microwave." It features sparse, binary rewards (success or failure).
- Offline Dataset Construction: The authors create a training dataset by mixing expert demonstrations (optimal trajectories provided by the benchmarks) with medium-quality trajectories. The medium trajectories are generated by other agents. A mixture ratio of 1:2 (expert:medium) is used, as this was found to be optimal. This mix provides both high-quality examples and a wider, more diverse coverage of the state-action space.
-
Evaluation Metrics: The paper evaluates performance using the standard "score" for each benchmark.
- Conceptual Definition:
- For ScienceWorld, the score is the percentage of required sub-goals completed for a given task, averaged over all test tasks. A score of 100 means the agent successfully completed all parts of the task. It measures partial success and is a fine-grained metric.
- For ALFWorld, the primary metric is Success Rate. This is the percentage of tasks the agent completes successfully. Since the reward is binary (1 for success, 0 for failure), this is equivalent to the average final reward.
- Mathematical Formula: The paper does not provide explicit formulas, as these are standard benchmark metrics. The conceptual definitions are sufficient.
- Average Score (ScienceWorld): Let be the score for task on run , out of runs and tasks.
- Success Rate (ALFWorld): Let be an indicator function that is 1 if task is solved on run and 0 otherwise.
- Symbol Explanation:
- : Total number of tasks.
- : Total number of evaluation runs per task.
- : The score (0-100) on a single run.
- : Success indicator (0 or 1) on a single run.
- Conceptual Definition:
-
Baselines: The paper compares GLIDER against a strong set of baselines covering different paradigms:
- Prompt-based:
ReAct
,Reflexion
. - SFT-based:
SwiftSage
. - Fine-tuning (SFT + RL):
NAT
,ETO
. These baselines represent the state-of-the-art in LLM agents at the time, making for a robust comparison.
- Prompt-based:
6. Results & Analysis
-
Core Results:
The main results are presented in Table 1, which compares GLIDER against baselines on both
seen
(training) andunseen
(test) tasks across three different LLM backbones.Manual Transcription of Table 1:
Backbone Method ScienceWorld ALFWorld Seen Unseen Seen Unseen Mistral-7B Φ ReAct 20.72 17.65 7.86 5.22 Φ Reflexion 21.07 18.11 11.56 6.00 Φ SwitchSage 48.40 45.25 30.29 26.52 ● NAT 57.12 50.79 64.43 68.96 ● ETO 58.17 51.85 66.84 71.43 ● GLIDER 67.31 (↑15.71%) 65.14 (↑25.63%) 70.02 (↑4.76%) 74.83 (↑4.76%) Gemma-7B Φ ReAct 3.58 3.51 6.43 2.24 Φ Reflexion 4.94 3.93 7.14 2.99 Φ SwitchSage 33.43 30.90 8.23 5.72 ● NAT 47.63 44.98 67.86 65.88 ● ETO 50.44 47.84 6.43 68.66 ● GLIDER 63.67 (↑26.23%) 58.50 (↑22.28%) 72.12 (↑6.28%) 70.88 (↑3.23%) Llama-3-8B Φ ReAct 24.76 22.66 2.86 3.73 Φ Reflexion 27.23 25.41 4.29 4.48 Φ SwitchSage 42.22 40.58 20.39 10.78 ● NAT 55.24 48.76 60.71 59.70 ● ETO 57.90 52.33 64.29 64.18 ● GLIDER 77.43 (↑33.73%) 68.34 (↑30.59%) 71.56 (↑11.31%) 75.38 (↑17.45%) - Analysis: GLIDER consistently and significantly outperforms all baselines across all three LLM backbones and on both benchmarks. The performance gains are particularly large on
unseen
tasks (e.g., +30.59% with Llama-3-8B on ScienceWorld), which strongly validates GLIDER's superior generalization capability. This confirms that the hierarchical structure is highly effective.
- Analysis: GLIDER consistently and significantly outperforms all baselines across all three LLM backbones and on both benchmarks. The performance gains are particularly large on
-
Ablations / Parameter Sensitivity:
1. Contribution of Hierarchy and Training Stages (Figure 3):
该图像是图表,展示了ScienceWorld中不同模型架构在未见任务上的消融性能对比。柱状图中实心柱代表层次模型,阴影柱表示去除层次结构,紫色/黄色/绿色分别对应SFT、ORL及SFT+ORL训练阶段。
This figure analyzes the impact of GLIDER's key components on unseen tasks in ScienceWorld.
- Hierarchical vs. Non-Hierarchical: In every comparison, the hierarchical models (solid bars) dramatically outperform their non-hierarchical counterparts (shaded bars). This is the clearest evidence that the "divide-and-conquer" strategy is the main driver of performance.
- Training Stages:
- The full pipeline (green bars) consistently yields the best results. This shows that starting with imitation learning and then refining with RL is the most effective strategy.
ORL
only (yellow bars) performs better thanSFT
only (purple bars). This suggests that reinforcement learning is more powerful than simple imitation, as it allows the agent to learn from sub-optimal data and explore.
2. Impact of Model Scale (Table 2):
Manual Transcription of Table 2:
Model w/o Hier w/ Hier SFT ORL SFT+ORL SFT ORL SFT+ORL Llama-1B 37.24 45.31 48.48 44.50 50.43 53.62 Llama-3B 38.19 52.47 56.93 48.11 55.98 61.29 Llama-8B 41.88 50.16 53.94 50.17 57.12 68.34 - Analysis: The benefits of the hierarchical structure (
w/ Hier
) and the full pipeline hold true across different model sizes (1B, 3B, 8B). Interestingly, the hierarchical Llama-3B model (score 61.29) outperforms even larger non-hierarchical models (e.g., it is better than the non-hierarchical Llama-8B's 53.94). This highlights the efficiency of GLIDER's architecture—a better structure can be more important than simply increasing model size.
3. Generalization via Online Fine-tuning (Figure 4):
该图像是图表,展示了GLIDER在ScienceWorld基准测试中相较于AWAC和AC基线的在线微调性能(评分/100),三组任务(test-conductivity、find-animal、boil)中GLIDER的表现显著优于其他方法,且随训练步数增加测试分数提升明显。
This experiment tests how well a pre-trained GLIDER agent adapts to a completely new task with online fine-tuning.
- Zero-shot Generalization: At step 0 (before any online fine-tuning), GLIDER's score is already much higher than the baselines (AC and AWAC). This indicates that the pre-trained agent has better "zero-shot" knowledge transfer to new tasks.
- Fast Adaptation: During online fine-tuning, GLIDER's performance curve rises much more steeply and reaches a higher final score than the baselines. This confirms that freezing the low-level skills and only tuning the high-level planner is a highly effective and efficient strategy for adapting to new environments.
4. Impact of Data Mixture Ratios (Figure 5):
该图像是图表,展示了在ScienceWorld中不同专家与中等质量数据混合比例下,采用Llama-3-8B作为LLM骨干时的性能表现对比。图中红色实线代表使用层级策略(w/ Hier),绿色虚线代表不使用层级策略(w/o Hier),性能随混合比例变化呈现明显差异。
This figure explores how the mix of expert and medium-quality data affects performance.
- Mixture is Best: The best performance is achieved with a mixture of expert and medium data (peaking at a 1:2 ratio).
- Diversity over Perfection: Interestingly, training only on medium data (score ~36.0) yields better results than training only on expert data (score ~29.7). This suggests that the diversity and broader state-space coverage of the medium-quality data are more valuable for learning a generalizable policy than having a smaller set of perfect-only trajectories. This strongly motivates the use of RL, which is designed to learn from such imperfect data.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces GLIDER, a novel framework that leverages offline hierarchical reinforcement learning to create highly capable LLM agents. By decomposing complex, long-horizon tasks into manageable sub-tasks, GLIDER significantly improves exploration, credit assignment, and overall performance. Its parameter-efficient design and ability to learn from mixed-quality offline data make it a practical and powerful approach. The framework's strong generalization and fast online adaptation capabilities further establish it as a significant step forward for building autonomous AI agents.
-
Limitations & Future Work:
- Limitations: The authors acknowledge that the multi-stage training pipeline (SFT followed by offline RL) is somewhat complex.
- Future Work:
- Streamline Training: They propose simplifying the training process, potentially into a single stage, inspired by recent work like DeepSeek-R1.
- Broader Applications: The authors suggest that GLIDER's framework could be extended beyond traditional agent tasks to other sequential decision-making problems, such as mathematical reasoning and code generation, where a problem can be solved step-by-step.
-
Personal Insights & Critique:
- Strengths:
- The core contribution—using an LLM's own reasoning to autonomously create the hierarchy—is a brilliant solution to a long-standing problem in HRL. It elegantly combines the structured planning of HRL with the flexible reasoning of LLMs.
- The parameter-sharing architecture is very clever, making the approach scalable and computationally feasible, which is often a major hurdle for complex agent architectures.
- The thorough experimental validation, including multiple backbones, ablations, and generalization studies, provides very strong evidence for the framework's effectiveness.
- Potential Weaknesses and Open Questions:
- The framework relies on a fixed temporal abstraction factor, . The optimal value of might vary significantly across tasks or even within a single task. A dynamic or adaptive mechanism for choosing could be a powerful extension.
- The paper claims that sub-task completion is "easily accessible from the environment observation." While true for benchmarks like ScienceWorld and ALFWorld, this might not hold in more complex, open-world environments with ambiguous observations. The robustness of this sub-goal checker is critical.
- The quality of the autonomously generated sub-goals is crucial. While the LLM planner seems effective, an analysis of common failure modes (e.g., generating illogical or unachievable sub-goals) would be insightful.
- Strengths:
Similar papers
Recommended via semantic vector search.
Training-Free Group Relative Policy Optimization
Existing LLM agent fine-tuning is costly and parameter-heavy for specialized tasks. This paper proposes Training-Free GRPO, which distills "relative semantic advantage" from few rollouts into a "token prior" to guide LLMs. This method avoids parameter updates, addresses data scar
Information Gain-based Policy Optimization: A Simple and Effective Approach for Multi-Turn LLM Agents
This work introduces IGPO, offering dense intrinsic rewards per turn to address sparse reward issues in multi-turn LLM training, improving accuracy and sample efficiency over existing RL methods.
PokeeResearch: Effective Deep Research via Reinforcement Learning from AI Feedback and Robust Reasoning Scaffold
PokeeResearch-7B leverages AI feedback-driven reinforcement learning and a chain-of-thought scaffold to enhance factual accuracy and robustness, achieving state-of-the-art results on ten benchmarks among 7B-parameter deep research agents.
Agent Learning via Early Experience
This paper proposes an "early experience" paradigm for language agents to learn from self-generated interaction data, using future states as reward-free supervision. By employing implicit world modeling and self-reflection, it consistently improves task performance and out-of-dom
Risk-Sensitive RL for Alleviating Exploration Dilemmas in Large Language Models
This paper introduces Risk-Sensitive RL (RS-GRPO) to alleviate LLMs' exploration dilemma in reasoning tasks. By using a risk-seeking objective that amplifies learning from difficult problems, RS-GRPO fosters deeper exploration. Experiments show it consistently improves `pass@k` w
Discussion
Leave a comment
No comments yet. Start the discussion!