Tree Search for LLM Agent Reinforcement Learning
TL;DR Summary
This work introduces Tree-GRPO, a tree search method improving rollout efficiency and generating step-wise supervision to enhance multi-turn RL for LLM agents, outperforming chain-based approaches across diverse QA datasets.
Abstract
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 T REE S EARCH FOR LLM A GENT R EINFORCEMENT L EARNING Anonymous authors Paper under double-blind review A BSTRACT Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Tree Search for LLM Agent Reinforcement Learning
1.2. Authors
The authors are anonymous, as the paper is under double-blind review.
1.3. Journal/Conference
The paper is under review as a conference paper at ICLR 2026. ICLR (International Conference on Learning Representations) is a prestigious conference in the field of artificial intelligence, particularly focusing on deep learning, representation learning, and related areas. Its reputation and influence are high, making it a significant venue for publishing cutting-edge research in machine learning.
1.4. Publication Year
2025 (Specifically, published at UTC: 2025-10-08T00:00:00.000Z, indicating it's a submission for the 2026 conference cycle).
1.5. Abstract
Recent advancements in reinforcement learning (RL) have significantly improved the agent capabilities of large language models (LLMs). However, in complex long-term and multi-turn agent tasks, existing approaches that rely solely on outcome rewards often face the challenge of sparse supervision. To tackle this, the authors propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method that employs a tree search strategy where each tree node represents a complete agent interaction step (Thought-Action-Observation). This tree search sampling method enhances the number of achievable rollouts within a fixed budget of tokens or tool calls by sharing common prefixes. Furthermore, the paper highlights that this tree-structured trajectory naturally facilitates the construction of step-wise process supervised signals, even when using only the outcome reward. Based on this, Tree-GRPO estimates grouped relative advantages at both intra-tree and inter-tree levels. Theoretical analysis demonstrates that the intra-tree level group relative policy optimization objective is equivalent to that of step-level direct preference learning. Experimental evaluations across 11 datasets and three types of QA tasks confirm the superiority of the proposed tree-based RL over traditional chain-based RL methods.
1.6. Original Source Link
https://openreview.net/forum?id=ZpQwAFhU13 PDF Link: https://openreview.net/pdf?id=ZpQwAFhU13 Publication Status: Under review as a conference paper.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the inefficiency and sparse supervision inherent in applying Reinforcement Learning (RL) to Large Language Model (LLM) agents performing long-term and multi-turn tasks.
In the current field, RL has significantly boosted LLM capabilities, especially in single-turn tasks. However, extending this to complex agent settings with dynamic, multi-turn interactions introduces two key challenges:
-
Heavy Budget in LLM Rollouts: Agent tasks involve
multi-turnThought-Action-Observationcycles, leading to very long trajectories (thousands of tokens) and numerous tool calls. Existingchain-basedRL methods independently sample complete trajectories, leading to considerable redundancy in sampling and substantial costs (both in tokens and expensive tool calls like search APIs). The rollout phase often dominates the overall training time and budget. -
Sparse Supervision in Long-Horizon, Multi-Turn Trajectories: Most current agent RL approaches rely primarily on
outcome rewards—a single scalar reward delivered at the end of an entire trajectory. Thissparse signalmakes it difficult for the LLM to learn which specific steps or actions within a long, interdependent sequence contributed to success or failure. This limited training signal can lead to unstable learning,training collapse, and an inability to learn fine-grained decision-making.The paper's entry point or innovative idea is to address these challenges by replacing the traditional
chain-basedsampling with anonline tree-searchapproach, where each node represents a completeagent interaction step. This structural change aims to simultaneously increase sampling efficiency and generate richer, more fine-grained supervision signals from the same outcome rewards.
2.2. Main Contributions / Findings
The paper makes several primary contributions to the field of LLM agent reinforcement learning:
-
Novel Tree-based Rollout Strategy: The introduction of a
tree-based rollout strategyformulti-turn agentic RL, usingagent step-level nodesinstead of independent chain-based trajectories. This method significantly reduces the rollout budget (tokens and tool calls) by sharing common prefixes among trajectories. -
Tree-based Group Relative Advantage Estimation: The proposal of a
group-relative advantage estimationmethod that operates at bothintra-treeandinter-treelevels. This approach leverages the tree structure to constructimplicit step-level preference learning objectives, providing more fine-grained process supervision purely from outcome rewards, with a more stable baseline estimate. -
Theoretical and Empirical Validation: The paper provides theoretical analysis demonstrating that the
intra-treelevel group relative policy optimization is structurally equivalent tostep-level Direct Preference Optimization (DPO). This is complemented by empirical evidence across 11 datasets and three types of QA tasks, showingTree-GRPO's superior performance overchain-basedmethods, especially for smaller models and inmulti-hop QAtasks, often achieving better results with substantially less rollout budget. Notably,Tree-GRPOcan enable base models to adopt complex multi-turn interaction paradigms withoutsupervised fine-tuning (SFT)under limited budgets.The key conclusions are that
Tree-GRPOeffectively addresses the sparse supervision and high rollout cost challenges in LLM agent RL. By usingagent step-level nodesin atree search, it not only improves sampling efficiency but also naturally extracts richerprocess-level supervisionfromoutcome rewards, leading to more stable and higher-performing RL training for LLM agents.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with the following foundational concepts:
-
Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on transformer architectures, trained on vast amounts of text data to understand, generate, and process human language. They can perform a wide range of natural language processing (NLP) tasks, such as text generation, summarization, translation, and question answering. Examples include GPT series, Llama, and Qwen.
-
Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to makedecisionsby interacting with anenvironment. The agent receivesrewardsfor desired actions andpenaltiesfor undesired ones, aiming to learn apolicythat maximizes its cumulative reward over time.- Agent: The entity that performs actions and learns from interactions. In this paper, the LLM is the agent.
- Environment: The external system with which the agent interacts.
- State (): A representation of the current situation of the environment, providing the agent with information to make decisions. In LLM agent RL, this typically includes the prompt and interaction history.
- Action (): A decision made by the agent at a given state. For LLMs, this could be generating text (thought) or calling a tool.
- Reward (): A scalar feedback signal from the environment that indicates the desirability of the agent's actions.
- Policy (): A function or distribution that maps states to actions, dictating the agent's behavior. The goal of RL is to find an optimal policy.
- Trajectory (): A sequence of states, actions, and rewards resulting from an agent's interaction with the environment over time: .
-
LLM Agents: An LLM equipped with tools and the ability to autonomously plan and execute a sequence of actions (often in
multi-turninteractions) to achieve a goal. They often involve aThought-Action-Observationloop. -
ReAct Framework: A specific
LLM agentframework that combinesReasoning(Thought) andActing(Action) to solve tasks. At each step, the LLM firstThoughts (reasons about the task, plans next steps), then takes anAction(e.g., using a tool like a search engine), and finally processes theObservation(the result of the action) to inform its nextThought. This iterative process allows LLMs to interact dynamically with environments. -
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by a tuple :
- : A set of states.
- : A set of actions.
- : A state transition probability function , which gives the probability of moving to state from state after taking action .
- : A reward function
R(s, a, s'), which defines the reward received for transitioning from state to via action . - : A discount factor, indicating the present value of future rewards.
-
Policy Optimization: In RL, this refers to the process of iteratively updating the agent's policy to maximize the expected cumulative reward. Methods like
Proximal Policy Optimization (PPO)orGroup Relative Policy Optimization (GRPO)are examples. -
Advantage Estimation: A technique used in policy optimization to reduce variance in gradient estimates. The
advantage functionA(s,a)measures how much better an action is than the average action at state . It's typically defined as , whereQ(s,a)is theaction-value function(expected return from taking action in state ) andV(s)is thestate-value function(expected return from state ). Ingroup-based RL, advantages are often estimated relative to other trajectories within a group. -
Group-based RL: A class of RL algorithms that estimate advantages by comparing the performance of multiple candidate rollouts (trajectories) within a group. Instead of relying on a learned value function, it uses the average or relative performance of other samples as a baseline. This is designed to stabilize gradient updates and improve sample efficiency.
GRPOandGSPOare examples. -
Direct Preference Optimization (DPO): An
offline RLmethod used to align LLMs with human preferences. Instead of learning a reward model, DPO directly optimizes the policy to maximize the likelihood of preferred responses over dispreferred ones, using a simple binary preference dataset (e.g., "chosen" vs. "rejected" responses). This makes it computationally simpler thanReinforcement Learning from Human Feedback (RLHF). -
Monte Carlo Tree Search (MCTS): A heuristic search algorithm used for decision-making in complex domains, famously used in game AI (e.g., AlphaGo). It builds a search tree by repeatedly performing four steps:
Selection,Expansion,Simulation(rollout), andBackpropagation. It explores promising branches more deeply. -
KL Divergence (Kullback-Leibler Divergence): A measure of how one probability distribution diverges from a second, expected probability distribution. In RL, it's often used as a regularization term to prevent the learned policy from deviating too much from a reference policy (e.g., the original LLM's policy) or a previous policy, ensuring stable training. $ D_{\mathrm{KL}}(P | Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right) $ Here,
P(x)andQ(x)are the probabilities of event under distributions and , respectively. It quantifies the information lost when is used to approximate .
3.2. Previous Works
The paper contextualizes its work by referencing several prior approaches:
-
RL for LLMs (Single-Turn):
- Kimi Team (2025b), Yu et al. (2025), Chu et al. (2025a), Shao et al. (2024), Xin et al. (2024): These works demonstrate LLMs learning complex reasoning abilities and achieving significant gains in single-turn tasks (e.g., mathematical proof, code generation) when trained with
outcome rewards. This establishes the foundation for applying RL to LLMs. - DeepSeek-AI Team (2025), Yang et al. (2025a), OpenAI (2024): These are frontier models and general RL paradigms for LLMs.
DeepSeek-AI Team (2025)specifically introducesGRPO(Group Relative Policy Optimization), which is a direct baseline forTree-GRPO.GRPOestimates advantages by sampling a group of candidate rollouts to estimate an in-group baseline, aiming to stabilize gradient updates without requiring extra value functions.
- Kimi Team (2025b), Yu et al. (2025), Chu et al. (2025a), Shao et al. (2024), Xin et al. (2024): These works demonstrate LLMs learning complex reasoning abilities and achieving significant gains in single-turn tasks (e.g., mathematical proof, code generation) when trained with
-
RL for LLM Agents (Multi-Turn):
- Feng et al. (2025b), Singh et al. (2025), Wang et al. (2025a), Qian et al. (2025), Feng et al. (2025a): These works extend the RL paradigm to more complex agent settings involving dynamic,
multi-turninteractions, which is the direct problem domain ofTree-GRPO.Wang et al. (2025a)is specifically cited for formalizing theReAct-based process as anMDP. - Dong et al. (2025), Zhang et al. (2025b): Further works applying RL to optimize LLM policies for agent tasks.
- Wang et al. (2025b; a), Jin et al. (2025b): These works highlight the problem of
sparse supervisionand potentialtraining collapsein agent RL due totrajectory-level sparse signals, whichTree-GRPOexplicitly aims to solve.Search-R1 (Jin et al., 2025b)is the codebase on whichTree-GRPO's implementation is built, including prompt templates and agent-environment interaction.
- Feng et al. (2025b), Singh et al. (2025), Wang et al. (2025a), Qian et al. (2025), Feng et al. (2025a): These works extend the RL paradigm to more complex agent settings involving dynamic,
-
Baselines Compared in the Paper:
- ReAct (Yao et al., 2023b): A foundational agent framework that combines
ReasoningandActingto interact with environments usingThought-Action-Observationcycles. It's used as a direct baseline. - Search-o1 (Li et al., 2025a): An advanced
Retrieval Augmented Generation (RAG)method, representing a strong non-RL baseline for QA tasks. - GRPO (DeepSeek-AI Team, 2025): The
group-based RLalgorithm thatTree-GRPObuilds upon and directly compares against, representing achain-basedRL approach. - GSPO (Zheng et al., 2025): Another
group-based RLmethod, also serving as achain-basedbaseline.
- ReAct (Yao et al., 2023b): A foundational agent framework that combines
-
Offline RL / Preference Learning for LLMs:
- Wang et al. (2025b), Xie et al. (2024), Xiong et al. (2024): These works construct
step-level DPO datain anoffline mannerto achieve more fine-grained optimization objectives, similar toTree-GRPO's goal ofstep-level preference learning, butTree-GRPOdoes itonline.
- Wang et al. (2025b), Xie et al. (2024), Xiong et al. (2024): These works construct
-
Tree Search for LLMs:
- Test-Time Scaling:
- Yao et al. (2023a), Long (2023), Snell et al. (2024): Propose
Tree-of-Thoughtto allow LLMs to consider multiple reasoning paths. - Xin et al. (2024; 2025): Employ
MCTSfor generating diverse proof paths.
- Yao et al. (2023a), Long (2023), Snell et al. (2024): Propose
- Offline Data Generation:
- He et al. (2024), Feng et al. (2024), Wu et al. (2024), Xie et al. (2024), Zhang et al. (2025c), Lai et al. (2024): Utilize
tree-search structuresfor constructingstep-level preference learning dataforDPOorSFT.
- He et al. (2024), Feng et al. (2024), Wu et al. (2024), Xie et al. (2024), Zhang et al. (2025c), Lai et al. (2024): Utilize
- Online RL (Token/Sentence Level):
- Hou et al. (2025), Zhang et al. (2024), Yang et al. (2025b): Similar to
Tree-GRPOin usingtree searchfor sampling inLLM online RL, but these methods typically operate at thetokenorsentence leveland are not directly designed for theagent step-levelstructure.Tree-GRPOdifferentiates itself by using acomplete Thought-Action-Observation stepas the tree node, which is more natural for agent tasks.
- Hou et al. (2025), Zhang et al. (2024), Yang et al. (2025b): Similar to
- Test-Time Scaling:
3.3. Technological Evolution
The evolution of LLMs has moved from basic text generation to complex reasoning, and now to agentic capabilities. Initially, LLMs were fine-tuned for specific tasks via Supervised Fine-Tuning (SFT). Reinforcement Learning from Human Feedback (RLHF) then emerged as a powerful paradigm for aligning LLMs with human preferences, enabling models to learn complex behaviors by maximizing a reward signal. This led to significant improvements in single-turn tasks like mathematical reasoning and code generation.
The next frontier involved enabling LLMs to act as agents that can interact with dynamic environments over multi-turn sessions, often by using external tools. This agentic RL opened up possibilities for more autonomous and complex problem-solving. However, multi-turn interactions exacerbate the challenges of sparse rewards and high computational costs (many tokens and tool calls per trajectory).
This paper's work fits within this timeline by pushing the boundaries of agentic RL. While tree search has been explored for LLMs in test-time inference and offline DPO data generation, its application to online RL training for LLM agents with step-level nodes is a novel step. It directly addresses the limitations of prior chain-based RL for agents, which struggle with sparse supervision and high rollout budgets. By leveraging the inherent structure of agentic Thought-Action-Observation steps within a tree search, Tree-GRPO represents an advancement towards more efficient and effective RL training for complex LLM agents.
3.4. Differentiation Analysis
The core differences and innovations of Tree-GRPO compared to related work are:
-
Chain-based RL (GRPO, GSPO):
- Sampling Strategy: Traditional
chain-based RLsamples multiple independent complete trajectories for each task. This leads to considerable redundancy, especially when many trajectories share common initial steps. - Supervision Signal: Primarily relies on
trajectory-level outcome rewards, which are sparse and provide coarse credit assignment, making it hard to pinpoint effective steps in longmulti-turninteractions. - Tree-GRPO's Innovation:
Tree-GRPOreplaces this with atree-search-based sampling. By sharing common prefixes, it significantly increases the number of effective rollouts achievable within the sametoken/tool-call budget. More importantly, thetree structurenaturally allows for the construction ofstep-wise process supervised signalsfrom theoutcome reward, introducingimplicit step-level preference learning. This makes the supervision much richer and less sparse.
- Sampling Strategy: Traditional
-
Token/Sentence-level Tree Search in Online RL:
- Node Granularity: Previous
online RL tree searchmethods (e.g., Hou et al., Zhang et al., Yang et al.) often usetoken-levelorsentence-level unitsas tree nodes. While this offers finer granularity, it might not align optimally with the inherent structure ofLLM agentinteractions. - Tree-GRPO's Innovation:
Tree-GRPOspecifically defines each tree node as acomplete agent interaction step(aThought-Action-Observationtuple). This design is more semantically aligned with howLLM agentsoperate, providing clearer contextual segmentation and explicit constraints on the rollout budget in terms of both tokens and tool calls, which are critical for agent tasks. Thisagent step-levelgranularity is shown to be more suitable and effective.
- Node Granularity: Previous
-
Offline Step-Level DPO Data Generation:
-
Online vs. Offline: Some works use
tree searchto generateoffline step-level preference dataforDPOorSFT. This involves a separate data collection and training phase. -
Tree-GRPO's Innovation:
Tree-GRPOintegratesstep-level preference learningimplicitly within itsonline RL framework. It directly uses thetree-structured rolloutsto estimaterelative advantagesthat mimicDPO's gradient structure during the online learning process, avoiding the complexity of explicitoffline data construction.In essence,
Tree-GRPOinnovates by combining an efficienttree-search sampling strategytailored forLLM agents(usingagent step-level nodes) with a noveltree-based advantage estimationthat extractsfine-grained process supervisionfromsparse outcome rewardswithin anonline RLsetting, addressing core limitations of prior methods.
-
4. Methodology
4.1. Principles
The core idea behind Tree-GRPO (Tree-based Group Relative Policy Optimization) is to leverage a tree-search mechanism for generating agent trajectories in Reinforcement Learning (RL). This approach is driven by two main principles:
-
Efficient Rollout Sampling: By structuring agent interactions as a tree where nodes represent complete
Thought-Action-Observationsteps and sharing common prefixes, the method aims to generate a larger number of diverserollouts(trajectories) within the samecomputational budget(tokens and tool calls) compared to independentchain-based sampling. This addresses the high cost ofLLM agentinteractions. -
Deriving Fine-Grained Process Supervision: The inherent
tree structureallows for the back-propagation ofoutcome rewardsfrom the leaves to branching points. The differences in rewards between sibling branches then naturally formpreference-learning objectives. This effectively transforms sparsetrajectory-level outcome rewardsinto richer,step-wise process supervised signalswithout requiring additional supervision or a separate reward model. This addresses the problem ofsparse supervisioninlong-horizon, multi-turn agent tasks.The theoretical basis is that this
tree-based advantage estimationcan be shown to be structurally equivalent tostep-level Direct Preference Optimization (DPO), thereby providing a principled way to incorporate fine-grained preference learning into an online RL framework.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Multi-turn Agent Framework
The paper adopts the widely used ReAct (Reasoning and Acting) framework for agent interactions. Unlike static, single-turn interactions, an LLM agent using ReAct engages in multi-turn Thought-Action-Observation cycles with an environment to solve a given task.
At each time step , ranging from 0 to T-1:
-
The
LLMgenerates athought() and a parsabletextual action() based on the current context (). -
The
action() typically involves using atool(e.g., a search engine). -
Through the tool, the agent interacts with the environment to obtain a new
observation().A complete -step agent episode consists of an interleaved trajectory, denoted as : $ \mathcal{H}=\left{\left(\tau_{0}, \alpha_{0}, o_{0}\right),\left(\tau_{1}, \alpha_{1}, o_{1}\right), \ldots,\left(\tau_{T-1}, \alpha_{T-1}, o_{T-1}\right)\right} $ These trajectories can become very long, reaching tens of thousands of tokens for complex tasks.
This dynamic process is formalized as a Markov Decision Process (MDP) :
-
: Represents the
states, which correspond to the complete interaction context up to a given time step, . -
: Denotes the
compound action space, where each action consists of athought-action pair(). -
: Denotes the
transition dynamics, which includes both the external environment's dynamics () and the concatenation of the full context over time steps.The complete process, based on an
LLM policy model, can be formulated probabilistically as: $ p_{\theta}\left(s_{0: T}, \tau_{0: T}, \alpha_{0: T}, o_{0: T}\right)=p\left(s_{0}\right) \prod_{t=0}^{T-1}\left[\pi_{\theta}\left(\tau_{t} \mid s_{t}\right) \pi_{\theta}\left(\alpha_{t} \mid s_{t}, \tau_{t}\right) P_{\mathrm{env}}\left(o_{t+1} \mid \alpha_{t}\right)\right] $ Where: -
: The joint probability of the entire trajectory under policy .
-
: The sequence of states from time 0 to .
-
: The sequence of thoughts from time 0 to .
-
: The sequence of actions from time 0 to .
-
: The sequence of observations from time 0 to . Note that is the observation after action is taken, leading to .
-
: The probability distribution of the initial state.
-
: The probability of generating thought given state by the LLM policy .
-
: The probability of generating action given state and thought by the LLM policy .
-
: The probability of receiving observation from the environment after taking action .
4.2.2. Agentic Reinforcement Learning
The goal of Agentic Reinforcement Learning is to optimize the LLM policy by maximizing the expected return (total reward) of the full state-action trajectory. This is typically formulated as:
$
J(\theta)=\mathbb{E}{\mathcal{H} \sim p{\theta}}[R(\mathcal{H})]
$
Where:
-
: The objective function to be maximized, representing the expected return for policy parameters .
-
: The expectation taken over trajectories sampled according to the policy .
-
: The scalar reward obtained for the entire trajectory .
In practice, this optimization is performed using a
variance-reduced advantage estimatorto stabilize gradient updates. Most existing agentic RL systems use anoutcome-based reward, meaning is a single scalar delivered at the end of the trajectory, based on predefined rules or a model-based scoring function.
The Tree-GRPO method is built upon group-based RL algorithms. Unlike PPO which often uses a separate value function to estimate advantages, group-based RL methods estimate advantages by sampling a group of candidate rollouts and using an in-group baseline to guide the optimization direction.
4.2.3. Tree Search for Agent Rollout
The paper proposes an online rollout strategy based on tree search to overcome the limitations of chain-based sampling. The main challenge for tree search in online RL for LLMs is the multi-turn sequential nature of rollouts, which usually doesn't parallelize well. To mitigate this, Tree-GRPO adopts an initialize-then-expand approach and uses agent step-level nodes.
Figure 2 illustrates the difference between chain-based, token/sentence-level tree search, and the proposed agent step-level tree search.
As shown in Image 2, the chain-based rollout independently generates full trajectories. Token/sentence-level tree search expands at a very fine granularity. The proposed method (Right, Ours) uses complete Thought-Action-Observation steps as nodes, aligning with the agent's interaction structure.
The overall tree-search sampling process involves these steps:
-
Initialization: For each given prompt , independent
chain-based trajectoriesare first generated by the currentpolicy model. These trajectories form the initial trees, denoted as . -
Sampling: From each initialized tree , non-leaf nodes are randomly selected for expansion. A
non-leaf nodemeans an intermediate step, not the final agent answer. -
Expansion: For each selected node , the
contextfrom the root of the tree up to that node, , combined with the original prompt , serves as input. Thepolicy modelthen generates theremaining part of the response(). This new branch is then inserted into the source tree .This process of
SamplingandExpansionis iteratively repeated times. This results in a total of rollouts, which form the finalgroup sizefor a single prompt. These rollouts are distributed across the trees.
A key benefit is the budget efficiency. Let be the expected rollout budget (tokens and tool calls) for a single complete agent trajectory. For each single random tree expansion, the expected depth of the selected node is half of the maximum depth. Therefore, the expected cost for an expansion is .
The total expected budget for tree-search sampling is:
$
\mathbb{E}\left[B_{\text {tree }}\right]=M \cdot B+L \cdot N \cdot B / 2
$
Where:
-
: The expected total budget for tree-search sampling.
-
: The budget for the initial complete trajectories.
-
: The budget for rounds of expansion, where each round involves expansions, and each expansion costs, on average, .
This formula shows that
tree searchcan obtain a larger number of agent rollouts for the same budget compared tochain-based methods, especially when is large relative to . Decreasing while increasing and can raise the number of rollouts but narrows the exploration scope, as more trajectories share prefixes.
Image 3 provides an overview of the Tree-GRPO training pipeline. It illustrates how the rollout is conducted in a tree-search manner, with each node corresponding to a complete thought-action-observation step. It shows the grouped relative advantages being estimated at both intra-tree and inter-tree levels, and emphasizes that Tree-GRPO constructs step-level process supervision signals with a reduced rollout budget.
该图像是一个示意图,展示了基于树搜索的分层强化学习方法 Tree-GRPO 的框架。图中展示了代理的步骤级树搜索过程、策略模型、工具环境及奖励模型的相互作用。包括占用较少回滚预算和构建更精细的过程信号等特点。
Figure 3: The overview of the Tree-GRPO training pipeline. The rollout is conducted in a tree-search manner, where each node corresponds to a complete thought-action-observation step. The group relative advantages are estimated at both intra-tree and inter-tree levels. Tree-GRPO constructs steplevel process supervision signals through a tree structure with a less rollout budget.
4.2.4. Tree-based Group Relative Advantages
In traditional chain-based RL for multi-turn agents, the reward is typically computed only at the outcome of a complete trajectory. This means the advantage estimation is also at the trajectory level, assigning the same credit to all steps:
$
A(\mathcal{H})=A\left(\left{\left(\tau_{0}, \alpha_{0}, o_{0}\right), \ldots,\left(\tau_{T}, \alpha_{T}, o_{T}\right)\right}\right)=A\left(\left{\tau_{0}, \alpha_{0}, o_{0}\right}\right)=\ldots=A\left(\left{\tau_{T}, \alpha_{T}, o_{T}\right}\right)
$
This coarse credit assignment due to sparse rewards severely impacts the stability of RL training for long-horizon agents.
Tree-GRPO addresses this by leveraging the tree structure to derive process credit signals. As depicted in Figure 4, when multiple branches originate from a common node, the difference between the back-propagated outcome rewards from their respective leaves naturally forms a preference-learning objective for those subtrees. This offers process signals of varying granularity depending on the subtree depth.
To implement this, Tree-GRPO performs grouped advantage estimation using two levels of grouping:
-
Intra-tree Grouping (): This calculates advantages by comparing rollouts within the same tree . At any branching point, the outcomes of sibling branches are compared, allowing for
step-level preference learning. -
Inter-tree Grouping (): To stabilize training, particularly given potentially limited rollouts within a single tree, advantages are also estimated across all rollouts from all trees generated for a given prompt. This provides a more robust baseline.
The
group relative advantagefor a trajectory is calculated as: $ \hat{A}{\text {Intra/Inter-tree }}\left(\mathcal{H}^{i}\right)=\left[R\left(\mathcal{H}^{i}\right)-\operatorname{mean}\left(\left{R\left(\mathcal{H}^{j}\right)\right}{j}^{G_{\text {Intra/Inter-tree }}\left(\mathcal{T}{i}\right)}\right)\right] / \operatorname{std}\left(\left{R\left(\mathcal{H}^{j}\right)\right}{j}^{G_{\text {Intra/Inter-tree }}\left(\mathcal{T}_{i}\right)}\right) $ Where:
-
: The estimated advantage for trajectory based on either intra-tree or inter-tree grouping.
-
: The outcome reward of trajectory .
-
: The mean reward of all trajectories within the specified group (either intra-tree or inter-tree).
-
: The standard deviation of rewards within the specified group, used for normalization.
The final advantage estimate combines both intra-tree and inter-tree components: $ \hat{A}{\text {tree }}\left(\mathcal{H}^{i}\right)=\hat{A}{\text {Intra-tree }}\left(\mathcal{H}^{i}\right)+\hat{A}_{\text {Inter-tree }}\left(\mathcal{H}^{i}\right) $ This combination provides the benefits of
step-level preference learningfromintra-treecomparisons while maintaining stability from the broaderinter-treebaseline.
Image 4 visually compares chain-based and tree-based rollouts, highlighting how the tree structure (right side) allows for combining process signals () from different branches that share a common prefix, making the signal more effective than isolated chain-based signals (left side).
该图像是示意图,展示了基于链和树的强化学习策略优化结构的对比。左侧为链式结构,其中每个动作的信号传递受限,右侧为树形结构,通过组合和共享前缀增强信号效果,公式为 作为偏好信号体现,支持更有效的过程信号传递。
Figure 4: Comparison between chainbased and tree-based rollouts.
The final Tree-based Group Relative Policy Optimization objective is:
$
\begin{aligned}
J_{\text {Tree-GRPO }}(\theta)=\mathbb{E}{x \sim \mathcal{D}, \mathcal{H}^{\text {tree }} \sim \pi{\text {old }}(\cdot \mid x)} & {\left[\frac{1}{G} \sum_{i=1}^{G} \frac{1}{\left|\mathcal{H}^{i}\right|} \sum_{t=1}^{\left|\mathcal{H}^{i}\right|} \min \left(r_{i, t}(\theta) \hat{A}{\text {tree }}\left(\mathcal{H}^{i}\right)\right.\right.} \
& \left.\left.\operatorname{clip}\left(r{i, t}(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}{\text {tree }}\left(\mathcal{H}^{i}\right)\right)-\beta \mathbb{D}{\mathrm{KL}}\left(\pi_{\theta}(\mathcal{H} \mid x) | \pi_{\text {ref }}(\mathcal{H} \mid x)\right)\right]
\end{aligned}
$
This objective function is a variation of PPO (Proximal Policy Optimization) or GRPO, incorporating the tree-based advantage and a KL divergence regularization term.
Where:
- : The objective function for
Tree-GRPOthat the policy aims to maximize. - : Expectation over prompts from dataset and trajectories sampled using the
old policy. - : The total number of rollouts in the group for a given prompt.
- : The length (number of steps) of trajectory .
- : The
importance sampling ratioat token level for trajectory , which is the ratio of probabilities of the token under the current policy and the old policy . This is typically . - : The combined
tree-based group relative advantagefor trajectory . - : The
PPO clipping mechanismis used here. It takes the minimum of the unclipped objective term and a clipped version of the objective term. This prevents excessively large policy updates. - : Clips the
importance ratiowithin the range , where is a small hyperparameter (e.g., 0.2). This ensures updates are not too aggressive. - : A
KL divergence regularizationterm. It penalizes the current policy for deviating too much from areference LLM. - : A hyperparameter controlling the strength of the
KL regularization.
4.2.5. Implicit Step-Level Preference Learning
To provide a deeper understanding of how Tree-GRPO achieves fine-grained supervision, the paper draws a theoretical connection between intra-tree GRPO and step-level Direct Preference Optimization (DPO).
Assumption 3.1 (Binary Preference Setting):
For any intermediate tree node represented by the context , the subsequent trajectory (from step onwards) is assumed to fall into two categories based on its reward: (winning trajectory) and (losing trajectory). These are associated with simplified rewards of [1, 0] respectively.
The probabilities of these subsequent trajectories under policy are defined as:
$
p_{\theta}\left(\mathcal{H}{\geq t}^{\text{win}}\right)=1-p{\theta}\left(\mathcal{H}{\geq t}^{\text{loss}}\right)=\prod{\tau=t}^{T} \pi_{\theta}\left(\mathcal{H}{\tau}^{\text{win}} \mid x, \mathcal{H}{<\tau}\right)
$
This assumption simplifies the reward landscape to a binary preference, making the connection to DPO clearer.
Step-Level DPO Objective Gradient:
Under this binary preference assumption, the gradient of the step-level DPO objective (which aims to maximize the Bradley-Terry likelihood of preferred outcomes) can be expressed as:
$
\begin{aligned}
\nabla_{\theta} J_{\text {step-DPO }}(\theta)=\mathbb{E}{\left(x, \mathcal{H}{<t}, \mathcal{H}{\geq t}^{\text {win }}, \mathcal{H}{\geq t}^{\text {loss }}\right) \sim \mathcal{D}} & {\left[\sigma\left(\beta \log p_{\theta}\left(\mathcal{H}{\geq t}^{\text{loss }}\right)-\beta \log p{\theta}\left(\mathcal{H}{\geq t}^{\text{win }}\right)\right)\right.} \
& \left.\cdot\left(\nabla{\theta} \log p_{\theta}\left(\mathcal{H}{\geq t}^{\text{win }}\right)-\nabla{\theta} \log p_{\theta}\left(\mathcal{H}_{\geq t}^{\text{loss }}\right)\right)\right]
\end{aligned}
$
Where:
- : The gradient of the step-level DPO objective with respect to policy parameters .
- : Expectation over
preference pairsfrom the dataset , where each pair consists of a context and a winning and losing subsequent trajectory. - : The
sigmoid function, which scales thepreference scorebetween 0 and 1. - : A hyperparameter from DPO, often related to the inverse temperature.
- : The log-probability of the trajectory under the current policy .
- : The gradient of the log-probability, which is proportional to the policy gradient.
Intra-tree GRPO Gradient:
The paper shows that the gradient of the intra-tree GRPO objective can also be derived into a similar form:
$
\nabla_{\theta} J_{\text {Intra-tree }}(\theta)=p_{\theta}\left(\mathcal{H}{\geq t}^{\text{win}}\right) \cdot p{\theta}\left(\mathcal{H}{\geq t}^{\text{loss}}\right) \cdot\left[\nabla{\theta} \log p_{\theta}\left(\mathcal{H}{\geq t}^{\text{win}}\right)-\nabla{\theta} \log p_{\theta}\left(\mathcal{H}_{\geq t}^{\text{loss }}\right)\right]
$
Where:
- : The gradient of the intra-tree GRPO objective.
- and : The probabilities of the winning and losing subsequent trajectories under the current policy.
Proposition 3.1 (Structural Equivalence):
The paper states that under Assumption 3.1, both step-level DPO and intra-tree GRPO admit gradient estimators of the form:
$
\nabla_{\theta} J_{\text {unified }}(\theta)=\underbrace{w}{\text {Weight }} \cdot \underbrace{\left(\nabla{\theta} \log p_{\theta}\left(\mathcal{H}{\geq t}^{\text{win }}\right)-\nabla{\theta} \log p_{\theta}\left(\mathcal{H}{\geq t}^{\text{loss }}\right)\right)}{\text {Preference Advantage Gradient }}
$
Where:
-
: The
Weightterm, which is forstep-level DPO, and forintra-tree GRPO.This proposition indicates that
intra-tree GRPOimplicitly performsstep-level preference optimization, just likeDPO, but does so in anonline rolloutsetting by comparing sibling branches within the tree structure. The difference lies only in the specific weighting function applied to thepreference advantage gradient. This theoretical equivalence underscores howTree-GRPOsuccessfully extracts fine-grained process supervision from simple outcome rewards through its tree-based design.
5. Experimental Setup
5.1. Datasets
To evaluate Tree-GRPO, experiments were conducted on 11 benchmarks categorized into three types of QA tasks:
-
Multi-Hop QA: These tasks require multiple reasoning steps and interactions to gather information from different sources before arriving at an answer.
HotpotQA(Yang et al., 2018): Requires finding and combining information from multiple documents.2WikiMultiHopQA(Ho et al., 2020): Similar to HotpotQA, focused on Wikipedia.Musique(Trivedi et al., 2022): Designed for multi-hop reasoning over complex compositional questions.Bamboogle(Press et al., 2023): A dataset that may involve multi-step reasoning and potentially tool use for information gathering.
-
Single-Hop QA: These tasks typically require retrieving a single piece of information or performing a single reasoning step to answer the question.
NQ(Natural Questions) (Kwiatkowski et al., 2019): Questions that can be answered directly from a Wikipedia page.TriviaQA(Joshi et al., 2017): Challenging trivia questions that require broad factual knowledge.PopQA(Mallen et al., 2023): A dataset focused on popular knowledge questions.
-
Web-Agent QA: These are more complex tasks that require an agent to interact with a web environment, often involving browsing, searching, and extracting information from web pages.
SimpleQA(Press et al., 2023): A web-based QA task.GAIA(Mialon et al., 2023): A benchmark for general AI agents, often involving multi-tool use and web interaction.WebWalkerQA(Wu et al., 2025b): A dataset designed for web agents.BrowseComp(Wei et al., 2025): A benchmark for LLM agents that need to browse and comprehend web content.
Tool Usage:
For Multi-Hop QA and Single-Hop QA, an E5-based local retrieval server (Wang et al., 2024) built on a Wikipedia dump (Karpukhin et al., 2020) was used as the search tool. This simulates a controlled and consistent information retrieval environment.
For Web-Agent QA, a real web search API was employed for retrieval, reflecting the open-ended and dynamic nature of web interaction tasks.
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, a complete explanation is provided below:
5.2.1. Exact Match (EM)
- Conceptual Definition: Exact Match (EM) is a stringent metric commonly used in question answering tasks. It quantifies the proportion of predictions that exactly match the ground truth answer. If a prediction is even slightly different from the reference answer (e.g., missing punctuation, different capitalization, extra words), it scores zero. Otherwise, it scores one. It measures the precision of the model's answer.
- Mathematical Formula: Given a set of questions, with ground truth answers and predicted answers : $ \mathrm{EM} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(P_i = G_i) $
- Symbol Explanation:
- : The total number of questions in the dataset.
- : An index iterating through each question.
- : The indicator function, which returns 1 if the condition inside the parenthesis is true, and 0 otherwise.
- : The predicted answer for the -th question.
- : The ground truth answer for the -th question.
5.2.2. F1 Score
- Conceptual Definition: The F1 score is a measure of a model's accuracy, particularly useful when there is an uneven class distribution or when false positives and false negatives have different costs. In QA, it's often calculated at the token level, measuring the overlap between the predicted answer and the ground truth answer. It is the harmonic mean of precision and recall.
- Precision: The proportion of correctly identified positive cases among all positive cases identified by the model. In QA, it's the fraction of predicted tokens that are correct.
- Recall: The proportion of actual positive cases that are correctly identified by the model. In QA, it's the fraction of ground truth tokens that are captured by the prediction.
- Mathematical Formula:
First,
Precision(P) andRecall(R) are calculated: $ \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} $ $ \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $ Then, the F1 score is calculated as: $ \mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ - Symbol Explanation:
- (True Positives): The number of tokens correctly identified as part of the answer.
- (False Positives): The number of tokens incorrectly identified as part of the answer.
- (False Negatives): The number of tokens that are part of the answer but were not identified by the model.
- The F1 score for a dataset is typically the average F1 score across all questions.
5.3. Baselines
The proposed Tree-GRPO method was compared against a comprehensive set of baselines:
-
Direct Prompting Methods:
Direct Inference: TheLLManswers the question without any explicittool useormulti-turninteraction. This serves as a foundational baseline for the model's inherent knowledge.ReAct agent framework(Yao et al., 2023b): TheLLMuses theThought-Action-Observationloop but withoutRL fine-tuning. This shows the effectiveness of the agent framework itself before RL optimization.
-
Advanced RAG Method:
Search-o1(Li et al., 2025a): A strongRetrieval Augmented Generation (RAG)baseline that leverages search capabilities, providing a robust comparison for information retrieval tasks.
-
RL-based Methods (Chain-based):
-
GRPO(DeepSeek-AI Team, 2025):Group Relative Policy Optimization, achain-basedgroup-based RLalgorithm thatTree-GRPOdirectly improves upon. This is a crucial baseline to demonstrate the advantages oftree-based samplingandsupervision. -
GSPO(Zheng et al., 2025):Group Similarity Policy Optimization, anotherchain-basedgroup-based RLmethod.These baselines are representative because they cover a spectrum from basic
LLM inference, toLLM agentswithoutRL, to advancedRAG, and finally to state-of-the-artchain-based RLmethods forLLMs. This allows for a thorough evaluation ofTree-GRPO's unique contributions. The implementation was built upon theSearch-R1(Jin et al., 2025b) repository, ensuring a consistent environment for comparison.
-
5.4. Models
Experiments were conducted using two prominent series of LLMs at various parameter scales:
-
Qwen-2.5(Base/Instruct) (Qwen et al., 2025) -
Llama-3.2(Base/Instruct) (Llama Team, 2024)The parameter scales included:
-
1.5 billion (1.5b)
-
3 billion (3b)
-
7 billion (7b)
-
14 billion (14b)
This allows for evaluating the scalability and effectiveness of
Tree-GRPOacross different model sizes and architectures. The defaultrollout budgetwas set to 4 complete rollouts per prompt during training, unless specified otherwise.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results across 11 datasets and three task categories consistently demonstrate the superiority of Tree-GRPO over chain-based RL methods, particularly for smaller models and multi-hop tasks.
6.1.1. Multi-Hop QA Analysis
As shown in Table 1, Multi-Hop QA tasks inherently require multi-turn interactions for information gathering and synthesis.
- Small Models (<7b):
Direct InferenceandReAct(without RL) show minimal improvements, indicating thatprompting aloneis insufficient for these models to handle complexlong-horizon agent tasks. - Tree-GRPO's Impact:
Tree-GRPOachieves substantial improvements overchain-based GRPOfor models below 3b. For instance, onQwen2.5-1.5bandLlama3.2-1.5b,Tree-GRPOyieldsrelative improvementsranging from16% to 69%in averageEM scoresonMulti-Hop QA. This highlights its ability to effectively teach complexmulti-turn tool-use behaviorto smaller models, wherechain-based methodsoften struggle or fail to stimulate such behavior. - Larger Models (e.g., Qwen2.5-14b): While
RLoffers more limited benefits for larger models that already possess strongagentic capabilities,Tree-GRPOstill achieves an average relative improvement of8.4%overGRPOforQwen2.5-14binMulti-Hop QA. - Overall: The results strongly suggest that the
process signalsprovided by thetree-based methodare highly effective in guidingLLM agentsthrough complex,multi-step reasoningtasks.
6.1.2. Single-Hop QA Analysis
In Single-Hop QA settings, fewer interaction turns are typically required.
- 14b Models: Larger models (e.g.,
14b) already exhibit strongagentic capabilitiesunder theReAct framework(withoutRL tuning) due to their inherent knowledge and reasoning. - Tree-GRPO's Impact:
Tree-GRPOstill shows stable improvements overchain-based RLmethods, particularly forsmaller modelslikeQwen2.5-1.5bandQwen2.5-3b. - Limited Gains: However, the gains from
process-level signalsare naturally more limited here because manysingle-hop questionscan be solved with justone round of retrievalandone round of answering, leading to shallow tree depths (typically 2). This means the differentiation betweenprocess-levelandtrajectory-levelsignals is less pronounced.
6.1.3. Web-Agent QA Analysis
Web-Agent QA tasks are highly challenging, often requiring numerous web interactions.
-
Challenging Domain: The paper notes that existing
open-source web-agent QA benchmarksare predominantlytest setswith limited training data that can match the required difficulty. -
Tree-GRPO's Impact: Despite these constraints,
Tree-GRPOconsistently outperformschain-based GRPOacross the four test datasets, with a notable28% average improvementonGAIA. This indicates its robustness even in complex, noisy environments. -
Limitations: For extremely challenging benchmarks like
BrowseComp,RLyields only marginal gains, which the authors attribute primarily to thelimited training dataavailable for such complex tasks.The following are the results from Table 1 of the original paper:
Method Single-Hop QA Multi-Hop QA NQ Trivia PopQA Avg./ Hotpot 2wiki Musiq Bamb Avg./ Qwen2.5-1.5b Direct Inference 7.1 22.4 9.9 13.1 5.9 4.3 2.6 8.0 5.2 Search-o1 10.2 30.9 15.0 15.4 11.6 12.2 3.1 13.0 10.0 ReAct 9.5 22.1 13.8 15.1 7.3 8.0 1.9 11.2 7.1 + GRPO 39.4 51.0 39.7 43.4 14.6 24.4 2.2 4.0 11.3 + GSPO 36.8 48.9 37.3 41.0 -5.5% 15.8 23.7 2.5 4.8 11.7 +0.5% + Tree-GRPO 43.6 57.3 41.6 47.5 +0.5% 29.5 26.8 6.6 13.6 19.1 +69% Qwen2.5-3b Direct Inference 10.6 28.8 10.8 16.7 14.9 24.4 2.0 2.4 10.9 Search-o1 15.1 44.3 13.1 24.2 18.7 17.6 5.8 29.6 17.9 ReAct 21.1 43.5 28.3 31.0 19.2 19.1 4.8 20.0 15.8 + GRPO 44.4 58.0 42.0 48.1 39.0 36.3 15.2 36.8 31.8 + GSPO 43.0 58.8 42.5 48.1 +0.0% 40.2 39.8 17.0 36.8 33.5 +0.3% + Tree-GRPO 46.8 59.7 43.6 50.0 +4.0% 42.4 43.7 17.8 43.2 36.8 +16% Llama3.2-3b-it Direct Inference 16.2 29.6 7.4 17.7 12.6 9.2 2.0 8.0 8.0 Search-o1 24.2 48.4 8.8 27.1 19.4 17.4 6.0 32.0 14.1 ReAct 23.9 42.4 21.7 29.3 16.2 10.4 3.5 23.2 13.3 + GRPO 45.5 58.2 42.4 48.7 36.0 26.9 11.8 32.0 26.7 + GSPO 41.2 57.8 40.8 46.6 -4.3% 28.1 24.5 8.6 32.0 23.3 -13% + Tree-GRPO 47.7 59.9 42.3 50.0 +2.7% 44.6 38.4 17.6 46.4 36.8 +36% Qwen2.5-7b Direct Inference 13.4 40.8 14.0 22.7 18.3 25.0 3.1 12.0 14.6 Search-o1 23.8 47.2 26.2 32.4 22.1 21.8 5.4 32.0 20.3 ReAct 30.6 56.3 34.6 40.5 27.9 25.3 11.3 28.8 23.3 + GRPO 45.8 61.5 44.3 50.5 42.5 40.7 19.1 43.2 36.4 + GSPO 47.0 64.5 46.1 52.5 +4.0% 40.0 38.2 19.2 44.0 35.4 -2.8% + Tree-GRPO 48.1 63.3 45.2 52.2 +3.4% 44.6 42.3 20.2 44.0 37.8 +3.0% Qwen2.5-14b Direct Inference 19.8 53.1 18.4 30.4 21.7 25.3 4.5 16.0 16.9 Search-o1 34.7 63.5 24.1 40.8 26.8 16.1 9.9 41.6 23.6 ReAct 36.1 64.2 39.3 46.5 39.1 33.8 15.0 43.2 32.8 + GRPO 51.3 67.2 46.7 55.1 47.7 42.6 23.2 53.6 41.8 + GSPO 50.7 67.4 47.1 55.1 +0.0% 50.1 50.2 23.8 52.8 44.2 +5.7% + Tree-GRPO 51.7 68.1 47.3 55.7 +1.1% 50.2 50.5 25.9 54.4 45.3 +6.4%
Table 1: Overall performance on single-hop QA and multi-hop QA, with EM scores for each dataset. The best results are indicated in bold.
The following are the results from Table 2 of the original paper:
| Method | SimpleQA | GAIA | WebWalkerQA | BrowseComp | |||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Lv. 1 | Lv. 2 | Lv. 3 | Avg. | Avg. | Lv. 1 | Lv. 2 | Lv. 3 | Avg. | Easy | Med. | Hard | Avg. | |
| Qwen2.5-3b | |||||||||||||
| DeepSeek-R1-Distill-3b | 7.7 | 8.8 | 7.7 | 3.0 | 7.6 | 6.2 | 9.4 | 5.8 | 7.4 | 2.2 | |||
| ReAct | 12.6 | 19.2 | 7.8 | 4.1 | 11.7 | 9.4 | 13.3 | 9.4 | 11.0 | 2.4 | |||
| + GRPO | 25.1 | 6.2 | 3.5 | 1.1 | 4.2 | 8.0 | 9.2 | 5.6 | 7.6 | 1.3 | |||
| + Tree-GRPO | 61.5 | 17.7 | 14.9 | 4.5 | 14.7 | 8.9 | 11.4 | 11.6 | 10.9 | 2.3 | |||
| Qwen2.5-7b | |||||||||||||
| ReAct | 62.4 | 19.3 | 17.5 | 5.7 | 16.8 | 9.3 | 11.8 | 11.9 | 11.2 | 2.7 | |||
| + GRPO | 43.3 | 11.4 | 7.1 | 0.9 | 8.0 | 9.5 | 11.3 | 7.4 | 9.5 | 1.2 | |||
| + Tree-GRPO | 65.4 | 21.6 | 15.0 | 5.5 | 16.4 | 11.4 | 14.8 | 10.3 | 12.4 | 2.4 | |||
Table 2: Overall performance on web-agent QA, with F1 scores for each dataset. The best results are indicated in bold.
6.1.4. Summary of Overall Performance
Tree-GRPOconsistently delivers improvements across various model sizes and task types.- It is particularly impactful for smaller models, enabling them to learn complex
multi-turn interactionsmore effectively thanchain-based RL. - The benefits are most pronounced in
multi-hopandweb-agent QA, wheremulti-turn reasoningandtool useare critical andsparse supervisionis a major hindrance. - This validates the core hypothesis that
tree-based samplingandprocess-level supervisionare crucial for advancingLLM agentic RL.
6.2. Ablation Studies / Parameter Analysis
The paper further analyzes the method's robustness and efficiency through various ablation studies and configurations.
6.2.1. Different Training Budget
The cost of LLM rollouts (tokens and tool calls) is a significant concern in agent RL. This study assesses Tree-GRPO's performance under different budget constraints.
The following are the results from Table 3 of the original paper:
| Method | Single-Hop QA | Multi-Hop QA | |||||||
|---|---|---|---|---|---|---|---|---|---|
| NQ | Trivia | PopQA | Avg./ | Hotpot | 2wiki | Musiq | Bamb | Avg./ | |
| Rollout Token/Tool Budget 2/per prompt | |||||||||
| Chain-based | 42.0 | 56.7 | 40.8 | 46.5 | 17.9 | 25.6 | 3.3 | 12.8 | 14.9 |
| Tree-based | 46.1 | 59.4 | 43.6 | 49.7 | 39.5 | 40.2 | 13.7 | 32.8 | 31.6 125% |
| Rollout Token/Tool Budget 4/per prompt | |||||||||
| Chain-based | 44.4 | 58.0 | 42.0 | 48.1 | 39.0 | 36.3 | 15.2 | 36.8 | 31.8 |
| Tree-based | 46.8 | 59.7 | 43.6 | 50.0 | 42.4 | 43.7 | 17.8 | 43.2 | 36.8 105% |
| Rollout Token/Tool Budget 8/per prompt | |||||||||
| Chain-based | 46.5 | 59.2 | 44.3 | 50.0 | 39.4 | 36.4 | 16.1 | 33.6 | 31.4 |
| Tree-based | 47.6 | 60.8 | 44.2 | 50.8 | 42.0 | 42.9 | 19.5 | 36.0 | 35.1 135% |
| Rollout Token/Tool Budget 16/per prompt | |||||||||
| Chain-based | 47.8 | 61.1 | 44.7 | 51.2 | 40.1 | 38.8 | 17.5 | 39.2 | 33.9 |
| Tree-based | 48.6 | 61.7 | 44.9 | 51.7 | 44.6 | 43.2 | 18.2 | 38.4 | 36.1 85% |
| Tree-based | 48.5 | 61.6 | 45.0 | 51.7 | 45.3 | 44.1 | 18.8 | 37.6 | 36.5 75% |
| Tree-based | 48.4 | 61.3 | 43.8 | 51.2 | 45.0 | 43.9 | 18.5 | 41.6 | 37.3 105% |
Table 3: Performance with different training budget (defined as the cost of several complete agent trajectories per prompt). The base model is Qwen2.5-3b. The best results are indicated in bold.
- Consistent Improvements:
Tree-based methodsconsistently demonstrate improvements across all budget settings. - Highly Constrained Budgets: Under a highly constrained rollout budget ( 2/per prompt),
chain-based RLstruggles significantly to learnmulti-turn interactions. In contrast,Tree-GRPOachieves a substantial112% relative improvementinMulti-Hop QA, showcasing its superior efficiency and ability to learn from limited data. - Increasing Budget: As the
rollout budgetincreases, the advantage oftree-based methodsin terms of generatingmore training trajectoriesbecomes less critical insingle-hop settings(where the problem is simpler and lessmulti-turnby nature). However, the benefit offiner process supervision signalsremains significant inmulti-hop settings, leading to sustained improvements. - Cost-Effectiveness:
Tree-GRPOachieves superior performance using onlyone-quarter of the rollout budgetcompared tochain-based methods, highlighting its efficiency. - Flexibility: With larger budgets,
tree-based samplingoffers more flexibility in parameter choices (M, N, L), allowing for different trade-offs between exploration scope and number of rollouts.
6.2.2. Chain-based vs. Tree-based Beyond Performance
Beyond raw performance metrics, the paper investigates qualitative differences.
Sparse outcome rewards in multi-turn agentic RL can lead models to favor shorter interaction paths or even unreasonable shortcuts, rather than engaging in extensive exploration necessary for complex tasks.
As shown in Figure 5, for multi-hop QA:
-
Tree-based methodsencourage theLLM agentto engage inlonger interactions(i.e., making more tool calls), increasing the average number of actions from 2.4 to 3.0 to solve amulti-hop QA. -
This suggests that
Tree-GRPOnot only improves performance but also fosters more robustagent behavior, which is crucial for solving trulycomplex, long-horizon tasksthat inherently require multiple steps.
该图像是比较 Qwen2.5-3b 和 Llama3.2-3b-it 的强化学习表现的图表,左侧为 Qwen2.5-3b 的奖励和动作数量变化,右侧为 Llama3.2-3b-it 的相关数据。图中展示了基于树搜索的强化学习(Tree-based RL)与链式强化学习(Chain-based RL)在不同步骤中对奖励和动作数量的影响。
Figure 5: Comparison between tree-based and chain-based RL on reward and action number.
6.2.3. Learning Rate Warmup Analysis
The learning rate (LR) warmup ratio is a sensitive hyperparameter, especially for smaller models.
As shown in Figure 6:
-
Tree-based methodsconsistently outperformchain-based methodsacross allLR warmup ratio settingsfor bothQwen2.5-3bandLlama3.2-3b-it. This indicatesTree-GRPO's robustness to hyperparameter choices and its more stable learning process, even in sensitive configurations.
该图像是两个折线图,展示了不同 LR 预热比下,Qwen 2.5-3b 和 Llama 3.2-3b-it 模型在树基RL和链式RL中的平均测试成绩。左图显示了 Qwen 模型的表现,右图显示了 Llama 模型的表现,结果表明树基RL在各个 LR 预热比下的效果不同。
Figure 6: Ablation study on LR warmup ratio.
6.2.4. Tree-based Advantage Ablation
This study (Table 4) evaluates the contribution of different components of the tree-based advantage estimation.
The following are the results from Table 4 of the original paper:
| Advantage | Hotpot | 2Wiki | Musiq | Bamb | Avg. |
|---|---|---|---|---|---|
| Qwen2.5-3b w. Chain-based | |||||
| GRPO | 39.0 | 36.3 | 15.2 | 36.8 | 31.8 |
| Qwen2.5-3b w. Tree-based | |||||
| 1.1 | 1.7 | 0.2 | 1.6 | 1.2 | |
| 40.6 | 41.3 | 16.5 | 36.8 | 33.8 | |
| 42.4 | 43.7 | 17.8 | 43.2 | 36.8 |
Table 4: Ablation study on tree-based advantages.
-
alone: Training with only
intra-tree advantagesleads to unstable learning andtraining collapse(average score of 1.2). This is because the limited number of rollouts within each tree can lead to unreliable baseline estimations, making the learning process too noisy. -
alone: Using only
inter-tree advantages(which averages rewards across all trees for a given prompt, akin to a global baseline) provides more stability (average score of 33.8). Even with just this component, thetree-based samplingstrategy inherently offers more efficient rollouts under the same budget, leading to better results thanchain-based GRPO(31.8). -
Combined Advantage (): The combination of both
intra-treeandinter-tree advantagesyields the best performance (average score of 36.8). This configuration successfully introduces the benefits ofstep-level preference learning(fromintra-tree) while leveraging the stability of a broader baseline (frominter-tree). This validates the design choice of combining both levels of grouping for a robust and effective training signal.In summary, the experiments confirm that
Tree-GRPOis a more effective and efficientRL approachforLLM agents, particularly excelling inmulti-turntasks andconstrained budgetscenarios. Its ability to providefiner process-level supervisionand encourage deepermulti-step interactionspositions it as a significant advancement inagentic RL.
7. Conclusion & Reflections
7.1. Conclusion Summary
This work introduces Tree-based Group Relative Policy Optimization (Tree-GRPO), a novel approach to Reinforcement Learning (RL) for LLM agents that replaces traditional chain-based rollouts with a tree-search rollout strategy. The core innovation lies in defining each tree node as a complete agent interaction step (Thought-Action-Observation), which offers semantically rich contextual segmentation.
Tree-GRPO achieves two primary benefits:
-
Enhanced Rollout Efficiency: By sharing common prefixes among trajectories within the tree structure, it significantly increases the number of effective rollouts that can be generated within a fixed budget of tokens and tool calls, directly addressing the high computational cost of
LLM agenttraining. -
Fine-Grained Process Supervision: The tree structure naturally enables the construction of
step-wise process supervised signalsfrom sparseoutcome rewards. This is achieved bytree-based groupingforadvantage estimationat bothintra-treeandinter-treelevels. The theoretical analysis further confirms that theintra-treecomponent implicitly performsstep-level preference learning, similar toDirect Preference Optimization (DPO).Empirical evaluations across 11 diverse datasets, including
single-hop QA,multi-hop QA, and challengingweb-agent QAtasks, consistently demonstrate the superiority ofTree-GRPOoverchain-based RLmethods. It shows particular efficacy for smallerLLMsand in complexmulti-turnscenarios, where it not only boosts performance but also encourages agents to engage in longer, more thorough interactions.
7.2. Limitations & Future Work
The paper implicitly points to some limitations:
-
Web-Agent QA Performance: The authors note that the performance improvement from
RLonWeb-Agent QAbenchmarks is relatively limited, primarily constrained by the availability and quality of training data. This suggests that whileTree-GRPOis effective, its full potential can still be bottlenecked by dataset limitations in highly complex, open-ended environments. -
Scalability to Very Deep Trees: While
Tree-GRPOusesagent step-level nodesto manage budget, the complexity oftree searchcan still grow with very deep trees and broad branching factors. The paper mentions that increasing (expansions) while decreasing (initial trees) narrows exploration scope. Balancing this trade-off is an ongoing challenge intree search. -
Implicit vs. Explicit Supervision: Although
Tree-GRPOintelligently derivesprocess supervisionimplicitly fromoutcome rewards, it still relies on the quality of that final outcome reward. Explicitprocess reward modelsor human feedback at intermediate steps could potentially provide even richer signals, but at a higher cost.Potential future research directions could include:
-
Dynamic Tree Search Strategies: Exploring more adaptive or learned strategies for node sampling and expansion within the tree to optimize exploration-exploitation trade-offs dynamically.
-
Integration with Advanced Reward Models: Investigating how
Tree-GRPOcould benefit from, or be combined with,process reward modelsif budget allows, to further enhance the quality ofstep-level supervision. -
Application to Broader Agent Tasks: Applying
Tree-GRPOto a wider array ofLLM agent tasksbeyond QA, such as complex planning, code generation in interactive environments, or robotic control, to assess its generalizability. -
Theoretical Guarantees: Further theoretical analysis on the convergence properties and sample complexity of
Tree-GRPOin different settings.
7.3. Personal Insights & Critique
This paper presents an elegant and practical solution to two critical challenges in LLM agent RL: rollout budget efficiency and sparse supervision. The idea of treating a complete Thought-Action-Observation cycle as a single node in a search tree is particularly insightful. It bridges the gap between token-level granularity (often too fine and inefficient for agents) and trajectory-level granularity (too coarse for effective learning).
The theoretical link between intra-tree GRPO and step-level DPO is a strong point, providing a principled justification for why the method generates fine-grained supervision from outcome rewards. This is a significant contribution because it offers the benefits of preference learning at a sub-trajectory level without the overhead of explicit human feedback on intermediate steps, which is often infeasible for long agent interactions.
A potential area for further exploration could be the sensitivity of the method to the quality and consistency of the outcome reward. While it derives process signals from outcome rewards, a noisy or poorly defined outcome reward could still propagate misleading process signals. Additionally, exploring more complex tree-traversal or MCTS-like strategies within the online learning loop (beyond random expansion) could potentially yield even greater exploration efficiency and performance gains. The paper's empirical success, especially with smaller models and under limited budgets, makes Tree-GRPO a highly valuable and practical contribution to the field of LLM agent development. Its methods could be transferable to any multi-step decision-making system where sparse outcome rewards are common and intermediate process supervision is beneficial but costly to obtain explicitly.
Similar papers
Recommended via semantic vector search.