Paper status: completed

Tree Search for LLM Agent Reinforcement Learning

Published:10/08/2025

Sequence Policy Optimization (40)RL Training for Large Language Models (67)Tree Search Reinforcement Learning (1)Group Relative Advantage Estimation (1)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work introduces Tree-GRPO, a tree search method improving rollout efficiency and generating step-wise supervision to enhance multi-turn RL for LLM agents, outperforming chain-based approaches across diverse QA datasets.

Abstract

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 T REE S EARCH FOR LLM A GENT R EINFORCEMENT L EARNING Anonymous authors Paper under double-blind review A BSTRACT Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward.

Mind Map

In-depth Reading

English Analysis~20 min read · 27,027 chars

1. Bibliographic Information

1.1. Title

Tree Search for LLM Agent Reinforcement Learning

1.2. Authors

The authors are anonymous, as the paper is under double-blind review.

1.3. Journal/Conference

The paper is under review as a conference paper at ICLR 2026. ICLR (International Conference on Learning Representations) is a prestigious conference in the field of artificial intelligence, particularly focusing on deep learning, representation learning, and related areas. Its reputation and influence are high, making it a significant venue for publishing cutting-edge research in machine learning.

1.4. Publication Year

2025 (Specifically, published at UTC: 2025-10-08T00:00:00.000Z, indicating it's a submission for the 2026 conference cycle).

1.5. Abstract

Recent advancements in reinforcement learning (RL) have significantly improved the agent capabilities of large language models (LLMs). However, in complex long-term and multi-turn agent tasks, existing approaches that rely solely on outcome rewards often face the challenge of sparse supervision. To tackle this, the authors propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method that employs a tree search strategy where each tree node represents a complete agent interaction step (Thought-Action-Observation). This tree search sampling method enhances the number of achievable rollouts within a fixed budget of tokens or tool calls by sharing common prefixes. Furthermore, the paper highlights that this tree-structured trajectory naturally facilitates the construction of step-wise process supervised signals, even when using only the outcome reward. Based on this, Tree-GRPO estimates grouped relative advantages at both intra-tree and inter-tree levels. Theoretical analysis demonstrates that the intra-tree level group relative policy optimization objective is equivalent to that of step-level direct preference learning. Experimental evaluations across 11 datasets and three types of QA tasks confirm the superiority of the proposed tree-based RL over traditional chain-based RL methods.

1.6. Original Source Link

https://openreview.net/forum?id=ZpQwAFhU13 PDF Link: https://openreview.net/pdf?id=ZpQwAFhU13 Publication Status: Under review as a conference paper.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inefficiency and sparse supervision inherent in applying Reinforcement Learning (RL) to Large Language Model (LLM) agents performing long-term and multi-turn tasks.

In the current field, RL has significantly boosted LLM capabilities, especially in single-turn tasks. However, extending this to complex agent settings with dynamic, multi-turn interactions introduces two key challenges:

Heavy Budget in LLM Rollouts: Agent tasks involve multi-turn Thought-Action-Observation cycles, leading to very long trajectories (thousands of tokens) and numerous tool calls. Existing chain-based RL methods independently sample complete trajectories, leading to considerable redundancy in sampling and substantial costs (both in tokens and expensive tool calls like search APIs). The rollout phase often dominates the overall training time and budget.
Sparse Supervision in Long-Horizon, Multi-Turn Trajectories: Most current agent RL approaches rely primarily on outcome rewards—a single scalar reward delivered at the end of an entire trajectory. This sparse signal makes it difficult for the LLM to learn which specific steps or actions within a long, interdependent sequence contributed to success or failure. This limited training signal can lead to unstable learning, training collapse, and an inability to learn fine-grained decision-making.

The paper's entry point or innovative idea is to address these challenges by replacing the traditional chain-based sampling with an online tree-search approach, where each node represents a complete agent interaction step. This structural change aims to simultaneously increase sampling efficiency and generate richer, more fine-grained supervision signals from the same outcome rewards.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of LLM agent reinforcement learning:

Novel Tree-based Rollout Strategy: The introduction of a tree-based rollout strategy for multi-turn agentic RL, using agent step-level nodes instead of independent chain-based trajectories. This method significantly reduces the rollout budget (tokens and tool calls) by sharing common prefixes among trajectories.
Tree-based Group Relative Advantage Estimation: The proposal of a group-relative advantage estimation method that operates at both intra-tree and inter-tree levels. This approach leverages the tree structure to construct implicit step-level preference learning objectives, providing more fine-grained process supervision purely from outcome rewards, with a more stable baseline estimate.
Theoretical and Empirical Validation: The paper provides theoretical analysis demonstrating that the intra-tree level group relative policy optimization is structurally equivalent to step-level Direct Preference Optimization (DPO). This is complemented by empirical evidence across 11 datasets and three types of QA tasks, showing Tree-GRPO's superior performance over chain-based methods, especially for smaller models and in multi-hop QA tasks, often achieving better results with substantially less rollout budget. Notably, Tree-GRPO can enable base models to adopt complex multi-turn interaction paradigms without supervised fine-tuning (SFT) under limited budgets.

The key conclusions are that Tree-GRPO effectively addresses the sparse supervision and high rollout cost challenges in LLM agent RL. By using agent step-level nodes in a tree search, it not only improves sampling efficiency but also naturally extracts richer process-level supervision from outcome rewards, leading to more stable and higher-performing RL training for LLM agents.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with the following foundational concepts:

Large Language Models (LLMs): These are advanced artificial intelligence models, typically based on transformer architectures, trained on vast amounts of text data to understand, generate, and process human language. They can perform a wide range of natural language processing (NLP) tasks, such as text generation, summarization, translation, and question answering. Examples include GPT series, Llama, and Qwen.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards for desired actions and penalties for undesired ones, aiming to learn a policy that maximizes its cumulative reward over time.
- Agent: The entity that performs actions and learns from interactions. In this paper, the LLM is the agent.
- Environment: The external system with which the agent interacts.
- State ( $s$ ): A representation of the current situation of the environment, providing the agent with information to make decisions. In LLM agent RL, this typically includes the prompt and interaction history.
- Action ( $a$ ): A decision made by the agent at a given state. For LLMs, this could be generating text (thought) or calling a tool.
- Reward ( $R$ ): A scalar feedback signal from the environment that indicates the desirability of the agent's actions.
- Policy ( $\pi$ ): A function or distribution that maps states to actions, dictating the agent's behavior. The goal of RL is to find an optimal policy.
- Trajectory ( $\mathcal{H}$ ): A sequence of states, actions, and rewards resulting from an agent's interaction with the environment over time: $\mathcal{H} = (s_0, a_0, r_0, s_1, a_1, r_1, \ldots, s_T, a_T, r_T)$ .
LLM Agents: An LLM equipped with tools and the ability to autonomously plan and execute a sequence of actions (often in multi-turn interactions) to achieve a goal. They often involve a Thought-Action-Observation loop.
ReAct Framework: A specific LLM agent framework that combines Reasoning (Thought) and Acting (Action) to solve tasks. At each step, the LLM first Thoughts (reasons about the task, plans next steps), then takes an Action (e.g., using a tool like a search engine), and finally processes the Observation (the result of the action) to inform its next Thought. This iterative process allows LLMs to interact dynamically with environments.
Markov Decision Process (MDP): A mathematical framework for modeling decision-making in situations where outcomes are partly random and partly under the control of a decision-maker. An MDP is defined by a tuple $\{S, A, P, R, \gamma\}$ :
- $S$ : A set of states.
- $A$ : A set of actions.
- $P$ : A state transition probability function $P(s' | s, a)$ , which gives the probability of moving to state $s'$ from state $s$ after taking action $a$ .
- $R$ : A reward function R(s, a, s'), which defines the reward received for transitioning from state $s$ to $s'$ via action $a$ .
- $\gamma$ : A discount factor, indicating the present value of future rewards.
Policy Optimization: In RL, this refers to the process of iteratively updating the agent's policy to maximize the expected cumulative reward. Methods like Proximal Policy Optimization (PPO) or Group Relative Policy Optimization (GRPO) are examples.
Advantage Estimation: A technique used in policy optimization to reduce variance in gradient estimates. The advantage function A(s,a) measures how much better an action $a$ is than the average action at state $s$ . It's typically defined as $A(s,a) = Q(s,a) - V(s)$ , where Q(s,a) is the action-value function (expected return from taking action $a$ in state $s$ ) and V(s) is the state-value function (expected return from state $s$ ). In group-based RL, advantages are often estimated relative to other trajectories within a group.
Group-based RL: A class of RL algorithms that estimate advantages by comparing the performance of multiple candidate rollouts (trajectories) within a group. Instead of relying on a learned value function, it uses the average or relative performance of other samples as a baseline. This is designed to stabilize gradient updates and improve sample efficiency. GRPO and GSPO are examples.
Direct Preference Optimization (DPO): An offline RL method used to align LLMs with human preferences. Instead of learning a reward model, DPO directly optimizes the policy to maximize the likelihood of preferred responses over dispreferred ones, using a simple binary preference dataset (e.g., "chosen" vs. "rejected" responses). This makes it computationally simpler than Reinforcement Learning from Human Feedback (RLHF).
Monte Carlo Tree Search (MCTS): A heuristic search algorithm used for decision-making in complex domains, famously used in game AI (e.g., AlphaGo). It builds a search tree by repeatedly performing four steps: Selection, Expansion, Simulation (rollout), and Backpropagation. It explores promising branches more deeply.
KL Divergence (Kullback-Leibler Divergence): A measure of how one probability distribution diverges from a second, expected probability distribution. In RL, it's often used as a regularization term to prevent the learned policy from deviating too much from a reference policy (e.g., the original LLM's policy) or a previous policy, ensuring stable training. $ D_{\mathrm{KL}}(P | Q) = \sum_{x \in \mathcal{X}} P(x) \log\left(\frac{P(x)}{Q(x)}\right) $ Here, P(x) and Q(x) are the probabilities of event $x$ under distributions $P$ and $Q$ , respectively. It quantifies the information lost when $Q$ is used to approximate $P$ .

3.2. Previous Works

The paper contextualizes its work by referencing several prior approaches:

RL for LLMs (Single-Turn):
- Kimi Team (2025b), Yu et al. (2025), Chu et al. (2025a), Shao et al. (2024), Xin et al. (2024): These works demonstrate LLMs learning complex reasoning abilities and achieving significant gains in single-turn tasks (e.g., mathematical proof, code generation) when trained with outcome rewards. This establishes the foundation for applying RL to LLMs.
- DeepSeek-AI Team (2025), Yang et al. (2025a), OpenAI (2024): These are frontier models and general RL paradigms for LLMs. DeepSeek-AI Team (2025) specifically introduces GRPO (Group Relative Policy Optimization), which is a direct baseline for Tree-GRPO. GRPO estimates advantages by sampling a group of $N$ candidate rollouts to estimate an in-group baseline, aiming to stabilize gradient updates without requiring extra value functions.
RL for LLM Agents (Multi-Turn):
- Feng et al. (2025b), Singh et al. (2025), Wang et al. (2025a), Qian et al. (2025), Feng et al. (2025a): These works extend the RL paradigm to more complex agent settings involving dynamic, multi-turn interactions, which is the direct problem domain of Tree-GRPO. Wang et al. (2025a) is specifically cited for formalizing the ReAct-based process as an MDP.
- Dong et al. (2025), Zhang et al. (2025b): Further works applying RL to optimize LLM policies for agent tasks.
- Wang et al. (2025b; a), Jin et al. (2025b): These works highlight the problem of sparse supervision and potential training collapse in agent RL due to trajectory-level sparse signals, which Tree-GRPO explicitly aims to solve. Search-R1 (Jin et al., 2025b) is the codebase on which Tree-GRPO's implementation is built, including prompt templates and agent-environment interaction.
Baselines Compared in the Paper:
- ReAct (Yao et al., 2023b): A foundational agent framework that combines Reasoning and Acting to interact with environments using Thought-Action-Observation cycles. It's used as a direct baseline.
- Search-o1 (Li et al., 2025a): An advanced Retrieval Augmented Generation (RAG) method, representing a strong non-RL baseline for QA tasks.
- GRPO (DeepSeek-AI Team, 2025): The group-based RL algorithm that Tree-GRPO builds upon and directly compares against, representing a chain-based RL approach.
- GSPO (Zheng et al., 2025): Another group-based RL method, also serving as a chain-based baseline.
Offline RL / Preference Learning for LLMs:
- Wang et al. (2025b), Xie et al. (2024), Xiong et al. (2024): These works construct step-level DPO data in an offline manner to achieve more fine-grained optimization objectives, similar to Tree-GRPO's goal of step-level preference learning, but Tree-GRPO does it online.
Tree Search for LLMs:
- Test-Time Scaling:
  - Yao et al. (2023a), Long (2023), Snell et al. (2024): Propose Tree-of-Thought to allow LLMs to consider multiple reasoning paths.
  - Xin et al. (2024; 2025): Employ MCTS for generating diverse proof paths.
- Offline Data Generation:
  - He et al. (2024), Feng et al. (2024), Wu et al. (2024), Xie et al. (2024), Zhang et al. (2025c), Lai et al. (2024): Utilize tree-search structures for constructing step-level preference learning data for DPO or SFT.
- Online RL (Token/Sentence Level):
  - Hou et al. (2025), Zhang et al. (2024), Yang et al. (2025b): Similar to Tree-GRPO in using tree search for sampling in LLM online RL, but these methods typically operate at the token or sentence level and are not directly designed for the agent step-level structure. Tree-GRPO differentiates itself by using a complete Thought-Action-Observation step as the tree node, which is more natural for agent tasks.

3.3. Technological Evolution

The evolution of LLMs has moved from basic text generation to complex reasoning, and now to agentic capabilities. Initially, LLMs were fine-tuned for specific tasks via Supervised Fine-Tuning (SFT). Reinforcement Learning from Human Feedback (RLHF) then emerged as a powerful paradigm for aligning LLMs with human preferences, enabling models to learn complex behaviors by maximizing a reward signal. This led to significant improvements in single-turn tasks like mathematical reasoning and code generation.

The next frontier involved enabling LLMs to act as agents that can interact with dynamic environments over multi-turn sessions, often by using external tools. This agentic RL opened up possibilities for more autonomous and complex problem-solving. However, multi-turn interactions exacerbate the challenges of sparse rewards and high computational costs (many tokens and tool calls per trajectory).

This paper's work fits within this timeline by pushing the boundaries of agentic RL. While tree search has been explored for LLMs in test-time inference and offline DPO data generation, its application to online RL training for LLM agents with step-level nodes is a novel step. It directly addresses the limitations of prior chain-based RL for agents, which struggle with sparse supervision and high rollout budgets. By leveraging the inherent structure of agentic Thought-Action-Observation steps within a tree search, Tree-GRPO represents an advancement towards more efficient and effective RL training for complex LLM agents.

3.4. Differentiation Analysis

The core differences and innovations of Tree-GRPO compared to related work are:

Chain-based RL (GRPO, GSPO):
- Sampling Strategy: Traditional chain-based RL samples multiple independent complete trajectories for each task. This leads to considerable redundancy, especially when many trajectories share common initial steps.
- Supervision Signal: Primarily relies on trajectory-level outcome rewards, which are sparse and provide coarse credit assignment, making it hard to pinpoint effective steps in long multi-turn interactions.
- Tree-GRPO's Innovation: Tree-GRPO replaces this with a tree-search-based sampling. By sharing common prefixes, it significantly increases the number of effective rollouts achievable within the same token/tool-call budget. More importantly, the tree structure naturally allows for the construction of step-wise process supervised signals from the outcome reward, introducing implicit step-level preference learning. This makes the supervision much richer and less sparse.
Token/Sentence-level Tree Search in Online RL:
- Node Granularity: Previous online RL tree search methods (e.g., Hou et al., Zhang et al., Yang et al.) often use token-level or sentence-level units as tree nodes. While this offers finer granularity, it might not align optimally with the inherent structure of LLM agent interactions.
- Tree-GRPO's Innovation: Tree-GRPO specifically defines each tree node as a complete agent interaction step (a Thought-Action-Observation tuple). This design is more semantically aligned with how LLM agents operate, providing clearer contextual segmentation and explicit constraints on the rollout budget in terms of both tokens and tool calls, which are critical for agent tasks. This agent step-level granularity is shown to be more suitable and effective.
Offline Step-Level DPO Data Generation:
- Online vs. Offline: Some works use tree search to generate offline step-level preference data for DPO or SFT. This involves a separate data collection and training phase.
- Tree-GRPO's Innovation: Tree-GRPO integrates step-level preference learning implicitly within its online RL framework. It directly uses the tree-structured rollouts to estimate relative advantages that mimic DPO's gradient structure during the online learning process, avoiding the complexity of explicit offline data construction.
  
  In essence, Tree-GRPO innovates by combining an efficient tree-search sampling strategy tailored for LLM agents (using agent step-level nodes) with a novel tree-based advantage estimation that extracts fine-grained process supervision from sparse outcome rewards within an online RL setting, addressing core limitations of prior methods.

4. Methodology

4.1. Principles

The core idea behind Tree-GRPO (Tree-based Group Relative Policy Optimization) is to leverage a tree-search mechanism for generating agent trajectories in Reinforcement Learning (RL). This approach is driven by two main principles:

Efficient Rollout Sampling: By structuring agent interactions as a tree where nodes represent complete Thought-Action-Observation steps and sharing common prefixes, the method aims to generate a larger number of diverse rollouts (trajectories) within the same computational budget (tokens and tool calls) compared to independent chain-based sampling. This addresses the high cost of LLM agent interactions.
Deriving Fine-Grained Process Supervision: The inherent tree structure allows for the back-propagation of outcome rewards from the leaves to branching points. The differences in rewards between sibling branches then naturally form preference-learning objectives. This effectively transforms sparse trajectory-level outcome rewards into richer, step-wise process supervised signals without requiring additional supervision or a separate reward model. This addresses the problem of sparse supervision in long-horizon, multi-turn agent tasks.

The theoretical basis is that this tree-based advantage estimation can be shown to be structurally equivalent to step-level Direct Preference Optimization (DPO), thereby providing a principled way to incorporate fine-grained preference learning into an online RL framework.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Multi-turn Agent Framework

The paper adopts the widely used ReAct (Reasoning and Acting) framework for agent interactions. Unlike static, single-turn interactions, an LLM agent using ReAct engages in multi-turn Thought-Action-Observation cycles with an environment to solve a given task.

At each time step $t$ , ranging from 0 to T-1:

The LLM generates a thought ( $\tau_t$ ) and a parsable textual action ( $\alpha_t$ ) based on the current context ( $s_t$ ).
The action ( $\alpha_t$ ) typically involves using a tool (e.g., a search engine).
Through the tool, the agent interacts with the environment to obtain a new observation ( $o_t$ ).

A complete $T$ -step agent episode consists of an interleaved trajectory, denoted as $\mathcal{H}$ : $ \mathcal{H}=\left{\left(\tau_{0}, \alpha_{0}, o_{0}\right),\left(\tau_{1}, \alpha_{1}, o_{1}\right), \ldots,\left(\tau_{T-1}, \alpha_{T-1}, o_{T-1}\right)\right} $ These trajectories can become very long, reaching tens of thousands of tokens for complex tasks.

This dynamic process is formalized as a Markov Decision Process (MDP) $\mathcal{M}=\{S, A, P\}$ :

$S$ : Represents the states, which correspond to the complete interaction context up to a given time step, $\mathcal{H}_{<t}$ .
$A$ : Denotes the compound action space, where each action consists of a thought-action pair ( $\tau_t, \alpha_t$ ).
$P$ : Denotes the transition dynamics, which includes both the external environment's dynamics ( $P_{\text{env}}$ ) and the concatenation of the full context over time steps.

The complete process, based on an LLM policy model $\pi_{\theta}$ , can be formulated probabilistically as: $ p_{\theta}\left(s_{0: T}, \tau_{0: T}, \alpha_{0: T}, o_{0: T}\right)=p\left(s_{0}\right) \prod_{t=0}^{T-1}\left[\pi_{\theta}\left(\tau_{t} \mid s_{t}\right) \pi_{\theta}\left(\alpha_{t} \mid s_{t}, \tau_{t}\right) P_{\mathrm{env}}\left(o_{t+1} \mid \alpha_{t}\right)\right] $ Where:
$p_{\theta}(\cdot)$ : The joint probability of the entire trajectory under policy $\pi_{\theta}$ .
$s_{0:T}$ : The sequence of states from time 0 to $T$ .
$\tau_{0:T}$ : The sequence of thoughts from time 0 to $T$ .
$\alpha_{0:T}$ : The sequence of actions from time 0 to $T$ .
$o_{0:T}$ : The sequence of observations from time 0 to $T$ . Note that $o_{t+1}$ is the observation after action $\alpha_t$ is taken, leading to $s_{t+1}$ .
$p(s_0)$ : The probability distribution of the initial state.
$\pi_{\theta}(\tau_t \mid s_t)$ : The probability of generating thought $\tau_t$ given state $s_t$ by the LLM policy $\pi_{\theta}$ .
$\pi_{\theta}(\alpha_t \mid s_t, \tau_t)$ : The probability of generating action $\alpha_t$ given state $s_t$ and thought $\tau_t$ by the LLM policy $\pi_{\theta}$ .
$P_{\text{env}}(o_{t+1} \mid \alpha_t)$ : The probability of receiving observation $o_{t+1}$ from the environment after taking action $\alpha_t$ .

4.2.2. Agentic Reinforcement Learning

The goal of Agentic Reinforcement Learning is to optimize the LLM policy $\pi_{\theta}$ by maximizing the expected return (total reward) of the full state-action trajectory. This is typically formulated as: $ J(\theta)=\mathbb{E}{\mathcal{H} \sim p{\theta}}[R(\mathcal{H})] $ Where:

$J(\theta)$ : The objective function to be maximized, representing the expected return for policy parameters $\theta$ .
$\mathbb{E}_{\mathcal{H} \sim p_{\theta}}[\cdot]$ : The expectation taken over trajectories $\mathcal{H}$ sampled according to the policy $p_{\theta}$ .
$R(\mathcal{H})$ : The scalar reward obtained for the entire trajectory $\mathcal{H}$ .

In practice, this optimization is performed using a variance-reduced advantage estimator $\hat{A}(\mathcal{H})$ to stabilize gradient updates. Most existing agentic RL systems use an outcome-based reward, meaning $R(\cdot)$ is a single scalar delivered at the end of the trajectory, based on predefined rules or a model-based scoring function.

The Tree-GRPO method is built upon group-based RL algorithms. Unlike PPO which often uses a separate value function to estimate advantages, group-based RL methods estimate advantages $\hat{A}$ by sampling a group of $N$ candidate rollouts and using an in-group baseline to guide the optimization direction.

4.2.3. Tree Search for Agent Rollout

The paper proposes an online rollout strategy based on tree search to overcome the limitations of chain-based sampling. The main challenge for tree search in online RL for LLMs is the multi-turn sequential nature of rollouts, which usually doesn't parallelize well. To mitigate this, Tree-GRPO adopts an initialize-then-expand approach and uses agent step-level nodes.

Figure 2 illustrates the difference between chain-based, token/sentence-level tree search, and the proposed agent step-level tree search. As shown in Image 2, the chain-based rollout independently generates full trajectories. Token/sentence-level tree search expands at a very fine granularity. The proposed method (Right, Ours) uses complete Thought-Action-Observation steps as nodes, aligning with the agent's interaction structure.

The overall tree-search sampling process involves these steps:

Initialization: For each given prompt $x_i$ , $M$ independent chain-based trajectories are first generated by the current policy model $\pi_{\theta}$ . These $M$ trajectories form the initial $M$ trees, denoted as $Y=\left\{\mathcal{H}^{i} \sim \pi_{\theta}\left(\cdot \mid x_{i}\right)\right\}^{M}$ .
Sampling: From each initialized tree $\mathcal{T}_i$ , $N$ non-leaf nodes $p_{i,j}$ are randomly selected for expansion. A non-leaf node means an intermediate step, not the final agent answer.
Expansion: For each selected node $p_{i,j}$ , the context from the root of the tree up to that node, $\mathcal{H}_{<t}^{i}=\left\{p_{i, j}^{\text {root }}, \ldots, p_{i, j}^{\text {father }}, p_{i, j}\right\}$ , combined with the original prompt $x_i$ , serves as input. The policy model $\pi_{\theta}$ then generates the remaining part of the response ( $Y_{\text{new}}=\left\{\mathcal{H}_{\geq t}^{i} \sim \pi_{\theta}\left(\cdot \mid x_{i}, \mathcal{H}_{<t}^{i}\right)\right\}^{N}$ ). This new branch is then inserted into the source tree $\mathcal{T}_i \leftarrow \mathcal{T}_i \cup Y_{\text{new}}$ .

This process of Sampling and Expansion is iteratively repeated $L$ times. This results in a total of $M \times (L \times N + 1)$ rollouts, which form the final group size $G$ for a single prompt. These rollouts are distributed across the $M$ trees.

A key benefit is the budget efficiency. Let $B$ be the expected rollout budget (tokens and tool calls) for a single complete agent trajectory. For each single random tree expansion, the expected depth of the selected node is half of the maximum depth. Therefore, the expected cost for an expansion is $B/2$ . The total expected budget for tree-search sampling is: $ \mathbb{E}\left[B_{\text {tree }}\right]=M \cdot B+L \cdot N \cdot B / 2 $ Where:

$\mathbb{E}[B_{\text{tree}}]$ : The expected total budget for tree-search sampling.
$M \cdot B$ : The budget for the initial $M$ complete trajectories.
$L \cdot N \cdot B / 2$ : The budget for $L$ rounds of expansion, where each round involves $N$ expansions, and each expansion costs, on average, $B/2$ .

This formula shows that tree search can obtain a larger number of agent rollouts for the same budget compared to chain-based methods, especially when $L \cdot N$ is large relative to $M$ . Decreasing $M$ while increasing $N$ and $L$ can raise the number of rollouts but narrows the exploration scope, as more trajectories share prefixes.

Image 3 provides an overview of the Tree-GRPO training pipeline. It illustrates how the rollout is conducted in a tree-search manner, with each node corresponding to a complete thought-action-observation step. It shows the grouped relative advantages being estimated at both intra-tree and inter-tree levels, and emphasizes that Tree-GRPO constructs step-level process supervision signals with a reduced rollout budget.

该图像是一个示意图，展示了基于树搜索的分层强化学习方法 Tree-GRPO 的框架。图中展示了代理的步骤级树搜索过程、策略模型、工具环境及奖励模型的相互作用。包括占用较少回滚预算和构建更精细的过程信号等特点。

Figure 3: The overview of the Tree-GRPO training pipeline. The rollout is conducted in a tree-search manner, where each node corresponds to a complete thought-action-observation step. The group relative advantages are estimated at both intra-tree and inter-tree levels. Tree-GRPO constructs steplevel process supervision signals through a tree structure with a less rollout budget.

4.2.4. Tree-based Group Relative Advantages

In traditional chain-based RL for multi-turn agents, the reward $R(\cdot)$ is typically computed only at the outcome of a complete trajectory. This means the advantage estimation is also at the trajectory level, assigning the same credit to all steps: $ A(\mathcal{H})=A\left(\left{\left(\tau_{0}, \alpha_{0}, o_{0}\right), \ldots,\left(\tau_{T}, \alpha_{T}, o_{T}\right)\right}\right)=A\left(\left{\tau_{0}, \alpha_{0}, o_{0}\right}\right)=\ldots=A\left(\left{\tau_{T}, \alpha_{T}, o_{T}\right}\right) $ This coarse credit assignment due to sparse rewards severely impacts the stability of RL training for long-horizon agents.

Tree-GRPO addresses this by leveraging the tree structure to derive process credit signals. As depicted in Figure 4, when multiple branches originate from a common node, the difference between the back-propagated outcome rewards from their respective leaves naturally forms a preference-learning objective for those subtrees. This offers process signals of varying granularity depending on the subtree depth.

To implement this, Tree-GRPO performs grouped advantage estimation using two levels of grouping:

Intra-tree Grouping ( $G_{\text{intra-tree}}(\mathcal{T}_i)$ ): This calculates advantages by comparing rollouts within the same tree $\mathcal{T}_i$ . At any branching point, the outcomes of sibling branches are compared, allowing for step-level preference learning.
Inter-tree Grouping ( $G_{\text{inter-tree}}(\mathcal{T}_i)$ ): To stabilize training, particularly given potentially limited rollouts within a single tree, advantages are also estimated across all rollouts from all trees generated for a given prompt. This provides a more robust baseline.

The group relative advantage for a trajectory $\mathcal{H}^i$ is calculated as: $ \hat{A}{\text {Intra/Inter-tree }}\left(\mathcal{H}^{i}\right)=\left[R\left(\mathcal{H}^{i}\right)-\operatorname{mean}\left(\left{R\left(\mathcal{H}^{j}\right)\right}{j}^{G_{\text {Intra/Inter-tree }}\left(\mathcal{T}{i}\right)}\right)\right] / \operatorname{std}\left(\left{R\left(\mathcal{H}^{j}\right)\right}{j}^{G_{\text {Intra/Inter-tree }}\left(\mathcal{T}_{i}\right)}\right) $ Where:

$\hat{A}_{\text{Intra/Inter-tree}}(\mathcal{H}^i)$ : The estimated advantage for trajectory $\mathcal{H}^i$ based on either intra-tree or inter-tree grouping.
$R(\mathcal{H}^i)$ : The outcome reward of trajectory $\mathcal{H}^i$ .
$\operatorname{mean}\left(\left\{R\left(\mathcal{H}^{j}\right)\right\}_{j}^{G_{\text {Intra/Inter-tree }}\left(\mathcal{T}_{i}\right)}\right)$ : The mean reward of all trajectories $\mathcal{H}^j$ within the specified group (either intra-tree or inter-tree).
$\operatorname{std}\left(\left\{R\left(\mathcal{H}^{j}\right)\right\}_{j}^{G_{\text {Intra/Inter-tree }}\left(\mathcal{T}_{i}\right)}\right)$ : The standard deviation of rewards within the specified group, used for normalization.

The final advantage estimate combines both intra-tree and inter-tree components: $ \hat{A}{\text {tree }}\left(\mathcal{H}^{i}\right)=\hat{A}{\text {Intra-tree }}\left(\mathcal{H}^{i}\right)+\hat{A}_{\text {Inter-tree }}\left(\mathcal{H}^{i}\right) $ This combination provides the benefits of step-level preference learning from intra-tree comparisons while maintaining stability from the broader inter-tree baseline.

Image 4 visually compares chain-based and tree-based rollouts, highlighting how the tree structure (right side) allows for combining process signals ( $A_1 + A_2 + A_3$ ) from different branches that share a common prefix, making the signal more effective than isolated chain-based signals (left side).

该图像是示意图，展示了基于链和树的强化学习策略优化结构的对比。左侧为链式结构，其中每个动作的信号传递受限，右侧为树形结构，通过组合和共享前缀增强信号效果，公式为 $A_1 + A_2 + A_3$ 作为偏好信号体现，支持更有效的过程信号传递。

Figure 4: Comparison between chainbased and tree-based rollouts.

The final Tree-based Group Relative Policy Optimization objective is: $ \begin{aligned} J_{\text {Tree-GRPO }}(\theta)=\mathbb{E}{x \sim \mathcal{D}, \mathcal{H}^{\text {tree }} \sim \pi{\text {old }}(\cdot \mid x)} & {\left[\frac{1}{G} \sum_{i=1}^{G} \frac{1}{\left|\mathcal{H}^{i}\right|} \sum_{t=1}^{\left|\mathcal{H}^{i}\right|} \min \left(r_{i, t}(\theta) \hat{A}{\text {tree }}\left(\mathcal{H}^{i}\right)\right.\right.} \ & \left.\left.\operatorname{clip}\left(r{i, t}(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}{\text {tree }}\left(\mathcal{H}^{i}\right)\right)-\beta \mathbb{D}{\mathrm{KL}}\left(\pi_{\theta}(\mathcal{H} \mid x) | \pi_{\text {ref }}(\mathcal{H} \mid x)\right)\right] \end{aligned} $ This objective function is a variation of PPO (Proximal Policy Optimization) or GRPO, incorporating the tree-based advantage and a KL divergence regularization term. Where:

$J_{\text{Tree-GRPO}}(\theta)$ : The objective function for Tree-GRPO that the policy $\pi_{\theta}$ aims to maximize.
$\mathbb{E}_{x \sim \mathcal{D}, \mathcal{H}^{\text{tree}} \sim \pi_{\text{old}}(\cdot \mid x)}[\cdot]$ : Expectation over prompts $x$ from dataset $\mathcal{D}$ and trajectories $\mathcal{H}^{\text{tree}}$ sampled using the old policy $\pi_{\text{old}}$ .
$G$ : The total number of rollouts in the group for a given prompt.
$|\mathcal{H}^i|$ : The length (number of steps) of trajectory $\mathcal{H}^i$ .
$r_{i,t}(\theta)$ : The importance sampling ratio at token level $t$ for trajectory $i$ , which is the ratio of probabilities of the token under the current policy $\pi_{\theta}$ and the old policy $\pi_{\text{old}}$ . This is typically $\frac{\pi_{\theta}(\text{token}_t | \text{context}_t)}{\pi_{\text{old}}(\text{token}_t | \text{context}_t)}$ .
$\hat{A}_{\text{tree}}(\mathcal{H}^i)$ : The combined tree-based group relative advantage for trajectory $\mathcal{H}^i$ .
$\min(\cdot, \cdot)$ : The PPO clipping mechanism is used here. It takes the minimum of the unclipped objective term and a clipped version of the objective term. This prevents excessively large policy updates.
$\operatorname{clip}(r_{i,t}(\theta), 1-\epsilon, 1+\epsilon)$ : Clips the importance ratio $r_{i,t}(\theta)$ within the range $[1-\epsilon, 1+\epsilon]$ , where $\epsilon$ is a small hyperparameter (e.g., 0.2). This ensures updates are not too aggressive.
$-\beta \mathbb{D}_{\mathrm{KL}}(\pi_{\theta}(\mathcal{H} \mid x) \| \pi_{\text{ref}}(\mathcal{H} \mid x))$ : A KL divergence regularization term. It penalizes the current policy $\pi_{\theta}$ for deviating too much from a reference LLM $\pi_{\text{ref}}$ .
$\beta$ : A hyperparameter controlling the strength of the KL regularization.

4.2.5. Implicit Step-Level Preference Learning

To provide a deeper understanding of how Tree-GRPO achieves fine-grained supervision, the paper draws a theoretical connection between intra-tree GRPO and step-level Direct Preference Optimization (DPO).

Assumption 3.1 (Binary Preference Setting): For any intermediate tree node represented by the context $(x, \mathcal{H}_{<t})$ , the subsequent trajectory $\mathcal{H}_{\geq t}$ (from step $t$ onwards) is assumed to fall into two categories based on its reward: $\mathcal{H}_{\geq t}^{\text{win}}$ (winning trajectory) and $\mathcal{H}_{\geq t}^{\text{loss}}$ (losing trajectory). These are associated with simplified rewards of [1, 0] respectively. The probabilities of these subsequent trajectories under policy $\pi_{\theta}$ are defined as: $ p_{\theta}\left(\mathcal{H}{\geq t}^{\text{win}}\right)=1-p{\theta}\left(\mathcal{H}{\geq t}^{\text{loss}}\right)=\prod{\tau=t}^{T} \pi_{\theta}\left(\mathcal{H}{\tau}^{\text{win}} \mid x, \mathcal{H}{<\tau}\right) $ This assumption simplifies the reward landscape to a binary preference, making the connection to DPO clearer.

Step-Level DPO Objective Gradient: Under this binary preference assumption, the gradient of the step-level DPO objective (which aims to maximize the Bradley-Terry likelihood of preferred outcomes) can be expressed as: $ \begin{aligned} \nabla_{\theta} J_{\text {step-DPO }}(\theta)=\mathbb{E}{\left(x, \mathcal{H}{<t}, \mathcal{H}{\geq t}^{\text {win }}, \mathcal{H}{\geq t}^{\text {loss }}\right) \sim \mathcal{D}} & {\left[\sigma\left(\beta \log p_{\theta}\left(\mathcal{H}{\geq t}^{\text{loss }}\right)-\beta \log p{\theta}\left(\mathcal{H}{\geq t}^{\text{win }}\right)\right)\right.} \ & \left.\cdot\left(\nabla{\theta} \log p_{\theta}\left(\mathcal{H}{\geq t}^{\text{win }}\right)-\nabla{\theta} \log p_{\theta}\left(\mathcal{H}_{\geq t}^{\text{loss }}\right)\right)\right] \end{aligned} $ Where:

$\nabla_{\theta} J_{\text{step-DPO}}(\theta)$ : The gradient of the step-level DPO objective with respect to policy parameters $\theta$ .
$\mathbb{E}_{\left(x, \mathcal{H}_{<t}, \mathcal{H}_{\geq t}^{\text{win}}, \mathcal{H}_{\geq t}^{\text{loss}}\right) \sim \mathcal{D}}[\cdot]$ : Expectation over preference pairs from the dataset $\mathcal{D}$ , where each pair consists of a context $(x, \mathcal{H}_{<t})$ and a winning and losing subsequent trajectory.
$\sigma(\cdot)$ : The sigmoid function, which scales the preference score between 0 and 1.
$\beta$ : A hyperparameter from DPO, often related to the inverse temperature.
$\log p_{\theta}(\cdot)$ : The log-probability of the trajectory under the current policy $\pi_{\theta}$ .
$\nabla_{\theta} \log p_{\theta}(\cdot)$ : The gradient of the log-probability, which is proportional to the policy gradient.

Intra-tree GRPO Gradient: The paper shows that the gradient of the intra-tree GRPO objective can also be derived into a similar form: $ \nabla_{\theta} J_{\text {Intra-tree }}(\theta)=p_{\theta}\left(\mathcal{H}{\geq t}^{\text{win}}\right) \cdot p{\theta}\left(\mathcal{H}{\geq t}^{\text{loss}}\right) \cdot\left[\nabla{\theta} \log p_{\theta}\left(\mathcal{H}{\geq t}^{\text{win}}\right)-\nabla{\theta} \log p_{\theta}\left(\mathcal{H}_{\geq t}^{\text{loss }}\right)\right] $ Where:

$\nabla_{\theta} J_{\text{Intra-tree}}(\theta)$ : The gradient of the intra-tree GRPO objective.
$p_{\theta}(\mathcal{H}_{\geq t}^{\text{win}})$ and $p_{\theta}(\mathcal{H}_{\geq t}^{\text{loss}})$ : The probabilities of the winning and losing subsequent trajectories under the current policy.

Proposition 3.1 (Structural Equivalence): The paper states that under Assumption 3.1, both step-level DPO and intra-tree GRPO admit gradient estimators of the form: $ \nabla_{\theta} J_{\text {unified }}(\theta)=\underbrace{w}{\text {Weight }} \cdot \underbrace{\left(\nabla{\theta} \log p_{\theta}\left(\mathcal{H}{\geq t}^{\text{win }}\right)-\nabla{\theta} \log p_{\theta}\left(\mathcal{H}{\geq t}^{\text{loss }}\right)\right)}{\text {Preference Advantage Gradient }} $ Where:

$w$ : The Weight term, which is $\sigma(\beta \log p_{\theta}(\mathcal{H}_{\geq t}^{\text{loss}}) - \beta \log p_{\theta}(\mathcal{H}_{\geq t}^{\text{win}}))$ for step-level DPO, and $p_{\theta}(\mathcal{H}_{\geq t}^{\text{win}}) \cdot p_{\theta}(\mathcal{H}_{\geq t}^{\text{loss}})$ for intra-tree GRPO.

This proposition indicates that intra-tree GRPO implicitly performs step-level preference optimization, just like DPO, but does so in an online rollout setting by comparing sibling branches within the tree structure. The difference lies only in the specific weighting function applied to the preference advantage gradient. This theoretical equivalence underscores how Tree-GRPO successfully extracts fine-grained process supervision from simple outcome rewards through its tree-based design.

5. Experimental Setup

5.1. Datasets

To evaluate Tree-GRPO, experiments were conducted on 11 benchmarks categorized into three types of QA tasks:

Multi-Hop QA: These tasks require multiple reasoning steps and interactions to gather information from different sources before arriving at an answer.
- HotpotQA (Yang et al., 2018): Requires finding and combining information from multiple documents.
- 2WikiMultiHopQA (Ho et al., 2020): Similar to HotpotQA, focused on Wikipedia.
- Musique (Trivedi et al., 2022): Designed for multi-hop reasoning over complex compositional questions.
- Bamboogle (Press et al., 2023): A dataset that may involve multi-step reasoning and potentially tool use for information gathering.
Single-Hop QA: These tasks typically require retrieving a single piece of information or performing a single reasoning step to answer the question.
- NQ (Natural Questions) (Kwiatkowski et al., 2019): Questions that can be answered directly from a Wikipedia page.
- TriviaQA (Joshi et al., 2017): Challenging trivia questions that require broad factual knowledge.
- PopQA (Mallen et al., 2023): A dataset focused on popular knowledge questions.
Web-Agent QA: These are more complex tasks that require an agent to interact with a web environment, often involving browsing, searching, and extracting information from web pages.
- SimpleQA (Press et al., 2023): A web-based QA task.
- GAIA (Mialon et al., 2023): A benchmark for general AI agents, often involving multi-tool use and web interaction.
- WebWalkerQA (Wu et al., 2025b): A dataset designed for web agents.
- BrowseComp (Wei et al., 2025): A benchmark for LLM agents that need to browse and comprehend web content.

Tool Usage: For Multi-Hop QA and Single-Hop QA, an E5-based local retrieval server (Wang et al., 2024) built on a Wikipedia dump (Karpukhin et al., 2020) was used as the search tool. This simulates a controlled and consistent information retrieval environment. For Web-Agent QA, a real web search API was employed for retrieval, reflecting the open-ended and dynamic nature of web interaction tasks.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, a complete explanation is provided below:

5.2.1. Exact Match (EM)

Conceptual Definition: Exact Match (EM) is a stringent metric commonly used in question answering tasks. It quantifies the proportion of predictions that exactly match the ground truth answer. If a prediction is even slightly different from the reference answer (e.g., missing punctuation, different capitalization, extra words), it scores zero. Otherwise, it scores one. It measures the precision of the model's answer.
Mathematical Formula: Given a set of $N$ questions, with ground truth answers $G_i$ and predicted answers $P_i$ : $ \mathrm{EM} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(P_i = G_i) $
Symbol Explanation:
- $N$ : The total number of questions in the dataset.
- $i$ : An index iterating through each question.
- $\mathbb{I}(\cdot)$ : The indicator function, which returns 1 if the condition inside the parenthesis is true, and 0 otherwise.
- $P_i$ : The predicted answer for the $i$ -th question.
- $G_i$ : The ground truth answer for the $i$ -th question.

5.2.2. F1 Score

Conceptual Definition: The F1 score is a measure of a model's accuracy, particularly useful when there is an uneven class distribution or when false positives and false negatives have different costs. In QA, it's often calculated at the token level, measuring the overlap between the predicted answer and the ground truth answer. It is the harmonic mean of precision and recall.
- Precision: The proportion of correctly identified positive cases among all positive cases identified by the model. In QA, it's the fraction of predicted tokens that are correct.
- Recall: The proportion of actual positive cases that are correctly identified by the model. In QA, it's the fraction of ground truth tokens that are captured by the prediction.
Mathematical Formula: First, Precision (P) and Recall (R) are calculated: $ \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} $ $ \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $ Then, the F1 score is calculated as: $ \mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $
Symbol Explanation:
- $\mathrm{TP}$ (True Positives): The number of tokens correctly identified as part of the answer.
- $\mathrm{FP}$ (False Positives): The number of tokens incorrectly identified as part of the answer.
- $\mathrm{FN}$ (False Negatives): The number of tokens that are part of the answer but were not identified by the model.
- The F1 score for a dataset is typically the average F1 score across all questions.

5.3. Baselines

The proposed Tree-GRPO method was compared against a comprehensive set of baselines:

Direct Prompting Methods:
- Direct Inference: The LLM answers the question without any explicit tool use or multi-turn interaction. This serves as a foundational baseline for the model's inherent knowledge.
- ReAct agent framework (Yao et al., 2023b): The LLM uses the Thought-Action-Observation loop but without RL fine-tuning. This shows the effectiveness of the agent framework itself before RL optimization.
Advanced RAG Method:
- Search-o1 (Li et al., 2025a): A strong Retrieval Augmented Generation (RAG) baseline that leverages search capabilities, providing a robust comparison for information retrieval tasks.
RL-based Methods (Chain-based):
- GRPO (DeepSeek-AI Team, 2025): Group Relative Policy Optimization, a chain-based group-based RL algorithm that Tree-GRPO directly improves upon. This is a crucial baseline to demonstrate the advantages of tree-based sampling and supervision.
- GSPO (Zheng et al., 2025): Group Similarity Policy Optimization, another chain-based group-based RL method.
  
  These baselines are representative because they cover a spectrum from basic LLM inference, to LLM agents without RL, to advanced RAG, and finally to state-of-the-art chain-based RL methods for LLMs. This allows for a thorough evaluation of Tree-GRPO's unique contributions. The implementation was built upon the Search-R1 (Jin et al., 2025b) repository, ensuring a consistent environment for comparison.

5.4. Models

Experiments were conducted using two prominent series of LLMs at various parameter scales:

Qwen-2.5 (Base/Instruct) (Qwen et al., 2025)
Llama-3.2 (Base/Instruct) (Llama Team, 2024)

The parameter scales included:
1.5 billion (1.5b)
3 billion (3b)
7 billion (7b)
14 billion (14b)

This allows for evaluating the scalability and effectiveness of Tree-GRPO across different model sizes and architectures. The default rollout budget was set to 4 complete rollouts per prompt during training, unless specified otherwise.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results across 11 datasets and three task categories consistently demonstrate the superiority of Tree-GRPO over chain-based RL methods, particularly for smaller models and multi-hop tasks.

6.1.1. Multi-Hop QA Analysis

As shown in Table 1, Multi-Hop QA tasks inherently require multi-turn interactions for information gathering and synthesis.

Small Models (<7b): Direct Inference and ReAct (without RL) show minimal improvements, indicating that prompting alone is insufficient for these models to handle complex long-horizon agent tasks.
Tree-GRPO's Impact: Tree-GRPO achieves substantial improvements over chain-based GRPO for models below 3b. For instance, on Qwen2.5-1.5b and Llama3.2-1.5b, Tree-GRPO yields relative improvements ranging from 16% to 69% in average EM scores on Multi-Hop QA. This highlights its ability to effectively teach complex multi-turn tool-use behavior to smaller models, where chain-based methods often struggle or fail to stimulate such behavior.
Larger Models (e.g., Qwen2.5-14b): While RL offers more limited benefits for larger models that already possess strong agentic capabilities, Tree-GRPO still achieves an average relative improvement of 8.4% over GRPO for Qwen2.5-14b in Multi-Hop QA.
Overall: The results strongly suggest that the process signals provided by the tree-based method are highly effective in guiding LLM agents through complex, multi-step reasoning tasks.

6.1.2. Single-Hop QA Analysis

In Single-Hop QA settings, fewer interaction turns are typically required.

14b Models: Larger models (e.g., 14b) already exhibit strong agentic capabilities under the ReAct framework (without RL tuning) due to their inherent knowledge and reasoning.
Tree-GRPO's Impact: Tree-GRPO still shows stable improvements over chain-based RL methods, particularly for smaller models like Qwen2.5-1.5b and Qwen2.5-3b.
Limited Gains: However, the gains from process-level signals are naturally more limited here because many single-hop questions can be solved with just one round of retrieval and one round of answering, leading to shallow tree depths (typically 2). This means the differentiation between process-level and trajectory-level signals is less pronounced.

6.1.3. Web-Agent QA Analysis

Web-Agent QA tasks are highly challenging, often requiring numerous web interactions.

Challenging Domain: The paper notes that existing open-source web-agent QA benchmarks are predominantly test sets with limited training data that can match the required difficulty.
Tree-GRPO's Impact: Despite these constraints, Tree-GRPO consistently outperforms chain-based GRPO across the four test datasets, with a notable 28% average improvement on GAIA. This indicates its robustness even in complex, noisy environments.

Limitations: For extremely challenging benchmarks like BrowseComp, RL yields only marginal gains, which the authors attribute primarily to the limited training data available for such complex tasks.

The following are the results from Table 1 of the original paper:

	Method	Single-Hop QA				Multi-Hop QA
	Method	NQ	Trivia	PopQA	Avg./ $\Delta_{n 0}^{\infty}$	Hotpot	2wiki	Musiq	Bamb	Avg./ $\Delta_{n 0}^{\infty}$
Qwen2.5-1.5b
	Direct Inference	7.1	22.4	9.9	13.1	5.9	4.3	2.6	8.0	5.2
	Search-o1	10.2	30.9	15.0	15.4	11.6	12.2	3.1	13.0	10.0
	ReAct	9.5	22.1	13.8	15.1	7.3	8.0	1.9	11.2	7.1
	+ GRPO	39.4	51.0	39.7	43.4 $\Delta_{\text{Base}}$	14.6	24.4	2.2	4.0	11.3 $\Delta_{\text{Base}}$
	+ GSPO	36.8	48.9	37.3	41.0 -5.5%	15.8	23.7	2.5	4.8	11.7 +0.5%
	+ Tree-GRPO	43.6	57.3	41.6	47.5 +0.5%	29.5	26.8	6.6	13.6	19.1 +69%
Qwen2.5-3b
	Direct Inference	10.6	28.8	10.8	16.7	14.9	24.4	2.0	2.4	10.9
	Search-o1	15.1	44.3	13.1	24.2	18.7	17.6	5.8	29.6	17.9
	ReAct	21.1	43.5	28.3	31.0	19.2	19.1	4.8	20.0	15.8
	+ GRPO	44.4	58.0	42.0	48.1 $\Delta_{\text{Base}}$	39.0	36.3	15.2	36.8	31.8 $\Delta_{\text{Base}}$
	+ GSPO	43.0	58.8	42.5	48.1 +0.0%	40.2	39.8	17.0	36.8	33.5 +0.3%
	+ Tree-GRPO	46.8	59.7	43.6	50.0 +4.0%	42.4	43.7	17.8	43.2	36.8 +16%
Llama3.2-3b-it
	Direct Inference	16.2	29.6	7.4	17.7	12.6	9.2	2.0	8.0	8.0
	Search-o1	24.2	48.4	8.8	27.1	19.4	17.4	6.0	32.0	14.1
	ReAct	23.9	42.4	21.7	29.3	16.2	10.4	3.5	23.2	13.3
	+ GRPO	45.5	58.2	42.4	48.7 $\Delta_{\text{Base}}$	36.0	26.9	11.8	32.0	26.7 $\Delta_{\text{Base}}$
	+ GSPO	41.2	57.8	40.8	46.6 -4.3%	28.1	24.5	8.6	32.0	23.3 -13%
	+ Tree-GRPO	47.7	59.9	42.3	50.0 +2.7%	44.6	38.4	17.6	46.4	36.8 +36%
Qwen2.5-7b
	Direct Inference	13.4	40.8	14.0	22.7	18.3	25.0	3.1	12.0	14.6
	Search-o1	23.8	47.2	26.2	32.4	22.1	21.8	5.4	32.0	20.3
	ReAct	30.6	56.3	34.6	40.5	27.9	25.3	11.3	28.8	23.3
	+ GRPO	45.8	61.5	44.3	50.5 $\Delta_{\text{Base}}$	42.5	40.7	19.1	43.2	36.4 $\Delta_{\text{Base}}$
	+ GSPO	47.0	64.5	46.1	52.5 +4.0%	40.0	38.2	19.2	44.0	35.4 -2.8%
	+ Tree-GRPO	48.1	63.3	45.2	52.2 +3.4%	44.6	42.3	20.2	44.0	37.8 +3.0%
Qwen2.5-14b
	Direct Inference	19.8	53.1	18.4	30.4	21.7	25.3	4.5	16.0	16.9
	Search-o1	34.7	63.5	24.1	40.8	26.8	16.1	9.9	41.6	23.6
	ReAct	36.1	64.2	39.3	46.5	39.1	33.8	15.0	43.2	32.8
	+ GRPO	51.3	67.2	46.7	55.1 $\Delta_{\text{Base}}$	47.7	42.6	23.2	53.6	41.8 $\Delta_{\text{Base}}$
	+ GSPO	50.7	67.4	47.1	55.1 +0.0%	50.1	50.2	23.8	52.8	44.2 +5.7%
	+ Tree-GRPO	51.7	68.1	47.3	55.7 +1.1%	50.2	50.5	25.9	54.4	45.3 +6.4%

Table 1: Overall performance on single-hop QA and multi-hop QA, with EM scores for each dataset. The best results are indicated in bold.

The following are the results from Table 2 of the original paper:

Method	SimpleQA				GAIA	WebWalkerQA			BrowseComp
	Lv. 1	Lv. 2	Lv. 3	Avg.	Avg.	Lv. 1	Lv. 2	Lv. 3	Avg.	Easy	Med.	Hard	Avg.
Qwen2.5-3b
DeepSeek-R1-Distill-3b	7.7	8.8	7.7	3.0	7.6	6.2	9.4	5.8	7.4	2.2
ReAct	12.6	19.2	7.8	4.1	11.7	9.4	13.3	9.4	11.0	2.4
+ GRPO	25.1	6.2	3.5	1.1	4.2	8.0	9.2	5.6	7.6	1.3
+ Tree-GRPO	61.5	17.7	14.9	4.5	14.7	8.9	11.4	11.6	10.9	2.3
Qwen2.5-7b
ReAct	62.4	19.3	17.5	5.7	16.8	9.3	11.8	11.9	11.2	2.7
+ GRPO	43.3	11.4	7.1	0.9	8.0	9.5	11.3	7.4	9.5	1.2
+ Tree-GRPO	65.4	21.6	15.0	5.5	16.4	11.4	14.8	10.3	12.4	2.4

Table 2: Overall performance on web-agent QA, with F1 scores for each dataset. The best results are indicated in bold.

6.1.4. Summary of Overall Performance

Tree-GRPO consistently delivers improvements across various model sizes and task types.
It is particularly impactful for smaller models, enabling them to learn complex multi-turn interactions more effectively than chain-based RL.
The benefits are most pronounced in multi-hop and web-agent QA, where multi-turn reasoning and tool use are critical and sparse supervision is a major hindrance.
This validates the core hypothesis that tree-based sampling and process-level supervision are crucial for advancing LLM agentic RL.

6.2. Ablation Studies / Parameter Analysis

The paper further analyzes the method's robustness and efficiency through various ablation studies and configurations.

6.2.1. Different Training Budget

The cost of LLM rollouts (tokens and tool calls) is a significant concern in agent RL. This study assesses Tree-GRPO's performance under different budget constraints.

The following are the results from Table 3 of the original paper:

Method	Single-Hop QA				Multi-Hop QA
Method	NQ	Trivia	PopQA	Avg./ $\Delta$	Hotpot	2wiki	Musiq	Bamb	Avg./ $\Delta$
Rollout Token/Tool Budget $\approx$ 2/per prompt
Chain-based	42.0	56.7	40.8	46.5 $\Delta_{\text{base}}$	17.9	25.6	3.3	12.8	14.9
Tree-based $(M=1, N=2, L=1)$	46.1	59.4	43.6	49.7	39.5	40.2	13.7	32.8	31.6 125%
Rollout Token/Tool Budget $\approx$ 4/per prompt
Chain-based	44.4	58.0	42.0	48.1 $\Delta_{\text{base}}$	39.0	36.3	15.2	36.8	31.8
Tree-based $(M=2, N=2, L=1)$	46.8	59.7	43.6	50.0	42.4	43.7	17.8	43.2	36.8 105%
Rollout Token/Tool Budget $\approx$ 8/per prompt
Chain-based	46.5	59.2	44.3	50.0 $\Delta_{\text{base}}$	39.4	36.4	16.1	33.6	31.4
Tree-based $(M=4, N=2, L=1)$	47.6	60.8	44.2	50.8	42.0	42.9	19.5	36.0	35.1 135%
Rollout Token/Tool Budget $\approx$ 16/per prompt
Chain-based	47.8	61.1	44.7	51.2 $\Delta_{\text{base}}$	40.1	38.8	17.5	39.2	33.9
Tree-based $(M=8, N=2, L=1)$	48.6	61.7	44.9	51.7	44.6	43.2	18.2	38.4	36.1 85%
Tree-based $(M=6, N=3, L=1)$	48.5	61.6	45.0	51.7	45.3	44.1	18.8	37.6	36.5 75%
Tree-based $(M=4, N=5, L=1)$	48.4	61.3	43.8	51.2	45.0	43.9	18.5	41.6	37.3 105%

Table 3: Performance with different training budget (defined as the cost of several complete agent trajectories per prompt). The base model is Qwen2.5-3b. The best results are indicated in bold.

Consistent Improvements: Tree-based methods consistently demonstrate improvements across all budget settings.
Highly Constrained Budgets: Under a highly constrained rollout budget ( $\approx$ 2/per prompt), chain-based RL struggles significantly to learn multi-turn interactions. In contrast, Tree-GRPO achieves a substantial 112% relative improvement in Multi-Hop QA, showcasing its superior efficiency and ability to learn from limited data.
Increasing Budget: As the rollout budget increases, the advantage of tree-based methods in terms of generating more training trajectories becomes less critical in single-hop settings (where the problem is simpler and less multi-turn by nature). However, the benefit of finer process supervision signals remains significant in multi-hop settings, leading to sustained improvements.
Cost-Effectiveness: Tree-GRPO achieves superior performance using only one-quarter of the rollout budget compared to chain-based methods, highlighting its efficiency.
Flexibility: With larger budgets, tree-based sampling offers more flexibility in parameter choices (M, N, L), allowing for different trade-offs between exploration scope and number of rollouts.

6.2.2. Chain-based vs. Tree-based Beyond Performance

Beyond raw performance metrics, the paper investigates qualitative differences. Sparse outcome rewards in multi-turn agentic RL can lead models to favor shorter interaction paths or even unreasonable shortcuts, rather than engaging in extensive exploration necessary for complex tasks.

As shown in Figure 5, for multi-hop QA:

Tree-based methods encourage the LLM agent to engage in longer interactions (i.e., making more tool calls), increasing the average number of actions from 2.4 to 3.0 to solve a multi-hop QA.
This suggests that Tree-GRPO not only improves performance but also fosters more robust agent behavior, which is crucial for solving truly complex, long-horizon tasks that inherently require multiple steps.

该图像是比较 Qwen2.5-3b 和 Llama3.2-3b-it 的强化学习表现的图表，左侧为 Qwen2.5-3b 的奖励和动作数量变化，右侧为 Llama3.2-3b-it 的相关数据。图中展示了基于树搜索的强化学习（Tree-based RL）与链式强化学习（Chain-based RL）在不同步骤中对奖励和动作数量的影响。

Figure 5: Comparison between tree-based and chain-based RL on reward and action number.

6.2.3. Learning Rate Warmup Analysis

The learning rate (LR) warmup ratio is a sensitive hyperparameter, especially for smaller models. As shown in Figure 6:

Tree-based methods consistently outperform chain-based methods across all LR warmup ratio settings for both Qwen2.5-3b and Llama3.2-3b-it. This indicates Tree-GRPO's robustness to hyperparameter choices and its more stable learning process, even in sensitive configurations.

该图像是两个折线图，展示了不同 LR 预热比下，Qwen 2.5-3b 和 Llama 3.2-3b-it 模型在树基RL和链式RL中的平均测试成绩。左图显示了 Qwen 模型的表现，右图显示了 Llama 模型的表现，结果表明树基RL在各个 LR 预热比下的效果不同。

Figure 6: Ablation study on LR warmup ratio.

6.2.4. Tree-based Advantage Ablation

This study (Table 4) evaluates the contribution of different components of the tree-based advantage estimation.

The following are the results from Table 4 of the original paper:

Advantage	Hotpot	2Wiki	Musiq	Bamb	Avg.
Qwen2.5-3b w. Chain-based
GRPO	39.0	36.3	15.2	36.8	31.8
Qwen2.5-3b w. Tree-based
$\hat{A}_{\text{intra-tree}}$	1.1	1.7	0.2	1.6	1.2
$\hat{A}_{\text{inter-tree}}$	40.6	41.3	16.5	36.8	33.8
$\hat{A}_{\text{intra-tree}}+\hat{A}_{\text{inter-tree}}$	42.4	43.7	17.8	43.2	36.8

Table 4: Ablation study on tree-based advantages.

$\hat{A}_{\text{intra-tree}}$ alone: Training with only intra-tree advantages leads to unstable learning and training collapse (average score of 1.2). This is because the limited number of rollouts within each tree can lead to unreliable baseline estimations, making the learning process too noisy.
$\hat{A}_{\text{inter-tree}}$ alone: Using only inter-tree advantages (which averages rewards across all trees for a given prompt, akin to a global baseline) provides more stability (average score of 33.8). Even with just this component, the tree-based sampling strategy inherently offers more efficient rollouts under the same budget, leading to better results than chain-based GRPO (31.8).
Combined Advantage ( $\hat{A}_{\text{intra-tree}}+\hat{A}_{\text{inter-tree}}$ ): The combination of both intra-tree and inter-tree advantages yields the best performance (average score of 36.8). This configuration successfully introduces the benefits of step-level preference learning (from intra-tree) while leveraging the stability of a broader baseline (from inter-tree). This validates the design choice of combining both levels of grouping for a robust and effective training signal.

In summary, the experiments confirm that Tree-GRPO is a more effective and efficient RL approach for LLM agents, particularly excelling in multi-turn tasks and constrained budget scenarios. Its ability to provide finer process-level supervision and encourage deeper multi-step interactions positions it as a significant advancement in agentic RL.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work introduces Tree-based Group Relative Policy Optimization (Tree-GRPO), a novel approach to Reinforcement Learning (RL) for LLM agents that replaces traditional chain-based rollouts with a tree-search rollout strategy. The core innovation lies in defining each tree node as a complete agent interaction step (Thought-Action-Observation), which offers semantically rich contextual segmentation.

Tree-GRPO achieves two primary benefits:

Enhanced Rollout Efficiency: By sharing common prefixes among trajectories within the tree structure, it significantly increases the number of effective rollouts that can be generated within a fixed budget of tokens and tool calls, directly addressing the high computational cost of LLM agent training.
Fine-Grained Process Supervision: The tree structure naturally enables the construction of step-wise process supervised signals from sparse outcome rewards. This is achieved by tree-based grouping for advantage estimation at both intra-tree and inter-tree levels. The theoretical analysis further confirms that the intra-tree component implicitly performs step-level preference learning, similar to Direct Preference Optimization (DPO).

Empirical evaluations across 11 diverse datasets, including single-hop QA, multi-hop QA, and challenging web-agent QA tasks, consistently demonstrate the superiority of Tree-GRPO over chain-based RL methods. It shows particular efficacy for smaller LLMs and in complex multi-turn scenarios, where it not only boosts performance but also encourages agents to engage in longer, more thorough interactions.

7.2. Limitations & Future Work

The paper implicitly points to some limitations:

Web-Agent QA Performance: The authors note that the performance improvement from RL on Web-Agent QA benchmarks is relatively limited, primarily constrained by the availability and quality of training data. This suggests that while Tree-GRPO is effective, its full potential can still be bottlenecked by dataset limitations in highly complex, open-ended environments.
Scalability to Very Deep Trees: While Tree-GRPO uses agent step-level nodes to manage budget, the complexity of tree search can still grow with very deep trees and broad branching factors. The paper mentions that increasing $L \cdot N$ (expansions) while decreasing $M$ (initial trees) narrows exploration scope. Balancing this trade-off is an ongoing challenge in tree search.
Implicit vs. Explicit Supervision: Although Tree-GRPO intelligently derives process supervision implicitly from outcome rewards, it still relies on the quality of that final outcome reward. Explicit process reward models or human feedback at intermediate steps could potentially provide even richer signals, but at a higher cost.

Potential future research directions could include:
Dynamic Tree Search Strategies: Exploring more adaptive or learned strategies for node sampling and expansion within the tree to optimize exploration-exploitation trade-offs dynamically.
Integration with Advanced Reward Models: Investigating how Tree-GRPO could benefit from, or be combined with, process reward models if budget allows, to further enhance the quality of step-level supervision.
Application to Broader Agent Tasks: Applying Tree-GRPO to a wider array of LLM agent tasks beyond QA, such as complex planning, code generation in interactive environments, or robotic control, to assess its generalizability.
Theoretical Guarantees: Further theoretical analysis on the convergence properties and sample complexity of Tree-GRPO in different settings.

7.3. Personal Insights & Critique

This paper presents an elegant and practical solution to two critical challenges in LLM agent RL: rollout budget efficiency and sparse supervision. The idea of treating a complete Thought-Action-Observation cycle as a single node in a search tree is particularly insightful. It bridges the gap between token-level granularity (often too fine and inefficient for agents) and trajectory-level granularity (too coarse for effective learning).

The theoretical link between intra-tree GRPO and step-level DPO is a strong point, providing a principled justification for why the method generates fine-grained supervision from outcome rewards. This is a significant contribution because it offers the benefits of preference learning at a sub-trajectory level without the overhead of explicit human feedback on intermediate steps, which is often infeasible for long agent interactions.

A potential area for further exploration could be the sensitivity of the method to the quality and consistency of the outcome reward. While it derives process signals from outcome rewards, a noisy or poorly defined outcome reward could still propagate misleading process signals. Additionally, exploring more complex tree-traversal or MCTS-like strategies within the online learning loop (beyond random expansion) could potentially yield even greater exploration efficiency and performance gains. The paper's empirical success, especially with smaller models and under limited budgets, makes Tree-GRPO a highly valuable and practical contribution to the field of LLM agent development. Its methods could be transferable to any multi-step decision-making system where sparse outcome rewards are common and intermediate process supervision is beneficial but costly to obtain explicitly.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Tree Search for LLM Agent Reinforcement Learning

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~20 min read · 27,027 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Multi-turn Agent Framework

4.2.2. Agentic Reinforcement Learning

4.2.3. Tree Search for Agent Rollout

4.2.4. Tree-based Group Relative Advantages

4.2.5. Implicit Step-Level Preference Learning

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Exact Match (EM)

5.2.2. F1 Score

5.3. Baselines

5.4. Models

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Multi-Hop QA Analysis

6.1.2. Single-Hop QA Analysis

6.1.3. Web-Agent QA Analysis

6.1.4. Summary of Overall Performance

6.2. Ablation Studies / Parameter Analysis

6.2.1. Different Training Budget

6.2.2. Chain-based vs. Tree-based Beyond Performance

6.2.3. Learning Rate Warmup Analysis

6.2.4. Tree-based Advantage Ablation

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers