Paper status: completed

Tongyi DeepResearch Technical Report

Published:10/29/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The report presents Tongyi DeepResearch, an agentic large language model designed for long-horizon research tasks. It employs an end-to-end training framework combining mid and post-training to foster autonomous capabilities, achieving state-of-the-art performance across various

Abstract

We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Tongyi DeepResearch Technical Report

1.2. Authors

The paper lists "Tongyi DeepResearch Team" as the authors. The Project Leader is Yong Jiang, with Core Contributors and Contributors also listed, encompassing a large team from Tongyi Lab, Alibaba Group.

1.3. Journal/Conference

The paper is a technical report, published as a preprint on arXiv. While not a formal journal or conference publication, arXiv is a highly reputable platform for disseminating cutting-edge research in fields like AI and computer science, allowing for rapid sharing and peer review before formal publication. Its influence is significant in the academic community for showcasing new advancements.

1.4. Publication Year

The paper was published on arXiv at 2025-10-28T17:53:02.000Z, making the publication year 2025.

1.5. Abstract

This technical report introduces Tongyi DeepResearch, an agentic large language model (LLM) specifically engineered for complex, long-horizon information-seeking research tasks. To foster autonomous deep research capabilities, the model is developed using an end-to-end training framework that integrates agentic mid-training and agentic post-training. This framework is designed to enable scalable reasoning and information seeking across diverse and intricate tasks. A key innovation is a highly scalable, fully automatic data synthesis pipeline that operates without costly human annotation and supports all training stages. The system constructs customized environments for each training stage to ensure stable and consistent interactions. Tongyi DeepResearch boasts 30.5 billion total parameters but activates only 3.3 billion parameters per token, demonstrating efficiency. It achieves state-of-the-art performance across various agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES, and xbench-DeepSearch-2510. The model, its training framework, and complete solutions are open-sourced to benefit the research community.

Official Source: https://arxiv.org/abs/2510.24701 PDF Link: https://arxiv.org/pdf/2510.24701v2.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the development of Deep Research agents – AI systems capable of autonomously conducting multi-step reasoning and information seeking on the internet for complex research tasks. This is a crucial step towards Artificial General Intelligence (AGI) and has the potential to significantly enhance human intellectual productivity.

This problem is highly important because traditional Large Language Models (LLMs), while powerful, often lack the agentic capabilities (e.g., planning, searching, reasoning, synthesizing knowledge over extended periods and diverse sources) required for truly autonomous, long-horizon research. Existing deep research systems are mostly closed-source, making their internal workings and research processes inaccessible to the broader community, hindering collaborative progress. There's a significant gap in publicly available, fully open-source models with robust deep research capabilities.

The paper's innovative idea and entry point is to open-source an agentic LLM named Tongyi DeepResearch, explicitly designed for long-horizon, deep information-seeking research tasks. It addresses the limitations of existing models by introducing a novel, end-to-end training framework that combines agentic mid-training and agentic post-training, coupled with a fully automatic, scalable data synthesis pipeline and customized environmental interactions. This approach aims to equip LLMs with practical and open autonomous research capabilities.

2.2. Main Contributions / Findings

The primary contributions of Tongyi DeepResearch are:

  • Novel End-to-End Agentic Training Paradigm: Introduction of a unified agentic mid-training and agentic post-training framework. Agentic mid-training cultivates inherent agentic biases by exposing the model to large-scale agentic data, bridging the gap between pre-training and post-training. Agentic post-training further refines capabilities through scalable multi-turn reinforcement learning (RL). This paradigm enables gradual development from basic interaction skills to advanced autonomous research behaviors.

  • Fully Automated, Scalable Data Synthesis Pipeline: Design of a pipeline that eliminates the need for human annotation to generate diverse, high-quality agent trajectories. This pipeline creates research-level questions, agentic behavior data (planning, reasoning, decision-making actions), and function-calling data, tailored for each training phase. It enables the creation of "super-human-level" datasets and fosters a data flywheel effect.

  • Stage-Specific, Customized Environments: Construction of robust environments that provide consistent interactions for data synthesis and training. These environments range from Prior World Environment (for pre-trained knowledge mining) to Simulated Environment (for controlled, low-cost iteration) and Real-world Environment (for authentic feedback), adapting to the developmental stage of the agent.

  • State-of-the-Art Performance with Efficiency: Tongyi DeepResearch, built on the Qwen3-30B-A3B-Base model, features 30.5 billion total parameters but activates only 3.3 billion per token. Despite its parameter efficiency, it achieves state-of-the-art performance across a suite of agentic deep research benchmarks, outperforming strong baselines like OpenAI-o3 and Deepseek-V3.1.

  • Open-Sourcing: The model, training framework, and complete solutions are open-sourced, aiming to democratize access to advanced AI research agents and accelerate community progress.

  • Heavy Mode for Enhanced Performance: Introduction of a Heavy Mode that leverages test-time scaling through parallel research and integrative synthesis. This mode deploys multiple agents to explore diverse solution paths and then uses a synthesis model to consolidate findings, achieving further state-of-the-art results on challenging benchmarks.

    The key findings include the effectiveness of this integrated training and data generation approach in creating capable and efficient deep research agents. The systematic analysis covers agentic reinforcement learning and synthetic data, providing insights into the development of such agents. The paper also demonstrates that agentic models represent a significant future trend, capable of internalizing agent-like capabilities and autonomously invoking tools to solve a wide range of problems.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with several core concepts in Large Language Models (LLMs) and Reinforcement Learning (RL).

  • Large Language Models (LLMs): These are neural networks with billions of parameters, pre-trained on vast amounts of text data to understand and generate human-like text. They learn complex patterns in language, enabling tasks like translation, summarization, and question answering. In this paper, LLMs serve as the "brain" of the deep research agent.
  • Agentic LLM: An LLM specifically designed to act as an agent, meaning it can perceive its environment, make decisions, take actions, and receive feedback to achieve a goal. This involves capabilities beyond just text generation, such as tool use, planning, and memory management.
  • Long-horizon Tasks: These are complex tasks that require many steps, potentially spanning a long duration, and often involve multiple interactions with an environment or various tools. Deep research tasks fall into this category as they might involve searching many web pages, synthesizing information, and performing multiple reasoning steps.
  • Agentic Capabilities/Agency: Refers to an agent's ability to operate autonomously, including:
    • Planning: Decomposing a complex task into smaller, manageable steps.
    • Searching/Information Seeking: Actively querying external knowledge sources (like the internet) to find relevant information.
    • Reasoning: Drawing logical conclusions, inferring new information from existing data, and connecting disparate pieces of knowledge.
    • Synthesizing Knowledge: Combining information from various sources to form a coherent understanding or generate a comprehensive report.
    • Tool Use: The ability of an LLM to invoke external tools (e.g., search engines, code interpreters, web browsers) to perform actions that it cannot do itself.
  • Pre-training: The initial phase of training for an LLM, where it learns general language understanding and generation by processing massive text datasets, typically predicting the next word or masked words.
  • Fine-tuning / Post-training: After pre-training, an LLM is further trained on a smaller, task-specific dataset to adapt its capabilities to particular applications, such as instruction following or, in this case, agentic behaviors.
  • Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by performing actions in an environment to maximize a cumulative reward. The agent receives a reward signal for its actions and learns a policy (a strategy) that maps states to actions.
    • Policy (π\pi): The agent's strategy, defining how it chooses actions given a state or observation.
    • Reward: A scalar feedback signal from the environment indicating the desirability of an agent's action. In this paper, a 0 or 1 reward signal is used for answer correctness.
    • Trajectory / Rollout: A sequence of states, actions, and rewards generated by an agent interacting with an environment over time.
    • On-policy RL: An RL algorithm where the policy being learned is the same as the policy used to generate the trajectories (data).
  • Supervised Fine-tuning (SFT): A common fine-tuning technique where an LLM is trained on labeled (input, output) pairs to learn specific behaviors, often used as a "cold start" for RL.
  • Context Window: The maximum number of tokens (words or sub-word units) an LLM can process and attend to at any given time. Longer context windows allow models to handle more information and longer conversations or documents.
  • Next-Token Prediction: The primary objective function during pre-training and sometimes fine-tuning of LLMs, where the model predicts the next token in a sequence given the preceding tokens.
  • Inductive Bias: In machine learning, this refers to a set of assumptions that a learning algorithm uses to make predictions on unseen data. For agentic LLMs, an agentic inductive bias means the model is pre-disposed to learn and exhibit agent-like behaviors like planning and tool use.

3.2. Previous Works

The paper references several key prior works and concepts that inform its approach:

  • ReAct (Yao et al., 2023): This framework (Reasoning and Acting) synergizes reasoning and acting in language models. An agent generates both a reasoning trace (Thought) and a subsequent Action in an interleaved manner. This creates a trajectory of thought-action-observation triplets. Tongyi DeepResearch is fundamentally based on this ReAct architecture due to its simplicity and alignment with scalable computation principles.
    • The ReAct paradigm follows a sequence:
      1. Thought (τt\tau_t): The agent's internal reasoning process, analyzing the current context, recalling memory, planning, and self-reflecting.
      2. Action (ata_t): An external operation executed by the agent, often involving tool use (e.g., Search, Visit, Python Interpreter).
      3. Observation (oto_t): Feedback received from the environment after an action, used to update the agent's internal state.
    • The trajectory HT\mathcal{H}_T is defined as: $ \mathcal{H}T = ( \tau_0, a_0, o_0, \ldots, \tau_i, a_i, o_i, \ldots, \tau_T, a_T ) $ Here, aTa_T is the final answer, and ata_t for t<Tt < T are intermediate tool calls. The policy π\pi generates the current thought and action based on the history: $ \tau_t, a_t \sim \pi ( \cdot | \mathcal{H}{t-1} ) $
  • Context Management Paradigm (Qiao et al., 2025): This addresses the limitation of finite context windows in LLMs for long-horizon tasks. Instead of conditioning on the complete history, the agent is conditioned on a strategically reconstructed workspace containing only essential elements: the question (q), an evolving report (S_t) (as compressed memory), and the immediate context from the last interaction (ata_t and oto_t). This Markovian structure helps maintain reasoning capacity across deep explorations.
    • The core update process is formalized as: $ S_t, \tau_{t+1}, a_{t+1} \sim \pi ( \cdot | S_{t-1}, a_t, o_t ) $
    • This is crucial because report (S_t) serves as a condensed memory, preventing context overflow and enforcing structured reasoning by requiring the agent to synthesize and prioritize information.
  • The Bitter Lesson (Sutton, 2019): This influential principle in AI suggests that general methods that leverage scalable computation ultimately outperform approaches relying on complex, human-engineered knowledge and intricate designs. The authors explicitly cite this lesson to justify their choice of ReAct and their focus on scalable training paradigms over complex, specialized prompt engineering.
  • Agentic Continual Pre-training (Agentic CPT) (Su et al., 2025): A two-stage process mentioned as the core mid-training phase in Tongyi DeepResearch. It aims to provide a base model with a strong inductive bias for agentic behavior while preserving broad linguistic competence. It uses next-token prediction loss and progressively expands context length.
  • rLLM framework (Tan et al., 2025): A framework for post-training language agents used by Tongyi DeepResearch to implement its on-policy asynchronous rollout framework for RL training.
  • GRPO (Shao et al., 2024): A Reinforcement Learning algorithm, specifically Generalized Policy Optimization, which serves as the foundation for the RL training algorithm in Tongyi DeepResearch. The paper adapts GRPO to its needs.
  • DAPO (Yu et al., 2025): An open-source LLM reinforcement learning system which influences the application of token-level policy gradient loss and a clip-higher strategy in the RL training objective to encourage exploration.
  • Qwen3-30B-A3B-Base (Yang et al., 2025): The pre-trained base model from which Tongyi DeepResearch is initialized. This signifies that the work builds upon existing powerful LLM architectures.

3.3. Technological Evolution

The field of Large Language Models has rapidly evolved from models primarily focused on text generation and understanding (e.g., initial GPT models) to increasingly agentic systems. Early LLMs were typically pre-trained on vast text corpora and then fine-tuned for specific tasks. The major shifts leading to the current work include:

  1. Instruction Following: LLMs evolved to understand and follow complex instructions (instruction fine-tuning), making them more useful for diverse tasks.

  2. Tool Use Integration: The realization that LLMs alone are limited (e.g., cannot perform precise calculations, access real-time information) led to methods allowing them to invoke external tools (e.g., search engines, code interpreters). Frameworks like ReAct became prominent here.

  3. Agentic Training Paradigms: Moving beyond simple instruction fine-tuning to training LLMs specifically for multi-step decision-making and environmental interaction, often leveraging Reinforcement Learning.

  4. Long-Context Models: The development of models that can handle increasingly longer input sequences, critical for long-horizon tasks like deep research.

  5. Synthetic Data Generation: The increasing sophistication of LLMs themselves has enabled them to generate high-quality training data, reducing reliance on costly human annotation and allowing for scalable data creation.

    Tongyi DeepResearch fits within this timeline by pushing the boundaries of agentic LLMs for deep research. It specifically addresses the need for open-source, capable agents by integrating advanced agentic training (mid-training and post-training with RL), highly scalable synthetic data generation, and robust environmental interaction strategies. It combines existing powerful LLM backbones with novel training methodologies to achieve state-of-the-art agentic performance.

3.4. Differentiation Analysis

Compared to main methods in related work, Tongyi DeepResearch presents several core differences and innovations:

  • End-to-End Integrated Training Framework: While many existing works focus on post-training for DeepResearch agents, Tongyi DeepResearch introduces a novel, integrated end-to-end training framework that unifies agentic mid-training and agentic post-training.

    • Mid-training Innovation: The explicit agentic mid-training phase is a key differentiator. It's designed to instill agentic inductive biases early by exposing the model to large-scale agentic data before the intensive RL post-training. This bridges the gap between general pre-training and specific agentic post-training, addressing optimization conflicts and leading to a stronger agentic foundation model. Most general foundation models lack this specific agentic prior knowledge.
  • Fully Automated and Scalable Data Synthesis: Many agentic systems or LLM fine-tuning efforts still rely on human-annotated data, which is expensive and unscalable for research-level problems. Tongyi DeepResearch emphasizes a fully automated, highly scalable data synthesis pipeline that generates diverse, high-quality agent trajectories without human intervention. This includes:

    • Synthesizing research-level questions efficiently using LLMs.
    • Generating planning, reasoning, and decision-making actions.
    • Creating function-calling data via environment scaling.
    • Focusing on high-quality, high-uncertainty, super-human level QA pairs for post-training, including PhD-level research questions.
  • Strategic Environmental Interaction: The paper explicitly models and leverages three forms of environments (Prior World, Simulated, Real-world) and adapts synthetic data generation and training strategies accordingly. This structured approach to environmental interaction (especially using simulated environments for rapid iteration and real-world sandboxes for stability) is more systematic than simply interacting with the real world or using offline datasets.

  • Efficiency at Scale: Tongyi DeepResearch achieves state-of-the-art performance with significantly fewer activated parameters (3.3 billion per token from a 30.5 billion total parameter model) compared to many proprietary systems. This emphasizes efficiency and scalability for deployment.

  • Open-Source Commitment: Unlike many leading deep research systems that remain closed-source (e.g., OpenAI DeepResearch, Gemini DeepResearch), Tongyi DeepResearch is fully open-sourced. This fosters transparency, reproducibility, and collaborative research.

    In essence, the innovation lies in the holistic, end-to-end framework that strategically integrates mid-training for agentic bias, automated synthetic data for scalability, and adaptive environmental interaction for stable and efficient RL, all while being open-source and parameter-efficient.

4. Methodology

The methodology section details the Tongyi DeepResearch system, outlining its formulation, overall training recipe, and the specifics of agentic mid-training and agentic post-training.

4.1. Principles

The core idea behind Tongyi DeepResearch is to endow Large Language Models (LLMs) with autonomous research capabilities by treating them as agents that can plan, search, reason, and synthesize knowledge across extended sequences of actions and diverse information sources. This is achieved through a novel end-to-end training framework that balances the cultivation of agentic biases and the refinement of deep research capabilities.

The theoretical basis and intuition are rooted in:

  1. Sequential Decision-Making: Framing deep research as a sequence of thoughts, actions, and observations, similar to how humans conduct research or how an agent interacts with an environment in Reinforcement Learning. The ReAct paradigm is fundamental here, combining verbalized reasoning with tool-based actions.
  2. Scalability through Data Synthesis: Recognizing the inherent difficulty and cost of obtaining human-annotated data for complex research tasks. The intuition is that LLMs themselves, when properly guided, can generate high-quality, diverse, and complex agent trajectories and research questions at scale, leading to a data flywheel effect where an improving agent generates better training data.
  3. Controlled Environmental Interaction: Acknowledging that real-world environments are noisy, costly, and non-stationary. The principle is to strategically leverage different types of environments (Prior World, Simulated, Real-world) based on the training stage's needs, optimizing for stability, cost, and fidelity. Simulated environments act as a "wind tunnel" for rapid algorithm iteration.
  4. Progressive Capability Building: The two-stage training pipeline (mid-training then post-training) reflects the idea that agentic capabilities should be built progressively. Mid-training establishes a strong agentic inductive bias (general agentic knowledge), while post-training (via SFT and RL) refines these into robust deep research capabilities for specific complex tasks. This addresses the challenge of directly training agentic behaviors on general LLMs which lack the necessary foundational bias.
  5. Context Efficiency: For long-horizon tasks, managing the context window is critical. The Context Management Paradigm is based on the idea that an agent can maintain coherent reasoning by dynamically summarizing and prioritizing information, mimicking human researchers who periodically synthesize their findings.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Formulation

Tongyi DeepResearch's interaction with the environment at each timestep tt is defined by three fundamental components:

  • Thought (τt\tau_t): This represents the agent's internal cognitive process. It involves analyzing the current context, retrieving relevant information from its memory, planning the next steps, and engaging in self-reflection to adapt its strategy. It's the verbalization of the agent's reasoning.

  • Action (ata_t): This is an external operation performed by the agent to interact with its environment. Tongyi DeepResearch is equipped with a set of versatile tools that define its action space. These tools allow it to interact with various information sources. The available tools are:

    • Search: For performing Google web searches.
    • Visit: For accessing and summarizing content from web pages.
    • Python Interpreter: For executing Python code in a sandboxed environment.
    • Google Scholar: For retrieving information from academic publications.
    • File Parser: For parsing user-uploaded local files (PDF, DOCX, etc.). Actions include all intermediate tool calls (a˙t\dot{a}_t where t<Tt < T) and the final response to the user, which is an in-depth report (aTa_T).
  • Observation (oto_t): This is the feedback received from the environment immediately after an action is performed. This new information is then used to update the agent's internal state and guide its subsequent thought and action.

    Based on these components, two different rollout types are defined:

  • ReAct: The architecture is fundamentally based on the ReAct framework. In this paradigm, the agent generates a reasoning trace (Thought) and a subsequent Action in an interleaved manner. This process forms a trajectory, HT\mathcal{H}_T, which is a sequence of thought-action-observation triplets: $ \mathcal{H}T = ( \tau_0, a_0, o_0, \ldots, \tau_i, a_i, o_i, \ldots, \tau_T, a_T ) $ Here, aTa_T denotes the final answer to the given task. At any given step tTt \leq T, the agent's policy (π\pi) generates the current thought (τt\tau_t) and action (ata_t) conditioned on the entire history of previous interactions, Ht1\mathcal{H}_{t-1}: $ \tau_t, a_t \sim \pi ( \cdot | \mathcal{H}{t-1} ) $ The choice of ReAct is deliberate, emphasizing its simplicity and alignment with The Bitter Lesson principle, which favors general, scalable computational methods over complex, human-engineered ones.

  • Context Management: To address the finite context window constraint in long-horizon tasks, a dynamic context management mechanism based on Markovian state reconstruction is employed. Instead of being conditioned on the complete history, the agent is conditioned on a strategically reconstructed workspace at each step tt. This workspace contains only essential elements: the question (q), an evolving report (S_t) serving as compressed memory, and the immediate context from the last interaction (ata_t and oto_t). This Markovian structure allows the agent to maintain consistent reasoning capacity across arbitrary exploration depths and naturally circumvents context degradation. For every step 0<t<T0 < t < T, this core update process can be formalized as: $ S_t, \tau_{t+1}, a_{t+1} \sim \pi ( \cdot | S_{t-1}, a_t, o_t ) $ This context management paradigm is crucial as it prevents context overflow and enforces structured reasoning by requiring the agent to explicitly synthesize and prioritize information at each step, aligning with human research patterns.

4.2.2. Overall Training Recipe

The Tongyi DeepResearch system is initialized from the pre-trained base model Qwen3-30B-A3B-Base. The development proceeds through an end-to-end training framework that integrates agentic mid-training and agentic post-training. This framework is designed to enable scalable reasoning and information seeking across complex research tasks, establishing a new paradigm for training agentic models.

The overall training pipeline is visualized below:

Figure 2: Training pipeline of Tongyi DeepResearch.
该图像是图示,展示了Tongyi DeepResearch的训练管道。图中包含三个主要阶段:预训练、 mid-training和后训练,具体的训练阶段和参数数量分别为Agentic CPT Stage 1(32K)、Agentic CPT Stage 2(128K)和Agentic SFT及Agentic RL。

Figure 2: Training pipeline of Tongyi DeepResearch.

The pipeline begins with Pre-training (e.g., Qwen3-30B-A3B-Base). This is followed by two main agentic training stages:

  1. Agentic Mid-training: This phase aims to instill agentic inductive biases into the base model. It consists of two stages:
    • Agentic CPT Stage 1 (32K context)
    • Agentic CPT Stage 2 (128K context)
  2. Agentic Post-training: This phase refines the agentic capabilities through more targeted training. It also consists of two stages:
    • Agentic SFT (Supervised Fine-tuning)
    • Agentic RL (Reinforcement Learning)

4.2.3. Agentic Mid-training

The mid-training phase is designed to bridge the gap between generally pre-trained models and the specific requirements of agentic post-training. Its primary objective is to provide a base model with a strong inductive bias for agentic behavior while preserving broad linguistic competence.

4.2.3.1. Training Configuration

Tongyi DeepResearch employs a two-stage Agentic Continual Pre-training (Agentic CPT) as its core mid-training phase. The optimization process uses the standard Next-Token Prediction loss function.

  • Stage 1: Initiates with a 32K context length.
  • Stage 2: Expands to a 128K context length. A substantial corpus of long-sequence (64K-128K) agentic behavior data is introduced in this stage to enhance the model's capacity for coherent, long-horizon reasoning and action. Throughout both stages, a small proportion of general pre-training data is interleaved to ensure the model acquires specialized agentic competence without sacrificing its foundational generalization capabilities.

4.2.3.2. Large-scale Agent Behavior Data Synthesis

In Agentic CPT, data is synthesized across the complete lifecycle of agent workflows. A typical agent workflow involves starting with a problem, iteratively cycling through reflection and action, and ultimately converging on a final solution. To comprehensively capture this, data is synthesized for critical steps: Question Synthesis, Planning Action, Reasoning Action, and Decision-Making Action. Decision-making is explicitly modeled as a distinct action type.

The process of large-scale agent behavior data synthesis is illustrated below:

Figure 3: Large-scale agent behavior data synthesis for agentic continual pre-training.
该图像是一个示意图,展示了任务规划和决策制定的流程。图中显示了从任务到回答的各个环节,包括问题合成、规划、决策制定及推理等步骤,强调了决策过程中的潜在路径和隐藏过程。

Figure 3: Large-scale agent behavior data synthesis for agentic continual pre-training.

The workflow begins with an Open World Memory (continuously updated knowledge). This memory is used for Question Synthesis, feeding into Planning, which then flows into Decision Making and Reasoning steps, ultimately leading to the Answer. This entire process is used to generate Agent Behavior Data for Agentic Continual Pre-training.

  • Large-scale Multi-style Question Synthesis:

    • An entity-anchored open-world memory is constructed, consolidating diverse real-world knowledge (web-crawled data, agent interaction trajectories) into structured representations of entities and their associated knowledge.
    • Entities and related knowledge are sampled to generate diverse questions that embed specific behavioral pattern requirements, such as multi-hop reasoning questions and numerical computation questions.
  • Planning Action:

    • Planning involves problem decomposition and first-step action prediction.
    • Open-source models are used to analyze, decompose, and predict initial actions for the synthesized questions.
    • Rejection sampling based on the entities and associated knowledge from question construction ensures high-quality planning outputs.
  • Reasoning Action:

    • Focuses on logical reasoning and knowledge integration from heterogeneous data, especially when external tools return massive unstructured responses.
    • Large models are guided through a two-stage process to generate complete reasoning chains given a question and its dependent knowledge.
    • A dual filtering mechanism based on reasoning length and answer consistency ensures quality.
  • Decision-Making Action:

    • Each step of an agent's thinking and action is essentially an implicit decision-making process, where the agent selects the most promising solution from multiple potential reasoning and action paths.
    • This process is explicitly modeled: existing demonstration trajectories are used to explore the feasible action space at each step.
    • Original trajectories are reconstructed into multi-step decision sequences while preserving the original decision choices.
  • General Function-calling Data Synthesis via Environment Scaling:

    • To enhance the model's general agentic capability, function-calling data is systematically scaled through environment scaling. The principle is that the breadth of function-calling competence is tied to the diversity of environments.
    • A scalable framework is designed to automatically construct heterogeneous, fully simulated environments, effectively broadening the space of function-calling scenarios.
    • The generated data is incorporated into the mid-training phase.

4.2.4. Agentic Post-training

The post-training pipeline comprises three stages: data synthesis, supervised fine-tuning for cold start, and agentic reinforcement learning.

4.2.4.1. High-quality Data Synthesis

An end-to-end solution for synthetic data generation is developed to create complex, high-uncertainty, and super-human level question and answer pairs. This fully automated process aims to push the boundaries of agent performance without human intervention.

The high-quality data synthesis pipeline is depicted below:

Figure 4: High-quality data synthesis pipeline.
该图像是插图,展示了三个阶段的图形处理过程:1)图的构建、2)子图采样和3)不确定性注入。这些过程在研究中起到关键作用,旨在优化信息获取和处理的效率。

Figure 4: High-quality data synthesis pipeline.

The pipeline consists of three main stages:

  1. Graph Construction: Builds a highly interconnected knowledge graph via random walks, leveraging web search to acquire relevant knowledge, and isomorphic tables from real-world websites for a realistic information structure.

  2. Subgraph Sampling: Samples subgraphs and subtables from the constructed knowledge graph to generate initial questions and answers.

  3. Uncertainty Injection: Strategically increases the uncertainty within the question to enhance its difficulty. This is grounded in a theoretical framework that models QA difficulty as atomic operations on entity relationships (e.g., merging entities with similar attributes), allowing systematic complexity increase. Formal modeling of the information-seeking problem based on set theory further enables controllable difficulty and structure scaling, minimizes reasoning shortcuts and structural redundancy, and allows for efficient verification of QA correctness.

    Additionally, an automated data engine generates PhD-level research questions. It starts with a multi-disciplinary knowledge base to create seed QA pairs requiring multi-source reasoning. These seeds undergo iterative complexity upgrades, where a question-crafting agent progressively expands scope and abstraction, refining and compounding prior outputs.

4.2.4.2. Supervised Fine-tuning for Cold Start

The initial phase of agentic post-training is a supervised fine-tuning (SFT) stage. Its purpose is to equip the base model with a robust initial policy before reinforcement learning.

  • Data Source: Synthesized high-quality QA data is used to obtain training trajectories. These trajectories cover the complete thought process and tool responses generated by high-performing open-source models.

  • Filtering: A rigorous rejection sampling protocol is applied to ensure that only high-quality trajectories exhibiting diverse problem-solving patterns are retained.

  • Mixed Training Paradigm: The SFT phase leverages data from two different formulations to enhance model robustness and generalization:

    • ReAct Mode: Training samples take the historical state Ht1\mathcal{H}_{t-1} as input and output the corresponding thought τi\tau_i and tool call aia_i for the current step.
    • Context Management Mode: Training samples take as input the previous step's trajectory summary St1S_{t-1}, tool call ai1a_{i-1}, and tool response oi1o_{i-1}. They output the current step's trajectory summary, thought τi\tau_i, and tool call aia_i. This mode specifically strengthens the agent's capabilities in state analysis and strategic decision-making, requiring the model to synthesize complex observations into coherent summaries.
  • Two-stage Training Strategy based on Context Length:

    • Stage 1: Context length is set to 40K. Training data includes ReAct Mode samples with context lengths shorter than 40K, along with all Context Management Mode samples (as they are all within 40K).
    • Stage 2: Context length is extended to 128K. Training data includes ReAct Mode samples with context lengths between 40K and 128K, plus a small portion of 40K data for stability.

4.2.4.3. Agentic Reinforcement Learning

To advance the model's capabilities in robust and reliable planning and searching in complex web environments, an agentic RL framework is applied.

An overview of the agentic reinforcement learning framework:

Figure 5: An overview of our agentic reinforcement learning framework.
该图像是一个示意图,展示了自动合成数据的框架,包含异步回放服务、回放工作者、轨迹收集和奖励服务等组成部分。图中同时表示了模拟环境与真实环境的结合,以及相应的操作与观测流程。

Figure 5: An overview of our agentic reinforcement learning framework.

The agentic RL framework involves the policy model interacting with the environment (either Simulated Environment or Real-world Environment). This interaction generates trajectories (rollouts) and rewards. These are collected and processed by a Trajectory Collection component, which then feeds into the RL Training module. RL Training updates the policy model, and this cycle iterates. An Async Rollout Service and Rollout Workers facilitate parallel interactions.

  • Real-world Environment: The agent's toolkit integrates several specialized tools: Search, Visit, Python Interpreter, Google Scholar, and File Parser. To ensure reliability in training and evaluation, a unified sandbox is developed.

    • This sandbox orchestrates every tool call through a central scheduling and management layer.
    • For each tool, robust concurrency controls and fault-tolerance mechanisms are implemented (e.g., QPS rate constraints, caching, timeout-and-retry, graceful degradation, failover to backups).
    • This design abstracts tool invocation into a deterministic and stable interface, insulating the training loop from real-world stochasticity and reducing operational costs.
  • Simulated Environment: Direct use of real-world web environment APIs presents numerous practical problems (e.g., instability, cost).

    • An offline environment is built based on the 2024 Wikipedia database.
    • A suite of local RAG tools simulates the web environment.
    • The data synthesis pipeline is reused to create high-quality, structurally complex QA specifically for this offline environment.
    • This provides a low-cost, high-efficiency, fully controllable platform for rapid experimentation, accelerating development.
  • On-Policy Asynchronous Rollout Framework: The iterative nature of agentic rollouts (requiring numerous environment interactions) can be a bottleneck.

    • A custom, step-level asynchronous RL training loop is implemented, built on the rLLM framework.
    • It uses two separate asynchronous online servers: one for model inference and another for tool invocation.
    • A centralized interaction handler processes outputs from both, formatting feedback into a unified message list.
    • This architecture allows multiple agent instances to interact with the environment in parallel, completing rollouts independently.
  • RL Training Algorithm: The RL algorithm is a tailored adaptation of GRPO (Generalized Policy Optimization). It operates with a strict on-policy regimen, meaning trajectories are consistently sampled using the most up-to-date policy. The reward is a pure 0 or 1 signal indicating answer correctness, with no separate format reward.

    The training objective is: $ \mathcal{I}(\boldsymbol{\theta}) = \mathbb{E}{(\boldsymbol{q}, \boldsymbol{y}) \sim \mathcal{D}, {\mathcal{H}i^i}{i=1}^G \sim \pi{\theta_{\mathrm{old}}}(\cdot \vert \mathrm{context})} \left[ \frac{1}{\sum_{i=1}^G \vert\mathcal{H}^i\vert} \sum_{i=1}^G \sum_{j=1}^{\vert\mathcal{H}^i\vert} \min\left( r_{i,j}(\boldsymbol{\theta}) \hat{A}{i,j}, ~ \mathrm{clip}\left( r{i,j}(\boldsymbol{\theta}), 1-\varepsilon_{\mathrm{low}}, 1+\varepsilon_{\mathrm{high}} \right) \hat{A}_{i,j} \right) \right] $ Where:

    • θ\boldsymbol{\theta}: The parameters of the current policy.

    • (q,y)D(\boldsymbol{q}, \boldsymbol{y}) \sim \mathcal{D}: A question-answer pair sampled from the dataset D\mathcal{D}.

    • {Hii}i=1Gπθold(context)\{\mathcal{H}_i^i\}_{i=1}^G \sim \pi_{\theta_{\mathrm{old}}}(\cdot \vert \mathrm{context}): A set of GG trajectories (rollouts) sampled using the old policy πθold\pi_{\theta_{\mathrm{old}}}.

    • ri,j(θ)r_{i,j}(\theta): The importance sampling ratio for the jj-th token in the ii-th trajectory. For strictly on-policy training, this remains 1.0. It is defined as: $ r_{i,j}(\theta) = \frac{ \pi_{\theta}(\mathcal{H}^{i,j} \mid \mathrm{context}) }{ \pi_{\theta_{\mathrm{old}}}(\mathcal{H}^{i,j} \mid \mathrm{context}) } $ Here, πθ(Hi,jcontext)\pi_{\theta}(\mathcal{H}^{i,j} \mid \mathrm{context}) is the probability of the jj-th token in trajectory Hi\mathcal{H}^i under the new policy πθ\pi_\theta, given the context up to that point, and similarly for the old policy.

    • A^i,j\hat{A}_{i,j}: An estimator of the advantage at token jj of the ii-th trajectory. It is calculated as the reward of the trajectory minus the mean reward of all trajectories in the current batch: $ \hat{A}{i,j} = R_i - \mathrm{mean}( {R_i}{i=1}^G ) $ Here, RiR_i is the episode reward (0 or 1 for correctness) for trajectory ii.

    • clip(,1εlow,1+εhigh)\mathrm{clip}(\cdot, 1-\varepsilon_{\mathrm{low}}, 1+\varepsilon_{\mathrm{high}}): A clipping function applied to the importance sampling ratio to constrain policy updates, preventing excessively large steps.

      Following DAPO, a token-level policy gradient loss is applied, and a clip-higher strategy is used to encourage more exploration. To reduce variance in advantage estimation, a leave-one-out strategy is adopted. Additionally, to improve training stability and prevent policy collapse, certain negative samples are selectively excluded from the loss calculation. The paper notes that these modifications prioritize pragmatic stability and efficiency over algorithmic novelty.

  • Automatic Data Curation: To generalize to out-of-distribution scenarios through self-exploration, data is optimized in real-time, guided by training dynamics.

    • A fully automated data filtering pipeline dynamically adjusts the training set based on the improved policy model.
    • Starts with a large dataset D\mathcal{D}. An initial SFT model samples multiple rollouts for each problem.
    • An initial training set D\mathcal{D}' is created by filtering out problems where the model always fails or always succeeds (as they offer no learning signal). This leaves problems of moderate difficulty.
    • During RL training, problems in D\mathcal{D}' are continuously monitored.
    • A separate background process uses intermediate checkpoints of the policy model to sample from the entire original dataset D\mathcal{D}, identifying new moderately difficult problems for a backup pool.
    • When training reaches a certain step count or reward plateaus, the active training set D\mathcal{D}' is refreshed by removing mastered problems and incorporating new, challenging ones from the backup pool.
    • This pipeline runs independently, never interrupting the main RL training loop, ensuring high training efficiency and stability.

4.2.4.4. Model Merging

At the last stage of the pipeline, model merging is employed. This approach is based on the insight that parameters of different model variants derived from the same pre-trained model can be effectively combined.

  • Process: Several model variants originating from the same base model but exhibiting different capability preferences are selected.
  • Weighted Average: The final merged model is created by computing a weighted average of their parameters: $ \theta_{\mathrm{merged}} = \sum_k \alpha_k \cdot \theta^{(k)}, \quad \mathrm{s.t.} \sum_k \alpha_k = 1, \alpha_k \geq 0 $ Where:
    • θ(k)\theta^{(k)}: Represents the parameters of the kk-th model variant.
    • αk\alpha_k: Is its corresponding merge weight.
  • Benefits: This interpolation strategy preserves the core strengths of each contributing model and equips the merged model with robust generalization abilities. It performs comparably to the best source model in its respective area of strength without incurring additional optimization costs.

5. Experimental Setup

5.1. Datasets

The experiments evaluate Tongyi DeepResearch on seven public information-seeking benchmarks designed for long-term reasoning and long-horizon tool use.

  • Humanity's Last Exam (HLE) (Phan et al., 2025): A benchmark designed to test an agent's ability to tackle complex, multidisciplinary questions that often require deep reasoning and knowledge integration. It focuses on questions that might challenge even human experts. The paper evaluates on 2,154 text-only questions.

  • BrowseComp (Wei et al., 2025): A benchmark for browsing agents that requires navigating and extracting information from web pages to answer questions. It tests tool use and information retrieval in a realistic web environment.

  • BrowseComp-ZH (Zhou et al., 2025): The Chinese counterpart of BrowseComp, assessing similar browsing and information retrieval capabilities but in a Chinese language context.

  • GAIA (Mialon et al., 2023): A benchmark for general AI assistants that evaluates complex real-world tasks requiring multiple steps, tool use, and common sense reasoning. It often involves using web search and other tools.

  • xbench-DeepSearch (Xbench Team, 2025): A benchmark specifically designed for evaluating deep search capabilities, likely involving multi-hop information retrieval and complex synthesis from multiple sources.

  • WebWalkerQA (Wu et al., 2025b): A benchmark focused on web traversal and question answering, testing LLMs' ability to navigate through web pages to find answers.

  • FRAMES (Krishna et al., 2025): A benchmark for retrieval-augmented generation, often involving fetching facts and reasoning over them.

  • xbench-DeepSearch-2510: A newly released benchmark for deep search, indicating a continuous effort to push the boundaries of such systems.

    These datasets were chosen because they are widely recognized public benchmarks for evaluating agentic capabilities, information seeking, long-horizon reasoning, and tool use in LLMs. They are effective for validating the proposed method's performance across diverse complexities and language domains (English and Chinese).

The paper also mentions AIME25, HMMT25, and SimpleQA (OpenAI, 2025c) for evaluating performance on general benchmarks.

  • AIME25: Likely a mathematical problem-solving benchmark, possibly related to the American Invitational Mathematics Examination.
  • HMMT25: Possibly referring to the Harvard-MIT Mathematics Tournament, another math competition benchmark.
  • SimpleQA: A knowledge-intensive benchmark focusing on factual question answering.

5.2. Evaluation Metrics

For all deep research benchmarks, the paper follows each benchmark's official evaluation protocol. The primary metric reported is the average performance over three runs, denoted as Avg@3. For completeness, Pass@1 (best result over 3 runs) and Pass@3 are also reported. While the specific calculation for each benchmark's score (e.g., accuracy, F1 score) isn't detailed in the main text, it's implied that they use standard metrics for QA or task completion.

For general benchmarks:

  • Mathematical Problems (AIME25, HMMT25): Manual evaluation is used due to the detailed reports generated by the system and the relatively small scale of these datasets, ensuring accuracy and fairness. The metric is likely accuracy (proportion of correctly solved problems).

  • Knowledge-based Problems (SimpleQA): The official evaluation script of SimpleQA is utilized to maintain consistency with established benchmarks. This typically involves accuracy or F1 score for factual questions.

    Since the paper does not explicitly provide the mathematical formulas for Avg@3, Pass@1, and Pass@3, I will provide their conceptual definitions. These are common metrics in agentic LLM evaluation, especially for tasks with some stochasticity.

  • Pass@1:

    1. Conceptual Definition: Pass@1 measures the success rate of an agent on its best attempt for a given task. If an agent is run multiple times on the same task, Pass@1 checks if at least one of those runs was successful. It indicates the agent's potential to solve a task under ideal conditions or given enough tries.
    2. Mathematical Formula: Let NN be the total number of tasks. Let KK be the number of independent runs per task (in this paper, K=3K=3). Let Si,kS_{i,k} be a binary indicator: 1 if the kk-th run for task ii is successful, 0 otherwise. $ \mathrm{Pass@1} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}\left( \max_{k=1,\ldots,K} S_{i,k} = 1 \right) $
    3. Symbol Explanation:
      • NN: Total number of tasks in the benchmark.
      • KK: Number of independent runs for each task (here, 3).
      • Si,kS_{i,k}: Binary indicator; 1 if the kk-th run for task ii is successful, 0 if unsuccessful.
      • I()\mathbb{I}(\cdot): The indicator function, which evaluates to 1 if its argument is true, and 0 otherwise.
      • maxk=1,,KSi,k\max_{k=1,\ldots,K} S_{i,k}: Returns 1 if at least one of the KK runs for task ii was successful, and 0 otherwise.
  • Pass@3:

    1. Conceptual Definition: Pass@3 is typically calculated in code generation or agentic tasks where multiple attempts are made. It represents the probability that at least one of KK independent attempts (here, K=3K=3) would have succeeded, assuming a constant probability of success pp for each attempt. It's often estimated by calculating the empirical success rate ss and then finding pp such that 1(1p)K=s1 - (1-p)^K = s. However, in agentic evaluations, it is more commonly used to directly report the success rate when any of the three attempts pass, or to report the average score over three runs (as Avg@3 implies). Given the paper's statement "best result over 3 runs", Pass@3 seems to refer to the same concept as Pass@1 but with the specific number of runs being 3. It's often used interchangeably with Pass@K where KK attempts are allowed. In this paper's context of "best result over 3 runs", it means if any of the 3 runs passed, the task is considered passed.
    2. Mathematical Formula: Let NN be the total number of tasks. Let K=3K=3 be the number of independent runs per task. Let Si,kS_{i,k} be a binary indicator: 1 if the kk-th run for task ii is successful, 0 otherwise. $ \mathrm{Pass@3} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}\left( \max_{k=1,2,3} S_{i,k} = 1 \right) $
    3. Symbol Explanation: Same as Pass@1, but specifically for K=3K=3 runs.
  • Avg@3:

    1. Conceptual Definition: Avg@3 calculates the average performance score across three independent runs for each task and then averages these task-level averages over all tasks. It provides a more robust estimate of the typical performance, smoothing out run-to-run variability.
    2. Mathematical Formula: Let NN be the total number of tasks. Let K=3K=3 be the number of independent runs per task. Let Scorei,kScore_{i,k} be the performance score (e.g., accuracy, F1) for the kk-th run of task ii. $ \mathrm{Avg@3} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{1}{K} \sum_{k=1}^{K} Score_{i,k} \right) $
    3. Symbol Explanation:
      • NN: Total number of tasks in the benchmark.
      • KK: Number of independent runs for each task (here, 3).
      • Scorei,kScore_{i,k}: The performance score achieved by the agent on task ii during run kk. This score can be a direct success rate (0 or 1) or a more granular metric like F1 score depending on the benchmark.

5.3. Baselines

The paper compares Tongyi DeepResearch against two families of systems:

  1. LLM-based ReAct agents: These are models that use LLMs primarily within a ReAct framework for reasoning and tool use.

    • GLM-4.5 (Zeng et al., 2025)
    • Kimi-K2 (Team et al., 2025)
    • DeepSeek-V3.1 (DeepSeek Team, 2025)
    • Claude-4-Sonnet (anthropic, 2025)
    • OpenAI o3/o4-mini (OpenAI, 2025b)
  2. End-to-end deep-research agents: These are systems specifically designed and optimized for deep research tasks, often incorporating more complex agentic architectures and training.

    • OpenAI DeepResearch (OpenAI, 2025a)

    • Gemini DeepResearch (Gemini Team, 2025)

    • Kimi Researcher (Kimi, 2025)

      These baselines are representative because they cover a range of state-of-the-art LLMs (both open and closed-source) that are either general-purpose models adapted for agentic tasks (LLM-based ReAct agents) or specialized systems built for deep research (end-to-end deep-research agents). This allows for a comprehensive comparison of Tongyi DeepResearch against both general LLM-agent capabilities and dedicated deep research solutions.

5.4. Inference Parameters

To ensure stability and reproducibility across evaluations, fixed inference parameters were adopted:

  • temperature =0.85= 0.85

  • repetition penalty =1.1= 1.1

  • top-p =0.95= 0.95

  • A maximum of 128 tool invocations is allowed per task.

  • The context length is constrained to 128K tokens.

    Each benchmark is evaluated three times independently, and the average performance (Avg@3) is reported as the main metric. Pass@1 (best result over 3 runs) and Pass@3 results are also provided. All results were obtained on September 16, 2025, except for xbench-DeepSearch-2510, which was evaluated on October 28, 2025.

The action space for Tongyi DeepResearch includes Search, Visit, Python, Scholar, and File Parser tools. Official reproduction scripts, tool implementations, and prompt configurations are open-sourced on GitHub.

6. Results & Analysis

6.1. Core Results Analysis

The main experimental results demonstrate that Tongyi DeepResearch achieves state-of-the-art performance across a range of deep research benchmarks, often outperforming stronger baselines, despite its parameter efficiency.

The following are the results from Table 1 of the original paper:

Benchmarks Humanity's Last Exam Browse Comp Browse Comp-ZH GAIA xbench DeepSearch WebWalker QA FRAMES
LLM-based ReAct Agent
GLM 4.5 21.2 26.4 37.5 66.0 70.0 65.6 78.9
Kimi K2 18.1 14.1 28.8 57.7 50.0 63.0 72.0
DeepSeek-V3.1 29.8 30.0 49.2 63.1 71.0 61.2 83.7
Claude-4-Sonnet 20.3 12.2 29.1 68.3 65.0 61.7 80.7
OpenAI o3 24.9 49.7 58.1 67.0 71.7 84.0
OpenAI o4-mini 17.7 28.3 60.0
DeepResearch Agent
OpenAI DeepResearch 26.6 51.5 42.9 67.4
Gemini DeepResearch 26.9
Kimi Researcher 26.9 69.0 78.8
Tongyi DeepResearch (30B-A3B) 32.9 43.4 46.7 70.9 75.0 72.2 90.6

Analysis of Advantages and Disadvantages:

  • Humanity's Last Exam: Tongyi DeepResearch achieves 32.9, significantly outperforming all other LLM-based ReAct agents (e.g., DeepSeek-V3.1 at 29.8, OpenAI o3 at 24.9) and competitive with DeepResearch agents (e.g., OpenAI DeepResearch at 26.6, Gemini DeepResearch at 26.9). This indicates strong multi-disciplinary reasoning and knowledge integration capabilities.

  • BrowseComp: While OpenAI DeepResearch (51.5) and OpenAI o3 (49.7) achieve higher scores, Tongyi DeepResearch (43.4) still outperforms other LLM-based ReAct agents like GLM 4.5 (26.4) and DeepSeek-V3.1 (30.0). This suggests good web browsing and information retrieval skills, though there's room for improvement against the absolute best proprietary systems.

  • BrowseComp-ZH: Tongyi DeepResearch scores 46.7, which is competitive but slightly lower than DeepSeek-V3.1 (49.2) and OpenAI o3 (58.1). This shows its ability to generalize to Chinese-language browsing tasks, albeit with some gap compared to top performers in this specific benchmark.

  • GAIA: Tongyi DeepResearch achieves 70.9, the highest score reported among all baselines, surpassing GLM 4.5 (66.0), Claude-4-Sonnet (68.3), and OpenAI DeepResearch (67.4). This highlights its general AI assistant capabilities in complex, real-world tasks requiring multi-step reasoning and tool use.

  • xbench-DeepSearch: Tongyi DeepResearch secures the highest score at 75.0, surpassing DeepSeek-V3.1 (71.0) and GLM 4.5 (70.0). This directly validates its core deep search capabilities.

  • WebWalker QA: With 72.2, Tongyi DeepResearch leads all reported baselines, including OpenAI o3 (71.7). This indicates excellent web traversal and question answering abilities.

  • FRAMES: Tongyi DeepResearch achieves 90.6, significantly higher than any other model, including OpenAI o3 (84.0) and DeepSeek-V3.1 (83.7). This demonstrates superior fact fetching and reasoning over retrieved information.

    Overall: Tongyi DeepResearch consistently achieves state-of-the-art performance across nearly all evaluated benchmarks, especially among open-source deep research agents. It narrows the gap, and in some cases surpasses, proprietary frontier systems, while activating significantly fewer parameters (3.3 billion out of 30.5 billion total parameters). This underscores its efficiency and scalability. On the newly released xbench-DeepSearch-2510, it ranks just below ChatGPT-5-Pro, further demonstrating its competitive edge.

6.1.1. Heavy Mode Performance

The paper introduces a Heavy Mode to further unlock the potential of deep research agents through test-time scaling. This mode leverages a Research-Synthesis framework built upon the context management paradigm.

The performance comparison of Tongyi DeepResearch Heavy Mode and state-of-the-art models is shown below:

Figure 6: Performance comparison between Tongyi DeepResearch Heavy Mode and state-of-the-art models.
该图像是图表,展示了Tongyi DeepResearch在多个基准测试上的表现,包括Humanity's Last Exam、BrowseComp和BrowseComp-ZH。图表中的通过率数据表明,Tongyi DeepResearch在这些任务中的表现优于其他对比模型。

Figure 6: Performance comparison between Tongyi DeepResearch Heavy Mode and state-of-the-art models.

Methodology of Heavy Mode:

  1. Parallel Research Phase: nn parallel agents are deployed. Each agent follows the context management paradigm, exploring diverse solution paths using different tool usage and reasoning strategies. Each agent uu independently processes the question qq and produces a final report summary (STuS_T^u) and answer (answeru\mathrm{answer}_u): $ ( S_T^u, \mathrm{answer}_u ) = \mathrm{Agent}_u( q ), \quad u \in [ 1, n ] $ Here, STuS_T^u represents the final report summary from agent uu after TT iterations, encapsulating the complete reasoning trajectory in a compressed form.
  2. Integrative Synthesis Phase: A synthesis model consolidates all parallel findings to produce the final answer: $ \mathrm{answer}_{\mathrm{final}} = \mathrm{Synthesis}\left( { \left( S_T^u, \mathrm{answer}u \right) }{u=1}^n \right) $ The advantage is that the compressed context management reports (STuS_T^u) allow the synthesis model to assess nn diverse solution strategies within a manageable context window, unlike traditional methods that would aggregate full, long trajectories.

Heavy Mode Results:

  • Humanity's Last Exam: Achieves 38.3%, a substantial improvement over the standard mode (32.9%) and all other baselines.

  • BrowseComp-ZH: Reaches 58.1%, matching OpenAI o3 and surpassing DeepSeek-V3.1 (49.2) and its standard mode (46.7%).

  • BrowseComp: Achieves 58.3%, a significant improvement over its standard mode (43.4%) and surpasses OpenAI DeepResearch (51.5) and OpenAI o3 (49.7), becoming the leading model on this benchmark.

    These results validate the effectiveness of Heavy Mode in leveraging test-time compute through parallel exploration and intelligent aggregation for enhanced performance.

6.2. Detailed Analysis

6.2.1. Pass@1 and Pass@3 Performance

The paper reports Avg@3 performance in Table 1. A fine-grained analysis of Pass@1 (best result over three runs) and Pass@3 is also conducted to demonstrate robustness in a dynamic environment.

The detailed evaluation results using Avg@3, Pass@1, and Pass@3 metrics are shown below:

Figure 7: Detailed evaluation results using \(\\mathtt { A v g @ } \\varnothing 3\) , Pass `@ 1` and Pass `@ 3` metric.
该图像是一个条形图,展示了不同基准测试(如 HLE、BrowseComp、WebWalkerQA 等)在 ext{Avg}@3ext{Pass}@1ext{Pass}@3 指标上的详细评估结果。每个基准的得分以条形的高度显示,方便比较它们的性能表现。

Figure 7: Detailed evaluation results using Avg@3\mathtt { A v g @ } \varnothing 3, Pass @ 1 and Pass @ 3 metric.

The figure shows that Avg@3 results are consistent with Pass@1 results across benchmarks, indicating robustness. Pass@3 (interpreted as the best score over 3 runs if any pass) shows even higher potential:

  • BrowseComp: 59.64% (compared to Avg@3 of 43.4)
  • BrowseComp-ZH: 63.67% (compared to Avg@3 of 46.7)
  • Humanity's Last Exam: 45.9% (compared to Avg@3 of 32.9) The higher Pass@3 values (which typically refers to the success rate if any of the 3 runs passed) demonstrate the strong potential of the agent when given multiple attempts.

6.2.2. Training Rewards and Entropy

The agent's performance (reward) and policy entropy during agentic RL training are analyzed.

The reward and entropy loss of agentic RL training is shown below:

Figure 8: Reward and entropy loss of agentic RL training.
该图像是一个图表,展示了代理强化学习训练过程中的奖励和熵损失的变化。左侧图表示奖励随着训练步骤的变化情况,右侧图则显示熵损失的变化趋势。两图均包含原始值和经过EMA平滑处理的曲线。

Figure 8: Reward and entropy loss of agentic RL training.

  • Reward: The left panel shows a clear and significant upward trend in the agent's performance (reward) with training, confirming effective policy learning. The sustained improvement is attributed to dynamic data curation, which consistently provides challenging material, preventing learning stagnation.
  • Entropy: The right panel shows that policy entropy exhibits exceptional stability. It converges to a consistent value after a brief initial increase, avoiding both collapse (where the policy becomes too deterministic and stops exploring) and explosion (where the policy becomes too random and inefficient). This stability is strong evidence for the methodological contributions in environment design and algorithm modification that create effective RL training.

6.2.3. Context Length of RL

The impact of the model's context length on the agentic RL training process is analyzed by comparing models with 32k, 48k, and 64k context limits. The dynamic data curation for all variants used a 64k context model.

The comparison of different context length limits for RL training is shown below:

Figure 9: Comparison of different context length limits for RL training.
该图像是一个图表,展示了不同上下文长度限制对强化学习训练奖励和平均响应长度的影响。左侧图表显示了不同步骤下的奖励值变化,右侧图表展示了平均响应长度的变化,两图均包含32k、48k和64k的曲线对比。

Figure 9: Comparison of different context length limits for RL training.

  • Reward Dynamics (Left Panel): All three models (32k, 48k, 64k) demonstrate effective and stable policy learning with monotonically increasing rewards, confirming the robustness of the training framework. However, their performance ceilings diverge. The 64k model achieves the highest reward because the curriculum is populated with problems moderately difficult for a 64k context model, often requiring long and complex reasoning. The 48k and 32k models, being more constrained, cannot solve the most complex problems, thus capping their maximum potential reward.
  • Average Response Length (Right Panel):
    • The 64k context model shows a steady increase in average response length, learning to leverage its expansive context for more elaborate solutions.
    • The 48k context model maintains a consistent equilibrium in response length, improving its policy within a stable complexity budget.
    • The 32k context model displays a clear downward trend in response length. This is a key insight: for models with limited context, RL training on a curriculum designed for a more capable model can force them to discover more efficient solutions. Since the 64k context model curates the data, problems might have optimal solutions longer than 32k tokens. A 32k model attempting these would receive a zero-reward signal, implicitly incentivizing it to discover more concise, potent action sequences that fit within its limit, thereby becoming more efficient.

6.2.4. Interaction Test-time Scaling

The paper investigates how the agent's performance scales with the number of interaction turns with the environment, which correlates with context length.

The detailed analysis on interaction scaling and simulated environments is shown below:

Figure 10: Detailed analysis on interaction scaling and simulated environments.
该图像是图表,展示了在BrowseComp上交互回合与上下文长度的关系(图a)及在模拟环境中的奖励变化(图b)。图a中,随着上下文长度增加,准确率呈现上升趋势;图b展示了在不同步骤下奖励的变化,呈现出平稳增长的趋势。

Figure 10: Detailed analysis on interaction scaling and simulated environments.

  • Interaction Scaling (Figure 10a): As the context length and number of interactions grow, the model's performance on the BrowseComp dataset improves consistently. This demonstrates that for DeepResearch agents that rely on environmental interactions, scaling along the dimension of environment interactions (and thus context length) is crucial for performance gains, unlike conventional models that might scale by simply increasing output tokens.

6.2.5. Super-human Level Synthetic Data

To validate the effectiveness of the synthetic data, a statistical analysis of the SFT dataset was conducted.

  • Over 20% of the samples in the SFT dataset exceed 32k tokens and involve more than 10 tool invocations.
  • This demonstrates the high complexity and richness of the synthetic data. This high-quality, cold-start data provides the model with a strong foundation for deep reasoning and research capabilities, serving as an excellent initialization for the RL phase. Automated data curation is leveraged during RL to make more effective use of this synthetic data.

6.2.6. From Simulation to Reality

To rapidly validate the algorithm, a simulated Wiki environment mirroring real-world conditions was built.

  • The adapted GRPO algorithm was tested in this environment.
  • The resulting reward curve (shown in Figure 10b) closely matches the one observed in the real environment (Figure 8).
  • This Wiki simulation environment functions as a "wind tunnel laboratory," enabling fast algorithm iteration and significantly improving development efficiency.

6.2.7. Performance on General Benchmark

The paper also evaluates Tongyi DeepResearch on three general benchmarks: AIME25, HMMT25, and SimpleQA.

The performance on general benchmarks is shown below:

Figure 11: Performance on general benchmarks.
该图像是图表,展示了不同模型在多个基准任务上的性能得分,包括AIME25、HMMT25和SimpleQA。Tongyi DeepResearch在HMMT25和AIME25中均取得了100分的最佳成绩,而在SimpleQA中的得分为98.6。

Figure 11: Performance on general benchmarks.

  • Results: Tongyi DeepResearch achieves substantial improvements over the base model (which relies solely on reasoning without any tool use).

    • For AIME25 and HMMT25 (mathematical reasoning benchmarks), it scores 100% and 100% respectively, compared to the base model's 52% and 48%. This improvement is attributed to the Python Interpreter, which provides native computational support.
    • For SimpleQA (knowledge-intensive benchmark), it scores 98.6%, compared to the base model's 85%. This improvement is due to the ability to retrieve external information via search.
  • Implication: These results demonstrate that model training increasingly converges with agent training. Solving paradigms are evolving toward agentic architectures that integrate tool invocation and environment interaction, reflecting a more human-like problem-solving process.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces Tongyi DeepResearch, an open-source deep research agent developed by Alibaba Group. This agent represents a significant step towards AI systems capable of autonomously transforming information into insight. Its core innovation lies in an end-to-end training paradigm that unifies agentic mid-training and agentic post-training. This framework, supported by automated data synthesis and stage-specific environments, enables the model to autonomously plan, search, reason, and synthesize information for complex, long-horizon research tasks.

Despite its parameter efficiency (30.5 billion total parameters with only 3.3 billion activated per token), Tongyi DeepResearch achieves state-of-the-art results across multiple deep research benchmarks, including Humanity's Last Exam, BrowseComp, GAIA, and FRAMES, often surpassing strong proprietary systems. The introduction of Heavy Mode further enhances performance through parallel exploration and integrative synthesis at test time. The work emphasizes the critical role of synthetic data and stable environmental interactions for effective agentic reinforcement learning. By open-sourcing the model and framework, Tongyi DeepResearch establishes a foundation for reproducible research into autonomous AI agents and contributes to the ongoing development of more general, self-improving intelligence.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

  • Context Length: The current 128K context length is still insufficient for the most complex long-horizon tasks. Future work will explore extended context windows or more advanced context management mechanisms.
  • Model Scale: While the current model is efficient, a larger-scale model is currently in progress.
  • Report Generation Fidelity: Continuous improvement in report generation fidelity and optimization for user preferences is needed to ensure more faithful, useful, and preference-aligned outputs.
  • RL Efficiency: The efficiency of the reinforcement learning framework can be improved by exploring techniques such as partial rollouts, which will require addressing off-policy training challenges, including distributional shift.
  • Generalization Beyond Deep Research: The current Deep Research training focuses on specific prompt instructions and predefined tool sets. The plan is to enhance its robustness and extend the framework from Deep Research to broader agentic tool use scenarios.
  • Larger Models and Edge Deployment: The authors also emphasize the value of training agentic capabilities on relatively small models for efficiency on edge devices and broader accessibility, indicating a direction for practical deployment while acknowledging the concurrent development of larger models.

7.3. Personal Insights & Critique

This paper presents a highly compelling and systematic approach to developing deep research agents. The integration of agentic mid-training and agentic post-training is particularly insightful, addressing a critical challenge: how to effectively instill agentic biases into general LLMs before applying intensive RL. This progressive training strategy seems much more robust than trying to learn everything during a single RL phase.

The emphasis on fully automated, scalable data synthesis is another strong point. The ability to generate super-human level, high-uncertainty QA pairs and PhD-level research questions without human annotation is a game-changer for scaling agentic research. This not only reduces cost but also allows for controlled curriculum generation, which is critical for stable RL. The concept of a data flywheel where improving agents generate better training data is powerful for self-improving AI.

The detailed analysis of context length and response length during RL offers a fascinating insight into how models adapt to their constraints, particularly the observation that limited context length can implicitly force a model to find more efficient action sequences. This suggests that constrained environments can sometimes drive more intelligent behavior, a point worth exploring further in general AI research.

The Heavy Mode is an elegant solution for test-time scaling, effectively addressing the context window limitation by synthesizing compressed reports from parallel agents. This demonstrates a practical way to leverage additional compute for improved performance on complex tasks without redesigning the core model.

Potential Issues/Areas for Improvement:

  • Sim-to-Real Gap: While the paper acknowledges the sim-to-real gap, the heavy reliance on simulated environments for iteration, even with Wikipedia-based RAG tools, might still leave a significant challenge when deploying to truly open-ended, dynamic real-world web environments with their inherent noise, adversarial elements, and constantly changing information landscape. The unified sandbox for real-world interaction helps, but the fundamental challenge remains.

  • Interpretability of Model Merging: While model merging is effective for performance gains, the specific mechanisms by which weighted averaging of parameters from models with "different capability preferences" leads to "robust generalization abilities" could be explored in more depth. What are these capability preferences, and how do they interact?

  • Evaluation Metrics for Complex Reasoning: While quantitative metrics are crucial, evaluating deep research agents for tasks like "Humanity's Last Exam" might also benefit from qualitative assessments of the depth, novelty, and coherence of the generated reports, beyond just correctness.

  • Scalability of Heavy Mode: The Heavy Mode deploys nn parallel agents. While effective, the computational cost increases with nn. Further work might explore dynamic scaling or intelligent pruning of parallel agents based on early indicators of solution quality to optimize resource usage.

    This paper provides a strong foundation for open-source agentic AI and its application to deep research. Its methodologies, particularly the integrated training pipeline and automated data synthesis, offer valuable insights for the broader agentic LLM community. The commitment to open-sourcing is commendable and will undoubtedly accelerate future research.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.