Tongyi DeepResearch Technical Report
TL;DR Summary
The report presents Tongyi DeepResearch, an agentic large language model designed for long-horizon research tasks. It employs an end-to-end training framework combining mid and post-training to foster autonomous capabilities, achieving state-of-the-art performance across various
Abstract
We present Tongyi DeepResearch, an agentic large language model, which is specifically designed for long-horizon, deep information-seeking research tasks. To incentivize autonomous deep research agency, Tongyi DeepResearch is developed through an end-to-end training framework that combines agentic mid-training and agentic post-training, enabling scalable reasoning and information seeking across complex tasks. We design a highly scalable data synthesis pipeline that is fully automatic, without relying on costly human annotation, and empowers all training stages. By constructing customized environments for each stage, our system enables stable and consistent interactions throughout. Tongyi DeepResearch, featuring 30.5 billion total parameters, with only 3.3 billion activated per token, achieves state-of-the-art performance across a range of agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES and xbench-DeepSearch-2510. We open-source the model, framework, and complete solutions to empower the community.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Tongyi DeepResearch Technical Report
1.2. Authors
The paper lists "Tongyi DeepResearch Team" as the authors. The Project Leader is Yong Jiang, with Core Contributors and Contributors also listed, encompassing a large team from Tongyi Lab, Alibaba Group.
1.3. Journal/Conference
The paper is a technical report, published as a preprint on arXiv. While not a formal journal or conference publication, arXiv is a highly reputable platform for disseminating cutting-edge research in fields like AI and computer science, allowing for rapid sharing and peer review before formal publication. Its influence is significant in the academic community for showcasing new advancements.
1.4. Publication Year
The paper was published on arXiv at 2025-10-28T17:53:02.000Z, making the publication year 2025.
1.5. Abstract
This technical report introduces Tongyi DeepResearch, an agentic large language model (LLM) specifically engineered for complex, long-horizon information-seeking research tasks. To foster autonomous deep research capabilities, the model is developed using an end-to-end training framework that integrates agentic mid-training and agentic post-training. This framework is designed to enable scalable reasoning and information seeking across diverse and intricate tasks. A key innovation is a highly scalable, fully automatic data synthesis pipeline that operates without costly human annotation and supports all training stages. The system constructs customized environments for each training stage to ensure stable and consistent interactions. Tongyi DeepResearch boasts 30.5 billion total parameters but activates only 3.3 billion parameters per token, demonstrating efficiency. It achieves state-of-the-art performance across various agentic deep research benchmarks, including Humanity's Last Exam, BrowseComp, BrowseComp-ZH, WebWalkerQA, xbench-DeepSearch, FRAMES, and xbench-DeepSearch-2510. The model, its training framework, and complete solutions are open-sourced to benefit the research community.
1.6. Original Source Link
Official Source: https://arxiv.org/abs/2510.24701
PDF Link: https://arxiv.org/pdf/2510.24701v2.pdf
Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the development of Deep Research agents – AI systems capable of autonomously conducting multi-step reasoning and information seeking on the internet for complex research tasks. This is a crucial step towards Artificial General Intelligence (AGI) and has the potential to significantly enhance human intellectual productivity.
This problem is highly important because traditional Large Language Models (LLMs), while powerful, often lack the agentic capabilities (e.g., planning, searching, reasoning, synthesizing knowledge over extended periods and diverse sources) required for truly autonomous, long-horizon research. Existing deep research systems are mostly closed-source, making their internal workings and research processes inaccessible to the broader community, hindering collaborative progress. There's a significant gap in publicly available, fully open-source models with robust deep research capabilities.
The paper's innovative idea and entry point is to open-source an agentic LLM named Tongyi DeepResearch, explicitly designed for long-horizon, deep information-seeking research tasks. It addresses the limitations of existing models by introducing a novel, end-to-end training framework that combines agentic mid-training and agentic post-training, coupled with a fully automatic, scalable data synthesis pipeline and customized environmental interactions. This approach aims to equip LLMs with practical and open autonomous research capabilities.
2.2. Main Contributions / Findings
The primary contributions of Tongyi DeepResearch are:
-
Novel End-to-End Agentic Training Paradigm: Introduction of a unified
agentic mid-trainingandagentic post-trainingframework.Agentic mid-trainingcultivates inherent agentic biases by exposing the model to large-scale agentic data, bridging the gap between pre-training and post-training.Agentic post-trainingfurther refines capabilities through scalablemulti-turn reinforcement learning (RL). This paradigm enables gradual development from basic interaction skills to advanced autonomous research behaviors. -
Fully Automated, Scalable Data Synthesis Pipeline: Design of a pipeline that eliminates the need for human annotation to generate diverse, high-quality
agent trajectories. This pipeline createsresearch-level questions,agentic behavior data(planning, reasoning, decision-making actions), andfunction-calling data, tailored for each training phase. It enables the creation of "super-human-level" datasets and fosters adata flywheeleffect. -
Stage-Specific, Customized Environments: Construction of robust environments that provide consistent interactions for data synthesis and training. These environments range from
Prior World Environment(for pre-trained knowledge mining) toSimulated Environment(for controlled, low-cost iteration) andReal-world Environment(for authentic feedback), adapting to the developmental stage of the agent. -
State-of-the-Art Performance with Efficiency:
Tongyi DeepResearch, built on theQwen3-30B-A3B-Basemodel, features 30.5 billion total parameters but activates only 3.3 billion per token. Despite its parameter efficiency, it achieves state-of-the-art performance across a suite of agentic deep research benchmarks, outperforming strong baselines likeOpenAI-o3andDeepseek-V3.1. -
Open-Sourcing: The model, training framework, and complete solutions are open-sourced, aiming to democratize access to advanced
AI research agentsand accelerate community progress. -
Heavy Mode for Enhanced Performance: Introduction of a
Heavy Modethat leveragestest-time scalingthroughparallel researchandintegrative synthesis. This mode deploys multiple agents to explore diverse solution paths and then uses a synthesis model to consolidate findings, achieving further state-of-the-art results on challenging benchmarks.The key findings include the effectiveness of this integrated training and data generation approach in creating capable and efficient
deep research agents. The systematic analysis coversagentic reinforcement learningandsynthetic data, providing insights into the development of such agents. The paper also demonstrates thatagentic modelsrepresent a significant future trend, capable of internalizingagent-like capabilitiesand autonomously invoking tools to solve a wide range of problems.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with several core concepts in Large Language Models (LLMs) and Reinforcement Learning (RL).
- Large Language Models (LLMs): These are neural networks with billions of parameters, pre-trained on vast amounts of text data to understand and generate human-like text. They learn complex patterns in language, enabling tasks like translation, summarization, and question answering. In this paper, LLMs serve as the "brain" of the
deep research agent. - Agentic LLM: An LLM specifically designed to act as an
agent, meaning it can perceive its environment, make decisions, take actions, and receive feedback to achieve a goal. This involves capabilities beyond just text generation, such as tool use, planning, and memory management. - Long-horizon Tasks: These are complex tasks that require many steps, potentially spanning a long duration, and often involve multiple interactions with an environment or various tools.
Deep research tasksfall into this category as they might involve searching many web pages, synthesizing information, and performing multiple reasoning steps. - Agentic Capabilities/Agency: Refers to an agent's ability to operate autonomously, including:
- Planning: Decomposing a complex task into smaller, manageable steps.
- Searching/Information Seeking: Actively querying external knowledge sources (like the internet) to find relevant information.
- Reasoning: Drawing logical conclusions, inferring new information from existing data, and connecting disparate pieces of knowledge.
- Synthesizing Knowledge: Combining information from various sources to form a coherent understanding or generate a comprehensive report.
- Tool Use: The ability of an LLM to invoke external tools (e.g., search engines, code interpreters, web browsers) to perform actions that it cannot do itself.
- Pre-training: The initial phase of training for an LLM, where it learns general language understanding and generation by processing massive text datasets, typically predicting the next word or masked words.
- Fine-tuning / Post-training: After pre-training, an LLM is further trained on a smaller, task-specific dataset to adapt its capabilities to particular applications, such as instruction following or, in this case,
agentic behaviors. - Reinforcement Learning (RL): A paradigm of machine learning where an
agentlearns to make decisions by performing actions in anenvironmentto maximize acumulative reward. The agent receives arewardsignal for its actions and learns apolicy(a strategy) that maps states to actions.- Policy (): The agent's strategy, defining how it chooses actions given a state or observation.
- Reward: A scalar feedback signal from the environment indicating the desirability of an agent's action. In this paper, a
0or1reward signal is used foranswer correctness. - Trajectory / Rollout: A sequence of
states,actions, andrewardsgenerated by an agent interacting with an environment over time. - On-policy RL: An RL algorithm where the
policybeing learned is the same as the policy used to generate thetrajectories(data).
- Supervised Fine-tuning (SFT): A common fine-tuning technique where an LLM is trained on labeled
(input, output)pairs to learn specific behaviors, often used as a "cold start" for RL. - Context Window: The maximum number of tokens (words or sub-word units) an LLM can process and attend to at any given time. Longer context windows allow models to handle more information and longer conversations or documents.
- Next-Token Prediction: The primary objective function during pre-training and sometimes fine-tuning of LLMs, where the model predicts the next token in a sequence given the preceding tokens.
- Inductive Bias: In machine learning, this refers to a set of assumptions that a learning algorithm uses to make predictions on unseen data. For
agentic LLMs, anagentic inductive biasmeans the model is pre-disposed to learn and exhibitagent-like behaviorslike planning and tool use.
3.2. Previous Works
The paper references several key prior works and concepts that inform its approach:
ReAct(Yao et al., 2023): This framework (ReasoningandActing) synergizes reasoning and acting in language models. An agent generates both areasoning trace(Thought) and a subsequentActionin an interleaved manner. This creates atrajectoryofthought-action-observationtriplets.Tongyi DeepResearchis fundamentally based on thisReActarchitecture due to its simplicity and alignment with scalable computation principles.- The
ReActparadigm follows a sequence:- Thought (): The agent's internal reasoning process, analyzing the current context, recalling memory, planning, and self-reflecting.
- Action (): An external operation executed by the agent, often involving tool use (e.g., Search, Visit, Python Interpreter).
- Observation (): Feedback received from the environment after an action, used to update the agent's internal state.
- The
trajectoryis defined as: $ \mathcal{H}T = ( \tau_0, a_0, o_0, \ldots, \tau_i, a_i, o_i, \ldots, \tau_T, a_T ) $ Here, is the final answer, and for are intermediate tool calls. The policy generates the current thought and action based on the history: $ \tau_t, a_t \sim \pi ( \cdot | \mathcal{H}{t-1} ) $
- The
Context Management Paradigm(Qiao et al., 2025): This addresses the limitation of finite context windows in LLMs forlong-horizon tasks. Instead of conditioning on the complete history, the agent is conditioned on a strategically reconstructedworkspacecontaining only essential elements: thequestion (q), an evolvingreport (S_t)(as compressed memory), and theimmediate contextfrom the last interaction ( and ). ThisMarkovian structurehelps maintain reasoning capacity across deep explorations.- The core update process is formalized as: $ S_t, \tau_{t+1}, a_{t+1} \sim \pi ( \cdot | S_{t-1}, a_t, o_t ) $
- This is crucial because
report (S_t)serves as a condensed memory, preventingcontext overflowand enforcing structured reasoning by requiring the agent to synthesize and prioritize information.
The Bitter Lesson(Sutton, 2019): This influential principle in AI suggests that general methods that leverage scalable computation ultimately outperform approaches relying on complex, human-engineered knowledge and intricate designs. The authors explicitly cite this lesson to justify their choice ofReActand their focus on scalable training paradigms over complex, specialized prompt engineering.Agentic Continual Pre-training (Agentic CPT)(Su et al., 2025): A two-stage process mentioned as the coremid-trainingphase inTongyi DeepResearch. It aims to provide a base model with a stronginductive biasforagentic behaviorwhile preserving broad linguistic competence. It usesnext-token predictionloss and progressively expands context length.rLLM framework(Tan et al., 2025): A framework forpost-training language agentsused byTongyi DeepResearchto implement itson-policy asynchronous rollout frameworkforRL training.GRPO(Shao et al., 2024): AReinforcement Learningalgorithm, specificallyGeneralized Policy Optimization, which serves as the foundation for theRL training algorithminTongyi DeepResearch. The paper adaptsGRPOto its needs.DAPO(Yu et al., 2025): Anopen-source LLM reinforcement learning systemwhich influences the application oftoken-level policy gradient lossand aclip-higher strategyin theRL training objectiveto encourage exploration.Qwen3-30B-A3B-Base(Yang et al., 2025): The pre-trained base model from whichTongyi DeepResearchis initialized. This signifies that the work builds upon existing powerfulLLM architectures.
3.3. Technological Evolution
The field of Large Language Models has rapidly evolved from models primarily focused on text generation and understanding (e.g., initial GPT models) to increasingly agentic systems. Early LLMs were typically pre-trained on vast text corpora and then fine-tuned for specific tasks. The major shifts leading to the current work include:
-
Instruction Following:
LLMsevolved to understand and follow complex instructions (instruction fine-tuning), making them more useful for diverse tasks. -
Tool Use Integration: The realization that
LLMsalone are limited (e.g., cannot perform precise calculations, access real-time information) led to methods allowing them toinvoke external tools(e.g., search engines, code interpreters). Frameworks likeReActbecame prominent here. -
Agentic Training Paradigms: Moving beyond simple
instruction fine-tuningto trainingLLMsspecifically formulti-step decision-makingandenvironmental interaction, often leveragingReinforcement Learning. -
Long-Context Models: The development of models that can handle increasingly longer input sequences, critical for
long-horizon taskslike deep research. -
Synthetic Data Generation: The increasing sophistication of
LLMsthemselves has enabled them to generate high-quality training data, reducing reliance on costly human annotation and allowing forscalable data creation.Tongyi DeepResearchfits within this timeline by pushing the boundaries ofagentic LLMsfordeep research. It specifically addresses the need for open-source, capable agents by integrating advancedagentic training(mid-training and post-training with RL), highly scalablesynthetic data generation, and robustenvironmental interaction strategies. It combines existing powerfulLLM backboneswith novel training methodologies to achievestate-of-the-art agentic performance.
3.4. Differentiation Analysis
Compared to main methods in related work, Tongyi DeepResearch presents several core differences and innovations:
-
End-to-End Integrated Training Framework: While many existing works focus on
post-trainingforDeepResearch agents,Tongyi DeepResearchintroduces a novel, integratedend-to-end training frameworkthat unifiesagentic mid-trainingandagentic post-training.- Mid-training Innovation: The explicit
agentic mid-trainingphase is a key differentiator. It's designed to instillagentic inductive biasesearly by exposing the model to large-scaleagentic databefore the intensiveRL post-training. This bridges the gap between general pre-training and specificagentic post-training, addressingoptimization conflictsand leading to a strongeragentic foundation model. Most general foundation models lack this specificagentic prior knowledge.
- Mid-training Innovation: The explicit
-
Fully Automated and Scalable Data Synthesis: Many
agentic systemsorLLM fine-tuningefforts still rely on human-annotated data, which is expensive and unscalable forresearch-level problems.Tongyi DeepResearchemphasizes afully automated, highly scalable data synthesis pipelinethat generates diverse, high-qualityagent trajectorieswithout human intervention. This includes:Synthesizing research-level questionsefficiently usingLLMs.- Generating
planning,reasoning, anddecision-making actions. - Creating
function-calling dataviaenvironment scaling. - Focusing on
high-quality, high-uncertainty, super-human level QA pairsforpost-training, includingPhD-level research questions.
-
Strategic Environmental Interaction: The paper explicitly models and leverages three forms of environments (
Prior World,Simulated,Real-world) and adaptssynthetic data generationand training strategies accordingly. This structured approach toenvironmental interaction(especially usingsimulated environmentsfor rapid iteration andreal-world sandboxesfor stability) is more systematic than simply interacting with the real world or using offline datasets. -
Efficiency at Scale:
Tongyi DeepResearchachievesstate-of-the-art performancewith significantly fewer activated parameters (3.3 billion per token from a 30.5 billion total parameter model) compared to many proprietary systems. This emphasizes efficiency and scalability for deployment. -
Open-Source Commitment: Unlike many leading
deep research systemsthat remain closed-source (e.g.,OpenAI DeepResearch,Gemini DeepResearch),Tongyi DeepResearchis fullyopen-sourced. This fosters transparency, reproducibility, and collaborative research.In essence, the innovation lies in the holistic, end-to-end framework that strategically integrates
mid-trainingforagentic bias,automated synthetic datafor scalability, andadaptive environmental interactionfor stable and efficientRL, all while beingopen-sourceand parameter-efficient.
4. Methodology
The methodology section details the Tongyi DeepResearch system, outlining its formulation, overall training recipe, and the specifics of agentic mid-training and agentic post-training.
4.1. Principles
The core idea behind Tongyi DeepResearch is to endow Large Language Models (LLMs) with autonomous research capabilities by treating them as agents that can plan, search, reason, and synthesize knowledge across extended sequences of actions and diverse information sources. This is achieved through a novel end-to-end training framework that balances the cultivation of agentic biases and the refinement of deep research capabilities.
The theoretical basis and intuition are rooted in:
- Sequential Decision-Making: Framing
deep researchas a sequence ofthoughts,actions, andobservations, similar to how humans conduct research or how an agent interacts with an environment inReinforcement Learning. TheReActparadigm is fundamental here, combining verbalized reasoning with tool-based actions. - Scalability through Data Synthesis: Recognizing the inherent difficulty and cost of obtaining
human-annotated datafor complex research tasks. The intuition is thatLLMsthemselves, when properly guided, can generate high-quality, diverse, and complexagent trajectoriesandresearch questionsat scale, leading to adata flywheeleffect where an improving agent generates better training data. - Controlled Environmental Interaction: Acknowledging that
real-world environmentsare noisy, costly, and non-stationary. The principle is to strategically leverage different types of environments (Prior World,Simulated,Real-world) based on the training stage's needs, optimizing for stability, cost, and fidelity.Simulated environmentsact as a "wind tunnel" for rapid algorithm iteration. - Progressive Capability Building: The
two-stage training pipeline(mid-training then post-training) reflects the idea thatagentic capabilitiesshould be built progressively.Mid-trainingestablishes a strongagentic inductive bias(general agentic knowledge), whilepost-training(viaSFTandRL) refines these into robustdeep research capabilitiesfor specific complex tasks. This addresses the challenge of directly trainingagentic behaviorson generalLLMswhich lack the necessary foundational bias. - Context Efficiency: For
long-horizon tasks, managing thecontext windowis critical. TheContext Management Paradigmis based on the idea that an agent can maintain coherent reasoning by dynamically summarizing and prioritizing information, mimicking human researchers who periodically synthesize their findings.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Formulation
Tongyi DeepResearch's interaction with the environment at each timestep is defined by three fundamental components:
-
Thought (): This represents the agent's internal cognitive process. It involves analyzing the current context, retrieving relevant information from its memory, planning the next steps, and engaging in self-reflection to adapt its strategy. It's the verbalization of the agent's reasoning.
-
Action (): This is an external operation performed by the agent to interact with its environment.
Tongyi DeepResearchis equipped with a set of versatile tools that define itsaction space. These tools allow it to interact with various information sources. The available tools are:Search: For performing Google web searches.Visit: For accessing and summarizing content from web pages.Python Interpreter: For executing Python code in a sandboxed environment.Google Scholar: For retrieving information from academic publications.File Parser: For parsing user-uploaded local files (PDF, DOCX, etc.). Actions include all intermediatetool calls( where ) and the final response to the user, which is an in-depth report ().
-
Observation (): This is the feedback received from the environment immediately after an
actionis performed. This new information is then used to update the agent's internal state and guide its subsequentthoughtandaction.Based on these components, two different rollout types are defined:
-
ReAct: The architecture is fundamentally based on the
ReActframework. In this paradigm, the agent generates areasoning trace(Thought) and a subsequentActionin an interleaved manner. This process forms atrajectory, , which is a sequence ofthought-action-observationtriplets: $ \mathcal{H}T = ( \tau_0, a_0, o_0, \ldots, \tau_i, a_i, o_i, \ldots, \tau_T, a_T ) $ Here, denotes the final answer to the given task. At any given step , the agent'spolicy() generates the currentthought() andaction() conditioned on the entire history of previous interactions, : $ \tau_t, a_t \sim \pi ( \cdot | \mathcal{H}{t-1} ) $ The choice ofReActis deliberate, emphasizing its simplicity and alignment withThe Bitter Lessonprinciple, which favors general, scalable computational methods over complex, human-engineered ones. -
Context Management: To address the
finite context windowconstraint inlong-horizon tasks, adynamic context management mechanismbased onMarkovian state reconstructionis employed. Instead of being conditioned on the complete history, the agent is conditioned on a strategically reconstructedworkspaceat each step . This workspace contains only essential elements: thequestion (q), an evolvingreport (S_t)serving ascompressed memory, and theimmediate contextfrom the last interaction ( and ). ThisMarkovian structureallows the agent to maintain consistentreasoning capacityacross arbitrary exploration depths and naturally circumvents context degradation. For every step , this core update process can be formalized as: $ S_t, \tau_{t+1}, a_{t+1} \sim \pi ( \cdot | S_{t-1}, a_t, o_t ) $ Thiscontext management paradigmis crucial as it preventscontext overflowand enforcesstructured reasoningby requiring the agent to explicitly synthesize and prioritize information at each step, aligning with human research patterns.
4.2.2. Overall Training Recipe
The Tongyi DeepResearch system is initialized from the pre-trained base model Qwen3-30B-A3B-Base. The development proceeds through an end-to-end training framework that integrates agentic mid-training and agentic post-training. This framework is designed to enable scalable reasoning and information seeking across complex research tasks, establishing a new paradigm for training agentic models.
The overall training pipeline is visualized below:

该图像是图示,展示了Tongyi DeepResearch的训练管道。图中包含三个主要阶段:预训练、 mid-training和后训练,具体的训练阶段和参数数量分别为Agentic CPT Stage 1(32K)、Agentic CPT Stage 2(128K)和Agentic SFT及Agentic RL。
Figure 2: Training pipeline of Tongyi DeepResearch.
The pipeline begins with Pre-training (e.g., Qwen3-30B-A3B-Base). This is followed by two main agentic training stages:
- Agentic Mid-training: This phase aims to instill
agentic inductive biasesinto the base model. It consists of two stages:Agentic CPT Stage 1 (32K context)Agentic CPT Stage 2 (128K context)
- Agentic Post-training: This phase refines the
agentic capabilitiesthrough more targeted training. It also consists of two stages:Agentic SFT (Supervised Fine-tuning)Agentic RL (Reinforcement Learning)
4.2.3. Agentic Mid-training
The mid-training phase is designed to bridge the gap between generally pre-trained models and the specific requirements of agentic post-training. Its primary objective is to provide a base model with a strong inductive bias for agentic behavior while preserving broad linguistic competence.
4.2.3.1. Training Configuration
Tongyi DeepResearch employs a two-stage Agentic Continual Pre-training (Agentic CPT) as its core mid-training phase. The optimization process uses the standard Next-Token Prediction loss function.
- Stage 1: Initiates with a
32K context length. - Stage 2: Expands to a
128K context length. A substantial corpus oflong-sequence (64K-128K) agentic behavior datais introduced in this stage to enhance the model's capacity forcoherent, long-horizon reasoning and action. Throughout both stages, a small proportion ofgeneral pre-training datais interleaved to ensure the model acquires specializedagentic competencewithout sacrificing itsfoundational generalization capabilities.
4.2.3.2. Large-scale Agent Behavior Data Synthesis
In Agentic CPT, data is synthesized across the complete lifecycle of agent workflows. A typical agent workflow involves starting with a problem, iteratively cycling through reflection and action, and ultimately converging on a final solution. To comprehensively capture this, data is synthesized for critical steps: Question Synthesis, Planning Action, Reasoning Action, and Decision-Making Action. Decision-making is explicitly modeled as a distinct action type.
The process of large-scale agent behavior data synthesis is illustrated below:

该图像是一个示意图,展示了任务规划和决策制定的流程。图中显示了从任务到回答的各个环节,包括问题合成、规划、决策制定及推理等步骤,强调了决策过程中的潜在路径和隐藏过程。
Figure 3: Large-scale agent behavior data synthesis for agentic continual pre-training.
The workflow begins with an Open World Memory (continuously updated knowledge). This memory is used for Question Synthesis, feeding into Planning, which then flows into Decision Making and Reasoning steps, ultimately leading to the Answer. This entire process is used to generate Agent Behavior Data for Agentic Continual Pre-training.
-
Large-scale Multi-style Question Synthesis:
- An
entity-anchored open-world memoryis constructed, consolidating diverse real-world knowledge (web-crawled data, agent interaction trajectories) into structured representations of entities and their associated knowledge. - Entities and related knowledge are sampled to generate diverse questions that embed specific
behavioral pattern requirements, such asmulti-hop reasoning questionsandnumerical computation questions.
- An
-
Planning Action:
Planninginvolvesproblem decompositionandfirst-step action prediction.- Open-source models are used to analyze, decompose, and predict initial actions for the synthesized questions.
Rejection samplingbased on the entities and associated knowledge from question construction ensures high-quality planning outputs.
-
Reasoning Action:
- Focuses on
logical reasoningandknowledge integrationfromheterogeneous data, especially when external tools return massive unstructured responses. - Large models are guided through a two-stage process to generate complete
reasoning chainsgiven a question and its dependent knowledge. - A
dual filtering mechanismbased onreasoning lengthandanswer consistencyensures quality.
- Focuses on
-
Decision-Making Action:
- Each step of an agent's thinking and action is essentially an implicit
decision-making process, where the agent selects the most promising solution from multiple potential reasoning and action paths. - This process is explicitly modeled: existing demonstration trajectories are used to explore the
feasible action spaceat each step. - Original trajectories are reconstructed into
multi-step decision sequenceswhile preserving the original decision choices.
- Each step of an agent's thinking and action is essentially an implicit
-
General Function-calling Data Synthesis via Environment Scaling:
- To enhance the model's
general agentic capability,function-calling datais systematically scaled throughenvironment scaling. The principle is that the breadth offunction-calling competenceis tied to the diversity of environments. - A scalable framework is designed to automatically construct
heterogeneous, fully simulated environments, effectively broadening thespace of function-calling scenarios. - The generated data is incorporated into the
mid-training phase.
- To enhance the model's
4.2.4. Agentic Post-training
The post-training pipeline comprises three stages: data synthesis, supervised fine-tuning for cold start, and agentic reinforcement learning.
4.2.4.1. High-quality Data Synthesis
An end-to-end solution for synthetic data generation is developed to create complex, high-uncertainty, and super-human level question and answer pairs. This fully automated process aims to push the boundaries of agent performance without human intervention.
The high-quality data synthesis pipeline is depicted below:

该图像是插图,展示了三个阶段的图形处理过程:1)图的构建、2)子图采样和3)不确定性注入。这些过程在研究中起到关键作用,旨在优化信息获取和处理的效率。
Figure 4: High-quality data synthesis pipeline.
The pipeline consists of three main stages:
-
Graph Construction: Builds a highly interconnected
knowledge graphviarandom walks, leveragingweb searchto acquire relevant knowledge, andisomorphic tablesfrom real-world websites for a realistic information structure. -
Subgraph Sampling: Samples
subgraphsandsubtablesfrom the constructed knowledge graph to generateinitial questions and answers. -
Uncertainty Injection: Strategically increases the uncertainty within the question to enhance its difficulty. This is grounded in a theoretical framework that models
QA difficultyasatomic operationsonentity relationships(e.g., merging entities with similar attributes), allowing systematic complexity increase. Formal modeling of theinformation-seeking problembased onset theoryfurther enables controllable difficulty and structure scaling, minimizesreasoning shortcutsandstructural redundancy, and allows for efficientverification of QA correctness.Additionally, an
automated data enginegeneratesPhD-level research questions. It starts with a multi-disciplinaryknowledge baseto createseed QA pairsrequiringmulti-source reasoning. These seeds undergoiterative complexity upgrades, where aquestion-crafting agentprogressively expands scope and abstraction, refining and compounding prior outputs.
4.2.4.2. Supervised Fine-tuning for Cold Start
The initial phase of agentic post-training is a supervised fine-tuning (SFT) stage. Its purpose is to equip the base model with a robust initial policy before reinforcement learning.
-
Data Source:
Synthesized high-quality QA datais used to obtaintraining trajectories. These trajectories cover the completethought processandtool responsesgenerated by high-performing open-source models. -
Filtering: A
rigorous rejection sampling protocolis applied to ensure that only high-quality trajectories exhibitingdiverse problem-solving patternsare retained. -
Mixed Training Paradigm: The
SFTphase leverages data from two different formulations to enhance model robustness and generalization:ReAct Mode: Training samples take the historical state as input and output the correspondingthoughtandtool callfor the current step.Context Management Mode: Training samples take as input the previous step'strajectory summary,tool call, andtool response. They output the current step'strajectory summary,thought, andtool call. This mode specifically strengthens the agent's capabilities instate analysisandstrategic decision-making, requiring the model to synthesize complex observations into coherent summaries.
-
Two-stage Training Strategy based on Context Length:
- Stage 1:
Context lengthis set to40K. Training data includesReAct Modesamples with context lengths shorter than40K, along with allContext Management Modesamples (as they are all within40K). - Stage 2:
Context lengthis extended to128K. Training data includesReAct Modesamples with context lengths between40Kand128K, plus a small portion of40Kdata for stability.
- Stage 1:
4.2.4.3. Agentic Reinforcement Learning
To advance the model's capabilities in robust and reliable planning and searching in complex web environments, an agentic RL framework is applied.
An overview of the agentic reinforcement learning framework:

该图像是一个示意图,展示了自动合成数据的框架,包含异步回放服务、回放工作者、轨迹收集和奖励服务等组成部分。图中同时表示了模拟环境与真实环境的结合,以及相应的操作与观测流程。
Figure 5: An overview of our agentic reinforcement learning framework.
The agentic RL framework involves the policy model interacting with the environment (either Simulated Environment or Real-world Environment). This interaction generates trajectories (rollouts) and rewards. These are collected and processed by a Trajectory Collection component, which then feeds into the RL Training module. RL Training updates the policy model, and this cycle iterates. An Async Rollout Service and Rollout Workers facilitate parallel interactions.
-
Real-world Environment: The agent's
toolkitintegrates several specialized tools:Search,Visit,Python Interpreter,Google Scholar, andFile Parser. To ensure reliability in training and evaluation, aunified sandboxis developed.- This sandbox orchestrates every tool call through a
central scheduling and management layer. - For each tool,
robust concurrency controlsandfault-tolerance mechanismsare implemented (e.g., QPS rate constraints, caching, timeout-and-retry, graceful degradation, failover to backups). - This design abstracts tool invocation into a
deterministic and stable interface, insulating the training loop fromreal-world stochasticityand reducing operational costs.
- This sandbox orchestrates every tool call through a
-
Simulated Environment: Direct use of
real-world web environment APIspresents numerous practical problems (e.g., instability, cost).- An
offline environmentis built based on the2024 Wikipedia database. - A suite of
local RAG toolssimulates theweb environment. - The
data synthesis pipelineis reused to createhigh-quality, structurally complex QAspecifically for thisoffline environment. - This provides a low-cost, high-efficiency, fully controllable platform for rapid experimentation, accelerating development.
- An
-
On-Policy Asynchronous Rollout Framework: The iterative nature of
agentic rollouts(requiring numerous environment interactions) can be a bottleneck.- A custom,
step-level asynchronous RL training loopis implemented, built on therLLM framework. - It uses
two separate asynchronous online servers: one formodel inferenceand another fortool invocation. - A
centralized interaction handlerprocesses outputs from both, formatting feedback into a unified message list. - This architecture allows multiple agent instances to interact with the environment in parallel, completing rollouts independently.
- A custom,
-
RL Training Algorithm: The
RL algorithmis a tailored adaptation ofGRPO(Generalized Policy Optimization). It operates with astrict on-policy regimen, meaning trajectories are consistently sampled using the most up-to-date policy. Therewardis a pure0or1signal indicatinganswer correctness, with no separateformat reward.The training objective is: $ \mathcal{I}(\boldsymbol{\theta}) = \mathbb{E}{(\boldsymbol{q}, \boldsymbol{y}) \sim \mathcal{D}, {\mathcal{H}i^i}{i=1}^G \sim \pi{\theta_{\mathrm{old}}}(\cdot \vert \mathrm{context})} \left[ \frac{1}{\sum_{i=1}^G \vert\mathcal{H}^i\vert} \sum_{i=1}^G \sum_{j=1}^{\vert\mathcal{H}^i\vert} \min\left( r_{i,j}(\boldsymbol{\theta}) \hat{A}{i,j}, ~ \mathrm{clip}\left( r{i,j}(\boldsymbol{\theta}), 1-\varepsilon_{\mathrm{low}}, 1+\varepsilon_{\mathrm{high}} \right) \hat{A}_{i,j} \right) \right] $ Where:
-
: The parameters of the current policy.
-
: A
question-answer pairsampled from the dataset . -
: A set of
trajectories(rollouts) sampled using theold policy. -
: The
importance sampling ratiofor the -th token in the -th trajectory. For strictlyon-policy training, this remains1.0. It is defined as: $ r_{i,j}(\theta) = \frac{ \pi_{\theta}(\mathcal{H}^{i,j} \mid \mathrm{context}) }{ \pi_{\theta_{\mathrm{old}}}(\mathcal{H}^{i,j} \mid \mathrm{context}) } $ Here, is the probability of the -th token in trajectory under the new policy , given the context up to that point, and similarly for the old policy. -
: An
estimator of the advantageat token of the -th trajectory. It is calculated as the reward of the trajectory minus the mean reward of all trajectories in the current batch: $ \hat{A}{i,j} = R_i - \mathrm{mean}( {R_i}{i=1}^G ) $ Here, is theepisode reward(0 or 1 for correctness) for trajectory . -
: A clipping function applied to the
importance sampling ratioto constrain policy updates, preventing excessively large steps.Following
DAPO, atoken-level policy gradient lossis applied, and aclip-higher strategyis used to encourage more exploration. To reduce variance inadvantage estimation, aleave-one-out strategyis adopted. Additionally, to improve training stability and preventpolicy collapse, certainnegative samplesare selectively excluded from the loss calculation. The paper notes that these modifications prioritize pragmatic stability and efficiency over algorithmic novelty.
-
-
Automatic Data Curation: To generalize to
out-of-distribution scenariosthroughself-exploration, data is optimized in real-time, guided by training dynamics.- A
fully automated data filtering pipelinedynamically adjusts the training set based on theimproved policy model. - Starts with a large dataset . An initial
SFT modelsamples multiplerolloutsfor each problem. - An initial training set is created by filtering out problems where the model always fails or always succeeds (as they offer no
learning signal). This leaves problems ofmoderate difficulty. - During
RL training, problems in are continuously monitored. - A separate background process uses
intermediate checkpointsof the policy model to sample from the entire original dataset , identifyingnew moderately difficult problemsfor abackup pool. - When training reaches a certain step count or
reward plateaus, the active training set is refreshed by removing mastered problems and incorporating new, challenging ones from thebackup pool. - This pipeline runs independently, never interrupting the main
RL training loop, ensuring high training efficiency and stability.
- A
4.2.4.4. Model Merging
At the last stage of the pipeline, model merging is employed. This approach is based on the insight that parameters of different model variants derived from the same pre-trained model can be effectively combined.
- Process: Several
model variantsoriginating from the same base model but exhibitingdifferent capability preferencesare selected. - Weighted Average: The final merged model is created by computing a
weighted averageof their parameters: $ \theta_{\mathrm{merged}} = \sum_k \alpha_k \cdot \theta^{(k)}, \quad \mathrm{s.t.} \sum_k \alpha_k = 1, \alpha_k \geq 0 $ Where:- : Represents the parameters of the -th
model variant. - : Is its corresponding
merge weight.
- : Represents the parameters of the -th
- Benefits: This
interpolation strategypreserves the core strengths of each contributing model and equips the merged model withrobust generalization abilities. It performs comparably to the best source model in its respective area of strength without incurring additional optimization costs.
5. Experimental Setup
5.1. Datasets
The experiments evaluate Tongyi DeepResearch on seven public information-seeking benchmarks designed for long-term reasoning and long-horizon tool use.
-
Humanity's Last Exam (HLE) (Phan et al., 2025): A benchmark designed to test an agent's ability to tackle complex, multidisciplinary questions that often require deep reasoning and knowledge integration. It focuses on questions that might challenge even human experts. The paper evaluates on 2,154 text-only questions.
-
BrowseComp (Wei et al., 2025): A benchmark for
browsing agentsthat requires navigating and extracting information from web pages to answer questions. It teststool useandinformation retrievalin a realistic web environment. -
BrowseComp-ZH (Zhou et al., 2025): The Chinese counterpart of
BrowseComp, assessing similarbrowsing and information retrieval capabilitiesbut in a Chinese language context. -
GAIA (Mialon et al., 2023): A benchmark for
general AI assistantsthat evaluates complex real-world tasks requiring multiple steps, tool use, and common sense reasoning. It often involves using web search and other tools. -
xbench-DeepSearch (Xbench Team, 2025): A benchmark specifically designed for evaluating
deep searchcapabilities, likely involvingmulti-hop information retrievaland complex synthesis from multiple sources. -
WebWalkerQA (Wu et al., 2025b): A benchmark focused on
web traversaland question answering, testingLLMs' ability to navigate through web pages to find answers. -
FRAMES (Krishna et al., 2025): A benchmark for
retrieval-augmented generation, often involving fetching facts and reasoning over them. -
xbench-DeepSearch-2510: A newly released benchmark for
deep search, indicating a continuous effort to push the boundaries of such systems.These datasets were chosen because they are widely recognized public benchmarks for evaluating
agentic capabilities,information seeking,long-horizon reasoning, andtool useinLLMs. They are effective for validating the proposed method's performance across diverse complexities and language domains (English and Chinese).
The paper also mentions AIME25, HMMT25, and SimpleQA (OpenAI, 2025c) for evaluating performance on general benchmarks.
- AIME25: Likely a mathematical problem-solving benchmark, possibly related to the American Invitational Mathematics Examination.
- HMMT25: Possibly referring to the Harvard-MIT Mathematics Tournament, another math competition benchmark.
- SimpleQA: A
knowledge-intensive benchmarkfocusing on factual question answering.
5.2. Evaluation Metrics
For all deep research benchmarks, the paper follows each benchmark's official evaluation protocol. The primary metric reported is the average performance over three runs, denoted as Avg@3. For completeness, Pass@1 (best result over 3 runs) and Pass@3 are also reported. While the specific calculation for each benchmark's score (e.g., accuracy, F1 score) isn't detailed in the main text, it's implied that they use standard metrics for QA or task completion.
For general benchmarks:
-
Mathematical Problems (AIME25, HMMT25):
Manual evaluationis used due to the detailed reports generated by the system and the relatively small scale of these datasets, ensuring accuracy and fairness. The metric is likelyaccuracy(proportion of correctly solved problems). -
Knowledge-based Problems (SimpleQA): The
official evaluation scriptofSimpleQAis utilized to maintain consistency with established benchmarks. This typically involvesaccuracyorF1 scorefor factual questions.Since the paper does not explicitly provide the mathematical formulas for
Avg@3,Pass@1, andPass@3, I will provide their conceptual definitions. These are common metrics inagentic LLMevaluation, especially for tasks with some stochasticity. -
Pass@1:
- Conceptual Definition:
Pass@1measures the success rate of an agent on its best attempt for a given task. If an agent is run multiple times on the same task,Pass@1checks if at least one of those runs was successful. It indicates the agent's potential to solve a task under ideal conditions or given enough tries. - Mathematical Formula:
Let be the total number of tasks.
Let be the number of independent runs per task (in this paper, ).
Let be a binary indicator:
1if the -th run for task is successful,0otherwise. $ \mathrm{Pass@1} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}\left( \max_{k=1,\ldots,K} S_{i,k} = 1 \right) $ - Symbol Explanation:
- : Total number of tasks in the benchmark.
- : Number of independent runs for each task (here, 3).
- : Binary indicator;
1if the -th run for task is successful,0if unsuccessful. - : The indicator function, which evaluates to
1if its argument is true, and0otherwise. - : Returns
1if at least one of the runs for task was successful, and0otherwise.
- Conceptual Definition:
-
Pass@3:
- Conceptual Definition:
Pass@3is typically calculated incode generationoragentic taskswhere multiple attempts are made. It represents the probability that at least one of independent attempts (here, ) would have succeeded, assuming a constant probability of success for each attempt. It's often estimated by calculating the empirical success rate and then finding such that . However, inagentic evaluations, it is more commonly used to directly report the success rate when any of the three attempts pass, or to report the average score over three runs (asAvg@3implies). Given the paper's statement "best result over 3 runs",Pass@3seems to refer to the same concept asPass@1but with the specific number of runs being 3. It's often used interchangeably withPass@Kwhere attempts are allowed. In this paper's context of "best result over 3 runs", it means if any of the 3 runs passed, the task is considered passed. - Mathematical Formula:
Let be the total number of tasks.
Let be the number of independent runs per task.
Let be a binary indicator:
1if the -th run for task is successful,0otherwise. $ \mathrm{Pass@3} = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}\left( \max_{k=1,2,3} S_{i,k} = 1 \right) $ - Symbol Explanation: Same as
Pass@1, but specifically for runs.
- Conceptual Definition:
-
Avg@3:
- Conceptual Definition:
Avg@3calculates the average performance score across three independent runs for each task and then averages these task-level averages over all tasks. It provides a more robust estimate of the typical performance, smoothing out run-to-run variability. - Mathematical Formula: Let be the total number of tasks. Let be the number of independent runs per task. Let be the performance score (e.g., accuracy, F1) for the -th run of task . $ \mathrm{Avg@3} = \frac{1}{N} \sum_{i=1}^{N} \left( \frac{1}{K} \sum_{k=1}^{K} Score_{i,k} \right) $
- Symbol Explanation:
- : Total number of tasks in the benchmark.
- : Number of independent runs for each task (here, 3).
- : The performance score achieved by the agent on task during run . This score can be a direct success rate (0 or 1) or a more granular metric like F1 score depending on the benchmark.
- Conceptual Definition:
5.3. Baselines
The paper compares Tongyi DeepResearch against two families of systems:
-
LLM-based ReAct agents: These are models that use
LLMsprimarily within aReAct frameworkfor reasoning and tool use.GLM-4.5(Zeng et al., 2025)Kimi-K2(Team et al., 2025)DeepSeek-V3.1(DeepSeek Team, 2025)Claude-4-Sonnet(anthropic, 2025)OpenAI o3/o4-mini(OpenAI, 2025b)
-
End-to-end deep-research agents: These are systems specifically designed and optimized for
deep research tasks, often incorporating more complexagentic architecturesand training.-
OpenAI DeepResearch(OpenAI, 2025a) -
Gemini DeepResearch(Gemini Team, 2025) -
Kimi Researcher(Kimi, 2025)These baselines are representative because they cover a range of
state-of-the-art LLMs(both open and closed-source) that are either general-purpose models adapted foragentic tasks(LLM-basedReActagents) or specialized systems built fordeep research(end-to-enddeep-research agents). This allows for a comprehensive comparison ofTongyi DeepResearchagainst both generalLLM-agentcapabilities and dedicateddeep research solutions.
-
5.4. Inference Parameters
To ensure stability and reproducibility across evaluations, fixed inference parameters were adopted:
-
temperature -
repetition penalty -
top-p -
A maximum of
128 tool invocationsis allowed per task. -
The
context lengthis constrained to128K tokens.Each benchmark is evaluated three times independently, and the
average performance (Avg@3)is reported as the main metric.Pass@1(best result over 3 runs) andPass@3results are also provided. All results were obtained on September 16, 2025, except forxbench-DeepSearch-2510, which was evaluated on October 28, 2025.
The action space for Tongyi DeepResearch includes Search, Visit, Python, Scholar, and File Parser tools. Official reproduction scripts, tool implementations, and prompt configurations are open-sourced on GitHub.
6. Results & Analysis
6.1. Core Results Analysis
The main experimental results demonstrate that Tongyi DeepResearch achieves state-of-the-art performance across a range of deep research benchmarks, often outperforming stronger baselines, despite its parameter efficiency.
The following are the results from Table 1 of the original paper:
| Benchmarks | Humanity's Last Exam | Browse Comp | Browse Comp-ZH | GAIA | xbench DeepSearch | WebWalker QA | FRAMES |
| LLM-based ReAct Agent | |||||||
| GLM 4.5 | 21.2 | 26.4 | 37.5 | 66.0 | 70.0 | 65.6 | 78.9 |
| Kimi K2 | 18.1 | 14.1 | 28.8 | 57.7 | 50.0 | 63.0 | 72.0 |
| DeepSeek-V3.1 | 29.8 | 30.0 | 49.2 | 63.1 | 71.0 | 61.2 | 83.7 |
| Claude-4-Sonnet | 20.3 | 12.2 | 29.1 | 68.3 | 65.0 | 61.7 | 80.7 |
| OpenAI o3 | 24.9 | 49.7 | 58.1 | 67.0 | 71.7 | 84.0 | |
| OpenAI o4-mini | 17.7 | 28.3 | 60.0 | ||||
| DeepResearch Agent | |||||||
| OpenAI DeepResearch | 26.6 | 51.5 | 42.9 | 67.4 | |||
| Gemini DeepResearch | 26.9 | ||||||
| Kimi Researcher | 26.9 | 69.0 | 78.8 | ||||
| Tongyi DeepResearch (30B-A3B) | 32.9 | 43.4 | 46.7 | 70.9 | 75.0 | 72.2 | 90.6 |
Analysis of Advantages and Disadvantages:
-
Humanity's Last Exam:
Tongyi DeepResearchachieves 32.9, significantly outperforming all otherLLM-based ReAct agents(e.g.,DeepSeek-V3.1at 29.8,OpenAI o3at 24.9) and competitive withDeepResearch agents(e.g.,OpenAI DeepResearchat 26.6,Gemini DeepResearchat 26.9). This indicates strongmulti-disciplinary reasoningandknowledge integrationcapabilities. -
BrowseComp: While
OpenAI DeepResearch(51.5) andOpenAI o3(49.7) achieve higher scores,Tongyi DeepResearch(43.4) still outperforms otherLLM-based ReAct agentslikeGLM 4.5(26.4) andDeepSeek-V3.1(30.0). This suggests goodweb browsingandinformation retrievalskills, though there's room for improvement against the absolute best proprietary systems. -
BrowseComp-ZH:
Tongyi DeepResearchscores 46.7, which is competitive but slightly lower thanDeepSeek-V3.1(49.2) andOpenAI o3(58.1). This shows its ability to generalize to Chinese-languagebrowsing tasks, albeit with some gap compared to top performers in this specific benchmark. -
GAIA:
Tongyi DeepResearchachieves 70.9, the highest score reported among all baselines, surpassingGLM 4.5(66.0),Claude-4-Sonnet(68.3), andOpenAI DeepResearch(67.4). This highlights itsgeneral AI assistant capabilitiesin complex, real-world tasks requiringmulti-step reasoningandtool use. -
xbench-DeepSearch:
Tongyi DeepResearchsecures the highest score at 75.0, surpassingDeepSeek-V3.1(71.0) andGLM 4.5(70.0). This directly validates its coredeep searchcapabilities. -
WebWalker QA: With 72.2,
Tongyi DeepResearchleads all reported baselines, includingOpenAI o3(71.7). This indicates excellentweb traversalandquestion answeringabilities. -
FRAMES:
Tongyi DeepResearchachieves 90.6, significantly higher than any other model, includingOpenAI o3(84.0) andDeepSeek-V3.1(83.7). This demonstrates superiorfact fetchingandreasoningover retrieved information.Overall:
Tongyi DeepResearchconsistently achievesstate-of-the-art performanceacross nearly all evaluated benchmarks, especially amongopen-source deep research agents. It narrows the gap, and in some cases surpasses, proprietary frontier systems, while activating significantly fewer parameters (3.3 billion out of 30.5 billion total parameters). This underscores its efficiency and scalability. On the newly releasedxbench-DeepSearch-2510, it ranks just belowChatGPT-5-Pro, further demonstrating its competitive edge.
6.1.1. Heavy Mode Performance
The paper introduces a Heavy Mode to further unlock the potential of deep research agents through test-time scaling. This mode leverages a Research-Synthesis framework built upon the context management paradigm.
The performance comparison of Tongyi DeepResearch Heavy Mode and state-of-the-art models is shown below:

该图像是图表,展示了Tongyi DeepResearch在多个基准测试上的表现,包括Humanity's Last Exam、BrowseComp和BrowseComp-ZH。图表中的通过率数据表明,Tongyi DeepResearch在这些任务中的表现优于其他对比模型。
Figure 6: Performance comparison between Tongyi DeepResearch Heavy Mode and state-of-the-art models.
Methodology of Heavy Mode:
- Parallel Research Phase: parallel agents are deployed. Each agent follows the
context management paradigm, exploring diverse solution paths using different tool usage and reasoning strategies. Each agent independently processes the question and produces a final report summary () and answer (): $ ( S_T^u, \mathrm{answer}_u ) = \mathrm{Agent}_u( q ), \quad u \in [ 1, n ] $ Here, represents the final report summary from agent after iterations, encapsulating the complete reasoning trajectory in a compressed form. - Integrative Synthesis Phase: A
synthesis modelconsolidates all parallel findings to produce the final answer: $ \mathrm{answer}_{\mathrm{final}} = \mathrm{Synthesis}\left( { \left( S_T^u, \mathrm{answer}u \right) }{u=1}^n \right) $ The advantage is that the compressedcontext management reports() allow thesynthesis modelto assess diverse solution strategies within a manageable context window, unlike traditional methods that would aggregate full, long trajectories.
Heavy Mode Results:
-
Humanity's Last Exam: Achieves 38.3%, a substantial improvement over the standard mode (32.9%) and all other baselines. -
BrowseComp-ZH: Reaches 58.1%, matchingOpenAI o3and surpassingDeepSeek-V3.1(49.2) and its standard mode (46.7%). -
BrowseComp: Achieves 58.3%, a significant improvement over its standard mode (43.4%) and surpassesOpenAI DeepResearch(51.5) andOpenAI o3(49.7), becoming the leading model on this benchmark.These results validate the effectiveness of
Heavy Modein leveragingtest-time computethroughparallel explorationandintelligent aggregationfor enhanced performance.
6.2. Detailed Analysis
6.2.1. Pass@1 and Pass@3 Performance
The paper reports Avg@3 performance in Table 1. A fine-grained analysis of Pass@1 (best result over three runs) and Pass@3 is also conducted to demonstrate robustness in a dynamic environment.
The detailed evaluation results using Avg@3, Pass@1, and Pass@3 metrics are shown below:

该图像是一个条形图,展示了不同基准测试(如 HLE、BrowseComp、WebWalkerQA 等)在 ext{Avg}@3、ext{Pass}@1 和 ext{Pass}@3 指标上的详细评估结果。每个基准的得分以条形的高度显示,方便比较它们的性能表现。
Figure 7: Detailed evaluation results using , Pass @ 1 and Pass @ 3 metric.
The figure shows that Avg@3 results are consistent with Pass@1 results across benchmarks, indicating robustness. Pass@3 (interpreted as the best score over 3 runs if any pass) shows even higher potential:
BrowseComp: 59.64% (compared toAvg@3of 43.4)BrowseComp-ZH: 63.67% (compared toAvg@3of 46.7)Humanity's Last Exam: 45.9% (compared toAvg@3of 32.9) The higherPass@3values (which typically refers to the success rate if any of the 3 runs passed) demonstrate the strong potential of the agent when given multiple attempts.
6.2.2. Training Rewards and Entropy
The agent's performance (reward) and policy entropy during agentic RL training are analyzed.
The reward and entropy loss of agentic RL training is shown below:

该图像是一个图表,展示了代理强化学习训练过程中的奖励和熵损失的变化。左侧图表示奖励随着训练步骤的变化情况,右侧图则显示熵损失的变化趋势。两图均包含原始值和经过EMA平滑处理的曲线。
Figure 8: Reward and entropy loss of agentic RL training.
- Reward: The left panel shows a clear and significant
upward trendin the agent's performance (reward) with training, confirming effectivepolicy learning. The sustained improvement is attributed todynamic data curation, which consistently provides challenging material, preventinglearning stagnation. - Entropy: The right panel shows that
policy entropyexhibits exceptional stability. It converges to a consistent value after a brief initial increase, avoiding bothcollapse(where the policy becomes too deterministic and stops exploring) andexplosion(where the policy becomes too random and inefficient). This stability is strong evidence for the methodological contributions inenvironment designandalgorithm modificationthat create effectiveRL training.
6.2.3. Context Length of RL
The impact of the model's context length on the agentic RL training process is analyzed by comparing models with 32k, 48k, and 64k context limits. The dynamic data curation for all variants used a 64k context model.
The comparison of different context length limits for RL training is shown below:

该图像是一个图表,展示了不同上下文长度限制对强化学习训练奖励和平均响应长度的影响。左侧图表显示了不同步骤下的奖励值变化,右侧图表展示了平均响应长度的变化,两图均包含32k、48k和64k的曲线对比。
Figure 9: Comparison of different context length limits for RL training.
- Reward Dynamics (Left Panel): All three models (
32k,48k,64k) demonstrate effective and stablepolicy learningwith monotonically increasing rewards, confirming the robustness of the training framework. However, theirperformance ceilingsdiverge. The64k modelachieves the highest reward because the curriculum is populated with problems moderately difficult for a64k contextmodel, often requiringlong and complex reasoning. The48kand32kmodels, being more constrained, cannot solve the most complex problems, thus capping their maximum potential reward. - Average Response Length (Right Panel):
- The
64k contextmodel shows a steady increase inaverage response length, learning to leverage its expansive context for more elaborate solutions. - The
48k contextmodel maintains aconsistent equilibriumin response length, improving its policy within a stable complexity budget. - The
32k contextmodel displays a cleardownward trendin response length. This is a key insight: for models with limited context,RL trainingon a curriculum designed for a more capable model can force them to discover moreefficient solutions. Since the64k context modelcurates the data, problems might have optimal solutions longer than32k tokens. A32k modelattempting these would receive azero-reward signal, implicitly incentivizing it to discover moreconcise, potent action sequencesthat fit within its limit, thereby becoming more efficient.
- The
6.2.4. Interaction Test-time Scaling
The paper investigates how the agent's performance scales with the number of interaction turns with the environment, which correlates with context length.
The detailed analysis on interaction scaling and simulated environments is shown below:

该图像是图表,展示了在BrowseComp上交互回合与上下文长度的关系(图a)及在模拟环境中的奖励变化(图b)。图a中,随着上下文长度增加,准确率呈现上升趋势;图b展示了在不同步骤下奖励的变化,呈现出平稳增长的趋势。
Figure 10: Detailed analysis on interaction scaling and simulated environments.
- Interaction Scaling (Figure 10a): As the
context lengthandnumber of interactionsgrow, the model's performance on theBrowseComp datasetimproves consistently. This demonstrates that forDeepResearch agentsthat rely on environmental interactions, scaling along the dimension ofenvironment interactions(and thus context length) is crucial for performance gains, unlike conventional models that might scale by simply increasing output tokens.
6.2.5. Super-human Level Synthetic Data
To validate the effectiveness of the synthetic data, a statistical analysis of the SFT dataset was conducted.
- Over
20%of the samples in theSFT datasetexceed32k tokensand involve more than10 tool invocations. - This demonstrates the
high complexity and richnessof the synthetic data. Thishigh-quality, cold-start dataprovides the model with a strong foundation fordeep reasoningandresearch capabilities, serving as an excellent initialization for theRL phase.Automated data curationis leveraged duringRLto make more effective use of this synthetic data.
6.2.6. From Simulation to Reality
To rapidly validate the algorithm, a simulated Wiki environment mirroring real-world conditions was built.
- The adapted
GRPO algorithmwas tested in this environment. - The resulting
reward curve(shown in Figure 10b) closely matches the one observed in thereal environment(Figure 8). - This
Wiki simulation environmentfunctions as a "wind tunnel laboratory," enablingfast algorithm iterationand significantly improving development efficiency.
6.2.7. Performance on General Benchmark
The paper also evaluates Tongyi DeepResearch on three general benchmarks: AIME25, HMMT25, and SimpleQA.
The performance on general benchmarks is shown below:

该图像是图表,展示了不同模型在多个基准任务上的性能得分,包括AIME25、HMMT25和SimpleQA。Tongyi DeepResearch在HMMT25和AIME25中均取得了100分的最佳成绩,而在SimpleQA中的得分为98.6。
Figure 11: Performance on general benchmarks.
-
Results:
Tongyi DeepResearchachieves substantial improvements over thebase model(which relies solely on reasoning without any tool use).- For
AIME25andHMMT25(mathematical reasoning benchmarks), it scores 100% and 100% respectively, compared to the base model's 52% and 48%. This improvement is attributed to thePython Interpreter, which providesnative computational support. - For
SimpleQA(knowledge-intensive benchmark), it scores 98.6%, compared to the base model's 85%. This improvement is due to the ability toretrieve external information via search.
- For
-
Implication: These results demonstrate that
model trainingincreasingly converges withagent training.Solving paradigmsare evolving towardagentic architecturesthat integratetool invocationandenvironment interaction, reflecting a morehuman-like problem-solving process.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces Tongyi DeepResearch, an open-source deep research agent developed by Alibaba Group. This agent represents a significant step towards AI systems capable of autonomously transforming information into insight. Its core innovation lies in an end-to-end training paradigm that unifies agentic mid-training and agentic post-training. This framework, supported by automated data synthesis and stage-specific environments, enables the model to autonomously plan, search, reason, and synthesize information for complex, long-horizon research tasks.
Despite its parameter efficiency (30.5 billion total parameters with only 3.3 billion activated per token), Tongyi DeepResearch achieves state-of-the-art results across multiple deep research benchmarks, including Humanity's Last Exam, BrowseComp, GAIA, and FRAMES, often surpassing strong proprietary systems. The introduction of Heavy Mode further enhances performance through parallel exploration and integrative synthesis at test time. The work emphasizes the critical role of synthetic data and stable environmental interactions for effective agentic reinforcement learning. By open-sourcing the model and framework, Tongyi DeepResearch establishes a foundation for reproducible research into autonomous AI agents and contributes to the ongoing development of more general, self-improving intelligence.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Context Length: The current
128K context lengthis still insufficient for the most complexlong-horizon tasks. Future work will exploreextended context windowsor more advancedcontext management mechanisms. - Model Scale: While the current model is efficient, a
larger-scale modelis currently in progress. - Report Generation Fidelity: Continuous improvement in
report generation fidelityandoptimization for user preferencesis needed to ensure more faithful, useful, andpreference-aligned outputs. - RL Efficiency: The efficiency of the
reinforcement learning frameworkcan be improved by exploring techniques such aspartial rollouts, which will require addressingoff-policy training challenges, includingdistributional shift. - Generalization Beyond Deep Research: The current
Deep Research trainingfocuses on specific prompt instructions and predefined tool sets. The plan is to enhance its robustness and extend the framework fromDeep Researchto broaderagentic tool use scenarios. - Larger Models and Edge Deployment: The authors also emphasize the value of training
agentic capabilitieson relatively small models for efficiency onedge devicesand broader accessibility, indicating a direction for practical deployment while acknowledging the concurrent development of larger models.
7.3. Personal Insights & Critique
This paper presents a highly compelling and systematic approach to developing deep research agents. The integration of agentic mid-training and agentic post-training is particularly insightful, addressing a critical challenge: how to effectively instill agentic biases into general LLMs before applying intensive RL. This progressive training strategy seems much more robust than trying to learn everything during a single RL phase.
The emphasis on fully automated, scalable data synthesis is another strong point. The ability to generate super-human level, high-uncertainty QA pairs and PhD-level research questions without human annotation is a game-changer for scaling agentic research. This not only reduces cost but also allows for controlled curriculum generation, which is critical for stable RL. The concept of a data flywheel where improving agents generate better training data is powerful for self-improving AI.
The detailed analysis of context length and response length during RL offers a fascinating insight into how models adapt to their constraints, particularly the observation that limited context length can implicitly force a model to find more efficient action sequences. This suggests that constrained environments can sometimes drive more intelligent behavior, a point worth exploring further in general AI research.
The Heavy Mode is an elegant solution for test-time scaling, effectively addressing the context window limitation by synthesizing compressed reports from parallel agents. This demonstrates a practical way to leverage additional compute for improved performance on complex tasks without redesigning the core model.
Potential Issues/Areas for Improvement:
-
Sim-to-Real Gap: While the paper acknowledges the
sim-to-real gap, the heavy reliance onsimulated environmentsfor iteration, even withWikipedia-based RAG tools, might still leave a significant challenge when deploying to truly open-ended, dynamicreal-world web environmentswith their inherent noise, adversarial elements, and constantly changing information landscape. Theunified sandboxforreal-world interactionhelps, but the fundamental challenge remains. -
Interpretability of
Model Merging: Whilemodel mergingis effective for performance gains, the specific mechanisms by whichweighted averagingof parameters from models with "different capability preferences" leads to "robust generalization abilities" could be explored in more depth. What are thesecapability preferences, and how do they interact? -
Evaluation Metrics for Complex Reasoning: While quantitative metrics are crucial, evaluating
deep research agentsfor tasks like "Humanity's Last Exam" might also benefit from qualitative assessments of the depth, novelty, and coherence of the generated reports, beyond just correctness. -
Scalability of
Heavy Mode: TheHeavy Modedeploys parallel agents. While effective, the computational cost increases with . Further work might explore dynamic scaling or intelligent pruning of parallel agents based on early indicators of solution quality to optimize resource usage.This paper provides a strong foundation for
open-source agentic AIand its application todeep research. Its methodologies, particularly the integrated training pipeline and automated data synthesis, offer valuable insights for the broaderagentic LLMcommunity. The commitment toopen-sourcingis commendable and will undoubtedly accelerate future research.
Similar papers
Recommended via semantic vector search.