Paper status: completed

WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent

Published:10/08/2025

Vision-Language-Action Model (34)Multimodal Large Language Model (24)RL Training for Large Language Models (67)Complex Information Retrieval Benchmark (1)Visual-Language Reasoning (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

WebWatcher is a multimodal deep research agent enhancing visual-language reasoning via synthetic trajectories and reinforcement learning, validated on the new BrowseComp-VL benchmark for complex visual-text retrieval tasks, surpassing existing baselines.

Abstract

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 W EB W ATCHER : B REAKING N EW F RONTIERS OF V ISION -L ANGUAGE D EEP R ESEARCH A GENT Anonymous authors Paper under double-blind review A BSTRACT Web agents such as deep research have demonstrated superhuman cognitive abili- ties, capable of solving highly challenging information-seeking problems. However, most research remains largely text-centric, overlooking visual information in the real world. This makes multimodal deep research highly challenging, as such agents require much stronger perceptual, logical, and knowledge-based reason- ing abilities, as well as proficiency in more sophisticated tools. To address this limitation, we introduce WebWatcher, a multimodal agent for deep research with enhanced visual-language reasoning capabilities. It uses high-quality synthetic trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization t

Mind Map

In-depth Reading

English Analysis~30 min read · 42,700 chars

1. Bibliographic Information

1.1. Title

WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent

1.2. Authors

The authors are listed as "Anonymous authors," indicating the paper is under double-blind review. Their affiliations are not disclosed in the provided text.

1.3. Journal/Conference

The paper is published at OpenReview, a platform commonly used for conference submissions (e.g., ICLR, NeurIPS) under double-blind review. The specific conference is not named, but OpenReview is a reputable platform for disseminating research in machine learning. The publication status is "Paper under double-blind review."

1.4. Publication Year

2025-10-08T00:00:00.000Z (UTC)

1.5. Abstract

The paper introduces WebWatcher, a novel multimodal deep research agent designed to overcome the text-centric limitations of most existing web agents. WebWatcher integrates enhanced visual-language reasoning capabilities through the use of high-quality synthetic trajectories for efficient cold start training, diverse tools for deep reasoning, and reinforcement learning for improved generalization. To evaluate such agents, the authors propose BrowseComp-VL, a new benchmark styled after BrowseComp that demands complex information retrieval combining visual and textual data. Experimental results demonstrate that WebWatcher either outperforms or matches proprietary baselines, Retrieval-Augmented Generation (RAG) workflows, and open-source agents across four challenging Visual Question Answering (VQA) benchmarks, thereby paving the way for solving intricate multimodal information-seeking tasks.

1.6. Original Source Link

Official Source: https://openreview.net/forum?id=8jsaazdAb3 PDF Link: https://openreview.net/pdf?id=8jsaazdAb3 Publication Status: The paper is currently "under double-blind review" at OpenReview.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the text-centric limitation of current deep research web agents. While these agents, often powered by Large Language Models (LLMs), have shown superhuman abilities in solving complex information-seeking problems, they largely overlook the vast amount of visual information present in the real world. This makes multimodal deep research – tasks requiring reasoning across both visual and textual data – exceptionally challenging.

This problem is important because many real-world scenarios, such as interpreting scientific diagrams, analyzing charts, or navigating visual web interfaces, inherently demand joint vision-language reasoning. Existing Vision-Language (VL) agents often fall short by relying on template-driven pipelines, limiting their flexible reasoning, planning ability, and versatile tool use. Some VL agents focus primarily on image-based perception with visual tools but struggle to integrate this with deep textual understanding and cross-modal inference. Conversely, search-only agents have a limited problem-solving scope, failing when answers are implicit, require interaction, or demand additional computation.

The paper's entry point or innovative idea is to introduce WebWatcher, an agent that directly addresses this gap by combining strong reasoning abilities across both textual and visual information with the effective use of multiple external tools. It focuses on generating high-quality training data that combines complex visual content with multi-step reasoning and then trains the agent through a combination of supervised fine-tuning (SFT) and reinforcement learning (RL).

2.2. Main Contributions / Findings

The paper's primary contributions are:

Introducing WebWatcher: A novel multimodal agent for deep research that enhances visual-language reasoning capabilities. It is designed to integrate various tools for deep reasoning and to generalize through reinforcement learning.
Developing a Scalable Data Generation Pipeline: This pipeline creates high-quality synthetic trajectories for efficient cold start training. It transforms complex textual Question Answering (QA) pairs into Visual Question Answering (VQA) items, incorporating multi-hop, knowledge-intensive queries grounded in authentic web images, and includes a multi-stage filtering process for quality control.
Automated Trajectory Generation and Post-Training: The paper proposes an automated pipeline to build tool-use trajectories from action-observation sequences via prompting, followed by Supervised Fine-Tuning (SFT) and Group-Relative Policy Optimization (GRPO) to optimize tool use and decision-making.
Proposing BrowseComp-VL: A challenging new VQA benchmark that extends BrowseComp into the visual domain, requiring complex information retrieval involving both visual and textual information, cross-modal reasoning, and high-level planning.

The key conclusions or findings reached by the paper are:

WebWatcher consistently outperforms or matches proprietary baselines, RAG workflows, and open-source agents across four challenging VQA benchmarks (HLE, LiveVQA, BrowseComp-VL, and MMSearch).
It demonstrates competitive performance even on perception-oriented benchmarks like SimpleVQA, indicating broad applicability.
The tool usage analysis shows that WebWatcher flexibly composes tool chains based on benchmark demands, rather than over-relying on any single tool, showcasing its adaptability.
The cold start SFT is crucial for stable and effective Reinforcement Learning in multimodal agent training, preventing initial instability and ensuring meaningful credit assignment.
The Pass@k analysis confirms the scalability of the agentic paradigm, where systematic exploration of reasoning paths leads to consistent and robust performance improvements.

These findings solve the problem of limited multimodal capabilities in deep research agents, offering a robust framework for agents to effectively interact with and reason over both visual and textual information in complex real-world scenarios.

3.1. Foundational Concepts

To understand this paper, a reader should be familiar with several core concepts in artificial intelligence, particularly in the areas of natural language processing, computer vision, and reinforcement learning.

Large Language Models (LLMs): These are advanced artificial intelligence models, often based on the transformer architecture, trained on vast amounts of text data to understand, generate, and process human language. They can perform a wide range of tasks, from question answering to code generation. Examples include GPT-4o and Gemini.
Deep Research Agents (or Web Agents): These are LLM-powered systems designed to autonomously perform complex information-seeking tasks on the web. They go beyond single-turn interactions by planning multi-step actions, using tools (like search engines or code interpreters), and synthesizing information from multiple sources to answer challenging questions or complete intricate tasks. The paper refers to "deep research" as a specific type of web agent focused on comprehensive information gathering and synthesis.
Multimodal AI: This refers to AI systems that can process and reason over information from multiple modalities, such as text, images, audio, and video. In this paper's context, multimodal primarily refers to vision-language, meaning the ability to understand and integrate both visual (images) and textual (language) information.
Visual Question Answering (VQA): A task in multimodal AI where a model receives an image and a natural language question about that image, and it must provide a natural language answer. VQA challenges models to perform both visual recognition and language understanding, often requiring reasoning to combine information from both modalities.
ReAct Framework: Short for "Reasoning and Acting," ReAct is a general paradigm for LLM agents that interleaves Thought, Action, and Observation steps.
- A Thought (or Think) step involves the LLM generating a reasoning trace to decide the next action.
- An Action (or tool_call) step involves the LLM calling an external tool (e.g., search engine, code interpreter) based on its Thought.
- An Observation (or tool_response) step involves the environment returning the result of the tool's action, which the LLM then uses to inform its next Thought. This cyclical process allows LLMs to perform complex, multi-step tasks by breaking them down into manageable sub-problems and leveraging external knowledge or computation.
Supervised Fine-Tuning (SFT): A common technique to adapt a pre-trained LLM to a specific task or domain. It involves training the LLM on a dataset of input-output pairs (trajectories in this case) where the desired behavior is explicitly demonstrated. The model learns to mimic this behavior by minimizing a loss function (e.g., cross-entropy) on the labeled data. In WebWatcher, SFT serves as a "cold start" to teach the agent basic tool-augmented reasoning.
Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties for its actions, and its goal is to learn a policy that maximizes the cumulative reward over time. RL is particularly useful for tasks that involve sequential decision-making and where explicit demonstrations for all possible scenarios are hard to provide.
Group-Relative Policy Optimization (GRPO): An RL algorithm mentioned in the paper, which is a variant of Policy Gradient methods. GRPO refines decision-making by normalizing rewards within a group of generated trajectories. This "group-relative advantage" helps to stabilize training and encourages exploration of trajectories that yield higher rewards compared to others in the same group, without relying on a separate value function (which can be hard to estimate). It's designed to promote stable updates while encouraging exploration of trajectories with higher relative return.

3.2. Previous Works

The paper contextualizes its contributions by referencing several prior works in deep research agents and multimodal VQA benchmarks.

Deep Research Agents

Proprietary Solutions (e.g., DeepResearch, Gemini Deep Research): These agents (from OpenAI, Google, Perplexity) demonstrate near-expert performance in fact-finding and analysis. However, their internal architectures and data pipelines are typically secret, hindering research replication and in-depth analysis.
WebDancer (Wu et al., 2025a): This open-source agent focuses on curriculum-driven SFT over ReAct traces. It teaches agents to use tools through structured demonstrations.
WebThinker (Li et al., 2025c): Augments SFT with policy-gradient refinement, meaning it uses RL techniques to further improve the agent's decision-making beyond what SFT can achieve.
R1-Searcher (Song et al., 2025): Leverages self-play to learn tree-structured exploration policies, allowing the agent to explore different search paths more effectively.
WebSailor (Li et al., 2025a): Focuses on uncertainty reduction using structured task obfuscation, RFT cold-start, and the DUPO algorithm to handle ambiguous queries.
WebShaper (Tao et al., 2025b): Proposes a formalization-driven data-synthesis pipeline by introducing Knowledge Projections and an agentic Expander.
OmniSearch (Li et al., 2025d): A search-oriented open-source agent based on GPT-4o, used as a baseline in WebWatcher's experiments.

Crucial Background Information for Understanding Agent Frameworks: Many of these agents, including WebWatcher, rely on the ReAct framework or its variations. As explained above, ReAct models generate thoughts and actions in an interleaved manner. This is a significant departure from earlier LLM interactions that were purely prompt-response. The core idea behind these frameworks is to enable LLMs to:

Reason: Generate internal thoughts to plan and strategize.
Act: Utilize external tools to gather information or perform computations.
Observe: Process the results from tool actions to update their internal state and guide subsequent reasoning.

Multimodal VQA Benchmarks

Single-step Perception/Shallow Retrieval Benchmarks (e.g., OK-VQA, A-OKVQA): These older benchmarks typically emphasize static knowledge grounding and heuristic answer prediction. They often require models to answer questions based on a single image and some external knowledge, without extensive multi-step reasoning or tool use.
MMT-Bench (Ying et al., 2024): Offers large-scale coverage of planning-oriented tasks across multiple domains but uses a multiple-choice format, which restricts the assessment of procedural reasoning and rich textual outputs.
MicroVQA (Burgess et al., 2025) and Open3DVQA (Zhang et al., 2025): Explore domain-specific and spatial reasoning, respectively, but are often constrained by limited scale, manual curation, or lack of complex planning structures.
Dyn-VQA (Li et al., 2025d; Chen et al., 2025): Introduces adaptive query tasks but remains narrow in its multimodal scope and size.
MMMU-Pro (Yue et al., 2024), MMSearch-Plus (Tao et al., 2025a), MM-BrowseComp (Li et al., 2025b): These are more recent benchmarks exploring performance limitations of current MLLMs on domain-specific and difficult information-seeking tasks. MMSearch is used as an evaluation benchmark for WebWatcher.
BrowseComp (Wei et al., 2025a, 2025b): A benchmark for browsing agents that emphasizes underspecified and difficult queries requiring retrieval of scattered information and integration of fragmented clues. WebWatcher extends this to the visual domain with BrowseComp-VL.
Humanity's Last Exam (HLE) (Phan et al., 2025): A challenging benchmark with expert-written questions across diverse academic fields, requiring synthesis of evidence from obscure sources and reasoning through abstract problems. WebWatcher evaluates on a multimodal subset of HLE.

3.3. Technological Evolution

The field has evolved from text-only LLMs to multimodal LLMs (MLLMs) that can process both text and images. Initially, research focused on fundamental VQA tasks, often limited to single-step reasoning or simple retrieval. The next stage involved equipping LLMs with tools (tool-use LLMs) to augment their capabilities, leading to the development of deep research agents that could perform multi-step planning and interaction on the web, but primarily in a text-centric manner.

This paper's work represents a critical step in this evolution: bridging the gap between multimodal perception and deep research agent capabilities. It pushes beyond text-only reasoning by deeply integrating visual information into the agent's reasoning and tool-use loop. This means the agent doesn't just "see" but actively reasons over visual content and uses visual information to guide its multi-step information-seeking process. The introduction of BrowseComp-VL also signifies an evolution in benchmarks, moving towards more realistic, complex, and multimodal information-seeking challenges that mirror real-world tasks.

3.4. Differentiation Analysis

Compared to the main methods in related work, WebWatcher introduces several core differences and innovations:

Integrated Multimodal Reasoning and Tool Use: Unlike most prior deep research agents that are text-bound, WebWatcher deeply integrates vision-language reasoning with a versatile set of tools. It explicitly addresses tasks that require combining both modalities for complex problem-solving, which is a limitation for agents primarily focused on text or only simple visual perception.
Advanced Multimodal Data Generation: Existing VQA datasets often focus on single-hop queries or perception. WebWatcher's pipeline generates training data specifically designed for in-depth, multi-step reasoning and strategic planning by converting complex textual QA into VQA and masking entities. This provides a richer and more challenging training environment than typically available.
Automated Trajectory Generation: Instead of rigid, template-based trajectories, WebWatcher generates action-observation sequences via prompting, grounding them in actual tool-use behavior and reflecting procedural decision-making. This addresses the challenge of coordinating tools with distinct input-output formats and reasoning roles.
Robust Training Methodology: WebWatcher combines Supervised Fine-Tuning (SFT) for a strong "cold start" with Reinforcement Learning (RL) (GRPO) for further optimization and generalization. The paper specifically highlights the importance of the SFT cold start for stable RL training in complex tool-use scenarios, which is a critical finding for agent development.
Novel Multimodal Benchmark (BrowseComp-VL): WebWatcher introduces a new benchmark that extends BrowseComp to the visual domain. This benchmark is specifically designed to challenge agents with long, entity-obfuscated queries that demand cross-modal reasoning, thorough information-seeking, and high-level planning across web search, image retrieval, and webpage browsing. This provides a more comprehensive evaluation of multimodal deep research capabilities than previous benchmarks.

In essence, WebWatcher moves beyond merely adding visual tools to an LLM by providing a holistic framework for generating complex multimodal data, training agents to effectively use tools in a multimodal context, and evaluating them on benchmarks that truly demand integrated vision-language reasoning for deep research.

4. Methodology

4.1. Principles

The core idea behind WebWatcher is to build a multimodal deep research agent capable of complex vision-language reasoning and multi-tool interaction. This is achieved by addressing three key challenges:

Developing strong reasoning across text and vision: This requires constructing high-quality training data that combines rich visual content with complex, multi-step reasoning.
Enabling effective use of multiple external tools: This involves equipping the agent with a diverse set of tools and training it to coordinate them flexibly.
Ensuring generalization and robust decision-making: This is achieved through a combination of supervised fine-tuning (SFT) and reinforcement learning (RL).

The theoretical basis or intuition is that by providing LLMs with the ability to "see" (process visual information) and "act" (use various tools) in a structured and learned manner, they can transcend text-only limitations and tackle more complex, real-world information-seeking problems. The ReAct framework provides the operational structure for interleaving thought, action, and observation, while carefully curated data and RL techniques refine the agent's strategic capabilities.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology of WebWatcher can be broken down into three main phases: Data Preparation, Trajectory Generation and Post-Training, and Experimental Setup.

4.2.1. Data Preparation

This phase focuses on constructing a high-quality dataset for multimodal deep research agents.

4.2.1.1. Data Overview

The dataset is designed for multimodal deep research agents, with each example comprising:

A factual image.
An associated question requiring cross-modal reasoning.
A corresponding answer.
Auxiliary metadata about underlying entities and relations.

The dataset covers 5 major domains (Entertainment, Humanities, Technology, Natural Science, and Other) and 17 fine-grained subfields. It defines two difficulty levels:
Level 1: Questions require multi-hop reasoning but still reference explicit entities. Answers can be obtained through iterative retrieval, but integration of information across multiple sources is non-trivial.
Level 2: Questions have obfuscated entities and attributes (e.g., vague periods, masked names, fuzzed quantitative properties). This introduces uncertainty, forcing the agent to plan, compare, and synthesize information rather than just direct retrieval.

This dataset is split into a training set and a benchmark called BrowseComp-VL.

The following figure (Figure 2 from the original paper) illustrates the domain distribution and examples of Level 1 and Level 2 questions:

该图像是包含两个同心圆环图的示意图，展示了Level 1和Level 2的结构和组成部分。图中分别用不同颜色区分了子模块，并附带相关问题和答案的示例。

4.2.1.2. Construction of VQA Pairs

This sub-section details how diverse textual QA pairs are first constructed, then grounded in relevant images to form VQA tasks.

QA Pairs Generation

Level 1: Inspired by CRAWL-QA from WebDancer (Wu et al., 2025a). Root URLs are collected from authoritative sources (arXiv, GitHub, Wikipedia), and their hyperlinks are recursively traversed to mimic human browsing. GPT-4o (OpenAI, 2024) synthesizes question-answer pairs from the aggregated content.
Level 2: Following WebSailor (Li et al., 2025a), queries are constructed with fuzzed entities by replacing precise references with partial or ambiguous descriptions. This forces contextual reasoning and synthesis across modalities. A two-stage generation framework is used:
1. Nodes Selecting: Starting from an initial Wikipedia page, GPT-4o generates a base QA pair using the page title as the root entity node $B_{\text {root }}$ . A hyperlink graph is expanded by recursively traversing outgoing links to form a tree of depth $d$ and branching factor $k$ . The number of nodes generated is given by $\left(k^{d+1}-1\right) /(k-1)$ . In practice, $d=3$ and $k=3$ are used. Subgraphs of $N$ entities are sampled, each defining a path from $B_{\text {root }}$ to a target entity $B$ , forming the basis for multi-hop QA pairs.
2. Query Generating and Entity Masking: For each subgraph, GPT-4o generates a standard question explicitly referencing entities and relations. A fuzzed version is then created by replacing key references with partial or ambiguous descriptions, preventing simple string matching and forcing cross-modal reasoning.

QA-TO-VQA CONVERSION

This process ensures reliable visual grounding and transforms textual QA into VQA queries.

Visual Context Construction: Trivial or overly ambiguous target entities $B$ (lacking visual grounding) are discarded. For each retained entity $\hat{B}$ , a set of web images $\mathcal{I}(\hat{B})=\left\{I_{1}^{B}, I_{2}^{B}, \ldots, I_{K}^{B}\right\}$ is retrieved via Google SerpApi (Google, 2025), with $K=2$ images per entity in this implementation. These images are strictly authentic.
Question Transformation: To create image-grounded VQA pairs from each textual QA $(q_t, a)$ , GPT-4o is used for prompt-based rewriting. The target entity $\hat{B}$ in $q_t$ is masked with a visual reference token $r_{\text {vis}}$ (e.g., "this entity," "the object in the image"), producing a transformed VQA query $q$ . Simultaneously, an image query string $s_{\text {img}}(\hat{B})$ is created to guide the filtering of $\mathcal{I}(\hat{B})$ . Each retained image $I_{k}^{B} \in \mathcal{I}$ is paired with (q, a), meaning $K$ multimodal examples are generated from each textual QA.

The following figure (Figure 3 from the original paper) illustrates the data generation pipeline:

该图像是一个示意图，展示了针对问题“以詹姆斯·罗伊·金霍恩命名的蛇种是什么？”的多层次信息检索和推理流程，包含图搜索、图像检索、选择器和检查器模块，以及基于图结构的层级推理过程。

4.2.1.3. Quality Control

A two-stage filtering pipeline ensures high-quality VQA samples:

Selector:
1. Discards cases where the transformed VQA query $q$ is identical to $q_t$ , or where the entity name $\hat{B}$ and its aliases appear in $q$ , indicating failed masking.
2. GPT-4o evaluates each image $I_{k}^{B} \in \mathcal{I}(\hat{B})$ against both $(q_t, a)$ and (q, a), scoring contextual alignment, semantic fit, and visual reasoning plausibility. Cases with low scores are removed.
Examiner: For each retained image-query pair $(s_{\text {img}}(\hat{B}), \mathcal{I}(\hat{B}))$ , GPT-4o attempts to answer $s_{\text {img}}(\hat{B})$ using only visual content and associated captions. Failure to answer accurately indicates improper visual context, and such cases are discarded. Captions are included to reduce false negatives from missing world knowledge.

4.2.2. Trajectory Generation and Post-Training

This phase involves generating high-quality tool-use trajectories and then using them for supervised fine-tuning (SFT) and reinforcement learning (RL).

4.2.2.1. Multimodal Tools

WebWatcher is equipped with five tools:

Web Image Search: Uses Google SerpApi (Google, 2025) for retrieving relevant images with captions and URLs.
Web Text Search: For open-domain information seeking using text queries.
Visit: Uses Jina (Jina.ai, 2025) for navigating specific URLs and summarizing pages according to the agent's goal.
Code Interpreter: For symbolic computation and numerical reasoning (Cheng et al., 2024).
OCR (Optical Character Recognition): An internal tool, invoked via prompt and SFT data, to extract text from input images (Huang et al., 2025). This is crucial for interpreting text embedded in visuals like charts or diagrams.

4.2.2.2. Automated Trajectory Annotation

Given a VQA instance (I, q, a) from BrowseComp-VL, GPT-4o constructs tool-use trajectories simulating step-by-step human reasoning, following the ReAct (Yao et al., 2023) framework. Each trajectory $\tau$ comprises multiple think-act-observe cycles. At each step $t$ , the model generates:

Thought: Intermediate reasoning or plan, enclosed in $<think>...</think>$ .
Action: Tool invocation wrapped in <tool_call>...</tool_call> or the final answer in $<answer>...</answer>$ .
Observation: Returned result from the environment, within <tool_response>...</tool_response> tags.

The action space $\mathcal{T}$ consists of discrete tool-use actions $t_l$ . The Finish action signals task completion. A trajectory of length $L$ is defined as: $\tau=\left\{\left(t_{0}, o_{0}\right),\left(t_{1}, o_{1}\right), \ldots,\left(t_{L}, o_{L}\right)\right\}$ Here, $t_i$ represents an action at step $i$ , and $o_i$ is the observation (environment feedback) after executing $t_i$ . Each trajectory provides a content-grounded demonstration of planning and tool selection.

4.2.2.3. Trajectory Filtering and Quality Assurance

A three-stage selection process ensures robust and instructive supervision:

Final Answer Matching: Only trajectories $\tau$ where the final answer matches the ground truth $a$ are retained.
Step-by-Step Consistency Check: GPT-4o verifies the logical consistency of each intermediate step in $\tau$ . Trajectories with hallucinated content, contradictions, or unjustified tool calls are discarded. This avoids correct answers being reached by chance.
Minimum Tool Usage Requirement: Trajectories with fewer than three tool calls are removed to ensure substantive, process-driven tool interactions and reasoning.

4.2.2.4. Supervised Fine-Tuning (SFT) as Cold Start

After filtering, a dataset of $K$ high-quality tool-use trajectories is obtained. At each step $l$ of trajectory $i$ , WebWatcher is trained to predict the correct action $t_l^{(i)}$ , given the image $I^{(i)}$ , question $q^{(i)}$ , and previous actions and observations $(t_{<l}^{(i)}, o_{<l}^{(i)})$ . SFT maximizes the log-likelihood of $t_l^{(i)}$ : $\max _{\theta} \sum_{i=1}^{K} \sum_{l=1}^{L_{i}} \log P_{\theta}\left(t_{l}^{(i)} \mid I^{(i)}, q^{(i)}, t_{<l}^{(i)}, o_{<l}^{(i)}\right)$ Here, $\theta$ are the model parameters, $I^{(i)}$ is the image for trajectory $i$ , $q^{(i)}$ is the question, $t_{<l}^{(i)}$ are actions before step $l$ , $o_{<l}^{(i)}$ are observations before step $l$ , and $L_i$ is the length of trajectory $i$ . This cold-start stage teaches the agent effective tool use and structured multi-step reasoning.

4.2.2.5. Reinforcement Learning (RL)

With SFT providing cold-start initialization, Group-Relative Policy Optimization (GRPO) (Guo et al., 2025) is applied to refine decision-making. For a VQA query $q$ , the current policy $\pi_{\theta}$ generates a group $G=\tau_{1}, \ldots, \tau_{K}$ of $K$ complete trajectories, each with return $R_i$ . The group-relative advantage is defined as: $A_{\mathrm{rel}}\left(\tau^{(i)}\right)=R^{(i)}-\frac{1}{K} \sum_{j=1}^{K} R^{(j)}$ This normalizes rewards within the group, removing the need for a separate value function. The GRPO objective is defined as a clipped surrogate loss: $\mathcal{L}_{\text {GRPO }}(\theta)=\mathbb{E}_{\tau^{(i)} \in \mathcal{G}}\left[\min \left(\rho^{(i)} A_{\mathrm{rel}}\left(\tau^{(i)}\right), \operatorname{clip}\left(\rho^{(i)}, 1-\epsilon, 1+\epsilon\right) A_{\mathrm{rel}}\left(\tau^{(i)}\right)\right)\right]-\beta D_{\mathrm{KL}}\left(\pi_{\theta} \| \pi_{\theta_{\text {old }}}\right)$ Where:

$\rho^{(i)}=\frac{\pi_{\theta}\left(\tau^{(i)}\right)}{\pi_{\theta_{\text {old }}}\left(\tau^{(i)}\right)}$ is the importance sampling ratio between the current policy $\pi_{\theta}$ and the previous policy $\pi_{\theta_{\text {old}}}$ .
$A_{\mathrm{rel}}\left(\tau^{(i)}\right)$ is the group-relative advantage for trajectory $\tau^{(i)}$ .
$\epsilon$ is the clipping threshold, typically a small positive value (e.g., 0.2), which limits the change in the policy to ensure stable updates.
$D_{\mathrm{KL}}\left(\pi_{\theta} \| \pi_{\theta_{\text {old }}}\right)$ denotes the Kullback-Leibler (KL) divergence between the current and previous policies, serving as a penalty to prevent the new policy from deviating too much from the old one, promoting stability.
$\beta$ is a coefficient controlling the strength of the KL penalty.

This objective promotes stable updates while encouraging exploration of trajectories with higher relative return.

Each trajectory $\tau$ receives a binary format score $r_{\mathrm{f}} \in[0,1]$ (1 if all tool calls follow the schema). An LLM grader provides a semantic accuracy score $r_{\mathrm{a}} \in[0,1]$ by comparing the final answer with the ground truth. The total reward $R$ is defined as: $R=w r_{\mathrm{f}}+(1-w) r_{\mathrm{a}}$ With $w=0.2$ to prioritize task completion while maintaining structured tool use. Since $R$ is given only at the episode end, the group-relative ranking enables effective credit assignment. Rollouts are collected in groups of $N=16$ for diversity and computational efficiency.

5. Experimental Setup

5.1. Datasets

The experimental setup involves both training data construction and evaluation on several challenging benchmarks.

Training Data Construction

The training data for WebWatcher comes from three sources:

BrowseComp-VL training set: This includes 110,000 Level-1 and 70,000 Level-2 QA pairs. After VQA conversion and filtering, 60,000 Level-1 and 40,000 Level-2 high-quality examples are retained.
Long-tail QA pairs converted to VQA: Sampled from training instances with a similar distribution to SimpleVQA, resulting in 4,000 VQA examples.
Hard VQA samples: Collected from InfoSeek (Chen et al., 2023), VQAv2.0 (Goyal et al., 2017), LogicVista (Xiao et al., 2024), and Encyclopedic VQA (Mensink et al., 2023). Huang et al., 2025 is added to activate OCR. Rejection sampling ensures difficulty.

After trajectory generation and filtering, 8,000 high-quality tool-use trajectories are obtained for SFT, with an additional 2,000 samples reserved for GRPO. The final ratio of data sources is 5:3:2 for BrowseComp-VL, long-tail VQA, and hard VQA data, respectively.

Evaluation Benchmarks

WebWatcher is evaluated on five challenging benchmarks:

BrowseComp-VL:
- Source: Proposed in this paper, extending BrowseComp (Wei et al., 2025b).
- Scale: The evaluation set consists of 100 instances from Level 1 and 200 instances from Level 2, totaling 300 instances. All examples are manually verified by PhD-level AI experts.
- Characteristics: Designed for in-depth multimodal reasoning and strategic planning. Queries are long, entity-obfuscated, and require multi-page browsing, fine-grained visual grounding, and complex information retrieval across both visual and textual information.
- Domain: Not explicitly stated but inferred to cover a broad range based on the training data categories.
Humanity's Last Exam (HLE) (Phan et al., 2025):
- Source: An existing benchmark.
- Scale: Originally 2,500 expert-written questions. WebWatcher evaluates on a subset of 330 multimodal questions.
- Characteristics: Questions go beyond simple retrieval, requiring models to synthesize evidence from obscure or fragmented sources and reason through abstract academic problems. Multimodal questions assess visual-textual reasoning.
- Domain: Diverse academic fields such as science, engineering, and the humanities (e.g., Biology, Chemistry, Computer Science/AI, Engineering, Humanities, Math, Physics, Other).
LiveVQA (Fu et al., 2025):
- Source: An existing benchmark.
- Scale: 3,602 multi-hop VQA instances. WebWatcher evaluates on a 300-example subset.
- Characteristics: Evaluates a model's ability to answer questions grounded in up-to-date visual knowledge, often from recent global news. Requires multi-hop reasoning.
- Domain: Recent global news across six sources and fourteen topics.
SimpleVQA (Cheng et al., 2025):
- Source: An existing benchmark.
- Scale: 2,025 examples in both English and Chinese. WebWatcher evaluates on 300 examples randomly sampled from the 1,013 English QA pairs.
- Characteristics: Factual VQA benchmark combining curated image-question pairs from recent VQA datasets and expert-annotated web images. Focuses more on visual reasoning over external knowledge.
- Domain: General factual knowledge related to images.
MMSearch (Jiang et al., 2024):
- Source: An existing benchmark.
- Scale: 300 manually curated examples. WebWatcher uses the 171 visual subset for evaluation.
- Characteristics: Examples cover both recent news and rare knowledge, requiring search capabilities.
- Domain: 14 subdomains including recent news and rare knowledge.

5.2. Evaluation Metrics

The primary evaluation metric used is pass@k (Chen et al., 2021) with LLM-as-Judges (Liu et al., 2024) for correctness scoring.

Conceptual Definition of pass@k: pass@k is a metric used to evaluate the success rate of generative models, particularly in tasks where multiple attempts might be made to find a correct solution. It measures the probability that at least one of $k$ independently generated solutions is correct. If a model generates $k$ candidate solutions, and any one of them is correct, the attempt is considered a success. This metric is useful for evaluating agents that can perform multiple rollouts or search for a solution through several tries.
Mathematical Formula for pass@k: The paper specifies that pass@1 is computed as: $\text{pass@1} = \frac{1}{n} \sum_{i=1}^{n} p_{i}$ Where:
- $n$ is the total number of evaluation instances.
- $p_i$ is the binary correctness (1 for correct, 0 for incorrect) of the $i$ -th prediction.
  
  For a general pass@k, the formula is often derived from the probability of failure: $\text{pass@k} = 1 - \prod_{j=1}^{k} \left(1 - P(\text{correct solution on } j^{th} \text{ attempt})\right)$ A more practical way to calculate pass@k given $k$ samples for each problem and a binary correctness score for each sample is: $\text{pass@k} = \frac{1}{\text{num\_problems}} \sum_{i=1}^{\text{num\_problems}} \mathbb{I}(\exists \text{ correct sample among } k \text{ samples for problem } i)$ However, the paper's description of pass@k in the "Pass@k Analysis on HLE" section implies it's calculating the proportion of problems where at least one of $k$ generated solutions is correct, likely by repeating the generation process $k$ times for each problem and checking if any are correct. The provided formula for pass@1 is the average accuracy of single attempts. The text "repeatedly generate for $k$ times to get pass@k" suggests they generate $k$ independent samples for each problem and check if any pass.
Symbol Explanation:
- $n$ : The total number of evaluation instances (questions/problems).
- $p_i$ : A binary indicator variable for the $i$ -th prediction, where $p_i = 1$ if the prediction is correct, and $p_i = 0$ if it is incorrect.
- $\mathbb{I}(\cdot)$ : The indicator function, which is 1 if its argument is true, and 0 otherwise.
- num_problems: The total number of distinct problems or questions in the benchmark.
- $k$ : The number of independent generations or attempts made for each problem.

`LLM-as-Judges`

Conceptual Definition: This evaluation approach leverages a powerful Large Language Model (like GPT-4o in this paper) to act as an automated judge for assessing the correctness and quality of generated answers. Instead of human annotators, an LLM is prompted with the question, the model's response, and often a ground truth answer, and then asked to rate or provide a binary correctness judgment. This method aims to automate and scale up evaluation, especially for open-ended generative tasks where traditional exact-match metrics are insufficient.
Details: The paper mentions using the LLM-as-Judges approach (Liu et al., 2024) and provides the prompt used for Response Accuracy Evaluation in Appendix F.5. This prompt asks the judge LLM to determine if a given response correctly answers the question based on a correct_answer. It extracts a final answer, provides reasoning, and outputs a binary correct (yes/no).

5.3. Baselines

The paper compares WebWatcher against several categories of baselines:

Direct Inference: These are powerful Multimodal Large Language Models (MLLMs) that directly generate answers using their internal knowledge without explicit tool use.
- GPT-4o (OpenAI, 2024)
- Gemini-2.5-flash (DeepMind, 2025)
- Claude-3.7-Sonnet (Anthropic, 2025)
- Qwen-2.5-VL family (7B/32B/72B) (Bai et al., 2025)
Prompt Workflow: These models use prompt-driven workflows and are equipped with the same tools as WebWatcher. This setup evaluates the impact of WebWatcher's training methodology beyond just tool availability.
- GPT-4o
- Gemini-2.5-flash
- Claude-3.7-Sonnet
- Qwen-2.5-VL family (7B/32B/72B)
Reasoning Baselines: These are models specifically designed for multi-step reasoning, either as agents or large LLMs with reasoning capabilities.
- o4-mini (OpenAI, 2025b): An OpenAI model mentioned in the context of reasoning.
- Gemini-2.5-Pro (DeepMind, 2025): A powerful Gemini model from Google DeepMind, likely employed with prompt-driven workflows for reasoning.
- OmniSearch (GPT-4o) (Li et al., 2025d): An open-source, search-oriented agent based on GPT-4o.
  
  These baselines represent a comprehensive comparison across state-of-the-art MLLMs, models utilizing tools via prompting, and dedicated reasoning agents, allowing for a thorough assessment of WebWatcher's innovations.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate WebWatcher's strong performance across various challenging VQA benchmarks, often outperforming or matching proprietary baselines and open-source agents.

Humanity's Last Exam (HLE) Results

On HLE, which requires complex multimodal search, computation, and reasoning, models relying on direct inference perform poorly, with average accuracy scores below 10%. This highlights the limitations of vanilla MLLMs when faced with knowledge-intensive VQA that demands external tool use and multi-step reasoning. RAG-based methods (Prompt Workflow in the table) show moderate improvements, particularly in Chemistry, indicating that external information retrieval helps.

WebWatcher-32B, despite being a smaller model (32B parameters) compared to some proprietary baselines, achieves a competitive overall average accuracy of 13.6%. It particularly excels in specific domains, scoring 33.8% in Biology and showing strong performance in Mathematics and Humanities. This suggests its training and tool-use integration are effective for domain-specific, complex reasoning tasks. While o4-mini and Gemini-2.5-Pro achieve slightly higher overall scores (16.0% and 15.8% respectively), WebWatcher-32B demonstrates parameter efficiency for comparable performance.

Other Challenging Benchmarks

On BrowseComp-VL, LiveVQA, MMSearch, and SimpleVQA, WebWatcher consistently outperforms both direct inference and prompt workflow baselines.

BrowseComp-VL: This benchmark is highly challenging, requiring multi-page browsing and fine-grained visual grounding. Most baselines score below 20%. WebWatcher-32B achieves 27.0%, significantly outperforming all baselines and its smaller WebWatcher-7B counterpart, validating the effectiveness of its dynamic tool-use loop and training for this complex task.
LiveVQA: WebWatcher-32B achieves a state-of-the-art result of 58.7%, indicating its strong ability to handle questions grounded in up-to-date visual knowledge.
MMSearch: WebWatcher-32B also achieves a state-of-the-art result of 55.3%, showcasing its effectiveness in multimodal search scenarios.
SimpleVQA: Even on SimpleVQA, which emphasizes visual reasoning over external knowledge (a perception-oriented benchmark), WebWatcher-32B performs well with a score of 59.0%. This demonstrates its broad applicability beyond just knowledge-intensive tasks, suggesting its visual understanding component is robust.

These results collectively confirm that WebWatcher excels in tasks requiring knowledge-intensive reasoning and multimodal interaction while maintaining strong visual reasoning capabilities.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper: Table 1: Main results on HLE. All accuracy scores are reported as percentages. Avg signifies the average accuracy score of three inference runs across different subtopics.

| Backbone | Humanity's Last Exam (HLE-VL) | | | | | | | | | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | | Bio. | Chem. | CS/AI | Engineer. | Human. | Math | Physics | Other | Avg. | Direct Inference | | | | | | | | | | GPT-4o | 13.8 | 0.0 | 0.0 | 3.9 | 12.0 | 6.8 | 7.1 | 7.0 | 6.5 | Gemini-2.5-flash | 12.1 | 1.6 | 0.0 | 0.0 | 4.0 | 0.0 | 14.3 | 0.0 | 4.9 | Claude-3.7-Sonnet | 1.7 | 4.8 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 12.3 | 2.8 | Qwen-2.5-VL-7B | 3.4 | 3.2 | 7.1 | 0.0 | 4.0 | 2.3 | 7.1 | 0.0 | 2.6 | Qwen-2.5-VL-32B | 3.4 | 6.5 | 0.0 | 3.9 | 8.0 | 2.3 | 7.1 | 0.0 | 3.7 | Qwen-2.5-VL-72B | 3.4 | 8.0 | 0.0 | 5.9 | 8.0 | 0.0 | 0.0 | 7.0 | 4.9 | Prompt Workflow | | | | | | | | | | GPT-4o | 9.8 | 24.1 | 4.8 | 0.0 | 2.0 | 4.0 | 9.1 | 14.3 | 12.3 | Gemini-2.5-flash | 25.9 | 3.2 | 7.1 | 0.0 | 8.0 | 9.1 | 3.5 | 14.0 | 11.4 | Claude-3.7-Sonnet | 4.3 | 5.2 | 4.8 | 0.0 | 0.0 | 0.0 | 9.1 | 14.3 | 3.5 | Qwen-2.5-VL-7B | 4.3 | 6.9 | 3.2 | 7.1 | 0.0 | 4.0 | 4.5 | 7.1 | 5.3 | Qwen-2.5-VL-32B | 5.2 | 10.3 | 3.2 | 7.1 | 0.0 | 0.0 | 4.5 | 7.1 | 8.8 | Qwen-2.5-VL-72B | 15.8 | 10.3 | 8.1 | 0.0 | 2.0 | 8.0 | 6.8 | 14.3 | 8.6 | Reasoning Model | | | | | | | | | | o4-mini | 12.1 | 23.7 | 17.7 | 0.0 | 5.8 | 0.0 | 33.3 | 21.4 | 16.0 | Gemini-2.5-Pro | 23.7 | 17.7 | 13.3 | 11.5 | 8.0 | 13.3 | 14.3 | 15.5 | 15.8 | Open Source Agents | | | | | | | | | | OmniSearch (GPT-4o) | 15.5 | 8.2 | 0.0 | 2.2 | 8.0 | 6.8 | 21.4 | 12.1 | 9.3 | WebWatcher-7B | 18.6 | 6.5 | 6.7 | 7.7 | 4.0 | 6.7 | 7.1 | 17.2 | 10.6 | WebWatcher-32B | 33.8 | 9.7 | 0.0 | 5.8 | 8.0 | 8.9 | 14.3 | 13.8 | 13.6

The following are the results from Table 2 of the original paper: Table 2: Main results on four challenging benchmarks. All accuracy scores are reported as percentages. Avg signifies the average score of three inference across two difficult levels.

Backbone	BC-VL			LiveVQA	MMSearch	SimpleVQA
	Level1	Level2	Avg.
Direct Inference
GPT-4o	6.4	4.0	5.5	29.7	18.7	47.0
Gemini-2.5-flash	11.6	6.0	9.6	35.0	19.6	63.0
Claude-3.7-Sonnet	8.8	4.0	7.1	23.7	12.3	42.7
Qwen-2.5-VL-7B	0.8	0.0	0.5	22.7	4.09	30.7
Qwen-2.5-VL-32B	3.2	1.0	2.4	26.3	7.60	40.7
Qwen-2.5-VL-72B	9.2	3.0	7.1	30.3	11.7	51.3
Prompt Workflow
GPT-4o	16.8	7.0	13.4	34.0	24.1	61.6
Gemini-2.5-flash	15.2	9.0	13.0	41.3	43.9	68.6
Claude-3.7-Sonnet	13.9	6.0	11.2	30.3	32.7	59.3
Qwen-2.5-VL-7B	3.6	1.0	2.7	21.7	9.94	21.0
Qwen-2.5-VL-32B	9.4	3.0	7.2	30.5	17.5	44.6
Qwen-2.5-VL-72B	14.4	6.0	11.5	35.7	29.2	58.6
Agents
OmniSearch (GPT-4o)	19.7	10.0	16.3	40.9	49.7	63.0
WebWatcher-7B	23.6	17.0	21.2	51.2	49.1	54.3
WebWatcher-32B	28.4	25.0	27.0	58.7	55.3	59.0

6.3. Ablation Studies / Parameter Analysis

The paper conducts several analyses to understand the components and behavior of WebWatcher.

6.3.1. Number of Tool Calls

This analysis (Figure 4) examines how WebWatcher adapts its tool usage to the specific demands of different benchmarks.

HLE: Shows a balanced usage across Web Text Search, Web Image Search, and Code Interpreter, with Visit for navigation. This reflects HLE's requirement for multimodal search, computation, and complex reasoning.
BrowseComp-VL and MMSearch: For these benchmarks, which focus heavily on information seeking and reasoning, Web Text Search dominates, accounting for 62% of calls. Other tools play minor roles. This highlights the agent's ability to prioritize text-based retrieval when problems are primarily information-gathering.
SimpleVQA: The focus shifts to visual content, with Web Image Search making up one-third or more of calls. Text Search and Visit act as auxiliaries. This indicates that WebWatcher correctly identifies the visual nature of SimpleVQA tasks.
Code Interpreter: Is used only when actual computation is required, confirming that WebWatcher is cost and context-aware in its tool selection.

Overall, the distribution of tool usage mirrors benchmark demands, underscoring WebWatcher's flexibility in composing tool chains rather than over-relying on any single tool.

The following figure (Figure 4 from the original paper) shows the percentage of external tool calls in the four benchmarks:

该图像是柱状图，展示了WebWatcher在五个多模态数据集（HLE、BC-VL、MMsearch、LiveVQA和SimpleVQA）及综合评估中的四类操作（文本搜索、图像搜索、代码使用、访问页面）的占比情况，反映了不同任务中各操作的使用频率差异。

6.3.2. Cold Start for RL Training

This analysis (Figure 5) verifies the crucial role of supervised fine-tuning (SFT) as a "cold start" for Reinforcement Learning (RL) training in WebWatcher. The authors compare two initializations for the same RL algorithm (GRPO):

Instruct: Warm-started only with public instruction-following data.
Cold-start: Includes an extra SFT stage on high-quality trajectories that explicitly demonstrate tool use and step-by-step visual reasoning.

Instruct Initialization: The Instruct initialization stalls near zero on all three benchmarks (HLE, BC-VL, LiveVQA). This is attributed to frequent tool-call format errors and the strict Qwen-2.5-72B grader suppressing partial answers. Without proper initial guidance on structured tool use, the RL agent struggles to receive meaningful rewards, leading to a breakdown in learning.
Cold-start Initialization: In contrast, the cold-start SFT lifts initial scores significantly. Subsequently, GRPO trends diverge:
- HLE and BC-VL oscillate without improvement, suggesting that for these highly complex benchmarks, GRPO on its own might need more refinement or a larger model capacity to build upon the SFT foundation effectively.
- LiveVQA rises steadily, maintaining a 0.06-0.18 margin over Instruct. This shows that for certain tasks, GRPO effectively refines the SFT-initialized policy.
  
  The analysis concludes that reasoning traces (like Chain-of-Thought from a larger reasoner) cannot replace an SFT cold start under strict RL settings, as injecting them into a smaller model led to instability, format violations, repetitions, and context overflow. This confirms the necessity of explicit SFT for robust tool-augmented RL training.

The following figure (Figure 5 from the original paper) shows the performance comparison using cold start in RL training on three benchmarks:

该图像是三幅折线图，展示了WebWatcher在不同训练步骤下Cold-start与Instruct两种训练方式在HLE、BC-VL和LiveVQA三个基准测试中的得分情况，横轴为训练步骤，纵轴为得分，反映Instruct训练表现普遍优于Cold-start。

6.3.3. Pass@k Analysis on HLE

This analysis (Figure 6) investigates the performance of WebWatcher on HLE as the number of attempts ( $k$ ) increases, using the pass@k metric.

Single Attempt (k=1): WebWatcher achieves a 13.6% pass rate.
Initial Steep Rise: Performance rises steeply with a few attempts; three roll-outs ( $k=3$ ) reach 20.3%. This indicates that even a small number of diverse trajectories generated by the agent can yield large gains in success probability.
Continued Improvement: Accuracy continues to improve, reaching 35.7% at $k=16$ and 41.9% at $k=32$ . This nearly quadruples the single-shot inference performance and surpasses reasoning models like Gemini-2.5-Pro and o4-mini (which have pass@1 scores of 15.8% and 16.0% respectively).
Diminishing Returns: Marginal gains tend to taper after $k ≈ 16$ , suggesting that practitioners can cap roll-outs at 8-16 for a significant boost (2-3x) at moderate computational cost.

The smooth curve suggests that de-correlated sampling (generating diverse rollouts) avoids redundant solutions and captures complementary knowledge. This analysis demonstrates the scalability of the agentic paradigm, where systematic exploration of reasoning paths leads to consistent and robust improvements on challenging multimodal benchmarks.

The following figure (Figure 6 from the original paper) shows the Pass@k curve of WebWatcher on HLE for k ranging from 1 to 32: Note: Figure 6 from the original paper is not provided in the image assets, but the text describes its content, which is primarily a line graph showing Pass@k performance. (Self-correction: The prompt specifically asks to include the image if it is provided. Looking at the provided image list, img-5.jpeg corresponds to a KenKen puzzle, not a Pass@k curve. Therefore, I will state that the image is not provided, and rely on the textual description given in the paper.)

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces WebWatcher, a pioneering multimodal deep research agent that effectively integrates complex vision-language reasoning with multi-tool interaction. The authors propose BrowseComp-VL, a novel and challenging benchmark specifically designed for in-depth multimodal reasoning and strategic planning. A scalable pipeline is presented to transform complex textual QA examples into VQA items, generating high-quality training data. Furthermore, an automated trajectory generation pipeline, grounded in action-observation traces, is developed, followed by supervised fine-tuning (SFT) and Group-Relative Policy Optimization (GRPO) to train the agent. Experimental results demonstrate that WebWatcher achieves strong performance across multiple high-difficulty benchmarks (HLE, LiveVQA, BrowseComp-VL, and MMSearch), outperforming both open-source and proprietary research agents. It also delivers competitive results on the perception-oriented SimpleVQA benchmark. WebWatcher establishes a robust foundation for future multimodal deep research agents capable of autonomous, flexible, and deeply reasoned problem-solving in real-world scenarios.

7.2. Limitations & Future Work

The authors do not explicitly list "Limitations" as a separate section. However, the analysis of their results implicitly points to some areas:

Computational Cost: While WebWatcher-32B is parameter-efficient compared to larger proprietary models, RL training, especially with multiple rollouts, can be computationally intensive. The Pass@k analysis suggests diminishing returns after 16 rollouts, implying a trade-off between performance gain and computational cost.
Stability of RL: The cold start analysis shows that GRPO sometimes oscillates without improvement on highly complex benchmarks like HLE and BC-VL even with an SFT cold start. This indicates that further research might be needed to improve the stability and effectiveness of RL for such intricate multimodal tasks, especially for smaller models.
Reliance on External LLMs for Data Generation/Filtering: The methodology heavily relies on GPT-4o for QA pair generation, VQA conversion, trajectory annotation, and quality control. This introduces a dependency on the capabilities and biases of these external LLMs, which might limit the diversity or introduce specific reasoning styles into the training data.

As for future work, the paper states that WebWatcher "paves the way for solving complex multimodal information-seeking tasks" and "establishes a strong foundation for future multimodal deep research agents." Implicitly, this suggests future work could involve:
Improving RL Stability and Efficiency: Enhancing RL algorithms to perform more robustly on highly complex multimodal tasks and to scale more efficiently.
Reducing Reliance on Proprietary LLMs: Exploring methods for generating high-quality data and trajectories with open-source models or alternative data synthesis techniques to improve reproducibility and reduce external dependencies.
Expanding Tool Capabilities: Integrating a wider array of specialized tools or enabling the agent to learn to use new tools dynamically.
Long-Horizon Multimodal Tasks: Applying WebWatcher to even more complex, long-horizon real-world problems that require extended multi-step planning and cross-modal reasoning over longer periods.
Human-Agent Collaboration: Exploring ways to integrate human feedback and guidance more seamlessly into the agent's learning and decision-making process for multimodal deep research.

7.3. Personal Insights & Critique

This paper presents a significant step forward in the domain of multimodal deep research agents. The core innovation lies not just in adding visual capabilities to a web agent, but in meticulously building the necessary infrastructure: a novel data generation pipeline, a robust training methodology (SFT + GRPO), and a challenging benchmark (BrowseComp-VL) that truly tests integrated vision-language reasoning.

One key inspiration is the rigorous approach to data generation and quality control. The multi-stage QA-to-VQA conversion, entity masking, and Selector/Examiner filtering steps are crucial for creating a high-quality dataset that drives deep reasoning rather than shallow pattern matching. This highlights that for advanced agent capabilities, simply collecting raw multimodal data is insufficient; structured, reasoning-intensive data curation is paramount.

The emphasis on the SFT cold start for RL training is another critical insight. It debunks the idea that RL alone can learn complex tool-use from scratch when dealing with strict formatting requirements and multi-step reasoning. The SFT phase provides the necessary scaffolding, allowing RL to then refine and generalize the learned behaviors. This suggests a powerful hybrid training paradigm for future complex agent development.

A potential issue or area for improvement could be the generalizability of the LLM-as-Judges method for evaluation. While GPT-4o is powerful, its judgments might still contain biases or miss subtle nuances that human experts would catch, especially in abstract academic problems (like HLE). The reliability of LLM graders, though common, is an ongoing research area. Further, the computational cost of generating $k$ rollouts for pass@k evaluation and the dependence on expensive proprietary LLMs for data generation and grading might pose barriers for broader research and open-source development.

The methods and conclusions could be transferred to other domains requiring complex multimodal interpretation and action, such as scientific discovery (e.g., analyzing experimental images alongside textual protocols), medical diagnosis (interpreting scans with patient histories), or even advanced robotics (visual navigation and task execution based on textual commands). WebWatcher's structured approach to integrating visual perception and tool-augmented reasoning provides a strong blueprint for building more capable and versatile AI agents.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~30 min read · 42,700 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

Deep Research Agents

Multimodal VQA Benchmarks

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Data Preparation

4.2.1.1. Data Overview

4.2.1.2. Construction of VQA Pairs

QA Pairs Generation

QA-TO-VQA CONVERSION

4.2.1.3. Quality Control

4.2.2. Trajectory Generation and Post-Training

4.2.2.1. Multimodal Tools

4.2.2.2. Automated Trajectory Annotation

4.2.2.3. Trajectory Filtering and Quality Assurance

4.2.2.4. Supervised Fine-Tuning (SFT) as Cold Start

4.2.2.5. Reinforcement Learning (RL)

5. Experimental Setup

5.1. Datasets

Training Data Construction

Evaluation Benchmarks

5.2. Evaluation Metrics

LLM-as-Judges

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

Humanity's Last Exam (HLE) Results

Other Challenging Benchmarks

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.3.1. Number of Tool Calls

6.3.2. Cold Start for RL Training

6.3.3. Pass@k Analysis on HLE

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

`LLM-as-Judges`