WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent
TL;DR Summary
WebWatcher is a multimodal deep research agent enhancing visual-language reasoning via synthetic trajectories and reinforcement learning, validated on the new BrowseComp-VL benchmark for complex visual-text retrieval tasks, surpassing existing baselines.
Abstract
000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 W EB W ATCHER : B REAKING N EW F RONTIERS OF V ISION -L ANGUAGE D EEP R ESEARCH A GENT Anonymous authors Paper under double-blind review A BSTRACT Web agents such as deep research have demonstrated superhuman cognitive abili- ties, capable of solving highly challenging information-seeking problems. However, most research remains largely text-centric, overlooking visual information in the real world. This makes multimodal deep research highly challenging, as such agents require much stronger perceptual, logical, and knowledge-based reason- ing abilities, as well as proficiency in more sophisticated tools. To address this limitation, we introduce WebWatcher, a multimodal agent for deep research with enhanced visual-language reasoning capabilities. It uses high-quality synthetic trajectories for efficient cold start training, utilizes various tools for deep reasoning, and further enhances generalization t
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
WebWatcher: Breaking New Frontiers of Vision-Language Deep Research Agent
1.2. Authors
The authors are listed as "Anonymous authors," indicating the paper is under double-blind review. Their affiliations are not disclosed in the provided text.
1.3. Journal/Conference
The paper is published at OpenReview, a platform commonly used for conference submissions (e.g., ICLR, NeurIPS) under double-blind review. The specific conference is not named, but OpenReview is a reputable platform for disseminating research in machine learning. The publication status is "Paper under double-blind review."
1.4. Publication Year
2025-10-08T00:00:00.000Z (UTC)
1.5. Abstract
The paper introduces WebWatcher, a novel multimodal deep research agent designed to overcome the text-centric limitations of most existing web agents. WebWatcher integrates enhanced visual-language reasoning capabilities through the use of high-quality synthetic trajectories for efficient cold start training, diverse tools for deep reasoning, and reinforcement learning for improved generalization. To evaluate such agents, the authors propose BrowseComp-VL, a new benchmark styled after BrowseComp that demands complex information retrieval combining visual and textual data. Experimental results demonstrate that WebWatcher either outperforms or matches proprietary baselines, Retrieval-Augmented Generation (RAG) workflows, and open-source agents across four challenging Visual Question Answering (VQA) benchmarks, thereby paving the way for solving intricate multimodal information-seeking tasks.
1.6. Original Source Link
Official Source: https://openreview.net/forum?id=8jsaazdAb3 PDF Link: https://openreview.net/pdf?id=8jsaazdAb3 Publication Status: The paper is currently "under double-blind review" at OpenReview.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the text-centric limitation of current deep research web agents. While these agents, often powered by Large Language Models (LLMs), have shown superhuman abilities in solving complex information-seeking problems, they largely overlook the vast amount of visual information present in the real world. This makes multimodal deep research – tasks requiring reasoning across both visual and textual data – exceptionally challenging.
This problem is important because many real-world scenarios, such as interpreting scientific diagrams, analyzing charts, or navigating visual web interfaces, inherently demand joint vision-language reasoning. Existing Vision-Language (VL) agents often fall short by relying on template-driven pipelines, limiting their flexible reasoning, planning ability, and versatile tool use. Some VL agents focus primarily on image-based perception with visual tools but struggle to integrate this with deep textual understanding and cross-modal inference. Conversely, search-only agents have a limited problem-solving scope, failing when answers are implicit, require interaction, or demand additional computation.
The paper's entry point or innovative idea is to introduce WebWatcher, an agent that directly addresses this gap by combining strong reasoning abilities across both textual and visual information with the effective use of multiple external tools. It focuses on generating high-quality training data that combines complex visual content with multi-step reasoning and then trains the agent through a combination of supervised fine-tuning (SFT) and reinforcement learning (RL).
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Introducing WebWatcher: A novel multimodal agent for deep research that enhances visual-language reasoning capabilities. It is designed to integrate various tools for deep reasoning and to generalize through reinforcement learning.
-
Developing a Scalable Data Generation Pipeline: This pipeline creates high-quality synthetic trajectories for efficient cold start training. It transforms complex textual
Question Answering (QA)pairs intoVisual Question Answering (VQA)items, incorporating multi-hop, knowledge-intensive queries grounded in authentic web images, and includes a multi-stage filtering process for quality control. -
Automated Trajectory Generation and Post-Training: The paper proposes an automated pipeline to build tool-use trajectories from
action-observationsequences via prompting, followed bySupervised Fine-Tuning (SFT)andGroup-Relative Policy Optimization (GRPO)to optimize tool use and decision-making. -
Proposing BrowseComp-VL: A challenging new
VQAbenchmark that extendsBrowseCompinto the visual domain, requiring complex information retrieval involving both visual and textual information, cross-modal reasoning, and high-level planning.The key conclusions or findings reached by the paper are:
-
WebWatcherconsistently outperforms or matches proprietary baselines,RAGworkflows, and open-source agents across four challengingVQAbenchmarks (HLE,LiveVQA,BrowseComp-VL, andMMSearch). -
It demonstrates competitive performance even on perception-oriented benchmarks like
SimpleVQA, indicating broad applicability. -
The
tool usage analysisshows thatWebWatcherflexibly composes tool chains based on benchmark demands, rather than over-relying on any single tool, showcasing its adaptability. -
The
cold startSFTis crucial for stable and effectiveReinforcement Learningin multimodal agent training, preventing initial instability and ensuring meaningful credit assignment. -
The
Pass@k analysisconfirms the scalability of the agentic paradigm, where systematic exploration of reasoning paths leads to consistent and robust performance improvements.These findings solve the problem of limited multimodal capabilities in deep research agents, offering a robust framework for agents to effectively interact with and reason over both visual and textual information in complex real-world scenarios.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a reader should be familiar with several core concepts in artificial intelligence, particularly in the areas of natural language processing, computer vision, and reinforcement learning.
- Large Language Models (LLMs): These are advanced artificial intelligence models, often based on the transformer architecture, trained on vast amounts of text data to understand, generate, and process human language. They can perform a wide range of tasks, from question answering to code generation. Examples include
GPT-4oandGemini. - Deep Research Agents (or Web Agents): These are
LLM-powered systems designed to autonomously perform complex information-seeking tasks on the web. They go beyond single-turn interactions by planning multi-step actions, using tools (like search engines or code interpreters), and synthesizing information from multiple sources to answer challenging questions or complete intricate tasks. The paper refers to "deep research" as a specific type of web agent focused on comprehensive information gathering and synthesis. - Multimodal AI: This refers to AI systems that can process and reason over information from multiple modalities, such as text, images, audio, and video. In this paper's context,
multimodalprimarily refers tovision-language, meaning the ability to understand and integrate both visual (images) and textual (language) information. - Visual Question Answering (VQA): A task in multimodal AI where a model receives an image and a natural language question about that image, and it must provide a natural language answer.
VQAchallenges models to perform both visual recognition and language understanding, often requiring reasoning to combine information from both modalities. - ReAct Framework: Short for "Reasoning and Acting,"
ReActis a general paradigm forLLMagents that interleavesThought,Action, andObservationsteps.- A
Thought(orThink) step involves theLLMgenerating a reasoning trace to decide the next action. - An
Action(ortool_call) step involves theLLMcalling an external tool (e.g., search engine, code interpreter) based on itsThought. - An
Observation(ortool_response) step involves the environment returning the result of the tool's action, which theLLMthen uses to inform its nextThought. This cyclical process allowsLLMsto perform complex, multi-step tasks by breaking them down into manageable sub-problems and leveraging external knowledge or computation.
- A
- Supervised Fine-Tuning (SFT): A common technique to adapt a pre-trained
LLMto a specific task or domain. It involves training theLLMon a dataset of input-output pairs (trajectories in this case) where the desired behavior is explicitly demonstrated. The model learns to mimic this behavior by minimizing a loss function (e.g., cross-entropy) on the labeled data. InWebWatcher,SFTserves as a "cold start" to teach the agent basic tool-augmented reasoning. - Reinforcement Learning (RL): A paradigm of machine learning where an agent learns to make decisions by interacting with an environment. The agent receives rewards or penalties for its actions, and its goal is to learn a policy that maximizes the cumulative reward over time.
RLis particularly useful for tasks that involve sequential decision-making and where explicit demonstrations for all possible scenarios are hard to provide. - Group-Relative Policy Optimization (GRPO): An
RLalgorithm mentioned in the paper, which is a variant ofPolicy Gradientmethods.GRPOrefines decision-making by normalizing rewards within a group of generated trajectories. This "group-relative advantage" helps to stabilize training and encourages exploration of trajectories that yield higher rewards compared to others in the same group, without relying on a separate value function (which can be hard to estimate). It's designed to promote stable updates while encouraging exploration of trajectories with higher relative return.
3.2. Previous Works
The paper contextualizes its contributions by referencing several prior works in deep research agents and multimodal VQA benchmarks.
Deep Research Agents
- Proprietary Solutions (e.g., DeepResearch, Gemini Deep Research): These agents (from
OpenAI,Google,Perplexity) demonstrate near-expert performance in fact-finding and analysis. However, their internal architectures and data pipelines are typically secret, hindering research replication and in-depth analysis. - WebDancer (Wu et al., 2025a): This open-source agent focuses on
curriculum-driven SFToverReActtraces. It teaches agents to use tools through structured demonstrations. - WebThinker (Li et al., 2025c): Augments
SFTwithpolicy-gradient refinement, meaning it usesRLtechniques to further improve the agent's decision-making beyond whatSFTcan achieve. - R1-Searcher (Song et al., 2025): Leverages
self-playto learntree-structured exploration policies, allowing the agent to explore different search paths more effectively. - WebSailor (Li et al., 2025a): Focuses on
uncertainty reductionusingstructured task obfuscation,RFT cold-start, and theDUPO algorithmto handle ambiguous queries. - WebShaper (Tao et al., 2025b): Proposes a
formalization-driven data-synthesis pipelineby introducingKnowledge Projectionsand anagentic Expander. - OmniSearch (Li et al., 2025d): A search-oriented open-source agent based on
GPT-4o, used as a baseline inWebWatcher's experiments.
Crucial Background Information for Understanding Agent Frameworks:
Many of these agents, including WebWatcher, rely on the ReAct framework or its variations. As explained above, ReAct models generate thoughts and actions in an interleaved manner. This is a significant departure from earlier LLM interactions that were purely prompt-response.
The core idea behind these frameworks is to enable LLMs to:
- Reason: Generate internal thoughts to plan and strategize.
- Act: Utilize external tools to gather information or perform computations.
- Observe: Process the results from tool actions to update their internal state and guide subsequent reasoning.
Multimodal VQA Benchmarks
- Single-step Perception/Shallow Retrieval Benchmarks (e.g., OK-VQA, A-OKVQA): These older benchmarks typically emphasize
static knowledge groundingandheuristic answer prediction. They often require models to answer questions based on a single image and some external knowledge, without extensive multi-step reasoning or tool use. - MMT-Bench (Ying et al., 2024): Offers large-scale coverage of
planning-oriented tasksacross multiple domains but uses amultiple-choice format, which restricts the assessment of procedural reasoning and rich textual outputs. - MicroVQA (Burgess et al., 2025) and Open3DVQA (Zhang et al., 2025): Explore
domain-specificandspatial reasoning, respectively, but are often constrained by limited scale, manual curation, or lack of complex planning structures. - Dyn-VQA (Li et al., 2025d; Chen et al., 2025): Introduces
adaptive query tasksbut remains narrow in its multimodal scope and size. - MMMU-Pro (Yue et al., 2024), MMSearch-Plus (Tao et al., 2025a), MM-BrowseComp (Li et al., 2025b): These are more recent benchmarks exploring performance limitations of current
MLLMsondomain-specificanddifficult information-seeking tasks.MMSearchis used as an evaluation benchmark forWebWatcher. - BrowseComp (Wei et al., 2025a, 2025b): A benchmark for
browsing agentsthat emphasizesunderspecifiedanddifficult queriesrequiring retrieval of scattered information and integration of fragmented clues.WebWatcherextends this to the visual domain withBrowseComp-VL. - Humanity's Last Exam (HLE) (Phan et al., 2025): A challenging benchmark with expert-written questions across diverse academic fields, requiring synthesis of evidence from obscure sources and reasoning through abstract problems.
WebWatcherevaluates on a multimodal subset ofHLE.
3.3. Technological Evolution
The field has evolved from text-only LLMs to multimodal LLMs (MLLMs) that can process both text and images. Initially, research focused on fundamental VQA tasks, often limited to single-step reasoning or simple retrieval. The next stage involved equipping LLMs with tools (tool-use LLMs) to augment their capabilities, leading to the development of deep research agents that could perform multi-step planning and interaction on the web, but primarily in a text-centric manner.
This paper's work represents a critical step in this evolution: bridging the gap between multimodal perception and deep research agent capabilities. It pushes beyond text-only reasoning by deeply integrating visual information into the agent's reasoning and tool-use loop. This means the agent doesn't just "see" but actively reasons over visual content and uses visual information to guide its multi-step information-seeking process. The introduction of BrowseComp-VL also signifies an evolution in benchmarks, moving towards more realistic, complex, and multimodal information-seeking challenges that mirror real-world tasks.
3.4. Differentiation Analysis
Compared to the main methods in related work, WebWatcher introduces several core differences and innovations:
-
Integrated Multimodal Reasoning and Tool Use: Unlike most prior deep research agents that are
text-bound,WebWatcherdeeply integratesvision-language reasoningwith a versatile set of tools. It explicitly addresses tasks that require combining both modalities for complex problem-solving, which is a limitation for agents primarily focused on text or only simple visual perception. -
Advanced Multimodal Data Generation: Existing
VQAdatasets often focus on single-hop queries or perception.WebWatcher's pipeline generates training data specifically designed forin-depth,multi-step reasoningandstrategic planningby converting complex textualQAintoVQAand masking entities. This provides a richer and more challenging training environment than typically available. -
Automated Trajectory Generation: Instead of rigid, template-based trajectories,
WebWatchergeneratesaction-observation sequencesvia prompting, grounding them in actual tool-use behavior and reflecting procedural decision-making. This addresses the challenge of coordinating tools with distinct input-output formats and reasoning roles. -
Robust Training Methodology:
WebWatchercombinesSupervised Fine-Tuning (SFT)for a strong "cold start" withReinforcement Learning (RL)(GRPO) for further optimization and generalization. The paper specifically highlights the importance of theSFT cold startfor stableRLtraining in complex tool-use scenarios, which is a critical finding for agent development. -
Novel Multimodal Benchmark (BrowseComp-VL):
WebWatcherintroduces a new benchmark that extendsBrowseCompto the visual domain. This benchmark is specifically designed to challenge agents withlong,entity-obfuscated queriesthat demandcross-modal reasoning,thorough information-seeking, andhigh-level planningacross web search, image retrieval, and webpage browsing. This provides a more comprehensive evaluation of multimodal deep research capabilities than previous benchmarks.In essence,
WebWatchermoves beyond merely adding visual tools to anLLMby providing a holistic framework for generating complex multimodal data, training agents to effectively use tools in a multimodal context, and evaluating them on benchmarks that truly demand integrated vision-language reasoning for deep research.
4. Methodology
4.1. Principles
The core idea behind WebWatcher is to build a multimodal deep research agent capable of complex vision-language reasoning and multi-tool interaction. This is achieved by addressing three key challenges:
-
Developing strong reasoning across text and vision: This requires constructing high-quality training data that combines rich visual content with complex, multi-step reasoning.
-
Enabling effective use of multiple external tools: This involves equipping the agent with a diverse set of tools and training it to coordinate them flexibly.
-
Ensuring generalization and robust decision-making: This is achieved through a combination of
supervised fine-tuning (SFT)andreinforcement learning (RL).The theoretical basis or intuition is that by providing
LLMswith the ability to "see" (process visual information) and "act" (use various tools) in a structured and learned manner, they can transcend text-only limitations and tackle more complex, real-world information-seeking problems. TheReActframework provides the operational structure for interleaving thought, action, and observation, while carefully curated data andRLtechniques refine the agent's strategic capabilities.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology of WebWatcher can be broken down into three main phases: Data Preparation, Trajectory Generation and Post-Training, and Experimental Setup.
4.2.1. Data Preparation
This phase focuses on constructing a high-quality dataset for multimodal deep research agents.
4.2.1.1. Data Overview
The dataset is designed for multimodal deep research agents, with each example comprising:
-
A factual image.
-
An associated question requiring cross-modal reasoning.
-
A corresponding answer.
-
Auxiliary metadata about underlying entities and relations.
The dataset covers 5 major domains (Entertainment, Humanities, Technology, Natural Science, and Other) and 17 fine-grained subfields. It defines two difficulty levels:
-
Level 1: Questions require multi-hop reasoning but still reference explicit entities. Answers can be obtained through iterative retrieval, but integration of information across multiple sources is non-trivial.
-
Level 2: Questions have obfuscated entities and attributes (e.g., vague periods, masked names, fuzzed quantitative properties). This introduces uncertainty, forcing the agent to plan, compare, and synthesize information rather than just direct retrieval.
This dataset is split into a training set and a benchmark called
BrowseComp-VL.
The following figure (Figure 2 from the original paper) illustrates the domain distribution and examples of Level 1 and Level 2 questions:
该图像是包含两个同心圆环图的示意图,展示了Level 1和Level 2的结构和组成部分。图中分别用不同颜色区分了子模块,并附带相关问题和答案的示例。
4.2.1.2. Construction of VQA Pairs
This sub-section details how diverse textual QA pairs are first constructed, then grounded in relevant images to form VQA tasks.
QA Pairs Generation
- Level 1: Inspired by
CRAWL-QAfromWebDancer(Wu et al., 2025a). RootURLsare collected from authoritative sources (arXiv, GitHub, Wikipedia), and their hyperlinks are recursively traversed to mimic human browsing.GPT-4o(OpenAI, 2024) synthesizes question-answer pairs from the aggregated content. - Level 2: Following
WebSailor(Li et al., 2025a), queries are constructed withfuzzed entitiesby replacing precise references with partial or ambiguous descriptions. This forces contextual reasoning and synthesis across modalities. A two-stage generation framework is used:- Nodes Selecting: Starting from an initial Wikipedia page,
GPT-4ogenerates a baseQApair using the page title as the root entity node . A hyperlink graph is expanded by recursively traversing outgoing links to form a tree of depth and branching factor . The number of nodes generated is given by . In practice, and are used. Subgraphs of entities are sampled, each defining a path from to a target entity , forming the basis for multi-hopQApairs. - Query Generating and Entity Masking: For each subgraph,
GPT-4ogenerates a standard question explicitly referencing entities and relations. A fuzzed version is then created by replacing key references with partial or ambiguous descriptions, preventing simple string matching and forcing cross-modal reasoning.
- Nodes Selecting: Starting from an initial Wikipedia page,
QA-TO-VQA CONVERSION
This process ensures reliable visual grounding and transforms textual QA into VQA queries.
-
Visual Context Construction: Trivial or overly ambiguous target entities (lacking visual grounding) are discarded. For each retained entity , a set of web images is retrieved via
Google SerpApi(Google, 2025), with images per entity in this implementation. These images are strictly authentic. -
Question Transformation: To create image-grounded
VQApairs from each textualQA,GPT-4ois used for prompt-based rewriting. The target entity in is masked with a visual reference token (e.g., "this entity," "the object in the image"), producing a transformedVQAquery . Simultaneously, an image query string is created to guide the filtering of . Each retained image is paired with(q, a), meaning multimodal examples are generated from each textualQA.The following figure (Figure 3 from the original paper) illustrates the data generation pipeline:
该图像是一个示意图,展示了针对问题“以詹姆斯·罗伊·金霍恩命名的蛇种是什么?”的多层次信息检索和推理流程,包含图搜索、图像检索、选择器和检查器模块,以及基于图结构的层级推理过程。
4.2.1.3. Quality Control
A two-stage filtering pipeline ensures high-quality VQA samples:
- Selector:
- Discards cases where the transformed
VQAquery is identical to , or where the entity name and its aliases appear in , indicating failed masking. GPT-4oevaluates each image against both and(q, a), scoringcontextual alignment,semantic fit, andvisual reasoning plausibility. Cases with low scores are removed.
- Discards cases where the transformed
- Examiner: For each retained image-query pair ,
GPT-4oattempts to answer using only visual content and associated captions. Failure to answer accurately indicates improper visual context, and such cases are discarded. Captions are included to reduce false negatives from missing world knowledge.
4.2.2. Trajectory Generation and Post-Training
This phase involves generating high-quality tool-use trajectories and then using them for supervised fine-tuning (SFT) and reinforcement learning (RL).
4.2.2.1. Multimodal Tools
WebWatcher is equipped with five tools:
- Web Image Search: Uses
Google SerpApi(Google, 2025) for retrieving relevant images with captions and URLs. - Web Text Search: For open-domain information seeking using text queries.
- Visit: Uses
Jina(Jina.ai, 2025) for navigating specific URLs and summarizing pages according to the agent's goal. - Code Interpreter: For symbolic computation and numerical reasoning (Cheng et al., 2024).
- OCR (Optical Character Recognition): An internal tool, invoked via prompt and
SFTdata, to extract text from input images (Huang et al., 2025). This is crucial for interpreting text embedded in visuals like charts or diagrams.
4.2.2.2. Automated Trajectory Annotation
Given a VQA instance (I, q, a) from BrowseComp-VL, GPT-4o constructs tool-use trajectories simulating step-by-step human reasoning, following the ReAct (Yao et al., 2023) framework. Each trajectory comprises multiple think-act-observe cycles. At each step , the model generates:
-
Thought: Intermediate reasoning or plan, enclosed in .
-
Action: Tool invocation wrapped in
<tool_call>...</tool_call>or the final answer in . -
Observation: Returned result from the environment, within
<tool_response>...</tool_response>tags.The action space consists of discrete tool-use actions . The
Finishaction signals task completion. A trajectory of length is defined as: Here, represents an action at step , and is the observation (environment feedback) after executing . Each trajectory provides a content-grounded demonstration of planning and tool selection.
4.2.2.3. Trajectory Filtering and Quality Assurance
A three-stage selection process ensures robust and instructive supervision:
- Final Answer Matching: Only trajectories where the final answer matches the ground truth are retained.
- Step-by-Step Consistency Check:
GPT-4overifies the logical consistency of each intermediate step in . Trajectories with hallucinated content, contradictions, or unjustified tool calls are discarded. This avoids correct answers being reached by chance. - Minimum Tool Usage Requirement: Trajectories with fewer than three tool calls are removed to ensure substantive, process-driven tool interactions and reasoning.
4.2.2.4. Supervised Fine-Tuning (SFT) as Cold Start
After filtering, a dataset of high-quality tool-use trajectories is obtained. At each step of trajectory , WebWatcher is trained to predict the correct action , given the image , question , and previous actions and observations . SFT maximizes the log-likelihood of :
Here, are the model parameters, is the image for trajectory , is the question, are actions before step , are observations before step , and is the length of trajectory . This cold-start stage teaches the agent effective tool use and structured multi-step reasoning.
4.2.2.5. Reinforcement Learning (RL)
With SFT providing cold-start initialization, Group-Relative Policy Optimization (GRPO) (Guo et al., 2025) is applied to refine decision-making.
For a VQA query , the current policy generates a group of complete trajectories, each with return . The group-relative advantage is defined as:
This normalizes rewards within the group, removing the need for a separate value function. The GRPO objective is defined as a clipped surrogate loss:
Where:
-
is the
importance sampling ratiobetween the current policy and the previous policy . -
is the
group-relative advantagefor trajectory . -
is the
clipping threshold, typically a small positive value (e.g., 0.2), which limits the change in the policy to ensure stable updates. -
denotes the
Kullback-Leibler (KL) divergencebetween the current and previous policies, serving as a penalty to prevent the new policy from deviating too much from the old one, promoting stability. -
is a coefficient controlling the strength of the
KL penalty.This objective promotes stable updates while encouraging exploration of trajectories with higher relative return.
Each trajectory receives a binary format score (1 if all tool calls follow the schema). An LLM grader provides a semantic accuracy score by comparing the final answer with the ground truth. The total reward is defined as:
With to prioritize task completion while maintaining structured tool use. Since is given only at the episode end, the group-relative ranking enables effective credit assignment. Rollouts are collected in groups of for diversity and computational efficiency.
5. Experimental Setup
5.1. Datasets
The experimental setup involves both training data construction and evaluation on several challenging benchmarks.
Training Data Construction
The training data for WebWatcher comes from three sources:
-
BrowseComp-VL training set: This includes 110,000 Level-1 and 70,000 Level-2
QApairs. AfterVQAconversion and filtering, 60,000 Level-1 and 40,000 Level-2 high-quality examples are retained. -
Long-tail QA pairs converted to VQA: Sampled from training instances with a similar distribution to
SimpleVQA, resulting in 4,000VQAexamples. -
Hard VQA samples: Collected from
InfoSeek(Chen et al., 2023),VQAv2.0(Goyal et al., 2017),LogicVista(Xiao et al., 2024), andEncyclopedic VQA(Mensink et al., 2023).Huang et al., 2025is added to activateOCR.Rejection samplingensures difficulty.After trajectory generation and filtering, 8,000 high-quality tool-use trajectories are obtained for
SFT, with an additional 2,000 samples reserved forGRPO. The final ratio of data sources is 5:3:2 forBrowseComp-VL, long-tailVQA, and hardVQAdata, respectively.
Evaluation Benchmarks
WebWatcher is evaluated on five challenging benchmarks:
-
BrowseComp-VL:
- Source: Proposed in this paper, extending
BrowseComp(Wei et al., 2025b). - Scale: The evaluation set consists of 100 instances from Level 1 and 200 instances from Level 2, totaling 300 instances. All examples are manually verified by PhD-level AI experts.
- Characteristics: Designed for in-depth multimodal reasoning and strategic planning. Queries are long, entity-obfuscated, and require multi-page browsing, fine-grained visual grounding, and complex information retrieval across both visual and textual information.
- Domain: Not explicitly stated but inferred to cover a broad range based on the training data categories.
- Source: Proposed in this paper, extending
-
Humanity's Last Exam (HLE) (Phan et al., 2025):
- Source: An existing benchmark.
- Scale: Originally 2,500 expert-written questions.
WebWatcherevaluates on a subset of 330 multimodal questions. - Characteristics: Questions go beyond simple retrieval, requiring models to synthesize evidence from obscure or fragmented sources and reason through abstract academic problems. Multimodal questions assess visual-textual reasoning.
- Domain: Diverse academic fields such as science, engineering, and the humanities (e.g., Biology, Chemistry, Computer Science/AI, Engineering, Humanities, Math, Physics, Other).
-
LiveVQA (Fu et al., 2025):
- Source: An existing benchmark.
- Scale: 3,602 multi-hop
VQAinstances.WebWatcherevaluates on a 300-example subset. - Characteristics: Evaluates a model's ability to answer questions grounded in up-to-date visual knowledge, often from recent global news. Requires multi-hop reasoning.
- Domain: Recent global news across six sources and fourteen topics.
-
SimpleVQA (Cheng et al., 2025):
- Source: An existing benchmark.
- Scale: 2,025 examples in both English and Chinese.
WebWatcherevaluates on 300 examples randomly sampled from the 1,013 EnglishQApairs. - Characteristics: Factual
VQAbenchmark combining curated image-question pairs from recentVQAdatasets and expert-annotated web images. Focuses more on visual reasoning over external knowledge. - Domain: General factual knowledge related to images.
-
MMSearch (Jiang et al., 2024):
- Source: An existing benchmark.
- Scale: 300 manually curated examples.
WebWatcheruses the 171 visual subset for evaluation. - Characteristics: Examples cover both recent news and rare knowledge, requiring search capabilities.
- Domain: 14 subdomains including recent news and rare knowledge.
5.2. Evaluation Metrics
The primary evaluation metric used is pass@k (Chen et al., 2021) with LLM-as-Judges (Liu et al., 2024) for correctness scoring.
-
Conceptual Definition of
pass@k:pass@kis a metric used to evaluate the success rate of generative models, particularly in tasks where multiple attempts might be made to find a correct solution. It measures the probability that at least one of independently generated solutions is correct. If a model generates candidate solutions, and any one of them is correct, the attempt is considered a success. This metric is useful for evaluating agents that can perform multiple rollouts or search for a solution through several tries. -
Mathematical Formula for
pass@k: The paper specifies thatpass@1is computed as: Where:-
is the total number of evaluation instances.
-
is the binary correctness (1 for correct, 0 for incorrect) of the -th prediction.
For a general
pass@k, the formula is often derived from the probability of failure: A more practical way to calculatepass@kgiven samples for each problem and a binary correctness score for each sample is: However, the paper's description ofpass@kin the "Pass@k Analysis on HLE" section implies it's calculating the proportion of problems where at least one of generated solutions is correct, likely by repeating the generation process times for each problem and checking if any are correct. The provided formula forpass@1is the average accuracy of single attempts. The text "repeatedly generate for times to get pass@k" suggests they generate independent samples for each problem and check if any pass.
-
-
Symbol Explanation:
- : The total number of evaluation instances (questions/problems).
- : A binary indicator variable for the -th prediction, where if the prediction is correct, and if it is incorrect.
- : The indicator function, which is 1 if its argument is true, and 0 otherwise.
num_problems: The total number of distinct problems or questions in the benchmark.- : The number of independent generations or attempts made for each problem.
LLM-as-Judges
-
Conceptual Definition: This evaluation approach leverages a powerful
Large Language Model(likeGPT-4oin this paper) to act as an automated judge for assessing the correctness and quality of generated answers. Instead of human annotators, anLLMis prompted with the question, the model's response, and often a ground truth answer, and then asked to rate or provide a binary correctness judgment. This method aims to automate and scale up evaluation, especially for open-ended generative tasks where traditional exact-match metrics are insufficient. -
Details: The paper mentions using the
LLM-as-Judgesapproach (Liu et al., 2024) and provides the prompt used forResponse Accuracy Evaluationin Appendix F.5. This prompt asks the judgeLLMto determine if a given response correctly answers the question based on acorrect_answer. It extracts afinal answer, providesreasoning, and outputs a binarycorrect(yes/no).
5.3. Baselines
The paper compares WebWatcher against several categories of baselines:
-
Direct Inference: These are powerful
Multimodal Large Language Models (MLLMs)that directly generate answers using their internal knowledge without explicit tool use.GPT-4o(OpenAI, 2024)Gemini-2.5-flash(DeepMind, 2025)Claude-3.7-Sonnet(Anthropic, 2025)Qwen-2.5-VL family (7B/32B/72B)(Bai et al., 2025)
-
Prompt Workflow: These models use
prompt-driven workflowsand are equipped with the same tools asWebWatcher. This setup evaluates the impact ofWebWatcher's training methodology beyond just tool availability.GPT-4oGemini-2.5-flashClaude-3.7-SonnetQwen-2.5-VL family (7B/32B/72B)
-
Reasoning Baselines: These are models specifically designed for multi-step reasoning, either as agents or large
LLMswith reasoning capabilities.-
o4-mini(OpenAI, 2025b): AnOpenAImodel mentioned in the context of reasoning. -
Gemini-2.5-Pro(DeepMind, 2025): A powerfulGeminimodel fromGoogle DeepMind, likely employed with prompt-driven workflows for reasoning. -
OmniSearch (GPT-4o)(Li et al., 2025d): An open-source, search-oriented agent based onGPT-4o.These baselines represent a comprehensive comparison across state-of-the-art
MLLMs, models utilizing tools via prompting, and dedicated reasoning agents, allowing for a thorough assessment ofWebWatcher's innovations.
-
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate WebWatcher's strong performance across various challenging VQA benchmarks, often outperforming or matching proprietary baselines and open-source agents.
Humanity's Last Exam (HLE) Results
On HLE, which requires complex multimodal search, computation, and reasoning, models relying on direct inference perform poorly, with average accuracy scores below 10%. This highlights the limitations of vanilla MLLMs when faced with knowledge-intensive VQA that demands external tool use and multi-step reasoning. RAG-based methods (Prompt Workflow in the table) show moderate improvements, particularly in Chemistry, indicating that external information retrieval helps.
WebWatcher-32B, despite being a smaller model (32B parameters) compared to some proprietary baselines, achieves a competitive overall average accuracy of 13.6%. It particularly excels in specific domains, scoring 33.8% in Biology and showing strong performance in Mathematics and Humanities. This suggests its training and tool-use integration are effective for domain-specific, complex reasoning tasks. While o4-mini and Gemini-2.5-Pro achieve slightly higher overall scores (16.0% and 15.8% respectively), WebWatcher-32B demonstrates parameter efficiency for comparable performance.
Other Challenging Benchmarks
On BrowseComp-VL, LiveVQA, MMSearch, and SimpleVQA, WebWatcher consistently outperforms both direct inference and prompt workflow baselines.
-
BrowseComp-VL: This benchmark is highly challenging, requiring multi-page browsing and fine-grained visual grounding. Most baselines score below 20%.
WebWatcher-32Bachieves 27.0%, significantly outperforming all baselines and its smallerWebWatcher-7Bcounterpart, validating the effectiveness of its dynamic tool-use loop and training for this complex task. -
LiveVQA:
WebWatcher-32Bachieves a state-of-the-art result of 58.7%, indicating its strong ability to handle questions grounded in up-to-date visual knowledge. -
MMSearch:
WebWatcher-32Balso achieves a state-of-the-art result of 55.3%, showcasing its effectiveness in multimodal search scenarios. -
SimpleVQA: Even on
SimpleVQA, which emphasizes visual reasoning over external knowledge (a perception-oriented benchmark),WebWatcher-32Bperforms well with a score of 59.0%. This demonstrates its broad applicability beyond just knowledge-intensive tasks, suggesting its visual understanding component is robust.These results collectively confirm that
WebWatcherexcels in tasks requiringknowledge-intensive reasoningandmultimodal interactionwhile maintaining strongvisual reasoningcapabilities.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper: Table 1: Main results on HLE. All accuracy scores are reported as percentages. Avg signifies the average accuracy score of three inference runs across different subtopics.
| Backbone | Humanity's Last Exam (HLE-VL) | | | | | | | | | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | :--: | | Bio. | Chem. | CS/AI | Engineer. | Human. | Math | Physics | Other | Avg. | Direct Inference | | | | | | | | | | GPT-4o | 13.8 | 0.0 | 0.0 | 3.9 | 12.0 | 6.8 | 7.1 | 7.0 | 6.5 | Gemini-2.5-flash | 12.1 | 1.6 | 0.0 | 0.0 | 4.0 | 0.0 | 14.3 | 0.0 | 4.9 | Claude-3.7-Sonnet | 1.7 | 4.8 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 12.3 | 2.8 | Qwen-2.5-VL-7B | 3.4 | 3.2 | 7.1 | 0.0 | 4.0 | 2.3 | 7.1 | 0.0 | 2.6 | Qwen-2.5-VL-32B | 3.4 | 6.5 | 0.0 | 3.9 | 8.0 | 2.3 | 7.1 | 0.0 | 3.7 | Qwen-2.5-VL-72B | 3.4 | 8.0 | 0.0 | 5.9 | 8.0 | 0.0 | 0.0 | 7.0 | 4.9 | Prompt Workflow | | | | | | | | | | GPT-4o | 9.8 | 24.1 | 4.8 | 0.0 | 2.0 | 4.0 | 9.1 | 14.3 | 12.3 | Gemini-2.5-flash | 25.9 | 3.2 | 7.1 | 0.0 | 8.0 | 9.1 | 3.5 | 14.0 | 11.4 | Claude-3.7-Sonnet | 4.3 | 5.2 | 4.8 | 0.0 | 0.0 | 0.0 | 9.1 | 14.3 | 3.5 | Qwen-2.5-VL-7B | 4.3 | 6.9 | 3.2 | 7.1 | 0.0 | 4.0 | 4.5 | 7.1 | 5.3 | Qwen-2.5-VL-32B | 5.2 | 10.3 | 3.2 | 7.1 | 0.0 | 0.0 | 4.5 | 7.1 | 8.8 | Qwen-2.5-VL-72B | 15.8 | 10.3 | 8.1 | 0.0 | 2.0 | 8.0 | 6.8 | 14.3 | 8.6 | Reasoning Model | | | | | | | | | | o4-mini | 12.1 | 23.7 | 17.7 | 0.0 | 5.8 | 0.0 | 33.3 | 21.4 | 16.0 | Gemini-2.5-Pro | 23.7 | 17.7 | 13.3 | 11.5 | 8.0 | 13.3 | 14.3 | 15.5 | 15.8 | Open Source Agents | | | | | | | | | | OmniSearch (GPT-4o) | 15.5 | 8.2 | 0.0 | 2.2 | 8.0 | 6.8 | 21.4 | 12.1 | 9.3 | WebWatcher-7B | 18.6 | 6.5 | 6.7 | 7.7 | 4.0 | 6.7 | 7.1 | 17.2 | 10.6 | WebWatcher-32B | 33.8 | 9.7 | 0.0 | 5.8 | 8.0 | 8.9 | 14.3 | 13.8 | 13.6
The following are the results from Table 2 of the original paper: Table 2: Main results on four challenging benchmarks. All accuracy scores are reported as percentages. Avg signifies the average score of three inference across two difficult levels.
| Backbone | BC-VL | LiveVQA | MMSearch | SimpleVQA | ||
|---|---|---|---|---|---|---|
| Level1 | Level2 | Avg. | ||||
| Direct Inference | ||||||
| GPT-4o | 6.4 | 4.0 | 5.5 | 29.7 | 18.7 | 47.0 |
| Gemini-2.5-flash | 11.6 | 6.0 | 9.6 | 35.0 | 19.6 | 63.0 |
| Claude-3.7-Sonnet | 8.8 | 4.0 | 7.1 | 23.7 | 12.3 | 42.7 |
| Qwen-2.5-VL-7B | 0.8 | 0.0 | 0.5 | 22.7 | 4.09 | 30.7 |
| Qwen-2.5-VL-32B | 3.2 | 1.0 | 2.4 | 26.3 | 7.60 | 40.7 |
| Qwen-2.5-VL-72B | 9.2 | 3.0 | 7.1 | 30.3 | 11.7 | 51.3 |
| Prompt Workflow | ||||||
| GPT-4o | 16.8 | 7.0 | 13.4 | 34.0 | 24.1 | 61.6 |
| Gemini-2.5-flash | 15.2 | 9.0 | 13.0 | 41.3 | 43.9 | 68.6 |
| Claude-3.7-Sonnet | 13.9 | 6.0 | 11.2 | 30.3 | 32.7 | 59.3 |
| Qwen-2.5-VL-7B | 3.6 | 1.0 | 2.7 | 21.7 | 9.94 | 21.0 |
| Qwen-2.5-VL-32B | 9.4 | 3.0 | 7.2 | 30.5 | 17.5 | 44.6 |
| Qwen-2.5-VL-72B | 14.4 | 6.0 | 11.5 | 35.7 | 29.2 | 58.6 |
| Agents | ||||||
| OmniSearch (GPT-4o) | 19.7 | 10.0 | 16.3 | 40.9 | 49.7 | 63.0 |
| WebWatcher-7B | 23.6 | 17.0 | 21.2 | 51.2 | 49.1 | 54.3 |
| WebWatcher-32B | 28.4 | 25.0 | 27.0 | 58.7 | 55.3 | 59.0 |
6.3. Ablation Studies / Parameter Analysis
The paper conducts several analyses to understand the components and behavior of WebWatcher.
6.3.1. Number of Tool Calls
This analysis (Figure 4) examines how WebWatcher adapts its tool usage to the specific demands of different benchmarks.
-
HLE: Shows a balanced usage across
Web Text Search,Web Image Search, andCode Interpreter, withVisitfor navigation. This reflectsHLE's requirement for multimodal search, computation, and complex reasoning. -
BrowseComp-VL and MMSearch: For these benchmarks, which focus heavily on information seeking and reasoning,
Web Text Searchdominates, accounting for 62% of calls. Other tools play minor roles. This highlights the agent's ability to prioritize text-based retrieval when problems are primarily information-gathering. -
SimpleVQA: The focus shifts to visual content, with
Web Image Searchmaking up one-third or more of calls.Text SearchandVisitact as auxiliaries. This indicates thatWebWatchercorrectly identifies the visual nature ofSimpleVQAtasks. -
Code Interpreter: Is used only when actual computation is required, confirming that
WebWatcheris cost and context-aware in its tool selection.Overall, the distribution of tool usage mirrors benchmark demands, underscoring
WebWatcher's flexibility in composing tool chains rather than over-relying on any single tool.
The following figure (Figure 4 from the original paper) shows the percentage of external tool calls in the four benchmarks:
该图像是柱状图,展示了WebWatcher在五个多模态数据集(HLE、BC-VL、MMsearch、LiveVQA和SimpleVQA)及综合评估中的四类操作(文本搜索、图像搜索、代码使用、访问页面)的占比情况,反映了不同任务中各操作的使用频率差异。
6.3.2. Cold Start for RL Training
This analysis (Figure 5) verifies the crucial role of supervised fine-tuning (SFT) as a "cold start" for Reinforcement Learning (RL) training in WebWatcher. The authors compare two initializations for the same RL algorithm (GRPO):
- Instruct: Warm-started only with public instruction-following data.
- Cold-start: Includes an extra
SFTstage on high-quality trajectories that explicitly demonstrate tool use and step-by-step visual reasoning.
- Instruct Initialization: The
Instructinitializationstalls near zeroon all three benchmarks (HLE,BC-VL,LiveVQA). This is attributed to frequenttool-call format errorsand the strictQwen-2.5-72Bgrader suppressing partial answers. Without proper initial guidance on structured tool use, theRLagent struggles to receive meaningful rewards, leading to a breakdown in learning. - Cold-start Initialization: In contrast, the
cold-start SFTlifts initial scores significantly. Subsequently,GRPOtrends diverge:-
HLEandBC-VLoscillate without improvement, suggesting that for these highly complex benchmarks,GRPOon its own might need more refinement or a larger model capacity to build upon theSFTfoundation effectively. -
LiveVQArises steadily, maintaining a 0.06-0.18 margin overInstruct. This shows that for certain tasks,GRPOeffectively refines theSFT-initialized policy.The analysis concludes that
reasoning traces(likeChain-of-Thoughtfrom a larger reasoner) cannot replace anSFT cold startunder strictRLsettings, as injecting them into a smaller model led to instability, format violations, repetitions, and context overflow. This confirms the necessity of explicitSFTfor robust tool-augmentedRLtraining.
-
The following figure (Figure 5 from the original paper) shows the performance comparison using cold start in RL training on three benchmarks:
该图像是三幅折线图,展示了WebWatcher在不同训练步骤下Cold-start与Instruct两种训练方式在HLE、BC-VL和LiveVQA三个基准测试中的得分情况,横轴为训练步骤,纵轴为得分,反映Instruct训练表现普遍优于Cold-start。
6.3.3. Pass@k Analysis on HLE
This analysis (Figure 6) investigates the performance of WebWatcher on HLE as the number of attempts () increases, using the pass@k metric.
-
Single Attempt (k=1):
WebWatcherachieves a 13.6% pass rate. -
Initial Steep Rise: Performance rises steeply with a few attempts; three roll-outs () reach 20.3%. This indicates that even a small number of diverse trajectories generated by the agent can yield large gains in success probability.
-
Continued Improvement: Accuracy continues to improve, reaching 35.7% at and 41.9% at . This nearly quadruples the single-shot inference performance and surpasses reasoning models like
Gemini-2.5-Proando4-mini(which havepass@1scores of 15.8% and 16.0% respectively). -
Diminishing Returns: Marginal gains tend to taper after , suggesting that practitioners can cap roll-outs at 8-16 for a significant boost (2-3x) at moderate computational cost.
The smooth curve suggests that
de-correlated sampling(generating diverse rollouts) avoids redundant solutions and captures complementary knowledge. This analysis demonstrates thescalability of the agentic paradigm, where systematic exploration of reasoning paths leads to consistent and robust improvements on challenging multimodal benchmarks.
The following figure (Figure 6 from the original paper) shows the Pass@k curve of WebWatcher on HLE for k ranging from 1 to 32:
Note: Figure 6 from the original paper is not provided in the image assets, but the text describes its content, which is primarily a line graph showing Pass@k performance.
(Self-correction: The prompt specifically asks to include the image if it is provided. Looking at the provided image list, img-5.jpeg corresponds to a KenKen puzzle, not a Pass@k curve. Therefore, I will state that the image is not provided, and rely on the textual description given in the paper.)
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces WebWatcher, a pioneering multimodal deep research agent that effectively integrates complex vision-language reasoning with multi-tool interaction. The authors propose BrowseComp-VL, a novel and challenging benchmark specifically designed for in-depth multimodal reasoning and strategic planning. A scalable pipeline is presented to transform complex textual QA examples into VQA items, generating high-quality training data. Furthermore, an automated trajectory generation pipeline, grounded in action-observation traces, is developed, followed by supervised fine-tuning (SFT) and Group-Relative Policy Optimization (GRPO) to train the agent. Experimental results demonstrate that WebWatcher achieves strong performance across multiple high-difficulty benchmarks (HLE, LiveVQA, BrowseComp-VL, and MMSearch), outperforming both open-source and proprietary research agents. It also delivers competitive results on the perception-oriented SimpleVQA benchmark. WebWatcher establishes a robust foundation for future multimodal deep research agents capable of autonomous, flexible, and deeply reasoned problem-solving in real-world scenarios.
7.2. Limitations & Future Work
The authors do not explicitly list "Limitations" as a separate section. However, the analysis of their results implicitly points to some areas:
-
Computational Cost: While
WebWatcher-32Bis parameter-efficient compared to larger proprietary models,RLtraining, especially with multiple rollouts, can be computationally intensive. ThePass@kanalysis suggests diminishing returns after 16 rollouts, implying a trade-off between performance gain and computational cost. -
Stability of RL: The
cold startanalysis shows thatGRPOsometimesoscillates without improvementon highly complex benchmarks likeHLEandBC-VLeven with anSFTcold start. This indicates that further research might be needed to improve the stability and effectiveness ofRLfor such intricate multimodal tasks, especially for smaller models. -
Reliance on External LLMs for Data Generation/Filtering: The methodology heavily relies on
GPT-4oforQApair generation,VQAconversion, trajectory annotation, and quality control. This introduces a dependency on the capabilities and biases of these externalLLMs, which might limit the diversity or introduce specific reasoning styles into the training data.As for future work, the paper states that
WebWatcher"paves the way for solving complex multimodal information-seeking tasks" and "establishes a strong foundation for future multimodal deep research agents." Implicitly, this suggests future work could involve: -
Improving RL Stability and Efficiency: Enhancing
RLalgorithms to perform more robustly on highly complex multimodal tasks and to scale more efficiently. -
Reducing Reliance on Proprietary LLMs: Exploring methods for generating high-quality data and trajectories with open-source models or alternative data synthesis techniques to improve reproducibility and reduce external dependencies.
-
Expanding Tool Capabilities: Integrating a wider array of specialized tools or enabling the agent to learn to use new tools dynamically.
-
Long-Horizon Multimodal Tasks: Applying
WebWatcherto even more complex, long-horizon real-world problems that require extended multi-step planning and cross-modal reasoning over longer periods. -
Human-Agent Collaboration: Exploring ways to integrate human feedback and guidance more seamlessly into the agent's learning and decision-making process for
multimodal deep research.
7.3. Personal Insights & Critique
This paper presents a significant step forward in the domain of multimodal deep research agents. The core innovation lies not just in adding visual capabilities to a web agent, but in meticulously building the necessary infrastructure: a novel data generation pipeline, a robust training methodology (SFT + GRPO), and a challenging benchmark (BrowseComp-VL) that truly tests integrated vision-language reasoning.
One key inspiration is the rigorous approach to data generation and quality control. The multi-stage QA-to-VQA conversion, entity masking, and Selector/Examiner filtering steps are crucial for creating a high-quality dataset that drives deep reasoning rather than shallow pattern matching. This highlights that for advanced agent capabilities, simply collecting raw multimodal data is insufficient; structured, reasoning-intensive data curation is paramount.
The emphasis on the SFT cold start for RL training is another critical insight. It debunks the idea that RL alone can learn complex tool-use from scratch when dealing with strict formatting requirements and multi-step reasoning. The SFT phase provides the necessary scaffolding, allowing RL to then refine and generalize the learned behaviors. This suggests a powerful hybrid training paradigm for future complex agent development.
A potential issue or area for improvement could be the generalizability of the LLM-as-Judges method for evaluation. While GPT-4o is powerful, its judgments might still contain biases or miss subtle nuances that human experts would catch, especially in abstract academic problems (like HLE). The reliability of LLM graders, though common, is an ongoing research area. Further, the computational cost of generating rollouts for pass@k evaluation and the dependence on expensive proprietary LLMs for data generation and grading might pose barriers for broader research and open-source development.
The methods and conclusions could be transferred to other domains requiring complex multimodal interpretation and action, such as scientific discovery (e.g., analyzing experimental images alongside textual protocols), medical diagnosis (interpreting scans with patient histories), or even advanced robotics (visual navigation and task execution based on textual commands). WebWatcher's structured approach to integrating visual perception and tool-augmented reasoning provides a strong blueprint for building more capable and versatile AI agents.
Similar papers
Recommended via semantic vector search.