DeepShop: A Benchmark for Deep Research Shopping Agents
TL;DR Summary
DeepShop introduces a benchmark for deep research shopping agents, evolving query diversity and complexity, with fine-grained and holistic evaluation to reveal limitations of current methods like RAG in handling complex, multi-attribute shopping scenarios.
Abstract
Web agents for online shopping have shown great promise in automating user interactions across e-commerce platforms. Benchmarks for assessing such agents do not reflect the complexity of real-world shopping scenarios, as they often consist of overly simple queries with deterministic paths, such as "Find iPhone 15." Real shopping scenarios are inherently more layered, involving multi-dimensional product attributes, search filters, and user-specific sorting preferences. To address this gap, we introduce DeepShop, a benchmark designed to evaluate web agents in complex and realistic online shopping environments. DeepShop comprises three key components. (1) Query diversity evolution: Starting from real user queries, we generate diverse queries across five popular online shopping domains. (2) Query complexity evolution: We further evolve these queries to increase complexity, considering product attributes, search filters, and sorting preferences, and classify them into three levels: easy, medium, and hard, based on the number of evolutions. (3) Fine-grained and holistic evaluation: We propose an automated evaluation framework that assesses agent performance in terms of fine-grained aspects (product attributes, search filters, and sorting preferences) and reports the overall success rate through holistic evaluation. We conduct a systematic evaluation of retrieval-augmented generation (RAG) methods, web agents, and deep research systems. Results show that RAG struggles with complex queries due to its lack of web interaction, while other methods face significant challenges with filters and sorting preferences, leading to low overall success rates. We also perform cross-category, complexity-based evaluations and error analyses to support the advancement of deep research shopping agents.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
DeepShop: A Benchmark for Deep Research Shopping Agents
1.2. Authors
The paper lists five authors with their affiliations:
-
Yougang Lyu (University of Amsterdam)
-
Xiaoyu Zhang (Shandong University)
-
Lingyong Yan (Baidu Inc.)
-
Maarten de Rijke (University of Amsterdam)
-
Zhaochun Ren (Leiden University)
-
Xiuying Chen (Mohamed bin Zayed University of Artificial Intelligence (MBZUAI))
The authors represent a mix of academic institutions and an industry research lab, indicating a collaborative effort spanning different research perspectives, particularly in information retrieval, natural language processing, and artificial intelligence.
1.3. Journal/Conference
The paper is published at (UTC): 2025-06-03T13:08:17.000Z, suggesting it is a recent or forthcoming publication. While a specific conference or journal is not listed in the provided text, the mention of "arXiv preprint" and its typical submission to major AI/NLP/IR conferences (like NeurIPS, ICLR, ACL, EMNLP, SIGIR, WWW) implies it targets high-impact venues in these fields. Benchmarking papers are highly valued in these communities for setting new research directions and evaluation standards.
1.4. Publication Year
2025 (based on the UTC timestamp 2025-06-03T13:08:17.000Z).
1.5. Abstract
This paper introduces DeepShop, a novel benchmark designed to evaluate web agents in complex and realistic online shopping scenarios. The motivation stems from the observation that existing benchmarks often feature overly simplistic queries, failing to capture the multi-dimensional attributes, search filters, and sorting preferences inherent in real-world shopping. DeepShop addresses this gap through three main components: (1) Query diversity evolution, which generates diverse queries across five popular shopping domains from real user queries; (2) Query complexity evolution, which iteratively increases query complexity by adding product attributes, filters, and sorting preferences, categorizing them into easy, medium, and hard levels; and (3) a Fine-grained and holistic evaluation framework that assesses agent performance on these specific aspects and reports an overall success rate. The authors evaluate various approaches, including retrieval-augmented generation (RAG) methods, general web agents, and commercial deep research systems. Results show that RAG struggles due to its lack of web interaction, while other methods face significant challenges with filters and sorting, leading to low overall success rates. The paper concludes by providing cross-category, complexity-based evaluations and error analyses to guide future development in deep research shopping agents.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2506.02839
- PDF Link: https://arxiv.org/pdf/2506.02839v1.pdf
- Publication Status: This is an arXiv preprint, indicating it is publicly available but may not have undergone full peer review for a specific conference or journal yet, or it is a preprint version of a paper that has been accepted or is under review.
2. Executive Summary
2.1. Background & Motivation
The core problem DeepShop aims to solve is the lack of realistic and complex benchmarks for evaluating web agents in online shopping scenarios. Current benchmarks for web agents often consist of overly simplistic queries (e.g., "Find iPhone 15") with deterministic paths, which do not reflect the layered and nuanced nature of real-world shopping tasks.
This problem is important because web agents, especially those integrating large language models (LLMs), show great promise in automating user interactions on e-commerce platforms. However, despite advancements in planning, memory, and web interaction capabilities, existing agents still struggle with complex user queries in dynamic shopping environments. Real shopping requires "deep research"—browsing, filtering, comparing—to meet diverse and nuanced user preferences. The gap between these complex real-world needs and the limited complexity of current benchmarks means that the true capabilities and limitations of web agents in practical e-commerce settings remain underexplored.
The paper's entry point is to bridge this gap by introducing DeepShop, a benchmark that mirrors the complexity and diversity of real-world shopping, enabling a more accurate and comprehensive evaluation of web agents. Its innovative idea is to systematically evolve real user queries to increase both their diversity across product categories and their complexity by adding specific product attributes, search filters, and sorting preferences.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
DeepShop Benchmark Introduction: The creation of
DeepShop, a comprehensive benchmark for evaluating web agents in complex online shopping scenarios. It features diverse queries across five popular product categories (Books, Electronics, Home, Fashion, Sports) and varying complexity levels (easy, medium, hard) derived from a multi-stagequery diversityandcomplexity evolutionprocess based on real-world shopping intents. -
Comprehensive Evaluation Framework: The proposal of a
fine-grained and holistic evaluationframework. This framework assesses agent performance on specific aspects like correctproduct attributematching, propersearch filterapplication, and accuratesorting preferenceexecution, alongside an overallholistic success rate. The evaluation is largely automated usingGPT-4owith human agreement validation. -
Extensive Experimental Evaluation: Conducted systematic evaluations of various approaches, including
simple retrieval-augmented generation (RAG)methods, advancedweb agents(e.g., Agent-E, SeeAct, WebVoyager, Browser Use), and commercialdeep research systems(e.g., Gemini Deep Research, OpenAI Deep Research) using theDeepShopbenchmark. -
Detailed Analysis and Future Guidance: Provided detailed analyses across product categories, query complexity levels, and specific error types. This analysis reveals critical limitations in current systems (e.g.,
RAGstruggles due to lack of web interaction, web agents struggle with fine-grained requirements like filters and sorting,deep research systemsfacehallucination errors), offering insights to guide the future development of more effectivedeep research shopping agents.The key conclusions and findings reached by the paper are:
-
RAGmethods perform poorly on complex shopping queries due to their inherent lack of web interaction capabilities, being unable to apply filters or sorting. -
While
web agentsgenerally outperformRAGby interacting with websites, they still face significant challenges in simultaneously satisfyingDeepShop's fine-grained requirements, particularlysearch filtersandsorting preferences, leading to relatively low overall success rates. -
Deep research systemsshow improved performance in handlingproduct attributesandsorting preferencesdue to their multi-step reasoning, but they still struggle withsearch filtersand exhibithallucination errors, limiting their overall success. -
Agent performance varies significantly across product categories, with visually-driven categories like Fashion and Sports posing greater challenges for agents lacking robust multimodal reasoning.
-
There is a clear negative correlation between query complexity and agent performance; as queries become harder, success rates drop dramatically for all evaluated methods.
-
Critical limitations identified for web agents include poor
grounding ability, lack ofstate assessment and replanning capabilities, constrainedaction space(especially for dynamic UI elements), and inability tolearn from execution.These findings highlight that current web agents and deep research systems are not yet robust enough to handle the full complexity of real-world online shopping scenarios, emphasizing the need for continued research and development guided by benchmarks like
DeepShop.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the DeepShop paper, several foundational concepts related to AI, machine learning, and web interaction are crucial.
- Web Agents: In the context of this paper, a
web agent(or sometimesAI agentorLLM agent) refers to an artificial intelligence program designed to autonomously interact with web environments, much like a human user would. This involves capabilities like navigating web pages, clicking buttons, typing into search fields, reading content, and interpreting visual layouts, all to achieve a specific goal or task.Web agentsare typically powered by underlying AI models, oftenlarge language models (LLMs). - Large Language Models (LLMs):
LLMsare advanced AI models trained on vast amounts of text data, enabling them to understand, generate, and process human language. They are at the core of many recent AI advancements, includingweb agents, providing capabilities such as planning, reasoning, understanding instructions, and generating actions in natural language or code. Examples include GPT-4o, Gemini, etc. - Retrieval-Augmented Generation (RAG):
RAGis an architectural pattern forLLMsthat combines information retrieval with text generation. When anLLMis asked a question, aRAGsystem first retrieves relevant information (e.g., documents, web pages, paragraphs) from a knowledge base or the internet. This retrieved information is then fed into theLLMalong with the original query, allowing theLLMto generate a more informed and accurate response that is grounded in factual data, rather than relying solely on its pre-trained knowledge. In the context ofDeepShop,Simple RAGoften involves searching via Google and generating a response based on the search results, without direct interaction with dynamic websites. - Partially Observable Markov Decision Process (POMDP): A
POMDPis a mathematical framework used in AI to model decision-making problems where an agent's current state is not fully known.- Markov Decision Process (MDP): In an
MDP, an agent interacts with an environment, taking actions that change the state of the environment, and receives rewards. The key is that the agent always knows its current state. - Partially Observable: In a
POMDP, the agent does not directly observe the true state of the environment. Instead, it receives observations that are related to the state but may not fully reveal it. For aweb agent, this means it might not know everything about a webpage's underlying structure or dynamic content, only what it can "see" (e.g., through screenshots or DOM trees). - The tuple represents:
- : The set of all possible states of the environment (e.g., the underlying HTML structure and backend data of a website).
- : The set of all possible observations the agent can receive (e.g., screenshots, visible text, DOM tree elements).
- : The set of all possible actions the agent can take (e.g., click, type, scroll).
- : The transition function, which describes how taking an action in a given state leads to a new state.
- Markov Decision Process (MDP): In an
- Online vs. Offline Benchmarks:
- Offline Benchmarks: These evaluate agents in static environments, often constructed from pre-collected web snapshots or manually curated HTML/DOM structures. They offer controlled conditions for reproducibility but fail to capture the dynamic and unpredictable nature of real-world websites.
- Online Benchmarks: These allow agents to operate within real-time web environments, interacting directly with live websites. This provides a more realistic evaluation but introduces challenges related to website changes, dynamic content, and potential non-determinism.
- Fine-grained Evaluation: This refers to assessing an agent's performance on specific, granular aspects of a task. In
DeepShop, this includes evaluating whether the agent correctly handlesproduct attributes(e.g., brand, color),search filters(e.g., ratings, price range), andsorting preferences(e.g., lowest price first). It helps diagnose specific failure modes. - Holistic Evaluation: This refers to assessing the overall success of an agent in completing an entire task, often by aggregating fine-grained success into a single pass/fail metric. In
DeepShop, it determines if all required components (attributes, filters, sorting) were successfully satisfied. - Product Attributes, Search Filters, and Sorting Preferences: These are key components of online shopping queries:
- Product Attributes: Concrete characteristics of a product a user might specify, like "brand: Apple," "color: red," "size: Medium," "price range: $500-$800."
- Search Filters: Categorical or numerical constraints applied to search results, often via checkboxes, sliders, or dropdowns. Examples: "customer rating: 4 stars & up," "shipping: free delivery," "availability: in stock."
- Sorting Preferences: The desired order in which search results should be displayed. Examples: "sort by: lowest price," "sort by: highest user rating," "sort by: newest arrivals."
- Grounding: In the context of
web agents,groundingrefers to the ability of anLLMor agent to accurately connect natural language instructions (e.g., "click the 'Add to Cart' button") to specific, actionable elements within the visual or structural representation of a webpage (e.g., locating the exact button element in a screenshot or DOM tree). Poorgroundingleads to incorrect actions. - Hallucination (in LLMs): This occurs when an
LLMgenerates information that is factually incorrect, nonsensical, or not supported by its input data, yet presents it as if it were true and confident. InDeep Research Systems, this can manifest as making up product details, claiming a product meets criteria it doesn't, or providing incorrect links.
3.2. Previous Works
The paper extensively references prior research in web agent evaluation and development, categorizing benchmarks into offline and online.
Offline Benchmarks:
- Mind2Web [5]: An early benchmark for general web agents, using static snapshots or simulated environments. It provides controlled conditions but lacks real-world dynamism.
- WebShop [51]: Specifically designed for online shopping tasks but also uses a simulated environment (a fixed product catalog and interface), limiting its realism compared to live websites.
- WebArena [62]: Another offline benchmark for autonomous agents in web environments, aiming for realistic tasks but within a static setup.
- VWebarena [18] and MMInA [58]: Multimodal offline benchmarks, indicating a move towards agents that use both visual and text information, but still within static environments.
- ChatShop [3]: An offline benchmark focused on interactive information seeking with language agents, supporting multi-turn preference elicitation, but constrained by training products and not autonomously browsing live web content.
Online Benchmarks:
- WebLINX [21]: An online benchmark for real-world website navigation with multi-turn dialogue, offering a more dynamic setting.
- Mind2Web-Live [34]: An extension of
Mind2Webthat allows agents to operate in real-time web environments, marking a step towards more realistic evaluation.DeepShopdraws seed queries fromMind2Web-Live. - WebVoyager [11]: An online benchmark that enables end-to-end web agents with
large multimodal models (LMMs), using iterative real-world exploration and feedback.DeepShopalso draws seed queries fromWebVoyager.
Web Agents for Task Automation (Baselines and related systems):
- WebGPT [30]: An early HTML-based agent that used
LLMsto interpret instructions and navigate web interfaces usingDOM trees. - MindAct [5]: Another HTML-based agent mentioned, likely building on
DOM treeinteraction. - Agent-E [1]: An HTML-based agent using a hierarchical planner-actor framework with
DOM treedistillation. This is used as a baseline inDeepShop. - SeeAct [60]: A vision-based agent exploiting the multimodal capabilities of
LLMs, integrating visual perception with structured web-based interactions. Used as a baseline. - Browser Use [29]: An open-source web agent framework combining visual understanding with HTML structure parsing for robust web navigation and interaction. Used as a baseline.
- OpenAI Deep Research [33] and Gemini Deep Research [8]: Commercial
deep research systemsthat use advanced reasoningLLMsto tackle complex information-seeking tasks, autonomously browsing, analyzing, and synthesizing web information into citation-rich outputs. These are evaluated as baselines.
Query Understanding in E-commerce:
- Traditional
information retrieval (IR)systems struggle with complex e-commerce queries due to overwhelming product spaces and nuanced user preferences [4, 45]. - Conversational
IRsystems, while supporting multi-turn preference elicitation, are limited by training products and cannot autonomously browse web content [3, 57].Web agentsoffer a promising alternative by mimicking human browsing behaviors [11, 51].
3.3. Technological Evolution
The field of web agents has evolved significantly:
-
Early Automation (Rule-based/Scripted): Initial attempts at web automation were often rule-based or required explicit scripting for specific tasks, lacking flexibility and generalization.
-
Text-based Agents (DOM-centric): With the rise of
LLMs, agents started usingDOM trees(Document Object Model, a programming interface for HTML and XML documents that treats HTML elements as objects) to understand webpage structure and execute actions.WebGPT,MindAct, andAgent-Eare examples. These agents primarily interact with the underlying code rather than the visual layout. -
Multimodal/Vision-based Agents: Recognizing that web interaction is inherently visual for humans, agents evolved to incorporate visual perception (e.g., screenshots) alongside text.
SeeAct,WebVoyager, andBrowser Useare prominent examples, integrating visual grounding to handle complex layouts and interactive components. This allows them to "see" what a human sees. -
Deep Research Systems (Advanced Reasoning): The latest wave involves highly sophisticated
LLMsthat can perform multi-step reasoning, plan complex tasks, and synthesize information from multiple sources, emulating human research workflows.Gemini Deep ResearchandOpenAI Deep Researchrepresent this frontier. -
Benchmarking Evolution: Simultaneously, evaluation benchmarks progressed from static/simulated environments (
Mind2Web,WebShop,WebArena) to more realistic online environments (WebLINX,Mind2Web-Live,WebVoyager).DeepShopfits into this evolution by addressing the gap in complexity within existing online benchmarks. While previous online benchmarks provided realistic environments, their tasks often remained simple.DeepShoppushes the envelope by introducing systematically diversified and complex queries, pushingweb agentsbeyond simple navigation to truly "deep research" shopping.
3.4. Differentiation Analysis
Compared to main methods in related work, DeepShop's core differences and innovations are:
-
Complexity and Realism of Queries:
- Existing Benchmarks: Often use simple, deterministic queries like "find an iPhone 15" or focus on general web tasks. They lack the multi-dimensional requirements common in real shopping.
- DeepShop: Introduces queries with systematically increased complexity, incorporating multiple
product attributes(e.g., brand, color, size),search filters(e.g., minimum rating, free delivery), andsorting preferences(e.g., lowest price first). This makes the tasks much more akin to real-world user needs, requiring agents to perform "deep research."
-
Systematic Query Evolution:
- Existing Benchmarks: Queries are often hand-crafted or derived in simpler ways, which might lead to an uneven distribution of difficulty or domain coverage.
- DeepShop: Employs a multi-stage
query diversity evolution(across five product categories) andquery complexity evolution(iteratively adding attributes, filters, sorting) starting from real user seed queries, utilizingGPT-4ofor generation. This systematic approach ensures a comprehensive and balanced testbed, categorized into easy, medium, and hard levels.
-
Fine-grained and Holistic Evaluation:
- Existing Benchmarks: Primarily focus on binary task success (did the agent complete the task or not?).
- DeepShop: Proposes a novel evaluation framework that includes both
fine-grained metrics(assessing success on product attributes, search filters, and sorting preferences individually) and aholistic success rate(requiring all specified conditions to be met). This allows for a more nuanced understanding of agent capabilities and precise diagnosis of failure modes. It usesGPT-4ofor automated evaluation, validated with human agreement.
-
Online Environment Focus with Specific Shopping Domain:
-
Existing Benchmarks: While some are online, they might focus on general web tasks or lack the specific nuances of e-commerce platforms. Offline benchmarks, though controlled, lack realism.
-
DeepShop: Specifically targets
online shoppingscenarios on real, dynamic websites, providing a realistic testbed for a critical application domain forweb agents.In essence,
DeepShopmoves beyond simply evaluating if an agent can navigate a website to evaluating if it can effectively understand and fulfill complex, multi-faceted user intents in a dynamic, real-world e-commerce setting, thereby providing a more rigorous challenge for the next generation ofdeep research shopping agents.
-
4. Methodology
The DeepShop benchmark is designed to evaluate web agents in realistic and complex online shopping environments. The methodology involves formulating the task, curating seed data, evolving query diversity and complexity, and establishing a comprehensive evaluation framework.
4.1. Principles
The core idea behind DeepShop is that real-world online shopping queries are inherently complex, involving multiple product characteristics, specific filtering criteria, and desired sorting orders. Existing benchmarks often simplify these queries, leading to an overestimation of web agent capabilities. DeepShop aims to bridge this gap by systematically generating diverse and complex queries, starting from real user intents, and providing a granular evaluation to truly assess agent performance in "deep research" shopping scenarios. The theoretical basis is rooted in simulating human-like complex information seeking behavior on e-commerce platforms.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Task Formulation
The online web shopping task is formulated as a partially observable Markov decision process (POMDP). This mathematical framework is used because a web agent operating on a live website does not have full knowledge of the website's underlying state (e.g., full server-side data, all hidden elements), but rather receives partial observations (e.g., screenshots, visible DOM elements).
The POMDP is defined by a tuple :
-
: Represents the entire
state spaceof the web environment. This includes all possible configurations of a website, its backend data, and user session information. -
: Represents the
observation space. At any given time, theweb agentreceives an observation , which is a partial reflection of the true underlying state . For aweb agent, this could be a screenshot of the visible viewport, theDOM tree(Document Object Model, a programming interface for HTML and XML documents), or extracted text content. -
: Represents the
action space. These are the possible actions theweb agentcan take, such as clicking a link, typing text into a search box, scrolling the page, or selecting an item from a dropdown. -
: Represents the
transition function. This function describes how the environment's state changes when an agent takes an action. Specifically, if the agent takes an action from state , the environment transitions to a new state . After this transition, the agent receives an updated observation .The agent's goal is, given a user query , to navigate this
POMDPto find the desired product or information.
4.2.2. Seed Data Curation
To ensure realism and relevance, DeepShop starts with a collection of real user shopping queries, referred to as seed data.
- Source: 50
web shopping queriesare manually selected from two existing real-world benchmarks:Mind2Web-Live [34]andWebVoyager [11]. These benchmarks are chosen because they involve real-time web environments. - Categorization: The selected queries are manually categorized into five representative shopping domains:
-
Books(): 4 queries related to physical books, eBooks, and audiobooks. -
Electronics(): 14 queries for smartphones, laptops, headphones, etc. -
Home(): 20 queries for household items, furniture, appliances. -
Fashion(): 5 queries for apparel, footwear, and accessories. -
Sports(): 7 queries for fitness equipment, sportswear.These 50
seed queriesform the initial basis for generating a more diverse and complex dataset.
-
4.2.3. Shopping Query Diversity Evolution
To address the lack of fine-grained product category diversity in existing datasets, DeepShop generates new queries based on the seed queries and a randomly selected product category. This process is called diversity evolution.
The process is defined by the following formula:
$
q _ { i } ^ { * } = \mathrm { D i v e r s i t y } ( q _ { i } , d )
$
Where:
-
: Represents the new, diversified query that is generated.
-
: This is a function implemented by prompting
GPT-4omodels.GPT-4ois an advancedlarge language modelthat can understand and generate human-like text, making it suitable for rewriting or creating new queries based on given instructions. TheGPT-4ois instructed to create a new prompt (query) for the web shopping domain, tailored for a different specific product in a given Amazon product field, while maintaining similar length and complexity to the original. -
: Represents an original
seed queryselected from the initial dataset. -
: Denotes a randomly selected product category from the five defined domains.
The goal is to take an existing query (e.g., an electronics query) and rephrase it for a different domain (e.g., books), ensuring the agent can generalize across varied user shopping intents. The final
web shopping diversity evolution datasetis constructed by combining the originalseed datasetwith all the newly generated queries: , where is the number of seed queries.
4.2.4. Shopping Query Complexity Evolution
To simulate increasingly complex real-world shopping scenarios, DeepShop iteratively enhances the complexity of the diversified queries. This is done by adding specific requirements related to product attributes, search filters, and sorting preferences.
The complexity evolution is an iterative process. In each iteration , one of three strategies is randomly selected to evolve the query q _ { i , t } from the previous step.
The process is defined by the following formula: $ q _ { i , t + 1 } = \mathrm { Complexity } ( q _ { i , t } , c ) $ Where:
- : Represents the -th query after the -th complexity evolution step.
- : This is a function implemented by prompting
GPT-4o.GPT-4ois instructed to rewrite a given prompt into a more complex version by adding specific details, while keeping it reasonable and understandable. q _ { i , t }: Denotes the -th query in the -th complexity evolution step.- : The index for queries starting from the
diversity dataset. - : The current iteration number, where is the total number of rounds of complexity evolution. In
DeepShop, rounds are applied. q _ { i , 0 }: Denotes the -th query from the dataset, serving as the starting point for complexity evolution.- : This is the randomly selected strategy for increasing complexity in the current iteration. The three strategies are:
-
Attribute evolution (): This strategy enhances the query by incorporating concrete
product attributes. Examples include brand, model, specific price range, color, size, weight, or unique product features.GPT-4ois prompted to specify concrete values for one product attribute based on its knowledge, ensuring these details are directly incorporated into the query. -
Filter evolution (): This strategy enhances the query by adding specific
search filterscommonly available on e-commerce platforms. Examples include constraints like minimum customer rating (e.g., 4.5 stars), minimum number of reviews (e.g., 500+), shipping options (e.g., free delivery), release timeframe (e.g., new arrivals in the past 30 days), return policies, or warranty information.GPT-4ois prompted to specify concrete values for these constraints. -
Sorting evolution (): This strategy enhances the query by appending a
sorting preference. This directs the system to find top-ranked products according to criteria such as lowest price, highest user rating, newest arrival, or best seller ranking.GPT-4ois prompted to integrate a specific filtering requirement based on one of these criteria.By iteratively applying these strategies over rounds, the method mimics the natural evolution of user queries, generating a hierarchical set of increasingly complex queries. Starting from diverse queries in , this process results in a total of 600 queries.
-
The figure below (Figure 2 from the original paper) illustrates examples of diversity and complexity evolution in DeepShop:
该图像是图示,展示了DeepShop中多样性和复杂度演化的运行示例,包含属性演化、筛选演化和排序演化三种复杂度演化类型,体现用户查询的渐进变化。
As shown, a seed query like "Find a book on web scraping" can be diversified to "Find a book on python programming" (within the Books category). This diversified query then undergoes complexity evolution:
- Attribute Evolution: adding "by author John Smith, published after 2020."
- Filter Evolution: adding "with 4+ star ratings and free shipping."
- Sorting Evolution: adding "sort by lowest price." These evolutions create queries of increasing difficulty.
4.2.5. Dataset Analysis
The paper conducts analysis on the generated dataset to confirm its characteristics.
4.2.5.1. Analysis of Query Diversity Evolution
Existing benchmarks often have skewed distributions across product categories, introducing bias. To mitigate this, DeepShop constructs a balanced subset of 150 queries from its 600-query pool, systematically selecting 30 queries from each of the five major categories: Books, Electronics, Home, Fashion, and Sports. This balanced distribution (manually verified for quality and availability on corresponding websites) ensures a controlled and equitable testbed for evaluating cross-domain generalization, allowing clearer assessment of an agent's ability to generalize beyond narrow domain specialization.
The following figure (Figure 3 from the original paper) illustrates the product category distribution:
该图像是图表,展示了经过查询多样性演化后不同产品类别的分布情况。图中比较了初始种子数据和DeepShop数据在图书、电子产品、家居、时尚和运动五个类别的查询数量,DeepShop在各类别查询数均为30,显著高于种子数据,反映了查询多样性的提升。
The graph clearly shows that after query diversity evolution, DeepShop has an equal number of queries (30) for each of the five categories, unlike the seed data which had an imbalanced distribution (e.g., 20 for Home, 4 for Books).
4.2.5.2. Analysis of Query Complexity Evolution
A fine-grained analysis is performed on how query complexity evolves across the three dimensions (product attributes, search filters, sorting preferences).
The following figure (Figure 4 from the original paper) presents this analysis:
该图像是图表,展示了图4中查询复杂度演化的三个方面:产品属性、搜索过滤器和排序偏好,分别以迭代次数为横轴,计数为纵轴,比较了迭代演化过程与DeepShop各级别数据的变化趋势。
-
Product Attributes (Figure 4a): The average number of
product attributesper query steadily increases across iterations. The finalDeepShopdataset has an average of 0.52 more attributes than theseed data, and thehardsubset has an additional 0.66 attributes on average. -
Search Filters (Figure 4b): The average number of
search filtersper query consistently increases.DeepShopqueries include, on average, 1.95 more filters thanseed queries, with thehardsubset showing an increase of 2.88 filters on average. -
Sorting Preferences (Figure 4c): The average
sorting preferencesper query also show an upward trend. The final averagesorting preferencesexceed theseed databy 0.37, and thehardsubset contains an additional 0.66 sorting preferences on average.This analysis confirms that the
complexity evolutionstrategy successfully creates increasingly complex queries.
4.2.6. Evaluation Metrics
DeepShop uses a two-stage evaluation protocol: fine-grained evaluation and holistic task success evaluation. Given the challenges of human evaluation, GPT-4o is primarily used for automatic assessment, following previous work [11, 50].
4.2.6.1. Fine-grained Evaluation
- Decomposition: Each complex query is first decomposed into its constituent parts: a
product attribute subquery(), asearch filter subquery(), and asorting preference subquery(). - GPT-4o Assessment: For each
web agenttrajectory (a sequence of actions and observations, typically screenshots),GPT-4ois prompted to assess whether the final results align with the requirements specified in each subquery. The prompt includes the user subquery,screenshots(up to 15), and the agent'sfinal answer(textual response). - Binary Decision:
GPT-4oprovides a binary decision ("Success" or "Not Success") for each subquery. - Purpose: This
fine-grained evaluationcaptures partial success cases and helps diagnose specific failure modes more precisely than a simple holistic pass/fail. If a subquery is not present in the original query (e.g., no explicit sorting preference), its evaluation is skipped and not included in the calculation for that specific aspect.
4.2.6.2. Holistic Evaluation
- Aggregation: The
holistic evaluationcalculates the overall task success by aggregating the outcomes of thefine-grained evaluationforproduct attributes,search filters, andsorting preferences. - Rule-based Checking: For each dimension, the system checks if the original query explicitly specified a requirement.
- If a particular aspect (e.g., attribute, filter, or sorting) is present in the query, its corresponding success score from the
fine-grained evaluationis considered. - If an aspect is not present in the query, it is treated as automatically satisfied for the purpose of
holistic evaluation(i.e., it doesn't penalize the agent for not fulfilling a non-existent requirement).
- If a particular aspect (e.g., attribute, filter, or sorting) is present in the query, its corresponding success score from the
- Overall Success Condition: The
final holistic task successis determined as "Success" only if all required components (attributes, filters, and sorting preferences that were explicitly part of the query) are successfully satisfied. This means a single failure in any explicitly required dimension leads to an overall "Not Success." - Deep Research Systems: For
deep research systems(like Gemini or OpenAI Deep Research), intermediate execution screenshots are typically unavailable. Therefore, bothfine-grainedandholistic evaluationsfor these systems are conducted manually by human evaluators, who verify the returned links against the query requirements.
4.2.6.3. Agreement Rate between LLM Evaluation and Human Judge
To ensure the reliability of the GPT-4o-based evaluation, an agreement rate (also known as inter-annotator agreement) is calculated between human judgments and GPT-4o judgments.
- Procedure: Human annotators are shown the full interaction trace of an agent, including
screenshotsandactions, and asked to judge whether the agent successfully fulfilled the user's request for each sub-goal and the overall task. - Results: The agreement rates between human judges and
GPT-4ojudges are:Product attributes:Search filters:Sorting preferences:Overall task success: These high agreement rates indicate the effectiveness and reliability of usingGPT-4ofor evaluation in this context.
The following figure (Figure 6 from the original paper, part of Appendix A.1) shows the general structure of the GPT-4o prompt used for fine-grained evaluation. This prompt guides GPT-4o to assess task completion based on subqueries, screenshots, and the agent's final answer. It explicitly states that GPT-4o should not interact with web pages, make assumptions, or rely solely on the textual response if it contradicts the screenshot.
[System prompt]
role:
1. Web Task Instruction: A clear and precise natural language directive that specifies an online shopping activity to be executed. The instruction may involve locating products that meet certain attribute requirements (e.g., color, size, brand), applying specific search filters (e.g., price range, customer ratings, availability), or fulfilling user-defined sorting preferences (e.g., lowest price, newest arrivals, best sellers). Tasks may also include verifying product details, comparing offers, or checking for shipping and return policies, depending on the scenario.
Result Screenshots: This is a visual representation of the screen showing the result or intermeiat tat perorminga web askIt rves as isual proo theactins take response to the instruction.
3Result Response: This is a textual response obtained after the execution of the web task.
It serves as textual result in response to the instruction.
-- You DO NOT NEED to interact with web pages or perform actions such as conducting searches on websites.
-- You SHOULD NOT make assumptions based on information not presented in the screenshot when comparing it to the instructions.
-- Your primary responsibility is to conduct a thorough assessment of the web task instruction against the outcome depicted in the screenshot and in the response, evaluating whether the actions taken align with the given instructions.
-- NOTE that the instruction may involve more than one task, for example, locating the grsizi eFailplites o ov summary, should be considered unsuccessful.
-- NOTE that the screenshot is authentic, but the response provided by LLM is generated at the end of web browsing, and there may be discrepancies between the text and the screenshots.
-- Note the difference: 1) Result response may contradict the screenshot, then the content of the screenshot prevails, 2) The content in the Result response is not mentioned on the screenshot, choose to believe the content.
You should elaborate on how you arrived at your final evaluation and then provide a definitive verdict on whether the task has been successfully accomplished, either as 'SUCCESS' or 'NOT SUCCESS'.
[User prompt] TASK: {subquery}
Result Response: {answer}
15 screenshots at the end: {screenshots}
You will be presented with a web shopping task.
For each task, you will receive three subqueries, along with the web agent's action history and corresponding screenshots. Your goal is to evaluate the agent's performance across three specific dimensions: product attributes, search filters, and sorting preferences. Please note: if a subquery is labeled as None, you do not need to assess that particular aspect. Definitions of the three subqueries are as follows:
express detailed intent
earch filters rpresentin ategorical rnumerical constraint comonly use commerce platforms
Sorting preferences, indicating desired result orderings, such as price or popularity.
Task: Query}
Product Attribute Requirement: {Subquery1}
Search Filter Requirement: {Subquery2}
Sorting Preference Requirement: {Subquery3}
Agent Action History: {Action}
This prompt is crucial for GPT-4o to act as a reliable evaluator by providing clear instructions on how to interpret task requirements, agent actions, and visual evidence (screenshots) to determine success or failure for each fine-grained aspect.
5. Experimental Setup
5.1. Datasets
The primary dataset used in the experiments is the DeepShop benchmark itself.
-
Source: The
DeepShopbenchmark is constructed by takingseed queriesfromMind2Web-Live [34]andWebVoyager [11], then applyingquery diversityandcomplexity evolutionprocesses. -
Scale: The full
DeepShopbenchmark comprises 600 queries. For evaluation, a balanced subset of 150 queries is used, systematically selected with 30 queries from each of the five major categories. -
Characteristics:
- Diversity: Covers five major e-commerce categories: Books, Electronics, Home, Fashion, and Sports.
- Complexity: Queries are categorized into
easy(0-1 complexity evolution steps),medium(2-3 complexity evolution steps), andhard(4-5 complexity evolution steps) based on the number ofproduct attributes,search filters, andsorting preferencesintroduced during the evolution process. - Realism: Derived from real user queries and designed to be executed on live, real-time web environments (specifically Amazon.com, as hinted by figures and context).
- Features: Each instance in the dataset includes:
id: A unique identifier for the example.ques: The natural language shopping query.web_nameandweb: The e-commerce platform name and its identifier (e.g., "Amazon" and its URL).attribute,filter,sort: Subqueries describing the specific product attribute, search filter, and sorting preferences.category: The product category information (e.g., Books, Electronics).difficulty: The task difficulty level (e.g., easy, medium, hard).
-
Domain: The United States region is specified, and the language is English.
-
Intended Use: Evaluation of
web agentsin online shopping tasks through complex query understanding and UI interaction. -
Limitations (of the dataset itself): Currently focuses on desktop web interfaces, lacks support for dynamic user intent changes or multi-turn interactions, does not fully capture cognitive aspects of shopping behavior, and does not cover mobile layouts or multilingual queries.
The figure below (Figure 2 from the original paper) shows examples of data samples and their evolution, illustrating how a simple seed query transforms into more complex versions with specific attributes, filters, and sorting preferences:
该图像是图示,展示了DeepShop中多样性和复杂度演化的运行示例,包含属性演化、筛选演化和排序演化三种复杂度演化类型,体现用户查询的渐进变化。
For example, a seed query "Find a book on web scraping" is diversified to "Find a book on python programming". This diversified query can then be evolved to "Find a book on python programming by author John Smith, published after 2020 with 4+ star ratings and free shipping, sort by lowest price."
5.2. Evaluation Metrics
The paper uses both fine-grained and holistic evaluation metrics. These metrics are success rates, indicating the percentage of tasks (or sub-tasks) that an agent successfully completes according to the specified criteria.
-
Product Attribute Success Rate:
- Conceptual Definition: This metric quantifies the agent's ability to correctly identify and match products that satisfy specific
product attributesrequested in the query. It focuses on whether details like brand, model, color, size, or a price range for the product itself were correctly handled. - Mathematical Formula: Not explicitly provided in the paper as a formal equation, but it represents a percentage: $ \text{Product Attribute Success Rate} = \frac{\text{Number of queries where product attributes are correctly satisfied}}{\text{Total number of queries with explicit product attributes}} \times 100% $
- Symbol Explanation:
Number of queries where product attributes are correctly satisfied: The count of tasks where the agent successfully found a product meeting all specified product attribute requirements, as judged byGPT-4o(or human evaluators for deep research systems).Total number of queries with explicit product attributes: The total count of tasks in the benchmark that included at least one specific product attribute requirement. Queries without explicit attributes are excluded from the denominator for this metric.
- Conceptual Definition: This metric quantifies the agent's ability to correctly identify and match products that satisfy specific
-
Search Filter Success Rate:
- Conceptual Definition: This metric measures the agent's proficiency in applying specified
search filterson the e-commerce platform. It assesses if the agent correctly interacted with UI elements to narrow down results based on criteria like minimum customer rating, shipping options (e.g., free delivery), or specific timeframes. - Mathematical Formula: Not explicitly provided in the paper as a formal equation, but it represents a percentage: $ \text{Search Filter Success Rate} = \frac{\text{Number of queries where search filters are correctly applied}}{\text{Total number of queries with explicit search filters}} \times 100% $
- Symbol Explanation:
Number of queries where search filters are correctly applied: The count of tasks where the agent successfully applied all specified search filter requirements.Total number of queries with explicit search filters: The total count of tasks in the benchmark that included at least one specific search filter requirement.
- Conceptual Definition: This metric measures the agent's proficiency in applying specified
-
Sorting Preference Success Rate:
- Conceptual Definition: This metric evaluates the agent's capacity to correctly apply
sorting preferencesto the search results. It determines if the agent managed to arrange the listed products according to criteria such as lowest price, highest user rating, or newest arrival. - Mathematical Formula: Not explicitly provided in the paper as a formal equation, but it represents a percentage: $ \text{Sorting Preference Success Rate} = \frac{\text{Number of queries where sorting preferences are correctly applied}}{\text{Total number of queries with explicit sorting preferences}} \times 100% $
- Symbol Explanation:
Number of queries where sorting preferences are correctly applied: The count of tasks where the agent successfully applied all specified sorting preference requirements.Total number of queries with explicit sorting preferences: The total count of tasks in the benchmark that included at least one specific sorting preference requirement.
- Conceptual Definition: This metric evaluates the agent's capacity to correctly apply
-
Task Success Rate (Holistic):
- Conceptual Definition: This is the overall success rate, indicating whether the agent fully completed the entire shopping task by satisfying all explicitly stated requirements, including
product attributes,search filters, andsorting preferences. A task is considered successful only if every required component is met. - Mathematical Formula: Not explicitly provided in the paper as a formal equation, but it represents a percentage: $ \text{Task Success Rate} = \frac{\text{Number of tasks where all explicit requirements are met}}{\text{Total number of tasks in the benchmark}} \times 100% $
- Symbol Explanation:
Number of tasks where all explicit requirements are met: The count of tasks where the agent achieved "Success" in theholistic evaluation, meaning allproduct attributes,search filters, andsorting preferences(if explicitly stated in the query) were correctly satisfied.Total number of tasks in the benchmark: The total number of queries in theDeepShopevaluation set (150 queries in the balanced subset).
- Conceptual Definition: This is the overall success rate, indicating whether the agent fully completed the entire shopping task by satisfying all explicitly stated requirements, including
5.3. Baselines
The paper evaluates a range of approaches, categorized into three types: Simple RAG, Web agents, and Deep research systems.
-
Simple RAG (Retrieval-Augmented Generation):
- Model:
GPT-4o + Google Search. - Mechanism: This baseline simulates a basic
RAGapproach. The user query is submitted to Google Search. Thetop-ranked webpagefrom the search results is retrieved (usingSerper APIfor programmatic access). Then,GPT-4o(version 2024-08-06) generates a final response based on ascreenshotof this retrieved webpage. - Representativeness: This represents a simple, non-interactive approach that relies purely on search and static content analysis, highlighting the limitations of
RAGwhen dynamic web interaction is required.
- Model:
-
Web agents: All
web agentsuseGPT-4o(version 2024-08-06) as their underlyinglarge language model. They differ in their perception mechanisms and interaction strategies.- Agent-E [1]:
- Mechanism: An
HTML-based agentthat employs ahierarchical planner-actor framework. It interprets instructions and navigates web interfaces usingDOM trees(Document Object Model, a programming interface for HTML and XML documents). It's augmented withflexible DOM tree distillationand adenoising mechanismto improve decision accuracy. It utilizesfull-page screenshotsfor perception. - Representativeness: Represents the capabilities of text-based
DOM-aware agents.
- Mechanism: An
- SeeAct [60]:
- Mechanism: A
vision-based agentthat leverages themultimodal capabilitiesofLLMs. It integratesvisual perception(usingfull-page screenshots) with structured web-based interactions. - Representativeness: Represents agents that primarily rely on visual input interpretation from
LLMs.
- Mechanism: A
- WebVoyager [11]:
- Mechanism: Also a
multimodal reasoning agent. It introduces aset-of-mark prompting scheme, where the agent first generates intermediate thoughts before selectingfinal actions. It operates on thevisible viewport only(not full-page screenshots). - Representativeness: Represents advanced
multimodal agentswith explicit reasoning steps.
- Mechanism: Also a
- Browser Use [29]:
- Mechanism: An
open-source web agent frameworkthat combinesvisual understanding(operating on thevisible viewport only) withHTML structure parsingto support robust web navigation and interaction. - Representativeness: Represents hybrid agents that leverage both visual and structural information for more robust interaction.
- Mechanism: An
- Agent-E [1]:
-
Deep research systems: These are commercial systems with advanced reasoning capabilities. For these systems, explicit site constraints are included in the prompt to guide the search process, as they cannot be strictly constrained to specific websites in the same way open-source agents can.
- Gemini Deep Research [8]:
- Model:
Gemini 2.0 Flashmodel with deep research capabilities, integrated into Google's Gemini Advanced platform. - Mechanism: An
AI assistantthat decomposes queries, performs extensive searches, and generatescited multi-step reports. - Representativeness: Represents Google's state-of-the-art commercial
deep research LLMproduct.
- Model:
- OpenAI Deep Research [33]:
-
Model:
o3 model(likely an internal designation for an advancedGPTmodel) with deep research enabled, powered by OpenAI's reasoning models. -
Mechanism: An
agentic systemthat autonomously browses, analyzes, and synthesizes web information intocitation-rich outputs, emulating human research workflows. -
Representativeness: Represents OpenAI's state-of-the-art commercial
deep research LLMproduct.All open-source agents (Agent-E, SeeAct, WebVoyager, Browser Use) are executed within real-time web environments (Playwright for Agent-E, SeeAct, Browser Use; Selenium for WebVoyager). Each agent is limited to a maximum of 15 steps per task to control computation cost and prevent excessive exploration.
-
- Gemini Deep Research [8]:
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. RQ1: Performance Analysis of Web Agents
RQ1 addresses how Simple RAG methods, web agents, and deep research systems perform on the DeepShop benchmark across fine-grained and holistic evaluation metrics.
The following are the results from Table 2 of the original paper:
| Method Product attribute Search filter Sorting preference Task success | ||||
| Simple RAG | ||||
| GPT-4o + Google Search | 7.33 | 5.97 | 4.55 | 7.33 |
| Web agents | ||||
| Agent-E | 12.67 | 9.70 | 3.41 | 6.67 |
| SeeAct | 52.00 | 22.39 | 20.45 | 10.67 |
| WebVoyager | 40.67 | 38.00 | 23.86 | 16.00 |
| Browser Use | 36.00 | 34.33 | 30.68 | 32.00 |
| Deep research systems Gemini Deep Research | 53.33 | 44.00 | 52.94 | 30.00 |
| OpenAI Deep Research | 60.00 | 46.15 | 58.82 | 30.00 |
Observations:
- Simple RAG struggles significantly: The
GPT-4o + Google Searchmethod performs very poorly across all metrics, withTask successat just 7.33%. It particularly struggles withSearch filters(5.97%) andSorting preferences(4.55%). This is expected becauseRAGfundamentally lacks the ability to interact dynamically with website elements (like clicking buttons to apply filters or change sorting orders). It can only retrieve information and generate text based on static content. This clearly demonstrates thatDeepShopqueries cannot be solved by retrieval alone. - Web agents outperform RAG but face challenges with fine-grained requirements:
- All
web agentsshow better performance thanSimple RAG, indicating the necessity of web interaction. - There's a progressive gain in
Task successfromHTML-based Agent-E(6.67%) tovision-based SeeAct(10.67%) andWebVoyager(16.00%), culminating inBrowser Use(32.00%).Browser Use, which integrates both HTML and visual inputs, achieves the best performance amongweb agents. - However, even the best
web agent(Browser Use) achieves only 32.00%Task success, highlighting the difficulty of simultaneously satisfying all threefine-grainedrequirements. - Different
web agentsexcel in differentfine-grainedaspects:SeeActleads inProduct attribute(52.00%),WebVoyagerinSearch filters(38.00%), andBrowser UseinSorting preferences(30.68%). This suggests that no singleweb agentapproach is uniformly superior across all sub-tasks.
- All
- Deep research systems show enhanced fine-grained performance but limited overall success:
-
Both
Gemini Deep Research(30.00%) andOpenAI Deep Research(30.00%) achieve similarTask success rates, which are comparable to or slightly lower than the bestweb agent(Browser Use). -
However, they significantly excel in
Product attributes(53.33% and 60.00% respectively) and particularly inSorting preferences(52.94% and 58.82% respectively), often outperformingweb agentsin these aspects. This points to their stronger reasoning capabilities in interpreting and fulfilling such explicit instructions. -
They still struggle with
Search filters(44.00% and 46.15%), though better than mostweb agents. The paper suggests this is because many filters require deep exploration and confirmation on product detail pages, which these systems might not handle perfectly. -
Despite their strong
fine-grainedperformance in some areas, theirholistic task success ratesremain relatively low (30%), underscoring the immense challengeDeepShopposes—an agent must succeed in all specified aspects simultaneously.In summary, the results validate
DeepShopas a challenging benchmark.RAGmethods are insufficient,web agentsmake progress through interaction but struggle with the combined complexity, and even sophisticateddeep research systemsface significant hurdles in achieving high holistic success rates, particularly withsearch filtersand when all requirements must be met concurrently.
-
6.1.2. RQ2: Performance across Different Product Categories
RQ2 investigates how existing methods perform across different product categories (Books, Electronics, Home, Fashion, and Sports) in online shopping tasks.
The following figure (Figure 5 from the original paper, part a) shows the performance across different product categories:
Analysis of Figure 5(a) - Performance across different product categories:
-
Simple RAG: Shows variable performance, doing relatively well in
Homebut dropping to0% successinFashionandSports. This suggests thatHomeproducts might have richer, more easily retrievable textual descriptions via Google Search, whileFashionandSportsoften rely on visual cues (e.g., specific styles, colors) that are harder forRAGto capture without active web interaction. -
Agent-E (HTML-based): Consistently underperforms across categories, particularly low in
Sports. Its reliance on HTML without strong visual processing limits its effectiveness in categories where visual elements are crucial. -
Vision-based Agents (SeeAct, WebVoyager): Generally improve performance across domains compared to
Agent-EandSimple RAG, demonstrating the value of visual processing. -
Browser Use (Hybrid): Achieves the best cross-domain results among
web agentsby combining HTML and visual inputs. It shows more balanced performance across categories. -
Deep Research Systems (Gemini, OpenAI): Exhibit relatively stable trends across categories, outperforming most
web agents. However, they face significant challenges inFashionandSportscategories.Geminiscores0%inSports, andOpenAIfails entirely in bothFashionandSports. This highlights a critical need for robustmultimodal reasoningto handle visually driven product categories effectively, even for advanceddeep research systems.The varied performance across categories underscores that different types of agents have strengths and weaknesses depending on the nature of the product domain, especially concerning the importance of visual information versus structured text or
DOMelements.
6.1.3. RQ3: Performance across Query Complexity Evolution
RQ3 examines how the performance of web agents varies across different levels of query complexity, from seed queries to evolved complex queries with multiple attributes, filters, and sorting preferences.
The following figure (Figure 5 from the original paper, part b) shows the performance across query complexity evolution:
Analysis of Figure 5(b) - Performance across query complexity evolution:
-
Clear Negative Correlation: There is a clear
negative correlationbetween query complexity and agent performance across all methods. As tasks move fromeasy(0-1 complexity evolution steps) tomedium(2-3 steps) and thenhard(4-5 steps), the success rates generally decline. -
Simple RAG: Performs at 16% on
easyqueries, drops to 6.00% onmediumqueries, and completely fails (0%) onhardtasks. This reinforces thatGoogle Searchalone cannot handle complex user needs that require multi-faceted criteria. -
Web Agents: Also exhibit sharp declines in performance. The average accuracy for
web agentsfalls from 28.5% oneasytasks to 17% onmediumtasks, and further drops by 7 percentage points (to 10%) onhardtasks. This shows that whileweb agentsare better thanRAG, they are still significantly challenged by increasing query complexity. -
Deep Research Systems: Perform better than
web agentson thehard subset. Even forhardtasks,OpenAI Deep Researchachieves 20% success (andGemini18%). This highlights the importance of strong reasoning capabilities for handling complex instructions. However, even for these advanced systems, thehardtasks remain very challenging, with a 20% success rate being relatively low.The results clearly demonstrate that
DeepShopsuccessfully creates a gradient of difficulty. As queries become more complex by layering onattributes,filters, andsorting preferences, the ability of all evaluated systems to fulfill them drops considerably, indicating that currentweb agentsanddeep research systemshave substantial room for improvement in handling real-world query complexity.
6.2. Error Analysis and Future Improvement Guidance
The paper conducts a detailed error analysis to identify primary failure modes, providing critical insights for future research.
6.2.1. Web Agents are limited by grounding ability
-
Problem:
Web agentsstruggle to accuratelyground(connect natural language instructions to specific UI elements) interface elements. They fail to correctly identify interactive components like buttons, sliders, and review sections. -
Examples:
HTML-based agentsmight overlook visual details (e.g., product color, layout cues) crucial for decisions, as they focus on theDOMstructure.Vision-based agentsusingset-of-mark prompts(a technique where a model generates explicit visual segmentations or "marks" to identify regions of interest) suffer fromsegmentation errors. Interactive buttons are misclassified, or regions like customer reviews remain unsegmented, preventing the use of rating filters. Small filtering and sorting widgets are often ignored.
-
Future Work: Explore
multimodal fusion techniquesthat combineHTML structurewithvisual contextto enable strongergrounding.The following figure (Figure 12 from the original paper) illustrates the limited grounding ability of web agents:
该图像是一张亚马逊购物页面的屏幕截图,展示了两款绿色Xbox无线手柄的商品信息、用户评分和价格区间,用于展示Web代理在真实购物场景中处理复杂查询时的界面表现。
As shown, button 39 (related to user rating) was not properly segmented, preventing the agent from selecting a specific rating range. Buttons 31-37 and 41-44 were rendered too densely and overlapped, making interaction difficult. The sorting button on the right was incorrectly split into two buttons (16 and 17), which could confuse the agent.
6.2.2. Web Agents often lack state assessment and replanning capabilities
-
Problem: Agents fail to dynamically reassess the current webpage state and reformulate their plan when initial attempts fail or conditions are not met.
-
Examples:
- Issuing overly specific search queries. Upon retrieval failure, they don't backtrack or reformulate broader alternatives.
- Navigating to product detail pages and finding unmet requirements (e.g., a specific warranty not present), but instead of returning to search results or exploring other options, they continue to scroll inefficiently on the current page.
- Repeating ineffective actions (e.g., clicking an unresponsive element multiple times) due to limited awareness of
webpage state transitions.
-
Future Work: Fine-tune agents on realistic web environments to enhance their ability to reason over search failures and adapt plans dynamically.
The following figure (Figure 13 from the original paper) illustrates a web agent's failure to reassess and replan:
该图像是论文中图13的示意图,展示了web代理在购物过程中未能重新评估和重新规划的失败案例,图中通过一系列点击和滚动操作,突出代理未回溯而继续在当前页面探索的问题。
In this example, the agent enters a product detail page to verify a 1-year warranty. Upon realizing the requirement is unmet, it fails to reassess its state. Instead of returning to the search results page to look for other options, the agent continues to scroll within the current page, inefficiently attempting to locate an alternative product on the same page.
6.2.3. Web Agents are constrained by a limited action space
-
Problem:
Web agentsoperate within a restricted set of browser actions, preventing them from interacting with dynamic UI components commonly found on shopping platforms. -
Examples: An agent fails to filter products within a specific price range because it cannot
drag a price slider. They struggle withdropdowns,sliders, andnested menus, which are essential for precise filtering and sorting. -
Future Work: Expand the agent's action repertoire with shopping-specific operations and deeper browser integration, allowing for more complex UI manipulations.
The following figure (Figure 14 from the original paper) illustrates the web agent's failure to apply the price filter during task execution:
该图像是两张并列的网页截图示意图,展示在亚马逊网站使用价格过滤功能时点击“Go”按钮未能成功过滤商品的操作过程和结果。图中标注了交互动作及对应的商品列表错误显示情况,反映购物代理在过滤功能上的失败。
The agent attempts to filter cameras within the $100–$300 price range. However, it is unable to interact with the dynamic price slider UI element. Instead, it clicks the adjacent "Go" button without adjusting the slider values, resulting in ineffective filtering. This highlights the limitation of a constrained action space.
6.2.4. Web Agents lack the ability to learn from execution
-
Problem: Current agents show limited ability to generalize across tasks. Experiences gained from one interaction (successes or failures) are not transferred to future scenarios.
-
Examples: Agents repeatedly make the same mistakes, such as misusing a
retrieverto query filtering or sorting constraints that are only accessible via specific UI components. This leads to irrelevant results and demonstrates a lack of adaptive learning. -
Future Work: Enable
execution-time learningand memory, allowing agents to abstract successful patterns, track failure cases, and refine decision-making over time. This could involvetask-level memory modules,outcome-based self-reflection, andlifelong learning mechanisms.The following figure (Figure 15 from the original paper) illustrates the web agent's failure to learn from execution:
该图像是一张网页截图示意图,展示了购物代理在执行过程中未能有效学习的情况,图中包含亚马逊搜索结果及用户界面元素,突出显示了失败提示信息。
This figure shows screenshots from four different tasks where the web agent consistently misuses the retriever (likely a search bar or internal search function) for filtering or sorting, even though these functionalities are typically handled by dedicated UI components. This repeated error across tasks demonstrates a lack of execution-time learning, as the agent doesn't adapt its strategy based on past failures.
6.2.5. Deep research systems are prone to hallucination errors
-
Problem:
Deep research systemsoften oversimplify complex queries, neglect constraints, and return confident yet inaccurate recommendations or incorrect information. -
Examples:
OpenAI's deep research systemmight assert that a matching product exists even when it doesn't, or claimsize requirementsare met when they are not.- Both
GeminiandOpenAIsystems frequently return incomplete or incorrect links, redirecting to irrelevant websites or generic navigation pages instead of specific product detail views, violating task constraints. - They might extract policy information (e.g., return policies) from external, non-relevant sites rather than the specified e-commerce platform.
-
Future Work: Apply
preference alignmentandfact-checking techniquesto reducehallucination ratesand improve the precision of retrieved links.The following figures (Figure 16 and Figure 17 from the original paper) illustrate
hallucination errorsin theOpenAI deep research system:
该图像是展示电子商务网站商品页面的示意图,图中显示了一款女式无袖复古花卉印花长裙,页面上标明可选尺码为Large和XX-Large,但任务明确要求中号尺码,体现了购物代理任务中的复杂筛选和规格匹配挑战。
Figure 16 shows the OpenAI deep research system's answer to a task requesting a "Women's Vintage Floral Maxi Dress in Navy Blue, Size: Medium," and an explanation of the return policy. The system returns three links.
Figure 17 provides a detailed view of the first returned product link. Despite the task specifying "Size: Medium," the linked product only offers "Large" and "XX-Large" options. The deep research system hallucinates that the size requirement is met. Furthermore, Link2 and Link3 point to non-Amazon websites (e.g., "smoking-er.com"), violating the implied task constraint of searching on Amazon (as suggested by the context of a shopping benchmark). The system also incorrectly extracts return information from these external sites. These instances demonstrate hallucinations in both satisfying attribute constraints and sourcing accurate information from the correct domain.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces DeepShop, a novel and comprehensive benchmark designed to evaluate web agents in highly realistic and complex online shopping environments. It addresses the critical gap in existing benchmarks, which often feature simplistic queries, by systematically evolving query diversity across five major e-commerce domains and progressively increasing complexity through the addition of product attributes, search filters, and sorting preferences. DeepShop also provides a fine-grained and holistic evaluation framework, leveraging GPT-4o for automated assessment validated by human agreement.
Experimental results demonstrate that DeepShop is a challenging benchmark. Simple RAG methods fail due to their inability to perform dynamic web interactions. While web agents show improved performance through interaction, they still struggle significantly with simultaneously satisfying all fine-grained requirements, especially search filters and sorting preferences. Even advanced deep research systems face considerable challenges, exhibiting hallucination errors and relatively low overall success rates, particularly in visually-driven categories and for hard complexity queries. The detailed error analysis provides crucial insights into limitations in grounding, replanning, action space, and execution-time learning for web agents, and hallucination for deep research systems.
7.2. Limitations & Future Work
The authors acknowledge several limitations of DeepShop that open avenues for future research:
-
Desktop Interfaces Only: The benchmark currently focuses solely on desktop web interfaces and does not include mobile-specific layouts or interactions. Future work could extend it to mobile environments.
-
Lack of Dynamic User Intent and Multi-turn Interactions:
DeepShopdoes not support dynamic changes in user intent during a task or complex multi-turn conversational interactions, which are common in real shopping assistance. Future benchmarks could incorporate these conversational aspects. -
Limited Cognitive Aspects: The benchmark does not fully capture the nuanced cognitive aspects of human shopping behavior, such as comparison strategies, brand loyalty, or subjective preferences.
-
Benefiting from Tool Learning and Agent Capabilities: The authors suggest that
DeepShopcould benefit from recent advances intool learning(allowing agents to use external tools more effectively) and broaderagent capabilities(e.g., more sophisticated reasoning and planning).From a societal perspective, the authors note that while shopping agents can assist users, they raise concerns about privacy and consumer manipulation. Future work should consider broader implications of
agent-centric information accesson consumer behavior and market dynamics, ensuring ethical decision-making.
7.3. Personal Insights & Critique
DeepShop is a highly valuable contribution to the field of web agents, particularly for e-commerce. Its systematic approach to generating diverse and complex queries is a significant improvement over prior benchmarks, which often oversimplified real-world tasks. The fine-grained evaluation is especially insightful, as it moves beyond a simple pass/fail to diagnose where agents succeed or fail, providing actionable feedback for developers. The high GPT-4o agreement rates with human judgment also enhance the scalability and reproducibility of the benchmark.
Inspirations and Applications:
- Robust Agent Design: The identified error categories (grounding, replanning, action space, learning from execution, hallucination) provide a clear roadmap for designing more robust
web agents. For instance, the need formultimodal fusionto improvegroundingis a critical insight that can be applied to other GUI-based automation tasks beyond shopping. - Curriculum Learning for Agents: The
easy,medium,hardcomplexity levels withinDeepShopnaturally lend themselves tocurriculum learningapproaches, where agents could be initially trained or fine-tuned on simpler tasks before progressing to more complex ones. - Evaluation Beyond Binary: The
fine-grained evaluationparadigm can be transferred to other complex multi-step tasks (e.g., customer support, data entry, research tasks) to provide more diagnostic insights into agent performance. - Benchmarking Deep Research Systems: The inclusion of commercial
deep research systemsprovides a valuable baseline and highlights their current limitations, pushing the research community to improve these powerful, yet imperfect, systems.
Potential Issues/Areas for Improvement:
-
Dependence on GPT-4o for Query Generation and Evaluation: While
GPT-4ois powerful, its use for both query generation and evaluation introduces a potential risk of "model overfitting" orhallucinationin the benchmark creation process itself. Although human verification is performed, the inherent biases or limitations ofGPT-4ocould subtly influence the types of queries generated or how success is judged. -
Action Space Definition: The critique of
limited action spaceforweb agentsis valid, but the paper doesn't propose a concrete expanded action set or a method to learn new actions. Future work stemming fromDeepShopcould focus on designing a more universal and extensibleaction spacethat handles dynamic UI elements better. -
Dynamic Website Changes: While
DeepShopuses live websites, e-commerce platforms are constantly updated. This dynamism, while realistic, can lead to benchmark decay over time, requiring continuous maintenance and re-verification of tasks. -
Cognitive Aspects: The acknowledged limitation regarding
cognitive aspectsis significant. Real shopping involves subjective preferences, trust, and comparison strategies that are hard to capture with objectiveattributes,filters, andsorting. Integrating user feedback or preference learning into the benchmark could make it even more realistic.Overall,
DeepShoprepresents a crucial step forward in evaluatingweb agents, setting a higher bar for realistic performance and offering clear directions for future research in building truly intelligent and robustdeep research shopping agents.
Similar papers
Recommended via semantic vector search.