Paper status: completed

DeepShop: A Benchmark for Deep Research Shopping Agents

Published:06/03/2025

Deep Research Shopping Agents Benchmark (1)Query Complexity Evolution (1)Evaluation of Online Shopping Agents (1)Retrieval-Augmented Generation Methods (1)Fine-Grained Shopping Feature Evaluation (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DeepShop introduces a benchmark for deep research shopping agents, evolving query diversity and complexity, with fine-grained and holistic evaluation to reveal limitations of current methods like RAG in handling complex, multi-attribute shopping scenarios.

Abstract

Web agents for online shopping have shown great promise in automating user interactions across e-commerce platforms. Benchmarks for assessing such agents do not reflect the complexity of real-world shopping scenarios, as they often consist of overly simple queries with deterministic paths, such as "Find iPhone 15." Real shopping scenarios are inherently more layered, involving multi-dimensional product attributes, search filters, and user-specific sorting preferences. To address this gap, we introduce DeepShop, a benchmark designed to evaluate web agents in complex and realistic online shopping environments. DeepShop comprises three key components. (1) Query diversity evolution: Starting from real user queries, we generate diverse queries across five popular online shopping domains. (2) Query complexity evolution: We further evolve these queries to increase complexity, considering product attributes, search filters, and sorting preferences, and classify them into three levels: easy, medium, and hard, based on the number of evolutions. (3) Fine-grained and holistic evaluation: We propose an automated evaluation framework that assesses agent performance in terms of fine-grained aspects (product attributes, search filters, and sorting preferences) and reports the overall success rate through holistic evaluation. We conduct a systematic evaluation of retrieval-augmented generation (RAG) methods, web agents, and deep research systems. Results show that RAG struggles with complex queries due to its lack of web interaction, while other methods face significant challenges with filters and sorting preferences, leading to low overall success rates. We also perform cross-category, complexity-based evaluations and error analyses to support the advancement of deep research shopping agents.

Mind Map

In-depth Reading

English Analysis~40 min read · 53,335 chars

1. Bibliographic Information

1.1. Title

DeepShop: A Benchmark for Deep Research Shopping Agents

1.2. Authors

The paper lists five authors with their affiliations:

Yougang Lyu (University of Amsterdam)
Xiaoyu Zhang (Shandong University)
Lingyong Yan (Baidu Inc.)
Maarten de Rijke (University of Amsterdam)
Zhaochun Ren (Leiden University)
Xiuying Chen (Mohamed bin Zayed University of Artificial Intelligence (MBZUAI))

The authors represent a mix of academic institutions and an industry research lab, indicating a collaborative effort spanning different research perspectives, particularly in information retrieval, natural language processing, and artificial intelligence.

1.3. Journal/Conference

The paper is published at (UTC): 2025-06-03T13:08:17.000Z, suggesting it is a recent or forthcoming publication. While a specific conference or journal is not listed in the provided text, the mention of "arXiv preprint" and its typical submission to major AI/NLP/IR conferences (like NeurIPS, ICLR, ACL, EMNLP, SIGIR, WWW) implies it targets high-impact venues in these fields. Benchmarking papers are highly valued in these communities for setting new research directions and evaluation standards.

1.4. Publication Year

2025 (based on the UTC timestamp 2025-06-03T13:08:17.000Z).

1.5. Abstract

This paper introduces DeepShop, a novel benchmark designed to evaluate web agents in complex and realistic online shopping scenarios. The motivation stems from the observation that existing benchmarks often feature overly simplistic queries, failing to capture the multi-dimensional attributes, search filters, and sorting preferences inherent in real-world shopping. DeepShop addresses this gap through three main components: (1) Query diversity evolution, which generates diverse queries across five popular shopping domains from real user queries; (2) Query complexity evolution, which iteratively increases query complexity by adding product attributes, filters, and sorting preferences, categorizing them into easy, medium, and hard levels; and (3) a Fine-grained and holistic evaluation framework that assesses agent performance on these specific aspects and reports an overall success rate. The authors evaluate various approaches, including retrieval-augmented generation (RAG) methods, general web agents, and commercial deep research systems. Results show that RAG struggles due to its lack of web interaction, while other methods face significant challenges with filters and sorting, leading to low overall success rates. The paper concludes by providing cross-category, complexity-based evaluations and error analyses to guide future development in deep research shopping agents.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2506.02839
PDF Link: https://arxiv.org/pdf/2506.02839v1.pdf
Publication Status: This is an arXiv preprint, indicating it is publicly available but may not have undergone full peer review for a specific conference or journal yet, or it is a preprint version of a paper that has been accepted or is under review.

2. Executive Summary

2.1. Background & Motivation

The core problem DeepShop aims to solve is the lack of realistic and complex benchmarks for evaluating web agents in online shopping scenarios. Current benchmarks for web agents often consist of overly simplistic queries (e.g., "Find iPhone 15") with deterministic paths, which do not reflect the layered and nuanced nature of real-world shopping tasks.

This problem is important because web agents, especially those integrating large language models (LLMs), show great promise in automating user interactions on e-commerce platforms. However, despite advancements in planning, memory, and web interaction capabilities, existing agents still struggle with complex user queries in dynamic shopping environments. Real shopping requires "deep research"—browsing, filtering, comparing—to meet diverse and nuanced user preferences. The gap between these complex real-world needs and the limited complexity of current benchmarks means that the true capabilities and limitations of web agents in practical e-commerce settings remain underexplored.

The paper's entry point is to bridge this gap by introducing DeepShop, a benchmark that mirrors the complexity and diversity of real-world shopping, enabling a more accurate and comprehensive evaluation of web agents. Its innovative idea is to systematically evolve real user queries to increase both their diversity across product categories and their complexity by adding specific product attributes, search filters, and sorting preferences.

2.2. Main Contributions / Findings

The paper's primary contributions are:

DeepShop Benchmark Introduction: The creation of DeepShop, a comprehensive benchmark for evaluating web agents in complex online shopping scenarios. It features diverse queries across five popular product categories (Books, Electronics, Home, Fashion, Sports) and varying complexity levels (easy, medium, hard) derived from a multi-stage query diversity and complexity evolution process based on real-world shopping intents.
Comprehensive Evaluation Framework: The proposal of a fine-grained and holistic evaluation framework. This framework assesses agent performance on specific aspects like correct product attribute matching, proper search filter application, and accurate sorting preference execution, alongside an overall holistic success rate. The evaluation is largely automated using GPT-4o with human agreement validation.
Extensive Experimental Evaluation: Conducted systematic evaluations of various approaches, including simple retrieval-augmented generation (RAG) methods, advanced web agents (e.g., Agent-E, SeeAct, WebVoyager, Browser Use), and commercial deep research systems (e.g., Gemini Deep Research, OpenAI Deep Research) using the DeepShop benchmark.
Detailed Analysis and Future Guidance: Provided detailed analyses across product categories, query complexity levels, and specific error types. This analysis reveals critical limitations in current systems (e.g., RAG struggles due to lack of web interaction, web agents struggle with fine-grained requirements like filters and sorting, deep research systems face hallucination errors), offering insights to guide the future development of more effective deep research shopping agents.

The key conclusions and findings reached by the paper are:
RAG methods perform poorly on complex shopping queries due to their inherent lack of web interaction capabilities, being unable to apply filters or sorting.
While web agents generally outperform RAG by interacting with websites, they still face significant challenges in simultaneously satisfying DeepShop's fine-grained requirements, particularly search filters and sorting preferences, leading to relatively low overall success rates.
Deep research systems show improved performance in handling product attributes and sorting preferences due to their multi-step reasoning, but they still struggle with search filters and exhibit hallucination errors, limiting their overall success.
Agent performance varies significantly across product categories, with visually-driven categories like Fashion and Sports posing greater challenges for agents lacking robust multimodal reasoning.
There is a clear negative correlation between query complexity and agent performance; as queries become harder, success rates drop dramatically for all evaluated methods.
Critical limitations identified for web agents include poor grounding ability, lack of state assessment and replanning capabilities, constrained action space (especially for dynamic UI elements), and inability to learn from execution.

These findings highlight that current web agents and deep research systems are not yet robust enough to handle the full complexity of real-world online shopping scenarios, emphasizing the need for continued research and development guided by benchmarks like DeepShop.

3.1. Foundational Concepts

To understand the DeepShop paper, several foundational concepts related to AI, machine learning, and web interaction are crucial.

Web Agents: In the context of this paper, a web agent (or sometimes AI agent or LLM agent) refers to an artificial intelligence program designed to autonomously interact with web environments, much like a human user would. This involves capabilities like navigating web pages, clicking buttons, typing into search fields, reading content, and interpreting visual layouts, all to achieve a specific goal or task. Web agents are typically powered by underlying AI models, often large language models (LLMs).
Large Language Models (LLMs): LLMs are advanced AI models trained on vast amounts of text data, enabling them to understand, generate, and process human language. They are at the core of many recent AI advancements, including web agents, providing capabilities such as planning, reasoning, understanding instructions, and generating actions in natural language or code. Examples include GPT-4o, Gemini, etc.
Retrieval-Augmented Generation (RAG): RAG is an architectural pattern for LLMs that combines information retrieval with text generation. When an LLM is asked a question, a RAG system first retrieves relevant information (e.g., documents, web pages, paragraphs) from a knowledge base or the internet. This retrieved information is then fed into the LLM along with the original query, allowing the LLM to generate a more informed and accurate response that is grounded in factual data, rather than relying solely on its pre-trained knowledge. In the context of DeepShop, Simple RAG often involves searching via Google and generating a response based on the search results, without direct interaction with dynamic websites.
Partially Observable Markov Decision Process (POMDP): A POMDP is a mathematical framework used in AI to model decision-making problems where an agent's current state is not fully known.
- Markov Decision Process (MDP): In an MDP, an agent interacts with an environment, taking actions that change the state of the environment, and receives rewards. The key is that the agent always knows its current state.
- Partially Observable: In a POMDP, the agent does not directly observe the true state of the environment. Instead, it receives observations that are related to the state but may not fully reveal it. For a web agent, this means it might not know everything about a webpage's underlying structure or dynamic content, only what it can "see" (e.g., through screenshots or DOM trees).
- The tuple $( \mathcal { S } , \mathcal { O } , \mathcal { A } , \mathcal { T } )$ $(S, O, A, T)$ represents:
  - $\mathcal { S }$ : The set of all possible states of the environment (e.g., the underlying HTML structure and backend data of a website).
  - $\mathcal { O }$ : The set of all possible observations the agent can receive (e.g., screenshots, visible text, DOM tree elements).
  - $\mathcal { A }$ : The set of all possible actions the agent can take (e.g., click, type, scroll).
  - $\mathcal { T } : \mathcal { S } \times \mathcal { A } \to \mathcal { S }$ : The transition function, which describes how taking an action in a given state leads to a new state.
Online vs. Offline Benchmarks:
- Offline Benchmarks: These evaluate agents in static environments, often constructed from pre-collected web snapshots or manually curated HTML/DOM structures. They offer controlled conditions for reproducibility but fail to capture the dynamic and unpredictable nature of real-world websites.
- Online Benchmarks: These allow agents to operate within real-time web environments, interacting directly with live websites. This provides a more realistic evaluation but introduces challenges related to website changes, dynamic content, and potential non-determinism.
Fine-grained Evaluation: This refers to assessing an agent's performance on specific, granular aspects of a task. In DeepShop, this includes evaluating whether the agent correctly handles product attributes (e.g., brand, color), search filters (e.g., ratings, price range), and sorting preferences (e.g., lowest price first). It helps diagnose specific failure modes.
Holistic Evaluation: This refers to assessing the overall success of an agent in completing an entire task, often by aggregating fine-grained success into a single pass/fail metric. In DeepShop, it determines if all required components (attributes, filters, sorting) were successfully satisfied.
Product Attributes, Search Filters, and Sorting Preferences: These are key components of online shopping queries:
- Product Attributes: Concrete characteristics of a product a user might specify, like "brand: Apple," "color: red," "size: Medium," "price range: $500-$800."
- Search Filters: Categorical or numerical constraints applied to search results, often via checkboxes, sliders, or dropdowns. Examples: "customer rating: 4 stars & up," "shipping: free delivery," "availability: in stock."
- Sorting Preferences: The desired order in which search results should be displayed. Examples: "sort by: lowest price," "sort by: highest user rating," "sort by: newest arrivals."
Grounding: In the context of web agents, grounding refers to the ability of an LLM or agent to accurately connect natural language instructions (e.g., "click the 'Add to Cart' button") to specific, actionable elements within the visual or structural representation of a webpage (e.g., locating the exact button element in a screenshot or DOM tree). Poor grounding leads to incorrect actions.
Hallucination (in LLMs): This occurs when an LLM generates information that is factually incorrect, nonsensical, or not supported by its input data, yet presents it as if it were true and confident. In Deep Research Systems, this can manifest as making up product details, claiming a product meets criteria it doesn't, or providing incorrect links.

3.2. Previous Works

The paper extensively references prior research in web agent evaluation and development, categorizing benchmarks into offline and online.

Offline Benchmarks:

Mind2Web [5]: An early benchmark for general web agents, using static snapshots or simulated environments. It provides controlled conditions but lacks real-world dynamism.
WebShop [51]: Specifically designed for online shopping tasks but also uses a simulated environment (a fixed product catalog and interface), limiting its realism compared to live websites.
WebArena [62]: Another offline benchmark for autonomous agents in web environments, aiming for realistic tasks but within a static setup.
VWebarena [18] and MMInA [58]: Multimodal offline benchmarks, indicating a move towards agents that use both visual and text information, but still within static environments.
ChatShop [3]: An offline benchmark focused on interactive information seeking with language agents, supporting multi-turn preference elicitation, but constrained by training products and not autonomously browsing live web content.

Online Benchmarks:

WebLINX [21]: An online benchmark for real-world website navigation with multi-turn dialogue, offering a more dynamic setting.
Mind2Web-Live [34]: An extension of Mind2Web that allows agents to operate in real-time web environments, marking a step towards more realistic evaluation. DeepShop draws seed queries from Mind2Web-Live.
WebVoyager [11]: An online benchmark that enables end-to-end web agents with large multimodal models (LMMs), using iterative real-world exploration and feedback. DeepShop also draws seed queries from WebVoyager.

Web Agents for Task Automation (Baselines and related systems):

WebGPT [30]: An early HTML-based agent that used LLMs to interpret instructions and navigate web interfaces using DOM trees.
MindAct [5]: Another HTML-based agent mentioned, likely building on DOM tree interaction.
Agent-E [1]: An HTML-based agent using a hierarchical planner-actor framework with DOM tree distillation. This is used as a baseline in DeepShop.
SeeAct [60]: A vision-based agent exploiting the multimodal capabilities of LLMs, integrating visual perception with structured web-based interactions. Used as a baseline.
Browser Use [29]: An open-source web agent framework combining visual understanding with HTML structure parsing for robust web navigation and interaction. Used as a baseline.
OpenAI Deep Research [33] and Gemini Deep Research [8]: Commercial deep research systems that use advanced reasoning LLMs to tackle complex information-seeking tasks, autonomously browsing, analyzing, and synthesizing web information into citation-rich outputs. These are evaluated as baselines.

Query Understanding in E-commerce:

Traditional information retrieval (IR) systems struggle with complex e-commerce queries due to overwhelming product spaces and nuanced user preferences [4, 45].
Conversational IR systems, while supporting multi-turn preference elicitation, are limited by training products and cannot autonomously browse web content [3, 57]. Web agents offer a promising alternative by mimicking human browsing behaviors [11, 51].

3.3. Technological Evolution

The field of web agents has evolved significantly:

Early Automation (Rule-based/Scripted): Initial attempts at web automation were often rule-based or required explicit scripting for specific tasks, lacking flexibility and generalization.
Text-based Agents (DOM-centric): With the rise of LLMs, agents started using DOM trees (Document Object Model, a programming interface for HTML and XML documents that treats HTML elements as objects) to understand webpage structure and execute actions. WebGPT, MindAct, and Agent-E are examples. These agents primarily interact with the underlying code rather than the visual layout.
Multimodal/Vision-based Agents: Recognizing that web interaction is inherently visual for humans, agents evolved to incorporate visual perception (e.g., screenshots) alongside text. SeeAct, WebVoyager, and Browser Use are prominent examples, integrating visual grounding to handle complex layouts and interactive components. This allows them to "see" what a human sees.
Deep Research Systems (Advanced Reasoning): The latest wave involves highly sophisticated LLMs that can perform multi-step reasoning, plan complex tasks, and synthesize information from multiple sources, emulating human research workflows. Gemini Deep Research and OpenAI Deep Research represent this frontier.
Benchmarking Evolution: Simultaneously, evaluation benchmarks progressed from static/simulated environments (Mind2Web, WebShop, WebArena) to more realistic online environments (WebLINX, Mind2Web-Live, WebVoyager).

DeepShop fits into this evolution by addressing the gap in complexity within existing online benchmarks. While previous online benchmarks provided realistic environments, their tasks often remained simple. DeepShop pushes the envelope by introducing systematically diversified and complex queries, pushing web agents beyond simple navigation to truly "deep research" shopping.

3.4. Differentiation Analysis

Compared to main methods in related work, DeepShop's core differences and innovations are:

Complexity and Realism of Queries:
- Existing Benchmarks: Often use simple, deterministic queries like "find an iPhone 15" or focus on general web tasks. They lack the multi-dimensional requirements common in real shopping.
- DeepShop: Introduces queries with systematically increased complexity, incorporating multiple product attributes (e.g., brand, color, size), search filters (e.g., minimum rating, free delivery), and sorting preferences (e.g., lowest price first). This makes the tasks much more akin to real-world user needs, requiring agents to perform "deep research."
Systematic Query Evolution:
- Existing Benchmarks: Queries are often hand-crafted or derived in simpler ways, which might lead to an uneven distribution of difficulty or domain coverage.
- DeepShop: Employs a multi-stage query diversity evolution (across five product categories) and query complexity evolution (iteratively adding attributes, filters, sorting) starting from real user seed queries, utilizing GPT-4o for generation. This systematic approach ensures a comprehensive and balanced testbed, categorized into easy, medium, and hard levels.
Fine-grained and Holistic Evaluation:
- Existing Benchmarks: Primarily focus on binary task success (did the agent complete the task or not?).
- DeepShop: Proposes a novel evaluation framework that includes both fine-grained metrics (assessing success on product attributes, search filters, and sorting preferences individually) and a holistic success rate (requiring all specified conditions to be met). This allows for a more nuanced understanding of agent capabilities and precise diagnosis of failure modes. It uses GPT-4o for automated evaluation, validated with human agreement.
Online Environment Focus with Specific Shopping Domain:
- Existing Benchmarks: While some are online, they might focus on general web tasks or lack the specific nuances of e-commerce platforms. Offline benchmarks, though controlled, lack realism.
- DeepShop: Specifically targets online shopping scenarios on real, dynamic websites, providing a realistic testbed for a critical application domain for web agents.
  
  In essence, DeepShop moves beyond simply evaluating if an agent can navigate a website to evaluating if it can effectively understand and fulfill complex, multi-faceted user intents in a dynamic, real-world e-commerce setting, thereby providing a more rigorous challenge for the next generation of deep research shopping agents.

4. Methodology

The DeepShop benchmark is designed to evaluate web agents in realistic and complex online shopping environments. The methodology involves formulating the task, curating seed data, evolving query diversity and complexity, and establishing a comprehensive evaluation framework.

4.1. Principles

The core idea behind DeepShop is that real-world online shopping queries are inherently complex, involving multiple product characteristics, specific filtering criteria, and desired sorting orders. Existing benchmarks often simplify these queries, leading to an overestimation of web agent capabilities. DeepShop aims to bridge this gap by systematically generating diverse and complex queries, starting from real user intents, and providing a granular evaluation to truly assess agent performance in "deep research" shopping scenarios. The theoretical basis is rooted in simulating human-like complex information seeking behavior on e-commerce platforms.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Task Formulation

The online web shopping task is formulated as a partially observable Markov decision process (POMDP). This mathematical framework is used because a web agent operating on a live website does not have full knowledge of the website's underlying state (e.g., full server-side data, all hidden elements), but rather receives partial observations (e.g., screenshots, visible DOM elements).

The POMDP is defined by a tuple $( \mathcal { S } , \mathcal { O } , \mathcal { A } , \mathcal { T } )$ :

$\mathcal { S }$ : Represents the entire state space of the web environment. This includes all possible configurations of a website, its backend data, and user session information.
$\mathcal { O }$ : Represents the observation space. At any given time, the web agent receives an observation $o _ { t } \in \mathcal { O }$ , which is a partial reflection of the true underlying state $s _ { t } \in \mathcal { S }$ . For a web agent, this could be a screenshot of the visible viewport, the DOM tree (Document Object Model, a programming interface for HTML and XML documents), or extracted text content.
$\mathcal { A }$ : Represents the action space. These are the possible actions the web agent can take, such as clicking a link, typing text into a search box, scrolling the page, or selecting an item from a dropdown.
$\mathcal { T } : \mathcal { S } \times \mathcal { A } \to \mathcal { S }$ : Represents the transition function. This function describes how the environment's state changes when an agent takes an action. Specifically, if the agent takes an action $a _ { t } \in \mathcal { A }$ from state $s _ { t } \in \mathcal { S }$ , the environment transitions to a new state $s _ { t + 1 } \sim \mathcal { T } ( s _ { t } , a _ { t } )$ . After this transition, the agent receives an updated observation $o _ { t + 1 } \in \mathcal { O }$ .

The agent's goal is, given a user query $q$ , to navigate this POMDP to find the desired product or information.

4.2.2. Seed Data Curation

To ensure realism and relevance, DeepShop starts with a collection of real user shopping queries, referred to as seed data.

Source: 50 web shopping queries are manually selected from two existing real-world benchmarks: Mind2Web-Live [34] and WebVoyager [11]. These benchmarks are chosen because they involve real-time web environments.
Categorization: The selected queries are manually categorized into five representative shopping domains:
- Books ( $d _ { \mathbf { books } }$ ): 4 queries related to physical books, eBooks, and audiobooks.
- Electronics ( $d _ { \mathbf { electronics } }$ ): 14 queries for smartphones, laptops, headphones, etc.
- Home ( $d _ { \mathbf { home } }$ ): 20 queries for household items, furniture, appliances.
- Fashion ( $d _ { \mathbf { fashion } }$ ): 5 queries for apparel, footwear, and accessories.
- Sports ( $d _ { \mathbf { sports } }$ ): 7 queries for fitness equipment, sportswear.
  
  These 50 seed queries form the initial basis for generating a more diverse and complex dataset.

4.2.3. Shopping Query Diversity Evolution

To address the lack of fine-grained product category diversity in existing datasets, DeepShop generates new queries based on the seed queries and a randomly selected product category. This process is called diversity evolution. The process is defined by the following formula: $ q _ { i } ^ { * } = \mathrm { D i v e r s i t y } ( q _ { i } , d ) $ Where:

$q _ { i } ^ { * }$ : Represents the new, diversified query that is generated.
$\mathrm { Diversity } ( \cdot )$ : This is a function implemented by prompting GPT-4o models. GPT-4o is an advanced large language model that can understand and generate human-like text, making it suitable for rewriting or creating new queries based on given instructions. The GPT-4o is instructed to create a new prompt (query) for the web shopping domain, tailored for a different specific product in a given Amazon product field, while maintaining similar length and complexity to the original.
$q _ { i } \in \mathcal { D } _ { \mathrm { original } }$ : Represents an original seed query selected from the initial dataset.
$d \in \{ d _ { \mathrm { books } } , d _ { \mathrm { electronics } } , d _ { \mathrm { home } } , d _ { \mathrm { fashion } } , d _ { \mathrm { sports } } \}$ : Denotes a randomly selected product category from the five defined domains.

The goal is to take an existing query (e.g., an electronics query) and rephrase it for a different domain (e.g., books), ensuring the agent can generalize across varied user shopping intents. The final web shopping diversity evolution dataset $\mathcal { D } _ { \mathrm { diversity } }$ is constructed by combining the original seed dataset with all the newly generated queries: $\mathcal { D } _ { \mathrm { diversity } } = \mathcal { D } _ { \mathrm { original } } \cup \{ q _ { i } ^ { * } \} _ { i = 1 } ^ { N }$ , where $N$ is the number of seed queries.

4.2.4. Shopping Query Complexity Evolution

To simulate increasingly complex real-world shopping scenarios, DeepShop iteratively enhances the complexity of the diversified queries. This is done by adding specific requirements related to product attributes, search filters, and sorting preferences. The complexity evolution is an iterative process. In each iteration $t$ , one of three strategies is randomly selected to evolve the query q _ { i , t } from the previous step.

The process is defined by the following formula: $ q _ { i , t + 1 } = \mathrm { Complexity } ( q _ { i , t } , c ) $ Where:

$q _ { i , t + 1 }$ : Represents the $i$ -th query after the $(t+1)$ -th complexity evolution step.
$\mathrm { Complexity } ( \cdot )$ : This is a function implemented by prompting GPT-4o. GPT-4o is instructed to rewrite a given prompt into a more complex version by adding specific details, while keeping it reasonable and understandable.
q _ { i , t }: Denotes the $i$ -th query in the $t$ -th complexity evolution step.
$i \in [ 1 , | \mathcal { D } _ { \mathrm { diversity } } | ]$ : The index for queries starting from the diversity dataset.
$t \in [ 1 , T ]$ : The current iteration number, where $T$ is the total number of rounds of complexity evolution. In DeepShop, $T=5$ rounds are applied.
q _ { i , 0 }: Denotes the $i$ -th query from the $\mathcal { D } _ { \mathrm { diversity } }$ dataset, serving as the starting point for complexity evolution.
$c \in \{ c _ { \mathrm { attr } } , c _ { \mathrm { filter } } , c _ { \mathrm { sort } } \}$ $c \in {c_{attr}, c_{filter}, c_{sort}}$ : This is the randomly selected strategy for increasing complexity in the current iteration. The three strategies are:
- Attribute evolution ( $c _ { \mathbf { attr } }$ ): This strategy enhances the query by incorporating concrete product attributes. Examples include brand, model, specific price range, color, size, weight, or unique product features. GPT-4o is prompted to specify concrete values for one product attribute based on its knowledge, ensuring these details are directly incorporated into the query.
- Filter evolution ( $\mathcal { C } _ { \mathbf { filter } }$ ): This strategy enhances the query by adding specific search filters commonly available on e-commerce platforms. Examples include constraints like minimum customer rating (e.g., 4.5 stars), minimum number of reviews (e.g., 500+), shipping options (e.g., free delivery), release timeframe (e.g., new arrivals in the past 30 days), return policies, or warranty information. GPT-4o is prompted to specify concrete values for these constraints.
- Sorting evolution ( $c _ { \mathbf { sort } }$ ): This strategy enhances the query by appending a sorting preference. This directs the system to find top-ranked products according to criteria such as lowest price, highest user rating, newest arrival, or best seller ranking. GPT-4o is prompted to integrate a specific filtering requirement based on one of these criteria.
  
  By iteratively applying these strategies over $T=5$ rounds, the method mimics the natural evolution of user queries, generating a hierarchical set of increasingly complex queries. Starting from diverse queries in $\mathcal { D } _ { \mathrm { diversity } }$ , this process results in a total of 600 queries.

The figure below (Figure 2 from the original paper) illustrates examples of diversity and complexity evolution in DeepShop:

Figure 2: Running examples of diversity and complexity evolution in DeepShop. Complexity evolution includes attribute evolution, filter evolution, and sorting evolution. 该图像是图示，展示了DeepShop中多样性和复杂度演化的运行示例，包含属性演化、筛选演化和排序演化三种复杂度演化类型，体现用户查询的渐进变化。

As shown, a seed query like "Find a book on web scraping" can be diversified to "Find a book on python programming" (within the Books category). This diversified query then undergoes complexity evolution:

Attribute Evolution: adding "by author John Smith, published after 2020."
Filter Evolution: adding "with 4+ star ratings and free shipping."
Sorting Evolution: adding "sort by lowest price." These evolutions create queries of increasing difficulty.

4.2.5. Dataset Analysis

The paper conducts analysis on the generated dataset to confirm its characteristics.

4.2.5.1. Analysis of Query Diversity Evolution

Existing benchmarks often have skewed distributions across product categories, introducing bias. To mitigate this, DeepShop constructs a balanced subset of 150 queries from its 600-query pool, systematically selecting 30 queries from each of the five major categories: Books, Electronics, Home, Fashion, and Sports. This balanced distribution (manually verified for quality and availability on corresponding websites) ensures a controlled and equitable testbed for evaluating cross-domain generalization, allowing clearer assessment of an agent's ability to generalize beyond narrow domain specialization.

The following figure (Figure 3 from the original paper) illustrates the product category distribution:

Figure 3: Product category distribution after query diversity evolution. 该图像是图表，展示了经过查询多样性演化后不同产品类别的分布情况。图中比较了初始种子数据和DeepShop数据在图书、电子产品、家居、时尚和运动五个类别的查询数量，DeepShop在各类别查询数均为30，显著高于种子数据，反映了查询多样性的提升。

The graph clearly shows that after query diversity evolution, DeepShop has an equal number of queries (30) for each of the five categories, unlike the seed data which had an imbalanced distribution (e.g., 20 for Home, 4 for Books).

4.2.5.2. Analysis of Query Complexity Evolution

A fine-grained analysis is performed on how query complexity evolves across the three dimensions (product attributes, search filters, sorting preferences).

The following figure (Figure 4 from the original paper) presents this analysis:

Figure 4: Analysis of query complexity evolution. 该图像是图表，展示了图4中查询复杂度演化的三个方面：产品属性、搜索过滤器和排序偏好，分别以迭代次数为横轴，计数为纵轴，比较了迭代演化过程与DeepShop各级别数据的变化趋势。

Product Attributes (Figure 4a): The average number of product attributes per query steadily increases across iterations. The final DeepShop dataset has an average of 0.52 more attributes than the seed data, and the hard subset has an additional 0.66 attributes on average.
Search Filters (Figure 4b): The average number of search filters per query consistently increases. DeepShop queries include, on average, 1.95 more filters than seed queries, with the hard subset showing an increase of 2.88 filters on average.
Sorting Preferences (Figure 4c): The average sorting preferences per query also show an upward trend. The final average sorting preferences exceed the seed data by 0.37, and the hard subset contains an additional 0.66 sorting preferences on average.

This analysis confirms that the complexity evolution strategy successfully creates increasingly complex queries.

4.2.6. Evaluation Metrics

DeepShop uses a two-stage evaluation protocol: fine-grained evaluation and holistic task success evaluation. Given the challenges of human evaluation, GPT-4o is primarily used for automatic assessment, following previous work [11, 50].

4.2.6.1. Fine-grained Evaluation

Decomposition: Each complex query is first decomposed into its constituent parts: a product attribute subquery ( $q _ { \mathrm { attr } }$ ), a search filter subquery ( $q _ { \mathrm { filter } }$ ), and a sorting preference subquery ( $q _ { \mathrm { sort } }$ ).
GPT-4o Assessment: For each web agent trajectory (a sequence of actions and observations, typically screenshots), GPT-4o is prompted to assess whether the final results align with the requirements specified in each subquery. The prompt includes the user subquery, screenshots (up to 15), and the agent's final answer (textual response).
Binary Decision: GPT-4o provides a binary decision ("Success" or "Not Success") for each subquery.
Purpose: This fine-grained evaluation captures partial success cases and helps diagnose specific failure modes more precisely than a simple holistic pass/fail. If a subquery is not present in the original query (e.g., no explicit sorting preference), its evaluation is skipped and not included in the calculation for that specific aspect.

4.2.6.2. Holistic Evaluation

Aggregation: The holistic evaluation calculates the overall task success by aggregating the outcomes of the fine-grained evaluation for product attributes, search filters, and sorting preferences.
Rule-based Checking: For each dimension, the system checks if the original query explicitly specified a requirement.
- If a particular aspect (e.g., attribute, filter, or sorting) is present in the query, its corresponding success score from the fine-grained evaluation is considered.
- If an aspect is not present in the query, it is treated as automatically satisfied for the purpose of holistic evaluation (i.e., it doesn't penalize the agent for not fulfilling a non-existent requirement).
Overall Success Condition: The final holistic task success is determined as "Success" only if all required components (attributes, filters, and sorting preferences that were explicitly part of the query) are successfully satisfied. This means a single failure in any explicitly required dimension leads to an overall "Not Success."
Deep Research Systems: For deep research systems (like Gemini or OpenAI Deep Research), intermediate execution screenshots are typically unavailable. Therefore, both fine-grained and holistic evaluations for these systems are conducted manually by human evaluators, who verify the returned links against the query requirements.

4.2.6.3. Agreement Rate between LLM Evaluation and Human Judge

To ensure the reliability of the GPT-4o-based evaluation, an agreement rate (also known as inter-annotator agreement) is calculated between human judgments and GPT-4o judgments.

Procedure: Human annotators are shown the full interaction trace of an agent, including screenshots and actions, and asked to judge whether the agent successfully fulfilled the user's request for each sub-goal and the overall task.
Results: The agreement rates between human judges and GPT-4o judges are:
- Product attributes: $84 \%$
- Search filters: $80 \%$
- Sorting preferences: $82 \%$
- Overall task success: $86 \%$ These high agreement rates indicate the effectiveness and reliability of using GPT-4o for evaluation in this context.

The following figure (Figure 6 from the original paper, part of Appendix A.1) shows the general structure of the GPT-4o prompt used for fine-grained evaluation. This prompt guides GPT-4o to assess task completion based on subqueries, screenshots, and the agent's final answer. It explicitly states that GPT-4o should not interact with web pages, make assumptions, or rely solely on the textual response if it contradicts the screenshot.

[System prompt]

role:

1. Web Task Instruction: A clear and precise natural language directive that specifies an online shopping activity to be executed. The instruction may involve locating products that meet certain attribute requirements (e.g., color, size, brand), applying specific search filters (e.g., price range, customer ratings, availability), or fulfilling user-defined sorting preferences (e.g., lowest price, newest arrivals, best sellers). Tasks may also include verifying product details, comparing offers, or checking for shipping and return policies, depending on the scenario.

   Result Screenshots: This is a visual representation of the screen showing the result or intermeiat tat perorminga web askIt rves as isual proo theactins take response to the instruction.

3Result Response: This is a textual response obtained after the execution of the web task.   
It serves as textual result in response to the instruction.

-- You DO NOT NEED to interact with web pages or perform actions such as conducting searches on websites.

-- You SHOULD NOT make assumptions based on information not presented in the screenshot when comparing it to the instructions.

-- Your primary responsibility is to conduct a thorough assessment of the web task instruction against the outcome depicted in the screenshot and in the response, evaluating whether the actions taken align with the given instructions.

-- NOTE that the instruction may involve more than one task, for example, locating the grsizi eFailplites o ov summary, should be considered unsuccessful.

-- NOTE that the screenshot is authentic, but the response provided by LLM is generated at the end of web browsing, and there may be discrepancies between the text and the screenshots.

-- Note the difference: 1) Result response may contradict the screenshot, then the content of the screenshot prevails, 2) The content in the Result response is not mentioned on the screenshot, choose to believe the content.

You should elaborate on how you arrived at your final evaluation and then provide a definitive verdict on whether the task has been successfully accomplished, either as 'SUCCESS' or 'NOT SUCCESS'.

[User prompt] TASK: {subquery}

Result Response: {answer}

15 screenshots at the end: {screenshots}

You will be presented with a web shopping task.   
For each task, you will receive three subqueries, along with the web agent's action history and corresponding screenshots. Your goal is to evaluate the agent's performance across three specific dimensions: product attributes, search filters, and sorting preferences. Please note: if a subquery is labeled as None, you do not need to assess that particular aspect. Definitions of the three subqueries are as follows:   
express detailed intent   
earch filters rpresentin ategorical rnumerical constraint comonly use commerce platforms   
Sorting preferences, indicating desired result orderings, such as price or popularity.

Task: Query}

Product Attribute Requirement: {Subquery1}

Search Filter Requirement: {Subquery2}

Sorting Preference Requirement: {Subquery3}

Agent Action History: {Action}

This prompt is crucial for GPT-4o to act as a reliable evaluator by providing clear instructions on how to interpret task requirements, agent actions, and visual evidence (screenshots) to determine success or failure for each fine-grained aspect.

5. Experimental Setup

5.1. Datasets

The primary dataset used in the experiments is the DeepShop benchmark itself.

Source: The DeepShop benchmark is constructed by taking seed queries from Mind2Web-Live [34] and WebVoyager [11], then applying query diversity and complexity evolution processes.
Scale: The full DeepShop benchmark comprises 600 queries. For evaluation, a balanced subset of 150 queries is used, systematically selected with 30 queries from each of the five major categories.
Characteristics:
- Diversity: Covers five major e-commerce categories: Books, Electronics, Home, Fashion, and Sports.
- Complexity: Queries are categorized into easy (0-1 complexity evolution steps), medium (2-3 complexity evolution steps), and hard (4-5 complexity evolution steps) based on the number of product attributes, search filters, and sorting preferences introduced during the evolution process.
- Realism: Derived from real user queries and designed to be executed on live, real-time web environments (specifically Amazon.com, as hinted by figures and context).
- Features: Each instance in the dataset includes:
  - id: A unique identifier for the example.
  - ques: The natural language shopping query.
  - web_name and web: The e-commerce platform name and its identifier (e.g., "Amazon" and its URL).
  - attribute, filter, sort: Subqueries describing the specific product attribute, search filter, and sorting preferences.
  - category: The product category information (e.g., Books, Electronics).
  - difficulty: The task difficulty level (e.g., easy, medium, hard).
Domain: The United States region is specified, and the language is English.
Intended Use: Evaluation of web agents in online shopping tasks through complex query understanding and UI interaction.
Limitations (of the dataset itself): Currently focuses on desktop web interfaces, lacks support for dynamic user intent changes or multi-turn interactions, does not fully capture cognitive aspects of shopping behavior, and does not cover mobile layouts or multilingual queries.

The figure below (Figure 2 from the original paper) shows examples of data samples and their evolution, illustrating how a simple seed query transforms into more complex versions with specific attributes, filters, and sorting preferences:

该图像是图示，展示了DeepShop中多样性和复杂度演化的运行示例，包含属性演化、筛选演化和排序演化三种复杂度演化类型，体现用户查询的渐进变化。

For example, a seed query "Find a book on web scraping" is diversified to "Find a book on python programming". This diversified query can then be evolved to "Find a book on python programming by author John Smith, published after 2020 with 4+ star ratings and free shipping, sort by lowest price."

5.2. Evaluation Metrics

The paper uses both fine-grained and holistic evaluation metrics. These metrics are success rates, indicating the percentage of tasks (or sub-tasks) that an agent successfully completes according to the specified criteria.

Product Attribute Success Rate:
- Conceptual Definition: This metric quantifies the agent's ability to correctly identify and match products that satisfy specific product attributes requested in the query. It focuses on whether details like brand, model, color, size, or a price range for the product itself were correctly handled.
- Mathematical Formula: Not explicitly provided in the paper as a formal equation, but it represents a percentage: $ \text{Product Attribute Success Rate} = \frac{\text{Number of queries where product attributes are correctly satisfied}}{\text{Total number of queries with explicit product attributes}} \times 100% $
- Symbol Explanation:
  - Number of queries where product attributes are correctly satisfied: The count of tasks where the agent successfully found a product meeting all specified product attribute requirements, as judged by GPT-4o (or human evaluators for deep research systems).
  - Total number of queries with explicit product attributes: The total count of tasks in the benchmark that included at least one specific product attribute requirement. Queries without explicit attributes are excluded from the denominator for this metric.
Search Filter Success Rate:
- Conceptual Definition: This metric measures the agent's proficiency in applying specified search filters on the e-commerce platform. It assesses if the agent correctly interacted with UI elements to narrow down results based on criteria like minimum customer rating, shipping options (e.g., free delivery), or specific timeframes.
- Mathematical Formula: Not explicitly provided in the paper as a formal equation, but it represents a percentage: $ \text{Search Filter Success Rate} = \frac{\text{Number of queries where search filters are correctly applied}}{\text{Total number of queries with explicit search filters}} \times 100% $
- Symbol Explanation:
  - Number of queries where search filters are correctly applied: The count of tasks where the agent successfully applied all specified search filter requirements.
  - Total number of queries with explicit search filters: The total count of tasks in the benchmark that included at least one specific search filter requirement.
Sorting Preference Success Rate:
- Conceptual Definition: This metric evaluates the agent's capacity to correctly apply sorting preferences to the search results. It determines if the agent managed to arrange the listed products according to criteria such as lowest price, highest user rating, or newest arrival.
- Mathematical Formula: Not explicitly provided in the paper as a formal equation, but it represents a percentage: $ \text{Sorting Preference Success Rate} = \frac{\text{Number of queries where sorting preferences are correctly applied}}{\text{Total number of queries with explicit sorting preferences}} \times 100% $
- Symbol Explanation:
  - Number of queries where sorting preferences are correctly applied: The count of tasks where the agent successfully applied all specified sorting preference requirements.
  - Total number of queries with explicit sorting preferences: The total count of tasks in the benchmark that included at least one specific sorting preference requirement.
Task Success Rate (Holistic):
- Conceptual Definition: This is the overall success rate, indicating whether the agent fully completed the entire shopping task by satisfying all explicitly stated requirements, including product attributes, search filters, and sorting preferences. A task is considered successful only if every required component is met.
- Mathematical Formula: Not explicitly provided in the paper as a formal equation, but it represents a percentage: $ \text{Task Success Rate} = \frac{\text{Number of tasks where all explicit requirements are met}}{\text{Total number of tasks in the benchmark}} \times 100% $
- Symbol Explanation:
  - Number of tasks where all explicit requirements are met: The count of tasks where the agent achieved "Success" in the holistic evaluation, meaning all product attributes, search filters, and sorting preferences (if explicitly stated in the query) were correctly satisfied.
  - Total number of tasks in the benchmark: The total number of queries in the DeepShop evaluation set (150 queries in the balanced subset).

5.3. Baselines

The paper evaluates a range of approaches, categorized into three types: Simple RAG, Web agents, and Deep research systems.

Simple RAG (Retrieval-Augmented Generation):
- Model: GPT-4o + Google Search.
- Mechanism: This baseline simulates a basic RAG approach. The user query is submitted to Google Search. The top-ranked webpage from the search results is retrieved (using Serper API for programmatic access). Then, GPT-4o (version 2024-08-06) generates a final response based on a screenshot of this retrieved webpage.
- Representativeness: This represents a simple, non-interactive approach that relies purely on search and static content analysis, highlighting the limitations of RAG when dynamic web interaction is required.
Web agents: All web agents use GPT-4o (version 2024-08-06) as their underlying large language model. They differ in their perception mechanisms and interaction strategies.
- Agent-E [1]:
  - Mechanism: An HTML-based agent that employs a hierarchical planner-actor framework. It interprets instructions and navigates web interfaces using DOM trees (Document Object Model, a programming interface for HTML and XML documents). It's augmented with flexible DOM tree distillation and a denoising mechanism to improve decision accuracy. It utilizes full-page screenshots for perception.
  - Representativeness: Represents the capabilities of text-based DOM-aware agents.
- SeeAct [60]:
  - Mechanism: A vision-based agent that leverages the multimodal capabilities of LLMs. It integrates visual perception (using full-page screenshots) with structured web-based interactions.
  - Representativeness: Represents agents that primarily rely on visual input interpretation from LLMs.
- WebVoyager [11]:
  - Mechanism: Also a multimodal reasoning agent. It introduces a set-of-mark prompting scheme, where the agent first generates intermediate thoughts before selecting final actions. It operates on the visible viewport only (not full-page screenshots).
  - Representativeness: Represents advanced multimodal agents with explicit reasoning steps.
- Browser Use [29]:
  - Mechanism: An open-source web agent framework that combines visual understanding (operating on the visible viewport only) with HTML structure parsing to support robust web navigation and interaction.
  - Representativeness: Represents hybrid agents that leverage both visual and structural information for more robust interaction.
Deep research systems: These are commercial systems with advanced reasoning capabilities. For these systems, explicit site constraints are included in the prompt to guide the search process, as they cannot be strictly constrained to specific websites in the same way open-source agents can.
- Gemini Deep Research [8]:
  - Model: Gemini 2.0 Flash model with deep research capabilities, integrated into Google's Gemini Advanced platform.
  - Mechanism: An AI assistant that decomposes queries, performs extensive searches, and generates cited multi-step reports.
  - Representativeness: Represents Google's state-of-the-art commercial deep research LLM product.
- OpenAI Deep Research [33]:
  - Model: o3 model (likely an internal designation for an advanced GPT model) with deep research enabled, powered by OpenAI's reasoning models.
  - Mechanism: An agentic system that autonomously browses, analyzes, and synthesizes web information into citation-rich outputs, emulating human research workflows.
  - Representativeness: Represents OpenAI's state-of-the-art commercial deep research LLM product.
    
    All open-source agents (Agent-E, SeeAct, WebVoyager, Browser Use) are executed within real-time web environments (Playwright for Agent-E, SeeAct, Browser Use; Selenium for WebVoyager). Each agent is limited to a maximum of 15 steps per task to control computation cost and prevent excessive exploration.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. RQ1: Performance Analysis of Web Agents

RQ1 addresses how Simple RAG methods, web agents, and deep research systems perform on the DeepShop benchmark across fine-grained and holistic evaluation metrics.

The following are the results from Table 2 of the original paper:

Method Product attribute Search filter Sorting preference Task success
Simple RAG
GPT-4o + Google Search	7.33	5.97	4.55	7.33
Web agents
Agent-E	12.67	9.70	3.41	6.67
SeeAct	52.00	22.39	20.45	10.67
WebVoyager	40.67	38.00	23.86	16.00
Browser Use	36.00	34.33	30.68	32.00

Deep research systems Gemini Deep Research	53.33	44.00	52.94	30.00
OpenAI Deep Research	60.00	46.15	58.82	30.00

Observations:

Simple RAG struggles significantly: The GPT-4o + Google Search method performs very poorly across all metrics, with Task success at just 7.33%. It particularly struggles with Search filters (5.97%) and Sorting preferences (4.55%). This is expected because RAG fundamentally lacks the ability to interact dynamically with website elements (like clicking buttons to apply filters or change sorting orders). It can only retrieve information and generate text based on static content. This clearly demonstrates that DeepShop queries cannot be solved by retrieval alone.
Web agents outperform RAG but face challenges with fine-grained requirements:
- All web agents show better performance than Simple RAG, indicating the necessity of web interaction.
- There's a progressive gain in Task success from HTML-based Agent-E (6.67%) to vision-based SeeAct (10.67%) and WebVoyager (16.00%), culminating in Browser Use (32.00%). Browser Use, which integrates both HTML and visual inputs, achieves the best performance among web agents.
- However, even the best web agent (Browser Use) achieves only 32.00% Task success, highlighting the difficulty of simultaneously satisfying all three fine-grained requirements.
- Different web agents excel in different fine-grained aspects: SeeAct leads in Product attribute (52.00%), WebVoyager in Search filters (38.00%), and Browser Use in Sorting preferences (30.68%). This suggests that no single web agent approach is uniformly superior across all sub-tasks.
Deep research systems show enhanced fine-grained performance but limited overall success:
- Both Gemini Deep Research (30.00%) and OpenAI Deep Research (30.00%) achieve similar Task success rates, which are comparable to or slightly lower than the best web agent (Browser Use).
- However, they significantly excel in Product attributes (53.33% and 60.00% respectively) and particularly in Sorting preferences (52.94% and 58.82% respectively), often outperforming web agents in these aspects. This points to their stronger reasoning capabilities in interpreting and fulfilling such explicit instructions.
- They still struggle with Search filters (44.00% and 46.15%), though better than most web agents. The paper suggests this is because many filters require deep exploration and confirmation on product detail pages, which these systems might not handle perfectly.
- Despite their strong fine-grained performance in some areas, their holistic task success rates remain relatively low (30%), underscoring the immense challenge DeepShop poses—an agent must succeed in all specified aspects simultaneously.
  
  In summary, the results validate DeepShop as a challenging benchmark. RAG methods are insufficient, web agents make progress through interaction but struggle with the combined complexity, and even sophisticated deep research systems face significant hurdles in achieving high holistic success rates, particularly with search filters and when all requirements must be met concurrently.

6.1.2. RQ2: Performance across Different Product Categories

RQ2 investigates how existing methods perform across different product categories (Books, Electronics, Home, Fashion, and Sports) in online shopping tasks.

The following figure (Figure 5 from the original paper, part a) shows the performance across different product categories:

该图像是两幅柱状图，比较了不同模型在五个类别（左图）和三个复杂度层次（右图）下的任务成功率。左图展示了模型在Books、Electronics等领域的表现；右图体现了模型在Easy、Medium和Hard复杂度上的性能差异。 Analysis of Figure 5(a) - Performance across different product categories:

Simple RAG: Shows variable performance, doing relatively well in Home but dropping to 0% success in Fashion and Sports. This suggests that Home products might have richer, more easily retrievable textual descriptions via Google Search, while Fashion and Sports often rely on visual cues (e.g., specific styles, colors) that are harder for RAG to capture without active web interaction.
Agent-E (HTML-based): Consistently underperforms across categories, particularly low in Sports. Its reliance on HTML without strong visual processing limits its effectiveness in categories where visual elements are crucial.
Vision-based Agents (SeeAct, WebVoyager): Generally improve performance across domains compared to Agent-E and Simple RAG, demonstrating the value of visual processing.
Browser Use (Hybrid): Achieves the best cross-domain results among web agents by combining HTML and visual inputs. It shows more balanced performance across categories.
Deep Research Systems (Gemini, OpenAI): Exhibit relatively stable trends across categories, outperforming most web agents. However, they face significant challenges in Fashion and Sports categories. Gemini scores 0% in Sports, and OpenAI fails entirely in both Fashion and Sports. This highlights a critical need for robust multimodal reasoning to handle visually driven product categories effectively, even for advanced deep research systems.

The varied performance across categories underscores that different types of agents have strengths and weaknesses depending on the nature of the product domain, especially concerning the importance of visual information versus structured text or DOM elements.

6.1.3. RQ3: Performance across Query Complexity Evolution

RQ3 examines how the performance of web agents varies across different levels of query complexity, from seed queries to evolved complex queries with multiple attributes, filters, and sorting preferences.

The following figure (Figure 5 from the original paper, part b) shows the performance across query complexity evolution:

该图像是两幅柱状图，比较了不同模型在五个类别（左图）和三个复杂度层次（右图）下的任务成功率。左图展示了模型在Books、Electronics等领域的表现；右图体现了模型在Easy、Medium和Hard复杂度上的性能差异。 Analysis of Figure 5(b) - Performance across query complexity evolution:

Clear Negative Correlation: There is a clear negative correlation between query complexity and agent performance across all methods. As tasks move from easy (0-1 complexity evolution steps) to medium (2-3 steps) and then hard (4-5 steps), the success rates generally decline.
Simple RAG: Performs at 16% on easy queries, drops to 6.00% on medium queries, and completely fails (0%) on hard tasks. This reinforces that Google Search alone cannot handle complex user needs that require multi-faceted criteria.
Web Agents: Also exhibit sharp declines in performance. The average accuracy for web agents falls from 28.5% on easy tasks to 17% on medium tasks, and further drops by 7 percentage points (to 10%) on hard tasks. This shows that while web agents are better than RAG, they are still significantly challenged by increasing query complexity.
Deep Research Systems: Perform better than web agents on the hard subset. Even for hard tasks, OpenAI Deep Research achieves 20% success (and Gemini 18%). This highlights the importance of strong reasoning capabilities for handling complex instructions. However, even for these advanced systems, the hard tasks remain very challenging, with a 20% success rate being relatively low.

The results clearly demonstrate that DeepShop successfully creates a gradient of difficulty. As queries become more complex by layering on attributes, filters, and sorting preferences, the ability of all evaluated systems to fulfill them drops considerably, indicating that current web agents and deep research systems have substantial room for improvement in handling real-world query complexity.

6.2. Error Analysis and Future Improvement Guidance

The paper conducts a detailed error analysis to identify primary failure modes, providing critical insights for future research.

6.2.1. Web Agents are limited by grounding ability

Problem: Web agents struggle to accurately ground (connect natural language instructions to specific UI elements) interface elements. They fail to correctly identify interactive components like buttons, sliders, and review sections.
Examples:
- HTML-based agents might overlook visual details (e.g., product color, layout cues) crucial for decisions, as they focus on the DOM structure.
- Vision-based agents using set-of-mark prompts (a technique where a model generates explicit visual segmentations or "marks" to identify regions of interest) suffer from segmentation errors. Interactive buttons are misclassified, or regions like customer reviews remain unsegmented, preventing the use of rating filters. Small filtering and sorting widgets are often ignored.
Future Work: Explore multimodal fusion techniques that combine HTML structure with visual context to enable stronger grounding.

The following figure (Figure 12 from the original paper) illustrates the limited grounding ability of web agents:

该图像是一张亚马逊购物页面的屏幕截图，展示了两款绿色Xbox无线手柄的商品信息、用户评分和价格区间，用于展示Web代理在真实购物场景中处理复杂查询时的界面表现。

As shown, button 39 (related to user rating) was not properly segmented, preventing the agent from selecting a specific rating range. Buttons 31-37 and 41-44 were rendered too densely and overlapped, making interaction difficult. The sorting button on the right was incorrectly split into two buttons (16 and 17), which could confuse the agent.

6.2.2. Web Agents often lack state assessment and replanning capabilities

Problem: Agents fail to dynamically reassess the current webpage state and reformulate their plan when initial attempts fail or conditions are not met.
Examples:
- Issuing overly specific search queries. Upon retrieval failure, they don't backtrack or reformulate broader alternatives.
- Navigating to product detail pages and finding unmet requirements (e.g., a specific warranty not present), but instead of returning to search results or exploring other options, they continue to scroll inefficiently on the current page.
- Repeating ineffective actions (e.g., clicking an unresponsive element multiple times) due to limited awareness of webpage state transitions.
Future Work: Fine-tune agents on realistic web environments to enhance their ability to reason over search failures and adapt plans dynamically.

The following figure (Figure 13 from the original paper) illustrates a web agent's failure to reassess and replan:

该图像是论文中图13的示意图，展示了web代理在购物过程中未能重新评估和重新规划的失败案例，图中通过一系列点击和滚动操作，突出代理未回溯而继续在当前页面探索的问题。

In this example, the agent enters a product detail page to verify a 1-year warranty. Upon realizing the requirement is unmet, it fails to reassess its state. Instead of returning to the search results page to look for other options, the agent continues to scroll within the current page, inefficiently attempting to locate an alternative product on the same page.

6.2.3. Web Agents are constrained by a limited action space

Problem: Web agents operate within a restricted set of browser actions, preventing them from interacting with dynamic UI components commonly found on shopping platforms.
Examples: An agent fails to filter products within a specific price range because it cannot drag a price slider. They struggle with dropdowns, sliders, and nested menus, which are essential for precise filtering and sorting.
Future Work: Expand the agent's action repertoire with shopping-specific operations and deeper browser integration, allowing for more complex UI manipulations.

The following figure (Figure 14 from the original paper) illustrates the web agent's failure to apply the price filter during task execution:

该图像是两张并列的网页截图示意图，展示在亚马逊网站使用价格过滤功能时点击“Go”按钮未能成功过滤商品的操作过程和结果。图中标注了交互动作及对应的商品列表错误显示情况，反映购物代理在过滤功能上的失败。

The agent attempts to filter cameras within the $100–$300 price range. However, it is unable to interact with the dynamic price slider UI element. Instead, it clicks the adjacent "Go" button without adjusting the slider values, resulting in ineffective filtering. This highlights the limitation of a constrained action space.

6.2.4. Web Agents lack the ability to learn from execution

Problem: Current agents show limited ability to generalize across tasks. Experiences gained from one interaction (successes or failures) are not transferred to future scenarios.
Examples: Agents repeatedly make the same mistakes, such as misusing a retriever to query filtering or sorting constraints that are only accessible via specific UI components. This leads to irrelevant results and demonstrates a lack of adaptive learning.
Future Work: Enable execution-time learning and memory, allowing agents to abstract successful patterns, track failure cases, and refine decision-making over time. This could involve task-level memory modules, outcome-based self-reflection, and lifelong learning mechanisms.

The following figure (Figure 15 from the original paper) illustrates the web agent's failure to learn from execution:

该图像是一张网页截图示意图，展示了购物代理在执行过程中未能有效学习的情况，图中包含亚马逊搜索结果及用户界面元素，突出显示了失败提示信息。

This figure shows screenshots from four different tasks where the web agent consistently misuses the retriever (likely a search bar or internal search function) for filtering or sorting, even though these functionalities are typically handled by dedicated UI components. This repeated error across tasks demonstrates a lack of execution-time learning, as the agent doesn't adapt its strategy based on past failures.

6.2.5. Deep research systems are prone to hallucination errors

Problem: Deep research systems often oversimplify complex queries, neglect constraints, and return confident yet inaccurate recommendations or incorrect information.
Examples:
- OpenAI's deep research system might assert that a matching product exists even when it doesn't, or claim size requirements are met when they are not.
- Both Gemini and OpenAI systems frequently return incomplete or incorrect links, redirecting to irrelevant websites or generic navigation pages instead of specific product detail views, violating task constraints.
- They might extract policy information (e.g., return policies) from external, non-relevant sites rather than the specified e-commerce platform.
Future Work: Apply preference alignment and fact-checking techniques to reduce hallucination rates and improve the precision of retrieved links.

The following figures (Figure 16 and Figure 17 from the original paper) illustrate hallucination errors in the OpenAI deep research system:

该图像是展示电子商务网站商品页面的示意图，图中显示了一款女式无袖复古花卉印花长裙，页面上标明可选尺码为Large和XX-Large，但任务明确要求中号尺码，体现了购物代理任务中的复杂筛选和规格匹配挑战。

Figure 16 shows the OpenAI deep research system's answer to a task requesting a "Women's Vintage Floral Maxi Dress in Navy Blue, Size: Medium," and an explanation of the return policy. The system returns three links. Figure 17 provides a detailed view of the first returned product link. Despite the task specifying "Size: Medium," the linked product only offers "Large" and "XX-Large" options. The deep research system hallucinates that the size requirement is met. Furthermore, Link2 and Link3 point to non-Amazon websites (e.g., "smoking-er.com"), violating the implied task constraint of searching on Amazon (as suggested by the context of a shopping benchmark). The system also incorrectly extracts return information from these external sites. These instances demonstrate hallucinations in both satisfying attribute constraints and sourcing accurate information from the correct domain.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces DeepShop, a novel and comprehensive benchmark designed to evaluate web agents in highly realistic and complex online shopping environments. It addresses the critical gap in existing benchmarks, which often feature simplistic queries, by systematically evolving query diversity across five major e-commerce domains and progressively increasing complexity through the addition of product attributes, search filters, and sorting preferences. DeepShop also provides a fine-grained and holistic evaluation framework, leveraging GPT-4o for automated assessment validated by human agreement.

Experimental results demonstrate that DeepShop is a challenging benchmark. Simple RAG methods fail due to their inability to perform dynamic web interactions. While web agents show improved performance through interaction, they still struggle significantly with simultaneously satisfying all fine-grained requirements, especially search filters and sorting preferences. Even advanced deep research systems face considerable challenges, exhibiting hallucination errors and relatively low overall success rates, particularly in visually-driven categories and for hard complexity queries. The detailed error analysis provides crucial insights into limitations in grounding, replanning, action space, and execution-time learning for web agents, and hallucination for deep research systems.

7.2. Limitations & Future Work

The authors acknowledge several limitations of DeepShop that open avenues for future research:

Desktop Interfaces Only: The benchmark currently focuses solely on desktop web interfaces and does not include mobile-specific layouts or interactions. Future work could extend it to mobile environments.
Lack of Dynamic User Intent and Multi-turn Interactions: DeepShop does not support dynamic changes in user intent during a task or complex multi-turn conversational interactions, which are common in real shopping assistance. Future benchmarks could incorporate these conversational aspects.
Limited Cognitive Aspects: The benchmark does not fully capture the nuanced cognitive aspects of human shopping behavior, such as comparison strategies, brand loyalty, or subjective preferences.
Benefiting from Tool Learning and Agent Capabilities: The authors suggest that DeepShop could benefit from recent advances in tool learning (allowing agents to use external tools more effectively) and broader agent capabilities (e.g., more sophisticated reasoning and planning).

From a societal perspective, the authors note that while shopping agents can assist users, they raise concerns about privacy and consumer manipulation. Future work should consider broader implications of agent-centric information access on consumer behavior and market dynamics, ensuring ethical decision-making.

7.3. Personal Insights & Critique

DeepShop is a highly valuable contribution to the field of web agents, particularly for e-commerce. Its systematic approach to generating diverse and complex queries is a significant improvement over prior benchmarks, which often oversimplified real-world tasks. The fine-grained evaluation is especially insightful, as it moves beyond a simple pass/fail to diagnose where agents succeed or fail, providing actionable feedback for developers. The high GPT-4o agreement rates with human judgment also enhance the scalability and reproducibility of the benchmark.

Inspirations and Applications:

Robust Agent Design: The identified error categories (grounding, replanning, action space, learning from execution, hallucination) provide a clear roadmap for designing more robust web agents. For instance, the need for multimodal fusion to improve grounding is a critical insight that can be applied to other GUI-based automation tasks beyond shopping.
Curriculum Learning for Agents: The easy, medium, hard complexity levels within DeepShop naturally lend themselves to curriculum learning approaches, where agents could be initially trained or fine-tuned on simpler tasks before progressing to more complex ones.
Evaluation Beyond Binary: The fine-grained evaluation paradigm can be transferred to other complex multi-step tasks (e.g., customer support, data entry, research tasks) to provide more diagnostic insights into agent performance.
Benchmarking Deep Research Systems: The inclusion of commercial deep research systems provides a valuable baseline and highlights their current limitations, pushing the research community to improve these powerful, yet imperfect, systems.

Potential Issues/Areas for Improvement:

Dependence on GPT-4o for Query Generation and Evaluation: While GPT-4o is powerful, its use for both query generation and evaluation introduces a potential risk of "model overfitting" or hallucination in the benchmark creation process itself. Although human verification is performed, the inherent biases or limitations of GPT-4o could subtly influence the types of queries generated or how success is judged.
Action Space Definition: The critique of limited action space for web agents is valid, but the paper doesn't propose a concrete expanded action set or a method to learn new actions. Future work stemming from DeepShop could focus on designing a more universal and extensible action space that handles dynamic UI elements better.
Dynamic Website Changes: While DeepShop uses live websites, e-commerce platforms are constantly updated. This dynamism, while realistic, can lead to benchmark decay over time, requiring continuous maintenance and re-verification of tasks.
Cognitive Aspects: The acknowledged limitation regarding cognitive aspects is significant. Real shopping involves subjective preferences, trust, and comparison strategies that are hard to capture with objective attributes, filters, and sorting. Integrating user feedback or preference learning into the benchmark could make it even more realistic.

Overall, DeepShop represents a crucial step forward in evaluating web agents, setting a higher bar for realistic performance and offering clear directions for future research in building truly intelligent and robust deep research shopping agents.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

DeepShop: A Benchmark for Deep Research Shopping Agents

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~40 min read · 53,335 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Task Formulation

4.2.2. Seed Data Curation

4.2.3. Shopping Query Diversity Evolution

4.2.4. Shopping Query Complexity Evolution

4.2.5. Dataset Analysis

4.2.5.1. Analysis of Query Diversity Evolution

4.2.5.2. Analysis of Query Complexity Evolution

4.2.6. Evaluation Metrics

4.2.6.1. Fine-grained Evaluation

4.2.6.2. Holistic Evaluation

4.2.6.3. Agreement Rate between LLM Evaluation and Human Judge

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. RQ1: Performance Analysis of Web Agents

6.1.2. RQ2: Performance across Different Product Categories

6.1.3. RQ3: Performance across Query Complexity Evolution

6.2. Error Analysis and Future Improvement Guidance

6.2.1. Web Agents are limited by grounding ability

6.2.2. Web Agents often lack state assessment and replanning capabilities

6.2.3. Web Agents are constrained by a limited action space

6.2.4. Web Agents lack the ability to learn from execution

6.2.5. Deep research systems are prone to hallucination errors

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers