WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents
TL;DR Summary
WebMall is a new benchmark for evaluating LLM-based web agents in multi-shop comparison-shopping scenarios, featuring four simulated shops and 91 tasks that enhance online shopping research by offering authentic product diversity.
Abstract
LLM-based web agents have the potential to automate long-running web tasks, such as finding offers for specific products in multiple online shops and subsequently ordering the cheapest products that meet the users needs. This paper introduces WebMall, a multi-shop online shopping benchmark for evaluating the effectiveness and efficiency of web agents for comparison-shopping. WebMall consists of four simulated online shops populated with authentic product offers sourced from the Common Crawl, alongside a suite of 91 cross-shop tasks. These tasks include basic tasks such as finding specific products in multiple shops, performing price comparisons, adding items to the shopping cart, and completing checkout. Advanced tasks involve searching for products based on vague requirements, identifying suitable substitutes, and finding compatible products. Compared to existing e-commerce benchmarks, such as WebShop or ShoppingBench, WebMall introduces comparison-shopping tasks across multiple shops. Furthermore, the product offers are more heterogeneous, as they originate from hundreds of distinct real-world shops. The tasks in WebMall require longer interaction trajectories than those in WebShop, while remaining representative of real-world shopping behaviors. We evaluate eight baseline agents on WebMall, varying in observation modality, memory utilization, and underlying large language model (GPT 4.1 and Claude Sonnet 4). The best-performing configurations achieve completion rates of 75% and 53%, and F1 scores of 87% and 63%, on the basic and advanced task sets, respectively. WebMall is publicly released to facilitate research on web agents and to promote advancements in navigation, reasoning, and efficiency within e-commerce scenarios.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents
1.2. Authors
- Ralph Peeters (Data and Web Science Group, University of Mannheim, Mannheim, Germany)
- Aaron Steiner (Data and Web Science Group, University of Mannheim, Mannheim, Germany)
- Luca Schwarz (Data and Web Science Group, University of Mannheim, Mannheim, Germany)
- Julian Yuya Caspary (Data and Web Science Group, University of Mannheim, Mannheim, Germany)
- Christian Bizer (Data and Web Science Group, University of Mannheim, Mannheim, Germany)
1.3. Journal/Conference
The paper is published as a preprint on arXiv, indicating it is likely awaiting peer review or has been submitted to a conference/journal. Given the publication date in 2025, it might be targeting a major conference or journal in web science, artificial intelligence, or natural language processing. Christian Bizer is a well-known researcher in web data extraction and semantic web, suggesting the paper aligns with high-impact research in these fields.
1.4. Publication Year
2025
1.5. Abstract
This paper introduces WebMall, a novel benchmark designed to evaluate LLM-based web agents in multi-shop online shopping scenarios, specifically focusing on comparison-shopping. WebMall comprises four simulated online shops, populated with authentic product offers sourced from the Common Crawl. It features a suite of 91 cross-shop tasks, categorized into basic tasks (e.g., finding specific products, price comparison, adding to cart, checkout) and advanced tasks (e.g., searching with vague requirements, identifying substitutes, finding compatible products). A key innovation is its focus on comparison-shopping across multiple, heterogeneous shops, and its use of more diverse, real-world product data compared to existing single-shop benchmarks like WebShop or ShoppingBench. The tasks in WebMall require longer interaction trajectories, reflecting realistic shopping behaviors. The authors evaluate eight baseline agents, varying observation modality (accessibility tree, screenshots), memory utilization, and underlying large language model (GPT 4.1, Claude Sonnet 4). The top-performing configurations achieved completion rates of 75% and 53%, and F1 scores of 87% and 63%, for basic and advanced task sets, respectively. WebMall is publicly released to foster research in web agent navigation, reasoning, and efficiency in e-commerce.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2508.13024 - PDF Link:
https://arxiv.org/pdf/2508.13024v1.pdf - Publication Status: This is a preprint available on
arXiv.
2. Executive Summary
2.1. Background & Motivation
The proliferation of large language models (LLMs) has ignited significant interest in developing web agents capable of automating complex, long-running online tasks. A crucial application area is online shopping, where users often need to compare products across multiple stores to find the best deals or specific items. However, existing benchmarks for evaluating these web agents in e-commerce scenarios primarily focus on single-shop environments. These benchmarks either simulate a single online store or evaluate agents on the live web, which presents challenges for reproducibility. The lack of a standardized, reproducible benchmark for comparison-shopping across multiple, diverse online shops represents a significant gap in research. This gap hinders the development and rigorous evaluation of LLM-based web agents that can handle the real-world complexity of navigating, comparing, and transacting across heterogeneous e-commerce platforms.
The core problem the paper aims to solve is the absence of a comprehensive, reproducible, multi-shop benchmark for LLM-based web agents in comparison-shopping scenarios. This problem is important because real-world online shopping often involves comparing offers from various retailers, which requires agents to possess sophisticated navigation, reasoning, and cross-site information aggregation capabilities. Existing benchmarks fall short by either limiting agents to a single store, using artificial tasks, or relying on live web environments that prevent exact reproducibility. The paper's innovative idea, or entry point, is the creation of WebMall, a simulated multi-shop environment populated with realistic, heterogeneous product data and a challenging suite of cross-shop tasks that demand advanced comparison-shopping skills.
2.2. Main Contributions / Findings
The primary contributions of this paper are:
-
Novel Multi-Shop Benchmark (
WebMall): The introduction ofWebMall, the first benchmark designed forcomparison-shoppingtasks across multiple simulated e-shops. It consists of four locally hostable online stores, populated with 4,421 authentic product offers derived from theCommon Crawl, and a set of 91cross-shop tasksacross 11 categories. These tasks range from basic product search and checkout to advanced tasks requiringvague requirement reasoning,substitution, andcompatibility analysis. -
Extensive Baseline Evaluation: The paper conducts a thorough evaluation of eight
baseline agent configurationsusing theBrowsergym/AgentLabframework. These configurations vary acrossobservation space(accessibility tree, screenshots, or both), the use ofpersistent short-term memory, and theunderlying large language model(GPT-4.1 and Claude Sonnet 4). This evaluation provides insights into the effectiveness and efficiency of currentweb agentsin multi-shop scenarios.Key conclusions and findings include:
-
Challenging Benchmark:
WebMallproves challenging for state-of-the-artLLMs, with the best configurations achieving completion rates of 75% for basic tasks and 53% for advanced tasks, and F1 scores of 87% and 63% respectively. This indicates significant room for improvement in currentweb agentcapabilities for complexe-commercetasks. -
Importance of
Accessibility Tree: Theaccessibility treeis identified as the most crucialobservation modalityfor reliablenavigationand high task completion rates, especially ine-commercescenarios where structured information about UI elements is vital.Screenshotscan be supplementary but cannot replace thestructured informationprovided byaccessibility trees. -
Benefits of
Persistent Short-Term Memory: The integration ofpersistent short-term memorysignificantly improves task completion rates, particularly forlong-running tasksthat require agents to track and aggregate information across multiple shops and steps. This helps mitigatepremature terminationandinformation loss. -
LLM Performance Trade-offs: GPT-4.1 demonstrates better efficiency (faster, cheaper) and accuracy for structured, basic tasks. Claude Sonnet 4, while often slower and more costly, showed superior performance on less clearly defined, advanced tasks involving
vague requirementsorattribute-based reasoning. -
Common Failure Modes: Recurring issues include
rigid search strategies(missing variants, not broadening queries),insufficient cross-shop reasoning(stopping after finding one offer),UI interaction errors, andoutput formatting mistakes(e.g., incomplete URLs).
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand the WebMall paper, a reader should be familiar with several foundational concepts related to Large Language Models (LLMs), AI agents, and web technologies.
- Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and process human language. Examples include
GPT(Generative Pre-trained Transformer) series andClaude. They are characterized by their large number of parameters (billions or trillions) and their ability to perform a wide range of natural language processing tasks, including reasoning, summarization, and instruction following. In the context ofweb agents,LLMsact as the "brain" that interprets user instructions, plans actions, and processes observations from web pages. - Web Agents (LLM-based Agents): These are
AI agentsthat leverageLLMsto interact with websites. They are designed to understand natural language instructions from a user and perform complex tasks on the web by simulating human interaction (e.g., clicking buttons, typing text, scrolling, navigating between pages). The goal is to automatelong-running web tasksthat would otherwise require manual human effort. - Benchmarks: In machine learning and
AI, abenchmarkis a standardized set of tasks or problems used to evaluate and compare the performance of differentmodelsoragents. A goodbenchmarkis typically reproducible, covers a representative range of challenges, and provides clear evaluation metrics.WebMallis presented as such abenchmark. - Accessibility Tree (AX-Tree): This is a representation of the user interface (UI) of a web page that provides structured, semantic information about its elements. Unlike a visual screenshot, the
accessibility treeis an abstract tree structure that contains information about elements like buttons, input fields, links, and their associated labels, roles, and states. It's primarily designed to help assistive technologies (e.g., screen readers) understand and interact with web content. Forweb agents, it offers a programmatic way to understand the structure and interactive elements of a page, enabling more precisenavigationandinteractionthan visual cues alone. - Observation Space/Modality: This refers to the type of information an
agentreceives about its environment. In the context ofweb agents, commonobservation modalitiesinclude:- Accessibility Tree: Provides structured, semantic information.
- Screenshot: A visual image of the web page, capturing
layout,colors,product images, etc. This requiresvision models(likeGPT-4VorClaude Sonnet'svision capabilities) to interpret. - HTML/DOM: The raw
HTMLorDocument Object Modelof the page, offering the most detailed structural information, but often overwhelming forLLMsdirectly.
- Memory (Persistent Short-Term Memory): For
LLM-based agents,memoryrefers to the ability to store and recall information relevant to the current task over an extended sequence of actions.Persistent short-term memorymeans theagentcan retain specific pieces of information (e.g., found prices, product URLs, user requirements) across multiple steps or page navigations, rather than relying solely on the context of the immediate prompt or a simpleaction history. This is crucial forlong-running taskslikecomparison-shoppingwhere information needs to be collected and aggregated from different sources. - Completion Rate (CR): An evaluation metric that measures the percentage of tasks for which an
agentsuccessfully produces a perfect and correct answer within a given step limit. - Precision (P), Recall (R), and F1 Score (F1): Standard metrics used in information retrieval and classification tasks.
- Precision: The proportion of correctly identified positive results (e.g., correct product offers) out of all positive results identified by the agent. It answers: "Of all items the agent said were relevant, how many actually were relevant?"
- Recall: The proportion of correctly identified positive results out of all actual positive results. It answers: "Of all items that were relevant, how many did the agent find?"
- F1 Score: The harmonic mean of
PrecisionandRecall, providing a single score that balances both. It is particularly useful when dealing with imbalanced classes or when bothPrecisionandRecallare important.
- Token Usage:
LLMsprocess input and generate output in units calledtokens. Atokencan be a word, part of a word, or punctuation.Token usageis a measure of the computational resources (and cost) consumed by anLLM, as billing is often based on the number oftokensprocessed. - API Cost: The monetary cost associated with using
LLMservices through theirApplication Programming Interfaces (APIs). This cost is typically calculated based ontoken usage, model type, and sometimes other factors like image processing. - Docker: A platform that uses
OS-level virtualizationto deliver software in packages calledcontainers.Containersare isolated from each other and bundle their own software, libraries, and configuration files, ensuring that software runs consistently across different environments.WebMallusesDockerforlocally hostablesimulated shops, guaranteeing reproducibility. - WordPress/WooCommerce:
WordPressis a popular open-source content management system (CMS) used for building websites.WooCommerceis a freee-commerceplugin forWordPressthat adds online store functionality, allowing users to sell products, manage inventory, and process payments.WebMallleverages these technologies to create realistic, functional online shops. - Common Crawl: A non-profit organization that provides open datasets of web crawl data. It crawls billions of web pages monthly and makes the raw data available to the public.
WebMallsources its authentic product offers from theCommon Crawlto ensure realism and diversity. - Schema.org: A collaborative effort to create, maintain, and promote
schemasfor structured data on the internet. It provides a collection ofshared vocabulariesthat webmasters can use tomark uptheir web pages with semantic information (e.g.,Product,Offer,price,description). Thisstructured datahelps search engines andAI agentsbetter understand the content of web pages.WebMallusesschema.org annotationsto extract product offers.
3.2. Previous Works
The paper extensively references existing benchmarks for evaluating web agents, particularly in online shopping and broader web interaction contexts. Understanding these prior works and their limitations is crucial for appreciating WebMall's contributions.
- WebShop [20]: An early and influential benchmark for
online shoppingagents.- Description: Simulates a single
e-shoppopulated with over a million real product offers scraped from Amazon. Agents navigate this shop to fulfill user requests (e.g., "Find me a cheap laptop"). - Limitation addressed by WebMall:
WebShopis a single-shop environment, meaning it does not requirecomparison-shoppingorcross-site information aggregation, which are core toWebMall. Its product offers, while numerous, are from a single source (Amazon), potentially lacking the heterogeneityWebMallaims for.
- Description: Simulates a single
- WebArena [22]: A broader
web agent benchmark.- Description: Simulates multiple websites across various domains, including
e-commerce, social media, and productivity. - Limitation addressed by WebMall: While it includes
e-commercetasks, its shopping tasks are confined to asingle e-shopand often focus on administrative tasks (e.g., shop management, sales statistics) rather than complex user-centriccomparison-shopping.
- Description: Simulates multiple websites across various domains, including
- REAL [6]:
Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites.- Description: Spans various task types, including
single-shop e-commercetasks like product search, cart management, and checkout. - Limitation addressed by WebMall: Similar to
WebShopandWebArena,REALalso operates within asingle-shop environment, thereby not addressing themulti-shop comparison-shoppingchallenge.
- Description: Spans various task types, including
- ShoppingBench [16]: Another
single-store benchmark.- Description: Simulates a
single-store environmentwith tasks covering user intents like searching for products, using vouchers, and adhering to a budget. - Limitation addressed by WebMall: Focuses on a
single store, missing thecross-shop comparisonaspect.
- Description: Simulates a
- Live Web Benchmarks (Mind2Web [3], BrowseComp [18], DeepShop [10]):
- Description: These benchmarks evaluate agents directly on the
live World Wide Webrather than simulated environments.DeepShopspecifically features complex product search queries. - Limitation addressed by WebMall: While offering realism, evaluating on the
live Webmakesreproducibilityextremely difficult. Website content changes, links break, and layouts evolve, making consistent comparative evaluation challenging.WebMallexplicitly avoids this by providing acontainerized, locally hostableenvironment.BrowseCompis noted for featuring "artificial tasks" designed to be difficult, which contrasts withWebMall's focus on "representative of real-world shopping behaviors."
- Description: These benchmarks evaluate agents directly on the
- Other LLM-based Agent Benchmarks (AgentBench [9], VisualWebArena [7], WebChoreArena [11], DeepResearchBench [4], ECom-Bench [15]): These benchmarks cover a wider range of agent capabilities beyond just web shopping.
- AgentBench: Extends beyond the web to
databasesandoperating systems. - VisualWebArena: Focuses on
visually grounded tasks. - WebChoreArena: Targets
memory-intensive tedious web tasks. - DeepResearchBench: Evaluates
web research agentsonmulti-step tasks. - ECom-Bench: Focuses on
customer support dialoguesine-commerce. - Limitation addressed by WebMall: While valuable, these benchmarks either do not focus on
e-commerceor do not specifically address themulti-shop comparison-shoppingparadigm with heterogeneous product data.
- AgentBench: Extends beyond the web to
3.3. Technological Evolution
The field of web agents has seen rapid evolution, primarily driven by advances in LLMs and multi-modal AI.
- Early Web Scrapers/Bots: Automated web interaction initially involved rule-based
web scrapersand simple bots designed for specific, repetitive tasks. These lacked generalizability andnatural language understanding. - Reinforcement Learning for Web Navigation: Research then explored using
reinforcement learningto train agents to navigate websites, often requiring large amounts of interaction data and suffering from poor generalization across different website layouts. - Emergence of
LLMs: The development of powerfulLLMs(e.g.,GPT-2,GPT-3) marked a turning point. These models could understand complex instructions and generate human-like text, paving the way for more flexible and intelligentweb agents. LLM-as-AgentFrameworks:ReAct[21] (Reasoning and Acting) showed howLLMscould interleavereasoning(generating thoughts) andacting(executing web actions), making them more capable.Reflexion[13] further enhanced this by incorporatingverbal reinforcement learning, allowing agents to learn from past successes and failures.Voyager[14] introducedcurriculum learningandmodular skill librariesfor open-ended tasks.- Multi-modal Agents: The advent of
multi-modal LLMs(e.g.,GPT-4V) that can process both text and images enabled agents to interpret visual cues from screenshots, complementing the structural information fromHTMLoraccessibility trees. - Benchmarking Evolution: As
web agentsbecame more sophisticated, the need for comprehensive benchmarks grew. Initial benchmarks likeWebShopfocused on single-site interaction. More recent efforts expanded to broaderweb interaction(WebArena,Mind2Web) or specific challenges (VisualWebArena,WebChoreArena).WebMallfits into this evolution by pushing the boundaries ofe-commercebenchmarking to include the complex, real-world scenario ofmulti-shop comparison-shopping.
3.4. Differentiation Analysis
WebMall differentiates itself from existing web agent benchmarks in several key aspects:
- Multi-Shop
Comparison-Shopping: This is the most significant differentiator. UnlikeWebShop,WebArena,REAL, orShoppingBench, which aresingle-shop environments,WebMallexplicitly requires agents tonavigateandaggregate informationacrossfour distinct online shops. This introduces challenges likecross-site reasoning,price comparison, andproduct offer aggregationthat are absent insingle-shop benchmarks. - Heterogeneous Product Offers: The product offers in
WebMallare sourced fromhundreds of distinct real-world shopsvia theCommon Crawlandschema.org annotations. This leads tomore heterogeneous product descriptions, titles, and attribute representations than benchmarks populated from a single source (e.g.,WebShopfrom Amazon), making the task of matching and comparing products more challenging and realistic. - Longer Interaction Trajectories: The tasks in
WebMallare designed to necessitatelonger interaction trajectoriescompared to, for instance,WebShop. This includes not just finding a product but often comparing it across multiple shops, adding to cart, and completing checkout, or performing advanced reasoning over vague requirements. This better reflectsreal-world shopping behaviors. - Reproducible Environment vs. Live Web: In contrast to
Mind2Web,BrowseComp, andDeepShopwhich evaluate on thelive Web,WebMallprovides afully containerized,locally hostableenvironment. This ensures exactreproducibilityof evaluation results, allowing for fair and consistent comparison of differentagentarchitectures without the variability inherent in thelive internet. - Advanced Task Categories:
WebMallintroducesadvanced taskssuch assearching with vague requirements,identifying suitable substitutes, andfinding compatible products. These tasks go beyond simple product search or checkout and demand more sophisticatedreasoningandunderstandingfrom theagents, reflecting more nuanced user needs.
4. Methodology
4.1. Principles
The core idea behind WebMall is to create a realistic, reproducible, and challenging environment for evaluating LLM-based web agents in e-commerce comparison-shopping scenarios. The theoretical basis is that for web agents to be truly useful in automating online tasks, they must be able to handle the complexity of the real web, which includes navigating multiple, diverse websites, extracting and comparing heterogeneous information, and performing complex reasoning to fulfill user needs. By simulating this multi-shop environment with authentic data and a comprehensive set of tasks, WebMall aims to push the boundaries of web agent capabilities beyond existing single-shop or artificial benchmarks. The intuition is that an agent that can successfully comparison-shop across WebMall's heterogeneous stores and tasks will demonstrate strong navigation, information extraction, reasoning, and decision-making skills transferable to real-world applications.
4.2. Core Methodology In-depth (Layer by Layer)
The WebMall methodology involves several key components: the environment (simulated shops), the data (product offers), the task set, and the evaluation framework.
4.2.1. WebMall Environment
The WebMall environment consists of four simulated online shops and a solution submission website.
- Shop Implementation: The four shops are implemented using
WordPresswith theWooCommerceplugin. This choice provides realistice-commercefunctionality (shopping cart, checkout, search bar, product detail pages, category navigation) and allows for heterogeneity inuser interfaces. - Shop Templates: Four distinct, free
WooCommercetemplates are used to ensure that the shops have heterogeneous visual interfaces and layouts, mimicking the diversity found in the real world. - Local Hostability: The entire environment is
containerizedusingDocker. This means that after cloning therepository, a two-command setup automatically downloads backup files, configures services, and launches the four shops, their databases, andElasticsearchinstances. This guaranteesreproducibilityacross different evaluation setups. - Solution Website: In addition to the shops, a dedicated website is part of the environment where agents submit their task solutions (e.g., URLs of relevant product offers) or indicate task completion.
4.2.2. Product Offer Collection and Distribution
To ensure realism and challenge, WebMall populates its shops with authentic product offers.
- Data Source: Product offers are sourced from the October 2024
Common Crawlviaschema.org annotations.Schema.orgis a vocabulary forstructured datamarkup on web pages, which allows for programmatic extraction of product information. - Filtering: A multi-step filtering process is applied to the raw
Common Crawldata:- Property Check: Only offers containing
title,description,price, andpriceCurrencyschema.orgproperties are retained. - Deduplication: Exact duplicates based on the combination of these four attributes are removed.
- Language Filtering: Since
WebMallis an English-language benchmark, thefastTextlanguage classification model is used ontitlesanddescriptionsto filter for English offers only. - Product Clustering: Offers containing
globally unique product identifierslikeGTIN(Global Trade Item Number) orMPN(Manufacturer Part Number) are grouped intoclusters. These clusters represent the same real-world product, facilitating later task creation and distribution.
- Property Check: Only offers containing
- Manual and Automated Distribution:
- Initial Manual Selection: A set of product offers (selected during task creation) is manually distributed across the four shops to ensure specific tasks can be formed.
- Automated Filler Population:
GPT-4.1is used to query the corpus for additional offers to fill the shops in three designated categories:PC components,PC peripherals, andother electronics.- Embedding Generation: For each category query,
OpenAI's text-embedding-3-smallmodel is used to computeembeddingsfor product offers.Embeddingsare numerical representations of text that capture semantic meaning, allowing for similarity comparisons. - Nearest Neighbor Retrieval:
Elasticsearchis used to retrievenearest neighbors(most similar product offers) viacosine similarityover pre-indexedproduct vectors.Cosine similaritymeasures the cosine of the angle between two vectors, indicating their directional similarity. - Cleaning and Assessment: Retrieved candidates are cleaned (HTML removal, normalization) and then assessed by
GPT-4.1forlisting quality(English, informative description characters, specific non-generic title, not list-like) andcategory relevance. - Constraint Checking: Each candidate is screened against a
constraint listderived from the task set to prevent newly added offers from creating unintended valid task solutions. The resulting distribution of product offers is shown in Table 1.
- Embedding Generation: For each category query,
The following are the results from Table 1 of the original paper:
| Product Category | Overall Total | Shop 1 | Shop 2 | Shop 3 | Shop 4 | |||||
| Offers | % | Offers | % | Offers | % | Offers | % | Offers | % | |
| PC Components | 1,477 | 33.4 | 348 | 30.2 | 369 | 33.7 | 430 | 37.2 | 330 | 32.4 |
| PC Peripherals | 1,388 | 31.4 | 432 | 37.5 | 255 | 23.3 | 336 | 29.1 | 365 | 35.8 |
| Other Electronics | 1,556 | 35.2 | 370 | 32.3 | 471 | 43.0 | 390 | 33.7 | 325 | 31.9 |
| Total | 4,421 | 100.0 | 1,150 | 100.0 | 1,095 | 100.0 | 1,156 | 100.0 | 1,020 | 100.0 |
- Product Characteristics: The 4,421 offers have varied titles (6 to 264 characters, median 69, average 76.4) and descriptions (15 to >14,000 characters, median 573, average 1,059), reflecting real-world diversity.
- Category Trees: Each shop has manually created, distinct
category treesto simulate heterogeneity.
4.2.3. WebMall Task Set
The WebMall task set comprises 91 tasks designed to evaluate web agents in comparison-shopping scenarios, grouped into basic and advanced categories.
-
Task Definition: Each task includes a
natural-language instructionfor theweb agentand a set of one or moresolution URLsif the task requires finding specific offers. -
Basic Tasks: Represent typical, straightforward online shopping actions.
Find Specific Product(12 tasks): Locate all offers for a named product across all shops.Find Cheapest Offer(10 tasks): Identify the lowest-priced offer for a named product across all shops.Products Fulfilling Specific Requirements(11 tasks): Find offers based on specific attribute constraints (e.g.,display size,memory) without a named product.Add To Cart(7 tasks): Add specific named product offers to the shopping cart.Checkout(8 tasks): Add a specific offer to the cart and complete the full checkout process (including filling shipping/billing details).
-
Advanced Tasks: Incorporate higher complexity, vagueness, and reasoning requirements.
-
Cheapest Offer with Specific Requirements(10 tasks): ExtendProducts Fulfilling Specific Requirementsby also requiring comparison and selection of the cheapest. -
Products Satisfying Vague Requirements(8 tasks): Find products based on vaguely described user needs, requiringagent reasoning. -
Cheapest Offer with Vague Requirements(6 tasks): Combinevague requirementswithprice comparisonto find the cheapest offers. -
Find Substitutes(6 tasks): Suggest cheaper alternative products, simulating scenarios of unavailability or high price. -
Find Compatible Products(5 tasks): Requiresreasoning over compatibility(e.g., finding compatibleCPUsfor amotherboard). -
End-to-End(8 tasks): Integrates multiple steps: searching for products,price comparison, adding to cart, and checkout into a single workflow.The following are the results from Table 2 of the original paper:
Task Category Count Example Basic Task Set Find Specific Product 12 Find all offers for the AMD Ryzen 9 5900X. Find Cheapest Offer 10 Find the cheapest offer for the Samsung Galaxy S24 Plus. Products Fulfilling Specific Requirements 11 Find all offers for orange straps that fit with the Apple Watch Series 6. Add to Cart 7 Find all offers for the Asus DUAL RTX4070 SUPER OC White and add them to the shopping cart. Checkout 8 Add the product on page {PRODUCT_URL} to the shopping cart and complete the checkout process. Advanced Task Set Cheapest Offer Specific Requirements 10 Find the cheapest offer for a new Xbox gaming console with at least 512gb disk space in white. Products Satisfying Vague Requirements 8 Find all offers for the largest available MX500 model by Crucial. Cheapest Offer Vague Requirements 6 Find the cheapest offers for each model of mid-tier nVidia gaming GPUs in the 4000 series. Find Substitutes 6 Find the cheapest alternative for this item: {PRODUCT_URL}. Find Compatible Products 5 Find all offers for compatible CPUs for this motherboard: {PRODUCT_URL}. End To End 8 Find the cheapest offer for the Asrock B550 PHANTOM GAMING 4 and purchase it.
-
-
Artifacts: All tasks and their solutions are provided in a
JSONfile. Agents receive instructions explaining theWebMallenvironment (shop URLs, submission process) before each task.
4.2.4. Agent Evaluation Framework (Browsergym/AgentLab)
The Browsergym and AgentLab frameworks are used to configure and run the baseline agents.
- Browsergym: Provides common tools for
web agents, includingweb browsing capabilities(usingPython Playwrightlibrary),experimental framing, andresult/trace tracking. It supports anyAPI-based hosted LLM. - AgentLab: Integrates with
Browsergymand allows for configuring more sophisticated agents by affordingAPI-based LLMsspecific capabilities. - Agent Configurations: Eight
baseline agentconfigurations are evaluated, varying along three dimensions:- Observation Space: How the agent perceives the web page.
AX-Tree: Agent receives theHTML accessibility tree(structural information).Screenshot: Agent receives a visualscreenshotof the viewport. Vision capability is implemented usingset-of-mark[19] prompting, where visual elements are marked up forLLMprocessing.AX-Tree + Screenshot: Agent receives both modalities.
- Memory: The ability to retain information over time.
Memory:AgentLab'spersistent short-term memoryis activated, allowing agents to store and filter discovered information (e.g., cheapest product offer and its URL) across steps.No Memory: Agents rely solely on theiraction historyandthoughtsat each step, without explicit persistent storage of task-relevant data.
- Large Language Model (LLM): The underlying
AI modeldriving the agent's decisions.GPT-4.1: An iteration ofGPT-4fromOpenAI.Claude Sonnet 4: An iteration ofClaude SonnetfromAnthropic.
- Observation Space: How the agent perceives the web page.
- Step Limit: Each agent is allowed up to 50 steps to complete a task. A
stepis an action likego to page,click,fill text, orscroll, defined byAgentLab.
4.2.5. Evaluation Metrics
The evaluation measures both effectiveness and efficiency.
- Effectiveness Metrics:
Completion Rate (CR): Percentage of tasks where the agent outputs a perfect answer within the step limit.Precision (P),Recall (R),F1-score (F1): Calculated over the returned set of answers (e.g., URLs) and the correct set of answers per task.Macro averagingis used to aggregate these scores across tasks, meaning scores are computed for each task, and then averaged.
- Efficiency Metrics:
-
Average Steps: The average number of actions taken per task. -
Tokens Consumed: The average number ofLLM tokensused per task (input and output). -
Runtime: The average time taken to complete a task. -
Estimated API Cost: The estimated monetary cost per task based ontoken usageandLLM API pricing.The methodology ensures a robust evaluation by providing a standardized, reproducible environment, realistic and challenging tasks, and a comprehensive set of metrics covering various aspects of
agent performance.
-
5. Experimental Setup
5.1. Datasets
The primary "dataset" for the WebMall experiments is the WebMall environment itself, which includes:
-
Four Simulated Online Shops: Implemented using
WordPressandWooCommerce, designed to be visually distinct and functionally heterogeneous, mirroring real-worlde-commercesites. -
Product Offers: A total of 4,421 authentic product offers.
- Source: Extracted from the October 2024
Common Crawlusingschema.org annotations. - Categories: Distributed across
PC components,PC peripherals, andother electronics. - Characteristics: Varied titles (6 to 264 characters, median 69, average 76.4) and descriptions (15 to over 14,000 characters, median 573, average 1,059).
- Source: Extracted from the October 2024
-
Task Set: 91
cross-shop tasksdivided into 11 categories (basic and advanced). Each task consists of anatural language instructionand, if applicable, a set ofground truth URLsas solutions.These data sources (the simulated shops and the tasks) are effective for validating the method's performance because:
-
They are
realistic: Product data is from the real web (Common Crawl), and shop functionality (WooCommerce) is common. -
They are
diverse: Heterogeneous shop interfaces and product descriptions challengeagent generalization. -
They are
multi-shop: The core novelty, requiringcomparison-shoppingandcross-site reasoning. -
They are
reproducible: TheDocker-containerized setup ensures consistent environments for comparative evaluation. -
They are
challenging: The task set includesvague requirements,compatibility reasoning, andend-to-end workflows, pushingagentcapabilities.
5.2. Evaluation Metrics
The paper uses several metrics to evaluate the effectiveness and efficiency of the web agents.
5.2.1. Completion Rate (CR)
- Conceptual Definition:
Completion Ratemeasures the percentage of tasks for which anagentsuccessfully provides a perfect and correct answer within the predefinedstep limit(50 steps). It quantifies theagent'sability to successfully execute a task from start to finish without errors or premature termination, according to the specified requirements. - Mathematical Formula: $ \text{CR} = \frac{\text{Number of perfectly completed tasks}}{\text{Total number of tasks}} \times 100% $
- Symbol Explanation:
- : Completion Rate.
- : The count of tasks where the agent's output exactly matches the ground truth solution.
- : The total number of tasks in the benchmark set being evaluated.
5.2.2. Precision (P)
- Conceptual Definition:
Precisionmeasures the accuracy of theagent'spositive predictions. In the context ofWebMall, if anagentis asked to find relevant product offers,Precisionindicates how many of the offers it returned were actually correct and relevant. It focuses on the quality of the positive results. - Mathematical Formula: $ P = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $
- Symbol Explanation:
- : Precision.
- : Items (e.g., product URLs) correctly identified by the agent as part of the solution.
- : Items incorrectly identified by the agent as part of the solution (i.e., agent returned them, but they are not in the ground truth solution).
5.2.3. Recall (R)
- Conceptual Definition:
Recallmeasures theagent'sability to find all the relevant items. If there are multiple correct product offers for a task,Recallindicates what proportion of these theagentsuccessfully identified. It focuses on the completeness of the positive results. - Mathematical Formula: $ R = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
- Symbol Explanation:
- : Recall.
- : Items (e.g., product URLs) correctly identified by the agent as part of the solution.
- : Items that are part of the ground truth solution but were not identified by the agent.
5.2.4. F1 Score (F1)
- Conceptual Definition: The
F1 Scoreis the harmonic mean ofPrecisionandRecall. It provides a balanced measure that considers both false positives and false negatives. A highF1 Scoreindicates that theagenthas highPrecisionand highRecall, making it a good overall indicator of performance, especially whenPrecisionandRecallmight be in tension. ForWebMall,macro averagingis applied, meaning theF1 Scoreis calculated for each task independently, and then the average of theseF1 Scoresis reported. - Mathematical Formula: $ F1 = 2 \times \frac{P \times R}{P + R} $
- Symbol Explanation:
F1: F1 Score.- : Precision.
- : Recall.
5.2.5. Efficiency Metrics
- Average Steps: The mean number of
actions(e.g.,click,fill text,go to page) performed by the agent per task. - Average Input Tokens: The mean number of
tokenssent to theLLMas input per task. - Average Output Tokens: The mean number of
tokensgenerated by theLLMas output per task. - Average Runtime: The mean time taken for an agent to complete a task, measured in seconds.
- Average Cost: The estimated mean
API costper task, derived fromtoken usageand currentLLMpricing models.
5.3. Baselines
The paper evaluates eight baseline agent configurations built using the Browsergym/AgentLab framework. These baselines are chosen to explore the impact of different observation modalities, the presence of memory, and the choice of underlying LLM. They are representative because they cover common architectural choices for LLM-based web agents.
The baselines are formed by combining:
- Large Language Models (LLMs):
GPT-4.1(from OpenAI)Claude Sonnet 4(from Anthropic)
- Observation Spaces:
-
AX-Tree: Only theaccessibility treeis provided. -
AX-Tree + Memory:Accessibility treewithpersistent short-term memoryenabled. -
AX-Tree + Vision:Accessibility treesupplemented byscreenshots. -
Vision: Onlyscreenshotsare provided.This results in distinct baseline configurations. Each configuration is run on the full
WebMalltask set (91 tasks) to gather performance data across all specified metrics.
-
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that WebMall is a challenging benchmark for current LLM-based web agents. The analysis highlights the importance of structured observations (accessibility trees) and memory for effective web interaction, while also revealing performance trade-offs between different LLMs across various task complexities.
The following are the results from Table 3 of the original paper:
| Model | Task set | CR (%) | AX-Tree | AX-Tree + Memory | AX-Tree + Vision | Vision | |||||||||||
| P (%) | R (%) | F1 (%) | CR (%) | P (%) | R (%) | F1 (%) | CR (%) | P (%) | R (%) | F1 (%) | CR (%) | P (%) | R (%) | F1 (%) | |||
| GPT4.1 | Basic | 56.25 | 74.48 | 67.59 | 70.87 | 75.00 | 91.60 | 83.95 | 87.61 | 56.25 | 72.66 | 65.77 | 69.04 | 41.67 | 59.64 | 50.43 | 54.65 |
| GPT4.1 | Advanced | 32.56 | 52.03 | 45.57 | 48.59 | 34.88 | 52.11 | 46.25 | 49.01 | 39.53 | 48.46 | 48.35 | 48.41 | 13.95 | 20.70 | 18.00 | 19.26 |
| Claude Sonnet 4 | Basic | 66.67 | 76.04 | 72.44 | 74.20 | 70.83 | 81.25 | 75.12 | 78.06 | 72.92 | 79.17 | 76.67 | 77.90 | 10.42 | 35.42 | 21.99 | 27.14 |
| Claude Sonnet 4 | Advanced | 53.49 | 63.37 | 63.41 | 63.39 | 48.84 | 61.51 | 58.40 | 59.91 | 37.21 | 41.11 | 41.80 | 41.45 | 4.65 | 10.47 | 6.69 | 8.16 |
Overall Performance by Task Set (Table 3):
- Basic Tasks:
- The
GPT-4.1withAX-Tree + Memoryconfiguration achieves the highest performance with aCompletion Rateof 75.00% and anF1 Scoreof 87.61%. This indicates that for more structured and straightforward tasks, a powerfulLLMcombined withstructured observationandmemoryis highly effective. Claude Sonnet 4withAX-Tree + Visionperforms well on basic tasks (CR 72.92%, F1 77.90%), slightly outperforming itsAX-Tree(CR 66.67%, F1 74.20%) andAX-Tree + Memory(CR 70.83%, F1 78.06%) counterparts. This suggests that forClaude Sonnet 4, visual cues might offer some supplementary benefit for basic tasks.
- The
- Advanced Tasks:
Claude Sonnet 4withAX-Treeachieves the best results (CR 53.49%, F1 63.39%), demonstrating its strongerreasoning capabilitiesforvague requirementsandcomplex comparisons. Notably, addingmemoryorvisiontoClaude Sonnet 4for advanced tasks does not improve performance; in fact,memoryslightly degrades it (CR 48.84%, F1 59.91%), andvisionsignificantly so (CR 37.21%, F1 41.45%). This suggests that additional modalities ormemorycan sometimes confuse or distract theLLMwhen tasks are already highly complex.GPT-4.1generally shows lower performance on advanced tasks compared toClaude Sonnet 4(highest CR 39.53% withAX-Tree + Vision, highest F1 49.01% withAX-Tree + Memory).
- Impact of
Vision(Screenshot-only): Agents using onlyscreenshots(Visioncolumn) perform significantly worse across all task sets and bothLLMs. ForGPT-4.1,Visionachieves a CR of 41.67% on basic tasks and 13.95% on advanced tasks. ForClaude Sonnet 4, the performance is even lower, with CRs of 10.42% and 4.65% respectively. This confirms thatscreenshotsalone lack thestructured semantic informationnecessary for reliable webnavigationandinteraction. - Impact of
Memory:Memorygenerally improves performance forGPT-4.1on basic tasks (CR increases from 56.25% to 75.00%, F1 from 70.87% to 87.61%). ForClaude Sonnet 4on basic tasks,memoryprovides a slight boost in F1 but a minor decrease in CR. For advanced tasks,memoryhas a minimal positive or even slightly negative impact, especially forClaude Sonnet 4. This impliesmemoryis most beneficial forlong-running taskswhere information needs to be explicitly stored and retrieved, preventingpremature submissionorforgetting intermediate results.
6.2. Data Presentation (Tables)
The following are the results from Table 4 of the original paper:
| Model | Task set | P(%) | AX-Tree | AX-Tree + Memory | AX-Tree + Vision | Vision | |||||||||||
| R(%) | F1 (%) | CR (%) | P(%) | R(%) | F1 (%) | CR (%) | P(%) | R(%) | F1 (%) | CR (%) | P(%) | R(%) | F1 (%) | ||||
| Basic Tasks | CR (%) | ||||||||||||||||
| Single Product Search | 33.33 | 85.42 | 66.48 | 74.77 | 66.67 | 88.64 | 81.69 | 85.02 | 33.33 | 67.71 | 54.61 | 60.46 | 41.67 | 69.10 | 56.44 | 62.13 | |
| GPT4.1 | Cheapest Product Search | 60.00 | 60.00 | 60.00 | 60.00 | 90.00 | 90.00 | 90.00 | 90.00 | 40.00 | 42.50 | 42.50 | 42.50 | 50.00 | 63.33 | 57.50 | 60.28 |
| Best Fit Specific Requirements | 27.27 | 50.00 | 40.61 | 44.82 | 36.36 | 84.85 | 59.01 | 69.61 | 45.45 | 68.18 | 56.97 | 62.07 | 27.27 | 54.55 | 38.03 | 44.81 | |
| Add to Cart | 85.71 | 85.71 | 85.71 | 85.71 | 100.00 | 100.00 | 100.00 | 100.00 | 85.71 | 100.00 | 92.86 | 96.30 | 85.71 | 100.00 | 92.86 | 96.30 | |
| Checkout | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 100.00 | 12.50 | 12.50 | 12.50 | 12.50 | |
| Single Product Search | 66.67 | 83.33 | 78.41 | 80.80 | 75.00 | 83.33 | 79.17 | 81.20 | 75.00 | 83.33 | 79.17 | 81.20 | 0.00 | 58.33 | 22.98 | 32.97 | |
| Claude Sonnet 4 | Cheapest Product Search | 70.00 | 75.00 | 75.00 | 75.00 | 70.00 | 70.00 | 70.00 | 70.00 | 80.00 | 80.00 | 80.00 | 80.00 | 40.00 | 60.00 | 50.00 | 54.55 |
| Best Fit Specific Requirements | 45.45 | 63.64 | 53.31 | 58.01 | 45.45 | 81.82 | 59.61 | 68.97 | 45.45 | 63.64 | 57.27 | 60.29 | 9.09 | 36.36 | 25.45 | 29.95 | |
| Add to Cart | 71.43 | 71.43 | 71.43 | 71.43 | 85.71 | 85.71 | 85.71 | 85.71 | 85.71 | 85.71 | 85.71 | 85.71 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Checkout | 87.50 | 87.50 | 87.50 | 87.50 | 87.50 | 87.50 | 87.50 | 87.50 | 87.50 | 87.50 | 87.50 | 87.50 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Advanced Tasks | |||||||||||||||||
| 40.00 | 40.00 | 40.00 | 40.00 | 30.00 | 30.00 | 30.00 | 30.00 | 30.00 | 30.00 | 30.00 | 30.00 | 20.00 | 20.00 | 20.00 | |||
| GPT4.1 | Cheapest Best Fit Specific Requirements | 12.50 | 64.03 | 48.09 | 54.93 | 25.00 | 80.09 | 25.00 | 44.27 | 41.95 | 12.50 | 43.75 | 20.00 | 36.81 | |||
| Best Fit Vague Requirements | 16.67 | 54.17 | 48.61 | 51.24 | 16.67 | 66.67 | 65.28 | 71.93 | 16.67 | 39.87 | 48.61 | 50.48 | 0.00 | 6.67 | 31.77 | 4.44 | |
| Cheapest Best Fit Vague Requirements | 44.44 | 53.33 | 52.50 | 3.33 | |||||||||||||
| Find Substitutes | 50.00 | 50.00 | 50.00 | 50.00 | 33.33 | 33.33 | 33.33 | 33.33 | 33.33 | 33.33 | 33.33 | 33.33 | 33.33 | 33.33 | 33.33 | ||
| Find Compatible Products | 40.00 | 60.00 | 46.67 | 52.50 | 40.00 | 40.00 | 33.33 | 40.00 | 60.00 | 70.00 | 66.67 | 68.29 | 20.00 | 20.00 | 20.00 | 20.00 | |
| End-to-End | 37.50 | 50.00 | 43.75 | 46.67 | 62.50 | 62.50 | 62.50 | 62.50 | 75.00 | 75.00 | 75.00 | 75.00 | 0.00 | 0.00 | 0.00 | 0.00 | |
| 60.00 | 60.00 | 60.00 | 60.00 | ||||||||||||||
| Claude Sonnet 4 | Cheapest Best Fit Specific Requirements | 37.50 | 68.39 | 68.75 | 68.57 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 50.00 | 10.00 | 10.00 | 10.00 | 10.00 |
| Best Fit Vague Requirements | 37.50 | 71.88 | 57.64 | 63.97 | 37.50 | 58.48 | 62.15 | 60.26 | 0.00 | 31.25 | 10.94 | 16.20 | |||||
| Cheapest Best Fit Vague Requirements | |||||||||||||||||
| Find Substitutes | 83.33 | 83.33 | 83.33 | 45.87 | 33.33 | 33.33 | 33.33 | 33.33 | 16.67 | 16.67 | 16.67 | 16.67 | 0.00 | 0.00 | 0.00 | 0.00 | |
| Find Compatible Products | 33.33 | 52.78 | 40.56 | 0.00 | 0.00 | 0.00 | 0.00 | ||||||||||
| End-to-End | 60.00 | 60.00 | 60.00 | 66.67 | 66.67 | 66.67 | 66.67 | 16.67 | 16.67 | 16.67 | 16.67 | 0.00 | 0.00 | 0.00 | 0.00 | ||
6.3. Ablation Studies / Parameter Analysis
Table 4 provides a detailed breakdown of performance by individual task category, revealing granular insights into agent capabilities and failure modes.
6.3.1. Structured Basic Tasks
- Categories:
Single Product Search,Cheapest Product Search,Add to Cart,Checkout. - Performance:
GPT-4.1withAX-Tree + Memoryexcels in these categories, often achievingCompletion Ratesof 90-100% and highF1-scores. ForAdd to CartandCheckout,GPT-4.1 AX-Tree + Memoryachieves 100%CRandF1.Claude Sonnet 4agents withAX-Tree(and sometimes ) are competitive, though they might lose someprecisionorrecall. - Common Failure Modes:
- Rigid Search Strategies: Agents sometimes issue overly specific queries or stop if initial results are not found, missing product variants or alternative spellings. This reduces
recall. - Screenshot-only agents: Struggle significantly, often failing to locate crucial
UI elementslike search boxes or buttons, leading tostep limitexhaustion. For example,CheckoutwithGPT-4.1 Visionhas a CR of 12.50%, andClaude Sonnet 4 Visionhas 0% forAdd to CartandCheckout.
- Rigid Search Strategies: Agents sometimes issue overly specific queries or stop if initial results are not found, missing product variants or alternative spellings. This reduces
6.3.2. Attribute-Rich and Ambiguous Tasks
- Categories:
Best Fit Specific Requirements,Best Fit Vague Requirements, and theirCheapestvariants,Find Substitutes,Find Compatible Products. - Performance:
Claude Sonnet 4withAX-Treeoften shows higherF1-scoresandCompletion Ratesfor these categories compared toGPT-4.1, suggesting betterattribute-based reasoningandvagueness interpretation. ForBest Fit Vague Requirements,Claude Sonnet 4 AX-Treeachieves a CR of 37.50% and F1 of 63.97%, outperformingGPT-4.1 AX-Tree(CR 16.67%, F1 51.24%).Find Substitutesis another strong category forClaude Sonnet 4 AX-Tree(CR 83.33%, F1 45.87%), though the F1 is lower than its CR, indicating some incorrect submissions. - Impact of
Vision: Combiningaccessibility treeandscreenshotmodalities yields modest gains in some categories, such asFind Compatible ProductswithGPT-4.1, wherevisual informationmight aid in identifying matching aesthetic features (e.g., color schemes).GPT-4.1 AX-Tree + Visionachieves a CR of 60.00% and F1 of 68.29% forFind Compatible Products, which is higher thanAX-Treealone (CR 40.00%, F1 52.50%). - Common Failure Modes:
- Incomplete Cross-Shop Search: Agents frequently fail to comprehensively search across all shops or stop after the first matching result, reducing
recall. - Attribute Confusion/Misinterpretation: Agents may confuse similar attributes (e.g., RAM kit capacity vs. single stick capacity) or misinterpret
vague requirements. - Reasoning Errors: The complexity of
compatibility reasoningor understandingvague descriptionsleads to errors in identifying relevant offers.
- Incomplete Cross-Shop Search: Agents frequently fail to comprehensively search across all shops or stop after the first matching result, reducing
6.3.3. End-to-End Tasks
- Category:
End-to-End(combining search, comparison, add to cart, and checkout). - Performance:
Claude Sonnet 4 AX-Tree + Memorycompletes 66.67% ofend-to-endtasks, with an F1 score of 66.67%.GPT-4.1benefits fromAX-Tree + Visionin this category, achieving 75.00% CR and F1.Memoryis especially valuable here, as it helps agents maintain context and preventforgetting intermediate resultsorsubmission detailsover long sequences of actions. - Common Failure Modes:
- UI Interaction Errors: Agents may repeatedly click the wrong controls or fail to correctly fill in forms, especially without
structured input. - Output Formatting Mistakes: Even if an agent finds the correct solution, errors in the format of the submitted URL (e.g., incomplete URLs) lead to tasks being marked incorrect.
Memory-enabled agentsare less prone to this as they can store and retrievesolution URLsexplicitly. - Insufficient Cross-Shop Reasoning: Many agents struggle to aggregate and compare information effectively across multiple shops before making a final decision.
- UI Interaction Errors: Agents may repeatedly click the wrong controls or fail to correctly fill in forms, especially without
6.3.4. Overall Failure Patterns
A common thread across all categories is insufficient cross-shop reasoning. Many runs terminate after finding a single offer, failing to explore other shops for better deals or complete information. This is partially alleviated by memory, but not fully resolved. UI interaction errors (e.g., struggling with forms, missing buttons) are prevalent, especially for vision-only agents. Finally, output formatting mistakes for solution submission are a consistent source of lost points, particularly for agents without explicit memory to store solution URLs.
6.4. Efficiency Analysis
The efficiency analysis (Table 5 and Figure 2) reveals significant differences in token usage, runtime, and API costs between models and configurations, highlighting the critical trade-off between performance and resource consumption for practical deployment.
The following are the results from Table 5 of the original paper:
| Model | Task Set | Observation Space | Avg. Steps | Avg. Input Tokens | Avg. Output Tokens | Avg. Runtime | Avg. Cost |
| GPT4.1 | Basic | AX-Tree | 22.69 | 131,301 | 2,334 | 130.5s | 0.28\$ |
| AX-Tree + Memory | 20.88 | 130,270 | 3,511 | 142.4s | 0.29\$ | ||
| AX-Tree + Vision | 20.92 | 135,362 | 1,901 | 155.4s | 0.29\$ | ||
| Vision | 28.56 | 104,617 | 2,453 | 176.2s | 0.23\$ | ||
| GPT4.1 | Advanced | AX-Tree | 24.98 | 160,922 | 2,950 | 159.2s | 0.35\$ |
| AX-Tree + Memory | 24.19 | 178,949 | 4,658 | 177.0s | 0.40\$ | ||
| AX-Tree + Vision | 23.74 | 169,956 | 2,468 | 187.8s | 0.36\$ | ||
| Vision | 33.33 | 133,972 | 3,119 | 216.4s | 0.29\$ | ||
| Claude Sonnet 4 | Basic | AX-Tree | 23.69 | 188,079 | 6,791 | 222.7s | 0.67\$ |
| AX-Tree + Memory | 22.04 | 236,631 | 15,106 | 334.6s | 0.94\$ | ||
| AX-Tree + Vision | 25.62 | 242,597 | 6,255 | 279.5s | 0.82\$ | ||
| Vision | 43.40 | 364,694 | 13,937 | 446.9s | 1.30\$ | ||
| Claude Sonnet 4 | Advanced | AX-Tree | 29.65 | 291,048 | 10,063 | 331.7s | 1.02\$ |
| AX-Tree + Memory | 27.33 | 364,858 | 18,149 | 420.9s | 1.37\$ | ||
| AX-Tree + Vision | 37.26 | 480,199 | 12,630 | 471.9s | 1.63\$ | ||
| Vision | 47.74 | 421,704 | 17,456 | 536.3s | 1.53\$ |
6.4.1. Token Usage
- Model Comparison:
Claude Sonnet 4configurations consistently consume substantially moretokensthanGPT-4.1configurations, often more than double for comparableobservation spaces. For instance,Claude Sonnet 4 AX-Treefor advanced tasks uses 291,048 inputtokenscompared toGPT-4.1 AX-Tree's 160,922. - Observation Space Impact: Configurations incorporating
screenshots(AX-Tree + VisionorVisiononly) generally lead to highertoken usage, as visual information adds significantly to theLLM'sinput context. - Memory Impact: While
memory-basedconfigurations might reduceaverage steps, the longer prompts due to the explicitmemory sectioncan still result in higher overalltoken usage, especially forClaude Sonnet 4. ForClaude Sonnet 4 Advancedtasks,AX-Tree + Memoryuses 364,858 input tokens, significantly more thanAX-Treealone. - Inefficient
Vision-only agents:Vision-only agents, despite their lower performance, often show higher average steps and substantialtoken usage(e.g.,Claude Sonnet 4 Visionfor basic tasks uses 364,694 inputtokens), reflecting their struggle to navigate efficiently and their tendency to repeat actions due to lack ofstructured information.
6.4.2. Runtime
- Model Comparison:
GPT-4.1agents are considerably faster thanClaude Sonnet 4agents.GPT-4.1typically completes basic tasks in 2-3 minutes and advanced tasks in ~3 minutes. In contrast,Claude Sonnet 4often requires 4-8 minutes per task, especially for complex workflows or with additional modalities. This difference is largely attributable to the highertoken usageofClaude Sonnet 4and potentially differences inAPI latency. - Observation Space Impact: Adding
vision(AX-Tree + Vision) or relying solely onvision(Vision) tends to increaseruntimefor bothLLMsdue to the overhead of processing visual data and the increasedtoken count. - Efficiency Trade-off: The data suggests that
GPT-4.1is a more efficient choice forbasic structured tasksdue to its lowerruntimeandtoken consumption. Foradvanced tasks, whileClaude Sonnet 4might offer better effectiveness (as seen in Table 3), this comes at a significant cost inruntime.
6.4.3. API Usage Fees
-
Cost Scaling:
API costsdirectly correlate withtoken usageandruntime. -
Model Comparison:
GPT-4.1configurations are generally more cost-effective. For basic tasks,GPT-4.1costs range from ~0.23 to ~0.29 per task. -
High Cost of
Claude Sonnet 4:Claude Sonnet 4configurations are considerably more expensive. For basic tasks, costs range from ~0.67 to ~1.30. For advanced tasks,Claude Sonnet 4withAX-Tree + Visioncan reach ~1.63 per task. The highest-performingClaude Sonnet 4 AX-Treefor advanced tasks still costs ~1.02 per task. -
Performance-Cost Trade-off (Figure 2): The following figure (Figure 2 from the original paper) shows the relationship between cost and task completion rate:
该图像是一个比较图表,展示了基础(左)和高级(右)任务集的平均任务成本与任务完成率之间的关系。不同颜色的点代表了不同的代理配置,基于平均成本与任务完成率的变化,能够观察到各个代理的表现差异。Figure 2 visually represents this trade-off. For basic tasks (left plot),
GPT-4.1configurations generally cluster in the lower-cost, moderate-to-high completion rate area. TheGPT-4.1 AX-Tree + Memoryconfiguration, which has the highest completion rate (75%) for basic tasks, remains relatively low cost (~$0.29).Claude Sonnet 4configurations, while sometimes achieving comparable or slightly higher completion rates, always incur significantly higher costs. For advanced tasks (right plot), the pattern is similar.Claude Sonnet 4 AX-Treeachieves the highest completion rate (53.49%) but at a cost of ~$1.02. OtherClaude Sonnet 4configurations often have even higher costs with lower completion rates.GPT-4.1configurations are cheaper but achieve lower completion rates on advanced tasks. The plots clearly illustrate that while more sophisticatedagent architecturesorLLMsmight yield higher success rates, they often come with asubstantial increase in token usage, runtime, and cost, making them less practical for widespread, high-volume deployment.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces WebMall, a pioneering multi-shop benchmark specifically designed for evaluating LLM-based web agents in complex e-commerce comparison-shopping scenarios. WebMall stands out by simulating four distinct online shops populated with authentic, heterogeneous product offers derived from the Common Crawl and offering 91 diverse tasks across 11 categories, including both basic shopping flows and advanced reasoning-intensive tasks involving vagueness and compatibility. The benchmark addresses a critical gap in existing single-shop or live-web benchmarks by providing a reproducible and realistic environment for cross-site information aggregation and reasoning.
The comprehensive evaluation of eight baseline agent configurations revealed several key insights:
- Importance of Structured Observation: The
accessibility treeis paramount for reliable webnavigationand interaction, demonstrating superior performance compared tovision-onlyapproaches. - Value of Memory:
Persistent short-term memorysignificantly boosts performance onlong-running tasksrequiring information tracking across multiple shops and steps, preventingpremature terminationandinformation loss. - LLM Trade-offs:
GPT-4.1emerged as more efficient (faster, cheaper) and accurate for structured, basic tasks.Claude Sonnet 4, while generally more expensive and slower, demonstrated superiorreasoning capabilitiesfor less clearly definedadvanced taskswithvagueorspecific requirement constraints. - Current Limitations: Despite promising results (best F1 of 87% for basic, 63% for advanced tasks),
web agentsstill face challenges withrigid search strategies,insufficient cross-shop reasoning,UI interaction errors, andoutput formatting issues. The highAPI costsremain a significant barrier to widespread adoption.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Rigid Search Strategies: Agents often fail to adapt their search if initial queries yield no results, missing variants or requiring
broader queries. Future work should focus on more flexible and adaptive search and exploration mechanisms. - Difficulties in Handling UI:
UI interaction errors(e.g., misclicking, failing to locate or fill fields) are prevalent, especially for agents lackingstructured input. Improvedmulti-modal reasoningthat better integrates visual and structural cues could mitigate this. - Premature Termination and Output Formatting: Agents sometimes give up too early or make
output formatting mistakes(e.g., incomplete URLs) when submitting solutions. More robustmemory integrationcan help agents track task progress and ensure correct submission. - Insufficient Cross-Shop Reasoning: A core challenge identified is the agents' inability to consistently
aggregate informationand make decisions across all four shops, often stopping after the first relevant finding. This highlights a need for more sophisticatedreasoningandplanningcapabilities that explicitly handlemulti-source comparisons. - High API Costs: The current
LLM-based agentsincursubstantial API costs, which is a practical limitation for real-world deployment. Future work should explore more efficientLLMsoragent architecturesthat reducetoken usagewithout sacrificing performance.
7.3. Personal Insights & Critique
WebMall represents a crucial step forward in web agent benchmarking. By focusing on the multi-shop comparison-shopping paradigm, it directly addresses a highly practical and complex real-world use case that previous benchmarks largely overlooked. The use of authentic product offers from Common Crawl and the containerized environment are commendable for ensuring both realism and reproducibility.
A key insight from this paper is the clear demonstration of the complementary strengths of accessibility trees and memory. Accessibility trees provide the indispensable structured information that LLMs need to reliably interact with web pages, while memory allows them to maintain context and aggregate information over long interaction trajectories, which is vital for comparison-shopping. The trade-off between GPT-4.1's efficiency on structured tasks and Claude Sonnet 4's reasoning capabilities on ambiguous tasks also highlights the ongoing evolution and specialization of LLMs themselves.
Critically, the paper implicitly points to the need for more human-like learning and adaptation in web agents. The observed rigid search strategies and insufficient cross-shop reasoning suggest that current LLM-based agents often struggle with exploratory behavior and complex information synthesis when faced with diverse, dynamic environments. Humans, when comparison-shopping, intuitively broaden searches, infer compatibility, and cross-reference information. Mimicking these adaptive behaviors remains a significant challenge.
The reported high API costs are a stark reminder that while LLM-based agents show promise, their practical deployment at scale is currently limited by economic factors. Future research might explore smaller, specialized LLMs for specific sub-tasks or more efficient agent architectures that reduce the number of LLM calls or token usage per task.
The specific mention of "GPT 4.1" is interesting, as GPT-4 is the public designation, and .1 might refer to an internal version, an early version of GPT-4 Turbo, or a specific fine-tuned model. Clarifying this (though not the paper's primary focus) could be helpful for future researchers trying to replicate or build upon these baselines.
Overall, WebMall provides a much-needed benchmark that will undoubtedly stimulate further research into building more intelligent, robust, and efficient web agents capable of navigating the true complexity of the internet. Its methods and conclusions could be transferred to other domains requiring multi-source information aggregation and decision-making, such as news summarization from multiple sources, travel planning across different booking sites, or research assistance by combining data from various academic databases.
Similar papers
Recommended via semantic vector search.