Paper status: completed

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

Published:08/18/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

WebMall is a new benchmark for evaluating LLM-based web agents in multi-shop comparison-shopping scenarios, featuring four simulated shops and 91 tasks that enhance online shopping research by offering authentic product diversity.

Abstract

LLM-based web agents have the potential to automate long-running web tasks, such as finding offers for specific products in multiple online shops and subsequently ordering the cheapest products that meet the users needs. This paper introduces WebMall, a multi-shop online shopping benchmark for evaluating the effectiveness and efficiency of web agents for comparison-shopping. WebMall consists of four simulated online shops populated with authentic product offers sourced from the Common Crawl, alongside a suite of 91 cross-shop tasks. These tasks include basic tasks such as finding specific products in multiple shops, performing price comparisons, adding items to the shopping cart, and completing checkout. Advanced tasks involve searching for products based on vague requirements, identifying suitable substitutes, and finding compatible products. Compared to existing e-commerce benchmarks, such as WebShop or ShoppingBench, WebMall introduces comparison-shopping tasks across multiple shops. Furthermore, the product offers are more heterogeneous, as they originate from hundreds of distinct real-world shops. The tasks in WebMall require longer interaction trajectories than those in WebShop, while remaining representative of real-world shopping behaviors. We evaluate eight baseline agents on WebMall, varying in observation modality, memory utilization, and underlying large language model (GPT 4.1 and Claude Sonnet 4). The best-performing configurations achieve completion rates of 75% and 53%, and F1 scores of 87% and 63%, on the basic and advanced task sets, respectively. WebMall is publicly released to facilitate research on web agents and to promote advancements in navigation, reasoning, and efficiency within e-commerce scenarios.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

WebMall -- A Multi-Shop Benchmark for Evaluating Web Agents

1.2. Authors

  • Ralph Peeters (Data and Web Science Group, University of Mannheim, Mannheim, Germany)
  • Aaron Steiner (Data and Web Science Group, University of Mannheim, Mannheim, Germany)
  • Luca Schwarz (Data and Web Science Group, University of Mannheim, Mannheim, Germany)
  • Julian Yuya Caspary (Data and Web Science Group, University of Mannheim, Mannheim, Germany)
  • Christian Bizer (Data and Web Science Group, University of Mannheim, Mannheim, Germany)

1.3. Journal/Conference

The paper is published as a preprint on arXiv, indicating it is likely awaiting peer review or has been submitted to a conference/journal. Given the publication date in 2025, it might be targeting a major conference or journal in web science, artificial intelligence, or natural language processing. Christian Bizer is a well-known researcher in web data extraction and semantic web, suggesting the paper aligns with high-impact research in these fields.

1.4. Publication Year

2025

1.5. Abstract

This paper introduces WebMall, a novel benchmark designed to evaluate LLM-based web agents in multi-shop online shopping scenarios, specifically focusing on comparison-shopping. WebMall comprises four simulated online shops, populated with authentic product offers sourced from the Common Crawl. It features a suite of 91 cross-shop tasks, categorized into basic tasks (e.g., finding specific products, price comparison, adding to cart, checkout) and advanced tasks (e.g., searching with vague requirements, identifying substitutes, finding compatible products). A key innovation is its focus on comparison-shopping across multiple, heterogeneous shops, and its use of more diverse, real-world product data compared to existing single-shop benchmarks like WebShop or ShoppingBench. The tasks in WebMall require longer interaction trajectories, reflecting realistic shopping behaviors. The authors evaluate eight baseline agents, varying observation modality (accessibility tree, screenshots), memory utilization, and underlying large language model (GPT 4.1, Claude Sonnet 4). The top-performing configurations achieved completion rates of 75% and 53%, and F1 scores of 87% and 63%, for basic and advanced task sets, respectively. WebMall is publicly released to foster research in web agent navigation, reasoning, and efficiency in e-commerce.

  • Original Source Link: https://arxiv.org/abs/2508.13024
  • PDF Link: https://arxiv.org/pdf/2508.13024v1.pdf
  • Publication Status: This is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The proliferation of large language models (LLMs) has ignited significant interest in developing web agents capable of automating complex, long-running online tasks. A crucial application area is online shopping, where users often need to compare products across multiple stores to find the best deals or specific items. However, existing benchmarks for evaluating these web agents in e-commerce scenarios primarily focus on single-shop environments. These benchmarks either simulate a single online store or evaluate agents on the live web, which presents challenges for reproducibility. The lack of a standardized, reproducible benchmark for comparison-shopping across multiple, diverse online shops represents a significant gap in research. This gap hinders the development and rigorous evaluation of LLM-based web agents that can handle the real-world complexity of navigating, comparing, and transacting across heterogeneous e-commerce platforms.

The core problem the paper aims to solve is the absence of a comprehensive, reproducible, multi-shop benchmark for LLM-based web agents in comparison-shopping scenarios. This problem is important because real-world online shopping often involves comparing offers from various retailers, which requires agents to possess sophisticated navigation, reasoning, and cross-site information aggregation capabilities. Existing benchmarks fall short by either limiting agents to a single store, using artificial tasks, or relying on live web environments that prevent exact reproducibility. The paper's innovative idea, or entry point, is the creation of WebMall, a simulated multi-shop environment populated with realistic, heterogeneous product data and a challenging suite of cross-shop tasks that demand advanced comparison-shopping skills.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

  • Novel Multi-Shop Benchmark (WebMall): The introduction of WebMall, the first benchmark designed for comparison-shopping tasks across multiple simulated e-shops. It consists of four locally hostable online stores, populated with 4,421 authentic product offers derived from the Common Crawl, and a set of 91 cross-shop tasks across 11 categories. These tasks range from basic product search and checkout to advanced tasks requiring vague requirement reasoning, substitution, and compatibility analysis.

  • Extensive Baseline Evaluation: The paper conducts a thorough evaluation of eight baseline agent configurations using the Browsergym/AgentLab framework. These configurations vary across observation space (accessibility tree, screenshots, or both), the use of persistent short-term memory, and the underlying large language model (GPT-4.1 and Claude Sonnet 4). This evaluation provides insights into the effectiveness and efficiency of current web agents in multi-shop scenarios.

    Key conclusions and findings include:

  • Challenging Benchmark: WebMall proves challenging for state-of-the-art LLMs, with the best configurations achieving completion rates of 75% for basic tasks and 53% for advanced tasks, and F1 scores of 87% and 63% respectively. This indicates significant room for improvement in current web agent capabilities for complex e-commerce tasks.

  • Importance of Accessibility Tree: The accessibility tree is identified as the most crucial observation modality for reliable navigation and high task completion rates, especially in e-commerce scenarios where structured information about UI elements is vital. Screenshots can be supplementary but cannot replace the structured information provided by accessibility trees.

  • Benefits of Persistent Short-Term Memory: The integration of persistent short-term memory significantly improves task completion rates, particularly for long-running tasks that require agents to track and aggregate information across multiple shops and steps. This helps mitigate premature termination and information loss.

  • LLM Performance Trade-offs: GPT-4.1 demonstrates better efficiency (faster, cheaper) and accuracy for structured, basic tasks. Claude Sonnet 4, while often slower and more costly, showed superior performance on less clearly defined, advanced tasks involving vague requirements or attribute-based reasoning.

  • Common Failure Modes: Recurring issues include rigid search strategies (missing variants, not broadening queries), insufficient cross-shop reasoning (stopping after finding one offer), UI interaction errors, and output formatting mistakes (e.g., incomplete URLs).

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand the WebMall paper, a reader should be familiar with several foundational concepts related to Large Language Models (LLMs), AI agents, and web technologies.

  • Large Language Models (LLMs): These are advanced artificial intelligence models trained on vast amounts of text data, enabling them to understand, generate, and process human language. Examples include GPT (Generative Pre-trained Transformer) series and Claude. They are characterized by their large number of parameters (billions or trillions) and their ability to perform a wide range of natural language processing tasks, including reasoning, summarization, and instruction following. In the context of web agents, LLMs act as the "brain" that interprets user instructions, plans actions, and processes observations from web pages.
  • Web Agents (LLM-based Agents): These are AI agents that leverage LLMs to interact with websites. They are designed to understand natural language instructions from a user and perform complex tasks on the web by simulating human interaction (e.g., clicking buttons, typing text, scrolling, navigating between pages). The goal is to automate long-running web tasks that would otherwise require manual human effort.
  • Benchmarks: In machine learning and AI, a benchmark is a standardized set of tasks or problems used to evaluate and compare the performance of different models or agents. A good benchmark is typically reproducible, covers a representative range of challenges, and provides clear evaluation metrics. WebMall is presented as such a benchmark.
  • Accessibility Tree (AX-Tree): This is a representation of the user interface (UI) of a web page that provides structured, semantic information about its elements. Unlike a visual screenshot, the accessibility tree is an abstract tree structure that contains information about elements like buttons, input fields, links, and their associated labels, roles, and states. It's primarily designed to help assistive technologies (e.g., screen readers) understand and interact with web content. For web agents, it offers a programmatic way to understand the structure and interactive elements of a page, enabling more precise navigation and interaction than visual cues alone.
  • Observation Space/Modality: This refers to the type of information an agent receives about its environment. In the context of web agents, common observation modalities include:
    • Accessibility Tree: Provides structured, semantic information.
    • Screenshot: A visual image of the web page, capturing layout, colors, product images, etc. This requires vision models (like GPT-4V or Claude Sonnet's vision capabilities) to interpret.
    • HTML/DOM: The raw HTML or Document Object Model of the page, offering the most detailed structural information, but often overwhelming for LLMs directly.
  • Memory (Persistent Short-Term Memory): For LLM-based agents, memory refers to the ability to store and recall information relevant to the current task over an extended sequence of actions. Persistent short-term memory means the agent can retain specific pieces of information (e.g., found prices, product URLs, user requirements) across multiple steps or page navigations, rather than relying solely on the context of the immediate prompt or a simple action history. This is crucial for long-running tasks like comparison-shopping where information needs to be collected and aggregated from different sources.
  • Completion Rate (CR): An evaluation metric that measures the percentage of tasks for which an agent successfully produces a perfect and correct answer within a given step limit.
  • Precision (P), Recall (R), and F1 Score (F1): Standard metrics used in information retrieval and classification tasks.
    • Precision: The proportion of correctly identified positive results (e.g., correct product offers) out of all positive results identified by the agent. It answers: "Of all items the agent said were relevant, how many actually were relevant?"
    • Recall: The proportion of correctly identified positive results out of all actual positive results. It answers: "Of all items that were relevant, how many did the agent find?"
    • F1 Score: The harmonic mean of Precision and Recall, providing a single score that balances both. It is particularly useful when dealing with imbalanced classes or when both Precision and Recall are important.
  • Token Usage: LLMs process input and generate output in units called tokens. A token can be a word, part of a word, or punctuation. Token usage is a measure of the computational resources (and cost) consumed by an LLM, as billing is often based on the number of tokens processed.
  • API Cost: The monetary cost associated with using LLM services through their Application Programming Interfaces (APIs). This cost is typically calculated based on token usage, model type, and sometimes other factors like image processing.
  • Docker: A platform that uses OS-level virtualization to deliver software in packages called containers. Containers are isolated from each other and bundle their own software, libraries, and configuration files, ensuring that software runs consistently across different environments. WebMall uses Docker for locally hostable simulated shops, guaranteeing reproducibility.
  • WordPress/WooCommerce: WordPress is a popular open-source content management system (CMS) used for building websites. WooCommerce is a free e-commerce plugin for WordPress that adds online store functionality, allowing users to sell products, manage inventory, and process payments. WebMall leverages these technologies to create realistic, functional online shops.
  • Common Crawl: A non-profit organization that provides open datasets of web crawl data. It crawls billions of web pages monthly and makes the raw data available to the public. WebMall sources its authentic product offers from the Common Crawl to ensure realism and diversity.
  • Schema.org: A collaborative effort to create, maintain, and promote schemas for structured data on the internet. It provides a collection of shared vocabularies that webmasters can use to mark up their web pages with semantic information (e.g., Product, Offer, price, description). This structured data helps search engines and AI agents better understand the content of web pages. WebMall uses schema.org annotations to extract product offers.

3.2. Previous Works

The paper extensively references existing benchmarks for evaluating web agents, particularly in online shopping and broader web interaction contexts. Understanding these prior works and their limitations is crucial for appreciating WebMall's contributions.

  • WebShop [20]: An early and influential benchmark for online shopping agents.
    • Description: Simulates a single e-shop populated with over a million real product offers scraped from Amazon. Agents navigate this shop to fulfill user requests (e.g., "Find me a cheap laptop").
    • Limitation addressed by WebMall: WebShop is a single-shop environment, meaning it does not require comparison-shopping or cross-site information aggregation, which are core to WebMall. Its product offers, while numerous, are from a single source (Amazon), potentially lacking the heterogeneity WebMall aims for.
  • WebArena [22]: A broader web agent benchmark.
    • Description: Simulates multiple websites across various domains, including e-commerce, social media, and productivity.
    • Limitation addressed by WebMall: While it includes e-commerce tasks, its shopping tasks are confined to a single e-shop and often focus on administrative tasks (e.g., shop management, sales statistics) rather than complex user-centric comparison-shopping.
  • REAL [6]: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites.
    • Description: Spans various task types, including single-shop e-commerce tasks like product search, cart management, and checkout.
    • Limitation addressed by WebMall: Similar to WebShop and WebArena, REAL also operates within a single-shop environment, thereby not addressing the multi-shop comparison-shopping challenge.
  • ShoppingBench [16]: Another single-store benchmark.
    • Description: Simulates a single-store environment with tasks covering user intents like searching for products, using vouchers, and adhering to a budget.
    • Limitation addressed by WebMall: Focuses on a single store, missing the cross-shop comparison aspect.
  • Live Web Benchmarks (Mind2Web [3], BrowseComp [18], DeepShop [10]):
    • Description: These benchmarks evaluate agents directly on the live World Wide Web rather than simulated environments. DeepShop specifically features complex product search queries.
    • Limitation addressed by WebMall: While offering realism, evaluating on the live Web makes reproducibility extremely difficult. Website content changes, links break, and layouts evolve, making consistent comparative evaluation challenging. WebMall explicitly avoids this by providing a containerized, locally hostable environment. BrowseComp is noted for featuring "artificial tasks" designed to be difficult, which contrasts with WebMall's focus on "representative of real-world shopping behaviors."
  • Other LLM-based Agent Benchmarks (AgentBench [9], VisualWebArena [7], WebChoreArena [11], DeepResearchBench [4], ECom-Bench [15]): These benchmarks cover a wider range of agent capabilities beyond just web shopping.
    • AgentBench: Extends beyond the web to databases and operating systems.
    • VisualWebArena: Focuses on visually grounded tasks.
    • WebChoreArena: Targets memory-intensive tedious web tasks.
    • DeepResearchBench: Evaluates web research agents on multi-step tasks.
    • ECom-Bench: Focuses on customer support dialogues in e-commerce.
    • Limitation addressed by WebMall: While valuable, these benchmarks either do not focus on e-commerce or do not specifically address the multi-shop comparison-shopping paradigm with heterogeneous product data.

3.3. Technological Evolution

The field of web agents has seen rapid evolution, primarily driven by advances in LLMs and multi-modal AI.

  1. Early Web Scrapers/Bots: Automated web interaction initially involved rule-based web scrapers and simple bots designed for specific, repetitive tasks. These lacked generalizability and natural language understanding.
  2. Reinforcement Learning for Web Navigation: Research then explored using reinforcement learning to train agents to navigate websites, often requiring large amounts of interaction data and suffering from poor generalization across different website layouts.
  3. Emergence of LLMs: The development of powerful LLMs (e.g., GPT-2, GPT-3) marked a turning point. These models could understand complex instructions and generate human-like text, paving the way for more flexible and intelligent web agents.
  4. LLM-as-Agent Frameworks: ReAct [21] (Reasoning and Acting) showed how LLMs could interleave reasoning (generating thoughts) and acting (executing web actions), making them more capable. Reflexion [13] further enhanced this by incorporating verbal reinforcement learning, allowing agents to learn from past successes and failures. Voyager [14] introduced curriculum learning and modular skill libraries for open-ended tasks.
  5. Multi-modal Agents: The advent of multi-modal LLMs (e.g., GPT-4V) that can process both text and images enabled agents to interpret visual cues from screenshots, complementing the structural information from HTML or accessibility trees.
  6. Benchmarking Evolution: As web agents became more sophisticated, the need for comprehensive benchmarks grew. Initial benchmarks like WebShop focused on single-site interaction. More recent efforts expanded to broader web interaction (WebArena, Mind2Web) or specific challenges (VisualWebArena, WebChoreArena). WebMall fits into this evolution by pushing the boundaries of e-commerce benchmarking to include the complex, real-world scenario of multi-shop comparison-shopping.

3.4. Differentiation Analysis

WebMall differentiates itself from existing web agent benchmarks in several key aspects:

  • Multi-Shop Comparison-Shopping: This is the most significant differentiator. Unlike WebShop, WebArena, REAL, or ShoppingBench, which are single-shop environments, WebMall explicitly requires agents to navigate and aggregate information across four distinct online shops. This introduces challenges like cross-site reasoning, price comparison, and product offer aggregation that are absent in single-shop benchmarks.
  • Heterogeneous Product Offers: The product offers in WebMall are sourced from hundreds of distinct real-world shops via the Common Crawl and schema.org annotations. This leads to more heterogeneous product descriptions, titles, and attribute representations than benchmarks populated from a single source (e.g., WebShop from Amazon), making the task of matching and comparing products more challenging and realistic.
  • Longer Interaction Trajectories: The tasks in WebMall are designed to necessitate longer interaction trajectories compared to, for instance, WebShop. This includes not just finding a product but often comparing it across multiple shops, adding to cart, and completing checkout, or performing advanced reasoning over vague requirements. This better reflects real-world shopping behaviors.
  • Reproducible Environment vs. Live Web: In contrast to Mind2Web, BrowseComp, and DeepShop which evaluate on the live Web, WebMall provides a fully containerized, locally hostable environment. This ensures exact reproducibility of evaluation results, allowing for fair and consistent comparison of different agent architectures without the variability inherent in the live internet.
  • Advanced Task Categories: WebMall introduces advanced tasks such as searching with vague requirements, identifying suitable substitutes, and finding compatible products. These tasks go beyond simple product search or checkout and demand more sophisticated reasoning and understanding from the agents, reflecting more nuanced user needs.

4. Methodology

4.1. Principles

The core idea behind WebMall is to create a realistic, reproducible, and challenging environment for evaluating LLM-based web agents in e-commerce comparison-shopping scenarios. The theoretical basis is that for web agents to be truly useful in automating online tasks, they must be able to handle the complexity of the real web, which includes navigating multiple, diverse websites, extracting and comparing heterogeneous information, and performing complex reasoning to fulfill user needs. By simulating this multi-shop environment with authentic data and a comprehensive set of tasks, WebMall aims to push the boundaries of web agent capabilities beyond existing single-shop or artificial benchmarks. The intuition is that an agent that can successfully comparison-shop across WebMall's heterogeneous stores and tasks will demonstrate strong navigation, information extraction, reasoning, and decision-making skills transferable to real-world applications.

4.2. Core Methodology In-depth (Layer by Layer)

The WebMall methodology involves several key components: the environment (simulated shops), the data (product offers), the task set, and the evaluation framework.

4.2.1. WebMall Environment

The WebMall environment consists of four simulated online shops and a solution submission website.

  • Shop Implementation: The four shops are implemented using WordPress with the WooCommerce plugin. This choice provides realistic e-commerce functionality (shopping cart, checkout, search bar, product detail pages, category navigation) and allows for heterogeneity in user interfaces.
  • Shop Templates: Four distinct, free WooCommerce templates are used to ensure that the shops have heterogeneous visual interfaces and layouts, mimicking the diversity found in the real world.
  • Local Hostability: The entire environment is containerized using Docker. This means that after cloning the repository, a two-command setup automatically downloads backup files, configures services, and launches the four shops, their databases, and Elasticsearch instances. This guarantees reproducibility across different evaluation setups.
  • Solution Website: In addition to the shops, a dedicated website is part of the environment where agents submit their task solutions (e.g., URLs of relevant product offers) or indicate task completion.

4.2.2. Product Offer Collection and Distribution

To ensure realism and challenge, WebMall populates its shops with authentic product offers.

  • Data Source: Product offers are sourced from the October 2024 Common Crawl via schema.org annotations. Schema.org is a vocabulary for structured data markup on web pages, which allows for programmatic extraction of product information.
  • Filtering: A multi-step filtering process is applied to the raw Common Crawl data:
    1. Property Check: Only offers containing title, description, price, and priceCurrency schema.org properties are retained.
    2. Deduplication: Exact duplicates based on the combination of these four attributes are removed.
    3. Language Filtering: Since WebMall is an English-language benchmark, the fastText language classification model is used on titles and descriptions to filter for English offers only.
    4. Product Clustering: Offers containing globally unique product identifiers like GTIN (Global Trade Item Number) or MPN (Manufacturer Part Number) are grouped into clusters. These clusters represent the same real-world product, facilitating later task creation and distribution.
  • Manual and Automated Distribution:
    1. Initial Manual Selection: A set of product offers (selected during task creation) is manually distributed across the four shops to ensure specific tasks can be formed.
    2. Automated Filler Population: GPT-4.1 is used to query the corpus for additional offers to fill the shops in three designated categories: PC components, PC peripherals, and other electronics.
      • Embedding Generation: For each category query, OpenAI's text-embedding-3-small model is used to compute embeddings for product offers. Embeddings are numerical representations of text that capture semantic meaning, allowing for similarity comparisons.
      • Nearest Neighbor Retrieval: Elasticsearch is used to retrieve nearest neighbors (most similar product offers) via cosine similarity over pre-indexed product vectors. Cosine similarity measures the cosine of the angle between two vectors, indicating their directional similarity.
      • Cleaning and Assessment: Retrieved candidates are cleaned (HTML removal, normalization) and then assessed by GPT-4.1 for listing quality (English, informative description 100\geq 100 characters, specific non-generic title, not list-like) and category relevance.
      • Constraint Checking: Each candidate is screened against a constraint list derived from the task set to prevent newly added offers from creating unintended valid task solutions. The resulting distribution of product offers is shown in Table 1.

The following are the results from Table 1 of the original paper:

Product Category Overall Total Shop 1 Shop 2 Shop 3 Shop 4
Offers % Offers % Offers % Offers % Offers %
PC Components 1,477 33.4 348 30.2 369 33.7 430 37.2 330 32.4
PC Peripherals 1,388 31.4 432 37.5 255 23.3 336 29.1 365 35.8
Other Electronics 1,556 35.2 370 32.3 471 43.0 390 33.7 325 31.9
Total 4,421 100.0 1,150 100.0 1,095 100.0 1,156 100.0 1,020 100.0
  • Product Characteristics: The 4,421 offers have varied titles (6 to 264 characters, median 69, average 76.4) and descriptions (15 to >14,000 characters, median 573, average 1,059), reflecting real-world diversity.
  • Category Trees: Each shop has manually created, distinct category trees to simulate heterogeneity.

4.2.3. WebMall Task Set

The WebMall task set comprises 91 tasks designed to evaluate web agents in comparison-shopping scenarios, grouped into basic and advanced categories.

  • Task Definition: Each task includes a natural-language instruction for the web agent and a set of one or more solution URLs if the task requires finding specific offers.

  • Basic Tasks: Represent typical, straightforward online shopping actions.

    • Find Specific Product (12 tasks): Locate all offers for a named product across all shops.
    • Find Cheapest Offer (10 tasks): Identify the lowest-priced offer for a named product across all shops.
    • Products Fulfilling Specific Requirements (11 tasks): Find offers based on specific attribute constraints (e.g., display size, memory) without a named product.
    • Add To Cart (7 tasks): Add specific named product offers to the shopping cart.
    • Checkout (8 tasks): Add a specific offer to the cart and complete the full checkout process (including filling shipping/billing details).
  • Advanced Tasks: Incorporate higher complexity, vagueness, and reasoning requirements.

    • Cheapest Offer with Specific Requirements (10 tasks): Extend Products Fulfilling Specific Requirements by also requiring comparison and selection of the cheapest.

    • Products Satisfying Vague Requirements (8 tasks): Find products based on vaguely described user needs, requiring agent reasoning.

    • Cheapest Offer with Vague Requirements (6 tasks): Combine vague requirements with price comparison to find the cheapest offers.

    • Find Substitutes (6 tasks): Suggest cheaper alternative products, simulating scenarios of unavailability or high price.

    • Find Compatible Products (5 tasks): Requires reasoning over compatibility (e.g., finding compatible CPUs for a motherboard).

    • End-to-End (8 tasks): Integrates multiple steps: searching for products, price comparison, adding to cart, and checkout into a single workflow.

      The following are the results from Table 2 of the original paper:

      Task Category Count Example
      Basic Task Set
      Find Specific Product 12 Find all offers for the AMD Ryzen 9 5900X.
      Find Cheapest Offer 10 Find the cheapest offer for the Samsung Galaxy S24 Plus.
      Products Fulfilling Specific Requirements 11 Find all offers for orange straps that fit with the Apple Watch Series 6.
      Add to Cart 7 Find all offers for the Asus DUAL RTX4070 SUPER OC White and add them to the shopping cart.
      Checkout 8 Add the product on page {PRODUCT_URL} to the shopping cart and complete the checkout process.
      Advanced Task Set
      Cheapest Offer Specific Requirements 10 Find the cheapest offer for a new Xbox gaming console with at least 512gb disk space in white.
      Products Satisfying Vague Requirements 8 Find all offers for the largest available MX500 model by Crucial.
      Cheapest Offer Vague Requirements 6 Find the cheapest offers for each model of mid-tier nVidia gaming GPUs in the 4000 series.
      Find Substitutes 6 Find the cheapest alternative for this item: {PRODUCT_URL}.
      Find Compatible Products 5 Find all offers for compatible CPUs for this motherboard: {PRODUCT_URL}.
      End To End 8 Find the cheapest offer for the Asrock B550 PHANTOM GAMING 4 and purchase it.
  • Artifacts: All tasks and their solutions are provided in a JSON file. Agents receive instructions explaining the WebMall environment (shop URLs, submission process) before each task.

4.2.4. Agent Evaluation Framework (Browsergym/AgentLab)

The Browsergym and AgentLab frameworks are used to configure and run the baseline agents.

  • Browsergym: Provides common tools for web agents, including web browsing capabilities (using Python Playwright library), experimental framing, and result/trace tracking. It supports any API-based hosted LLM.
  • AgentLab: Integrates with Browsergym and allows for configuring more sophisticated agents by affording API-based LLMs specific capabilities.
  • Agent Configurations: Eight baseline agent configurations are evaluated, varying along three dimensions:
    1. Observation Space: How the agent perceives the web page.
      • AX-Tree: Agent receives the HTML accessibility tree (structural information).
      • Screenshot: Agent receives a visual screenshot of the viewport. Vision capability is implemented using set-of-mark [19] prompting, where visual elements are marked up for LLM processing.
      • AX-Tree + Screenshot: Agent receives both modalities.
    2. Memory: The ability to retain information over time.
      • Memory: AgentLab's persistent short-term memory is activated, allowing agents to store and filter discovered information (e.g., cheapest product offer and its URL) across steps.
      • No Memory: Agents rely solely on their action history and thoughts at each step, without explicit persistent storage of task-relevant data.
    3. Large Language Model (LLM): The underlying AI model driving the agent's decisions.
      • GPT-4.1: An iteration of GPT-4 from OpenAI.
      • Claude Sonnet 4: An iteration of Claude Sonnet from Anthropic.
  • Step Limit: Each agent is allowed up to 50 steps to complete a task. A step is an action like go to page, click, fill text, or scroll, defined by AgentLab.

4.2.5. Evaluation Metrics

The evaluation measures both effectiveness and efficiency.

  • Effectiveness Metrics:
    • Completion Rate (CR): Percentage of tasks where the agent outputs a perfect answer within the step limit.
    • Precision (P), Recall (R), F1-score (F1): Calculated over the returned set of answers (e.g., URLs) and the correct set of answers per task. Macro averaging is used to aggregate these scores across tasks, meaning scores are computed for each task, and then averaged.
  • Efficiency Metrics:
    • Average Steps: The average number of actions taken per task.

    • Tokens Consumed: The average number of LLM tokens used per task (input and output).

    • Runtime: The average time taken to complete a task.

    • Estimated API Cost: The estimated monetary cost per task based on token usage and LLM API pricing.

      The methodology ensures a robust evaluation by providing a standardized, reproducible environment, realistic and challenging tasks, and a comprehensive set of metrics covering various aspects of agent performance.

5. Experimental Setup

5.1. Datasets

The primary "dataset" for the WebMall experiments is the WebMall environment itself, which includes:

  • Four Simulated Online Shops: Implemented using WordPress and WooCommerce, designed to be visually distinct and functionally heterogeneous, mirroring real-world e-commerce sites.

  • Product Offers: A total of 4,421 authentic product offers.

    • Source: Extracted from the October 2024 Common Crawl using schema.org annotations.
    • Categories: Distributed across PC components, PC peripherals, and other electronics.
    • Characteristics: Varied titles (6 to 264 characters, median 69, average 76.4) and descriptions (15 to over 14,000 characters, median 573, average 1,059).
  • Task Set: 91 cross-shop tasks divided into 11 categories (basic and advanced). Each task consists of a natural language instruction and, if applicable, a set of ground truth URLs as solutions.

    These data sources (the simulated shops and the tasks) are effective for validating the method's performance because:

  • They are realistic: Product data is from the real web (Common Crawl), and shop functionality (WooCommerce) is common.

  • They are diverse: Heterogeneous shop interfaces and product descriptions challenge agent generalization.

  • They are multi-shop: The core novelty, requiring comparison-shopping and cross-site reasoning.

  • They are reproducible: The Docker-containerized setup ensures consistent environments for comparative evaluation.

  • They are challenging: The task set includes vague requirements, compatibility reasoning, and end-to-end workflows, pushing agent capabilities.

5.2. Evaluation Metrics

The paper uses several metrics to evaluate the effectiveness and efficiency of the web agents.

5.2.1. Completion Rate (CR)

  • Conceptual Definition: Completion Rate measures the percentage of tasks for which an agent successfully provides a perfect and correct answer within the predefined step limit (50 steps). It quantifies the agent's ability to successfully execute a task from start to finish without errors or premature termination, according to the specified requirements.
  • Mathematical Formula: $ \text{CR} = \frac{\text{Number of perfectly completed tasks}}{\text{Total number of tasks}} \times 100% $
  • Symbol Explanation:
    • CR\text{CR}: Completion Rate.
    • Number of perfectly completed tasks\text{Number of perfectly completed tasks}: The count of tasks where the agent's output exactly matches the ground truth solution.
    • Total number of tasks\text{Total number of tasks}: The total number of tasks in the benchmark set being evaluated.

5.2.2. Precision (P)

  • Conceptual Definition: Precision measures the accuracy of the agent's positive predictions. In the context of WebMall, if an agent is asked to find relevant product offers, Precision indicates how many of the offers it returned were actually correct and relevant. It focuses on the quality of the positive results.
  • Mathematical Formula: $ P = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $
  • Symbol Explanation:
    • PP: Precision.
    • True Positives\text{True Positives}: Items (e.g., product URLs) correctly identified by the agent as part of the solution.
    • False Positives\text{False Positives}: Items incorrectly identified by the agent as part of the solution (i.e., agent returned them, but they are not in the ground truth solution).

5.2.3. Recall (R)

  • Conceptual Definition: Recall measures the agent's ability to find all the relevant items. If there are multiple correct product offers for a task, Recall indicates what proportion of these the agent successfully identified. It focuses on the completeness of the positive results.
  • Mathematical Formula: $ R = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $
  • Symbol Explanation:
    • RR: Recall.
    • True Positives\text{True Positives}: Items (e.g., product URLs) correctly identified by the agent as part of the solution.
    • False Negatives\text{False Negatives}: Items that are part of the ground truth solution but were not identified by the agent.

5.2.4. F1 Score (F1)

  • Conceptual Definition: The F1 Score is the harmonic mean of Precision and Recall. It provides a balanced measure that considers both false positives and false negatives. A high F1 Score indicates that the agent has high Precision and high Recall, making it a good overall indicator of performance, especially when Precision and Recall might be in tension. For WebMall, macro averaging is applied, meaning the F1 Score is calculated for each task independently, and then the average of these F1 Scores is reported.
  • Mathematical Formula: $ F1 = 2 \times \frac{P \times R}{P + R} $
  • Symbol Explanation:
    • F1: F1 Score.
    • PP: Precision.
    • RR: Recall.

5.2.5. Efficiency Metrics

  • Average Steps: The mean number of actions (e.g., click, fill text, go to page) performed by the agent per task.
  • Average Input Tokens: The mean number of tokens sent to the LLM as input per task.
  • Average Output Tokens: The mean number of tokens generated by the LLM as output per task.
  • Average Runtime: The mean time taken for an agent to complete a task, measured in seconds.
  • Average Cost: The estimated mean API cost per task, derived from token usage and current LLM pricing models.

5.3. Baselines

The paper evaluates eight baseline agent configurations built using the Browsergym/AgentLab framework. These baselines are chosen to explore the impact of different observation modalities, the presence of memory, and the choice of underlying LLM. They are representative because they cover common architectural choices for LLM-based web agents.

The baselines are formed by combining:

  1. Large Language Models (LLMs):
    • GPT-4.1 (from OpenAI)
    • Claude Sonnet 4 (from Anthropic)
  2. Observation Spaces:
    • AX-Tree: Only the accessibility tree is provided.

    • AX-Tree + Memory: Accessibility tree with persistent short-term memory enabled.

    • AX-Tree + Vision: Accessibility tree supplemented by screenshots.

    • Vision: Only screenshots are provided.

      This results in 2×4=82 \times 4 = 8 distinct baseline configurations. Each configuration is run on the full WebMall task set (91 tasks) to gather performance data across all specified metrics.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that WebMall is a challenging benchmark for current LLM-based web agents. The analysis highlights the importance of structured observations (accessibility trees) and memory for effective web interaction, while also revealing performance trade-offs between different LLMs across various task complexities.

The following are the results from Table 3 of the original paper:

Model Task set CR (%) AX-Tree AX-Tree + Memory AX-Tree + Vision Vision
P (%) R (%) F1 (%) CR (%) P (%) R (%) F1 (%) CR (%) P (%) R (%) F1 (%) CR (%) P (%) R (%) F1 (%)
GPT4.1 Basic 56.25 74.48 67.59 70.87 75.00 91.60 83.95 87.61 56.25 72.66 65.77 69.04 41.67 59.64 50.43 54.65
GPT4.1 Advanced 32.56 52.03 45.57 48.59 34.88 52.11 46.25 49.01 39.53 48.46 48.35 48.41 13.95 20.70 18.00 19.26
Claude Sonnet 4 Basic 66.67 76.04 72.44 74.20 70.83 81.25 75.12 78.06 72.92 79.17 76.67 77.90 10.42 35.42 21.99 27.14
Claude Sonnet 4 Advanced 53.49 63.37 63.41 63.39 48.84 61.51 58.40 59.91 37.21 41.11 41.80 41.45 4.65 10.47 6.69 8.16

Overall Performance by Task Set (Table 3):

  • Basic Tasks:
    • The GPT-4.1 with AX-Tree + Memory configuration achieves the highest performance with a Completion Rate of 75.00% and an F1 Score of 87.61%. This indicates that for more structured and straightforward tasks, a powerful LLM combined with structured observation and memory is highly effective.
    • Claude Sonnet 4 with AX-Tree + Vision performs well on basic tasks (CR 72.92%, F1 77.90%), slightly outperforming its AX-Tree (CR 66.67%, F1 74.20%) and AX-Tree + Memory (CR 70.83%, F1 78.06%) counterparts. This suggests that for Claude Sonnet 4, visual cues might offer some supplementary benefit for basic tasks.
  • Advanced Tasks:
    • Claude Sonnet 4 with AX-Tree achieves the best results (CR 53.49%, F1 63.39%), demonstrating its stronger reasoning capabilities for vague requirements and complex comparisons. Notably, adding memory or vision to Claude Sonnet 4 for advanced tasks does not improve performance; in fact, memory slightly degrades it (CR 48.84%, F1 59.91%), and vision significantly so (CR 37.21%, F1 41.45%). This suggests that additional modalities or memory can sometimes confuse or distract the LLM when tasks are already highly complex.
    • GPT-4.1 generally shows lower performance on advanced tasks compared to Claude Sonnet 4 (highest CR 39.53% with AX-Tree + Vision, highest F1 49.01% with AX-Tree + Memory).
  • Impact of Vision (Screenshot-only): Agents using only screenshots (Vision column) perform significantly worse across all task sets and both LLMs. For GPT-4.1, Vision achieves a CR of 41.67% on basic tasks and 13.95% on advanced tasks. For Claude Sonnet 4, the performance is even lower, with CRs of 10.42% and 4.65% respectively. This confirms that screenshots alone lack the structured semantic information necessary for reliable web navigation and interaction.
  • Impact of Memory: Memory generally improves performance for GPT-4.1 on basic tasks (CR increases from 56.25% to 75.00%, F1 from 70.87% to 87.61%). For Claude Sonnet 4 on basic tasks, memory provides a slight boost in F1 but a minor decrease in CR. For advanced tasks, memory has a minimal positive or even slightly negative impact, especially for Claude Sonnet 4. This implies memory is most beneficial for long-running tasks where information needs to be explicitly stored and retrieved, preventing premature submission or forgetting intermediate results.

6.2. Data Presentation (Tables)

The following are the results from Table 4 of the original paper:

Model Task set P(%) AX-Tree AX-Tree + Memory AX-Tree + Vision Vision
R(%) F1 (%) CR (%) P(%) R(%) F1 (%) CR (%) P(%) R(%) F1 (%) CR (%) P(%) R(%) F1 (%)
Basic Tasks CR (%)
Single Product Search 33.33 85.42 66.48 74.77 66.67 88.64 81.69 85.02 33.33 67.71 54.61 60.46 41.67 69.10 56.44 62.13
GPT4.1 Cheapest Product Search 60.00 60.00 60.00 60.00 90.00 90.00 90.00 90.00 40.00 42.50 42.50 42.50 50.00 63.33 57.50 60.28
Best Fit Specific Requirements 27.27 50.00 40.61 44.82 36.36 84.85 59.01 69.61 45.45 68.18 56.97 62.07 27.27 54.55 38.03 44.81
Add to Cart 85.71 85.71 85.71 85.71 100.00 100.00 100.00 100.00 85.71 100.00 92.86 96.30 85.71 100.00 92.86 96.30
Checkout 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 100.00 12.50 12.50 12.50 12.50
Single Product Search 66.67 83.33 78.41 80.80 75.00 83.33 79.17 81.20 75.00 83.33 79.17 81.20 0.00 58.33 22.98 32.97
Claude Sonnet 4 Cheapest Product Search 70.00 75.00 75.00 75.00 70.00 70.00 70.00 70.00 80.00 80.00 80.00 80.00 40.00 60.00 50.00 54.55
Best Fit Specific Requirements 45.45 63.64 53.31 58.01 45.45 81.82 59.61 68.97 45.45 63.64 57.27 60.29 9.09 36.36 25.45 29.95
Add to Cart 71.43 71.43 71.43 71.43 85.71 85.71 85.71 85.71 85.71 85.71 85.71 85.71 0.00 0.00 0.00 0.00
Checkout 87.50 87.50 87.50 87.50 87.50 87.50 87.50 87.50 87.50 87.50 87.50 87.50 0.00 0.00 0.00 0.00
Advanced Tasks
40.00 40.00 40.00 40.00 30.00 30.00 30.00 30.00 30.00 30.00 30.00 30.00 20.00 20.00 20.00
GPT4.1 Cheapest Best Fit Specific Requirements 12.50 64.03 48.09 54.93 25.00 80.09 25.00 44.27 41.95 12.50 43.75 20.00 36.81
Best Fit Vague Requirements 16.67 54.17 48.61 51.24 16.67 66.67 65.28 71.93 16.67 39.87 48.61 50.48 0.00 6.67 31.77 4.44
Cheapest Best Fit Vague Requirements 44.44 53.33 52.50 3.33
Find Substitutes 50.00 50.00 50.00 50.00 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.33 33.33
Find Compatible Products 40.00 60.00 46.67 52.50 40.00 40.00 33.33 40.00 60.00 70.00 66.67 68.29 20.00 20.00 20.00 20.00
End-to-End 37.50 50.00 43.75 46.67 62.50 62.50 62.50 62.50 75.00 75.00 75.00 75.00 0.00 0.00 0.00 0.00
60.00 60.00 60.00 60.00
Claude Sonnet 4 Cheapest Best Fit Specific Requirements 37.50 68.39 68.75 68.57 50.00 50.00 50.00 50.00 50.00 50.00 50.00 50.00 10.00 10.00 10.00 10.00
Best Fit Vague Requirements 37.50 71.88 57.64 63.97 37.50 58.48 62.15 60.26 0.00 31.25 10.94 16.20
Cheapest Best Fit Vague Requirements
Find Substitutes 83.33 83.33 83.33 45.87 33.33 33.33 33.33 33.33 16.67 16.67 16.67 16.67 0.00 0.00 0.00 0.00
Find Compatible Products 33.33 52.78 40.56 0.00 0.00 0.00 0.00
End-to-End 60.00 60.00 60.00 66.67 66.67 66.67 66.67 16.67 16.67 16.67 16.67 0.00 0.00 0.00 0.00

6.3. Ablation Studies / Parameter Analysis

Table 4 provides a detailed breakdown of performance by individual task category, revealing granular insights into agent capabilities and failure modes.

6.3.1. Structured Basic Tasks

  • Categories: Single Product Search, Cheapest Product Search, Add to Cart, Checkout.
  • Performance: GPT-4.1 with AX-Tree + Memory excels in these categories, often achieving Completion Rates of 90-100% and high F1-scores. For Add to Cart and Checkout, GPT-4.1 AX-Tree + Memory achieves 100% CR and F1. Claude Sonnet 4 agents with AX-Tree (and sometimes +Vision+Vision) are competitive, though they might lose some precision or recall.
  • Common Failure Modes:
    • Rigid Search Strategies: Agents sometimes issue overly specific queries or stop if initial results are not found, missing product variants or alternative spellings. This reduces recall.
    • Screenshot-only agents: Struggle significantly, often failing to locate crucial UI elements like search boxes or buttons, leading to step limit exhaustion. For example, Checkout with GPT-4.1 Vision has a CR of 12.50%, and Claude Sonnet 4 Vision has 0% for Add to Cart and Checkout.

6.3.2. Attribute-Rich and Ambiguous Tasks

  • Categories: Best Fit Specific Requirements, Best Fit Vague Requirements, and their Cheapest variants, Find Substitutes, Find Compatible Products.
  • Performance: Claude Sonnet 4 with AX-Tree often shows higher F1-scores and Completion Rates for these categories compared to GPT-4.1, suggesting better attribute-based reasoning and vagueness interpretation. For Best Fit Vague Requirements, Claude Sonnet 4 AX-Tree achieves a CR of 37.50% and F1 of 63.97%, outperforming GPT-4.1 AX-Tree (CR 16.67%, F1 51.24%). Find Substitutes is another strong category for Claude Sonnet 4 AX-Tree (CR 83.33%, F1 45.87%), though the F1 is lower than its CR, indicating some incorrect submissions.
  • Impact of Vision: Combining accessibility tree and screenshot modalities yields modest gains in some categories, such as Find Compatible Products with GPT-4.1, where visual information might aid in identifying matching aesthetic features (e.g., color schemes). GPT-4.1 AX-Tree + Vision achieves a CR of 60.00% and F1 of 68.29% for Find Compatible Products, which is higher than AX-Tree alone (CR 40.00%, F1 52.50%).
  • Common Failure Modes:
    • Incomplete Cross-Shop Search: Agents frequently fail to comprehensively search across all shops or stop after the first matching result, reducing recall.
    • Attribute Confusion/Misinterpretation: Agents may confuse similar attributes (e.g., RAM kit capacity vs. single stick capacity) or misinterpret vague requirements.
    • Reasoning Errors: The complexity of compatibility reasoning or understanding vague descriptions leads to errors in identifying relevant offers.

6.3.3. End-to-End Tasks

  • Category: End-to-End (combining search, comparison, add to cart, and checkout).
  • Performance: Claude Sonnet 4 AX-Tree + Memory completes 66.67% of end-to-end tasks, with an F1 score of 66.67%. GPT-4.1 benefits from AX-Tree + Vision in this category, achieving 75.00% CR and F1. Memory is especially valuable here, as it helps agents maintain context and prevent forgetting intermediate results or submission details over long sequences of actions.
  • Common Failure Modes:
    • UI Interaction Errors: Agents may repeatedly click the wrong controls or fail to correctly fill in forms, especially without structured input.
    • Output Formatting Mistakes: Even if an agent finds the correct solution, errors in the format of the submitted URL (e.g., incomplete URLs) lead to tasks being marked incorrect. Memory-enabled agents are less prone to this as they can store and retrieve solution URLs explicitly.
    • Insufficient Cross-Shop Reasoning: Many agents struggle to aggregate and compare information effectively across multiple shops before making a final decision.

6.3.4. Overall Failure Patterns

A common thread across all categories is insufficient cross-shop reasoning. Many runs terminate after finding a single offer, failing to explore other shops for better deals or complete information. This is partially alleviated by memory, but not fully resolved. UI interaction errors (e.g., struggling with forms, missing buttons) are prevalent, especially for vision-only agents. Finally, output formatting mistakes for solution submission are a consistent source of lost points, particularly for agents without explicit memory to store solution URLs.

6.4. Efficiency Analysis

The efficiency analysis (Table 5 and Figure 2) reveals significant differences in token usage, runtime, and API costs between models and configurations, highlighting the critical trade-off between performance and resource consumption for practical deployment.

The following are the results from Table 5 of the original paper:

Model Task Set Observation Space Avg. Steps Avg. Input Tokens Avg. Output Tokens Avg. Runtime Avg. Cost
GPT4.1 Basic AX-Tree 22.69 131,301 2,334 130.5s 0.28\$
AX-Tree + Memory 20.88 130,270 3,511 142.4s 0.29\$
AX-Tree + Vision 20.92 135,362 1,901 155.4s 0.29\$
Vision 28.56 104,617 2,453 176.2s 0.23\$
GPT4.1 Advanced AX-Tree 24.98 160,922 2,950 159.2s 0.35\$
AX-Tree + Memory 24.19 178,949 4,658 177.0s 0.40\$
AX-Tree + Vision 23.74 169,956 2,468 187.8s 0.36\$
Vision 33.33 133,972 3,119 216.4s 0.29\$
Claude Sonnet 4 Basic AX-Tree 23.69 188,079 6,791 222.7s 0.67\$
AX-Tree + Memory 22.04 236,631 15,106 334.6s 0.94\$
AX-Tree + Vision 25.62 242,597 6,255 279.5s 0.82\$
Vision 43.40 364,694 13,937 446.9s 1.30\$
Claude Sonnet 4 Advanced AX-Tree 29.65 291,048 10,063 331.7s 1.02\$
AX-Tree + Memory 27.33 364,858 18,149 420.9s 1.37\$
AX-Tree + Vision 37.26 480,199 12,630 471.9s 1.63\$
Vision 47.74 421,704 17,456 536.3s 1.53\$

6.4.1. Token Usage

  • Model Comparison: Claude Sonnet 4 configurations consistently consume substantially more tokens than GPT-4.1 configurations, often more than double for comparable observation spaces. For instance, Claude Sonnet 4 AX-Tree for advanced tasks uses 291,048 input tokens compared to GPT-4.1 AX-Tree's 160,922.
  • Observation Space Impact: Configurations incorporating screenshots (AX-Tree + Vision or Vision only) generally lead to higher token usage, as visual information adds significantly to the LLM's input context.
  • Memory Impact: While memory-based configurations might reduce average steps, the longer prompts due to the explicit memory section can still result in higher overall token usage, especially for Claude Sonnet 4. For Claude Sonnet 4 Advanced tasks, AX-Tree + Memory uses 364,858 input tokens, significantly more than AX-Tree alone.
  • Inefficient Vision-only agents: Vision-only agents, despite their lower performance, often show higher average steps and substantial token usage (e.g., Claude Sonnet 4 Vision for basic tasks uses 364,694 input tokens), reflecting their struggle to navigate efficiently and their tendency to repeat actions due to lack of structured information.

6.4.2. Runtime

  • Model Comparison: GPT-4.1 agents are considerably faster than Claude Sonnet 4 agents. GPT-4.1 typically completes basic tasks in 2-3 minutes and advanced tasks in ~3 minutes. In contrast, Claude Sonnet 4 often requires 4-8 minutes per task, especially for complex workflows or with additional modalities. This difference is largely attributable to the higher token usage of Claude Sonnet 4 and potentially differences in API latency.
  • Observation Space Impact: Adding vision (AX-Tree + Vision) or relying solely on vision (Vision) tends to increase runtime for both LLMs due to the overhead of processing visual data and the increased token count.
  • Efficiency Trade-off: The data suggests that GPT-4.1 is a more efficient choice for basic structured tasks due to its lower runtime and token consumption. For advanced tasks, while Claude Sonnet 4 might offer better effectiveness (as seen in Table 3), this comes at a significant cost in runtime.

6.4.3. API Usage Fees

  • Cost Scaling: API costs directly correlate with token usage and runtime.

  • Model Comparison: GPT-4.1 configurations are generally more cost-effective. For basic tasks, GPT-4.1 costs range from ~0.23 to ~0.29 per task.

  • High Cost of Claude Sonnet 4: Claude Sonnet 4 configurations are considerably more expensive. For basic tasks, costs range from ~0.67 to ~1.30. For advanced tasks, Claude Sonnet 4 with AX-Tree + Vision can reach ~1.63 per task. The highest-performing Claude Sonnet 4 AX-Tree for advanced tasks still costs ~1.02 per task.

  • Performance-Cost Trade-off (Figure 2): The following figure (Figure 2 from the original paper) shows the relationship between cost and task completion rate:

    Figure 2: Cost versus task completion rate for the basic (left) and advanced (right) task set. 该图像是一个比较图表,展示了基础(左)和高级(右)任务集的平均任务成本与任务完成率之间的关系。不同颜色的点代表了不同的代理配置,基于平均成本与任务完成率的变化,能够观察到各个代理的表现差异。

    Figure 2 visually represents this trade-off. For basic tasks (left plot), GPT-4.1 configurations generally cluster in the lower-cost, moderate-to-high completion rate area. The GPT-4.1 AX-Tree + Memory configuration, which has the highest completion rate (75%) for basic tasks, remains relatively low cost (~$0.29). Claude Sonnet 4 configurations, while sometimes achieving comparable or slightly higher completion rates, always incur significantly higher costs. For advanced tasks (right plot), the pattern is similar. Claude Sonnet 4 AX-Tree achieves the highest completion rate (53.49%) but at a cost of ~$1.02. Other Claude Sonnet 4 configurations often have even higher costs with lower completion rates. GPT-4.1 configurations are cheaper but achieve lower completion rates on advanced tasks. The plots clearly illustrate that while more sophisticated agent architectures or LLMs might yield higher success rates, they often come with a substantial increase in token usage, runtime, and cost, making them less practical for widespread, high-volume deployment.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces WebMall, a pioneering multi-shop benchmark specifically designed for evaluating LLM-based web agents in complex e-commerce comparison-shopping scenarios. WebMall stands out by simulating four distinct online shops populated with authentic, heterogeneous product offers derived from the Common Crawl and offering 91 diverse tasks across 11 categories, including both basic shopping flows and advanced reasoning-intensive tasks involving vagueness and compatibility. The benchmark addresses a critical gap in existing single-shop or live-web benchmarks by providing a reproducible and realistic environment for cross-site information aggregation and reasoning.

The comprehensive evaluation of eight baseline agent configurations revealed several key insights:

  • Importance of Structured Observation: The accessibility tree is paramount for reliable web navigation and interaction, demonstrating superior performance compared to vision-only approaches.
  • Value of Memory: Persistent short-term memory significantly boosts performance on long-running tasks requiring information tracking across multiple shops and steps, preventing premature termination and information loss.
  • LLM Trade-offs: GPT-4.1 emerged as more efficient (faster, cheaper) and accurate for structured, basic tasks. Claude Sonnet 4, while generally more expensive and slower, demonstrated superior reasoning capabilities for less clearly defined advanced tasks with vague or specific requirement constraints.
  • Current Limitations: Despite promising results (best F1 of 87% for basic, 63% for advanced tasks), web agents still face challenges with rigid search strategies, insufficient cross-shop reasoning, UI interaction errors, and output formatting issues. The high API costs remain a significant barrier to widespread adoption.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

  • Rigid Search Strategies: Agents often fail to adapt their search if initial queries yield no results, missing variants or requiring broader queries. Future work should focus on more flexible and adaptive search and exploration mechanisms.
  • Difficulties in Handling UI: UI interaction errors (e.g., misclicking, failing to locate or fill fields) are prevalent, especially for agents lacking structured input. Improved multi-modal reasoning that better integrates visual and structural cues could mitigate this.
  • Premature Termination and Output Formatting: Agents sometimes give up too early or make output formatting mistakes (e.g., incomplete URLs) when submitting solutions. More robust memory integration can help agents track task progress and ensure correct submission.
  • Insufficient Cross-Shop Reasoning: A core challenge identified is the agents' inability to consistently aggregate information and make decisions across all four shops, often stopping after the first relevant finding. This highlights a need for more sophisticated reasoning and planning capabilities that explicitly handle multi-source comparisons.
  • High API Costs: The current LLM-based agents incur substantial API costs, which is a practical limitation for real-world deployment. Future work should explore more efficient LLMs or agent architectures that reduce token usage without sacrificing performance.

7.3. Personal Insights & Critique

WebMall represents a crucial step forward in web agent benchmarking. By focusing on the multi-shop comparison-shopping paradigm, it directly addresses a highly practical and complex real-world use case that previous benchmarks largely overlooked. The use of authentic product offers from Common Crawl and the containerized environment are commendable for ensuring both realism and reproducibility.

A key insight from this paper is the clear demonstration of the complementary strengths of accessibility trees and memory. Accessibility trees provide the indispensable structured information that LLMs need to reliably interact with web pages, while memory allows them to maintain context and aggregate information over long interaction trajectories, which is vital for comparison-shopping. The trade-off between GPT-4.1's efficiency on structured tasks and Claude Sonnet 4's reasoning capabilities on ambiguous tasks also highlights the ongoing evolution and specialization of LLMs themselves.

Critically, the paper implicitly points to the need for more human-like learning and adaptation in web agents. The observed rigid search strategies and insufficient cross-shop reasoning suggest that current LLM-based agents often struggle with exploratory behavior and complex information synthesis when faced with diverse, dynamic environments. Humans, when comparison-shopping, intuitively broaden searches, infer compatibility, and cross-reference information. Mimicking these adaptive behaviors remains a significant challenge.

The reported high API costs are a stark reminder that while LLM-based agents show promise, their practical deployment at scale is currently limited by economic factors. Future research might explore smaller, specialized LLMs for specific sub-tasks or more efficient agent architectures that reduce the number of LLM calls or token usage per task.

The specific mention of "GPT 4.1" is interesting, as GPT-4 is the public designation, and .1 might refer to an internal version, an early version of GPT-4 Turbo, or a specific fine-tuned model. Clarifying this (though not the paper's primary focus) could be helpful for future researchers trying to replicate or build upon these baselines.

Overall, WebMall provides a much-needed benchmark that will undoubtedly stimulate further research into building more intelligent, robust, and efficient web agents capable of navigating the true complexity of the internet. Its methods and conclusions could be transferred to other domains requiring multi-source information aggregation and decision-making, such as news summarization from multiple sources, travel planning across different booking sites, or research assistance by combining data from various academic databases.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.