Paper status: completed

REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites

Published:04/16/2025

Benchmarking Autonomous Agents on Real Websites (1)Deterministic Webpage Simulation (1)Multi-turn Web Interaction Tasks (1)LLM-Based Information Retrieval Evaluation (1)Autonomous Web Navigation Assessment (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

REAL benchmarks autonomous agents on 11 high-fidelity website simulations with 112 multi-turn tasks, combining programmatic checks and LLM-based evaluation. Results show top models achieve only 41% success, revealing key gaps in autonomous web navigation and task completion.

Abstract

We introduce REAL, a benchmark and framework for multi-turn agent evaluations on deterministic simulations of real-world websites. REAL comprises high-fidelity, deterministic replicas of 11 widely-used websites across domains such as e-commerce, travel, communication, and professional networking. We also release a benchmark consisting of 112 practical tasks that mirror everyday complex user interactions requiring both accurate information retrieval and state-changing actions. All interactions occur within this fully controlled setting, eliminating safety risks and enabling robust, reproducible evaluation of agent capability and reliability. Our novel evaluation framework combines programmatic checks of website state for action-based tasks with rubric-guided LLM-based judgments for information retrieval. The framework supports both open-source and proprietary agent systems through a flexible evaluation harness that accommodates black-box commands within browser environments, allowing research labs to test agentic systems without modification. Our empirical results show that frontier language models achieve at most a 41% success rate on REAL, highlighting critical gaps in autonomous web navigation and task completion capabilities. Our framework supports easy integration of new tasks, reproducible evaluation, and scalable post-training data generation, marking a significant step forward in evaluating and advancing agent capabilities.

Mind Map

In-depth Reading

English Analysis~17 min read · 21,092 chars

Bibliographic Information

Title: REAL: Benchmarking Autonomous Agents on Deterministic Simulations of Real Websites
Authors: Divyansh Garg, Shaun VanWeelden, Diego Caples, Andis Draguns, Nikil Ravi, Pranav Putta, Naman Garg, Tomas Abraham, Michael Lara, Federico Lopez, James Liu, Atharva Gundawar, Prannay Hebbar, Youngchul Joo, Jindong Gu, Charles London, Christian Schroeder de Witt, and Sumeet Motwani. The affiliations listed include Cpany Sanor nivesiy (likely a typo for a university like Stanford), University of Oxford, and Veron Researc.
Journal/Conference: The paper is available on arXiv, which is a pre-print server for academic papers. This means it has not yet undergone formal peer review for a specific conference or journal. Such pre-prints allow for rapid dissemination of research.
Publication Year: 2025 (as stated in the abstract metadata).
Abstract: The authors introduce REAL, a benchmark and framework for evaluating autonomous agents. It features 11 high-fidelity, deterministic simulations of real-world websites (e.g., for e-commerce, travel, networking). The benchmark includes 112 practical, multi-step tasks that are evaluated in a safe, controlled environment. The evaluation framework combines programmatic checks (for state-changing actions) with rubric-guided judgments from Large Language Models (LLMs) (for information retrieval). The system is designed to be flexible, supporting both open-source and proprietary "black-box" agents without modification. Experimental results show that even the most advanced models achieve a success rate of only 41%, indicating significant challenges remain in autonomous web navigation. The authors position REAL as a crucial tool for reproducible evaluation, agent training, and advancing the state of web agent capabilities.
Original Source Link: The paper is available at https://arxiv.org/abs/2504.11543 and the PDF can be accessed at https://arxiv.org/pdf/2504.11543v2.pdf. It is an unreviewed pre-print.

Executive Summary

Background & Motivation (Why):
- Core Problem: While Large Language Models (LLMs) show great promise for creating autonomous agents that can perform tasks on the web, their real-world deployment has been slow. This is largely due to the lack of suitable environments for training and, crucially, for evaluation.
- Gaps in Prior Work: Existing benchmarks for web agents suffer from several key limitations:
  1. Lack of Determinism: Real, live websites constantly change their content and user interface (UI), making it impossible to reproduce experiments and reliably compare different agents.
  2. Lack of Configurability & Safety: It's unsafe and impractical to test agents on live production websites, as they might make real purchases or change data. Furthermore, live sites cannot be configured to test specific failure scenarios (e.g., an item being out of stock).
  3. Lack of Realism: Many existing synthetic benchmarks use simplified HTML or artificial tasks that do not reflect the complexity of modern web applications, leading to agents that perform well on the benchmark but fail in the real world.
  4. High Barrier to Entry: Some benchmarks require complex self-hosting (e.g., using Docker), which creates a significant overhead for researchers.
- Paper's Novel Approach: The paper introduces REAL (REalistic Agent Learning environments), a benchmark that addresses these gaps by providing high-fidelity, deterministic, and publicly hosted simulations of popular websites. This creates a "best of both worlds" scenario: the realism of real websites with the reproducibility and safety of a controlled lab environment.
Main Contributions / Findings (What):
- The paper's primary contributions are:
  1. A Collection of 11 Web Environments: High-fidelity, deterministic replicas of popular websites (e.g., clones of Amazon, Airbnb, Gmail) built with modern web technologies like React and Next.js. These are publicly hosted, eliminating setup complexity.
  2. A Benchmark of 112 Tasks: A comprehensive set of realistic, multi-step tasks that mirror everyday user activities, covering both information retrieval and state-changing actions.
  3. A Flexible Evaluation Harness: A framework that supports various agent types, including proprietary "black-box" systems, through multiple integration methods (Playwright, Chrome DevTools Protocol, URL endpoints).
  4. A Robust Evaluation Methodology: A novel hybrid evaluation system that uses programmatic checks of the browser's state (localStorage) for action-based tasks and a rubric-guided LLM-judge for information-retrieval tasks.
  5. Configurable Environments: The ability to change website behavior via URL parameters to test agents against specific edge cases, like network latency or application errors.
- Key Finding: The empirical evaluation shows that current state-of-the-art agents still struggle significantly. The top-performing model, Claude 3.7-Sonnet-Thinking, achieved only a 41.07% success rate on the benchmark, highlighting critical gaps in reasoning, state-tracking, and error recovery for today's web agents.

Foundational Concepts

Autonomous Agents (or Web Agents): These are AI systems, typically powered by an LLM, designed to perform tasks on behalf of a user by interacting with digital environments like websites. Instead of just generating text, they can take actions such as clicking buttons, filling forms, and navigating between pages to achieve a goal (e.g., "book a flight" or "find the price of this product").
Partially Observable Markov Decision Process (POMDP): This is a mathematical framework for modeling decision-making. In the context of a web agent, it means:
- The agent is in a certain state (the complete status of the website and browser).
- The agent only gets a partial observation of this state (e.g., a screenshot or the visible part of the webpage's code), not the full underlying state.
- The agent takes an action (e.g., a click).
- The environment transitions to a new state.
- The agent receives a reward (in this case, a final score of 1 for success or 0 for failure).
Determinism: In computing, a deterministic system is one where a given input will always produce the same output. For a benchmark, this is critical. It ensures that if two different agents are given the same task on the same website, any difference in outcome is due to the agents' capabilities, not random variations in the website's content or layout.
Action & Observation Space:
- Observation Space: The set of all possible information an agent can receive from the environment. In REAL, this can be a screenshot, the HTML Document Object Model (DOM), or a more structured accessibility tree.
- Action Space: The set of all possible actions an agent can perform. This can range from high-level user actions ("click on element with ID 'login-button'") to low-level commands ("dispatch a mouse click event at coordinates (x, y)").
Playwright & Chrome DevTools Protocol (CDP): These are technologies for automating and controlling a web browser.
- Playwright: A high-level library that provides simple commands to simulate user actions like clicking, typing, and navigating. It's easier to use but less powerful.
- CDP: A low-level protocol that gives direct, fine-grained control over the Chrome browser. It allows for advanced actions like intercepting network requests, modifying the DOM on the fly, and debugging, offering maximum power and flexibility.
LLM-as-a-Judge: A method for evaluation where a powerful "judge" LLM (like GPT-4) is used to assess the quality of another model's output. It is given a rubric (a set of scoring criteria) and the model's response, and it outputs a score or judgment. This is useful for tasks where programmatic checks are difficult, such as evaluating the quality of a summarized text.

Previous Works

The paper situates REAL within a growing field of web agent research.

Early Benchmarks: Works like MiniWoB and $MiniWoB++$ created foundational, reproducible mini-games on web pages to test agents. However, these were often too simple and lacked real-world complexity.
Simulation-Based Benchmarks:
- WebShop focused on a simulated e-commerce site, testing agents' ability to find and purchase products based on user instructions.
- WebArena was a major inspiration for REAL, providing realistic simulations of websites. However, the authors of REAL criticize WebArena for having some artificial tasks, being "gameable" (having exploitable shortcuts), and requiring complex self-hosting.
- Mind2Web provided a large dataset of tasks, but on a static, offline collection of websites, not a live interactive environment.
Specialized Benchmarks: Other benchmarks focus on specific domains, such as WorkArena for enterprise software tasks or ST-WebAgentBench for evaluating agent safety and trustworthiness.
Unified Interfaces: BrowserGym aims to provide a standardized interface to run agents across many different benchmarks (including WebArena and MiniWoB). REAL builds upon BrowserGym's foundational interface.
Web Agent Models: The paper mentions several agent architectures that could be tested on REAL, such as AgentQ (uses tree search), OpenAI's Operator (simulates mouse/keyboard), and WebDreamer (simulates action outcomes for planning).

Differentiation

REAL distinguishes itself from prior work in several key ways:

vs. WebArena: REAL aims for more realistic and practical tasks, uses modern web frameworks (React/Next.js), and crucially, is publicly hosted, which dramatically lowers the barrier to entry for researchers compared to WebArena's self-hosting requirement.
vs. Live Websites: REAL provides determinism and safety, which are impossible to guarantee on live sites like the real Amazon or United Airlines.
vs. BrowserGym: While REAL uses BrowserGym's interface as a foundation, its main contribution is the new set of high-fidelity environments and realistic tasks, which were not available before.
vs. Black-Box Systems: Unlike benchmarks that require agents to conform to a rigid action/observation space, REAL's flexible harness is explicitly designed to accommodate proprietary, closed-source agents, making it more inclusive for both academic and industrial research.

Methodology (Core Technology & Implementation Details)

The core of REAL is its comprehensive framework, which includes the simulated websites, the task definitions, the evaluation mechanism, and the agent interaction harness.

Figure 1: The REAL benchmark and framework. REAL provides 11 realistic, deterministic, highfidelity web environments (across e-commerce, networking, communication, sheduling, booking, project managem… 该图像是图示，展示了REAL基准框架架构。包括11个确定性高保真环境，代理接收观察 $o_{t}$ 并执行动作 $a_{t}$ 完成任务，任务完成后通过程序状态检查和LLM评分获得奖励 $r_{T}$ 。

REAL Websites

The foundation of the benchmark is a set of 11 high-fidelity websites designed to mimic popular real-world applications.

Website Selection and Design: The sites were chosen to cover a diverse range of common web interactions: e-commerce, travel booking, scheduling, communication, and professional networking. This ensures agents are tested on a representative set of challenges.
Tech Stack: All websites are built using a modern stack: React and Next.js with TypeScript. This is important because modern web apps are highly dynamic and complex, posing a greater challenge to agents than simple, static HTML sites. The sites are publicly deployed on Vercel, making them accessible to anyone with an internet connection.
Determinism: To ensure reproducible results, the websites are made fully deterministic:
1. Static Data: All content (product prices, flight times, user messages) is fixed.
2. Predefined Time: All time-related elements (calendars, clocks) are locked to a specific date and time.
3. Replayability: These features guarantee that every agent starts with the exact same initial state for a given task.

Authentication and State Management: To simplify agent interaction, users are pre-authenticated. Anti-automation measures like CAPTCHAs are removed. The state of the website (e.g., items in a shopping cart) is stored in the browser's localStorage. This allows the state to persist across page loads and enables the evaluation framework to track changes made by the agent.

The following table, transcribed from Table 1 in the paper, lists the 11 websites and their functionalities.

Name	Inspired By	REAL URL	Core Functionality
Staynb	Airbnb	evals-staynb	Search, filter, book, and review vacation rentals; manage bookings.
Omnizon	Amazon	evals-omnizon	Browse/search products, manage shopping cart, complete online purchase checkout.
DashDish	Doordash	evals-dashdish	Browse restaurants, customize menu selections, place and manage food delivery orders.
GoCalendar	GCal	evals-gocalendar	Manage calendar views, schedule events, create and modify appointments.
GoMail	Gmail	evals-gomail	Manage inbox (read, label, delete), compose/send emails, handle attachments.
OpenDining	OpenTable	evals-opendining	Search restaurant availability by criteria (time, party size), make/manage table reservations.
NetworkIn	LinkedIn	evals-networkin	Manage user profile, search for professional connections, view profiles and posts.
UDriver	Uber	evals-udriver	Plan trips (set locations), request rides based on service type, view route and fare estimates.
FlyUnified	United	evals-fly-unified	Search for flights (origin, destination, dates), select seats, book tickets, manage itineraries.
TopWork	UpWork	evals-topwork	Post jobs (client), search/apply for projects (freelancer), manage proposals and active contracts.
Zilloft	Zillow	evals-zilloft	Search/filter property listings, save favorites, contact managers, view property details and photos.

Figure 2: Screenshots of representative web environments included in REAL (8 of 11 shown). These are high-fidelity, deterministic replicas of popular websites, hosted by us for easy accessibility. Th… 该图像是论文中图2所示的示意图，展示了REAL基准中包含的8个高保真、确定性的网站环境截图。这些环境包含多页复杂工作流和持久化浏览器状态，便于详细跟踪和检查代理行为导致的状态变化。

REAL Framework and Environments

The interaction between an agent and a REAL environment is modeled as a POMDP.

Observation Space ( $O$ ): REAL is flexible, allowing the agent to define its own observation space. Common options include:
- High-Level (Playwright): Screenshots, the full HTML DOM, or the Accessibility Tree (a semantic representation of UI elements).
- Low-Level (CDP): The entire live browser session state, providing maximum information.
Action Space ( $A$ ): Similarly, the action space depends on the integration method:
- High-Level (Playwright): User-like commands (e.g., click, type, scroll).
- Low-Level (CDP): A vast range of commands for direct DOM manipulation, JavaScript execution, network interception, etc.
Rewards ( $r$ ): The framework provides a binary outcome reward ( $r \in \{0, 1\}$ $r \in {0, 1}$ ) at the end of a task.
- Action-based Tasks ( $r_A$ ): Success is determined programmatically. The framework compares the localStorage state before and after the task against a set of expected key-value changes. A task is successful only if all changes match exactly.
- Information Retrieval Tasks ( $r_R$ ): Success is determined by an LLM-judge. The agent's final text response is evaluated against a task-specific rubric for correctness and completeness.
- Combined Tasks: Require both programmatic and LLM-judge checks to pass ( $r_A = 1$ and $r_R = 1$ ).
Special URL Endpoints: The websites have special URLs for managing the evaluation flow:
- /config: Initializes the environment for a task, applying specific configurations via URL parameters.
- /submit: The agent navigates here to signal task completion. This triggers the capture of the final state for evaluation.
- /finish: Can be used during development to inspect the current state changes without ending the task.
- /clear: Resets the website's localStorage, preparing it for a new run.

Evaluation Tasks

REAL includes 112 tasks designed to test a wide range of capabilities.

Task Types:
- Information Retrieval: The agent must find and report a piece of information (e.g., "How many restaurants in this area serve Italian food?").
- Action-based: The agent must change the state of the website (e.g., "Book a flight from SFO to JFK").
- Combined: Requires both finding information and taking action (e.g., "Find the cheapest flight and book it").
- Impossible Tasks: Some tasks are intentionally impossible (e.g., booking a sold-out flight). This tests an agent's ability to recognize failure and report it correctly, rather than hallucinating success.
Task Definition: Each task is given to the agent as a natural language goal (e.g., "Add two sweaters to the cart and proceed to checkout using the saved credit card").

Agent Harness and Configurable Environments

Agent Harness: The harness is a flexible adapter that allows different agent architectures to connect to the REAL environments with minimal modification. It supports three integration modes: direct Playwright control, a low-level CDP WebSocket connection, and URL-based access for external or black-box systems.
Configurable Environments: A key innovation of REAL is its two-level configuration system, applied via URL query parameters at the /config endpoint. This allows for systematic testing of agent robustness.
- Universal Configurations: Global settings like simulated_network_latency or hide_aria_labels (to test performance without accessibility hints).
- Website-specific Configurations: Granular control over each site's internal logic. For example, in the UDriver environment, one can set error_finding_driver=true to simulate a failure in the booking process or location_preset=2 to initialize the map in New York. This enables targeted testing of error handling and adaptability.

Experimental Setup

Datasets

The dataset in this context is the REAL benchmark itself, which consists of:

11 simulated website environments: These are detailed in Table 1 above. They span domains like e-commerce (Omnizon), travel (FlyUnified), scheduling (GoCalendar), and more.
112 evaluation tasks: Each task is defined by a natural language instruction and an initial configuration URL.

To provide a tangible example, a task for the FlyUnified website might be:

Goal: "Book a one-way flight for two adults from San Francisco (SFO) to New York (JFK) on October 26, 2024. Select the first two available economy class seats and use the saved payment method to complete the booking. Report the total cost." This task would require navigation, form filling, seat selection (a graphical task), and state changes, followed by information retrieval.

Evaluation Metrics

The primary metric used to evaluate agent performance is the End-to-End Task Success Rate.

Conceptual Definition: This metric measures the percentage of tasks that an agent completes successfully from start to finish. A task is only considered successful if the agent achieves the goal perfectly, as determined by the REAL evaluation framework (programmatic checks and/or LLM-judge). There is no partial credit.
Mathematical Formula: $\text{Success Rate} = \frac{\text{Number of Successfully Completed Tasks}}{\text{Total Number of Tasks}}$
Symbol Explanation:
- Number of Successfully Completed Tasks: The count of tasks for which the final reward $r$ was 1.
- Total Number of Tasks: The total number of tasks in the benchmark, which is 112.

Baselines

The paper evaluates a wide range of state-of-the-art large language models using a "default agent" provided as part of the REAL framework. The models tested serve as the baselines and include:

Proprietary / Closed Models: Claude 3.7-Sonnet-Thinking, Gemini 2.5-Pro Experimental, OpenAI-o3, GPT-4o, o3-mini, $o1$ , and OpenAI's Computer-Using Agent (CUA).
Open-Source Models: Llama-4-Maverick, Llama 3.3 70B, DeepSeek V3, Llama-3.1-8B, Qwen-2.5-vl-32B, and Gemma-3-27B. These models represent the frontier of language model capabilities at the time of publication and provide a strong baseline for measuring the difficulty of the REAL benchmark.

Results & Analysis

The paper's experiments reveal that even the most advanced models struggle with the realistic challenges posed by REAL.

Core Results

The overall performance of the models is summarized in Figure 3. The results highlight a clear performance gap between current agent capabilities and the requirements for reliable real-world task completion.

$Figure 3: Performance of evaluated models on the REAL benchmark, measured by end-to-end task success rate of our baseline agent across 112 tasks. Claude 3.7 Sonnet-Thinking achieves $4 1 . 0 7 \\%$ .$ 该图像是图表，展示了多个前沿语言模型在REAL基准测试中112项任务上的准确率表现。图中显示Claude 3.7 Sonnet-Thinking模型达到了最高的41.0%的任务成功率，显著优于其他模型。

Here is a breakdown of the success rates reported in the paper:

Claude 3.7-Sonnet-Thinking: 41.07% (Top performer)
Gemini-2.5-Pro Experimental: 38.39%
OpenAI-o3: 34.82%
o3-mini: 25.00%
DeepSeek V3: 19.64%
o1: 16.07%
GPT-4o: 14.29%
Llama-4-Maverick: 12.50%
Llama 3.3 70B: 10.71%
Gemma-3-27B: 9.82%
OpenAI's Computer-Using Agent (CUA): 7.14%
Qwen-2.5-vl-32B: 2.68%
Llama-3.1-8B: 1.79%

Key Takeaways:

No model surpasses a 42% success rate, indicating that autonomous web navigation remains a largely unsolved problem.
Models with advanced reasoning capabilities (Claude 3.7 Sonnet-Thinking, Gemini 2.5 Pro, $o3$ ) perform best, suggesting that simple instruction-following is not enough.
Open-source models currently lag significantly behind the top proprietary models.
Model scale alone is not a guarantee of better performance on these tasks.

Figure 4 provides a per-website performance breakdown, showing that certain environments are consistently more difficult for all agents.

该图像是一个雷达图，展示了多款前沿模型在REAL环境中11个网站的平均性能得分。图中分为闭源模型和开源模型两部分，反映TopWork和FlyUnified环境普遍最具挑战性。

The radar chart shows that environments like TopWork (freelance marketplace) and FlyUnified (flight booking) are particularly challenging, likely due to their complex, multi-page workflows and numerous constraints.

Qualitative Observations

By analyzing agent trajectories, the authors identified two common failure modes:

Inadequate Failure Recognition and State Verification: Agents often fail to verify if their actions were successful. For example, an agent tasked with adding two items to a cart might correctly add the first but fail on the second. Instead of checking the cart's state, it proceeds to checkout assuming both items are there, thus failing the task. The agent gives more weight to its own plan ("I clicked the add button") than to the ground truth presented in the new observation (the cart only shows one item).
Navigation Dead Ends and Lack of Recovery: Agents struggle with non-standard navigation flows. For example, when booking a ride in UDriver, an agent might accidentally click a button to schedule the ride for a future time, entering a different sub-menu. Once "lost," the agent often fails to identify the correct way to go back (e.g., a "back" or "cancel" button) and gets stuck in a loop of clicking irrelevant elements. This shows a lack of robust exploration and backtracking strategies.

Conclusion & Personal Thoughts

Conclusion Summary

The paper introduces REAL, a significant contribution to the field of autonomous agent evaluation. By providing a benchmark of 11 high-fidelity, deterministic, and publicly hosted web environments, along with 112 realistic tasks, REAL addresses critical limitations of previous benchmarks. Its flexible evaluation harness and configurable environments lower the barrier to entry for researchers and enable more rigorous, reproducible science. The empirical results starkly demonstrate that current frontier agents are not yet capable of reliably performing complex real-world web tasks, with the best model achieving only a 41% success rate. The authors position REAL not just as an evaluation tool, but also as a rich environment for generating training data to improve future agents, particularly through reinforcement learning.

Limitations & Future Work

The authors acknowledge the following limitations and outline future directions:

Limitations:
- The benchmark is currently limited to 11 environments.
- It focuses exclusively on web-based interactions, which is only one of many domains for autonomous agents.
Future Work:
- Expand the suite of tasks and environments, potentially including cross-application workflows (e.g., researching a product on one site and buying it on another).
- Develop dense, step-wise reward functions (instead of just a final binary reward) to better support reinforcement learning (RL) training.
- Provide better integration support for advanced agent architectures that use planning or tree-search methods.
- Release a dedicated library to streamline RL post-training workflows on the REAL environments.

Personal Insights & Critique

Positive Impact: The most significant practical contribution of REAL is its accessibility. By publicly hosting the environments and providing a flexible harness, the authors have removed a major obstacle that existed with previous benchmarks like WebArena, which required complex local setups. This will likely accelerate research in the academic community and enable more standardized comparisons across different labs. The focus on modern web technologies (React) also ensures the benchmark's long-term relevance.
Areas for Improvement: While the binary success/failure metric is objective and deterministic, it can be coarse. A more granular scoring system that awards partial credit could provide deeper insights into where agents fail. For example, an agent that correctly fills a form but fails at the final submission step is more capable than one that cannot even find the form.
Untested Assumptions: The paper evaluates models using a "default agent." The performance of these LLMs could be heavily influenced by the quality of this agent's scaffolding (e.g., its prompt engineering, planning module, or memory). The results are therefore a measure of the model + default agent system, not just the model in isolation. The authors do acknowledge this, and the framework's flexibility invites others to test better agent architectures.
Open Questions for Future Research: The configurability of REAL is a powerful feature that opens up new research avenues. Researchers can now systematically study agent robustness by asking targeted questions like: "How much does network latency affect performance?" or "Are agents better at recovering from UI errors or backend logic errors?" This moves the field from simply measuring success to understanding the precise failure points of agentic systems, which is critical for building more reliable AI.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.