AiPaper
Paper status: completed

On the Multi-turn Instruction Following for Conversational Web Agents

Published:08/01/2024
Original LinkPDF
Price: 0.10
Price: 0.10
7 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces conversational web navigation and the MT-Mind2Web dataset, proposing a self-reflective memory-augmented planning framework to improve LLMs' multi-turn instruction following, validated by extensive experiments.

Abstract

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 8795–8812 August 11-16, 2024 ©2024 Association for Computational Linguistics On the Multi-turn Instruction Following for Conversational Web Agents Yang Deng 1 ∗ , Xuan Zhang 2 ∗ , Wenxuan Zhang † , Yifei Yuan 3 , See-Kiong Ng 2 , Tat-Seng Chua 2 1 Singapore Management University, 2 National University of Singapore, 3 University of Copenhagen ydeng@smu.edu.sg xuanzhang@u.nus.edu Abstract Web agents powered by Large Language Mod- els (LLMs) have demonstrated remarkable abilities in planning and executing multi-step interactions within complex web-based envi- ronments, fulfilling a wide range of web nav- igation tasks. Despite these advancements, the potential for LLM-powered agents to effec- tively engage with sequential user instructions in real-world scenarios has not been fully ex- plored. In this work, we introduce a new task of Conversational Web Navigation, which ne- cessitates sophisticated interactions that span multiple turns with both the users and the envi- ronment, supported by a specially developed dataset named Multi-Turn Mind2Web (M

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

On the Multi-turn Instruction Following for Conversational Web Agents

1.2. Authors

Yang Deng*, Xuan Zhang*, Wenxuan Zhang†, Yifei Yuan, See-Kiong Ng, Tat-Seng Chua

  • Affiliations: Singapore Management University, National University of Singapore, University of Copenhagen.
  • Research Backgrounds: The authors generally come from institutions known for strong research in AI, natural language processing (NLP), and potentially human-computer interaction or web technologies. Yang Deng and Xuan Zhang are co-first authors, indicating equal contribution. The involvement of multiple universities suggests a collaborative research effort.

1.3. Journal/Conference

Published at ACL Long 2024.

  • Reputation and Influence: The Association for Computational Linguistics (ACL) is the premier international scientific and professional society for people working on computational linguistics and natural language processing (NLP). Its annual conference (ACL) is one of the most prestigious and highly-regarded venues in the field, making publication here a significant achievement that indicates high-quality, impactful research. "ACL Long" typically refers to the main conference track for full papers.

1.4. Publication Year

2024

1.5. Abstract

This paper addresses the underexplored area of Large Language Model (LLM)-powered web agents effectively engaging with sequential user instructions in real-world scenarios. It introduces a new task called Conversational Web Navigation, which requires complex multi-turn interactions with both users and the web environment. To support this task, a novel dataset named Multi-Turn Mind2Web (MT-Mind2Web) is developed. To overcome the challenges of limited context length in LLMs and the context-dependency inherent in conversational tasks, the authors propose a new framework: self-reflective memory-augmented planning (Self-MAP). This framework leverages memory utilization and self-reflection techniques. The paper benchmarks the MT-Mind2Web dataset with extensive experiments, validating the effectiveness of the proposed Self-MAP method.

Official Source: https://aclanthology.org/2024.acl-long.477/ PDF Link: https://aclanthology.org/2024.acl-long.477.pdf

  • Publication Status: Officially published at ACL 2024.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper aims to solve is the limited ability of LLM-powered web agents to handle multi-turn user instructions in real-world web navigation tasks. While LLMs have shown remarkable capabilities in planning and executing multi-step interactions in web environments for single-turn instructions (e.g., booking tickets), their potential for sequential, conversational interactions with users remains largely unexplored.

This problem is critically important for advancing AI agents towards more human-like, intuitive, and practical applications. In real-world interactions, users rarely provide all necessary information or instructions in a single, perfectly structured query. Instead, conversations are often sequential, involving follow-up questions, co-referencing instructions (where later instructions refer to entities or concepts mentioned earlier without full repetition), and brief/succinct instructions that rely heavily on prior context. Current web agents often treat each instruction as standalone, failing to leverage the rich context accumulated over a conversation.

Specific challenges existing in prior research include:

  1. Lack of Multi-turn User Instruction Handling: Existing web navigation datasets and methods primarily focus on completing a single, explicit instruction.

  2. Limited Context Length of LLMs: LLMs have a finite input window (context length). Conversational web navigation requires maintaining a long history, including both user-agent conversations and agent-environment interactions (the steps taken by the agent on the webpage), which can quickly exceed LLM context limits.

  3. Context-Dependency: Follow-up instructions are inherently dependent on previous turns. Ignoring this context leads to failures.

  4. Noisy History: The rich conversational history can also be noisy, containing irrelevant information that can confuse the LLM if not managed properly.

    The paper's entry point and innovative idea is to introduce the new task of Conversational Web Navigation. This task explicitly models sophisticated interactions that span multiple turns with both the users and the environment. To tackle the identified challenges, they propose Self-MAP, a framework that strategically uses memory and self-reflection to manage context and improve planning.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

  • Definition of Conversational Web Navigation: It formally defines a new, challenging task that extends web agent capabilities to multi-turn user instruction following, requiring interaction with both users and dynamic web environments. This addresses a crucial gap in current web agent research.
  • Introduction of MT-Mind2Web Dataset: To enable research and benchmarking for the new task, the paper introduces MT-Mind2Web, a novel dataset specifically designed for Conversational Web Navigation. This dataset is built upon Mind2Web (an expert-annotated web navigation dataset), adapting single-turn interactions into conversation sessions through a meticulous human-AI collaborative annotation process involving instruction decomposition and conversational rewriting.
  • Proposal of Self-MAP Framework: The paper proposes self-reflective memory-augmented planning (Self-MAP), a novel framework tailored to address the inherent challenges of conversational web navigation. Self-MAP tackles LLM context length limitations and context-dependency through three key components: a Memory Module (using multifaceted matching for relevant snippet retrieval), a Reflection Module (performing memory simplification and memory refinement with LLM-generated rationales), and a Planning Module (leveraging the processed memory).
  • Extensive Benchmarking and Validation: The authors conduct comprehensive experiments to benchmark the MT-Mind2Web dataset against various state-of-the-art web navigation and conversational task baselines.
  • Key Findings:
    • Self-MAP consistently and substantially outperforms all baselines across cross-task, cross-website, and cross-subdomain evaluation settings, demonstrating its effectiveness in handling conversational web navigation.

    • Ablation studies reveal that Memory Simplification is the most critical component for enhancing performance, highlighting the importance of efficient context management.

    • Generation-based Planning is superior to Multi-choice Question Answering for LLM agents in this task, balancing generative capabilities with context efficiency.

    • Multifaceted Matching for memory retrieval significantly outperforms simple chronological memory prepending, proving the necessity of filtering noise and focusing on relevance.

    • The Memory Refinement component, while effective, showed relatively lower generalizability in cross-website and cross-subdomain scenarios compared to cross-task.

      These contributions collectively address a significant challenge in LLM-powered AI agents, pushing the boundaries of web automation towards more natural and effective human-agent interaction.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully grasp the contributions of this paper, a beginner should understand several core concepts:

  • Large Language Models (LLMs): At their core, LLMs are neural networks, typically based on the Transformer architecture, trained on vast amounts of text data. They learn to predict the next word in a sequence, which imbues them with impressive capabilities for understanding, generating, and processing human language. In the context of web agents, LLMs are used for:

    • Planning: Decomposing complex tasks into a sequence of actionable steps.
    • Natural Language Understanding (NLU): Interpreting user instructions and the text content of webpages.
    • Natural Language Generation (NLG): Formulating responses or actions in a human-understandable format.
    • Limited Context Length: A crucial constraint of LLMs is that they can only process a finite amount of input text (tokens) at once. This "context window" limits how much history an LLM can explicitly consider, making long conversations or complex environment states challenging to manage.
  • Web Agents: An AI agent designed to interact with web browsers and web pages to accomplish tasks specified by a user. Unlike traditional chatbots, web agents can perceive (read HTML, identify UI elements), reason (plan actions), and act (click buttons, type text, scroll) within a dynamic web environment. Their goal is to automate web-based tasks, mimicking human browser usage.

  • Multi-turn Interaction: This refers to a sequence of exchanges or actions that build upon each other. In this paper, it has two dimensions:

    • User-Agent Multi-turn (Conversational): The user provides instructions sequentially, where later instructions might depend on previous ones (e.g., "Find me a flight to London." followed by "Now find a hotel near the airport.").
    • Agent-Environment Multi-turn (Multi-step Task Execution): The agent needs to perform a series of actions on a webpage to complete a single instruction (e.g., to book a flight, it might need to click a search button, type in a destination, select dates, and confirm). Conversational Web Navigation combines both.
  • Context-Dependency: This means that the meaning or correct execution of a current instruction or action is contingent upon information or events from previous turns in a conversation or interaction sequence. For example, if a user says "Book that one," "that" refers to something previously discussed.

  • Fine-tuning: A process where a pre-trained LLM (trained on a massive general text corpus) is further trained on a smaller, task-specific dataset. This specialized training adapts the LLM's knowledge and capabilities to a particular domain or task (e.g., web navigation), improving its performance beyond what it could achieve with only its general pre-training.

  • Memory-Augmented Models: These are AI models that are equipped with an external memory component to store and retrieve information beyond their immediate context window. This allows them to "remember" and utilize relevant past information over longer durations, overcoming the context length limitation of LLMs.

  • Self-reflection: In the context of AI agents, self-reflection refers to the agent's ability to analyze its own past actions, decisions, and reasoning processes to identify errors, improve its understanding, or generate better plans for future actions. It's a meta-cognitive ability where the agent "thinks about its thinking."

3.2. Previous Works

The paper positions itself within the broader context of web agents and multi-turn interactions.

3.2.1. Web Agents

Early web agents focused on simplified environments (Shi et al., 2017; Liu et al., 2018). More recent work, often LLM-powered, has advanced to more complex settings:

  • Mind2Web (Deng et al., 2023): A key predecessor to this paper, Mind2Web introduced an expert-annotated web navigation dataset covering diverse domains and real-world interactions. It focused on single-turn user instructions, meaning each instruction was a standalone request. The proposed MINDACT framework from Mind2Web served as a strong baseline here.
  • WebShop (Yao et al., 2022): This platform for web agents focused on online shopping tasks, showcasing scalable real-world web interaction with grounded language agents.
  • WebArena (Zhou et al., 2024): Provides a realistic web environment for building autonomous agents, addressing real-time interactions.
  • Visual UI Understanding: Some works like Zheng et al. (2024a) incorporate visual information from the user interface (UI) to enhance understanding.
  • Prompt-based Methods: Various techniques to improve LLM-powered web agents without extensive fine-tuning, such as recursive self-correction prompting (Kim et al., 2023), code-based prompting (Sun et al., 2023), and trajectory-augmented prompting (Zheng et al., 2024b, Synapse). These methods leverage LLMs' inherent capabilities through carefully crafted prompts. However, the paper notes that prompt-based methods often struggle to compete with fine-tuned methods in advanced settings like Mind2Web.

3.2.2. Multi-turn Interactions with Environment

These works enable LLM-powered agents to handle challenging tasks by interacting with external environments:

  • Code-grounded environments (Xu et al., 2024; Hong et al., 2024): Agents interact with databases or perform programming tasks.
  • Game-grounded environments (Shridhar et al., 2021): Agents play games.
  • Web-grounded environments (Deng et al., 2023; Yao et al., 2022): Agents navigate webpages or shop online. The crucial distinction here is that these works primarily focus on completing a single, standalone user instruction by planning a sequence of actions within the environment. Some studies (Wang et al., 2024; Xie et al., 2023) investigate using multi-turn user feedback to solve a given task, but they don't typically involve sequential, context-dependent user instructions over multiple turns of interaction for different sub-tasks.

3.2.3. Multi-turn Interactions with Users

This category includes LLM capabilities in engaging in conversations with human users for various tasks:

  • Recommendation (He et al., 2023; Huang et al., 2023): Conversational systems that recommend items.
  • Tutoring (Dan et al., 2023; Deng et al., 2024): LLMs acting as educational tutors.
  • Counseling (Zheng et al., 2023b): LLMs providing emotional support.
  • MT-Bench (Zheng et al., 2023a): A popular benchmark for evaluating LLMs' multi-turn instruction-following ability across 80 high-quality multi-turn questions.
  • Conversational Information Seeking (Pan et al., 2023): LLMs retrieve information in a conversational manner. The key limitation of these approaches, highlighted by the paper, is that they mainly rely on the LLMs' inherent knowledge or perform a one-time request from an external environment for each turn. They do not require dynamic, multi-step interaction with an environment for multiple times within a single conversation turn, nor do they manage multi-turn agent-environment interaction history as part of the conversational context.

3.2.4. Context-Aware Query Rewriting (CAR)

  • Context-Aware Query Rewriting (CAR) (Anand et al., 2023): This technique aims to make a query self-contained by incorporating relevant information from the conversational history. The original query, which might be co-referential or elliptical, is rewritten into a query that can be understood without needing the preceding turns. This helps systems that process queries in isolation. This paper adapts CAR as a baseline to see if simply rewriting instructions is sufficient for conversational web navigation.

3.3. Technological Evolution

The evolution of AI agents has progressed from simple, rule-based systems to more sophisticated LLM-powered entities. Initially, web agents were often task-specific and required extensive programming for each interaction. The advent of LLMs significantly boosted agent capabilities by providing powerful natural language understanding and planning skills. However, these LLM-powered agents largely focused on single-turn tasks or multi-step tasks stemming from a single instruction. The field then recognized the importance of multi-turn user interactions for real-world usability in general conversational LLMs. This paper bridges the gap between LLM-powered web agents and multi-turn user interactions, proposing that web agents must not only execute multi-step actions but also interpret multi-turn conversational instructions that are context-dependent. This represents a step towards truly generalist, human-like AI assistants that can engage in sustained, meaningful dialogues while interacting with complex digital environments.

3.4. Differentiation Analysis

Compared to the main methods in related work, this paper's approach presents several core innovations:

  • Focus on Conversational Web Navigation: Unlike previous web agent research that primarily addresses single-turn instructions (e.g., Mind2Web, WebShop), this work explicitly defines and tackles the more complex multi-turn conversational instruction following in dynamic web environments. This is a critical distinction as it requires managing context-dependency across turns.

  • Integrated User-Agent and Agent-Environment Interactions: While some conversational LLMs interact with users and some web agents interact with environments, this paper's task and framework necessitate sophisticated multi-turn interactions with both. The conversational history in MT-Mind2Web includes both user queries and the agent's actions on the webpage, making the context much richer and more challenging than traditional conversational tasks.

  • Novel Context Management for LLMs: The Self-MAP framework directly addresses the limited context length of LLMs and the noisy/long history characteristic of conversational web navigation.

    • Multifaceted Matching: Unlike simple kNN retrieval (as in Synapse) or fixed memory (as in MINDACT + Fixed), Self-MAP's multifaceted matching constructs a query using both the user instruction and the present agent action sequence. This allows for a more nuanced retrieval of memory snippets that are semantically relevant and share similar action trajectories, significantly reducing noise.
    • Memory Simplification: By applying a DOM element ranker to filter irrelevant elements from the environment state (HTML), Self-MAP efficiently compresses memory snippets, freeing up crucial context space for LLMs. This is a unique contribution tailored to the web environment.
    • Memory Refinement with LLM Rationales: Instead of just recalling past interactions, Self-MAP uses an LLM to generate explicit reasoning rationales for past actions. This self-reflection enriches the memory, providing a deeper understanding of why an action was taken, acting as a supervised signal to guide future planning. This differs from traditional self-reflection that often involves analyzing incorrect trajectories.
  • New Dataset MT-Mind2Web: The creation of a dedicated multi-turn dataset through human-AI collaborative annotation (decomposing instructions, rewriting with anaphora and ellipsis) is a significant enabler for this new research direction, providing a concrete benchmark where none existed previously.

    In essence, Self-MAP innovates by providing a structured, memory-augmented and self-reflective approach to overcome the context limitations and dependency issues that arise when LLM-powered web agents must engage in continuous, conversational interactions with users across dynamic web environments.

4. Methodology

4.1. Principles

The core idea of the proposed method, self-reflective memory-augmented planning (Self-MAP), is to enable LLM-powered web agents to effectively follow multi-turn user instructions in dynamic web environments. This involves two main principles:

  1. Intelligent Context Management: LLMs have limited context windows. Conversational web navigation generates a very long and potentially noisy interaction history (both user-agent and agent-environment). The principle here is to intelligently select, simplify, and refine only the most relevant parts of this history to fit within the LLM's context window, rather than simply truncating or appending all information. This is achieved through memory utilization (retrieval and simplification).

  2. Enhanced Planning through Self-Reflection: Beyond just recalling past actions, the agent should understand why certain actions were taken in previous, similar situations. The principle of self-reflection is applied to generate explicit reasoning rationales for past actions, which enriches the memory snippets and provides a deeper, more actionable context for the LLM to plan its next move.

    These principles combine to address the context length limitation and context-dependency issues inherent in conversational web navigation.

4.2. Core Methodology In-depth (Layer by Layer)

The Self-MAP framework is composed of three main components: a Memory Module, a Reflection Module, and a Planning Module. The overall pipeline can be visualized in Figure 3.

4.2.1. Problem Definition

The paper defines the task of Conversational Web Navigation. An agent must engage in multi-turn interactions with both the user and the web environment. Given:

  • The conversational interaction history Ct={q1,A1,...,At1,qt}C _ { t } = \left\{ q 1 , A _ { 1 } , . . . , A _ { t - 1 } , q _ { t } \right\}

    • qiq_i: User instruction at turn ii.
    • Ai={ai1,ai2,...,aik}A _ { i } = \{ a _ { i } ^ { 1 } , a _ { i } ^ { 2 } , . . . , a _ { i } ^ { k } \} : Environment interaction history (sequence of actions) performed by the agent to fulfill user instruction qiq_i. Each aija _ { i } ^ { j } is a single atomic action (e.g., click, type).
  • The current environment state E _ { t } (e.g., the HTML of the current webpage).

    Objective:

  • To accurately predict the action sequence A _ { t } required to accomplish the current user instruction q _ { t }. This action sequence encompasses both the target element on the webpage for interaction and the specific operation to be performed.

4.2.2. Self-MAP Framework Overview

The Self-MAP framework (Figure 3 from the original paper, below) processes the current user instruction (qtq_t), the current environment state (EtE_t), and the conversational history (CtC_t) to generate the next action.

Figure 3: Overview of Self-MAP.
该图像是图3中Self-MAP框架的示意图,展示了基于记忆的交互历史检索、记忆精炼与简化的反思过程,形成自反记忆后指导规划与操作执行的流程。

Figure 3: Overview of Self-MAP.

The core flow is:

  1. The Memory Module retrieves relevant past interactions from a memory bank.
  2. The Reflection Module then refines these retrieved memories by simplifying the environment states and enriching them with LLM-generated rationales.
  3. Finally, the Planning Module uses these self-reflective memories along with the current context to decide the next action.

4.2.3. Memory Module

The Memory Module is designed to construct and intelligently query a memory bank of past interactions.

  • Memory Bank Construction: Each memory snippet stored in the bank represents an individual interaction step from the conversational history. A memory snippet MtkM _ { t } ^ { k } is represented as: Mtk={qt,Atk1,Etk^,atk}M _ { t } ^ { k } = \{ q _ { t } , A _ { t } ^ { k - 1 } , E _ { t } ^ { \hat { k } } , a _ { t } ^ { k } \}

    • qtq_t: The user instruction at turn tt.
    • Atk1A _ { t } ^ { k - 1 }: The sequence of agent-environment interactions (trajectory) that occurred before the kk-th action at the current conversation turn tt.
    • Etk^E _ { t } ^ { \hat { k } }: The environment state (e.g., HTML) before the kk-th action at turn tt. The hat indicates that this is an approximation or snapshot.
    • atka _ { t } ^ { k }: The kk-th action (target element and operation) taken at turn tt. The challenge is that injecting all these memory snippets into the LLM's current running memory (its input context) would quickly exceed its maximum input length. Additionally, many snippets might be irrelevant or inconsistent with the current environment, introducing noise.
  • Multifaceted Matching: To address the context length limitation and noise, the Memory Module employs a multifaceted matching approach to retrieve only the top-KK most relevant snippets. This matching happens at the action level.

    • Query Construction: Given an ongoing conversational interaction at turn tt and action step kk, Ctk={q1,A1,,qt,Atk1}C _ { t } ^ { k } = \{ q _ { 1 } , A _ { 1 } , \dots , q _ { t } , A _ { t } ^ { k - 1 } \} where Atk1={at1,at2,,atk1}A _ { t } ^ { k - 1 } = \{ a _ { t } ^ { 1 } , a _ { t } ^ { 2 } , \ldots , a _ { t } ^ { k - 1 } \} represents the partial trajectory of agent-environment interactions at the current conversation turn. The query for retrieval is constructed using both the current user instruction and the present agent action sequence: (qt,Atk1)( q _ { t } , A _ { t } ^ { k - 1 } ).
    • Encoding Semantics and Trajectory: This query structure is designed to encode two types of relevance:
      1. Semantic relevance: How semantically similar the current user instruction qtq_t is to instructions in the memory bank.
      2. Action trajectory similarity: How similar the current partial action sequence Atk1A _ { t } ^ { k - 1 } is to action sequences in the memory bank.
    • Embedding and Retrieval:
      • The paper uses OpenAI's text-embedding-ada-002 to convert the query and all memory snippets into vector representations (embeddings).
      • Cosine similarity is then computed between the query embedding and each memory snippet embedding in the embedding space.
      • The top-KK memory snippets with the highest cosine similarity scores are retrieved.

4.2.4. Reflection Module

The Reflection Module further processes the retrieved memory snippets to maximize the utility of the limited memory space for the LLM. It involves two steps: Memory Simplification and Memory Refinement.

  • Memory Simplification:

    • Purpose: To remove task-irrelevant and noisy elements from the environment state within each memory snippet, thereby saving memory space and allowing more relevant information to be retained.
    • Method: The process is inspired by the MINDACT framework's candidate generation. A small pre-trained Language Model (e.g., DeBERTa) acts as a ranker. This ranker identifies and ranks DOM elements from the environment state (Etk^E _ { t } ^ { \hat { k } }) that are most relevant to the instruction and the current step. The simplified environmental state is denoted as etke _ { t } ^ { k }, replacing the original Etk^E _ { t } ^ { \hat { k } } in the memory snippet.
  • Memory Refinement:

    • Purpose: To enrich the memory information by generating intermediate reasoning rationales for past actions, serving as a supervised signal. This goes beyond merely storing the action; it stores why the action was taken.
    • Method: This step leverages the reasoning capability of LLMs (specifically ChatGPT with gpt-3.5-turbo-1106). For each retrieved memory snippet (qt,Atk1,atk)( q _ { t } , A _ { t } ^ { k - 1 } , a _ { t } ^ { k } ), an LLM is prompted to generate an in-depth rationale rtkr _ { t } ^ { k }. This rationale explains the decision-making process that led to the execution of the next action atka _ { t } ^ { k }.
    • Distinction from Traditional Self-reflection: Unlike some self-reflection methods (e.g., Reflexion by Shinn et al., 2023) that collect and analyze incorrect trajectories, this approach focuses on generating rationales for correct past actions within a static evaluation setting, primarily to enrich the memory rather than debug errors.
  • Self-reflective Memory Snippet: After these two steps, each processed memory snippet becomes a self-reflective memory snippet, denoted as M^tk\hat { M } _ { t } ^ { k }. It now contains: M^tk={qt,Atk1,etk,atk,rtk}\hat { M } _ { t } ^ { k } = \{ q _ { t } , A _ { t } ^ { k - 1 } , e _ { t } ^ { k } , a _ { t } ^ { k } , r _ { t } ^ { k } \} This snippet is concise (due to memory simplification) and informative (due to memory refinement).

4.2.5. Planning with Self-reflective Memory

The final stage is to use the self-reflective memory to plan the next action.

  • Input to the LLM: For each interaction step kk at the current conversation turn tt, the LLM receives an input consisting of: (qt,Atkˉ1,etk,Mtk)( q _ { t } , A _ { t } ^ { \bar { k } - 1 } , e _ { t } ^ { k } , \mathcal { M } _ { t } ^ { k } )

    • q _ { t }: The current user instruction.
    • Atkˉ1A _ { t } ^ { \bar { k } - 1 }: The partial action sequence executed so far in the current turn.
    • etke _ { t } ^ { k }: The simplified current environment state (top-NN candidate DOM elements from EtkE _ { t } ^ { k } after being ranked by the DeBERTa ranker, similar to memory simplification).
    • Mtk= {M^}K\mathcal { M } _ { t } ^ { k } = ~ \{ \hat { M } \} ^ { K }: The top KK self-reflective memory snippets retrieved and refined from the Reflection Module.
  • Action Generation: The LLM (specifically Flan-T5, fine-tuned) is tasked with planning the next action atka _ { t } ^ { k }. This action includes:

    1. The target element to interact with.
    2. The operation to perform on that element (e.g., CLICK, TYPE, SELECT).
  • Planning Paradigms: The paper explores two types of planning paradigms for the LLM (details in Appendix B.2):

    1. Multi-choice Question Answering (MCQ): The LLM selects the target element from a predefined list of options (typically the top-NN candidate elements identified by the DeBERTa ranker) and then determines the operation.

    2. Direct Generation: The LLM directly generates the target element and the operation as free-form text. The paper's ablation study shows Direct Generation generally performs better.

      This structured approach allows Self-MAP to leverage relevant historical context efficiently and effectively for planning actions in conversational web navigation, overcoming the inherent limitations of LLMs in handling complex, dynamic, and lengthy interaction histories.

5. Experimental Setup

5.1. Datasets

The core dataset used in this work is Multi-Turn Mind2Web (MT-Mind2Web).

  • Source: MT-Mind2Web is constructed from the existing Mind2Web (Deng et al., 2023) dataset. Mind2Web provides single-turn web navigation interactions annotated by experts, which serves as the foundation for ensuring the quality of agent responses.

  • Construction Process: The construction focuses on generating conversational user instructions while reusing the expert-annotated action sequences from Mind2Web. This ensures the feasibility and correctness of the underlying web interactions. The process involves three main steps (as illustrated in Figure 2 from the original paper, below):

    Figure 2: Overall pipeline for MT-Mind2Web creation with examples.
    该图像是图2,展示了MT-Mind2Web数据集构建的整体流程及示例,包括会话指令组织、复杂指令拆解和对话指令重写三个步骤,强调了对会话上下文的处理及指令转换。

    Figure 2: Overall pipeline for MT-Mind2Web creation with examples.

    1. Organize Conversation Sessions: Consecutive single-turn instructions from Mind2Web that share the same context (e.g., domain, website, entities, intents) are grouped into conversation sessions. For example, two separate instructions about ticket booking on the same website are combined into one session.
    2. Decompose Complex Instructions: Instructions in Mind2Web with long action sequences are often unnatural for single conversational turns. The paper employs a human-AI collaborative annotation approach for decomposition:
      • ChatGPT is initially used to divide an original complex instruction and its action sequence into NN subtasks, each with corresponding action sub-sequences. The target number of subtasks is set as N=N/4N = \lceil N ^ { \prime } / 4 \rceil, where NN ^ { \prime } is the number of actions in the original instruction.
      • Human annotators then verify and refine these AI-generated decompositions to ensure they are reasonable and executable, reflecting natural conversational flow. ChatGPT achieved a high pass rate (98.5%98.5\%) in this step.
      • Example (from Figure 2): An Action Sequence 1 is broken down into Action Sub-sequence 1-1 and Action Sub-sequence 1-2.
    3. Rewrite Conversational Instructions: The original standalone instructions are rephrased to be conversational using anaphora (pronouns or phrases referring to something mentioned earlier) and ellipsis (omission of words/phrases that are implied). This makes the conversation flow naturally.
      • Example (from Figure 2): If T1 mentions a "WWE ticket," T2 might refer to it as "one." If T3 and T4 involve the same action like "booking," the verb might be omitted in T4.
  • Quality Control: A quality verification process is implemented to ensure the coherence and correctness of the constructed conversations. Annotators fix any issues until verification is passed.

  • Dataset Statistics: The resulting MT-Mind2Web dataset contains: The following are the results from Table 1 of the original paper:

    Train Test (Cross-X)
    Task Website Subdomain
    # Conversations 600 34 42 44
    # Turns 2,896 191 218 216
    Avg. # Turn/Conv. 4.83 5.62 5.19 4.91
    Avg. # Action/Turn 2.95 3.16 3.01 3.07
    Avg. # Element/Turn 573.8 626.3 620.6 759.4
    Avg. Inst. Length 36.3 37.4 39.8 36.2
    Avg. HTML Length 169K 195K 138K 397K

    Table 1: Statistics of the MT-Mind2Web dataset.

    • Total Conversations: 720 sessions (600 for training, 120 for testing).
    • Total Instruction/Action Pairs: 3,525.
    • Average Turns per Conversation: Approximately 5 turns.
    • Average HTML Length: The HTML content for each turn is very long (average 169K tokens in training, up to 397K in cross-subdomain test set), highlighting the severe context length challenges.
  • Test Splits: The test set is divided into three subsets to evaluate generalization capabilities, mirroring Mind2Web:

    • Cross-Task: Evaluates generalization to new tasks within familiar websites/domains. (34 samples)

    • Cross-Website: Evaluates generalization to new websites within familiar domains. (42 samples from various specific sites like "redbox", "viator").

    • Cross-Subdomain: Evaluates generalization to entirely new subdomains. (44 samples from "Digital" and "Hotel" domains).

      The dataset effectively validates methods due to its multi-turn conversational nature and challenging generalization splits.

5.2. Evaluation Metrics

The paper uses a comprehensive set of metrics to evaluate the performance of web agents in conversational web navigation, building upon metrics from single-turn web navigation. All reported metrics are macro averages, meaning they are first calculated per task and then averaged over all tasks.

  1. Element Accuracy (Ele. Acc):

    • Conceptual Definition: This metric measures whether the agent correctly identifies and selects the target user interface (UI) element on the webpage that corresponds to the required action. It's a measure of the agent's ability to ground the natural language instruction to the correct interactive element.
    • Mathematical Formula: $ \text{Ele. Acc} = \frac{\text{Number of correctly identified elements}}{\text{Total number of elements to be identified}} $
    • Symbol Explanation:
      • Number of correctly identified elements: The count of instances where the agent's chosen element perfectly matches the ground-truth target element for an action step.
      • Total number of elements to be identified: The total count of ground-truth target elements across all action steps in the evaluation.
  2. Operation F1 (Op. F1):

    • Conceptual Definition: This metric assesses the agent's ability to predict the correct operation (e.g., CLICK, TYPE, SELECT) and its associated value (e.g., the text to type, the option to select) for a given element. It uses the F1 score at the token level, which is a harmonic mean of precision and recall, making it suitable for evaluating text generation tasks where partial matches can occur.
    • Mathematical Formula: $ \text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $ $ \text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $ $ \text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $ For Operation F1, these are typically calculated at the token level, where True Positives are tokens correctly predicted in the operation string, False Positives are tokens predicted but not in the ground truth, and False Negatives are tokens in the ground truth but not predicted.
    • Symbol Explanation:
      • True Positives: Tokens that are correctly identified by the model in both the predicted operation and the ground-truth operation.
      • False Positives: Tokens identified by the model in the predicted operation but not present in the ground-truth operation.
      • False Negatives: Tokens present in the ground-truth operation but not identified by the model in the predicted operation.
  3. Step Success Rate (SSR):

    • Conceptual Definition: This is a stringent measure of an agent's performance at the individual action step level. A single interaction step is considered successful only if both the selected element (Element Accuracy) and the predicted operation (Operation F1) are correct. This ensures that the agent not only finds the right UI component but also interacts with it in the correct manner.
    • Mathematical Formula: $ \text{SSR} = \frac{\text{Number of successful steps (correct element AND correct operation)}}{\text{Total number of steps}} $
    • Symbol Explanation:
      • Number of successful steps: Count of steps where the agent's chosen element matches the ground truth AND its predicted operation (including value) matches the ground truth.
      • Total number of steps: The total number of atomic actions required across all tasks being evaluated.
  4. Turn Success Rate (TSR):

    • Conceptual Definition: This is the most demanding metric and represents the ultimate goal in conversational web navigation. A conversation turn (which might involve multiple interaction steps to complete one user instruction) is considered successful only if all the individual steps within that turn have succeeded according to the SSR criterion. This metric directly evaluates the agent's ability to fully complete a user's instruction within a conversational context. The paper notes that TSR can be regarded as the main metric.
    • Mathematical Formula: $ \text{TSR} = \frac{\text{Number of successful turns (all steps in turn succeed)}}{\text{Total number of turns}} $
    • Symbol Explanation:
      • Number of successful turns: Count of conversation turns where every single action step performed by the agent to fulfill the user's instruction at that turn was correct.
      • Total number of turns: The total number of user-agent conversation turns being evaluated.

5.3. Baselines

The paper compares Self-MAP against several state-of-the-art baselines, adapting them for the conversational web navigation task. These baselines represent different approaches to web navigation and conversational tasks.

  1. DeBERTa (He et al., 2021):

    • Description: This is a pre-trained Language Model (Decoding-enhanced BERT with disentangled attention). In this context, it's used as a ranker for selecting target elements from the HTML. It represents a baseline for element selection without explicit operation prediction or advanced LLM-based planning.
    • Why Representative: It establishes a lower bound for performance focused purely on identifying the correct UI element, similar to how MINDACT uses a DeBERTa-based ranker for candidate generation.
  2. MINDACT (Deng et al., 2023):

    • Description: The original framework from the Mind2Web paper. It performs multi-choice question answering (MCQ) to select a target element from a list of options. For the conversational setting, its input is adapted to include the entire conversational interaction history. The paper evaluates MINDACT with two LLM backbones: GPT-3.5 (in-context learning) and Flan-T5 (fine-tuned).
    • Why Representative: It's the direct predecessor for single-turn web navigation and a strong fine-tuned baseline. Its GPT-3.5 variant shows how in-context learning fares without fine-tuning on the specific conversational task.
  3. MINDACT + CAR (Anand et al., 2023):

    • Description: This baseline integrates Context-Aware Rewriting (CAR) using ChatGPT. Before feeding the instruction to MINDACT, CAR attempts to reconstruct a self-contained instruction from the conversational instruction and the conversation context. This rewritten instruction is then used as input for MINDACT.
    • Why Representative: It represents a common strategy in conversational AI to handle context-dependency by making each turn's instruction independent. It tests whether rewriting alone is sufficient for conversational web navigation.
  4. MINDACT + Fixed (Huq et al., 2023):

    • Description: This baseline uses a fixed memory selection strategy. It prepends a fixed number of initial turns (the first 3 turns) from the conversation history as memory to MINDACT's input. The work by Huqetal.(2023)Huq et al. (2023) suggested that fixed examples can sometimes outperform relevance-based selection in demonstration-based learning.
    • Why Representative: It tests a simple, static memory augmentation strategy, contrasting with dynamic retrieval methods.
  5. Synapse (Zheng et al., 2024b):

    • Description: A state-of-the-art method from Mind2Web competition, which employs metadata (website, domain, subdomain, task) for kNN-based exemplar retrieval. For MT-Mind2Web, given that website, domain, and subdomain are constant within a conversation, only the task is used in the metadata for turn-level kNN retrieval.

    • Why Representative: It represents a sophisticated memory retrieval approach (exemplar-based prompting) from single-turn web agents and checks its adaptability to a conversational setting.

      These baselines cover a spectrum from basic element selection to advanced LLM-powered planning with various memory management and context handling strategies, providing a robust comparison for Self-MAP.

5.4. Implementation Details

  • Candidate HTML Element Ranker:

    • The DeBERTa-v3-base (He et al., 2021) model is fine-tuned as the ranker for selecting candidate HTML elements.
    • Training: During training, 5 random elements (including the positive ground-truth candidate) are used.
    • Evaluation: For evaluation, the ranker selects the top-50 elements when compared in groups of 5.
    • Hyperparameters: Batch size = 32, learning rate = 3×1053 \times 10^{-5}, trained for 5 epochs.
  • Action Planning (Generation Model):

    • Flan-T5 (Chung et al., 2022) is used as the generation model, specifically Flan-T5base and Flan-T5large versions.
    • It's used for both MCQ-based and generation-based action planning.
    • Context Length: The maximum sequence length is set to 2,048 tokens. However, the tokenizer's max context length is 512 tokens, implying that inputs larger than 512 tokens would be truncated or processed in chunks. This highlights the context length challenge addressed by Self-MAP. The system message, HTML, user input, and assistant response are tokenized separately.
    • Hyperparameters:
      • Batch size: 8 for Flan-T5base, 4 for Flan-T5large.
      • Learning rate: 5×1055 \times 10^{-5}.
      • Trained for 5 epochs.
  • Multifaceted Matching (Memory Module):

    • Embedding Model: OpenAI's text-embedding-ada-002 is used to generate embeddings for queries and memory snippets.
    • Similarity Metric: Cosine similarity is used to calculate the relevance between embeddings.
    • Number of Retrieved Memories (KK): Set to K=3K=3.
  • Memory Refinement (Reflection Module):

    • LLM for Rationale Generation: ChatGPT with the gpt-3.5-turbo-1106 version.
    • Generation Parameters: Maximum new tokens = 100, temperature = 0 (for deterministic output).
    • Input for Rationale: Only HTML snippets of the positive (ground-truth) element are provided to ChatGPT to generate rationales.
    • Default Rationale: If no positive element is found in the HTML snippet, a default rationale is used: "The assistant's answer is derived from the absence of a specific option in the provided HTML content, leading to the conclusion that none of the options provided are suitable for the user's task."

6. Results & Analysis

6.1. Core Results Analysis

The following are the results from Table 2 of the original paper:

Cross-Task Cross-Website Cross-Subdomain
Ele. Acc Op. F1 SSR TSR Ele. Acc Op. F1 SSR TSR Ele. Acc Op. F1 SSR TSR
Base Model Flan-T5base
DeBERTa (He et al., 2021) 36.8 - - - 31.7 - - - 27.7 - - -
MINDACT (GPT-3.5) (Deng et al., 2023) 4.3 27.6 1.9 1.0 6.7 22.2 2.1 1.7 4.0 22.9 1.5 1.1
MiNDAct (Deng et al., 2023) 43.2 79.1 36.6 14.2 38.8 69.4 29.2 15.2 41.9 77.2 35.5 15.7
MINDACT + CAR (Anand et al., 2023) 47.8 78.8 41.4 16.1 37.0 67.5 32.2 9.6 41.2 75.3 35.4 13.2
MINDACT + Fixed (Huq et al., 2023) 51.0 80.8 42.6 18.4 42.4 70.0 35.4 15.3 43.1 77.6 37.5 17.7
Synapse (Zheng et al., 2024b) 49.6 79.9 41.9 18.4 43.1 70.6 33.1 13.7 41.7 77.8 35.9 16.0
Self-MAP 56.2 82.5 47.1 24.7 48.3 71.8 40.6 18.2 46.4 79.1 38.3 20.8
Base Model Flan-T5large
MinDAct (Deng et al., 2023) 59.0 80.6 53.2 26.0 43.6 67.6 36.5 12.4 46.8 74.0 38.9 21.8
MINDACT + CAR (Anand et al., 2023) 54.5 79.5 47.8 19.8 43.2 69.2 36.1 12.2 44.5 75.0 40.2 15.6
MINDACT + Fixed (Huq et al., 2023) 58.0 79.7 51.3 26.4 46.2 69.7 37.6 15.2 47.4 74.9 38.8 21.4
Synapse (Zheng et al., 2024b) 57.5 82.0 50.0 23.2 45.1 69.0 37.1 13.0 47.4 74.1 39.3 19.4
Self-MAP 58.1 80.5 51.7 26.6 44.8 68.8 36.8 15.7 52.0 77.1 43.6 25.4

Table 2: Experimental results on MT-Mind2Web. TSR can be regarded as the main metric.

The experimental results in Table 2 provide a comprehensive evaluation of Self-MAP against various baselines on the MT-Mind2Web dataset, using both Flan-T5base and Flan-T5large as backbone models. TSR (Turn Success Rate) is highlighted as the main metric, indicating the agent's ability to complete an entire conversational turn successfully.

Key Observations and Analysis:

  • Weak Baselines (DeBERTa, MINDACT (GPT-3.5)):

    • DeBERTa (element selection only) shows very low Ele. Acc (27.7-36.8%) and no Op. F1, SSR, or TSR, confirming it's insufficient for complex web navigation.
    • MINDACT (GPT-3.5), which relies on in-context learning without fine-tuning, performs extremely poorly (1.0-1.7% TSR). This reinforces the finding that for intricate web interaction tasks, LLMs require fine-tuning or sophisticated context management beyond simple prompting.
  • Impact of Context-Aware Rewriting (CAR):

    • MINDACT + CAR generally performs worse than the vanilla MiNDAct (e.g., Flan-T5base: TSR of 16.1% vs 14.2% for Cross-Task, but 9.6% vs 15.2% for Cross-Website). This suggests that simply rewriting conversational instructions using GPT-3.5 can sometimes obfuscate the original instruction, making it harder for the agent to understand and act correctly. The LLM-based rewriting might introduce errors or lose subtle context, leading to reduced performance, especially in generalization settings like Cross-Website.
  • Comparison of Memory Strategies (MINDACT + Fixed, Synapse):

    • Both MINDACT + Fixed and Synapse generally outperform the vanilla MiNDAct. This validates the core idea that incorporating conversational interaction history (memory) is beneficial for conversational web navigation.
    • Surprisingly, Synapse (a SOTA method on Mind2Web utilizing kNN-based exemplar retrieval) performs worse than MINDACT + Fixed in several scenarios (e.g., Flan-T5base Cross-Website TSR: 13.7% vs 15.3%; Flan-T5large Cross-Website TSR: 13.0% vs 15.2%; Flan-T5large Cross-Task TSR: 23.2% vs 26.4%). This suggests that the coarse-grained kNN matching in Synapse, while effective for single-turn exemplar retrieval, is insufficient for effectively measuring the intricate relevance between the current conversation status and candidate memory snippets in the more complex conversational setting. The fixed memory approach, despite its simplicity, might provide more consistent and relevant context.
  • Impact of Base Model Size:

    • Using a stronger base model (Flan-T5large vs Flan-T5base) generally improves performance across most metrics and baselines. For example, MiNDAct's TSR on Cross-Task jumps from 14.2% (Flan-T5base) to 26.0% (Flan-T5large). This indicates that larger LLMs inherently possess better understanding and reasoning capabilities that benefit the task.
  • Superiority of Self-MAP:

    • Self-MAP consistently and substantially outperforms all baselines across all evaluation settings and both LLM sizes.
    • Quantified Improvement (Flan-T5base): Self-MAP achieves TSR scores of 24.7% (Cross-Task), 18.2% (Cross-Website), and 20.8% (Cross-Subdomain). This represents a significant gain over the strongest baselines (e.g., MINDACT + Fixed and Synapse). For instance, on Cross-Task, Self-MAP's TSR of 24.7% is +6.3 points higher than MINDACT + Fixed (18.4%).
    • Quantified Improvement (Flan-T5large): Self-MAP maintains its lead, achieving TSR scores of 26.6% (Cross-Task), 15.7% (Cross-Website), and 25.4% (Cross-Subdomain). While the Flan-T5large version of Self-MAP sometimes shows marginally lower SSR or TSR than MINDACT or MINDACT + Fixed in Cross-Task (e.g., 26.6% vs 26.0% and 26.4%), it demonstrates a substantial improvement in Cross-Subdomain (TSR 25.4% vs 21.8% for MINDACT and 21.4% for MINDACT + Fixed). This highlights its robustness, especially in challenging generalization scenarios.
    • Overall Validation: The consistent outperformance validates the effectiveness of Self-MAP's memory-augmented planning framework and its self-reflection strategy for enhancing memory utilization in conversational web navigation. The improvements are particularly notable for TSR, which is the most holistic measure of task success in this multi-turn setting.

6.2. Ablation Studies / Parameter Analysis

The following are the results from Table 3 of the original paper:

Cross-Task Cross-Website Cross-Subdomain
Ele. Acc Op. F1 SSR TSR Ele. Acc Op. F1 SSR TSR Ele. Acc Op. F1 SSR TSR
Self-MAP 56.2 82.5 47.1 24.7 48.3 71.8 40.6 18.2 46.4 79.1 38.3 20.8
w/o Generation-based Planning 51.7 79.4 43.5 22.2 43.1 69.5 34.9 15.5 44.8 77.2 37.3 17.7
w/o Memory Simplification 50.5 80.7 41.0 20.7 44.9 69.6 36.9 16.6 42.3 79.2 36.4 15.9
w/o Memory Refinement 52.1 81.3 43.0 23.2 48.9 70.8 39.1 18.1 46.3 78.7 37.2 17.8
w/o Multifaceted Matching 52.6 80.6 44.3 21.6 46.9 71.2 37.9 17.2 44.8 78.6 35.8 17.8

Table 3:Ablation study. "w/o Generation-based Planning" denotes that we use MCQ-based Planning, while "w/c Multifaceted Matching" denotes that we prepend the chronological conversation context without retrieval.

The ablation study in Table 3 systematically evaluates the contribution of each component of the Self-MAP framework, using Flan-T5base as the backbone.

  • w/o Generation-based Planning (i.e., using MCQ-based Planning):

    • Analysis: This variant shows a noticeable drop in performance compared to the full Self-MAP. For instance, Cross-Task TSR drops from 24.7% to 22.2%, Cross-Website TSR from 18.2% to 15.5%, and Cross-Subdomain TSR from 20.8% to 17.7%.
    • Conclusion: Generation-based Planning is superior. This is attributed to the advanced generative capabilities of LLMs and their efficiency in conserving context space (as generating an action directly can be more compact than selecting from a detailed list of options).
  • w/o Memory Simplification:

    • Analysis: This component's removal leads to the most significant performance degradation across all metrics. Cross-Task TSR drops from 24.7% to 20.7% (a 4-point drop), Cross-Website TSR from 18.2% to 16.6%, and Cross-Subdomain TSR from 20.8% to 15.9% (a 4.9-point drop).
    • Conclusion: Memory Simplification (filtering irrelevant DOM elements) is the most critical factor for Self-MAP's success. This underscores the paramount importance of optimizing the use of limited context space by removing noise and irrelevant information, especially given the large HTML sizes in web navigation.
  • w/o Memory Refinement:

    • Analysis: Removing Memory Refinement (the LLM-generated rationales) also causes a performance drop, though less severe than Memory Simplification. Cross-Task TSR goes from 24.7% to 23.2%, Cross-Website TSR from 18.2% to 18.1%, and Cross-Subdomain TSR from 20.8% to 17.8%.
    • Conclusion: Memory Refinement contributes positively. Its impact is more pronounced in cross-task scenarios than in cross-website or cross-subdomain. This suggests that the rationales (which explain decision-making processes) are highly valuable when the agent encounters entirely new tasks. However, its generalizability in modeling decision-making processes across highly diverse websites or subdomains might be relatively lower, possibly because the specific rationales might not transfer perfectly to visually or logically different interfaces.
  • w/o Multifaceted Matching (i.e., prepending chronological conversation context without retrieval):

    • Analysis: This variant, which reverts to a simpler chronological memory appending, also sees a notable performance decrease. Cross-Task TSR drops from 24.7% to 21.6%, Cross-Website TSR from 18.2% to 17.2%, and Cross-Subdomain TSR from 20.8% to 17.8%.
    • Conclusion: Multifaceted Matching for memory retrieval significantly outperforms simply prepending chronological context. This highlights the necessity of intelligently filtering out noisy conversational interaction history to focus on the relevant parts that share both semantic and trajectory similarity. Unfiltered chronological context can dilute the signal and confuse the LLM.

6.2.1. Effect of the Number of Retrieved Memory Snippets (KK)

The following are the results from Figure 4 of the original paper:

Figure 4: Performance in terms of different number of retrieved memory snippets.
该图像是图表,展示了不同数量检索记忆片段(K)情况下,模型在四个指标上(元素准确率、操作F1、步骤成功率和回合成功率)的表现,比较了跨任务、跨网站和跨子域三种场景的结果。

Figure 4: Performance in terms of different number of retrieved memory snippets.

  • Analysis: The graph (Figure 4) illustrates how the performance (specifically TSR on different test splits) changes as the number of retrieved memory snippets (KK) varies from 1 to 5.
    • Performance initially increases when KK grows from 1 to 3 across all three test splits. This indicates that retrieving a small number of relevant memory snippets (up to 3 in this case) is beneficial, as it provides valuable contextual information that helps the agent plan.
    • However, as KK continues to increase beyond 3 (e.g., to 4 or 5), the performance either plateaus or, in some cases (e.g., Cross-Task, Cross-Website), even slightly degrades.
  • Conclusion: This suggests an optimal point for memory retrieval. Retrieving too few snippets might miss crucial context, while retrieving too many can introduce noisy information from irrelevant turns. Given that the MT-Mind2Web dataset has an average of about 5 conversational turns (Table 1), increasing KK too much starts to include less relevant or even misleading information, which overwhelms the LLM or dilutes the helpful signal within its limited context window. The value of K=3K=3 chosen for the main experiments seems to strike a good balance.

6.2.2. Analysis of Generalizability

The following are the results from Figure 5 of the original paper:

该图像是一个柱状图,展示了多个品牌或网站在不同类别(如旅游、购物、餐厅等)下的某种指标表现,柱形高度代表指标值,类别通过颜色区分。图中品牌排列较为密集,反映了多领域网站的对比情况。
该图像是一个柱状图,展示了多个品牌或网站在不同类别(如旅游、购物、餐厅等)下的某种指标表现,柱形高度代表指标值,类别通过颜色区分。图中品牌排列较为密集,反映了多领域网站的对比情况。

Sea eeloTask elzoo to yellowpages), Cross-Website (from exploretock to redbox), and Cross-Subdomain (from koa to airbnb).

  • Analysis: Figure 5, while only showing TSR for Self-MAP for different generalization settings, allows for comparison of generalizability challenges.
    • Cross-Task vs. Others: All models (including Self-MAP) perform best on the Cross-Task setting compared to Cross-Website and Cross-Subdomain. This is logical because Cross-Task means the agent encounters new tasks but on familiar websites/domains, where interaction patterns and UI structures might be more consistent.
    • Cross-Website vs. Cross-Subdomain: There is no significant performance difference between Cross-Website and Cross-Subdomain settings.
  • Conclusion:
    • The primary challenges to generalization come from diversity in website designs and interaction logic, rather than just domain specifics. Different websites, even within the same domain, can have vastly different UI layouts and interaction flows.
    • Self-MAP's performance gap between Cross-Task and the other two settings (around 10-20% based on numerical values in Table 2 for Self-MAP Flan-T5base) is more substantial than observed in the original Mind2Web dataset (which focused on single-turn tasks). This implies that introducing multi-turn user-agent interactions significantly complicates the interaction logic, making generalization to novel websites or subdomains even harder. The agent must now reason about both dynamic environment states and evolving conversational context simultaneously.

6.2.3. Analysis of Conversation Prompt Designs

The following are the results from Figure 6 of the original paper:

Figure 6: Performance in terms of different Conversation Prompt Designs.
该图像是图表,展示了不同对话提示设计在多个指标(元素准确率、操作F1、步骤成功率、回合成功率)上的表现,涵盖跨任务、跨网站与跨子域三个评测场景,比较了Synapse和Multifaceted模型及其变体的性能差异。

Figure 6: Performance in terms of different Conversation Prompt Designs.

  • Analysis: Figure 6 compares different prompt designs for Synapse and Self-MAP related to memory order and the inclusion of state-based information in matching.
    • Relevance-based Order vs. Chronological Order: Both Synapse and Self-MAP show much better performance when using a relevance-based order for memory snippets compared to a chronological (sequential) order. For example, Self-MAP (Multifaceted) significantly outperforms Self-MAP (Chronological) across all metrics.
    • State-based Information in Matching: In Self-MAP, the multifaceted matching approach was designed to query using user instruction and agent action sequence (qt,Atk1)(q_t, A_t^{k-1}). The paper also tested including "state-based information" in the retrieved memory. The results suggest that multifaceted matching typically achieves better performance without explicit state-based information in the retrieved memory. This is explained by the idea that while state is critical for planning, explicitly embedding raw state within the retrieved memory snippets (especially when using text embeddings) might not be as effective as having memory simplification and refinement preprocess it. Also, in action-level matching, where there isn't a sequential framework like Atk1A_t^{k-1}, state information cannot be directly inferred from the trajectory.
  • Conclusion: Intelligent memory retrieval based on relevance (both semantic and trajectory) is crucial, significantly outperforming simple chronological prepending. The chosen prompt designs for Self-MAP, which use relevance-based ordering and manage state information through memory simplification rather than raw inclusion in retrieval queries, are optimal.

6.2.4. Runtime Analysis

The following are the results from Table 4 of the original paper:

The following are the results from Table 4 of the original paper:

Methods Flan-T5base Flan-T5large
Mind2Act 1.90s 3.43s
Mind2Act + CAR 1.23s 2.11s
Mind2Act + Fixed 1.69s 3.35s
Synapse 1.95s 3.58s
Self-MAP 2.56s 4.29s

Table 4: Runtime analysis

  • Analysis: Table 4 compares the runtime (in seconds) for predicting the next action on a single RTX A5000 GPU for Flan-T5base and Flan-T5large configurations.
    • MINDACT + CAR shows the lowest runtime (e.g., 1.23s for Flan-T5base). This is because it discards all historical trajectories after rewriting the query, significantly shortening the input token length. However, as shown in Table 2, this comes at a significant cost to performance.
    • Self-MAP is generally slower than other baselines. For Flan-T5base, Self-MAP takes 2.56s, compared to MINDACT (1.90s), MINDACT + Fixed (1.69s), and Synapse (1.95s). Similarly for Flan-T5large, Self-MAP takes 4.29s.
  • Conclusion: While Self-MAP introduces a slight increase in runtime due to its memory retrieval, simplification, and refinement processes, its runtime remains within a feasible range for deployment in complex task environments. This marginal increase in computational cost is justified by the substantial improvements in accuracy and adaptability demonstrated in the performance evaluations, making it a valuable tool for real-world applications where task success is paramount.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces the challenging new task of Conversational Web Navigation, which requires web agents to engage in complex, multi-turn interactions with both users and dynamic web environments. To facilitate research in this domain, the authors developed MT-Mind2Web, a novel dataset constructed by transforming single-turn interactions into conversational sessions through human-AI collaborative annotation. To overcome the inherent difficulties, particularly the limited context length of LLMs and the context-dependency of conversational tasks, the paper proposes Self-MAP (self-reflective memory-augmented planning). This framework intelligently leverages memory utilization via multifaceted matching for relevant snippet retrieval and self-reflection through memory simplification and LLM-generated rationales. Extensive experiments on MT-Mind2Web validate Self-MAP's effectiveness, demonstrating its consistent and substantial outperformance over strong baselines, especially in Turn Success Rate, across diverse generalization settings. The ablation studies highlight the critical role of memory simplification and the benefits of generation-based planning and relevance-based memory retrieval.

7.2. Limitations & Future Work

The authors acknowledge two main limitations and suggest future research directions:

  • Multimodal Environment: The current work primarily focuses on HTML-grounded methods. While the MT-Mind2Web dataset is compatible with multimodal environments (as its base, Mind2Web, is), the presented Self-MAP framework does not explicitly incorporate multimodal LLMs (e.g., GPT-4V). Future work could investigate adapting Self-MAP to leverage visual information from web pages, potentially improving UI understanding and action grounding in more complex or visually-driven web scenarios.
  • Online Evaluation: The paper employs offline evaluation settings, which are common in both conversational tasks and single-turn web navigation. While convenient for benchmarking, this setup inherits the drawback of offline evaluation in accurately assessing dynamic interactions with real, evolving web environments. Future work could explore online evaluation methods to better simulate real-world usage and validate the agent's robustness to unforeseen changes and dynamic elements.

7.3. Personal Insights & Critique

This paper makes a crucial step towards building more practical and human-friendly LLM-powered web agents. The introduction of Conversational Web Navigation as a task, coupled with the MT-Mind2Web dataset, addresses a significant gap in the current research landscape where web agents often operate in a single-shot interaction paradigm.

Key Strengths and Inspirations:

  • Realistic Problem Formulation: Recognizing the multi-turn nature of human interaction is vital for real-world AI agent adoption. This paper moves beyond isolated commands to address complex, context-dependent dialogues.
  • Elegant Solution to Context Length: The Self-MAP framework provides a compelling solution to the perennial context length limitation of LLMs. The combination of multifaceted matching, memory simplification (filtering irrelevant DOM elements), and memory refinement (generating rationales) is a powerful, multi-pronged approach that intelligently curates essential information. The finding that Memory Simplification is the most critical factor is a valuable insight, emphasizing that less is more when it comes to raw context if it's not highly relevant.
  • Value of Self-Reflection: Using LLMs to generate rationales for past actions (even correct ones) is a clever way to enrich memory. It transforms raw action sequences into more interpretable, reasoning-backed "experiences," which is akin to how humans learn from past decisions. This could inspire similar approaches in other memory-augmented agent systems.
  • Human-AI Collaboration in Data Curation: The human-AI collaborative annotation process for MT-Mind2Web is commendable. Leveraging ChatGPT for initial instruction decomposition, followed by human refinement, showcases an efficient and effective way to build complex datasets, potentially reducing annotation costs while maintaining quality.

Potential Issues, Unverified Assumptions, and Areas for Improvement:

  • Cost and Latency of Memory Refinement: While LLM-generated rationales are effective, the reliance on an external LLM (ChatGPT) for memory refinement introduces API costs and latency. For agents requiring real-time responses, this could be a bottleneck. Further work could explore distilling these rationales into smaller, faster models or integrating rationale generation more deeply within the primary LLM if context allows.

  • Generalizability of Rationales: The ablation study noted that Memory Refinement's contribution was less generalized across cross-website and cross-subdomain settings compared to cross-task. This is an interesting point. While a rationale for "why click 'Add to Cart'" might be universally applicable, the specific UI elements and their surrounding context can vary greatly. The rationales might be too tied to the specific UI patterns seen during training, making them less effective for visually or functionally novel interfaces. Exploring more abstract or robust rationale generation could be beneficial.

  • Scalability of Multifaceted Matching: As the memory bank grows very large (e.g., with hundreds of thousands of interaction steps), the time taken for embedding generation and cosine similarity computation might become significant, even with efficient vector databases. Strategies for hierarchical memory or more efficient search might be needed.

  • Error Propagation in Multi-turn Tasks: While Self-MAP improves TSR, it's still relatively low (e.g., 20-25%). In a multi-turn setting, an error in an early turn can cascade and make subsequent turns impossible to complete. The current framework focuses on correct action prediction for the current turn. Future work could explore mechanisms for error detection, self-correction, or user clarification within the conversation to recover from mistakes, making the agent more robust.

  • Beyond HTML: The current focus on HTML is strong, but many web applications are increasingly dynamic and use complex JavaScript rendering, making raw HTML parsing insufficient. Integrating visual perception (as suggested in limitations) is essential for future practical web agents.

    Overall, Self-MAP represents a significant advancement in conversational web agents. It offers a principled and effective way to manage context and enhance LLM planning in complex interactive environments, paving the way for more intelligent and user-friendly AI assistants.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.