On the Multi-turn Instruction Following for Conversational Web Agents
TL;DR Summary
This paper introduces conversational web navigation and the MT-Mind2Web dataset, proposing a self-reflective memory-augmented planning framework to improve LLMs' multi-turn instruction following, validated by extensive experiments.
Abstract
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 8795–8812 August 11-16, 2024 ©2024 Association for Computational Linguistics On the Multi-turn Instruction Following for Conversational Web Agents Yang Deng 1 ∗ , Xuan Zhang 2 ∗ , Wenxuan Zhang † , Yifei Yuan 3 , See-Kiong Ng 2 , Tat-Seng Chua 2 1 Singapore Management University, 2 National University of Singapore, 3 University of Copenhagen ydeng@smu.edu.sg xuanzhang@u.nus.edu Abstract Web agents powered by Large Language Mod- els (LLMs) have demonstrated remarkable abilities in planning and executing multi-step interactions within complex web-based envi- ronments, fulfilling a wide range of web nav- igation tasks. Despite these advancements, the potential for LLM-powered agents to effec- tively engage with sequential user instructions in real-world scenarios has not been fully ex- plored. In this work, we introduce a new task of Conversational Web Navigation, which ne- cessitates sophisticated interactions that span multiple turns with both the users and the envi- ronment, supported by a specially developed dataset named Multi-Turn Mind2Web (M
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
On the Multi-turn Instruction Following for Conversational Web Agents
1.2. Authors
Yang Deng*, Xuan Zhang*, Wenxuan Zhang†, Yifei Yuan, See-Kiong Ng, Tat-Seng Chua
- Affiliations: Singapore Management University, National University of Singapore, University of Copenhagen.
- Research Backgrounds: The authors generally come from institutions known for strong research in AI, natural language processing (NLP), and potentially human-computer interaction or web technologies. Yang Deng and Xuan Zhang are co-first authors, indicating equal contribution. The involvement of multiple universities suggests a collaborative research effort.
1.3. Journal/Conference
Published at ACL Long 2024.
- Reputation and Influence: The Association for Computational Linguistics (ACL) is the premier international scientific and professional society for people working on computational linguistics and natural language processing (NLP). Its annual conference (ACL) is one of the most prestigious and highly-regarded venues in the field, making publication here a significant achievement that indicates high-quality, impactful research. "ACL Long" typically refers to the main conference track for full papers.
1.4. Publication Year
2024
1.5. Abstract
This paper addresses the underexplored area of Large Language Model (LLM)-powered web agents effectively engaging with sequential user instructions in real-world scenarios. It introduces a new task called Conversational Web Navigation, which requires complex multi-turn interactions with both users and the web environment. To support this task, a novel dataset named Multi-Turn Mind2Web (MT-Mind2Web) is developed. To overcome the challenges of limited context length in LLMs and the context-dependency inherent in conversational tasks, the authors propose a new framework: self-reflective memory-augmented planning (Self-MAP). This framework leverages memory utilization and self-reflection techniques. The paper benchmarks the MT-Mind2Web dataset with extensive experiments, validating the effectiveness of the proposed Self-MAP method.
1.6. Original Source Link
Official Source: https://aclanthology.org/2024.acl-long.477/ PDF Link: https://aclanthology.org/2024.acl-long.477.pdf
- Publication Status: Officially published at ACL 2024.
2. Executive Summary
2.1. Background & Motivation
The core problem this paper aims to solve is the limited ability of LLM-powered web agents to handle multi-turn user instructions in real-world web navigation tasks. While LLMs have shown remarkable capabilities in planning and executing multi-step interactions in web environments for single-turn instructions (e.g., booking tickets), their potential for sequential, conversational interactions with users remains largely unexplored.
This problem is critically important for advancing AI agents towards more human-like, intuitive, and practical applications. In real-world interactions, users rarely provide all necessary information or instructions in a single, perfectly structured query. Instead, conversations are often sequential, involving follow-up questions, co-referencing instructions (where later instructions refer to entities or concepts mentioned earlier without full repetition), and brief/succinct instructions that rely heavily on prior context. Current web agents often treat each instruction as standalone, failing to leverage the rich context accumulated over a conversation.
Specific challenges existing in prior research include:
-
Lack of
Multi-turn User InstructionHandling: Existingweb navigationdatasets and methods primarily focus on completing a single, explicit instruction. -
Limited Context Length of
LLMs:LLMshave a finite input window (context length).Conversational web navigationrequires maintaining a long history, including bothuser-agent conversationsandagent-environment interactions(the steps taken by the agent on the webpage), which can quickly exceedLLMcontext limits. -
Context-Dependency: Follow-up instructions are inherently dependent on previous turns. Ignoring this context leads to failures.
-
Noisy History: The rich
conversational historycan also be noisy, containing irrelevant information that can confuse theLLMif not managed properly.The paper's entry point and innovative idea is to introduce the new task of
Conversational Web Navigation. This task explicitly modelssophisticated interactionsthat span multiple turns with both the users and the environment. To tackle the identified challenges, they proposeSelf-MAP, a framework that strategically usesmemoryandself-reflectionto manage context and improve planning.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
- Definition of
Conversational Web Navigation: It formally defines a new, challenging task that extendsweb agentcapabilities tomulti-turn user instruction following, requiring interaction with both users and dynamic web environments. This addresses a crucial gap in currentweb agentresearch. - Introduction of
MT-Mind2WebDataset: To enable research and benchmarking for the new task, the paper introducesMT-Mind2Web, a novel dataset specifically designed forConversational Web Navigation. This dataset is built uponMind2Web(an expert-annotatedweb navigationdataset), adapting single-turn interactions intoconversation sessionsthrough a meticuloushuman-AI collaborative annotationprocess involvinginstruction decompositionandconversational rewriting. - Proposal of
Self-MAPFramework: The paper proposesself-reflective memory-augmented planning (Self-MAP), a novel framework tailored to address the inherent challenges ofconversational web navigation.Self-MAPtacklesLLMcontext length limitations andcontext-dependencythrough three key components: aMemory Module(usingmultifaceted matchingfor relevantsnippet retrieval), aReflection Module(performingmemory simplificationandmemory refinementwithLLM-generated rationales), and aPlanning Module(leveraging the processed memory). - Extensive Benchmarking and Validation: The authors conduct comprehensive experiments to benchmark the
MT-Mind2Webdataset against various state-of-the-artweb navigationandconversational taskbaselines. - Key Findings:
-
Self-MAPconsistently and substantially outperforms all baselines acrosscross-task,cross-website, andcross-subdomainevaluation settings, demonstrating its effectiveness in handlingconversational web navigation. -
Ablation studies reveal that
Memory Simplificationis the most critical component for enhancing performance, highlighting the importance of efficient context management. -
Generation-based Planningis superior toMulti-choice Question AnsweringforLLMagents in this task, balancing generative capabilities with context efficiency. -
Multifaceted Matchingfor memory retrieval significantly outperforms simple chronological memory prepending, proving the necessity of filtering noise and focusing on relevance. -
The
Memory Refinementcomponent, while effective, showed relatively lower generalizability incross-websiteandcross-subdomainscenarios compared tocross-task.These contributions collectively address a significant challenge in
LLM-poweredAI agents, pushing the boundaries ofweb automationtowards more natural and effectivehuman-agent interaction.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the contributions of this paper, a beginner should understand several core concepts:
-
Large Language Models (LLMs): At their core,
LLMsare neural networks, typically based on theTransformerarchitecture, trained on vast amounts of text data. They learn to predict the next word in a sequence, which imbues them with impressive capabilities for understanding, generating, and processing human language. In the context ofweb agents,LLMsare used for:- Planning: Decomposing complex tasks into a sequence of actionable steps.
- Natural Language Understanding (NLU): Interpreting user instructions and the text content of webpages.
- Natural Language Generation (NLG): Formulating responses or actions in a human-understandable format.
- Limited Context Length: A crucial constraint of
LLMsis that they can only process a finite amount of input text (tokens) at once. This "context window" limits how much history anLLMcan explicitly consider, making long conversations or complex environment states challenging to manage.
-
Web Agents: An
AI agentdesigned to interact with web browsers and web pages to accomplish tasks specified by a user. Unlike traditionalchatbots,web agentscanperceive(read HTML, identify UI elements),reason(plan actions), andact(click buttons, type text, scroll) within a dynamic web environment. Their goal is to automate web-based tasks, mimicking human browser usage. -
Multi-turn Interaction: This refers to a sequence of exchanges or actions that build upon each other. In this paper, it has two dimensions:
- User-Agent Multi-turn (Conversational): The user provides instructions sequentially, where later instructions might depend on previous ones (e.g., "Find me a flight to London." followed by "Now find a hotel near the airport.").
- Agent-Environment Multi-turn (Multi-step Task Execution): The agent needs to perform a series of actions on a webpage to complete a single instruction (e.g., to book a flight, it might need to click a search button, type in a destination, select dates, and confirm).
Conversational Web Navigationcombines both.
-
Context-Dependency: This means that the meaning or correct execution of a current instruction or action is contingent upon information or events from previous turns in a conversation or interaction sequence. For example, if a user says "Book that one," "that" refers to something previously discussed.
-
Fine-tuning: A process where a pre-trained
LLM(trained on a massive general text corpus) is further trained on a smaller, task-specific dataset. This specialized training adapts theLLM's knowledge and capabilities to a particular domain or task (e.g.,web navigation), improving its performance beyond what it could achieve with only its general pre-training. -
Memory-Augmented Models: These are
AI modelsthat are equipped with an external memory component to store and retrieve information beyond their immediatecontext window. This allows them to "remember" and utilize relevant past information over longer durations, overcoming thecontext length limitationofLLMs. -
Self-reflection: In the context of
AI agents,self-reflectionrefers to the agent's ability to analyze its own past actions, decisions, and reasoning processes to identify errors, improve its understanding, or generate better plans for future actions. It's a meta-cognitive ability where the agent "thinks about its thinking."
3.2. Previous Works
The paper positions itself within the broader context of web agents and multi-turn interactions.
3.2.1. Web Agents
Early web agents focused on simplified environments (Shi et al., 2017; Liu et al., 2018). More recent work, often LLM-powered, has advanced to more complex settings:
Mind2Web (Deng et al., 2023): A key predecessor to this paper,Mind2Webintroduced an expert-annotatedweb navigationdataset covering diverse domains and real-world interactions. It focused onsingle-turn user instructions, meaning each instruction was a standalone request. The proposedMINDACTframework fromMind2Webserved as a strong baseline here.WebShop (Yao et al., 2022): This platform forweb agentsfocused on online shopping tasks, showcasing scalable real-world web interaction with grounded language agents.WebArena (Zhou et al., 2024): Provides a realistic web environment for building autonomous agents, addressing real-time interactions.- Visual UI Understanding: Some works like
Zheng et al. (2024a)incorporate visual information from the user interface (UI) to enhance understanding. - Prompt-based Methods: Various techniques to improve
LLM-poweredweb agentswithout extensive fine-tuning, such asrecursive self-correction prompting (Kim et al., 2023),code-based prompting (Sun et al., 2023), andtrajectory-augmented prompting (Zheng et al., 2024b, Synapse). These methods leverageLLMs' inherent capabilities through carefully crafted prompts. However, the paper notes thatprompt-based methodsoften struggle to compete withfine-tuned methodsin advanced settings likeMind2Web.
3.2.2. Multi-turn Interactions with Environment
These works enable LLM-powered agents to handle challenging tasks by interacting with external environments:
- Code-grounded environments (
Xu et al., 2024; Hong et al., 2024): Agents interact with databases or perform programming tasks. - Game-grounded environments (
Shridhar et al., 2021): Agents play games. - Web-grounded environments (
Deng et al., 2023; Yao et al., 2022): Agents navigate webpages or shop online. The crucial distinction here is that these works primarily focus on completing a single, standalone user instruction by planning a sequence of actions within the environment. Some studies (Wang et al., 2024; Xie et al., 2023)investigate usingmulti-turn user feedbackto solve a given task, but they don't typically involve sequential, context-dependent user instructions over multiple turns of interaction for different sub-tasks.
3.2.3. Multi-turn Interactions with Users
This category includes LLM capabilities in engaging in conversations with human users for various tasks:
- Recommendation (
He et al., 2023; Huang et al., 2023): Conversational systems that recommend items. - Tutoring (
Dan et al., 2023; Deng et al., 2024):LLMsacting as educational tutors. - Counseling (
Zheng et al., 2023b):LLMsproviding emotional support. MT-Bench (Zheng et al., 2023a): A popular benchmark for evaluatingLLMs'multi-turn instruction-followingability across 80 high-quality multi-turn questions.Conversational Information Seeking (Pan et al., 2023):LLMsretrieve information in a conversational manner. The key limitation of these approaches, highlighted by the paper, is that they mainly rely on theLLMs'inherent knowledgeor perform a one-time request from an external environment for each turn. They do not require dynamic,multi-step interactionwith an environment for multiple times within a single conversation turn, nor do they managemulti-turn agent-environment interaction historyas part of the conversational context.
3.2.4. Context-Aware Query Rewriting (CAR)
Context-Aware Query Rewriting (CAR) (Anand et al., 2023): This technique aims to make a query self-contained by incorporating relevant information from the conversational history. The original query, which might beco-referentialorelliptical, is rewritten into a query that can be understood without needing the preceding turns. This helps systems that process queries in isolation. This paper adaptsCARas a baseline to see if simply rewriting instructions is sufficient forconversational web navigation.
3.3. Technological Evolution
The evolution of AI agents has progressed from simple, rule-based systems to more sophisticated LLM-powered entities. Initially, web agents were often task-specific and required extensive programming for each interaction. The advent of LLMs significantly boosted agent capabilities by providing powerful natural language understanding and planning skills. However, these LLM-powered agents largely focused on single-turn tasks or multi-step tasks stemming from a single instruction. The field then recognized the importance of multi-turn user interactions for real-world usability in general conversational LLMs. This paper bridges the gap between LLM-powered web agents and multi-turn user interactions, proposing that web agents must not only execute multi-step actions but also interpret multi-turn conversational instructions that are context-dependent. This represents a step towards truly generalist, human-like AI assistants that can engage in sustained, meaningful dialogues while interacting with complex digital environments.
3.4. Differentiation Analysis
Compared to the main methods in related work, this paper's approach presents several core innovations:
-
Focus on
Conversational Web Navigation: Unlike previousweb agentresearch that primarily addressessingle-turn instructions(e.g.,Mind2Web,WebShop), this work explicitly defines and tackles the more complexmulti-turn conversational instruction followingin dynamic web environments. This is a critical distinction as it requires managingcontext-dependencyacross turns. -
Integrated
User-AgentandAgent-EnvironmentInteractions: While some conversationalLLMsinteract with users and someweb agentsinteract with environments, this paper's task and framework necessitate sophisticatedmulti-turn interactionswith both. Theconversational historyinMT-Mind2Webincludes both user queries and the agent's actions on the webpage, making the context much richer and more challenging than traditional conversational tasks. -
Novel
Context ManagementforLLMs: TheSelf-MAPframework directly addresses thelimited context lengthofLLMsand thenoisy/long historycharacteristic ofconversational web navigation.Multifaceted Matching: Unlike simplekNNretrieval (as inSynapse) or fixed memory (as inMINDACT + Fixed),Self-MAP'smultifaceted matchingconstructs a query using both theuser instructionand thepresent agent action sequence. This allows for a more nuanced retrieval of memory snippets that are semantically relevant and share similar action trajectories, significantly reducing noise.Memory Simplification: By applying aDOM element rankerto filter irrelevant elements from theenvironment state(HTML),Self-MAPefficiently compressesmemory snippets, freeing up crucialcontext spaceforLLMs. This is a unique contribution tailored to the web environment.Memory RefinementwithLLMRationales: Instead of just recalling past interactions,Self-MAPuses anLLMto generate explicitreasoning rationalesfor past actions. Thisself-reflectionenriches the memory, providing a deeper understanding of why an action was taken, acting as a supervised signal to guide future planning. This differs from traditionalself-reflectionthat often involves analyzing incorrect trajectories.
-
New Dataset
MT-Mind2Web: The creation of a dedicatedmulti-turn datasetthroughhuman-AI collaborative annotation(decomposing instructions, rewriting withanaphoraandellipsis) is a significant enabler for this new research direction, providing a concrete benchmark where none existed previously.In essence,
Self-MAPinnovates by providing a structured,memory-augmentedandself-reflectiveapproach to overcome thecontext limitationsanddependencyissues that arise whenLLM-poweredweb agentsmust engage in continuous,conversational interactionswith users across dynamic web environments.
4. Methodology
4.1. Principles
The core idea of the proposed method, self-reflective memory-augmented planning (Self-MAP), is to enable LLM-powered web agents to effectively follow multi-turn user instructions in dynamic web environments. This involves two main principles:
-
Intelligent Context Management:
LLMshave limitedcontext windows.Conversational web navigationgenerates a very long and potentially noisyinteraction history(both user-agent and agent-environment). The principle here is to intelligently select, simplify, and refine only the most relevant parts of this history to fit within theLLM'scontext window, rather than simply truncating or appending all information. This is achieved throughmemory utilization(retrieval and simplification). -
Enhanced Planning through Self-Reflection: Beyond just recalling past actions, the agent should understand why certain actions were taken in previous, similar situations. The principle of
self-reflectionis applied to generate explicitreasoning rationalesfor past actions, which enriches thememory snippetsand provides a deeper, more actionable context for theLLMto plan its next move.These principles combine to address the
context length limitationandcontext-dependencyissues inherent inconversational web navigation.
4.2. Core Methodology In-depth (Layer by Layer)
The Self-MAP framework is composed of three main components: a Memory Module, a Reflection Module, and a Planning Module. The overall pipeline can be visualized in Figure 3.
4.2.1. Problem Definition
The paper defines the task of Conversational Web Navigation. An agent must engage in multi-turn interactions with both the user and the web environment.
Given:
-
The
conversational interaction history- : User instruction at turn .
- : Environment interaction history (sequence of actions) performed by the agent to fulfill user instruction . Each is a single atomic action (e.g., click, type).
-
The
current environment stateE _ { t }(e.g., the HTML of the current webpage).Objective:
-
To accurately predict the
action sequenceA _ { t }required to accomplish thecurrent user instructionq _ { t }. Thisaction sequenceencompasses both thetarget elementon the webpage for interaction and the specificoperationto be performed.
4.2.2. Self-MAP Framework Overview
The Self-MAP framework (Figure 3 from the original paper, below) processes the current user instruction (), the current environment state (), and the conversational history () to generate the next action.

该图像是图3中Self-MAP框架的示意图,展示了基于记忆的交互历史检索、记忆精炼与简化的反思过程,形成自反记忆后指导规划与操作执行的流程。
Figure 3: Overview of Self-MAP.
The core flow is:
- The
Memory Moduleretrieves relevant past interactions from amemory bank. - The
Reflection Modulethen refines these retrieved memories by simplifying theenvironment statesand enriching them withLLM-generatedrationales. - Finally, the
Planning Moduleuses theseself-reflective memoriesalong with the current context to decide the next action.
4.2.3. Memory Module
The Memory Module is designed to construct and intelligently query a memory bank of past interactions.
-
Memory Bank Construction: Each
memory snippetstored in the bank represents an individual interaction step from theconversational history. Amemory snippetis represented as:- : The user instruction at turn .
- : The sequence of agent-environment interactions (trajectory) that occurred before the -th action at the current conversation turn .
- : The environment state (e.g., HTML) before the -th action at turn . The hat indicates that this is an approximation or snapshot.
- : The -th action (target element and operation) taken at turn .
The challenge is that injecting all these
memory snippetsinto theLLM'scurrent running memory(its input context) would quickly exceed itsmaximum input length. Additionally, many snippets might beirrelevantorinconsistentwith the current environment, introducing noise.
-
Multifaceted Matching: To address the
context length limitationandnoise, theMemory Moduleemploys amultifaceted matching approachto retrieve only the top- most relevant snippets. This matching happens at theaction level.- Query Construction: Given an ongoing
conversational interactionat turn and action step , where represents the partial trajectory ofagent-environment interactionsat the current conversation turn. The query for retrieval is constructed using both thecurrent user instructionand thepresent agent action sequence: . - Encoding Semantics and Trajectory: This query structure is designed to encode two types of relevance:
Semantic relevance: How semantically similar the current user instruction is to instructions in the memory bank.Action trajectory similarity: How similar the current partial action sequence is to action sequences in the memory bank.
- Embedding and Retrieval:
- The paper uses
OpenAI's text-embedding-ada-002to convert the query and allmemory snippetsintovector representations(embeddings). Cosine similarityis then computed between the query embedding and eachmemory snippet embeddingin theembedding space.- The top-
memory snippetswith the highestcosine similarityscores are retrieved.
- The paper uses
- Query Construction: Given an ongoing
4.2.4. Reflection Module
The Reflection Module further processes the retrieved memory snippets to maximize the utility of the limited memory space for the LLM. It involves two steps: Memory Simplification and Memory Refinement.
-
Memory Simplification:
- Purpose: To remove
task-irrelevantandnoisy elementsfrom theenvironment statewithin eachmemory snippet, thereby savingmemory spaceand allowing more relevant information to be retained. - Method: The process is inspired by the
MINDACTframework'scandidate generation. A small pre-trainedLanguage Model(e.g.,DeBERTa) acts as a ranker. This ranker identifies and ranksDOM elementsfrom theenvironment state() that are most relevant to the instruction and the current step. Thesimplified environmental stateis denoted as , replacing the original in thememory snippet.
- Purpose: To remove
-
Memory Refinement:
- Purpose: To enrich the
memory informationby generatingintermediate reasoning rationalesfor past actions, serving as asupervised signal. This goes beyond merely storing the action; it stores why the action was taken. - Method: This step leverages the
reasoning capability of LLMs(specificallyChatGPTwithgpt-3.5-turbo-1106). For each retrievedmemory snippet, anLLMis prompted to generate an in-depthrationale. Thisrationaleexplains the decision-making process that led to the execution of the next action . - Distinction from Traditional Self-reflection: Unlike some
self-reflectionmethods (e.g.,ReflexionbyShinn et al., 2023) that collect and analyze incorrect trajectories, this approach focuses on generating rationales for correct past actions within a static evaluation setting, primarily to enrich the memory rather than debug errors.
- Purpose: To enrich the
-
Self-reflective Memory Snippet: After these two steps, each processed
memory snippetbecomes aself-reflective memory snippet, denoted as . It now contains: This snippet is concise (due tomemory simplification) and informative (due tomemory refinement).
4.2.5. Planning with Self-reflective Memory
The final stage is to use the self-reflective memory to plan the next action.
-
Input to the
LLM: For eachinteraction stepat thecurrent conversation turn, theLLMreceives an input consisting of:q _ { t }: The current user instruction.- : The partial action sequence executed so far in the current turn.
- : The
simplified current environment state(top- candidateDOM elementsfrom after being ranked by theDeBERTaranker, similar tomemory simplification). - : The top
self-reflective memory snippetsretrieved and refined from theReflection Module.
-
Action Generation: The
LLM(specificallyFlan-T5, fine-tuned) is tasked with planning the next action . This action includes:- The
target elementto interact with. - The
operationto perform on that element (e.g.,CLICK,TYPE,SELECT).
- The
-
Planning Paradigms: The paper explores two types of planning paradigms for the
LLM(details in Appendix B.2):-
Multi-choice Question Answering (MCQ): TheLLMselects the target element from a predefined list of options (typically the top- candidate elements identified by theDeBERTaranker) and then determines the operation. -
Direct Generation: TheLLMdirectly generates the target element and the operation as free-form text. The paper's ablation study showsDirect Generationgenerally performs better.This structured approach allows
Self-MAPto leverage relevant historical context efficiently and effectively for planning actions inconversational web navigation, overcoming the inherent limitations ofLLMsin handling complex, dynamic, and lengthy interaction histories.
-
5. Experimental Setup
5.1. Datasets
The core dataset used in this work is Multi-Turn Mind2Web (MT-Mind2Web).
-
Source:
MT-Mind2Webis constructed from the existingMind2Web (Deng et al., 2023)dataset.Mind2Webprovidessingle-turn web navigationinteractions annotated by experts, which serves as the foundation for ensuring the quality of agent responses. -
Construction Process: The construction focuses on generating conversational user instructions while reusing the expert-annotated action sequences from
Mind2Web. This ensures the feasibility and correctness of the underlying web interactions. The process involves three main steps (as illustrated in Figure 2 from the original paper, below):
该图像是图2,展示了MT-Mind2Web数据集构建的整体流程及示例,包括会话指令组织、复杂指令拆解和对话指令重写三个步骤,强调了对会话上下文的处理及指令转换。Figure 2: Overall pipeline for MT-Mind2Web creation with examples.
- Organize Conversation Sessions: Consecutive
single-turn instructionsfromMind2Webthat share the samecontext(e.g., domain, website, entities, intents) are grouped intoconversation sessions. For example, two separate instructions about ticket booking on the same website are combined into one session. - Decompose Complex Instructions: Instructions in
Mind2Webwith longaction sequencesare often unnatural for single conversational turns. The paper employs ahuman-AI collaborative annotationapproach for decomposition:ChatGPTis initially used to divide an original complex instruction and itsaction sequenceinto subtasks, each with correspondingaction sub-sequences. The target number of subtasks is set as , where is the number of actions in the original instruction.- Human annotators then verify and refine these
AI-generated decompositions to ensure they are reasonable and executable, reflecting natural conversational flow.ChatGPTachieved a high pass rate () in this step. - Example (from Figure 2): An
Action Sequence 1is broken down intoAction Sub-sequence 1-1andAction Sub-sequence 1-2.
- Rewrite Conversational Instructions: The original standalone instructions are rephrased to be conversational using
anaphora(pronouns or phrases referring to something mentioned earlier) andellipsis(omission of words/phrases that are implied). This makes the conversation flow naturally.- Example (from Figure 2): If
T1mentions a "WWE ticket,"T2might refer to it as "one." IfT3andT4involve the same action like "booking," the verb might be omitted inT4.
- Example (from Figure 2): If
- Organize Conversation Sessions: Consecutive
-
Quality Control: A
quality verificationprocess is implemented to ensure the coherence and correctness of the constructed conversations. Annotators fix any issues until verification is passed. -
Dataset Statistics: The resulting
MT-Mind2Webdataset contains: The following are the results from Table 1 of the original paper:Train Test (Cross-X) Task Website Subdomain # Conversations 600 34 42 44 # Turns 2,896 191 218 216 Avg. # Turn/Conv. 4.83 5.62 5.19 4.91 Avg. # Action/Turn 2.95 3.16 3.01 3.07 Avg. # Element/Turn 573.8 626.3 620.6 759.4 Avg. Inst. Length 36.3 37.4 39.8 36.2 Avg. HTML Length 169K 195K 138K 397K Table 1: Statistics of the MT-Mind2Web dataset.
- Total Conversations: 720 sessions (600 for training, 120 for testing).
- Total Instruction/Action Pairs: 3,525.
- Average Turns per Conversation: Approximately 5 turns.
- Average HTML Length: The HTML content for each turn is very long (average 169K tokens in training, up to 397K in cross-subdomain test set), highlighting the severe
context length challenges.
-
Test Splits: The test set is divided into three subsets to evaluate generalization capabilities, mirroring
Mind2Web:-
Cross-Task: Evaluates generalization to new tasks within familiar websites/domains. (34 samples)
-
Cross-Website: Evaluates generalization to new websites within familiar domains. (42 samples from various specific sites like "redbox", "viator").
-
Cross-Subdomain: Evaluates generalization to entirely new subdomains. (44 samples from "Digital" and "Hotel" domains).
The dataset effectively validates methods due to its multi-turn conversational nature and challenging generalization splits.
-
5.2. Evaluation Metrics
The paper uses a comprehensive set of metrics to evaluate the performance of web agents in conversational web navigation, building upon metrics from single-turn web navigation. All reported metrics are macro averages, meaning they are first calculated per task and then averaged over all tasks.
-
Element Accuracy (Ele. Acc):
- Conceptual Definition: This metric measures whether the agent correctly identifies and selects the target user interface (UI) element on the webpage that corresponds to the required action. It's a measure of the agent's ability to ground the natural language instruction to the correct interactive element.
- Mathematical Formula: $ \text{Ele. Acc} = \frac{\text{Number of correctly identified elements}}{\text{Total number of elements to be identified}} $
- Symbol Explanation:
Number of correctly identified elements: The count of instances where the agent's chosen element perfectly matches the ground-truth target element for an action step.Total number of elements to be identified: The total count of ground-truth target elements across all action steps in the evaluation.
-
Operation F1 (Op. F1):
- Conceptual Definition: This metric assesses the agent's ability to predict the correct operation (e.g.,
CLICK,TYPE,SELECT) and its associated value (e.g., the text to type, the option to select) for a given element. It uses theF1 scoreat the token level, which is a harmonic mean ofprecisionandrecall, making it suitable for evaluating text generation tasks where partial matches can occur. - Mathematical Formula:
$
\text{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}}
$
$
\text{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}}
$
$
\text{F1} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$
For
Operation F1, these are typically calculated at the token level, whereTrue Positivesare tokens correctly predicted in the operation string,False Positivesare tokens predicted but not in the ground truth, andFalse Negativesare tokens in the ground truth but not predicted. - Symbol Explanation:
True Positives: Tokens that are correctly identified by the model in both the predicted operation and the ground-truth operation.False Positives: Tokens identified by the model in the predicted operation but not present in the ground-truth operation.False Negatives: Tokens present in the ground-truth operation but not identified by the model in the predicted operation.
- Conceptual Definition: This metric assesses the agent's ability to predict the correct operation (e.g.,
-
Step Success Rate (SSR):
- Conceptual Definition: This is a stringent measure of an agent's performance at the individual action step level. A single
interaction stepis considered successful only if both the selected element (Element Accuracy) and the predicted operation (Operation F1) are correct. This ensures that the agent not only finds the right UI component but also interacts with it in the correct manner. - Mathematical Formula: $ \text{SSR} = \frac{\text{Number of successful steps (correct element AND correct operation)}}{\text{Total number of steps}} $
- Symbol Explanation:
Number of successful steps: Count of steps where the agent's chosen element matches the ground truth AND its predicted operation (including value) matches the ground truth.Total number of steps: The total number of atomic actions required across all tasks being evaluated.
- Conceptual Definition: This is a stringent measure of an agent's performance at the individual action step level. A single
-
Turn Success Rate (TSR):
- Conceptual Definition: This is the most demanding metric and represents the ultimate goal in
conversational web navigation. Aconversation turn(which might involve multipleinteraction stepsto complete one user instruction) is considered successful only if all the individual steps within that turn have succeeded according to theSSRcriterion. This metric directly evaluates the agent's ability to fully complete a user's instruction within a conversational context. The paper notes thatTSRcan be regarded as the main metric. - Mathematical Formula: $ \text{TSR} = \frac{\text{Number of successful turns (all steps in turn succeed)}}{\text{Total number of turns}} $
- Symbol Explanation:
Number of successful turns: Count of conversation turns where every single action step performed by the agent to fulfill the user's instruction at that turn was correct.Total number of turns: The total number of user-agent conversation turns being evaluated.
- Conceptual Definition: This is the most demanding metric and represents the ultimate goal in
5.3. Baselines
The paper compares Self-MAP against several state-of-the-art baselines, adapting them for the conversational web navigation task. These baselines represent different approaches to web navigation and conversational tasks.
-
DeBERTa (He et al., 2021):- Description: This is a pre-trained
Language Model(Decoding-enhanced BERT with disentangled attention). In this context, it's used as arankerfor selecting target elements from the HTML. It represents a baseline for element selection without explicit operation prediction or advancedLLM-based planning. - Why Representative: It establishes a lower bound for performance focused purely on identifying the correct UI element, similar to how
MINDACTuses aDeBERTa-based ranker for candidate generation.
- Description: This is a pre-trained
-
MINDACT (Deng et al., 2023):- Description: The original framework from the
Mind2Webpaper. It performsmulti-choice question answering (MCQ)to select a target element from a list of options. For theconversational setting, its input is adapted to include the entireconversational interaction history. The paper evaluatesMINDACTwith twoLLMbackbones:GPT-3.5(in-context learning) andFlan-T5(fine-tuned). - Why Representative: It's the direct predecessor for
single-turn web navigationand a strongfine-tunedbaseline. ItsGPT-3.5variant shows howin-context learningfares without fine-tuning on the specific conversational task.
- Description: The original framework from the
-
MINDACT + CAR (Anand et al., 2023):- Description: This baseline integrates
Context-Aware Rewriting (CAR)usingChatGPT. Before feeding the instruction toMINDACT,CARattempts to reconstruct aself-contained instructionfrom theconversational instructionand theconversation context. This rewritten instruction is then used as input forMINDACT. - Why Representative: It represents a common strategy in
conversational AIto handlecontext-dependencyby making each turn's instruction independent. It tests whether rewriting alone is sufficient forconversational web navigation.
- Description: This baseline integrates
-
MINDACT + Fixed (Huq et al., 2023):- Description: This baseline uses a
fixed memory selectionstrategy. It prepends a fixed number of initial turns (the first 3 turns) from theconversation historyas memory toMINDACT's input. The work by suggested that fixed examples can sometimes outperform relevance-based selection indemonstration-based learning. - Why Representative: It tests a simple, static
memory augmentationstrategy, contrasting with dynamic retrieval methods.
- Description: This baseline uses a
-
Synapse (Zheng et al., 2024b):-
Description: A state-of-the-art method from
Mind2Webcompetition, which employsmetadata(website, domain, subdomain, task) forkNN-based exemplar retrieval. ForMT-Mind2Web, given that website, domain, and subdomain are constant within a conversation, only thetaskis used in the metadata forturn-level kNNretrieval. -
Why Representative: It represents a sophisticated
memory retrievalapproach (exemplar-based prompting) from single-turnweb agentsand checks its adaptability to a conversational setting.These baselines cover a spectrum from basic element selection to advanced
LLM-poweredplanningwith variousmemory managementandcontext handlingstrategies, providing a robust comparison forSelf-MAP.
-
5.4. Implementation Details
-
Candidate HTML Element Ranker:
- The
DeBERTa-v3-base (He et al., 2021)model is fine-tuned as the ranker for selecting candidateHTML elements. - Training: During training, 5 random elements (including the positive ground-truth candidate) are used.
- Evaluation: For evaluation, the ranker selects the top-50 elements when compared in groups of 5.
- Hyperparameters: Batch size = 32, learning rate = , trained for 5 epochs.
- The
-
Action Planning (Generation Model):
Flan-T5 (Chung et al., 2022)is used as the generation model, specificallyFlan-T5baseandFlan-T5largeversions.- It's used for both
MCQ-basedandgeneration-basedaction planning. - Context Length: The maximum sequence length is set to 2,048 tokens. However, the tokenizer's max context length is 512 tokens, implying that inputs larger than 512 tokens would be truncated or processed in chunks. This highlights the
context length challengeaddressed bySelf-MAP. The system message,HTML, user input, and assistant response are tokenized separately. - Hyperparameters:
- Batch size: 8 for
Flan-T5base, 4 forFlan-T5large. - Learning rate: .
- Trained for 5 epochs.
- Batch size: 8 for
-
Multifaceted Matching (Memory Module):
- Embedding Model:
OpenAI's text-embedding-ada-002is used to generate embeddings for queries andmemory snippets. - Similarity Metric:
Cosine similarityis used to calculate the relevance between embeddings. - Number of Retrieved Memories (): Set to .
- Embedding Model:
-
Memory Refinement (Reflection Module):
- LLM for Rationale Generation:
ChatGPTwith thegpt-3.5-turbo-1106version. - Generation Parameters: Maximum new tokens = 100, temperature = 0 (for deterministic output).
- Input for Rationale: Only
HTML snippetsof the positive (ground-truth) element are provided toChatGPTto generaterationales. - Default Rationale: If no positive element is found in the
HTML snippet, a default rationale is used: "The assistant's answer is derived from the absence of a specific option in the provided HTML content, leading to the conclusion that none of the options provided are suitable for the user's task."
- LLM for Rationale Generation:
6. Results & Analysis
6.1. Core Results Analysis
The following are the results from Table 2 of the original paper:
| Cross-Task | Cross-Website | Cross-Subdomain | ||||||||||
| Ele. Acc | Op. F1 | SSR | TSR | Ele. Acc | Op. F1 | SSR | TSR | Ele. Acc | Op. F1 | SSR | TSR | |
| Base Model Flan-T5base | ||||||||||||
| DeBERTa (He et al., 2021) | 36.8 | - | - | - | 31.7 | - | - | - | 27.7 | - | - | - |
| MINDACT (GPT-3.5) (Deng et al., 2023) | 4.3 | 27.6 | 1.9 | 1.0 | 6.7 | 22.2 | 2.1 | 1.7 | 4.0 | 22.9 | 1.5 | 1.1 |
| MiNDAct (Deng et al., 2023) | 43.2 | 79.1 | 36.6 | 14.2 | 38.8 | 69.4 | 29.2 | 15.2 | 41.9 | 77.2 | 35.5 | 15.7 |
| MINDACT + CAR (Anand et al., 2023) | 47.8 | 78.8 | 41.4 | 16.1 | 37.0 | 67.5 | 32.2 | 9.6 | 41.2 | 75.3 | 35.4 | 13.2 |
| MINDACT + Fixed (Huq et al., 2023) | 51.0 | 80.8 | 42.6 | 18.4 | 42.4 | 70.0 | 35.4 | 15.3 | 43.1 | 77.6 | 37.5 | 17.7 |
| Synapse (Zheng et al., 2024b) | 49.6 | 79.9 | 41.9 | 18.4 | 43.1 | 70.6 | 33.1 | 13.7 | 41.7 | 77.8 | 35.9 | 16.0 |
| Self-MAP | 56.2 | 82.5 | 47.1 | 24.7 | 48.3 | 71.8 | 40.6 | 18.2 | 46.4 | 79.1 | 38.3 | 20.8 |
| Base Model Flan-T5large | ||||||||||||
| MinDAct (Deng et al., 2023) | 59.0 | 80.6 | 53.2 | 26.0 | 43.6 | 67.6 | 36.5 | 12.4 | 46.8 | 74.0 | 38.9 | 21.8 |
| MINDACT + CAR (Anand et al., 2023) | 54.5 | 79.5 | 47.8 | 19.8 | 43.2 | 69.2 | 36.1 | 12.2 | 44.5 | 75.0 | 40.2 | 15.6 |
| MINDACT + Fixed (Huq et al., 2023) | 58.0 | 79.7 | 51.3 | 26.4 | 46.2 | 69.7 | 37.6 | 15.2 | 47.4 | 74.9 | 38.8 | 21.4 |
| Synapse (Zheng et al., 2024b) | 57.5 | 82.0 | 50.0 | 23.2 | 45.1 | 69.0 | 37.1 | 13.0 | 47.4 | 74.1 | 39.3 | 19.4 |
| Self-MAP | 58.1 | 80.5 | 51.7 | 26.6 | 44.8 | 68.8 | 36.8 | 15.7 | 52.0 | 77.1 | 43.6 | 25.4 |
Table 2: Experimental results on MT-Mind2Web. TSR can be regarded as the main metric.
The experimental results in Table 2 provide a comprehensive evaluation of Self-MAP against various baselines on the MT-Mind2Web dataset, using both Flan-T5base and Flan-T5large as backbone models. TSR (Turn Success Rate) is highlighted as the main metric, indicating the agent's ability to complete an entire conversational turn successfully.
Key Observations and Analysis:
-
Weak Baselines (
DeBERTa,MINDACT (GPT-3.5)):DeBERTa(element selection only) shows very lowEle. Acc(27.7-36.8%) and noOp. F1,SSR, orTSR, confirming it's insufficient for complexweb navigation.MINDACT (GPT-3.5), which relies onin-context learningwithout fine-tuning, performs extremely poorly (1.0-1.7%TSR). This reinforces the finding that for intricate web interaction tasks,LLMsrequire fine-tuning or sophisticated context management beyond simple prompting.
-
Impact of
Context-Aware Rewriting (CAR):MINDACT + CARgenerally performs worse than the vanillaMiNDAct(e.g.,Flan-T5base:TSRof 16.1% vs 14.2% for Cross-Task, but 9.6% vs 15.2% for Cross-Website). This suggests that simply rewriting conversational instructions usingGPT-3.5can sometimesobfuscatethe original instruction, making it harder for the agent to understand and act correctly. TheLLM-based rewriting might introduce errors or lose subtle context, leading to reduced performance, especially in generalization settings likeCross-Website.
-
Comparison of Memory Strategies (
MINDACT + Fixed,Synapse):- Both
MINDACT + FixedandSynapsegenerally outperform the vanillaMiNDAct. This validates the core idea that incorporatingconversational interaction history(memory) is beneficial forconversational web navigation. - Surprisingly,
Synapse(a SOTA method onMind2WebutilizingkNN-based exemplar retrieval) performs worse thanMINDACT + Fixedin several scenarios (e.g.,Flan-T5baseCross-WebsiteTSR: 13.7% vs 15.3%;Flan-T5largeCross-WebsiteTSR: 13.0% vs 15.2%;Flan-T5largeCross-TaskTSR: 23.2% vs 26.4%). This suggests that thecoarse-grained kNN matchinginSynapse, while effective forsingle-turn exemplar retrieval, is insufficient for effectively measuring the intricaterelevancebetween thecurrent conversation statusandcandidate memory snippetsin the more complexconversational setting. The fixed memory approach, despite its simplicity, might provide more consistent and relevant context.
- Both
-
Impact of Base Model Size:
- Using a stronger base model (
Flan-T5largevsFlan-T5base) generally improves performance across most metrics and baselines. For example,MiNDAct'sTSRon Cross-Task jumps from 14.2% (Flan-T5base) to 26.0% (Flan-T5large). This indicates that largerLLMsinherently possess better understanding and reasoning capabilities that benefit the task.
- Using a stronger base model (
-
Superiority of
Self-MAP:Self-MAPconsistently and substantially outperforms all baselines across all evaluation settings and bothLLMsizes.- Quantified Improvement (
Flan-T5base):Self-MAPachievesTSRscores of 24.7% (Cross-Task), 18.2% (Cross-Website), and 20.8% (Cross-Subdomain). This represents a significant gain over the strongest baselines (e.g.,MINDACT + FixedandSynapse). For instance, on Cross-Task,Self-MAP'sTSRof 24.7% is +6.3 points higher thanMINDACT + Fixed(18.4%). - Quantified Improvement (
Flan-T5large):Self-MAPmaintains its lead, achievingTSRscores of 26.6% (Cross-Task), 15.7% (Cross-Website), and 25.4% (Cross-Subdomain). While theFlan-T5largeversion ofSelf-MAPsometimes shows marginally lowerSSRorTSRthanMINDACTorMINDACT + FixedinCross-Task(e.g., 26.6% vs 26.0% and 26.4%), it demonstrates a substantial improvement inCross-Subdomain(TSR25.4% vs 21.8% forMINDACTand 21.4% forMINDACT + Fixed). This highlights its robustness, especially in challenging generalization scenarios. - Overall Validation: The consistent outperformance validates the effectiveness of
Self-MAP'smemory-augmented planning frameworkand itsself-reflection strategyfor enhancingmemory utilizationinconversational web navigation. The improvements are particularly notable forTSR, which is the most holistic measure of task success in this multi-turn setting.
6.2. Ablation Studies / Parameter Analysis
The following are the results from Table 3 of the original paper:
| Cross-Task | Cross-Website | Cross-Subdomain | ||||||||||
| Ele. Acc | Op. F1 | SSR | TSR | Ele. Acc | Op. F1 | SSR | TSR | Ele. Acc | Op. F1 | SSR | TSR | |
| Self-MAP | 56.2 | 82.5 | 47.1 | 24.7 | 48.3 | 71.8 | 40.6 | 18.2 | 46.4 | 79.1 | 38.3 | 20.8 |
| w/o Generation-based Planning | 51.7 | 79.4 | 43.5 | 22.2 | 43.1 | 69.5 | 34.9 | 15.5 | 44.8 | 77.2 | 37.3 | 17.7 |
| w/o Memory Simplification | 50.5 | 80.7 | 41.0 | 20.7 | 44.9 | 69.6 | 36.9 | 16.6 | 42.3 | 79.2 | 36.4 | 15.9 |
| w/o Memory Refinement | 52.1 | 81.3 | 43.0 | 23.2 | 48.9 | 70.8 | 39.1 | 18.1 | 46.3 | 78.7 | 37.2 | 17.8 |
| w/o Multifaceted Matching | 52.6 | 80.6 | 44.3 | 21.6 | 46.9 | 71.2 | 37.9 | 17.2 | 44.8 | 78.6 | 35.8 | 17.8 |
Table 3:Ablation study. "w/o Generation-based Planning" denotes that we use MCQ-based Planning, while "w/c Multifaceted Matching" denotes that we prepend the chronological conversation context without retrieval.
The ablation study in Table 3 systematically evaluates the contribution of each component of the Self-MAP framework, using Flan-T5base as the backbone.
-
w/o Generation-based Planning(i.e., usingMCQ-based Planning):- Analysis: This variant shows a noticeable drop in performance compared to the full
Self-MAP. For instance, Cross-TaskTSRdrops from 24.7% to 22.2%, Cross-WebsiteTSRfrom 18.2% to 15.5%, and Cross-SubdomainTSRfrom 20.8% to 17.7%. - Conclusion:
Generation-based Planningis superior. This is attributed to theadvanced generative capabilitiesofLLMsand theirefficiency in conserving context space(as generating an action directly can be more compact than selecting from a detailed list of options).
- Analysis: This variant shows a noticeable drop in performance compared to the full
-
w/o Memory Simplification:- Analysis: This component's removal leads to the most significant performance degradation across all metrics. Cross-Task
TSRdrops from 24.7% to 20.7% (a 4-point drop), Cross-WebsiteTSRfrom 18.2% to 16.6%, and Cross-SubdomainTSRfrom 20.8% to 15.9% (a 4.9-point drop). - Conclusion:
Memory Simplification(filtering irrelevantDOM elements) is themost critical factorforSelf-MAP's success. This underscores the paramount importance ofoptimizing the use of limited context spaceby removing noise and irrelevant information, especially given thelarge HTML sizesinweb navigation.
- Analysis: This component's removal leads to the most significant performance degradation across all metrics. Cross-Task
-
w/o Memory Refinement:- Analysis: Removing
Memory Refinement(theLLM-generated rationales) also causes a performance drop, though less severe thanMemory Simplification. Cross-TaskTSRgoes from 24.7% to 23.2%, Cross-WebsiteTSRfrom 18.2% to 18.1%, and Cross-SubdomainTSRfrom 20.8% to 17.8%. - Conclusion:
Memory Refinementcontributes positively. Its impact is more pronounced incross-task scenariosthan incross-websiteorcross-subdomain. This suggests that therationales(which explain decision-making processes) are highly valuable when the agent encounters entirely new tasks. However, itsgeneralizabilityin modeling decision-making processes across highly diverse websites or subdomains might be relatively lower, possibly because the specific rationales might not transfer perfectly to visually or logically different interfaces.
- Analysis: Removing
-
w/o Multifaceted Matching(i.e., prependingchronological conversation contextwithout retrieval):- Analysis: This variant, which reverts to a simpler chronological memory appending, also sees a notable performance decrease. Cross-Task
TSRdrops from 24.7% to 21.6%, Cross-WebsiteTSRfrom 18.2% to 17.2%, and Cross-SubdomainTSRfrom 20.8% to 17.8%. - Conclusion:
Multifaceted Matchingformemory retrievalsignificantly outperforms simply prependingchronological context. This highlights the necessity of intelligently filtering outnoisy conversational interaction historyto focus on therelevant partsthat share both semantic and trajectory similarity. Unfiltered chronological context can dilute the signal and confuse theLLM.
- Analysis: This variant, which reverts to a simpler chronological memory appending, also sees a notable performance decrease. Cross-Task
6.2.1. Effect of the Number of Retrieved Memory Snippets ()
The following are the results from Figure 4 of the original paper:

该图像是图表,展示了不同数量检索记忆片段(K)情况下,模型在四个指标上(元素准确率、操作F1、步骤成功率和回合成功率)的表现,比较了跨任务、跨网站和跨子域三种场景的结果。
Figure 4: Performance in terms of different number of retrieved memory snippets.
- Analysis: The graph (Figure 4) illustrates how the performance (specifically
TSRon different test splits) changes as the number of retrieved memory snippets () varies from 1 to 5.- Performance initially increases when grows from 1 to 3 across all three test splits. This indicates that retrieving a small number of relevant memory snippets (up to 3 in this case) is beneficial, as it provides valuable contextual information that helps the agent plan.
- However, as continues to increase beyond 3 (e.g., to 4 or 5), the performance either plateaus or, in some cases (e.g., Cross-Task, Cross-Website), even slightly degrades.
- Conclusion: This suggests an optimal point for memory retrieval. Retrieving too few snippets might miss crucial context, while retrieving too many can introduce
noisy informationfromirrelevant turns. Given that theMT-Mind2Webdataset has an average of about 5 conversational turns (Table 1), increasing too much starts to include less relevant or even misleading information, which overwhelms theLLMor dilutes the helpful signal within itslimited context window. The value of chosen for the main experiments seems to strike a good balance.
6.2.2. Analysis of Generalizability
The following are the results from Figure 5 of the original paper:

该图像是一个柱状图,展示了多个品牌或网站在不同类别(如旅游、购物、餐厅等)下的某种指标表现,柱形高度代表指标值,类别通过颜色区分。图中品牌排列较为密集,反映了多领域网站的对比情况。
Sea eeloTask elzoo to yellowpages), Cross-Website (from exploretock to redbox), and Cross-Subdomain (from koa to airbnb).
- Analysis: Figure 5, while only showing
TSRforSelf-MAPfor different generalization settings, allows for comparison of generalizability challenges.- Cross-Task vs. Others: All models (including
Self-MAP) perform best on theCross-Tasksetting compared toCross-WebsiteandCross-Subdomain. This is logical becauseCross-Taskmeans the agent encounters new tasks but on familiar websites/domains, where interaction patterns and UI structures might be more consistent. - Cross-Website vs. Cross-Subdomain: There is no significant performance difference between
Cross-WebsiteandCross-Subdomainsettings.
- Cross-Task vs. Others: All models (including
- Conclusion:
- The primary challenges to generalization come from
diversity in website designsandinteraction logic, rather than justdomain specifics. Different websites, even within the same domain, can have vastly different UI layouts and interaction flows. Self-MAP's performance gap betweenCross-Taskand the other two settings (around 10-20% based on numerical values in Table 2 forSelf-MAPFlan-T5base) is more substantial than observed in the originalMind2Webdataset (which focused on single-turn tasks). This implies that introducingmulti-turn user-agent interactionssignificantlycomplicates the interaction logic, making generalization to novel websites or subdomains even harder. The agent must now reason about both dynamic environment states and evolving conversational context simultaneously.
- The primary challenges to generalization come from
6.2.3. Analysis of Conversation Prompt Designs
The following are the results from Figure 6 of the original paper:

该图像是图表,展示了不同对话提示设计在多个指标(元素准确率、操作F1、步骤成功率、回合成功率)上的表现,涵盖跨任务、跨网站与跨子域三个评测场景,比较了Synapse和Multifaceted模型及其变体的性能差异。
Figure 6: Performance in terms of different Conversation Prompt Designs.
- Analysis: Figure 6 compares different prompt designs for
SynapseandSelf-MAPrelated to memory order and the inclusion of state-based information in matching.- Relevance-based Order vs. Chronological Order: Both
SynapseandSelf-MAPshow much better performance when using arelevance-based orderfor memory snippets compared to achronological (sequential) order. For example,Self-MAP(Multifaceted) significantly outperformsSelf-MAP(Chronological) across all metrics. - State-based Information in Matching: In
Self-MAP, themultifaceted matchingapproach was designed to query usinguser instructionandagent action sequence. The paper also tested including "state-based information" in the retrieved memory. The results suggest thatmultifaceted matchingtypically achieves better performance without explicit state-based information in the retrieved memory. This is explained by the idea that whilestateis critical forplanning, explicitly embedding raw state within theretrieved memory snippets(especially when using text embeddings) might not be as effective as havingmemory simplificationandrefinementpreprocess it. Also, in action-level matching, where there isn't a sequential framework like , state information cannot be directly inferred from the trajectory.
- Relevance-based Order vs. Chronological Order: Both
- Conclusion: Intelligent
memory retrievalbased onrelevance(both semantic and trajectory) is crucial, significantly outperforming simplechronological prepending. The chosen prompt designs forSelf-MAP, which userelevance-based orderingand manage state information throughmemory simplificationrather than raw inclusion in retrieval queries, are optimal.
6.2.4. Runtime Analysis
The following are the results from Table 4 of the original paper:
The following are the results from Table 4 of the original paper:
| Methods | Flan-T5base | Flan-T5large |
|---|---|---|
| Mind2Act | 1.90s | 3.43s |
| Mind2Act + CAR | 1.23s | 2.11s |
| Mind2Act + Fixed | 1.69s | 3.35s |
| Synapse | 1.95s | 3.58s |
| Self-MAP | 2.56s | 4.29s |
Table 4: Runtime analysis
- Analysis: Table 4 compares the
runtime(in seconds) for predicting the next action on a singleRTX A5000 GPUforFlan-T5baseandFlan-T5largeconfigurations.MINDACT + CARshows the lowest runtime (e.g., 1.23s forFlan-T5base). This is because itdiscards all historical trajectoriesafter rewriting the query, significantly shortening the inputtoken length. However, as shown in Table 2, this comes at a significant cost to performance.Self-MAPis generally slower than other baselines. ForFlan-T5base,Self-MAPtakes 2.56s, compared toMINDACT(1.90s),MINDACT + Fixed(1.69s), andSynapse(1.95s). Similarly forFlan-T5large,Self-MAPtakes 4.29s.
- Conclusion: While
Self-MAPintroduces a slight increase inruntimedue to itsmemory retrieval,simplification, andrefinementprocesses, its runtime remains within afeasible rangefor deployment incomplex task environments. Thismarginal increasein computational cost is justified by thesubstantial improvements in accuracyandadaptabilitydemonstrated in the performance evaluations, making it avaluable tool for real-world applicationswhere task success is paramount.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces the challenging new task of Conversational Web Navigation, which requires web agents to engage in complex, multi-turn interactions with both users and dynamic web environments. To facilitate research in this domain, the authors developed MT-Mind2Web, a novel dataset constructed by transforming single-turn interactions into conversational sessions through human-AI collaborative annotation. To overcome the inherent difficulties, particularly the limited context length of LLMs and the context-dependency of conversational tasks, the paper proposes Self-MAP (self-reflective memory-augmented planning). This framework intelligently leverages memory utilization via multifaceted matching for relevant snippet retrieval and self-reflection through memory simplification and LLM-generated rationales. Extensive experiments on MT-Mind2Web validate Self-MAP's effectiveness, demonstrating its consistent and substantial outperformance over strong baselines, especially in Turn Success Rate, across diverse generalization settings. The ablation studies highlight the critical role of memory simplification and the benefits of generation-based planning and relevance-based memory retrieval.
7.2. Limitations & Future Work
The authors acknowledge two main limitations and suggest future research directions:
- Multimodal Environment: The current work primarily focuses on
HTML-grounded methods. While theMT-Mind2Webdataset is compatible with multimodal environments (as its base,Mind2Web, is), the presentedSelf-MAPframework does not explicitly incorporatemultimodal LLMs(e.g.,GPT-4V). Future work could investigate adaptingSelf-MAPto leverage visual information from web pages, potentially improvingUI understandingandaction groundingin more complex or visually-driven web scenarios. - Online Evaluation: The paper employs
offline evaluation settings, which are common in bothconversational tasksandsingle-turn web navigation. While convenient for benchmarking, this setup inherits the drawback ofoffline evaluationin accurately assessingdynamic interactionswith real, evolving web environments. Future work could exploreonline evaluationmethods to better simulate real-world usage and validate the agent's robustness to unforeseen changes and dynamic elements.
7.3. Personal Insights & Critique
This paper makes a crucial step towards building more practical and human-friendly LLM-powered web agents. The introduction of Conversational Web Navigation as a task, coupled with the MT-Mind2Web dataset, addresses a significant gap in the current research landscape where web agents often operate in a single-shot interaction paradigm.
Key Strengths and Inspirations:
- Realistic Problem Formulation: Recognizing the
multi-turnnature of human interaction is vital for real-worldAI agentadoption. This paper moves beyond isolated commands to address complex, context-dependent dialogues. - Elegant Solution to
Context Length: TheSelf-MAPframework provides a compelling solution to the perennialcontext length limitationofLLMs. The combination ofmultifaceted matching,memory simplification(filtering irrelevantDOM elements), andmemory refinement(generatingrationales) is a powerful, multi-pronged approach that intelligently curates essential information. The finding thatMemory Simplificationis the most critical factor is a valuable insight, emphasizing that less is more when it comes to raw context if it's not highly relevant. - Value of
Self-Reflection: UsingLLMsto generaterationalesfor past actions (even correct ones) is a clever way to enrich memory. It transforms rawaction sequencesinto more interpretable, reasoning-backed "experiences," which is akin to how humans learn from past decisions. This could inspire similar approaches in othermemory-augmented agentsystems. - Human-AI Collaboration in Data Curation: The
human-AI collaborative annotationprocess forMT-Mind2Webis commendable. LeveragingChatGPTfor initial instruction decomposition, followed by human refinement, showcases an efficient and effective way to build complex datasets, potentially reducing annotation costs while maintaining quality.
Potential Issues, Unverified Assumptions, and Areas for Improvement:
-
Cost and Latency of
Memory Refinement: WhileLLM-generatedrationalesare effective, the reliance on an externalLLM(ChatGPT) formemory refinementintroducesAPI costsandlatency. For agents requiring real-time responses, this could be a bottleneck. Further work could explore distilling theserationalesinto smaller, faster models or integratingrationale generationmore deeply within the primaryLLMif context allows. -
Generalizability of
Rationales: The ablation study noted thatMemory Refinement's contribution was less generalized acrosscross-websiteandcross-subdomainsettings compared tocross-task. This is an interesting point. While a rationale for "why click 'Add to Cart'" might be universally applicable, the specific UI elements and their surrounding context can vary greatly. The rationales might be too tied to the specific UI patterns seen during training, making them less effective for visually or functionally novel interfaces. Exploring more abstract or robustrationale generationcould be beneficial. -
Scalability of
Multifaceted Matching: As thememory bankgrows very large (e.g., with hundreds of thousands of interaction steps), the time taken forembedding generationandcosine similaritycomputation might become significant, even with efficientvector databases. Strategies for hierarchical memory or more efficient search might be needed. -
Error Propagation in
Multi-turn Tasks: WhileSelf-MAPimprovesTSR, it's still relatively low (e.g., 20-25%). In amulti-turnsetting, an error in an early turn can cascade and make subsequent turns impossible to complete. The current framework focuses on correct action prediction for the current turn. Future work could explore mechanisms forerror detection,self-correction, oruser clarificationwithin the conversation to recover from mistakes, making the agent more robust. -
Beyond
HTML: The current focus onHTMLis strong, but many web applications are increasingly dynamic and use complexJavaScriptrendering, making rawHTMLparsing insufficient. Integratingvisual perception(as suggested in limitations) is essential for future practicalweb agents.Overall,
Self-MAPrepresents a significant advancement inconversational web agents. It offers a principled and effective way to manage context and enhanceLLMplanning in complex interactive environments, paving the way for more intelligent and user-friendlyAI assistants.
Similar papers
Recommended via semantic vector search.