Paper status: completed

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control

Published:06/13/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Synapse integrates state abstraction, trajectory-as-exemplar prompting, and exemplar memory to overcome LLM limitations, achieving notable success and generalization in complex long-horizon computer control tasks.

Abstract

Building agents with large language models (LLMs) for computer control is a burgeoning research area, where the agent receives computer states and performs actions to complete complex tasks. Previous computer agents have demonstrated the benefits of in-context learning (ICL); however, their performance is hindered by several issues. First, the limited context length of LLMs and complex computer states restrict the number of exemplars, as a single webpage can consume the entire context. Second, the exemplars in current methods, such as high-level plans and multi-choice questions, cannot represent complete trajectories, leading to suboptimal performance in long-horizon tasks. Third, existing computer agents rely on task-specific exemplars and overlook the similarity among tasks, resulting in poor generalization to novel tasks. To address these challenges, we introduce Synapse, a computer agent featuring three key components: i) state abstraction, which filters out task-irrelevant information from raw states, allowing more exemplars within the limited context, ii) trajectory-as-exemplar prompting, which prompts the LLM with complete trajectories of the abstracted states and actions to improve multi-step decision-making, and iii) exemplar memory, which stores the embeddings of exemplars and retrieves them via similarity search for generalization to novel tasks. We evaluate Synapse on MiniWoB++, a standard task suite, and Mind2Web, a real-world website benchmark. In MiniWoB++, Synapse achieves a 99.2% average success rate (a 10% relative improvement) across 64 tasks using demonstrations from only 48 tasks. Notably, Synapse is the first ICL method to solve the book-flight task in MiniWoB++. Synapse also exhibits a 56% relative improvement in average step success rate over the previous state-of-the-art prompting scheme in Mind2Web.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control

1.2. Authors

The paper is authored by Longtao Zheng, Rundong Wang, Xinrun Wang, and Bo An, all affiliated with Nanyang Technological University (NTU), Singapore. Their research interests appear to lie in artificial intelligence, particularly in developing intelligent agents capable of computer control using large language models.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server for scientific papers. As a preprint, it has not yet undergone formal peer review for a specific journal or conference. However, arXiv is a widely recognized platform for disseminating cutting-edge research in AI and other scientific fields, making it a common initial publication venue for impactful work.

1.4. Publication Year

The paper was published at 2023-06-13T15:49:41.000Z.

1.5. Abstract

This paper introduces Synapse, a novel computer agent designed to address several limitations of existing large language model (LLM)-based agents for computer control. The core problems identified are: (1) restricted exemplar quantity due to LLM context length and complex computer states; (2) suboptimal performance in long-horizon tasks because current exemplars (e.g., high-level plans, multi-choice questions) do not represent complete trajectories; and (3) poor generalization to novel tasks due to reliance on task-specific exemplars and neglect of task similarities. Synapse tackles these issues through three key components: (i) state abstraction, which filters task-irrelevant information from raw states to enable more exemplars within context; (ii) trajectory-as-exemplar (TaE) prompting, which uses complete sequences of abstracted states and actions as prompts to improve multi-step decision-making; and (iii) exemplar memory, which stores and retrieves relevant exemplars via similarity search for better generalization to new tasks. Evaluated on MiniWoB++ and Mind2Web, Synapse achieved a 99.2% average success rate on MiniWoB++ (a 10% relative improvement over prior ICL methods, notably solving the challenging book-flight task) using demonstrations from only 48 out of 64 tasks. On Mind2Web, it demonstrated a 56% relative improvement in average step success rate over the previous state-of-the-art prompting scheme.

Official Source Link: https://arxiv.org/abs/2306.07863 PDF Link: https://arxiv.org/pdf/2306.07863v3.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The overarching goal in artificial intelligence is to create agents capable of performing complex tasks to reduce human effort in routine operations. A significant step towards this is developing agents proficient in computer control, where an agent perceives computer states (like webpage HTML or screenshots) and performs actions (via keyboard/mouse) to complete natural language-specified tasks (e.g., flight booking, email management).

Recent research has increasingly focused on using large language models (LLMs) for these agents, leveraging their in-context learning (ICL) capabilities, which allow them to generate actions based on a few demonstrations (also called exemplars). This approach offers advantages in generalization and data efficiency compared to traditional methods like behavioral cloning (BC) or deep reinforcement learning (RL).

However, existing LLM-based computer agents face several critical challenges:

  1. Limited Context Length of LLMs: LLMs have a finite input capacity. When computer states (like detailed webpage HTML) are complex and lengthy, they consume a large portion of this context, severely restricting the number of exemplars that can be provided. This limits the LLM's ability to learn from diverse examples, hindering few-shot learning performance.

  2. Incomplete Trajectory Representation in Exemplars: Current methods often use high-level plans or multi-choice questions (MCQs) as exemplars. These structures fail to capture complete trajectories (sequences of states and actions), leading to error accumulation over multiple decision-making steps, especially in long-horizon tasks (tasks requiring many sequential actions). They don't provide the LLM with enough information about how actions unfold over time.

  3. Poor Generalization to Novel Tasks: Many existing agents rely on task-specific exemplars, meaning they require demonstrations tailored for each new task. They often overlook similarities between tasks (e.g., different email management scenarios sharing similar interfaces), leading to a lack of generalization and inability to handle novel tasks efficiently.

    The paper aims to solve these problems by introducing a novel agent architecture and prompting scheme that specifically addresses the constraints of LLM context, the need for complete trajectory information, and the desire for better task generalization.

2.2. Main Contributions / Findings

The paper introduces Synapse, an LLM-powered computer agent with three core components designed to overcome the aforementioned challenges:

  1. State Abstraction: This component processes raw, complex computer states (e.g., long HTML documents) into concise, task-relevant observations. By filtering out task-irrelevant information, it significantly reduces the token length per state. This allows more exemplars to fit within the LLM's limited context window, improving the efficiency of few-shot learning.
  2. Trajectory-as-Exemplar (TaE) Prompting: Instead of fragmented prompts, Synapse utilizes full, successful trajectories (sequences of abstracted states and actions) as few-shot exemplars. This provides the LLM with a complete history of interactions, enabling better multi-step decision-making, reducing error accumulation in long-horizon tasks, and implicitly encouraging temporal abstraction (generating multiple actions before requesting a new state).
  3. Exemplar Memory: To enable generalization to novel tasks and exploit task similarity, Synapse stores embeddings of exemplars (successful trajectories) along with their task metadata. When a new task arises, relevant exemplars are retrieved from this memory via similarity search, automatically providing contextually appropriate demonstrations without requiring task-specific hard-coding.

Key Conclusions / Findings:

  • Human-Level Performance on MiniWoB++: Synapse achieved a 99.2% average success rate across 64 tasks in MiniWoB++, marking a 10% relative improvement over the previous state-of-the-art ICL methods. Notably, it was the first ICL method to successfully solve the challenging book-flight task, which requires complex, long-horizon decision-making under intricate states.
  • Improved Generalization with Fewer Demonstrations: Synapse achieved its high performance on MiniWoB++ using demonstrations from only 48 tasks for 64 tasks, demonstrating superior generalization compared to methods requiring task-specific exemplars for each task. It also achieved this without relying on self-correction, a common but resource-intensive feature in other ICL agents.
  • Significant Improvement on Real-World Benchmarks: On Mind2Web, a real-world website benchmark, Synapse demonstrated a 56% relative improvement in average step success rate over MindAct, the previous state-of-the-art ICL method, when using GPT-3.5. With CodeLlama-7B, it achieved an average 2.5x step success rate compared to MindAct. This highlights its effectiveness in more realistic and complex web navigation scenarios.
  • Validation of Component Effectiveness: Ablation studies confirmed the individual contributions of each component: state abstraction for handling complex states and allowing more exemplars, TaE prompting for multi-step decision-making in long-horizon tasks, and exemplar memory for generalization across tasks.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand Synapse, a reader should be familiar with the following core concepts:

  • Large Language Models (LLMs): These are advanced neural network models (e.g., GPT-3, GPT-4, Llama) trained on vast amounts of text data. They can understand, generate, and process human language, making them powerful tools for various natural language processing (NLP) tasks, including reasoning, planning, and code generation. Their ability to perform in-context learning (ICL) is central to agents like Synapse.
  • In-Context Learning (ICL): A paradigm where LLMs learn from a few examples (demonstrations or exemplars) provided directly in their input prompt, without requiring explicit model weight updates (fine-tuning). The LLM leverages its pre-trained knowledge to generalize from these examples to new, similar tasks presented within the same prompt.
  • Few-shot Learning: A learning setting where a model is trained or performs a task with only a small number of examples (shots). In the context of LLMs, few-shot learning is often achieved through ICL.
  • Computer Control Agents: AI systems designed to interact with and control computer interfaces (e.g., web browsers, operating systems) to complete tasks. They perceive computer states (e.g., HTML, screenshots) and perform actions (e.g., clicks, typing, key presses) via simulated keyboard and mouse.
  • HTML (HyperText Markup Language): The standard markup language for documents designed to be displayed in a web browser. HTML defines the structure and content of web pages. Computer control agents often parse HTML to understand the current state of a webpage and identify interactive elements.
  • Embeddings: Numerical representations of text, images, or other data that capture their semantic meaning. Words, sentences, or even entire documents can be converted into embedding vectors in a high-dimensional space, where semantically similar items are located closer to each other. Embedding models are used to generate these vectors.
  • Similarity Search: A technique used to find items in a database that are similar to a given query item. This is often performed by comparing the embedding vectors of items using distance metrics (e.g., Euclidean distance, cosine similarity). Vector databases are specialized databases optimized for storing and querying embedding vectors.
  • MiniWoB++: A standard benchmark task suite for evaluating computer control agents. It consists of a variety of web-based tasks (e.g., filling forms, clicking elements, using a terminal) that simulate human-computer interactions in a simplified environment.
  • Mind2Web: A more realistic and challenging benchmark dataset for web navigation tasks. It features human demonstrations on open-domain tasks across diverse real-world websites, designed to test an agent's ability to generalize across cross-task, cross-website, and cross-domain scenarios.
  • Trajectory: In the context of computer control or reinforcement learning, a trajectory refers to a sequence of states, actions, and observations experienced by an agent in an environment over time, typically from the start of a task to its completion or termination. A complete trajectory captures the full interaction history.
  • Temporal Abstraction: The ability to represent or generate actions at different levels of granularity over time. In computer control, this might mean generating a sequence of simple low-level actions (e.g., type('ls'), press('enter')) as a single conceptual "step" for the LLM, rather than querying the LLM after each atomic action.

3.2. Previous Works

The paper discusses several key prior works and their limitations, primarily categorized into traditional training-based methods and LLM-based ICL methods.

3.2.1. Training-Based Methods

  • Behavioral Cloning (BC): (Pomerleau, 1989) A supervised learning approach where a model learns to imitate expert demonstrations (state-action pairs). For computer control, this involves training a neural network to predict actions given screenshots or HTML states, based on human recorded actions.
    • Limitation: Requires large-scale datasets and struggles with out-of-distribution states.
  • Deep Reinforcement Learning (RL): (Sutton and Barto, 2018) Agents learn optimal policies through trial-and-error interactions with an environment, maximizing a reward signal.
    • Limitation: Often sample-inefficient, difficult to explore complex environments, and sensitive to reward function design.
  • CC-Net (Humphreys et al., 2022): A notable BC+RLBC+RL method that achieved human-level performance on MiniWoB++.
    • Limitation: Required an extensive dataset of 2.4 million demonstrations (6,300 hours of human effort), making it challenging to generalize to new tasks or customize for users due to its data-hungry nature.
  • WebN-T5 (Gur et al., 2023a): A fine-tuned variant of the T5 model (Raffel et al., 2020), trained on 12,000 demonstrations across 56 tasks in MiniWoB++.
    • Limitation: Still required a large amount of data and did not reach human-level performance.
  • WebGUM (Furuta et al., 2023): Another fine-tuned LLM that used 346,827 demonstrations for MiniWoB++.
    • Limitation: Similar to WebN-T5, it was data-intensive and fell short of human performance.
  • Pix2Act (Shaw et al., 2023): Solved 59 MiniWoB++ tasks using tree search and BC on 1.3 million demonstrations.
    • Limitation: While effective, still relies on large-scale data, similar to other BC-based methods.

3.2.2. LLM-Based ICL Methods

  • RCI (Kim et al., 2023): An ICL method for MiniWoB++ that prompts LLMs to generate a high-level plan and then grounds each action based on this plan. It employs recursive self-correction.
    • Limitation: Relies on task-specific exemplars (requiring demonstrations for each of the 54 tasks it solves), limiting generalization to novel scenarios. Its exemplar structure (high-level plans) doesn't represent complete trajectories, leading to error accumulation in multi-round LLM queries and long-horizon tasks. The paper notes it may struggle to correctly predict the next action even with a correct plan.
  • MindAct (Deng et al., 2023): Introduced alongside the Mind2Web benchmark. It prompts LLMs to predict the next action using multi-choice questions (MCQ), with candidate choices generated by an element-ranking model.
    • Limitation: The MCQ-based exemplars also do not represent complete trajectories, contributing to error accumulation and suboptimal performance in long-horizon tasks. It struggles with element selection accuracy despite using a ranking model.
  • AdaPlanner (Sun et al., 2023): Utilized environment feedback for self-correction and achieved a 92.9% success rate in 53 tasks.
    • Limitation: Similar to RCI, it relies on self-correction and task-specific exemplars, suggesting similar issues with generalization and data efficiency.
  • WebAgent (Gur et al., 2023b): Employed a small element-ranking model for HTML state filtering and an LLM for few-shot action generation.
    • Limitation: Did not deeply investigate the exemplar structure and memory mechanisms for generalization, which are key aspects of Synapse.

The paper also briefly mentions other LLM advancements that could complement Synapse:

  • Reasoning and Planning (e.g., Chain-of-Thought, ReAct, Tree-of-Thoughts): Methods like Chain-of-Thought (Wei et al., 2022a; Kojima et al., 2022), Least-to-Most (Zhou et al., 2022), ReAct (Yao et al., 2022a), Tree-of-Thoughts (Yao et al., 2023), Reflexion (Shinn et al., 2023), and self-debugging (Chen et al., 2023) improve LLMs' reasoning and planning capabilities. Synapse focuses on action grounding and exemplar structure, which could be combined with these.
  • Tool Use (Wu et al., 2023a; Shen et al., 2023; Schick et al., 2023; Lu et al., 2023; Xu et al., 2023): Leveraging external tools can enhance LLM capabilities.
  • Code Generation (Liang et al., 2023; Singh et al., 2023): Framing policies as code generation, similar to how Synapse uses code for state abstraction and action generation.
  • Skill Libraries (Wang et al., 2023a): Voyager uses skill libraries for life-long skill acquisition, which can be seen as complementary to Synapse's exemplar memory.
  • Context Filtering (Wu et al., 2023b): Identified the issue of limited context and used a pretrained QA model to filter states, similar to MindAct, which aligns with Synapse's state abstraction principle but Synapse uses few-shot LLM learning for this.

3.3. Technological Evolution

The field of computer control agents has evolved from reliance on massive training-based methods like Behavioral Cloning (BC) and Reinforcement Learning (RL) (e.g., CC-Net needing millions of demonstrations) to more data-efficient and generalizable approaches leveraging Large Language Models (LLMs). Early LLM agents focused on generating high-level plans or making decisions in isolation using multi-choice questions.

This evolution highlights a shift from:

  • Data Hunger to Data Efficiency: Moving from millions of demonstrations to few-shot learning via ICL.

  • Task-Specific Training to Generalization: Aiming for models that can perform novel tasks without retraining or extensive new demonstrations.

  • Fragmented Decision-Making to Holistic Trajectories: Recognizing the limitations of step-by-step, isolated decisions and moving towards understanding and generating complete sequences of interactions.

    Synapse fits into this evolution by pushing the boundaries of LLM-based ICL through its focus on optimizing the context window via state abstraction, leveraging the sequential nature of tasks with trajectory-as-exemplar prompting, and enabling broad generalization through an intelligent exemplar memory system. It aims to bridge the gap between impressive LLM capabilities and practical, robust computer control in complex environments.

3.4. Differentiation Analysis

Compared to the main methods in related work, Synapse presents core differences and innovations:

  • Addressing Context Length (vs. RCI, MindAct, most ICL):

    • Prior methods: Directly feed raw states (often lengthy HTML) or rely on simple element filtering (e.g., MindAct's element-ranking model k=50k=50). This quickly exhausts the LLM's context window, limiting the number of exemplars.
    • Synapse's innovation (state abstraction): Explicitly uses LLMs (few-shot) or existing tools to create concise, task-relevant observations. This is a more principled approach to drastically reducing token count per state, allowing significantly more exemplars to fit within the context. This enables solving tasks previously impossible due to context limits (e.g., book-flight).
  • Trajectory Representation (vs. RCI, MindAct):

    • Prior methods: RCI uses high-level plans and grounds actions step-by-step; MindAct uses MCQs for single-step action prediction. Both query the LLM at each step, and neither provides a full interaction history, leading to error accumulation and poor performance in long-horizon tasks.
    • Synapse's innovation (trajectory-as-exemplar prompting): Prompts the LLM with complete successful trajectories (interleaved abstracted states and actions). This provides a rich, sequential context, improves multi-step decision-making, and implicitly encourages temporal abstraction (generating multiple actions before needing a new state), which reduces LLM queries, cost, and latency. This is a fundamental change in how ICL is applied for sequential decision-making.
  • Generalization (vs. RCI, AdaPlanner):

    • Prior methods: RCI and AdaPlanner rely on task-specific exemplars, requiring manual mapping or demonstrations for each task. They struggle to leverage similarities between tasks.
    • Synapse's innovation (exemplar memory): Introduces a vector database to store embedding representations of task metadata and their corresponding exemplar trajectories. This allows for similarity search to automatically retrieve relevant exemplars for novel tasks or tasks with shared characteristics. This mechanism enables broader generalization and reduces the need for extensive task-specific demonstrations.
  • Reliance on Self-Correction (vs. RCI, AdaPlanner):

    • Prior methods: RCI and AdaPlanner often incorporate self-correction mechanisms to recover from errors, indicating inherent weaknesses in their core prompting or decision-making.

    • Synapse's advantage: Achieves superior performance without relying on self-correction, suggesting a more robust and accurate primary decision-making process due to its state abstraction and TaE prompting.

      In essence, Synapse systematically tackles the three primary bottlenecks of LLM-based computer control by designing an ICL framework that is more context-efficient, sequentially intelligent, and broadly generalizable, moving beyond ad-hoc solutions like self-correction.

4. Methodology

4.1. Principles

The core idea behind Synapse is to overcome the limitations of Large Language Models (LLMs) in computer control tasks—specifically, restricted context length, inadequate trajectory representation, and poor generalization—by refining how in-context learning (ICL) is applied. The theoretical basis is that by providing LLMs with cleaner, more relevant state information, richer sequential demonstrations, and an intelligent retrieval mechanism for exemplars, their inherent reasoning and pattern-matching capabilities can be fully leveraged for complex, long-horizon tasks. The intuition is to make the LLM's input as informative and efficient as possible, mimicking how a human might learn by observing complete, relevant examples.

4.2. Core Methodology In-depth (Layer by Layer)

Synapse operates through a pipeline consisting of three main components: state abstraction, trajectory-as-exemplar (TaE) prompting, and exemplar memory. The general workflow begins with retrieving relevant exemplars from memory for a given task. Then, at each step, raw computer states are processed into clean observations via state abstraction, and finally, the LLM uses TaE prompting with these observations and retrieved exemplars to generate the next action.

4.2.1. State Abstraction

Problem: Raw computer states (e.g., full HTML documents of webpages) are often very long and contain much task-irrelevant information. This consumes a significant portion of the LLM's limited context length, reducing the number of exemplars that can be included and potentially distracting the LLM from generating correct actions.

Solution: State abstraction is introduced to extract only task-relevant observations from raw states, significantly reducing the token count and providing cleaner input to the LLM. Synapse employs two forms of few-shot state abstraction:

  1. Explicit Abstraction (for simpler states or manageable context):

    • Used when the LLM's context can handle multiple state-observation pairs.

    • The LLM is provided with few-shot exemplars in the format (state, observation).

    • Given a current raw state, the LLM is prompted to generate the corresponding clean observation.

      Example from Appendix C.1.1 (Terminal Task): The LLM is given several pairs of raw HTML states and their corresponding concise observations. Input to LLM for Explicit Abstraction:

    Role: User
    
    State:
    <div id="wrap" data-wob_ref="2" data-wob_eps="e0"> <div id="query">Use the terminal below to delete a file ending with the extension <span class="bold">.png</span></div> ... (rest of HTML) ...
    Observation:
    Use the terminal below to delete a file ending with the extension .png
    
    State:
    <div id="wrap" data-wob_ref="2" data-wob_eps="e0"> <div id="query">Use the terminal below to delete a file ending with the extension <span class="bold">.py</span></div> ... (rest of HTML) ...
    Observation:
    Use the terminal below to delete a file ending with the extension .py
    
    State:
    <div id="wrap" data-wob_ref="2" data-wob_eps="e0"> <div id="query">Use the terminal below to delete a file that has <span class="bold">no file extension</span>.</div> ... (rest of HTML) ...
    Observation:
    Use the terminal below to delete a file that has no file extension
    
    State:
    <div id="wrap" data-wob_ref="2" data-wob_eps="e0"> <div id="query">Use the terminal below to delete a file ending with the extension <span class="bold">.gpg</span></div> ... (rest of HTML) ...
    Observation:
    

    Output from LLM:

    Role: Assistant Use the terminal below to delete a file ending with the extension .gpg.
    

    This shows the LLM learning to extract the essential task description from the verbose HTML.

  2. Implicit Abstraction (for complex states like book-flight):

    • Used when explicit state-observation pairs are impractical due to the extreme complexity and length of raw states.

    • The LLM is prompted with few-shot exemplars in the format {task, code}. The code here is designed to parse the raw state and return a clean observation.

    • Given the current task, the LLM generates code that, when executed, takes the raw state as a parameter and returns the clean observation. If the generated code fails, the LLM performs zero-shot state abstraction as a fallback.

      Example from Appendix C.2.3 (Book-Flight Task): The LLM is given a task description and Python code that uses BeautifulSoup to parse the HTML and extract specific information. Input to LLM for Implicit Abstraction:

    Role: User
    Filter the following raw HTML state into a clean observation via code based on the task. State: 
    (a very long webpage containing flight information)
    Write code (between three backticks) to extract the price of the wanted ticket. First check if we want shortest or cheapest flight. Parse the HTML using BeautifulSoup. Loop through all the available flights. Check time duration (split 'h' and 'm') or flight price based on whether we want the shortest or the cheapest. We have searched with airports and the date, so there is no need to check them. Keep the integer price of the ticket in a string variable 'obs' in this format: 'The price of the {preference} one-way flight is {price}.'. The string of the raw state is already in the variable 'state' so do not repeat the state in the code. Write code within three backticks ``...`` to Book the cheapest one-way flight from: Hartford, CT to: TVC on 10/05/2016. ```python from bs4 import BeautifulSoup soup = BeautifulSoup(state, 'html.parser') preference = 'cheapest' best_flight_price = float('inf') # Loop through all the available flights for flight in soup.find_all("div", class_="flight"): # the preference is cheapest, compare flight prices price = int(flight.find("button", class_="flight-price")["data-price"]) if price < best_flight_price: best_flight_price = price price = best_flight_price obs = f'The price of the {preference} one-way flight is \{price}'

    (Similar example for "shortest" flight code)

    Now, write code within three backticks ``...`` to Book the cheapest one-way flight from: Anvik, AK to: MOT on 12/23/2016.
    Code:
    

    Output from LLM:

    Role: Assistant
    ```python
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(state, 'html.parser')
    preference = 'cheapest'
    best_flight_price = float('inf')
    # Loop through all the available flights
    for flight in soup.find_all("div", class_="flight"): # the preference is cheapest, compare flight prices
        price = int(flight.find("button", class_="flight-price")["data-price"])
        if price < best_flight_price:
            best_flight_price = price
    price = best_flight_price
    obs = f'The price of the {preference} one-way flight is price.TheLLMgeneratesPythoncodetoparsetheHTMLbasedonthetask,effectivelyabstractingthestate.InMind2Web,thepaperdemonstratesthatthisprinciplecanbeappliedtoexistingtools,e.g.,byusingapretrainedelementrankingmodeltofilterHTMLelements.WhiletheSOTAmethod(MindAct)used{price}.'
        ```
        The `LLM` generates `Python code` to parse the `HTML` based on the task, effectively abstracting the `state`.
    
    In Mind2Web, the paper demonstrates that this principle can be applied to existing tools, e.g., by using a pretrained `element-ranking model` to filter HTML elements. While the `SOTA` method (MindAct) used k=50(top50elements),Synapsefoundbetterperformancebyusingamuchsmaller (top 50 elements), Synapse found better performance by using a much smaller k (3 for previous observations, 5 for current observation), even with reduced `recall` of target elements. This suggests that fewer, highly relevant elements are better than more, potentially distracting ones.
    
    ## 4.2.2. Trajectory-as-Exemplar (TaE) Prompting
    
    **Problem:** Existing `ICL` methods for `computer control` (e.g., RCI's `high-level plans`, MindAct's `MCQs`) often treat decision-making steps in isolation. They query the `LLM` for one `action` at a time and do not provide complete `trajectory information`, leading to `error accumulation` in `multi-round LLM queries` and poor performance in `long-horizon tasks` or tasks with repeated `actions`.
    
    **Solution:** `TaE prompting` utilizes complete, successful `trajectories` as `few-shot exemplars` to prompt the `LLM` for `action generation`. A `trajectory` is formatted as `{task, observation, action, ..., observation, action}`.
    
    **Process:**
    1.  **Exemplar Provision**: The `LLM` is first fed with several successful `exemplar trajectories`.
    2.  **Current Trajectory Context**: Subsequently, the `LLM` is prompted with the current `task description` and the sequence of previously executed `clean observations` and `actions`.
    3.  **Action Generation**: The `LLM` then generates the next `action` (or sequence of `actions`).
    4.  **Iterative Process**: After executing the `action(s)` returned by the `LLM`, the new `clean observation` is appended to the current `trajectory`, and the process repeats until the task is complete or maximum steps are reached.
    
        The following figure (Figure 2 from the original paper) shows an illustrative comparison of TaE prompting with the prompting schemes of other computer agents:
    
        ![Figure 2: Comparison of trajectory-as-exemplar prompting with other prompting schemes. The illustration is based on the terminal task in MiniWoB ^ { + + },wheretheagentisaskedtodeleteafil](/files/papers/6910c7bcd97119450fc2d950/images/2.jpg)Figure2:Comparisonoftrajectoryasexemplarpromptingwithotherpromptingschemes.TheillustrationisbasedontheterminaltaskinMiniWoB , where the agent is asked to delete a fil…](/files/papers/6910c7bcd97119450fc2d950/images/2.jpg)
        *Figure 2: Comparison of trajectory-as-exemplar prompting with other prompting schemes. The illustration is based on the terminal task in MiniWoB ^ { + + }, where the agent is asked to delete a file ending with a specific extension.*
    
        **Benefits of TaE Prompting:**
        *   **Consistent Format**: The interleaved `observations` and `actions` provide a clear and interactive format suitable for `grounding actions`.
        *   **More Information**: Providing a complete `trajectory` offers significantly more context for decision-making compared to isolated plans or `MCQs`.
        *   **Temporal Abstraction**: It implicitly encourages the `LLM` to generate multiple `actions` consecutively (e.g., `type(ls)` followed by `press(enter)`) without requiring an intermediate `state query`. This reduces the number of `LLM` calls, leading to lower `cost` and `latency`.
        *   **Direct from Demonstrations**: `TaE exemplars` can be directly converted from `human demonstrations`, simplifying their creation.
    
            *Example from Appendix C.1.2 (Terminal Task - Trajectory-as-Exemplar Prompting):*
        The `LLM` is provided with successful `exemplary trajectories` before the current task.
        **Input to LLM for TaE Prompting (Current Trajectory):**
        ```
        Role: User
        Task: Use the terminal below to delete a file ending with the extension .gpg
        Trajectory:
        Role: User
        Observation:
        Use the terminal below to delete a file ending with the extension .gpg
        Action:
        ```
        **Output from LLM:**
        ```
        Role: Assistant agent.type('ls') agent.press('enter')
        ```
        This demonstrates the `LLM` generating two actions (`type('ls')` and `press('enter')`) sequentially without an intermediate `observation`, illustrating `temporal abstraction`.
    
    ## 4.2.3. Exemplar Memory
    
    **Problem:** Previous `computer agents` relied on `task-specific exemplars`, requiring explicit pre-definition or `hard-coding` for each task. This approach fails to leverage `task similarities` and hinders `generalization` to `novel tasks`.
    
    **Solution:** `Exemplar memory` is introduced as a mechanism to store and retrieve relevant `exemplars` dynamically, enabling `generalization` and `adaptability`.
    
    **Structure:** The `exemplar memory` is defined as a tuple \mathcal{D} = (K, V)::
    *   K:Afixedsizedarrayofembeddingvectorsrepresentingtaskmetadata(e.g.,taskdescriptions,initialstates).: A fixed-sized array of `embedding vectors` representing `task metadata` (e.g., `task descriptions`, initial `states`).
    *   V:Thecorrespondingstateabstractionpromptsandexemplarytrajectoriesassociatedwiththemetadatain: The corresponding `state abstraction prompts` and `exemplary trajectories` associated with the `metadata` in K.Process:1.MemoryConstruction:Alltaskmetadatafromavailableexemplarsisencodedusinganembeddingmodel(e.g.,textembeddingada002).Theseembeddingvectorsandtheircorrespondingexemplarsarestoredinavectordatabase(e.g.,Faiss).2.Retrieval:Foragivennewtask,itsmetadata(querymetadata,.
    
    **Process:**
    1.  **Memory Construction**: All `task metadata` from available `exemplars` is encoded using an `embedding model` (e.g., `text-embedding-ada-002`). These `embedding vectors` and their corresponding `exemplars` are stored in a `vector database` (e.g., Faiss).
    2.  **Retrieval**: For a given new `task`, its `metadata` (`query metadata`, q)isencodedintoanembeddingvector.Thisqueryembeddingisthenusedtoperformasimilaritysearchinthevectordatabase() is encoded into an `embedding vector`. This `query embedding` is then used to perform a `similarity search` in the `vector database` (K).3.ExemplarSelection:Thetop).
    3.  **Exemplar Selection**: The top-nmostsimilarexemplarsareretrieved.InMiniWoB++,ifexemplarsbelongtodifferenttasksbutareretrievedduetohighsimilarity,themostcommontaskamongthemisidentified,andallexemplarsassociatedwiththatcommontaskareretrieved.ForMind2Web,thewebsitename,domain,andtaskdescriptionareencodedasmetadata.FormalRetrievalProcess:Theretrievalprocessfindsthetop most `similar exemplars` are retrieved. In MiniWoB++, if `exemplars` belong to different tasks but are retrieved due to high similarity, the most common task among them is identified, and all `exemplars` associated with that common task are retrieved. For Mind2Web, the `website name`, `domain`, and `task description` are encoded as `metadata`.
    
    **Formal Retrieval Process:**
    The retrieval process finds the top-nexemplars exemplars dfromthememory from the memory \mathcal{D}thataremostsimilartothequery that are most similar to the query q. This can be formally expressed as:
    \
    

\arg \operatorname{top-n}_{d \in \mathcal{D}} \operatorname{sim}(q, d) $ Where:

  • qq: The query metadata (embedding vector of the new task).
  • dd: An exemplar in the memory (embedding vector of stored task metadata).
  • sim(q,d)\operatorname{sim}(q, d): A similarity function (e.g., Euclidean distance or cosine similarity) that quantifies the similarity between the query embedding qq and an exemplar embedding dd. Lower Euclidean distance implies higher similarity.

Benefits:

  • Automatic Exemplar Selection: Eliminates the need for hard-coded task-to-exemplar mappings.
  • Generalization to Novel Tasks: By leveraging task similarity, the agent can effectively use exemplars from known tasks to tackle new, unseen tasks.
  • Adaptive Agent: Lays the foundation for a more general-purpose and adaptive computer agent.

Integration of Components: These three components work in concert. The exemplar memory selects relevant exemplar trajectories. State abstraction then ensures that the states within these exemplars and the current state are concisely represented. Finally, TaE prompting uses these abstracted trajectories to guide the LLM's multi-step decision-making.

5. Experimental Setup

5.1. Datasets

The evaluation of Synapse was conducted on two distinct benchmarks: MiniWoB++ and Mind2Web.

5.1.1. MiniWoB++

  • Description: MiniWoB++ (Shi et al., 2017; Liu et al., 2018) is a standard research task suite designed to integrate diverse web-based tasks that mirror human-computer interactions. It is a relatively simplified environment but includes challenging tasks like book-flight, click-checkboxes-soft, and use-autocomplete.
  • Scale: The evaluation used 64 tasks from the benchmark.
  • Demonstrations: Exemplars were provided for 48 of these tasks, with 16 tasks being unseen during the exemplar collection phase to test generalization. The average number of exemplars per task for Synapse was 3.45.
  • State Space: Raw HTML code of the webpages.
  • Action Space: Consists of five keyboard and mouse operations:
    • click-xpath(xpath: str): Clicks an HTML element identified by its xpath.
    • move-mouse(xpath: str): Moves the mouse cursor to an HTML element identified by its xpath.
    • type(characters: str): Types a string using the keyboard.
    • press(key_type: str): Presses a specific key (e.g., enter, space, arrowleft).
    • click-options(xpath: str): Clicks an option within an HTML element (typically a dropdown or select box).
  • Prompt Example (Action Space in System Prompt):
    Role: System
    You are a large language model trained to navigate the web. To accomplish the task, use methods in the following Agent class to generate actions until you need the new state to proceed.
    class Agent: def _init_(self, args): . . . # Action: type a string via the keyboard def type(self, characters: str) -> None: # Action: click an HTML element with a valid xpath def click_xpath(self, xpath: str): . . . # Actions: press a key on the keyboard, including: # enter, space, arrowleft, arrowright, backspace, arrowup, arrowdown, command+a, command+c command+v
    def press(self, key_type: str) -> None: . . .
    # Action: click an option HTML element in a list with a valid xpath def click_option(self, xpath: str):
    # Action: move mouse cursor on an HTML element with a valid xpath def movemouse(self, xpath: str):
    
  • State Abstraction in MiniWoB++:
    • For scenarios with multiple states in context (e.g., email-inbox), explicit abstraction with state-observation pairs is used. These pairs are collected alongside trajectories, with states from the environment and cleaned observations provided by humans (or inferred by LLMs).
    • For scenarios with complex states where explicit abstraction is not applicable (e.g., book-flight), implicit abstraction is used. This involves pairing task descriptions with state-parsing code. The task descriptions are environment-given, and the code is collected via zero-shot sampling from LLMs (GPT-4) and human-reviewed.

5.1.2. Mind2Web

  • Description: Mind2Web (Deng et al., 2023) is a realistic dataset containing human demonstrations of open-domain tasks from 137 real-world websites across 31 different domains (e.g., Airbnb, Twitter). Task descriptions provide high-level goals, encouraging agents to autonomously understand and perform tasks.
  • Scale: Over 2,000 tasks.
  • Dataset Splits:
    • Training Set: 1,009 tasks from 73 websites. Only this set is stored in the exemplar memory.
    • Cross-Task Test Set: 252 tasks from 69 websites, where similar tasks might have been seen in training. Evaluates generalizability over tasks from the same websites.
    • Cross-Website Test Set: 7 tasks from 10 unique websites, unseen in training but belonging to familiar domains. Evaluates generalizability to new websites within known domains.
    • Cross-Domain Test Set: 912 tasks from 73 websites, requiring adaptation to completely unseen domains. Evaluates generalizability to entirely novel domains.
  • Observation Space: HTML provided in the dataset. Observations are filtered by an element-ranking model (a simplified state abstraction of selecting top-k elements) before being passed to the LLM.
  • Action Space: Includes click, type, and select operations:
    • CLICK [id]: Clicks an HTML element by its id.
    • TYPE [id] [value]: Types a string value into an element by its id.
    • SELECT [id] [value]: Selects a value for an HTML element by its id.
  • State Abstraction in Mind2Web:
    • MindAct's default uses k=50k=50 (top 50 elements) from an element-ranking model.
    • Synapse uses k=3k=3 for exemplars in memory and previous observations in the current history, and k=5k=5 for the current observation. This significantly reduces context length compared to k=50k=50.

5.1.3. LLM Configuration

  • MiniWoB++: gpt-3.5-turbo-0301 was queried. 50 episodes run per task.
  • Mind2Web: gpt-3.5-turbo-16k-0613 was the default LLM.
  • Decoding: Temperature was set to 0 for greedy decoding, ensuring deterministic outputs.
  • Embedding Model: text-embedding-ada-002 was used for generating embeddings for exemplar memory.
  • Vector Database: Faiss (Johnson et al., 2019) was used for efficient storage and similarity search of embeddings.

5.2. Evaluation Metrics

5.2.1. MiniWoB++

  • Metric: Success Rate (SR)
    • Conceptual Definition: The proportion of tasks that the agent successfully completes from start to finish according to the task's criteria. It is a binary measure: a task is either solved or not.
    • Mathematical Formula: $ SR = \frac{\text{Number of successfully completed tasks}}{\text{Total number of tasks attempted}} $
    • Symbol Explanation:
      • SR: Success Rate.
      • Number of successfully completed tasks: The count of unique tasks where the agent reached the defined goal state.
      • Total number of tasks attempted: The total count of unique tasks the agent was evaluated on.

5.2.2. Mind2Web

  • Metrics: Element Selection Accuracy (Ele. Acc), Step Success Rate (Step SR), and Task Success Rate (SR).
    • Element Selection Accuracy (Ele. Acc)
      • Conceptual Definition: Measures how accurately the agent selects the correct interactive HTML element for a given action step. This is crucial for agents interacting with graphical user interfaces.
      • Mathematical Formula: $ Ele. Acc = \frac{\text{Number of correctly selected elements}}{\text{Total number of element selections attempted}} $
      • Symbol Explanation:
        • Ele. Acc: Element Selection Accuracy.
        • Number of correctly selected elements: The count of individual steps where the agent identified and selected the ground-truth target HTML element.
        • Total number of element selections attempted: The total count of individual steps where an HTML element was supposed to be selected.
    • Step Success Rate (Step SR)
      • Conceptual Definition: The proportion of individual action steps (e.g., a click, a type, a select) that are correctly executed by the agent within a task trajectory. This is a fine-grained measure of performance, assessing the agent's correctness at each immediate decision point.
      • Mathematical Formula: $ Step SR = \frac{\text{Number of correctly executed steps}}{\text{Total number of steps in all trajectories}} $
      • Symbol Explanation:
        • Step SR: Step Success Rate.
        • Number of correctly executed steps: The count of individual actions taken by the agent that match the ground-truth action for that step.
        • Total number of steps in all trajectories: The sum of all individual actions across all trajectories being evaluated.
    • Task Success Rate (SR)
      • Conceptual Definition: Same as for MiniWoB++, indicating the proportion of complete tasks successfully accomplished.
      • Mathematical Formula: $ SR = \frac{\text{Number of successfully completed tasks}}{\text{Total number of tasks attempted}} $
      • Symbol Explanation:
        • SR: Task Success Rate.
        • Number of successfully completed tasks: The count of unique tasks where the agent reached the defined goal state.
        • Total number of tasks attempted: The total count of unique tasks the agent was evaluated on.

5.3. Baselines

5.3.1. MiniWoB++ Baselines

The paper compares Synapse against various state-of-the-art approaches, categorized by their underlying methodology:

  • BC+RL Baselines: These methods combine Behavioral Cloning (BC) with Reinforcement Learning (RL), typically requiring vast amounts of data.
    • CC-Net (Humphreys et al., 2022): A strong baseline that achieved human-level performance but with 2.4 million demonstrations.
    • Pix2Act (Shaw et al., 2023): A concurrent work that uses tree search and BC on 1.3 million demonstrations.
  • Fine-tuning Baselines: These involve fine-tuning LLMs on large datasets of demonstrations.
    • WebGUM (Furuta et al., 2023): A multimodal LLM fine-tuned on 346,827 demonstrations.
    • WebN-T5 (Gur et al., 2023a): A T5 model fine-tuned on 12,000 demonstrations.
  • ICL Methods: These are LLM-based agents utilizing in-context learning, often incorporating self-correction.
    • RCI (Kim et al., 2023): The previous SOTA ICL method, employing recursive self-correction to solve 54 tasks with a 90.6% success rate, but relying on task-specific exemplars.
    • AdaPlanner (Sun et al., 2023): A concurrent work that uses environment feedback for self-correction, achieving 92.9% success in 53 tasks.
  • Human Scores: Sourced from Humphreys et al. (2022) to provide a benchmark for human-level performance.
  • Others (BC): Represents aggregated performance of earlier BC-based methods like (Gur et al., 2018; Jia et al., 2018; Liu et al., 2018; Shi et al., 2017).

5.3.2. Mind2Web Baselines

  • MindAct (Deng et al., 2023): The current SOTA ICL method for this benchmark. It uses an element-ranking model for HTML state filtering (top-50 elements) and MCQ-formatted prompts for action generation.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that Synapse significantly outperforms existing state-of-the-art (SOTA) methods across both standard and real-world computer control benchmarks.

6.1.1. MiniWoB++ Performance

Synapse achieves a remarkable 99.2% average success rate across 64 tasks in MiniWoB++, establishing itself as the first ICL method to reach human-level performance. This is achieved using demonstrations from only 48 tasks, without relying on self-correction.

The following figure (Figure 3 from the original paper) shows the average success rates of different methods across MiniWoB++ tasks and a comparison of data efficiency:

Figure 2: Comparison of trajectory-as-exemplar prompting with other prompting schemes. The illustration is based on the terminal task in MiniWoB \(^ { + + }\) , where the agent is asked to delete a fil… 该图像是图表,展示了MiniWoB++任务中不同方法的平均成功率及数据效率对比。左图为各任务的成功率箱形图,右图展示了成功率与每任务示例数量的关系,突出Synapse在少示例情况下的优越性能。 *Figure 3: SyNAPSE is the first ICL method that achieves human-level performance in MiniWoB ++^ { + + }. It outperforms previous self-correction methods, including RCI and AdaPlanner. A comprehensive task-wise evaluation is shown in Appendix A. Pix2Act and AdaPlanner are concurrent with our work. We excluded them from the right-side figure due to their overlap with CC-Net and RCI. The outlier tasks are determined with an interquartile range of 1.5.

Key observations from MiniWoB++ evaluation:

  • Outperformance across Categories: Synapse surpasses BC+RLBC+RL (e.g., CC-Net, Pix2Act), fine-tuning (e.g., WebGUM, WebN-T5), and other ICL SOTA methods (RCI, AdaPlanner).
  • No Self-Correction: Unlike RCI and AdaPlanner, which incorporate self-correction to compensate for limitations, Synapse achieves higher performance without this mechanism, indicating a more robust inherent decision-making process.
  • Solving Challenging Tasks: Synapse successfully solves tasks previously considered intractable for ICL methods, such as book-flight (requiring long-horizon decision-making under complex states) and click-pie.
  • Generalization to Unseen Tasks: Synapse solves 64 tasks using exemplars from only 48 tasks, demonstrating strong generalization capabilities, particularly for the 16 unseen tasks where it achieved an average success rate of almost 100%.
  • Specific Task Improvements (Figure 4 Analysis):
    • Text Processing Tasks: Synapse significantly outperforms BC+RLBC+RL methods (Figure 4a) on tasks like terminal and text-transform, highlighting its LLM-driven reasoning strength.

    • Detailed State Understanding: Surpasses other ICL methods (Figure 4b) on tasks like click-collapsible-2, click-tab-2-hard, and count-shape, attributing to its effective state abstraction.

    • Long-Horizon/Repeated Action Tasks: Shows marked improvement over other ICL methods on use-autocomplete, use-spinner, where error accumulation is a major issue for single-step prompting.

    • Overall Superiority: Consistently achieves better performance than ICL (Figure 4b) and fine-tuning SOTA (Figure 4c) across all comparable tasks.

      The following figure (Figure 4 from the original paper) shows task-wise success rate comparisons between Synapse and three SOTA methods (CC-Net, RCI, WebGUM) in Figure 4, with bars ordered by ascending success rate differences and tasks with differences less than 0.05 excluded:

      Figure 3: SyNAPSE is the first ICL method that achieves human-level performance in MiniWoB \(^ { + + }\) . It outperforms previous self-correction methods, including RCI and AdaPlanner. A comprehensive… Figure 4: Task-wise comparisons between SyNAPSE and various SOTA methods. For clarity, the bars in each figure are arranged in ascending order based on the success rate differences. Tasks not reported by other methods and those with differences less than 0.05 are not included.

      Failure Cases: The paper notes that most observed failures stem from incorrect reasoning by LLMs, such as miscounting items in count-shape or misrecognizing characters (e.g., jrpf as jrfp in text-transform).

6.1.2. Mind2Web Performance

Synapse effectively validates its three core components on the Mind2Web benchmark, showcasing significant improvements over the SOTA ICL method, MindAct.

The following table (Table 1 from the original paper) shows the Mind2Web results and ablations with CodeLlama-7B (top) and GPT-3.5 (bottom): The following are the results from Table 1 of the original paper:

Method Cross-Task Cross-Website Cross-Domain
Ele. Acc Step SR SR Ele. Acc Step SR SR Ele. Acc Step SR SR
MindAct (CodeLlama-7B) 11.2 7.7 0.4 12.4 9.0 0.6 13.8 9.9 0.2
SYNAPSE w/ state abstraction 21.4 15.5 0.8 19.5 13.5 0.6 20.0 15.6 1.2
SYNAPSE w/ state abstraction + TaE 27.0 24.5 1.2 21.4 17.9 0.0 22.1 19.9 0.7
SYNAPSE w/ state abstraction + TaE + memory 29.4 26.4 3.2 22.9 18.9 0.6 22.6 19.7 0.3
MindAct (GPT-3.5) 20.3 17.4 0.8 19.3 16.2 0.6 21.6 18.6 1.0
SYNAPSE w/ state abstraction 29.8 25.2 2.0 26.1 19.6 0.6 28.0 24.3 1.8
SYNAPSE w/ state abstraction + TaE 32.8 29.2 2.0 28.0 22.7 0.6 29.0 26.2 1.8
SYNAPSE w/ state abstraction + TaE + memory 34.0 30.6 2.4 29.1 24.2 0.6 29.6 26.4 1.5

Key observations from Mind2Web evaluation:

  • Incremental Improvements: Incrementally adding Synapse's components (state abstraction, TaE prompting, exemplar memory) to a direct generation baseline consistently improves performance.
  • GPT-3.5 Performance:
    • Synapse with state abstraction alone (using direct generation with fewer top-ranked elements) shows a 32% improvement in average Step SR over MindAct's MCQ-formatted prompting (17.4% vs. 25.2% for MindAct GPT-3.5 vs Synapse w/ state abstraction GPT-3.5).
    • Adding TaE prompting further boosts average Step SR to 29.2% (a 50% relative improvement over MindAct).
    • Incorporating exemplar memory leads to an average Step SR of 30.6% (a 56% relative improvement over MindAct).
  • CodeLlama-7B Performance: Synapse demonstrates a 2.5x average Step SR compared to MindAct when using CodeLlama-7B as the underlying LLM, proving its effectiveness across different models.
  • Element Selection Accuracy: Synapse consistently achieves higher Ele. Acc across all generalization levels compared to MindAct, indicating better element identification.
  • Cross-Task Generalization: Shows the most significant improvements, particularly in Task SR (e.g., from 0.8% for MindAct to 3.2% for full Synapse with CodeLlama-7B, and 0.8% to 2.4% with GPT-3.5).
  • Cross-Domain Challenge: While Step SR improvements are seen across all levels, Task SR for Cross-Domain remains low for both Synapse and MindAct, highlighting the inherent difficulty of generalizing to entirely unseen domains.

6.2. Ablation Studies

Ablation studies were conducted to isolate and evaluate the contribution of each of Synapse's three core components.

6.2.1. Ablating State Abstraction

  • Effectiveness: State abstraction consistently improves performance, particularly for tasks with complex states or vague descriptions.

  • MiniWoB++:

    • Complex States: Enables solving tasks like book-flight, which LLMs without state abstraction struggle with due to limited context.
    • More Exemplars: Improves performance on tasks like email-inbox-nl-turk by allowing more exemplars to fit in context, raising its success rate from 52% to 100%.
  • Mind2Web: Direct generation with a reduced number of elements (top five) in clean observations (a simplified form of state abstraction) improved step success rate by 32% compared to MindAct's MCQ with top-50 elements. This is notable because reducing kk from 50 to 5 for element selection causes a recall drop from 86% to 53%, yet performance still increased, suggesting the quality and conciseness of observations are more critical than raw quantity.

    The following figure (Figure 5 from the original paper) shows the impact of state abstraction on the success rates of various MiniWoB++ tasks:

    Figure 4: Task-wise comparisons between SyNAPSE and various SOTA methods. For clarity, the bars in each figure are arranged in ascending order based on the success rate differences. Tasks not reporte… Figure 5: State abstraction enables solving MiniWoB++ tasks with complex states (e.g., book-flight). It also improves performance by providing more exemplars (e.g., email-inbox).

6.2.2. Ablating Trajectory-as-Exemplar Prompting

  • Effectiveness: TaE prompting significantly benefits tasks with long horizons and repeated actions.
  • MiniWoB++: Improves tasks like guess-number, use-spinner, and use-autocomplete, which are challenging for SOTA ICL methods due to error accumulation. RCI's success rates were 20%, 88%, and 58% for these, respectively, while Synapse achieved 100%, 100%, and 98%. This improvement is attributed to the comprehensive history and implicit temporal action abstraction provided by TaE prompting.
  • Mind2Web: TaE prompting further improved the average step success rate by 19% compared to Synapse with state abstraction only (refer to Table 1). It also outperformed MindAct by 50% across the three generalization levels.

6.2.3. Ablating Exemplar Memory

  • Effectiveness: Exemplar memory is crucial for generalization, especially for similar computer control tasks.

  • MiniWoB++:

    • Generalization to Unseen Tasks: Allows Synapse to solve 64 tasks using demonstrations from only 48 tasks. The average success rate for the 16 unseen tasks was almost 100%.
    • Leveraging Task Similarity: Automatically finds relevant exemplars, enabling generalization from tasks like email-inbox-nl-turk to email-inbox-forward-nl-turk or adapting to different layouts (e.g., multi-layouts to multi-orderings).
    • Failure Case: Only a 2% failure rate in click-tab-2-hard was observed due to exemplar mismatches, indicating the overall robustness of the retrieval mechanism.
  • Mind2Web: Memory further improved Synapse (with state abstraction and TaE prompting) by 6% in Step SR for cross-task and cross-website generalization (Table 1).

  • Cross-Domain Limitation: The improvement from memory was marginal for cross-domain generalization. The paper attributes this to LLMs being influenced by exemplars from unrelated domains and the fact that cross-domain tasks are inherently less similar to the training data.

    The following table (Table 2 from the original paper) shows the average distance of retrieval and Step SR improvements of memory (Δ)(\Delta) in Mind2Web: The following are the results from Table 2 of the original paper:

Test set Distance ΔGPT ΔCodeLlama
Cross-Task 17.0 1.4 1.9
Cross-Website 24.3 1.5 1.0
Cross-Domain 32.9 0.2 -0.2
  • Distance and Improvement Correlation: Table 2 illustrates that a greater average distance (lower similarity) between retrieved exemplars and the target task corresponds to a smaller Step SR improvement from memory. Cross-Domain tasks have the highest distance (32.9), explaining the minimal to negative impact of memory.

6.3. Data Presentation (Tables)

The following table (Table 3 from the original paper) shows per-task mean success rate comparison between SyNAPSE, humans, and baselines: The following are the results from Table 3 of the original paper:

Task Ours Human RCI AdaPlanner Pix2Act CC-Net WebGUM WebN-T5 Others(BC)
bisect-angle n/a 0.92 n/a n/a 0.96 0.32 0.97 0.29 n/a n/a 0.80
book-flight 1.00 0.87 n/a n/a n/a n/a 0.87 0.00 0.98 0.00 1.00
chase-circle n/a 0.82 n/a n/a n/a n/a 0.93 0.80 n/a n/a 1.00
choose-date 1.00 0.97 n/a n/a 0.79 0.06 0.97 0.12 0.13 0.00 1.00
choose-date-easy n/a 0.99 n/a n/a n/a n/a 0.99 0.42 1.00 0.03 n/a
choose-date-medium n/a 0.98 n/a n/a n/a n/a 0.99 0.26 0.60 0.00 n/a
choose-list 1.00 0.98 1.00 1.00 n/a n/a 0.99 0.19 0.24 0.26 0.26
circle-center n/a 0.96 n/a n/a 0.96 0.52 0.97 0.36 n/a n/a 0.98
click-button 1.00 0.98 1.00 1.00 0.99 0.32 1.00 0.78 1.00 1.00 1.00
click-button-sequence 1.00 0.94 1.00 1.00 0.99 1.00 1.00 0.47 1.00 1.00 1.00
click-checkboxes-large 1.00 0.87 0.94 1.00 0.99 1.00 0.71 0.00 0.99 0.22 0.84
click-checkboxes-soft 1.00 0.73 0.72 0.80 0.61 0.91 0.95 0.04 0.98 0.54 0.94
click-checkboxes-transfer 1.00 0.98 1.00 0.98 1.00 0.76 0.99 0.36 0.99 0.63 0.64
click-collapsible 1.00 0.99 1.00 1.00 0.94 0.80 1.00 0.81 0.98 0.00 1.00
click-collapsible-2 1.00 0.97 0.62 0.84 0.97 0.31 0.98 0.17 0.95 0.00 0.99
click-color 1.00 0.97 1.00 1.00 0.99 0.88 1.00 0.82 0.34 0.27 1.00
click-dialog 1.00 1.00 1.00 1.00 1.00 0.12 1.00 0.95 1.00 1.00 1.00
click-dialog-2 1.00 0.99 1.00 1.00 1.00 0.73 1.00 0.88 0.43 0.24 1.00
click-link 1.00 0.99 1.00 0.98 0.98 0.86 0.99 0.59 1.00 1.00 1.00
click-menu 1.00 0.97 1.00 0.78 n/a n/a 0.94 0.22 0.37 0.37 0.13
click-menu-2 n/a 0.98 n/a n/a n/a n/a 0.83 0.52 n/a n/a 0.16
click-option 1.00 0.99 1.00 1.00 1.00 0.00 0.99 0.21 1.00 0.87 1.00
click-pie 1.00 0.98 n/a n/a 0.99 0.81 0.97 0.15 0.99 0.51 1.00
click-scroll-list 1.00 0.91 1.00 1.00 n/a n/a 0.60 0.01 0.00 0.00 0.07
click-shades 1.00 0.91 1.00 1.00 0.99 0.76 1.00 0.04 0.00 0.00 0.99
click-shape 0.98 0.88 0.98 0.75 0.94 0.19 0.95 0.11 0.72 0.53 0.64
click-tab 1.00 0.99 1.00 1.00 1.00 0.54 1.00 0.95 1.00 0.74 1.00
click-tab-2-easy n/a 0.99 n/a n/a 0.99 0.77 0.99 0.61 n/a n/a n/a
click-tab-2-hard 0.98 0.96 0.76 0.78 0.97 0.00 0.98 0.19 0.95 0.12 n/a
click-tab-2-medium n/a 0.97 n/a n/a 1.00 0.07 0.99 0.54 n/a n/a n/a
click-test 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00
click-test-2 1.00 0.99 1.00 1.00 1.00 1.00 1.00 0.95 1.00 1.00 1.00
click-test-transfer n/a 0.99 n/a n/a 1.00 1.00 1.00 0.94 n/a n/a n/a
click-widget 1.00 0.83 0.98 1.00 1.00 0.87 1.00 0.56 1.00 1.00 1.00
copy-paste 1.00 0.94 n/a n/a n/a n/a 0.79 0.04 n/a n/a 0.00
count-shape 0.90 0.82 0.40 0.50 0.70 0.00 0.85 0.21 0.68 0.41 0.76
count-sides n/a 0.98 n/a n/a 1.00 0.38 1.00 0.74 n/a n/a 0.30
drag-box n/a 0.99 n/a n/a 0.99 1.00 1.00 0.61 n/a n/a 0.31
drag-cube n/a 0.99 n/a n/a n/a n/a 0.79 0.23 n/a n/a 0.18
drag-item n/a 0.98 n/a n/a 1.00 0.85 1.00 0.61 n/a n/a n/a
drag-items n/a 0.93 n/a n/a 1.00 0.64 0.99 0.13 n/a n/a 0.41
drag-shapes n/a 0.96 n/a n/a 0.98 0.96 0.99 0.26 n/a n/a 0.92
drag-sort-numbers n/a 0.92 n/a n/a 0.95 0.08 0.97 0.11 n/a n/a 0.66
email-inbox-delete n/a 0.99 n/a n/a 1.00 0.99 1.00 0.22 n/a n/a 1.00
email-inbox-forward n/a 0.96 n/a n/a n/a n/a 1.00 0.01 n/a n/a n/a
email-inbox-forward-nl 1.00 0.91 1.00 1.00 n/a n/a 1.00 0.00 1.00 0.60 n/a
email-inbox-forward-nl-turk 1.00 0.88 0.94 1.00 n/a n/a 1.00 0.00 1.00 0.33 n/a
email-inbox-nl-turk 1.00 0.93 0.98 0.90 n/a n/a 1.00 0.05 0.98 0.23 0.93
email-inbox-noscroll n/a 0.96 n/a n/a n/a n/a 1.00 0.13 n/a n/a n/a
email-inbox-reply n/a 0.91 n/a n/a n/a n/a 1.00 0.00 n/a n/a n/a
email-inbox-star-reply n/a 0.95 n/a n/a n/a n/a 1.00 0.11 n/a n/a n/a
enter-password 1.00 0.96 1.00 0.98 n/a n/a 1.00 0.02 1.00 0.97 1.00
enter-text 1.00 0.98 1.00 0.98 n/a n/a 1.00 0.35 1.00 0.89 1.00
enter-text-2 n/a 0.91 n/a n/a 0.97 1.00 0.98 0.04 n/a n/a 0.00
enter-text-dynamic 1.00 0.97 1.00 0.96 n/a n/a 1.00 0.39 1.00 0.98 1.00
enter-time 1.00 0.98 1.00 0.96 1.00 0.78 0.97 0.04 0.00 0.00 0.90
find-midpoint n/a 0.94 n/a n/a 0.96 0.74 0.97 0.35 n/a n/a 0.31
focus-text 1.00 1.00 1.00 1.00 n/a n/a 1.00 0.99 1.00 1.00 1.00
focus-text-2 1.00 0.99 1.00 0.94 n/a n/a 1.00 0.96 1.00 1.00 1.00
grid-coordinate 1.00 0.87 1.00 1.00 0.92 0.97 1.00 0.66 1.00 0.49 1.00
guess-number 1.00 0.99 0.20 0.88 n/a n/a 1.00 0.21 0.11 0.00 0.20
highlight-text n/a 0.97 n/a n/a n/a n/a 1.00 0.51 n/a n/a 0.90
highlight-text-2 n/a 0.97 n/a n/a n/a n/a 1.00 0.40 n/a n/a 0.13
identify-shape 1.00 0.98 0.76 0.96 1.00 0.94 1.00 0.68 1.00 0.88 1.00
login-user 1.00 0.96 1.00 1.00 n/a n/a 1.00 0.00 1.00 0.82 1.00
login-user-popup 1.00 0.94 0.68 0.98 n/a n/a 1.00 0.02 0.99 0.72 n/a
multi-layouts 0.98 0.95 0.72 0.84 n/a n/a 1.00 0.00 1.00 0.83 1.00
multi-orderings 1.00 0.96 1.00 1.00 n/a n/a 1.00 0.00 1.00 0.88 1.00
navigate-tree 0.98 0.98 0.86 0.82 0.99 0.07 0.99 0.32 1.00 0.91 1.00
number-checkboxes n/a 0.96 n/a n/a 0.84 0.26 0.99 0.00 n/a n/a 0.16
read-table 1.00 0.97 n/a n/a n/a n/a 0.97 0.01 n/a n/a 0.00
read-table-2 n/a 0.95 n/a n/a n/a n/a 0.94 0.00 n/a n/a 0.00
resize-textarea n/a 0.94 n/a n/a 0.99 1.00 1.00 0.27 n/a n/a 0.11
right-angle n/a 0.87 n/a n/a 0.97 1.00 0.98 0.26 n/a n/a 0.38
scroll-text n/a 0.97 n/a n/a n/a n/a 0.96 0.04 n/a n/a 0.00
scroll-text-2 n/a 0.97 n/a n/a n/a n/a 1.00 0.88 n/a n/a 0.96
search-engine 1.00 0.97 1.00 1.00 n/a n/a 1.00 0.15 0.96 0.34 1.00
simon-says n/a 0.62 n/a n/a n/a n/a 0.00 0.02 n/a n/a 0.28
simple-algebra 1.00 0.86 1.00 0.82 1.00 0.99 0.75 0.03 n/a n/a 0.04
simple-arithmetic 1.00 0.96 n/a n/a 1.00 0.67 0.86 0.38 n/a n/a 0.07
social-media 1.00 0.96 0.98 0.82 n/a n/a 0.90 0.03 1.00 0.21 1.00
social-media-all 1.00 0.89 1.00 1.00 n/a n/a 0.75 0.00 0.31 0.00 1.00
social-media-some 1.00 0.91 0.90 0.90 n/a n/a 0.85 0.01 0.68 0.02 0.42
terminal 1.00 0.88 1.00 0.98 n/a n/a 0.01 0.00 n/a n/a 0.00
text-editor n/a 0.88 n/a n/a n/a n/a 0.98 0.11 n/a n/a 0.01
text-transform 0.98 0.86 0.80 n/a 0.92 0.91 0.60 0.19 n/a n/a 0.00
tic-tac-toe 0.70 0.71 0.56 0.48 0.83 0.76 0.83 0.32 0.56 0.48 0.47
unicode-test 1.00 0.99 n/a n/a 1.00 0.64 1.00 0.86 n/a n/a n/a
use-autocomplete 0.98 0.98 0.58 0.88 0.99 0.95 1.00 0.07 0.98 0.22 0.98
use-colorwheel n/a 0.90 n/a n/a 0.97 0.98 0.98 0.68 n/a n/a 1.00
use-colorwheel-2 n/a 0.94 n/a n/a 0.95 1.00 0.95 0.38 n/a n/a 1.00
use-slider 0.98 0.98 n/a n/a 0.92 0.69 0.91 0.18 n/a n/a 0.51
use-slider-2 n/a 0.97 n/a n/a 1.00 0.09 0.95 0.03 n/a n/a 0.15
use-spinner 1.00 0.98 0.88 0.90 n/a n/a 1.00 0.47 0.11 0.07 0.17
visual-addition n/a 0.97 n/a n/a 1.00 0.68 0.99 0.36 n/a n/a 0.01
Average 0.992 0.935 0.906 0.929 0.962 0.665 0.935 0.305 0.802 0.484 0.646
# of solved tasks 64 104 54 53 59 59 104 104 56 56 88

This comprehensive table further underscores Synapse's dominant performance across a wide range of MiniWoB++ tasks, often matching or exceeding human performance and significantly outperforming all baselines.

6.4. Parameter Analysis (Mind2Web Element Abstraction)

The paper also includes an analysis of how the number of elements after state abstraction impacts performance in Mind2Web.

The following figure (Figure 7 from the original paper) shows the impact of different numbers of elements k after state abstraction on recall and step success rate in Mind2Web:

Figure 5: State abstraction enables solving MiniWoB++ tasks with complex states (e.g., book-flight). It also improves performance by providing more exemplars (e.g., email-inbox). Figure 7: Ablations of the number of elements after state abstraction in Mind2Web. The backbone language model here is CodeLlama-Instruct-7B.

  • Trade-off between Recall and Step SR: Figure 7 illustrates an inverse relationship between the recall of target elements and Step Success Rate. While a higher kk (more elements) increases recall (meaning the target element is more likely to be present in the abstracted state), it generally leads to a decrease in Step Success Rate.
  • Optimal Conciseness: This finding supports Synapse's design choice to use smaller kk values (3 or 5) for state abstraction in Mind2Web, suggesting that a more concise, task-relevant observation (even if it has slightly lower recall) is more beneficial for the LLM's decision-making than a verbose one with many potentially distracting elements. The LLM performs better when fed with fewer, higher-quality tokens.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces Synapse, a novel computer agent that significantly advances the state-of-the-art in LLM-based computer control. Synapse effectively addresses three major challenges: the limited context length of LLMs, the inadequacy of traditional exemplar structures for multi-step decision-making, and the poor generalization capabilities to novel tasks. It achieves this through three integrated components:

  1. State Abstraction: By converting verbose raw states into concise, task-relevant observations, it maximizes the utility of the LLM's context window, enabling more exemplars and reducing LLM distraction.

  2. Trajectory-as-Exemplar (TaE) Prompting: This innovative prompting scheme uses complete sequences of abstracted states and actions as exemplars. This provides rich sequential context, improves multi-step decision-making, and promotes temporal abstraction, reducing LLM query frequency, cost, and latency.

  3. Exemplar Memory: By storing embeddings of exemplars and retrieving them via similarity search based on task metadata, Synapse enables robust generalization to novel tasks and leverages task similarity more effectively than previous hard-coded approaches.

    Empirical evaluations showcase Synapse's superior performance: it achieved a 99.2% average success rate on MiniWoB++ (a 10% relative improvement over SOTA ICL methods) without self-correction, notably solving the challenging book-flight task. On Mind2Web, it demonstrated a 56% relative improvement in average step success rate over the previous SOTA ICL method. These results confirm Synapse's effectiveness in both controlled and real-world web navigation environments.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

  • High Inference Latency: A major concern when using LLMs is the inference latency. The authors suggest that their prompting scheme could be used to distill a more responsive, task-specific agent from existing LLMs, potentially by training smaller models on the LLM's outputs.
  • Dependence on Exemplar Quality: Synapse's performance is tied to the quality of the provided exemplars. Future work could explore instruction tuning (Wei et al., 2021; Chung et al., 2022) to train zero-shot computer agents based on their prompting techniques, reducing the reliance on specific demonstrations.
  • Memory Module Refinement: The memory structure and retrieval process could be further optimized for better generalization. This might involve more sophisticated embedding models, similarity metrics, or retrieval strategies.
  • Human Intervention and Customization: The memory module could be seamlessly combined with human intervention to allow for user customization and task adaptation, opening avenues for interactive agent development.
  • Compositional Generalization: Combining Synapse's methods with compositional generalization methods (Zhou et al., 2022; Wang et al., 2023a) could enable solving even more complicated tasks by breaking them down into simpler, reusable components.
  • Multimodal and Video Understanding: While Synapse currently operates on text (HTML), integrating multi-modal and video understanding capabilities (e.g., for pixel-based Android control as in Li et al., 2020; Toyama et al., 2021) could expand its applicability to more challenging, visually-driven tasks.

7.3. Personal Insights & Critique

Synapse presents a compelling and logically sound framework for advancing LLM-based computer control. Its strength lies in its systematic approach to addressing known LLM limitations rather than relying on brute-force solutions like self-correction.

  • Elegance of State Abstraction: The idea of dynamically abstracting states via few-shot LLM learning is particularly elegant. It leverages the LLM's understanding to curate its own input, a self-improving loop that maximizes context efficiency. The finding that fewer, more relevant tokens (small kk in Mind2Web) are better than more, possibly distracting ones highlights the importance of input quality for LLMs.
  • Power of Trajectory-as-Exemplar Prompting: This is a crucial shift from isolated action generation to sequential decision-making. By providing full trajectories, the LLM can learn temporal dependencies and abstract behaviors, which is a more natural way for humans to learn complex procedures. The resulting temporal abstraction (generating multiple actions without immediate state feedback) is a significant practical benefit in terms of cost and latency.
  • Intelligent Generalization with Memory: The exemplar memory system provides a robust mechanism for generalization, moving beyond task-specific prompting. This is a vital step towards truly general-purpose agents. The ablation study showing reduced benefits for cross-domain generalization is insightful, pointing to the inherent difficulty of transferring knowledge across vastly different contexts and perhaps suggesting directions for domain adaptation techniques within the memory retrieval.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

  • Robustness of State Abstraction: While effective, the LLM-based state abstraction could be sensitive to minor state variations or adversarial inputs. Ensuring its robustness across highly diverse and dynamic web environments is critical. How well does the LLM generalize in generating the correct parsing code for implicit abstraction for entirely novel website layouts?

  • Scalability of Exemplar Memory: As the number of tasks and exemplars grows, the similarity search in the vector database remains efficient, but the quality of retrieval becomes paramount. Task metadata representation and embedding model choice are crucial. Could exemplar memory itself become a bottleneck for long-horizon tasks if too many exemplars are retrieved or if LLM context is still limited?

  • Interpretability of LLM Decisions: Like many LLM-based systems, Synapse's decision-making process, especially the specific reasoning within the LLM that leads to an action or state abstraction, remains a black box. Understanding why an LLM makes a mistake could guide better exemplar design or prompt engineering.

  • Dependence on GPT-3.5: While CodeLlama-7B results are promising, the headline improvements are with GPT-3.5. Open-source LLMs are rapidly improving, and verifying Synapse's efficacy with the latest open-source models (e.g., Llama 3) would be valuable.

  • Cold Start Problem: For a truly novel task in a novel domain with no similar exemplars in memory, how well does Synapse perform zero-shot or few-shot with generic exemplars? The cross-domain results suggest this is still a hard problem.

    Overall, Synapse provides a strong foundation and a clear roadmap for building more intelligent and adaptive LLM-powered agents for computer control, marking a significant step towards reducing human effort in routine digital tasks. Its principles could be transferred to other sequential decision-making domains where LLMs interact with complex environments, such as robotics or software development.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.