AiPaper
Paper status: completed

Agent Workflow Memory

Published:09/12/2024
Original LinkPDF
Price: 0.10
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

AWM induces reusable task workflows to improve LM-based agents' performance on complex, long-horizon web navigation tasks, boosting success rates by up to 51% and reducing steps on major benchmarks with strong generalization.

Abstract

Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Agent Workflow Memory

1.2. Authors

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig. Their affiliations are Carnegie Mellon University and Massachusetts Institute of Technology, indicating a strong academic background in computer science, likely with a focus on natural language processing, machine learning, and AI agents.

1.3. Journal/Conference

This paper was published at arXiv, which is a preprint server. As such, it is not formally peer-reviewed by a journal or conference at the time of its publication (2024-09-11T17:21:00.000Z). However, arXiv is a highly influential platform in fields like AI and machine learning, used for rapid dissemination of research findings before or during the peer-review process. Many groundbreaking papers first appear on arXiv before being formally published in top-tier conferences (e例如, NeurIPS, ICML, ICLR) or journals.

1.4. Publication Year

2024

1.5. Abstract

This paper introduces Agent Workflow Memory (AWM), a novel method designed to enhance the performance of language model (LM)-based agents, particularly in solving complex, long-horizon real-world tasks such as web navigation. AWM addresses the current limitations of LM agents, which often struggle with adapting to diverse task contexts and environments, by enabling them to learn and reuse common task routines, referred to as workflows, from past experiences. The method can operate in both offline and online scenarios: offline AWM induces workflows from pre-existing training examples, while online AWM learns dynamically from self-generated predictions during test queries.

The authors evaluate AWM on two prominent web navigation benchmarks, Mind2Web and WebArena, which together encompass over 1000 tasks across 200+ domains. AWM significantly improves baseline results, achieving a 24.6% relative success rate increase on Mind2Web and a 51.1% relative increase on WebArena. Additionally, it reduces the number of steps required to successfully complete tasks on WebArena. A key finding is online AWM's robust generalization across different tasks, websites, and domains, demonstrating absolute performance gains of 8.9 to 14.0 points over baselines, especially as the distribution gap between training and testing data widens.

https://arxiv.org/abs/2409.07429 The paper is available as a preprint on arXiv.

https://arxiv.org/pdf/2409.07429v1.pdf

2. Executive Summary

2.1. Background & Motivation

The field of language model (LM)-based agents is rapidly advancing, enabling them to tackle increasingly complex digital tasks, such as web navigation and mobile app operation. However, current methods face significant challenges, particularly with long-horizon tasks that involve intricate sequences of actions (complex action trajectories). A major limitation is their lack of robustness and adaptability to changes in task contexts or environments. Existing agents primarily integrate fixed examples through training or in-context learning, which means they perform well on action sequences similar to those seen during training but struggle to disentangle complex tasks. They often fail to extract and learn reusable task workflows shared across similar tasks, preventing them from learning from past successes and failures.

The paper is motivated by an analogy to human cognitive processes: humans excel at solving complex problems by abstracting common routines from past experiences and applying this knowledge to guide future actions. The core problem the paper aims to solve is to imbue LM agents with a similar capability to learn, store, and utilize reusable task workflows. This is crucial for building more intelligent and adaptable agents that can perform effectively in dynamic, real-world environments. The paper's innovative idea is Agent Workflow Memory (AWM), a mechanism designed to enable agents to continuously induce and apply these workflows, thereby creating a continual learning system that improves performance over time.

2.2. Main Contributions / Findings

The paper makes several significant contributions:

  • Introduction of Agent Workflow Memory (AWM): The authors propose AWM, a novel method for inducing commonly reused routines (workflows) from agent trajectories and selectively providing them to the agent's memory to guide subsequent action generation.

  • Flexible Application (Offline and Online): AWM is designed to operate in both offline settings (where workflows are induced from pre-existing annotated training examples) and online settings (where workflows are induced on-the-fly from successful self-generated test queries in a supervision-free manner).

  • Significant Performance Improvement: AWM substantially improves the success rate of LM agents on two major web navigation benchmarks:

    • WebArena: Achieves a 51.1% relative success rate increase over baselines, and even surpasses methods using human-engineered workflows (SteP) by 7.9%. It also reduces the average number of steps taken per task by about 2.0 compared to a BrowserGym baseline.
    • Mind2Web: Improves relative success rate by 24.6% in cross-task scenarios.
  • Enhanced Generalization Capabilities: Online AWM demonstrates robust generalization across varied scenarios, including cross-task, cross-website, and cross-domain evaluations. It consistently outperforms baselines, with performance margins increasing from 8.9 to 14.0 absolute points as the train-test task distribution gap widens.

  • Demonstration of Continual Learning: The paper illustrates how AWM enables agents to learn basic workflows and then build more complex ones on top of them, creating a snowball effect that leads to substantial performance gains with minimal additional data.

  • Analysis of Workflow Representation: The authors explore different workflow representations (sub-routine/abstract vs. concrete, textual vs. programmatic, NL state description vs. HTML) and their impact on agent performance, providing insights into optimal design choices.

    In summary, AWM provides a flexible and effective mechanism for LM agents to learn and leverage reusable routines, significantly improving their performance and generalization capabilities on complex web navigation tasks, and paving the way for more adaptive and continually learning agents.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a foundational grasp of several concepts related to language models, AI agents, and web interaction is beneficial.

  • Language Model (LM) based Agents:

    • Concept: These are artificial intelligence systems that leverage the capabilities of large language models (LLMs) to perform tasks in digital environments. Unlike traditional software that follows explicit rules, LM agents interpret natural language instructions, reason about their environment, and generate actions, often in an iterative observe-act loop.
    • How they work: An LM agent typically receives a natural language instruction (the query), observes the current state of an environment (e.g., a webpage), and uses its internal knowledge (memory) and the LM's generative power to decide the next best action. This action is then executed, changing the environment's state, and the cycle repeats until the task is complete.
    • Relevance: This paper focuses on improving the robustness and adaptability of these LM-based agents, particularly for web navigation.
  • Web Navigation Tasks:

    • Concept: These are tasks that require an agent to interact with a web browser to achieve a specific goal. Examples include searching for a product on an e-commerce site, booking a flight, filling out a form, or navigating social media.
    • Actions: Common actions include CLICK (on buttons, links, etc.), TYPE (into text fields), SCROLL, NAVIGATE (to a URL), and STOP (when the task is completed).
    • Environment: The environment for a web agent is typically a web browser, and its state is represented by the current webpage content.
    • Relevance: The paper evaluates AWM on two major web navigation benchmarks, Mind2Web and WebArena.
  • Long-Horizon Tasks:

    • Concept: Tasks that require a long sequence of interdependent actions to complete. They often involve multiple sub-goals and can be complex to plan and execute.
    • Relevance: Current LM agents struggle with these due to cascading errors and difficulty in maintaining context or learning sequential dependencies, which AWM aims to address by breaking them down into reusable workflows.
  • In-context Learning:

    • Concept: A paradigm where a large language model (LLM) learns to perform a task by being provided with a few examples (demonstrations) within its input prompt, rather than through explicit weight updates (fine-tuning). The LLM then uses these examples to condition its behavior on a new, unseen input.
    • Relevance: Many current LM agents use in-context learning with fixed examples. AWM extends this by dynamically generating and integrating reusable workflows into the agent's context/memory.
  • Offline vs. Online Scenarios (in ML/Reinforcement Learning):

    • Offline Learning: The agent learns from a fixed dataset of pre-collected experiences. Once training is complete, the agent's policy is fixed and deployed. New data does not continuously update the model.
    • Online Learning: The agent continuously learns and updates its knowledge/policy as new data or experiences become available during deployment. This allows for adaptation to changing environments or tasks.
    • Relevance: AWM is designed to be flexible, operating in both offline (learning from a training set) and online (continually learning from test queries) modes, which is a key contribution for adaptability.
  • Accessibility Trees:

    • Concept: A hierarchical representation of the content and structure of a webpage, designed to make web content accessible to assistive technologies (like screen readers). It converts the visual DOM (Document Object Model) into a more semantic structure, often highlighting interactive elements and their roles.
    • Relevance: BrowserGym, a baseline method, uses accessibility trees to represent webpages, and AWM also leverages this representation. It's a structured way for agents to "see" and interact with web elements.

3.2. Previous Works

The paper builds upon and compares against several existing approaches and benchmarks in the field of web agents.

  • Web Agent Benchmarks:

    • MiniWob (Shi et al., 2017) & MiniWob++ (Liu et al., 2018): Early benchmarks for web agents, providing various scenarios like flight booking. They helped establish the foundational challenges.
    • WebShop (Yao et al., 2022): Focused on simulated e-commerce with crowd-sourced instructions, introducing more realistic interactions.
    • WebArena (Zhou et al., 2024): A more recent and comprehensive benchmark used in this paper. It integrates four additional websites beyond e-commerce, covering domains like social forums, software development, and content management. Crucially, it provides rigorous execution-based evaluation, meaning task success is determined by actual functional correctness rather than just action sequence correctness.
    • VisualWebArena (Koh et al., 2024): Extends WebArena by including tasks that require visual inputs, pushing towards multimodal agents.
    • Mind2Web (Deng et al., 2023): Another major benchmark used in this paper. It emphasizes the generality of agents across diverse operations and environments, featuring cross-task, cross-website, and cross-domain settings. Each task has a fixed number of steps, and evaluation includes element accuracy, action F1, and step success rate.
  • Enhancing Agents for Complex Tasks (Baselines):

    • BrowserGym (Drouin et al., 2024): A state-of-the-art method without human-annotated site-specific knowledge, used as a primary baseline in WebArena. It modifies the agent's default action space and uses accessibility tree representations.
    • SteP (Sodhi et al., 2023): An approach that uses human expert written workflows tailored for WebArena. AWM compares itself to SteP to show its ability to achieve competitive results without manual supervision.
    • AutoEval (Pan et al., 2024): A method that focuses on autonomous evaluation and refinement of digital agents. It involves additional evaluation and refinement steps to ensure task correctness.
    • MindAct (Deng et al., 2023): A method designed for Mind2Web that uses webpage element filtering and a multi-choice task format to simplify observation processing for LM agents.
    • Synapse (Zheng et al., 2024): Another Mind2Web baseline. It changes the input format to a trajectory style and augments retrieved relevant examples into the agent's context/memory. This is a direct point of comparison for AWM regarding the benefits of reusable workflows over concrete examples.
  • Learning Common Procedures from Experiences:

    • This is a broad area of related work that AWM falls into. Many prior works have explored extracting sub-routines:
      • Rule-based methods (Ellis et al., 2023; Bowers et al., 2023; Grand et al., 2023): These approaches typically use predefined rules or heuristics to identify repetitive action sequences.
      • LM-based methods (Cai et al., 2023; Wang et al., 2024c;a): These leverage large language models to identify and abstract common patterns from demonstrations. AWM falls into this category for its LM-based workflow induction.
    • The goal of these methods is often to create auxiliary skills or context guidance to improve future task-solving (Oh et al., 2017; Liang et al., 2023; Yu et al., 2023; Mao et al., 2023).

3.3. Technological Evolution

The evolution of web agents has moved from simple scripted bots to sophisticated LM-based agents. Early systems (MiniWob) focused on highly constrained environments and rule-based interactions. The advent of large language models brought in-context learning and the ability to interpret natural language instructions, shifting the focus to more open-ended and complex web tasks (WebShop, WebArena, Mind2Web).

Initially, agents relied on direct parsing of HTML or DOM structures. The adoption of accessibility trees improved semantic understanding of webpages. Methods also evolved from simply generating actions to incorporating self-feedback, planning, and memory augmentation with example demonstrations. However, these demonstrations often struggled with generalization because they were example-specific and context-dependent.

This paper's work (AWM) represents a significant step in this evolution by moving beyond mere example augmentation to workflow abstraction. Instead of providing concrete, full examples, AWM induces abstract, reusable routines. This allows for better generalization and continuous learning, addressing a key limitation of prior in-context learning approaches that struggled with extrapolating to other tasks or domains (Majumder et al., 2023). It bridges the gap between fixed, example-driven learning and dynamic, adaptable intelligence by allowing agents to build an evolving "skill library" (workflow memory) similar to how humans learn.

3.4. Differentiation Analysis

Compared to the main methods in related work, AWM introduces several core differences and innovations:

  • Abstraction of Reusable Routines (Workflows) vs. Concrete Examples:

    • Prior Work (e.g., Synapse): Often augments agent memory with concrete, full example trajectories or relevant demonstrations. While helpful, these example-specific contexts can bias agents and struggle to generalize to tasks that differ even slightly from the provided examples.
    • AWM's Innovation: Induces abstract, reusable workflows by explicitly prompting LMs to extract common sub-routines and abstracting out example-specific contexts (e.g., replacing "dry cat food" with productname{product-name}). This makes workflows much more flexible and widely applicable across diverse tasks and domains. The paper explicitly shows AWM's superiority over Synapse on Mind2Web, noting better element accuracy due to less bias.
  • Flexible Offline and Online Workflow Induction:

    • Prior Work: Many methods rely on available training data or human supervision for learning or policy design (e.g., SteP uses human-engineered workflows, BrowserGym uses a predefined action space). Collecting high-quality, domain-aligned examples is often difficult or impossible.
    • AWM's Innovation: Can operate in two modes:
      • Offline AWM: Utilizes existing canonical examples (if available) for workflow induction.
      • Online AWM: Crucially, it can function in a supervision-free manner by iteratively inducing workflows from its own successful predictions during test time. This allows it to learn and adapt even when no training data exists, directly addressing the challenge of data scarcity and enabling continual learning in dynamic environments. This flexibility significantly improves its generalization to unseen websites and domains.
  • Continual Learning and Workflow Composition:

    • Prior Work: Agents typically solve each task separately, without explicitly learning from past successes or failures to build a cumulative knowledge base of routines.
    • AWM's Innovation: AWM facilitates continual learning where agents can first learn basic workflows (e.g., "find a place by its name") and then use these as sub-goals to build more complex workflows (e.g., "get the zip code of a place" by first finding the place using the learned workflow). This snowball effect of increasingly complex workflows is a novel aspect that significantly boosts long-term performance.
  • Efficiency in Action Trajectories:

    • Prior Work (e.g., AutoEval): May require additional evaluation and refinement steps, leading to longer action trajectories.

    • AWM's Innovation: By providing direct guidance through relevant workflows, AWM enables agents to achieve high success rates while taking fewer steps per task, indicating more efficient task execution.

      In essence, AWM differentiates itself by moving beyond static, example-based learning to a more dynamic, abstract, and human-like approach of acquiring and composing reusable skills (workflows), leading to superior adaptability and performance across diverse and challenging web environments.

4. Methodology

The core idea of Agent Workflow Memory (AWM) is to enable language model-based agents to learn and reuse common operational routines, termed workflows, from past experiences. This process is analogous to how humans abstract common procedures and apply them to guide future actions. AWM achieves this by inducing workflows from agent trajectories and integrating them into the agent's memory to guide subsequent task-solving.

4.1. Principles

The fundamental principle behind AWM is workflow induction and memory augmentation. An agent's observed action trajectories from successfully completed tasks contain patterns of sub-routines that are frequently reused. By extracting these sub-routines and representing them as workflows with abstract descriptions and parameterized actions, AWM creates a reusable knowledge base. This knowledge base is then integrated into the agent's text-based memory, providing context-sensitive guidance to the underlying language model during future task execution. This mechanism allows the agent to move beyond merely imitating concrete examples to leveraging generalized procedural knowledge, improving efficiency, success rate, and generalization across varying tasks and environments.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Statement

The paper defines the task for an LM-based agent as follows: An agent operates with a language model backbone LL and text-based memory MM. The base memory initially contains documentation of built-in actions (e.g., CLICK, TYPE). To solve a task specified by a natural language (NL) instruction qq, the agent interacts with an environment defined by a transition function TT.

For each time step tit_i:

  1. The environment state sis_i provides an observation oio_i.
  2. The observation oio_i, along with the instruction qq and memory MM, is passed to the language model to generate an action aia_i. This is represented as: $ L(q, M, o_i) \to a_i $
    • LL: The language model backbone.
    • qq: The natural language instruction for the current task.
    • MM: The agent's text-based memory, potentially augmented with workflows.
    • oio_i: The observation obtained from the current environment state sis_i.
    • aia_i: The action generated by the language model.
  3. The generated action aia_i is executed in the environment, changing the state: $ T(s_i, a_i) \to s_{i+1} $
    • TT: The environment's transition function.

    • sis_i: The current environment state.

    • aia_i: The executed action.

    • si+1s_{i+1}: The new environment state after executing aia_i.

      This observe-act loop continues until the model predicts a STOP action (ai=STOPa_i = \mathrm{STOP}) or a pre-determined task termination condition (e.g., maximum steps) is met.

Each completed task forms an experience ee, which comprises the NL instruction qq and a trajectory of steps PeP^e. Each step pp in the trajectory contains the agent's observation oo and the action aa taken, formulated as pˉ=(o,a)\bar{p} = (o, a). So, an experience is: $ e = (q, P^e) $ where Pe=(p1e,,pne)P^e = (p_1^e, \dots, p_n^e).

The goal of AWM is to induce useful workflows W={w}\mathcal{W} = \{w\} from a set of experiences E={e}\mathcal{E} = \{e\} (collected from past or newly generated examples) using an induction module II. This induction process is formalized as: $ I(\mathcal{E}) \to \mathcal{W} $

  • II: The workflow induction module.

  • E\mathcal{E}: A set of agent experiences.

  • W\mathcal{W}: The set of induced workflows.

    These induced workflows W\mathcal{W} are then added into the agent's memory MM as guidance for subsequent task-solving.

4.2.2. Workflow Representation

A workflow is structured similarly to an experience but represents a reusable routine rather than a specific task instance. Each workflow ww consists of two main components:

  1. Textual Description (dd): An NL task description that summarizes the high-level goal or function of the workflow. This description is heuristically extracted from experience instructions or summarized by an LM.
  2. Workflow Trajectory (PdP^d): A series of steps (p1,p2,)(p_1, p_2, \ldots) required to complete the process described in dd. Each step pp within a workflow trajectory has three parts:
    • Environment State Description: An NL description of the current environment state (e.g., "Order {id} is shown").

    • Reasoning Process: The agent's elaborated reasoning for deciding on the next action (e.g., "Order {id} is found, I will now terminate the task.").

    • Executable Action: An action represented as an executable program over the environment (e.g., stop()).

      The following figure (Figure 2 from the original paper) illustrates the AWM pipeline, showing how an agent interacts with the environment, induces workflows, and integrates them into memory.

      Figure 2: Illustration of our AWM pipeline: an agent takes actions to solve given queries, induces workflows from successful ones, and integrates them into memory. 该图像是论文中图2的示意图,展示了AWM流水线:智能体在环境中通过语言模型骨干进行行动和观察,成功案例诱导出工作流并存入记忆中,以指导后续动作。

Figure 2: Illustration of our AWM pipeline: an agent takes actions to solve given queries, induces workflows from successful ones, and integrates them into memory.

4.2.3. Inducing and Using Workflows

At the core of AWM is the induction module II, which generates a set of workflows W\mathcal{W} from one or more past agent experiences E={ei}i=1m\mathcal{E} = \{e_i\}_{i=1}^m. Each experience e=(q,Pe)e = (q, P^e) contains an NL task instruction qq and a sequence of observation-action steps Pe=(p1e,,pne)P^e = (p_1^e, \dots, p_n^e). The induction module's output is: $ I(\mathcal{E}) \to \mathcal{W} = {w} = {(d_j, P_j^d)} $

  • W\mathcal{W}: Set of induced workflows.
  • djd_j: Textual description of the jj-th workflow.
  • PjdP_j^d: Trajectory of steps for the jj-th workflow.

LM-based Workflow Induction

To ensure workflows are reusable and capture common sub-routines, AWM uses an LM-based module for induction. This module prompts an LM to extract these sub-routines from input experiences. Key aspects of this process include:

  • Finer Granularity: Unlike concrete task instructions, workflows are induced at a finer granularity, focusing on sub-tasks (e.g., "search for a product on Amazon" instead of "Buy dry cat food on Amazon and deliver to my address").

  • Abstraction of Contexts: To enhance generality, example-specific values are abstracted into variables (e.g., "dry cat food" becomes productname{product-name}). This is achieved by specifying this abstraction in the workflow induction prompts (details in Appendix A).

  • Segmentation: Induced workflows are segmented (based on double-line breaks in the model output) and stored separately in workflow memory.

    After induction, workflows W\mathcal{W} are integrated into the agent's memory, augmenting the original memory MM to become Mw=M+WM_w = M + \mathcal{W}. When solving a new instruction qq, the agent now uses this augmented memory: $ L(q, M_w, o) \to L(q, M + \mathcal{W}, o) \to a $

AWM is applied in two main scenarios:

Offline Scenario (AWMoffline\mathbf{AWM}_{offline})

This scenario applies when additional canonical experiences (e.g., human-annotated data or synthesized examples) are available for training. The process is divided into two distinct phases:

  1. Workflow Induction (Training Time): AWM takes all available training examples from a website, concatenates them into a single prompt, and feeds them to an LM to create a set of workflows. $ I(\mathcal{E}{train}) \to \mathcal{W}{offline} $
    • Etrain\mathcal{E}_{train}: Set of training experiences.
    • Woffline\mathcal{W}_{offline}: Set of workflows induced in the offline setting.
  2. Workflow Utilization (Inference Time): At inference time, the agent incorporates all induced workflows Woffline\mathcal{W}_{offline} into its memory to solve test instructions. The same workflow memory Woffline\mathcal{W}_{offline} is used for every test task. $ L(q, M + \mathcal{W}_{offline}, o_i^{test}) \to a_i^{test} $

The following figure (Figure 3 from the original paper) illustrates the AWMoffline\mathbf{AWM}_{offline} pipeline.

Figure 3: Illustration of \(\\mathbf { A W M } _ { o f f i n e }\) 该图像是图3的示意图,展示了AWM offline方法的工作流程。步骤包括①从“训练”阶段带有额外示例中归纳工作流,将其加入记忆,②在测试推理阶段应用这些工作流指导任务执行。

Figure 3: Illustration of AWMoffine\mathbf { A W M } _ { o f f i n e }

Online Scenario (AWMonline\mathbf{AWM}_{online})

This scenario is designed for situations where no additional canonical experiences are available, and AWM operates in a supervision-free manner using only test queries. Agents with AWMonline\mathbf{AWM}_{online} process test queries in a streaming fashion, continually inducing, integrating, and utilizing workflows. The process is iterative:

  1. Initialization: The agent starts with its default memory MM.

  2. Task Execution: Given the tt-th test instruction qtq_t, the agent attempts to solve it by generating an action trajectory (p1t,p2t,)(p_1^t, p_2^t, \ldots), forming an experience et=(qt,{pt})e_t = (q^t, \{p^t\}).

  3. Success Evaluation: An LM-based evaluation model (from Pan et al., 2024) outputs a binary label, Leval(et){0,1}L_{eval}(e^t) \in \{0, 1\}, indicating whether the experience ete^t successfully solves qtq^t.

  4. Workflow Induction and Memory Update: If ete^t is judged successful (i.e., Leval(et)=1L_{eval}(e^t) = 1), it is transformed into one or more workflows: $ I(e^t) \to {w^t} $ These new workflows {wt}\{w^t\} are then added to the agent's memory for the next task: $ M^t + {w^t} \to M^{t+1} $

    • MtM^t: Agent's memory before processing task tt.
    • {wt}\{w^t\}: Workflows induced from successful task tt.
    • Mt+1M^{t+1}: Updated memory for task t+1t+1.
  5. Iteration: This memory-updating process continues iteratively as test instructions are streamed. The success rate is evaluated on the predicted action trajectories {pt}\{p^t\} for all tests.

    The following figure (Figure 4 from the original paper) illustrates the AWMonline\mathbf{AWM}_{online} pipeline.

    Figure 4: Illustrations of \(\\mathbf { A W M } _ { o n l i n e }\) 该图像是论文中图4的示意图,展示了在线版Agent Workflow Memory(AWM_online)的工作流程,包括如何从测试样例流中诱导(workflows induce)、存储(grow over time)、应用(apply)工作流,指导测试推断。

Figure 4: Illustrations of AWMonline\mathbf { A W M } _ { o n l i n e }

4.2.4. AWM's Continual Learning Mechanism

As depicted in Figure 1, AWM enables a continual learning loop. The agent begins with basic actions and, through solving tasks in a streaming manner, induces workflows. For instance, it might first learn "find a place by its name." Subsequently, it can build more complex workflows by composing existing ones, such as using "find a place by its name" as a subgoal within a larger workflow like "get the zip code of a place." This iterative process of inducing and applying increasingly complex workflows expands the agent's memory and knowledge, leading to a snowball effect that substantially improves performance over time compared to static baselines.

Figure 1: AWM enables agents to continuously induce and apply workflows to improve performance, compared to stagnant baselines. We show results by AWM on the WebArena map split as an example. 该图像是一个折线图,展示了在WebArena地图拆分任务中,AWM方法随着示例数量增加,累积成功率的提升趋势,并与基线方法进行了对比,显示AWM在任务复杂度增加时表现出显著优势。

Figure 1: AWM enables agents to continuously induce and apply workflows to improve performance, compared to stagnant baselines. We show results by AWM on the WebArena map split as an example.

5. Experimental Setup

The paper evaluates AWM on two prominent web navigation benchmarks: WebArena and Mind2Web. The experiments are designed to assess AWM's task success and its generalization capabilities across various setups. For both benchmarks, AWM is applied on a website basis, meaning examples are grouped by their associated websites, and AWM runs separately on each group. This ensures that the collection of workflows remains relevant to the test tasks.

5.1. Datasets

  • WebArena (Zhou et al., 2024):

    • Source & Characteristics: Provides 812 web navigation tasks across five distinct websites. These websites cover four common application domains: e-commerce, social forum discussions, collaborative software development, and content management.
    • Key Feature: Supports rigorous execution-based evaluation, meaning task success is determined by the functional correctness of the agent's actions within the environment, rather than just the correctness of the action sequence itself.
    • Data Availability: WebArena primarily has test examples. Due to the lack of additional high-quality, domain-aligned training examples, AWM is mainly conducted in the online setting for this benchmark.
  • Mind2Web (Deng et al., 2023):

    • Source & Characteristics: Features web navigation tasks that explicitly stress generality of agents across cross-task, cross-website, and cross-domain settings.
    • Task Structure: Each task in Mind2Web has a fixed number of steps. At each step, the agent must predict an action.
    • Data Availability: Mind2Web provides a training set that covers a portion of the tested websites (used for the cross-task split). This allows for exploration of both offline (inducing workflows from the training set) and online (streaming workflow induction and inference on test queries) AWM settings.
    • Splits:
      • Cross-task: Training and test examples are from the same websites/domains but involve different tasks.
      • Cross-website: Test examples are from websites not seen during training, but within the same domain.
      • Cross-domain: Test examples are from entirely new websites and domains not seen during training.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, a complete explanation is provided below.

  • Success Rate (SR):

    1. Conceptual Definition: Task Success Rate is a high-level metric that measures the percentage of tasks for which the agent successfully achieves the desired goal from start to finish. It is the ultimate measure of an agent's ability to complete a given instruction. For Mind2Web, specifically, it measures if all intermediate steps for a given task are successfully conducted.
    2. Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of successfully completed tasks}}{\text{Total number of tasks}} $
    3. Symbol Explanation:
      • SR\mathrm{SR}: Success Rate.
  • Number of Steps (WebArena):

    1. Conceptual Definition: This metric quantifies the average number of actions an agent takes to successfully complete a task. A lower number of steps generally indicates higher efficiency.
    2. Mathematical Formula: Not explicitly provided in the paper as a formula, but conceptually it is the sum of actions taken across all successful tasks divided by the number of successful tasks.
    3. Symbol Explanation: Not applicable as it's a direct count.
  • Mind2Web Specific Step-wise Metrics: For Mind2Web, evaluation is conducted at each step, and then aggregated.

    1. Element Accuracy (Elem Acc):

      • Conceptual Definition: Measures whether the agent correctly identifies and selects the target web page element (e.g., a button, a text field) that it needs to interact with at a given step.
      • Mathematical Formula: Not explicitly provided in the paper as a formula, but conceptually it is the percentage of steps where the correct page element is selected.
      • Symbol Explanation: Not applicable as it's a direct percentage.
    2. Action F1:

      • Conceptual Definition: Measures the harmonic mean of Precision and Recall for the action taken on the selected element. It assesses whether the agent performs the correct type of action (e.g., CLICK, TYPE, SCROLL) given the selected element.
      • Mathematical Formula: The standard F1-score formula is: $ F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $ Where: $ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} $ $ \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}} $
      • Symbol Explanation:
        • F1F_1: The F1-score, which balances Precision and Recall.
        • Precision\text{Precision}: The proportion of correctly predicted positive actions out of all positive actions predicted by the model. In this context, it would be the proportion of times the agent's predicted action type was correct among all action types it predicted.
        • Recall\text{Recall}: The proportion of correctly predicted positive actions out of all actual positive actions. It would be the proportion of times the agent correctly identified an action type among all the required action types.
        • True Positives (TP)\text{True Positives (TP)}: The agent correctly predicts an action type that is indeed the correct action type.
        • False Positives (FP)\text{False Positives (FP)}: The agent predicts an action type, but it is not the correct action type.
        • False Negatives (FN)\text{False Negatives (FN)}: The agent fails to predict an action type that was actually required.
    3. Step Success Rate (Step SR):

      • Conceptual Definition: Aggregates Element Accuracy and Action F1. A step is considered successful only if both the correct page element is selected AND the correct action is taken on that element.
      • Mathematical Formula: Not explicitly provided as a formula, but implicitly, it's the percentage of steps where both element accuracy and action selection are correct.
      • Symbol Explanation: Not applicable as it's a direct percentage.

5.3. Baselines

The paper compares AWM against several state-of-the-art LM-based agent methods, selected for their relevance to web navigation tasks and their performance on the chosen benchmarks.

  • WebArena Baselines:

    • WebArena (Zhou et al., 2024): The original benchmark's reported baseline performance.
    • AutoEval (Pan et al., 2024): A method focusing on autonomous evaluation and refinement steps for digital agents.
    • BrowserGym (Drouin et al., 2024): The current state-of-the-art autonomous method on WebArena (without human-annotated site-specific knowledge). It alters the agent's default action space.
    • BrowserGym_ax-tree: A variant of BrowserGym that uses only accessibility tree webpage representations, mirroring AWM's input representation for a fairer comparison.
    • SteP (Sodhi et al., 2023): A method that uses 14 human expert written workflows specifically tailored to solving WebArena tasks. This serves as a strong baseline demonstrating the upper bound of performance with extensive human supervision.
  • Mind2Web Baselines:

    • MindAct (Deng et al., 2023): A method that introduces webpage element filtering and a multi-choice task format to ease observation processing for LM agents.

    • Synapse (Zheng et al., 2024): A method that uses a trajectory-style format and augments retrieved relevant examples into the agent's memory. This is a critical comparison point for AWM as it tests the efficacy of abstract workflows versus concrete examples.

      For both benchmarks, the experiments are conducted using GPT-4 (specifically gpt-4-0613) and gpt-3.5-turbo models with a temperature of 0.0, which ensures deterministic and stable model outputs for consistent evaluation. The same model is used for neural workflow induction and agent action generation.

6. Results & Analysis

This section details the experimental results of AWM on the WebArena and Mind2Web benchmarks, comparing its performance against various baselines and analyzing its generalization capabilities and design choices.

6.1. Core Results Analysis

6.1.1. WebArena Main Results

The following are the results from Table 1 of the original paper:

Method Total SR Shopping CMS Reddit GitLab Maps # Steps
With human engineered workflows
*SteP (Sodhi et al., 2023) 33.0 37.0 24.0 59.0 32.0 30.0 -
Autonomous agent only
WebArena (Zhou et al., 2024) 14.9 14.0 11.0 6.0 15.0 16.0 -
AutoEval (Pan et al., 2024) 20.2 25.5 18.1 25.4 28.6 31.9 46.7
BrowserGym (Drouin et al., 2024) 23.5 - - - - - -
BrowserGymax-tree 15.0 17.2 14.8 20.2 19.0 25.5 7.9
AWM (OURS) 35.5 30.8 29.1 50.9 31.8 43.3 5.9

AWM achieves the highest Total SR (Success Rate) of 35.5% on WebArena, significantly outperforming all other methods. It surpasses the BrowserGym baseline (an autonomous method) by 12.0 absolute points (51.1% relative increase) and BrowserGym_ax-tree by 20.5 absolute points. Notably, AWM even outperforms SteP (which uses human-engineered workflows) by 2.5 absolute points (7.6% relative increase), demonstrating its ability to learn effective workflows without manual supervision. The performance gains are consistent across all five website categories, with AWM showing substantial improvements ranging from 11.8 to 30.7 absolute points over the BrowserGym_ax-tree baseline, indicating broad applicability.

Beyond task success, AWM also demonstrates efficiency by completing tasks in fewer steps. It uses an average of 5.9 steps per example, which is 2.0 fewer steps than the BrowserGym_ax-tree baseline and a remarkable 40.8 fewer steps than AutoEval, highlighting that AWM achieves higher success with more concise and efficient action trajectories.

6.1.2. Efficient Learning from Small Amounts of Data (WebArena)

The behavior of AWM_online is illustrated by its cumulative success rate over the process of online evaluation. The following figure (Figure 5 from the original paper) shows the cumulative success rate of AWM on the WebArena map test split.

Figure 5: AWM enables rapid learning from a small amount of data, i.e., about 40 queries, using WebArena map test split as an example. 该图像是图表,展示了图5中AWM在WebArena地图测试集上通过约40个查询实现的快速学习能力。横轴为示例数量,纵轴为累计成功率(%),图中标注了快速学习阶段和稳定推理阶段。

Figure 5: AWM enables rapid learning from a small amount of data, i.e., about 40 queries, using WebArena map test split as an example.

The graph shows that AWM exhibits a fast learning curve within the first 0-40 examples, where it acquires essential workflows that lead to a rapid increase in success rates. After this initial phase, the learning curve gradually stabilizes, indicating that the agent continues to learn more advanced workflows but the most significant performance gains occur early with a relatively small amount of data. This demonstrates AWM's efficient learning process, achieving substantial performance improvements by leveraging insights from merely tens of examples.

6.1.3. Cross-Template Workflow Generalization (WebArena)

To test AWM's generalization beyond task templates, a cross-template subset of WebArena examples (from non-overlapping templates) was created. The following are the results from Table 2 of the original paper:

Method Total SR Shopping (51) CMS (45) Reddit (24) GitLab (45) Maps (32)
With human engineered workflows
*SteP (Sodhi et al., 2023) 32.1 26.5 29.3 52.2 27.3 36.4
Autonomous agent only
AutoEval (Pan et al., 2024) 23.2 12.2 17.1 21.7 31.8 36.4
BrowserGymax-tree 20.5 10.4 17.8 23.1 27.3 28.6
AWM (OURS) 33.2 24.5 29.3 52.2 31.8 39.4

Even on this challenging cross-template subset, AWM still achieves the highest Total SR of 33.2%, outperforming all baselines and maintaining its superiority across individual website splits. This demonstrates that AWM's induced workflows effectively generalize across different tasks, not just within instances of the same task template. This validates AWM's ability to extract truly reusable routines.

The following figure (Figure 6 from the original paper) provides a case study illustrating how AWM builds increasingly complex workflows.

Figure 6: AWM builds increasingly complex workflows over time, by learning from past examples and earlier workflows. 该图像是示意图,展示了AWM如何通过借鉴早期工作流的前几个步骤,构建越来越复杂的任务流程,示例涉及按名称查找地点及获取邮政编码的两种任务。

Figure 6: AWM builds increasingly complex workflows over time, by learning from past examples and earlier workflows.

As shown in the example, AWM first learns a basic workflow like "Find a place by its name" from initial examples. Later, when encountering a more complex task (e.g., "get the zip code of a place"), it reuses the "Find a place by its name" workflow as a sub-routine and adds subsequent steps to obtain the zip code. This compositional learning demonstrates AWM's ability to generalize and build hierarchical knowledge.

6.1.4. Mind2Web Main Results (Offline)

The following are the results from Table 3 of the original paper:

Method Elem Acc Action F1 Step SR SR
MindAct3.5 20.3 56.6 17.4 0.8
CogAgent3.5 - - 18.6 -
Synapse3.5 34.0 - 30.6 2.4
AWM3.5 39.0 52.8 34.6 2.8
MindAct4 41.6 60.6 36.2 2.0
AWM4 50.6 57.3 45.1 4.8

In the Mind2Web cross-task dataset, using AWM_offline, the method consistently achieves the highest success rates with both GPT-3.5-turbo and GPT-4 variants. For GPT-4, AWM achieves a Step SR of 45.1% and a Task SR of 4.8%, which are substantial improvements over MindAct4 (36.2% Step SR, 2.0% Task SR). This represents a 24.6% relative increase in Step SR and a 140% relative increase in Task SR.

A breakdown shows that the improvements largely come from Element Accuracy, with AWM4 achieving 50.6% compared to MindAct4's 41.6% (a 9.0 absolute point increase). This suggests that AWM's abstract workflow representation helps agents more accurately select the correct web elements.

Comparing AWM to Synapse, which uses concrete examples, AWM achieves higher Element Accuracy (+5.0 points) and Step SR (+4.0 points for GPT-3.5-turbo). This supports the argument that abstract sub-routines in AWM introduce less bias in element selection compared to concrete, full examples, which might lead agents to prefer elements similar to those in the provided demonstrations. The reusable nature of workflows is shown to be more flexible than full example trajectories.

However, AWM shows a slightly lower Action F1 score than MindAct (e.g., 57.3% vs. 60.6% for GPT-4). This suggests that while workflows guide better element selection, they might sometimes prompt actions that are not perfectly aligned with the current environment state, indicating a challenge in knowing when to diverge from workflow guidelines.

6.1.5. Online AWM Enables Generalization (Mind2Web)

The following are the results from Table 4 of the original paper:

Method Cross-Task Cross-Website Cross-Domain
EA AF1 Step SR SR EA AF1 Step SR SR EA AF1 Step SR SR
MindAct* 41.6 60.6 36.2 2.0 35.8 51.1 30.1 2.0 21.6 52.8 18.6 1.0
AWMoffline 50.6 57.3 45.1 4.8 41.4 46.2 33.7 2.3 36.4 41.6 32.6 0.7
AWMonline 50.0 56.4 43.6 4.0 42.1 45.1 33.9 1.6 40.9 46.3 35.5 1.7

On Mind2Web, both AWM_online and AWM_offline significantly outperform the MindAct baseline across all generalization scenarios: cross-task, cross-website, and cross-domain. Improvements range from 7.4 to 14.0 absolute points in Step SR.

  • In-domain, Cross-task Scenario: AWM_online and AWM_offline perform comparably, with AWM_offline slightly ahead in Step SR (45.1% vs 43.6%). This is attributed to AWM_offline benefiting from high-quality, distribution-matching training examples when available, which can alleviate the train-test gap. AWM_online can induce incorrect workflows from self-generated (potentially flawed) trajectories.
  • Extending to Unseen Websites and Domains: As the domain gaps widen (from cross-website to cross-domain), AWM_online demonstrates greater generalization abilities. For example, in the cross-domain setting, AWM_online achieves a Step SR of 35.5% compared to AWM_offline's 32.6% and MindAct's 18.6%. This is because AWM_online does not rely on training data and thus is not affected by domain gaps between training and testing data, allowing it to adapt better to novel environments by learning workflows directly from the test distribution. Even AWM_offline, despite domain gaps, still shows substantial improvements over MindAct (e.g., 32.6% vs 18.6% Step SR cross-domain), indicating the inherent benefit of a workflow repository.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Rule-based vs. LM-based Workflow Induction

The paper explores alternative methods for workflow induction, comparing the LM-based approach to a rule-based one. The rule-based induction method (IruleI_{rule}) extracts action sequences, deduplicates them, and removes invalid steps.

  • WebArena Results: The following are the results from Table 5 of the original paper:

    Method Total SR # Steps
    AWMrule 35.6 6.3
    AWMlm 35.5 5.9

    On WebArena, AWM_rule and AWM_lm perform comparably in Total SR (35.6% vs. 35.5%), with AWM_lm being slightly more efficient (5.9 vs. 6.3 steps). Manual analysis suggests LM-based workflows are finer-grained, avoiding unnecessary steps found in rule-induced workflows.

  • Mind2Web Results: The following are the results from Table 6 of the original paper:

    Method Elem Acc Action F1 Step SR SR
    MindAct4 41.6 60.6 36.2 2.0
    AWM4,rule 49.5 57.0 43.4 2.0
    AWM4,lm 50.6 57.3 45.1 4.8

    On Mind2Web, AWM_lm (50.6% Elem Acc, 45.1% Step SR) significantly outperforms AWM_rule (49.5% Elem Acc, 43.4% Step SR), demonstrating a 2.8 point improvement in Step SR. This highlights that LM-based induction's abstract representation and focus on frequently-used sub-routines lead to less bias in element selection and more flexible utilization across test examples, compared to rule-induced full example trajectories.

6.2.2. Workflows in Descriptive Texts

The paper investigates whether representing workflow steps in a textual format (NL descriptions) is better than a program format (code actions). The following are the results from Table 7 of the original paper:

Method Elem Acc Action F1 Step SR SR
MindAct 41.6 60.6 36.2 2.0
AWM 50.6 57.3 45.1 4.8
AWMtext 51.2 57.4 45.4 3.6

AWM_text (workflows represented in NL) achieves slightly higher Element Accuracy (51.2% vs. 50.6%) and Step SR (45.4% vs. 45.1%) compared to AWM (workflows in code format). However, AWM_text shows a degradation in Task SR (3.6% vs. 4.8%). Overall, the performance variance between text and code formats is not substantial, suggesting both can effectively augment agent memory.

6.2.3. Environment Abstraction in Workflows

The paper examines how intermediate webpage states are represented within workflows. AWM uses NL descriptions, but the study considers adding website HTML (filtered for relevance) or both. The following are the results from Table 8 of the original paper:

Desc. HTML Elem Acc Act F1 Step SR SR
✔️ 39.0 52.8 34.6 2.8
✔️ 38.1 54.0 33.8 2.8
✔️ ✔️ 37.1 51.3 32.9 2.0

The results indicate that NL descriptions of states (Desc. ✔️) are more effective than HTML (HTML ✔️), as replacing NL with HTML leads to a slight 0.8 point drop in Step SR. Interestingly, using both NL and filtered HTML (Desc. ✔️ HTML ✔️) leads to worse results. The authors conjecture two reasons: (1) increased context length overloads the models, and (2) the filtered HTML often contains irrelevant items (missing correct elements 47% of the time), which can contradict NL descriptions and impair the agent's abilities. This suggests that a concise, high-level NL description is often superior to raw or partially filtered HTML for grounding agents.

6.2.4. Workflow Utilization in Context and in Action (AWMAS\mathbf{AWM}_{AS})

This section explores expanding the agent's action space with workflows, treating them as high-level functions or shortcut tools. The following are the results from Table 9 of the original paper:

Method Elem Acc Action F1 Step SR SR
MindAct 41.6 60.6 36.2 2.0
AWM 50.6 57.3 45.1 4.8
AWMAS 51.8 56.7 46.4 3.6

Expanding the agent's action space with workflows (AWM_AS) leads to a slight improvement in Step SR (1.3 points higher than base AWM) but a decrease in Task SR (3.6% vs 4.8%). Analysis reveals that agents call workflow actions in only 18.5% of tasks, suggesting a resistance to using newly added high-level actions. This indicates that while workflow actions can reinforce workflows in memory and offer small gains as auxiliary actions, agents struggle to flexibly integrate them into their planning.

The following figure (Figure 7 from the original paper) illustrates the challenge of dynamic environment changes that can impact workflow action utilization.

Figure 7: An example of dynamic environment changes that challenge workflow action utilization. 该图像是示意图,展示了图7中动态环境变化对流程动作利用的挑战。图中左侧为初始航班搜索界面,右侧显示输入地点后弹出的选项列表,强调选择动作依赖弹出选项。

Figure 7: An example of dynamic environment changes that challenge workflow action utilization.

This example shows that a book_flight workflow might hardcode a sequence of actions. However, dynamic elements, like a popup for airport selection after typing a city, require intermediate state observation and flexible decision-making that fixed workflow actions cannot easily handle. This limitation points to a need for more advanced techniques, such as real-time state access or dynamic execution loops, for better integration of workflow actions.

6.2.5. Workflow Quality Analysis (Appendix A.3)

The paper provides metrics to evaluate the quality of model-induced workflows. The following are the results from Table 10 of the original paper:

Metric # Workflows Coverage Function Overlap Utility Rate
WebArena 7.4 - 0.08 0.94
Mind2Web 7.3 0.40 0.20 0.91
  • # Workflows: Neural-based induction produces an efficient number of workflows (7.3-7.4 per example), which doesn't excessively bloat memory.
  • Utility Rate: Workflows are highly utilized (0.94 on WebArena, 0.91 on Mind2Web), indicating their broad applicability.
  • Function Overlap: Low overlap on WebArena (0.08) suggests efficiency in workflow management, minimizing redundancy. Higher overlap on Mind2Web (0.20) is noted.
  • Coverage: On Mind2Web, coverage is 0.40. This is deemed reasonable given the substantial task distribution variances between training and cross-task test examples. Coverage is not evaluated on WebArena due to lack of canonical trajectories.

6.2.6. Integrating AWM Offline and Online (Appendix C)

The paper explores combining AWM_offline and AWM_online into AWM_off+on. The following are the results from Table 11 of the original paper:

Method Cross-Task Cross-Website Cross-Domain
EA AF1 Step SR SR EA AF1 Step SR SR EA AF1 Step SR SR
MindAct* 41.6 60.6 36.2 2.0 35.8 51.1 30.1 2.0 21.6 52.8 18.6 1.0
AWMoffline 50.6 57.3 45.1 4.8 41.4 46.2 33.7 2.3 36.4 41.6 32.6 0.7
AWMonline 50.0 56.4 43.6 4.0 42.1 45.1 33.9 1.6 40.9 46.3 35.5 1.7
AWMoff+on 50.0 57.0 44.5 1.6 41.8 45.5 33.3 1.1 39.3 44.3 34.1 1.5

AWM_off+on (integrating both offline and online induced workflows) generally scores between AWM_offline and AWM_online across the three test splits. This suggests that the combination does not yield a straightforward additive benefit. The authors hypothesize that offline workflows might impair the generative quality and utility of online workflows due to incompatibility, resulting in medium overall performance rather than synergistic gains. This points to challenges in harmonizing workflows learned from different distributions or induction processes.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Agent Workflow Memory (AWM), a novel and effective method for enhancing language model-based agents' ability to tackle complex, long-horizon real-world tasks, particularly in web navigation. Inspired by human learning, AWM enables agents to induce reusable routines (workflows) from past experiences and integrate them into their memory to guide future actions. The method offers remarkable flexibility, operating successfully in both offline scenarios (learning from annotated training data) and online, supervision-free scenarios (learning dynamically from successful self-generated test queries).

The empirical evaluation on WebArena and Mind2Web benchmarks demonstrates AWM's significant impact: a 51.1% relative increase in success rate on WebArena and a 24.6% relative increase on Mind2Web, alongside reduced action steps. Crucially, online AWM showcases robust generalization capabilities across tasks, websites, and domains, with performance gains becoming more substantial as the distribution gap between training and testing data widens. The paper also provides valuable insights into optimal workflow representations and the trade-offs between different induction mechanisms. AWM represents a substantial step towards building more adaptive, continually learning, and generalized AI agents.

7.2. Limitations & Future Work

The authors highlight several limitations and suggest future research directions:

  • Dynamic Environment Challenges for Workflow Actions: While AWM can expand the agent's action space with workflows (AWM_AS), agents show resistance to consistently utilizing these newly added high-level actions. More importantly, dynamic environment changes (e.g., pop-up windows during a flight booking process, as shown in Figure 7) pose a significant challenge to fixed workflow actions which might not be flexible enough to handle intermediate states.

  • Incompatibility of Offline and Online Workflows: The AWM_off+on experiment revealed that simply combining offline and online induced workflows does not lead to additive benefits; instead, offline workflows can impair the utility of online workflows. This suggests a challenge in harmonizing workflows from different sources or distributions.

    Based on these, the authors suggest future work to explore:

  • Real-time State Access: Granting agents real-time state access within workflow actions could make them more flexible in dynamic environments.

  • Dynamic Execution Loops: Implementing dynamic execution loops within workflows could allow for more adaptive decision-making when unexpected intermediate states arise.

  • Workflow Integration Strategies: Further research is needed on more sophisticated strategies for integrating workflows from various sources (offline, online, human-provided) to ensure compatibility and maximize synergistic benefits.

  • Robustness of Workflow Induction: Improving the robustness of the online induction process to avoid learning from incorrect trajectories is also an implicit area for improvement.

7.3. Personal Insights & Critique

This paper presents a highly intuitive and powerful approach to improving LM-based agents that resonates strongly with human learning principles. The analogy to humans abstracting routines is particularly compelling.

  • Innovation of Online Learning: The online AWM is a standout innovation. Its ability to learn supervision-free and adapt to unseen domains and websites addresses a critical bottleneck in deploying LM agents in diverse, real-world scenarios where pre-annotated data is often scarce or quickly becomes outdated. This continual learning aspect is crucial for building truly autonomous and generalist agents.

  • Value of Abstraction: The emphasis on abstracting out example-specific contexts is a key insight. It highlights that simply providing more concrete examples (as in Synapse) can lead to overfitting or bias, whereas AWM's focus on generalized sub-routines leads to superior generalization. This concept could be transferable to other domains where agents need to learn reusable skills, such as robotics or code generation.

  • Nuances in Workflow Representation: The ablation studies on NL vs. HTML and text vs. code workflow representations offer practical guidance for agent design. The finding that NL descriptions are often better than raw HTML (especially when noisy) underscores the importance of high-level semantic understanding for LM agents. The slight Action F1 decrease in AWM compared to MindAct also hints at a subtle challenge: while workflows provide powerful guidance, knowing when to deviate from them based on immediate environmental cues is a sophisticated skill that warrants further exploration.

  • Challenges of High-Level Actions: The AWM_AS experiment, showing agent resistance to calling high-level workflow actions, reveals an interesting gap between providing capabilities and enabling their effective utilization. This suggests that LM agents may struggle with hierarchical planning or action selection when the action space becomes multi-granular. Future work could explore more explicit hierarchical reasoning, planning modules, or reinforcement learning approaches to better integrate these high-level actions.

  • Integration Complexity: The AWM_off+on results are a crucial practical insight. They indicate that simply accumulating workflows from different sources is not a panacea; sophisticated workflow management, conflict resolution, or contextual activation mechanisms are likely needed to prevent interference and ensure synergy. This suggests that the "memory" in Agent Workflow Memory might need more advanced organizational and retrieval capabilities beyond simple augmentation.

    Overall, AWM provides a robust framework for improving LM agents by focusing on the adaptive learning and reuse of procedural knowledge. Its success in generalization, particularly in online settings, makes it a highly promising direction for future research in artificial intelligence.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.