Agent Workflow Memory
TL;DR Summary
AWM induces reusable task workflows to improve LM-based agents' performance on complex, long-horizon web navigation tasks, boosting success rates by up to 51% and reducing steps on major benchmarks with strong generalization.
Abstract
Despite the potential of language model-based agents to solve real-world tasks such as web navigation, current methods still struggle with long-horizon tasks with complex action trajectories. In contrast, humans can flexibly solve complex tasks by learning reusable task workflows from past experiences and using them to guide future actions. To build agents that can similarly benefit from this process, we introduce Agent Workflow Memory (AWM), a method for inducing commonly reused routines, i.e., workflows, and selectively providing workflows to the agent to guide subsequent generations. AWM flexibly applies to both offline and online scenarios, where agents induce workflows from training examples beforehand or from test queries on the fly. We experiment on two major web navigation benchmarks -- Mind2Web and WebArena -- that collectively cover 1000+ tasks from 200+ domains across travel, shopping, and social media, among others. AWM substantially improves the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while reducing the number of steps taken to solve WebArena tasks successfully. Furthermore, online AWM robustly generalizes in cross-task, website, and domain evaluations, surpassing baselines from 8.9 to 14.0 absolute points as train-test task distribution gaps widen.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Agent Workflow Memory
1.2. Authors
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, Graham Neubig. Their affiliations are Carnegie Mellon University and Massachusetts Institute of Technology, indicating a strong academic background in computer science, likely with a focus on natural language processing, machine learning, and AI agents.
1.3. Journal/Conference
This paper was published at arXiv, which is a preprint server. As such, it is not formally peer-reviewed by a journal or conference at the time of its publication (2024-09-11T17:21:00.000Z). However, arXiv is a highly influential platform in fields like AI and machine learning, used for rapid dissemination of research findings before or during the peer-review process. Many groundbreaking papers first appear on arXiv before being formally published in top-tier conferences (e例如, NeurIPS, ICML, ICLR) or journals.
1.4. Publication Year
2024
1.5. Abstract
This paper introduces Agent Workflow Memory (AWM), a novel method designed to enhance the performance of language model (LM)-based agents, particularly in solving complex, long-horizon real-world tasks such as web navigation. AWM addresses the current limitations of LM agents, which often struggle with adapting to diverse task contexts and environments, by enabling them to learn and reuse common task routines, referred to as workflows, from past experiences. The method can operate in both offline and online scenarios: offline AWM induces workflows from pre-existing training examples, while online AWM learns dynamically from self-generated predictions during test queries.
The authors evaluate AWM on two prominent web navigation benchmarks, Mind2Web and WebArena, which together encompass over 1000 tasks across 200+ domains. AWM significantly improves baseline results, achieving a 24.6% relative success rate increase on Mind2Web and a 51.1% relative increase on WebArena. Additionally, it reduces the number of steps required to successfully complete tasks on WebArena. A key finding is online AWM's robust generalization across different tasks, websites, and domains, demonstrating absolute performance gains of 8.9 to 14.0 points over baselines, especially as the distribution gap between training and testing data widens.
1.6. Original Source Link
https://arxiv.org/abs/2409.07429 The paper is available as a preprint on arXiv.
1.7. PDF Link
https://arxiv.org/pdf/2409.07429v1.pdf
2. Executive Summary
2.1. Background & Motivation
The field of language model (LM)-based agents is rapidly advancing, enabling them to tackle increasingly complex digital tasks, such as web navigation and mobile app operation. However, current methods face significant challenges, particularly with long-horizon tasks that involve intricate sequences of actions (complex action trajectories). A major limitation is their lack of robustness and adaptability to changes in task contexts or environments. Existing agents primarily integrate fixed examples through training or in-context learning, which means they perform well on action sequences similar to those seen during training but struggle to disentangle complex tasks. They often fail to extract and learn reusable task workflows shared across similar tasks, preventing them from learning from past successes and failures.
The paper is motivated by an analogy to human cognitive processes: humans excel at solving complex problems by abstracting common routines from past experiences and applying this knowledge to guide future actions. The core problem the paper aims to solve is to imbue LM agents with a similar capability to learn, store, and utilize reusable task workflows. This is crucial for building more intelligent and adaptable agents that can perform effectively in dynamic, real-world environments. The paper's innovative idea is Agent Workflow Memory (AWM), a mechanism designed to enable agents to continuously induce and apply these workflows, thereby creating a continual learning system that improves performance over time.
2.2. Main Contributions / Findings
The paper makes several significant contributions:
-
Introduction of Agent Workflow Memory (AWM): The authors propose
AWM, a novel method for inducing commonly reused routines (workflows) from agent trajectories and selectively providing them to the agent's memory to guide subsequent action generation. -
Flexible Application (Offline and Online):
AWMis designed to operate in bothofflinesettings (where workflows are induced from pre-existing annotated training examples) andonlinesettings (where workflows are induced on-the-fly from successful self-generated test queries in asupervision-freemanner). -
Significant Performance Improvement:
AWMsubstantially improves the success rate of LM agents on two major web navigation benchmarks:- WebArena: Achieves a 51.1% relative success rate increase over baselines, and even surpasses methods using human-engineered workflows (SteP) by 7.9%. It also reduces the average number of steps taken per task by about 2.0 compared to a
BrowserGymbaseline. - Mind2Web: Improves relative success rate by 24.6% in cross-task scenarios.
- WebArena: Achieves a 51.1% relative success rate increase over baselines, and even surpasses methods using human-engineered workflows (SteP) by 7.9%. It also reduces the average number of steps taken per task by about 2.0 compared to a
-
Enhanced Generalization Capabilities:
Online AWMdemonstrates robust generalization across varied scenarios, including cross-task, cross-website, and cross-domain evaluations. It consistently outperforms baselines, with performance margins increasing from 8.9 to 14.0 absolute points as thetrain-test task distribution gapwidens. -
Demonstration of Continual Learning: The paper illustrates how
AWMenables agents to learn basic workflows and then build more complex ones on top of them, creating asnowball effectthat leads to substantial performance gains with minimal additional data. -
Analysis of Workflow Representation: The authors explore different workflow representations (sub-routine/abstract vs. concrete, textual vs. programmatic, NL state description vs. HTML) and their impact on agent performance, providing insights into optimal design choices.
In summary,
AWMprovides a flexible and effective mechanism for LM agents to learn and leveragereusable routines, significantly improving their performance and generalization capabilities on complex web navigation tasks, and paving the way for more adaptive and continually learning agents.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, a foundational grasp of several concepts related to language models, AI agents, and web interaction is beneficial.
-
Language Model (LM) based Agents:
- Concept: These are artificial intelligence systems that leverage the capabilities of large language models (LLMs) to perform tasks in digital environments. Unlike traditional software that follows explicit rules, LM agents interpret natural language instructions, reason about their environment, and generate actions, often in an iterative observe-act loop.
- How they work: An LM agent typically receives a natural language instruction (the
query), observes the current state of an environment (e.g., a webpage), and uses its internal knowledge (memory) and the LM's generative power to decide the next best action. This action is then executed, changing the environment's state, and the cycle repeats until the task is complete. - Relevance: This paper focuses on improving the robustness and adaptability of these LM-based agents, particularly for web navigation.
-
Web Navigation Tasks:
- Concept: These are tasks that require an agent to interact with a web browser to achieve a specific goal. Examples include searching for a product on an e-commerce site, booking a flight, filling out a form, or navigating social media.
- Actions: Common actions include
CLICK(on buttons, links, etc.),TYPE(into text fields),SCROLL,NAVIGATE(to a URL), andSTOP(when the task is completed). - Environment: The environment for a web agent is typically a web browser, and its state is represented by the current webpage content.
- Relevance: The paper evaluates
AWMon two major web navigation benchmarks,Mind2WebandWebArena.
-
Long-Horizon Tasks:
- Concept: Tasks that require a long sequence of interdependent actions to complete. They often involve multiple sub-goals and can be complex to plan and execute.
- Relevance: Current LM agents struggle with these due to cascading errors and difficulty in maintaining context or learning sequential dependencies, which
AWMaims to address by breaking them down into reusableworkflows.
-
In-context Learning:
- Concept: A paradigm where a large language model (LLM) learns to perform a task by being provided with a few examples (demonstrations) within its input prompt, rather than through explicit weight updates (fine-tuning). The LLM then uses these examples to condition its behavior on a new, unseen input.
- Relevance: Many current LM agents use
in-context learningwith fixed examples.AWMextends this by dynamically generating and integratingreusable workflowsinto the agent's context/memory.
-
Offline vs. Online Scenarios (in ML/Reinforcement Learning):
- Offline Learning: The agent learns from a fixed dataset of pre-collected experiences. Once training is complete, the agent's policy is fixed and deployed. New data does not continuously update the model.
- Online Learning: The agent continuously learns and updates its knowledge/policy as new data or experiences become available during deployment. This allows for adaptation to changing environments or tasks.
- Relevance:
AWMis designed to be flexible, operating in bothoffline(learning from a training set) andonline(continually learning from test queries) modes, which is a key contribution for adaptability.
-
Accessibility Trees:
- Concept: A hierarchical representation of the content and structure of a webpage, designed to make web content accessible to assistive technologies (like screen readers). It converts the visual DOM (Document Object Model) into a more semantic structure, often highlighting interactive elements and their roles.
- Relevance:
BrowserGym, a baseline method, usesaccessibility treesto represent webpages, andAWMalso leverages this representation. It's a structured way for agents to "see" and interact with web elements.
3.2. Previous Works
The paper builds upon and compares against several existing approaches and benchmarks in the field of web agents.
-
Web Agent Benchmarks:
- MiniWob (Shi et al., 2017) & MiniWob++ (Liu et al., 2018): Early benchmarks for web agents, providing various scenarios like flight booking. They helped establish the foundational challenges.
- WebShop (Yao et al., 2022): Focused on simulated e-commerce with crowd-sourced instructions, introducing more realistic interactions.
- WebArena (Zhou et al., 2024): A more recent and comprehensive benchmark used in this paper. It integrates four additional websites beyond e-commerce, covering domains like social forums, software development, and content management. Crucially, it provides
rigorous execution-based evaluation, meaning task success is determined by actual functional correctness rather than just action sequence correctness. - VisualWebArena (Koh et al., 2024): Extends
WebArenaby including tasks that require visual inputs, pushing towards multimodal agents. - Mind2Web (Deng et al., 2023): Another major benchmark used in this paper. It emphasizes the
generalityof agents across diverse operations and environments, featuringcross-task,cross-website, andcross-domainsettings. Each task has a fixed number of steps, and evaluation includeselement accuracy,action F1, andstep success rate.
-
Enhancing Agents for Complex Tasks (Baselines):
- BrowserGym (Drouin et al., 2024): A state-of-the-art method without human-annotated site-specific knowledge, used as a primary baseline in
WebArena. It modifies the agent's default action space and uses accessibility tree representations. - SteP (Sodhi et al., 2023): An approach that uses
human expert written workflowstailored forWebArena.AWMcompares itself toStePto show its ability to achieve competitive results without manual supervision. - AutoEval (Pan et al., 2024): A method that focuses on
autonomous evaluation and refinementof digital agents. It involves additional evaluation and refinement steps to ensure task correctness. - MindAct (Deng et al., 2023): A method designed for
Mind2Webthat useswebpage element filteringand amulti-choice task formatto simplify observation processing for LM agents. - Synapse (Zheng et al., 2024): Another
Mind2Webbaseline. It changes the input format to atrajectory styleandaugments retrieved relevant examplesinto the agent's context/memory. This is a direct point of comparison forAWMregarding the benefits ofreusable workflowsoverconcrete examples.
- BrowserGym (Drouin et al., 2024): A state-of-the-art method without human-annotated site-specific knowledge, used as a primary baseline in
-
Learning Common Procedures from Experiences:
- This is a broad area of related work that
AWMfalls into. Many prior works have explored extracting sub-routines:- Rule-based methods (Ellis et al., 2023; Bowers et al., 2023; Grand et al., 2023): These approaches typically use predefined rules or heuristics to identify repetitive action sequences.
- LM-based methods (Cai et al., 2023; Wang et al., 2024c;a): These leverage large language models to identify and abstract common patterns from demonstrations.
AWMfalls into this category for itsLM-based workflow induction.
- The goal of these methods is often to create
auxiliary skillsorcontext guidanceto improve future task-solving (Oh et al., 2017; Liang et al., 2023; Yu et al., 2023; Mao et al., 2023).
- This is a broad area of related work that
3.3. Technological Evolution
The evolution of web agents has moved from simple scripted bots to sophisticated LM-based agents. Early systems (MiniWob) focused on highly constrained environments and rule-based interactions. The advent of large language models brought in-context learning and the ability to interpret natural language instructions, shifting the focus to more open-ended and complex web tasks (WebShop, WebArena, Mind2Web).
Initially, agents relied on direct parsing of HTML or DOM structures. The adoption of accessibility trees improved semantic understanding of webpages. Methods also evolved from simply generating actions to incorporating self-feedback, planning, and memory augmentation with example demonstrations. However, these demonstrations often struggled with generalization because they were example-specific and context-dependent.
This paper's work (AWM) represents a significant step in this evolution by moving beyond mere example augmentation to workflow abstraction. Instead of providing concrete, full examples, AWM induces abstract, reusable routines. This allows for better generalization and continuous learning, addressing a key limitation of prior in-context learning approaches that struggled with extrapolating to other tasks or domains (Majumder et al., 2023). It bridges the gap between fixed, example-driven learning and dynamic, adaptable intelligence by allowing agents to build an evolving "skill library" (workflow memory) similar to how humans learn.
3.4. Differentiation Analysis
Compared to the main methods in related work, AWM introduces several core differences and innovations:
-
Abstraction of Reusable Routines (Workflows) vs. Concrete Examples:
- Prior Work (e.g.,
Synapse): Often augments agent memory withconcrete, full example trajectoriesorrelevant demonstrations. While helpful, theseexample-specific contextscan bias agents and struggle to generalize to tasks that differ even slightly from the provided examples. - AWM's Innovation: Induces
abstract, reusable workflowsby explicitly prompting LMs to extract commonsub-routinesandabstracting out example-specific contexts(e.g., replacing "dry cat food" with ). This makes workflows much more flexible and widely applicable across diverse tasks and domains. The paper explicitly showsAWM's superiority overSynapseon Mind2Web, noting betterelement accuracydue to less bias.
- Prior Work (e.g.,
-
Flexible Offline and Online Workflow Induction:
- Prior Work: Many methods rely on available
training dataorhuman supervisionfor learning or policy design (e.g.,StePuses human-engineered workflows,BrowserGymuses a predefined action space). Collecting high-quality, domain-aligned examples is often difficult or impossible. - AWM's Innovation: Can operate in two modes:
Offline AWM: Utilizes existing canonical examples (if available) for workflow induction.Online AWM: Crucially, it can function in asupervision-freemanner by iteratively inducing workflows from its ownsuccessful predictionsduring test time. This allows it to learn and adapt even when no training data exists, directly addressing the challenge ofdata scarcityand enablingcontinual learningin dynamic environments. This flexibility significantly improves itsgeneralizationto unseen websites and domains.
- Prior Work: Many methods rely on available
-
Continual Learning and Workflow Composition:
- Prior Work: Agents typically solve each task separately, without explicitly learning from past successes or failures to build a cumulative knowledge base of routines.
- AWM's Innovation:
AWMfacilitatescontinual learningwhere agents can first learn basic workflows (e.g., "find a place by its name") and then use these as sub-goals to buildmore complex workflows(e.g., "get the zip code of a place" by first finding the place using the learned workflow). Thissnowball effectof increasingly complex workflows is a novel aspect that significantly boosts long-term performance.
-
Efficiency in Action Trajectories:
-
Prior Work (e.g.,
AutoEval): May require additional evaluation and refinement steps, leading to longer action trajectories. -
AWM's Innovation: By providing direct
guidancethrough relevant workflows,AWMenables agents to achieve high success rates while takingfewer stepsper task, indicating more efficient task execution.In essence,
AWMdifferentiates itself by moving beyond static, example-based learning to a more dynamic, abstract, andhuman-likeapproach of acquiring and composingreusable skills(workflows), leading to superior adaptability and performance across diverse and challenging web environments.
-
4. Methodology
The core idea of Agent Workflow Memory (AWM) is to enable language model-based agents to learn and reuse common operational routines, termed workflows, from past experiences. This process is analogous to how humans abstract common procedures and apply them to guide future actions. AWM achieves this by inducing workflows from agent trajectories and integrating them into the agent's memory to guide subsequent task-solving.
4.1. Principles
The fundamental principle behind AWM is workflow induction and memory augmentation. An agent's observed action trajectories from successfully completed tasks contain patterns of sub-routines that are frequently reused. By extracting these sub-routines and representing them as workflows with abstract descriptions and parameterized actions, AWM creates a reusable knowledge base. This knowledge base is then integrated into the agent's text-based memory, providing context-sensitive guidance to the underlying language model during future task execution. This mechanism allows the agent to move beyond merely imitating concrete examples to leveraging generalized procedural knowledge, improving efficiency, success rate, and generalization across varying tasks and environments.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Statement
The paper defines the task for an LM-based agent as follows:
An agent operates with a language model backbone and text-based memory . The base memory initially contains documentation of built-in actions (e.g., CLICK, TYPE).
To solve a task specified by a natural language (NL) instruction , the agent interacts with an environment defined by a transition function .
For each time step :
- The
environment stateprovides anobservation. - The observation , along with the instruction and memory , is passed to the language model to generate an
action. This is represented as: $ L(q, M, o_i) \to a_i $- : The language model backbone.
- : The natural language instruction for the current task.
- : The agent's text-based memory, potentially augmented with
workflows. - : The observation obtained from the current environment state .
- : The action generated by the language model.
- The generated action is executed in the environment, changing the state:
$
T(s_i, a_i) \to s_{i+1}
$
-
: The environment's transition function.
-
: The current environment state.
-
: The executed action.
-
: The new environment state after executing .
This
observe-act loopcontinues until the model predicts aSTOPaction () or a pre-determinedtask termination condition(e.g., maximum steps) is met.
-
Each completed task forms an experience , which comprises the NL instruction and a trajectory of steps . Each step in the trajectory contains the agent's observation and the action taken, formulated as . So, an experience is:
$
e = (q, P^e)
$
where .
The goal of AWM is to induce useful workflows from a set of experiences (collected from past or newly generated examples) using an induction module . This induction process is formalized as:
$
I(\mathcal{E}) \to \mathcal{W}
$
-
: The workflow induction module.
-
: A set of agent experiences.
-
: The set of induced workflows.
These induced workflows are then added into the agent's memory as guidance for subsequent task-solving.
4.2.2. Workflow Representation
A workflow is structured similarly to an experience but represents a reusable routine rather than a specific task instance. Each workflow consists of two main components:
- Textual Description (): An
NL task descriptionthat summarizes the high-level goal or function of the workflow. This description is heuristically extracted from experience instructions or summarized by an LM. - Workflow Trajectory (): A series of steps required to complete the process described in . Each step within a workflow trajectory has three parts:
-
Environment State Description: An NL description of the current environment state (e.g., "Order {id} is shown").
-
Reasoning Process: The agent's elaborated reasoning for deciding on the next action (e.g., "Order {id} is found, I will now terminate the task.").
-
Executable Action: An action represented as an executable program over the environment (e.g.,
stop()).The following figure (Figure 2 from the original paper) illustrates the
AWMpipeline, showing how an agent interacts with the environment, induces workflows, and integrates them into memory.
该图像是论文中图2的示意图,展示了AWM流水线:智能体在环境中通过语言模型骨干进行行动和观察,成功案例诱导出工作流并存入记忆中,以指导后续动作。
-
Figure 2: Illustration of our AWM pipeline: an agent takes actions to solve given queries, induces workflows from successful ones, and integrates them into memory.
4.2.3. Inducing and Using Workflows
At the core of AWM is the induction module , which generates a set of workflows from one or more past agent experiences . Each experience contains an NL task instruction and a sequence of observation-action steps . The induction module's output is:
$
I(\mathcal{E}) \to \mathcal{W} = {w} = {(d_j, P_j^d)}
$
- : Set of induced workflows.
- : Textual description of the -th workflow.
- : Trajectory of steps for the -th workflow.
LM-based Workflow Induction
To ensure workflows are reusable and capture common sub-routines, AWM uses an LM-based module for induction. This module prompts an LM to extract these sub-routines from input experiences. Key aspects of this process include:
-
Finer Granularity: Unlike concrete task instructions, workflows are induced at a finer granularity, focusing on sub-tasks (e.g., "search for a product on Amazon" instead of "Buy dry cat food on Amazon and deliver to my address").
-
Abstraction of Contexts: To enhance generality, example-specific values are abstracted into variables (e.g., "dry cat food" becomes ). This is achieved by specifying this abstraction in the workflow induction prompts (details in Appendix A).
-
Segmentation: Induced workflows are segmented (based on double-line breaks in the model output) and stored separately in
workflow memory.After induction, workflows are integrated into the agent's memory, augmenting the original memory to become . When solving a new instruction , the agent now uses this augmented memory: $ L(q, M_w, o) \to L(q, M + \mathcal{W}, o) \to a $
AWM is applied in two main scenarios:
Offline Scenario ()
This scenario applies when additional canonical experiences (e.g., human-annotated data or synthesized examples) are available for training. The process is divided into two distinct phases:
- Workflow Induction (Training Time):
AWMtakes all available training examples from a website, concatenates them into a single prompt, and feeds them to an LM to create a set of workflows. $ I(\mathcal{E}{train}) \to \mathcal{W}{offline} $- : Set of training experiences.
- : Set of workflows induced in the offline setting.
- Workflow Utilization (Inference Time): At inference time, the agent incorporates all induced workflows into its memory to solve test instructions. The same workflow memory is used for every test task. $ L(q, M + \mathcal{W}_{offline}, o_i^{test}) \to a_i^{test} $
The following figure (Figure 3 from the original paper) illustrates the pipeline.
该图像是图3的示意图,展示了AWM offline方法的工作流程。步骤包括①从“训练”阶段带有额外示例中归纳工作流,将其加入记忆,②在测试推理阶段应用这些工作流指导任务执行。
Figure 3: Illustration of
Online Scenario ()
This scenario is designed for situations where no additional canonical experiences are available, and AWM operates in a supervision-free manner using only test queries. Agents with process test queries in a streaming fashion, continually inducing, integrating, and utilizing workflows.
The process is iterative:
-
Initialization: The agent starts with its default memory .
-
Task Execution: Given the -th test instruction , the agent attempts to solve it by generating an
action trajectory, forming anexperience. -
Success Evaluation: An
LM-based evaluation model(from Pan et al., 2024) outputs a binary label, , indicating whether the experience successfully solves . -
Workflow Induction and Memory Update: If is judged successful (i.e., ), it is transformed into one or more workflows: $ I(e^t) \to {w^t} $ These new workflows are then added to the agent's memory for the next task: $ M^t + {w^t} \to M^{t+1} $
- : Agent's memory before processing task .
- : Workflows induced from successful task .
- : Updated memory for task .
-
Iteration: This
memory-updating processcontinues iteratively as test instructions are streamed. The success rate is evaluated on the predicted action trajectories for all tests.The following figure (Figure 4 from the original paper) illustrates the pipeline.
该图像是论文中图4的示意图,展示了在线版Agent Workflow Memory(AWM_online)的工作流程,包括如何从测试样例流中诱导(workflows induce)、存储(grow over time)、应用(apply)工作流,指导测试推断。
Figure 4: Illustrations of
4.2.4. AWM's Continual Learning Mechanism
As depicted in Figure 1, AWM enables a continual learning loop. The agent begins with basic actions and, through solving tasks in a streaming manner, induces workflows. For instance, it might first learn "find a place by its name." Subsequently, it can build more complex workflows by composing existing ones, such as using "find a place by its name" as a subgoal within a larger workflow like "get the zip code of a place." This iterative process of inducing and applying increasingly complex workflows expands the agent's memory and knowledge, leading to a snowball effect that substantially improves performance over time compared to static baselines.
该图像是一个折线图,展示了在WebArena地图拆分任务中,AWM方法随着示例数量增加,累积成功率的提升趋势,并与基线方法进行了对比,显示AWM在任务复杂度增加时表现出显著优势。
Figure 1: AWM enables agents to continuously induce and apply workflows to improve performance, compared to stagnant baselines. We show results by AWM on the WebArena map split as an example.
5. Experimental Setup
The paper evaluates AWM on two prominent web navigation benchmarks: WebArena and Mind2Web. The experiments are designed to assess AWM's task success and its generalization capabilities across various setups. For both benchmarks, AWM is applied on a website basis, meaning examples are grouped by their associated websites, and AWM runs separately on each group. This ensures that the collection of workflows remains relevant to the test tasks.
5.1. Datasets
-
WebArena (Zhou et al., 2024):
- Source & Characteristics: Provides 812 web navigation tasks across five distinct websites. These websites cover four common application domains: e-commerce, social forum discussions, collaborative software development, and content management.
- Key Feature: Supports
rigorous execution-based evaluation, meaning task success is determined by the functional correctness of the agent's actions within the environment, rather than just the correctness of the action sequence itself. - Data Availability:
WebArenaprimarily has test examples. Due to the lack of additional high-quality, domain-aligned training examples,AWMis mainly conducted in theonline settingfor this benchmark.
-
Mind2Web (Deng et al., 2023):
- Source & Characteristics: Features web navigation tasks that explicitly stress
generalityof agents acrosscross-task,cross-website, andcross-domainsettings. - Task Structure: Each task in
Mind2Webhas a fixed number of steps. At each step, the agent must predict an action. - Data Availability:
Mind2Webprovides a training set that covers a portion of the tested websites (used for thecross-task split). This allows for exploration of bothoffline(inducing workflows from the training set) andonline(streaming workflow induction and inference on test queries)AWMsettings. - Splits:
Cross-task: Training and test examples are from the same websites/domains but involve different tasks.Cross-website: Test examples are from websites not seen during training, but within the same domain.Cross-domain: Test examples are from entirely new websites and domains not seen during training.
- Source & Characteristics: Features web navigation tasks that explicitly stress
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, a complete explanation is provided below.
-
Success Rate (SR):
- Conceptual Definition:
Task Success Rateis a high-level metric that measures the percentage of tasks for which the agent successfully achieves the desired goal from start to finish. It is the ultimate measure of an agent's ability to complete a given instruction. For Mind2Web, specifically, it measures if all intermediate steps for a given task are successfully conducted. - Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of successfully completed tasks}}{\text{Total number of tasks}} $
- Symbol Explanation:
- : Success Rate.
- Conceptual Definition:
-
Number of Steps (WebArena):
- Conceptual Definition: This metric quantifies the average number of actions an agent takes to successfully complete a task. A lower number of steps generally indicates higher efficiency.
- Mathematical Formula: Not explicitly provided in the paper as a formula, but conceptually it is the sum of actions taken across all successful tasks divided by the number of successful tasks.
- Symbol Explanation: Not applicable as it's a direct count.
-
Mind2Web Specific Step-wise Metrics: For Mind2Web, evaluation is conducted at each step, and then aggregated.
-
Element Accuracy (Elem Acc):
- Conceptual Definition: Measures whether the agent correctly identifies and selects the target web page element (e.g., a button, a text field) that it needs to interact with at a given step.
- Mathematical Formula: Not explicitly provided in the paper as a formula, but conceptually it is the percentage of steps where the correct page element is selected.
- Symbol Explanation: Not applicable as it's a direct percentage.
-
Action F1:
- Conceptual Definition: Measures the harmonic mean of Precision and Recall for the action taken on the selected element. It assesses whether the agent performs the correct type of action (e.g.,
CLICK,TYPE,SCROLL) given the selected element. - Mathematical Formula: The standard F1-score formula is: $ F_1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $ Where: $ \text{Precision} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Positives (FP)}} $ $ \text{Recall} = \frac{\text{True Positives (TP)}}{\text{True Positives (TP)} + \text{False Negatives (FN)}} $
- Symbol Explanation:
- : The F1-score, which balances Precision and Recall.
- : The proportion of correctly predicted positive actions out of all positive actions predicted by the model. In this context, it would be the proportion of times the agent's predicted action type was correct among all action types it predicted.
- : The proportion of correctly predicted positive actions out of all actual positive actions. It would be the proportion of times the agent correctly identified an action type among all the required action types.
- : The agent correctly predicts an action type that is indeed the correct action type.
- : The agent predicts an action type, but it is not the correct action type.
- : The agent fails to predict an action type that was actually required.
- Conceptual Definition: Measures the harmonic mean of Precision and Recall for the action taken on the selected element. It assesses whether the agent performs the correct type of action (e.g.,
-
Step Success Rate (Step SR):
- Conceptual Definition: Aggregates
Element AccuracyandAction F1. A step is considered successful only if both the correct page element is selected AND the correct action is taken on that element. - Mathematical Formula: Not explicitly provided as a formula, but implicitly, it's the percentage of steps where both element accuracy and action selection are correct.
- Symbol Explanation: Not applicable as it's a direct percentage.
- Conceptual Definition: Aggregates
-
5.3. Baselines
The paper compares AWM against several state-of-the-art LM-based agent methods, selected for their relevance to web navigation tasks and their performance on the chosen benchmarks.
-
WebArena Baselines:
- WebArena (Zhou et al., 2024): The original benchmark's reported baseline performance.
- AutoEval (Pan et al., 2024): A method focusing on autonomous evaluation and refinement steps for digital agents.
- BrowserGym (Drouin et al., 2024): The current state-of-the-art autonomous method on WebArena (without human-annotated site-specific knowledge). It alters the agent's default action space.
- BrowserGym_ax-tree: A variant of
BrowserGymthat uses onlyaccessibility treewebpage representations, mirroringAWM's input representation for a fairer comparison. - SteP (Sodhi et al., 2023): A method that uses
14 human expert written workflowsspecifically tailored to solvingWebArenatasks. This serves as a strong baseline demonstrating the upper bound of performance with extensive human supervision.
-
Mind2Web Baselines:
-
MindAct (Deng et al., 2023): A method that introduces
webpage element filteringand amulti-choice task formatto ease observation processing for LM agents. -
Synapse (Zheng et al., 2024): A method that uses a
trajectory-style formatandaugments retrieved relevant examplesinto the agent's memory. This is a critical comparison point forAWMas it tests the efficacy of abstract workflows versus concrete examples.For both benchmarks, the experiments are conducted using
GPT-4(specificallygpt-4-0613) andgpt-3.5-turbomodels with atemperature of 0.0, which ensures deterministic and stable model outputs for consistent evaluation. The same model is used forneural workflow inductionandagent action generation.
-
6. Results & Analysis
This section details the experimental results of AWM on the WebArena and Mind2Web benchmarks, comparing its performance against various baselines and analyzing its generalization capabilities and design choices.
6.1. Core Results Analysis
6.1.1. WebArena Main Results
The following are the results from Table 1 of the original paper:
| Method | Total SR | Shopping | CMS | GitLab | Maps | # Steps | |
|---|---|---|---|---|---|---|---|
| With human engineered workflows | |||||||
| *SteP (Sodhi et al., 2023) | 33.0 | 37.0 | 24.0 | 59.0 | 32.0 | 30.0 | - |
| Autonomous agent only | |||||||
| WebArena (Zhou et al., 2024) | 14.9 | 14.0 | 11.0 | 6.0 | 15.0 | 16.0 | - |
| AutoEval (Pan et al., 2024) | 20.2 | 25.5 | 18.1 | 25.4 | 28.6 | 31.9 | 46.7 |
| BrowserGym (Drouin et al., 2024) | 23.5 | - | - | - | - | - | - |
| BrowserGymax-tree | 15.0 | 17.2 | 14.8 | 20.2 | 19.0 | 25.5 | 7.9 |
| AWM (OURS) | 35.5 | 30.8 | 29.1 | 50.9 | 31.8 | 43.3 | 5.9 |
AWM achieves the highest Total SR (Success Rate) of 35.5% on WebArena, significantly outperforming all other methods. It surpasses the BrowserGym baseline (an autonomous method) by 12.0 absolute points (51.1% relative increase) and BrowserGym_ax-tree by 20.5 absolute points. Notably, AWM even outperforms SteP (which uses human-engineered workflows) by 2.5 absolute points (7.6% relative increase), demonstrating its ability to learn effective workflows without manual supervision. The performance gains are consistent across all five website categories, with AWM showing substantial improvements ranging from 11.8 to 30.7 absolute points over the BrowserGym_ax-tree baseline, indicating broad applicability.
Beyond task success, AWM also demonstrates efficiency by completing tasks in fewer steps. It uses an average of 5.9 steps per example, which is 2.0 fewer steps than the BrowserGym_ax-tree baseline and a remarkable 40.8 fewer steps than AutoEval, highlighting that AWM achieves higher success with more concise and efficient action trajectories.
6.1.2. Efficient Learning from Small Amounts of Data (WebArena)
The behavior of AWM_online is illustrated by its cumulative success rate over the process of online evaluation.
The following figure (Figure 5 from the original paper) shows the cumulative success rate of AWM on the WebArena map test split.
该图像是图表,展示了图5中AWM在WebArena地图测试集上通过约40个查询实现的快速学习能力。横轴为示例数量,纵轴为累计成功率(%),图中标注了快速学习阶段和稳定推理阶段。
Figure 5: AWM enables rapid learning from a small amount of data, i.e., about 40 queries, using WebArena map test split as an example.
The graph shows that AWM exhibits a fast learning curve within the first 0-40 examples, where it acquires essential workflows that lead to a rapid increase in success rates. After this initial phase, the learning curve gradually stabilizes, indicating that the agent continues to learn more advanced workflows but the most significant performance gains occur early with a relatively small amount of data. This demonstrates AWM's efficient learning process, achieving substantial performance improvements by leveraging insights from merely tens of examples.
6.1.3. Cross-Template Workflow Generalization (WebArena)
To test AWM's generalization beyond task templates, a cross-template subset of WebArena examples (from non-overlapping templates) was created.
The following are the results from Table 2 of the original paper:
| Method | Total SR | Shopping (51) | CMS (45) | Reddit (24) | GitLab (45) | Maps (32) |
|---|---|---|---|---|---|---|
| With human engineered workflows | ||||||
| *SteP (Sodhi et al., 2023) | 32.1 | 26.5 | 29.3 | 52.2 | 27.3 | 36.4 |
| Autonomous agent only | ||||||
| AutoEval (Pan et al., 2024) | 23.2 | 12.2 | 17.1 | 21.7 | 31.8 | 36.4 |
| BrowserGymax-tree | 20.5 | 10.4 | 17.8 | 23.1 | 27.3 | 28.6 |
| AWM (OURS) | 33.2 | 24.5 | 29.3 | 52.2 | 31.8 | 39.4 |
Even on this challenging cross-template subset, AWM still achieves the highest Total SR of 33.2%, outperforming all baselines and maintaining its superiority across individual website splits. This demonstrates that AWM's induced workflows effectively generalize across different tasks, not just within instances of the same task template. This validates AWM's ability to extract truly reusable routines.
The following figure (Figure 6 from the original paper) provides a case study illustrating how AWM builds increasingly complex workflows.
该图像是示意图,展示了AWM如何通过借鉴早期工作流的前几个步骤,构建越来越复杂的任务流程,示例涉及按名称查找地点及获取邮政编码的两种任务。
Figure 6: AWM builds increasingly complex workflows over time, by learning from past examples and earlier workflows.
As shown in the example, AWM first learns a basic workflow like "Find a place by its name" from initial examples. Later, when encountering a more complex task (e.g., "get the zip code of a place"), it reuses the "Find a place by its name" workflow as a sub-routine and adds subsequent steps to obtain the zip code. This compositional learning demonstrates AWM's ability to generalize and build hierarchical knowledge.
6.1.4. Mind2Web Main Results (Offline)
The following are the results from Table 3 of the original paper:
| Method | Elem Acc | Action F1 | Step SR | SR |
|---|---|---|---|---|
| MindAct3.5 | 20.3 | 56.6 | 17.4 | 0.8 |
| CogAgent3.5 | - | - | 18.6 | - |
| Synapse3.5 | 34.0 | - | 30.6 | 2.4 |
| AWM3.5 | 39.0 | 52.8 | 34.6 | 2.8 |
| MindAct4 | 41.6 | 60.6 | 36.2 | 2.0 |
| AWM4 | 50.6 | 57.3 | 45.1 | 4.8 |
In the Mind2Web cross-task dataset, using AWM_offline, the method consistently achieves the highest success rates with both GPT-3.5-turbo and GPT-4 variants. For GPT-4, AWM achieves a Step SR of 45.1% and a Task SR of 4.8%, which are substantial improvements over MindAct4 (36.2% Step SR, 2.0% Task SR). This represents a 24.6% relative increase in Step SR and a 140% relative increase in Task SR.
A breakdown shows that the improvements largely come from Element Accuracy, with AWM4 achieving 50.6% compared to MindAct4's 41.6% (a 9.0 absolute point increase). This suggests that AWM's abstract workflow representation helps agents more accurately select the correct web elements.
Comparing AWM to Synapse, which uses concrete examples, AWM achieves higher Element Accuracy (+5.0 points) and Step SR (+4.0 points for GPT-3.5-turbo). This supports the argument that abstract sub-routines in AWM introduce less bias in element selection compared to concrete, full examples, which might lead agents to prefer elements similar to those in the provided demonstrations. The reusable nature of workflows is shown to be more flexible than full example trajectories.
However, AWM shows a slightly lower Action F1 score than MindAct (e.g., 57.3% vs. 60.6% for GPT-4). This suggests that while workflows guide better element selection, they might sometimes prompt actions that are not perfectly aligned with the current environment state, indicating a challenge in knowing when to diverge from workflow guidelines.
6.1.5. Online AWM Enables Generalization (Mind2Web)
The following are the results from Table 4 of the original paper:
| Method | Cross-Task | Cross-Website | Cross-Domain | |||||||||
| EA | AF1 | Step SR | SR | EA | AF1 | Step SR | SR | EA | AF1 | Step SR | SR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MindAct* | 41.6 | 60.6 | 36.2 | 2.0 | 35.8 | 51.1 | 30.1 | 2.0 | 21.6 | 52.8 | 18.6 | 1.0 |
| AWMoffline | 50.6 | 57.3 | 45.1 | 4.8 | 41.4 | 46.2 | 33.7 | 2.3 | 36.4 | 41.6 | 32.6 | 0.7 |
| AWMonline | 50.0 | 56.4 | 43.6 | 4.0 | 42.1 | 45.1 | 33.9 | 1.6 | 40.9 | 46.3 | 35.5 | 1.7 |
On Mind2Web, both AWM_online and AWM_offline significantly outperform the MindAct baseline across all generalization scenarios: cross-task, cross-website, and cross-domain. Improvements range from 7.4 to 14.0 absolute points in Step SR.
- In-domain, Cross-task Scenario:
AWM_onlineandAWM_offlineperform comparably, withAWM_offlineslightly ahead inStep SR(45.1% vs 43.6%). This is attributed toAWM_offlinebenefiting from high-quality, distribution-matching training examples when available, which can alleviate thetrain-test gap.AWM_onlinecan induce incorrect workflows from self-generated (potentially flawed) trajectories. - Extending to Unseen Websites and Domains: As the
domain gapswiden (fromcross-websitetocross-domain),AWM_onlinedemonstratesgreater generalization abilities. For example, in thecross-domainsetting,AWM_onlineachieves aStep SRof 35.5% compared toAWM_offline's 32.6% andMindAct's 18.6%. This is becauseAWM_onlinedoes not rely on training data and thus is not affected bydomain gapsbetween training and testing data, allowing it to adapt better to novel environments by learning workflows directly from the test distribution. EvenAWM_offline, despite domain gaps, still shows substantial improvements overMindAct(e.g., 32.6% vs 18.6%Step SRcross-domain), indicating the inherent benefit of aworkflow repository.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Rule-based vs. LM-based Workflow Induction
The paper explores alternative methods for workflow induction, comparing the LM-based approach to a rule-based one. The rule-based induction method () extracts action sequences, deduplicates them, and removes invalid steps.
-
WebArena Results: The following are the results from Table 5 of the original paper:
Method Total SR # Steps AWMrule 35.6 6.3 AWMlm 35.5 5.9 On
WebArena,AWM_ruleandAWM_lmperform comparably inTotal SR(35.6% vs. 35.5%), withAWM_lmbeing slightly more efficient (5.9 vs. 6.3 steps). Manual analysis suggestsLM-basedworkflows are finer-grained, avoiding unnecessary steps found inrule-inducedworkflows. -
Mind2Web Results: The following are the results from Table 6 of the original paper:
Method Elem Acc Action F1 Step SR SR MindAct4 41.6 60.6 36.2 2.0 AWM4,rule 49.5 57.0 43.4 2.0 AWM4,lm 50.6 57.3 45.1 4.8 On
Mind2Web,AWM_lm(50.6%Elem Acc, 45.1%Step SR) significantly outperformsAWM_rule(49.5%Elem Acc, 43.4%Step SR), demonstrating a 2.8 point improvement inStep SR. This highlights thatLM-basedinduction'sabstract representationand focus onfrequently-used sub-routineslead to less bias in element selection and more flexible utilization across test examples, compared torule-inducedfull example trajectories.
6.2.2. Workflows in Descriptive Texts
The paper investigates whether representing workflow steps in a textual format (NL descriptions) is better than a program format (code actions).
The following are the results from Table 7 of the original paper:
| Method | Elem Acc | Action F1 | Step SR | SR |
|---|---|---|---|---|
| MindAct | 41.6 | 60.6 | 36.2 | 2.0 |
| AWM | 50.6 | 57.3 | 45.1 | 4.8 |
| AWMtext | 51.2 | 57.4 | 45.4 | 3.6 |
AWM_text (workflows represented in NL) achieves slightly higher Element Accuracy (51.2% vs. 50.6%) and Step SR (45.4% vs. 45.1%) compared to AWM (workflows in code format). However, AWM_text shows a degradation in Task SR (3.6% vs. 4.8%). Overall, the performance variance between text and code formats is not substantial, suggesting both can effectively augment agent memory.
6.2.3. Environment Abstraction in Workflows
The paper examines how intermediate webpage states are represented within workflows. AWM uses NL descriptions, but the study considers adding website HTML (filtered for relevance) or both.
The following are the results from Table 8 of the original paper:
| Desc. | HTML | Elem Acc | Act F1 | Step SR | SR |
|---|---|---|---|---|---|
| ✔️ | 39.0 | 52.8 | 34.6 | 2.8 | |
| ✔️ | 38.1 | 54.0 | 33.8 | 2.8 | |
| ✔️ | ✔️ | 37.1 | 51.3 | 32.9 | 2.0 |
The results indicate that NL descriptions of states (Desc. ✔️) are more effective than HTML (HTML ✔️), as replacing NL with HTML leads to a slight 0.8 point drop in Step SR. Interestingly, using both NL and filtered HTML (Desc. ✔️ HTML ✔️) leads to worse results. The authors conjecture two reasons: (1) increased context length overloads the models, and (2) the filtered HTML often contains irrelevant items (missing correct elements 47% of the time), which can contradict NL descriptions and impair the agent's abilities. This suggests that a concise, high-level NL description is often superior to raw or partially filtered HTML for grounding agents.
6.2.4. Workflow Utilization in Context and in Action ()
This section explores expanding the agent's action space with workflows, treating them as high-level functions or shortcut tools.
The following are the results from Table 9 of the original paper:
| Method | Elem Acc | Action F1 | Step SR | SR |
|---|---|---|---|---|
| MindAct | 41.6 | 60.6 | 36.2 | 2.0 |
| AWM | 50.6 | 57.3 | 45.1 | 4.8 |
| AWMAS | 51.8 | 56.7 | 46.4 | 3.6 |
Expanding the agent's action space with workflows (AWM_AS) leads to a slight improvement in Step SR (1.3 points higher than base AWM) but a decrease in Task SR (3.6% vs 4.8%). Analysis reveals that agents call workflow actions in only 18.5% of tasks, suggesting a resistance to using newly added high-level actions. This indicates that while workflow actions can reinforce workflows in memory and offer small gains as auxiliary actions, agents struggle to flexibly integrate them into their planning.
The following figure (Figure 7 from the original paper) illustrates the challenge of dynamic environment changes that can impact workflow action utilization.
该图像是示意图,展示了图7中动态环境变化对流程动作利用的挑战。图中左侧为初始航班搜索界面,右侧显示输入地点后弹出的选项列表,强调选择动作依赖弹出选项。
Figure 7: An example of dynamic environment changes that challenge workflow action utilization.
This example shows that a book_flight workflow might hardcode a sequence of actions. However, dynamic elements, like a popup for airport selection after typing a city, require intermediate state observation and flexible decision-making that fixed workflow actions cannot easily handle. This limitation points to a need for more advanced techniques, such as real-time state access or dynamic execution loops, for better integration of workflow actions.
6.2.5. Workflow Quality Analysis (Appendix A.3)
The paper provides metrics to evaluate the quality of model-induced workflows. The following are the results from Table 10 of the original paper:
| Metric | # Workflows | Coverage | Function Overlap | Utility Rate |
|---|---|---|---|---|
| WebArena | 7.4 | - | 0.08 | 0.94 |
| Mind2Web | 7.3 | 0.40 | 0.20 | 0.91 |
- # Workflows: Neural-based induction produces an efficient number of workflows (7.3-7.4 per example), which doesn't excessively bloat memory.
- Utility Rate: Workflows are highly utilized (0.94 on
WebArena, 0.91 onMind2Web), indicating their broad applicability. - Function Overlap: Low overlap on
WebArena(0.08) suggests efficiency in workflow management, minimizing redundancy. Higher overlap onMind2Web(0.20) is noted. - Coverage: On
Mind2Web,coverageis 0.40. This is deemed reasonable given the substantialtask distribution variancesbetween training and cross-task test examples. Coverage is not evaluated onWebArenadue to lack of canonical trajectories.
6.2.6. Integrating AWM Offline and Online (Appendix C)
The paper explores combining AWM_offline and AWM_online into AWM_off+on.
The following are the results from Table 11 of the original paper:
| Method | Cross-Task | Cross-Website | Cross-Domain | |||||||||
| EA | AF1 | Step SR | SR | EA | AF1 | Step SR | SR | EA | AF1 | Step SR | SR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| MindAct* | 41.6 | 60.6 | 36.2 | 2.0 | 35.8 | 51.1 | 30.1 | 2.0 | 21.6 | 52.8 | 18.6 | 1.0 |
| AWMoffline | 50.6 | 57.3 | 45.1 | 4.8 | 41.4 | 46.2 | 33.7 | 2.3 | 36.4 | 41.6 | 32.6 | 0.7 |
| AWMonline | 50.0 | 56.4 | 43.6 | 4.0 | 42.1 | 45.1 | 33.9 | 1.6 | 40.9 | 46.3 | 35.5 | 1.7 |
| AWMoff+on | 50.0 | 57.0 | 44.5 | 1.6 | 41.8 | 45.5 | 33.3 | 1.1 | 39.3 | 44.3 | 34.1 | 1.5 |
AWM_off+on (integrating both offline and online induced workflows) generally scores between AWM_offline and AWM_online across the three test splits. This suggests that the combination does not yield a straightforward additive benefit. The authors hypothesize that offline workflows might impair the generative quality and utility of online workflows due to incompatibility, resulting in medium overall performance rather than synergistic gains. This points to challenges in harmonizing workflows learned from different distributions or induction processes.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces Agent Workflow Memory (AWM), a novel and effective method for enhancing language model-based agents' ability to tackle complex, long-horizon real-world tasks, particularly in web navigation. Inspired by human learning, AWM enables agents to induce reusable routines (workflows) from past experiences and integrate them into their memory to guide future actions. The method offers remarkable flexibility, operating successfully in both offline scenarios (learning from annotated training data) and online, supervision-free scenarios (learning dynamically from successful self-generated test queries).
The empirical evaluation on WebArena and Mind2Web benchmarks demonstrates AWM's significant impact: a 51.1% relative increase in success rate on WebArena and a 24.6% relative increase on Mind2Web, alongside reduced action steps. Crucially, online AWM showcases robust generalization capabilities across tasks, websites, and domains, with performance gains becoming more substantial as the distribution gap between training and testing data widens. The paper also provides valuable insights into optimal workflow representations and the trade-offs between different induction mechanisms. AWM represents a substantial step towards building more adaptive, continually learning, and generalized AI agents.
7.2. Limitations & Future Work
The authors highlight several limitations and suggest future research directions:
-
Dynamic Environment Challenges for Workflow Actions: While
AWMcan expand the agent's action space with workflows (AWM_AS), agents showresistanceto consistently utilizing these newly added high-level actions. More importantly, dynamic environment changes (e.g., pop-up windows during a flight booking process, as shown in Figure 7) pose a significant challenge to fixedworkflow actionswhich might not be flexible enough to handle intermediate states. -
Incompatibility of Offline and Online Workflows: The
AWM_off+onexperiment revealed that simply combiningofflineandonlineinduced workflows does not lead to additive benefits; instead,offline workflowscan impair the utility ofonline workflows. This suggests a challenge in harmonizing workflows from different sources or distributions.Based on these, the authors suggest future work to explore:
-
Real-time State Access: Granting agents
real-time state accesswithinworkflow actionscould make them more flexible in dynamic environments. -
Dynamic Execution Loops: Implementing
dynamic execution loopswithin workflows could allow for more adaptive decision-making when unexpected intermediate states arise. -
Workflow Integration Strategies: Further research is needed on more sophisticated strategies for integrating workflows from various sources (offline, online, human-provided) to ensure compatibility and maximize synergistic benefits.
-
Robustness of Workflow Induction: Improving the robustness of the online induction process to avoid learning from incorrect trajectories is also an implicit area for improvement.
7.3. Personal Insights & Critique
This paper presents a highly intuitive and powerful approach to improving LM-based agents that resonates strongly with human learning principles. The analogy to humans abstracting routines is particularly compelling.
-
Innovation of Online Learning: The
online AWMis a standout innovation. Its ability to learnsupervision-freeand adapt to unseen domains and websites addresses a critical bottleneck in deployingLM agentsin diverse, real-world scenarios where pre-annotated data is often scarce or quickly becomes outdated. Thiscontinual learningaspect is crucial for building truly autonomous and generalist agents. -
Value of Abstraction: The emphasis on
abstracting out example-specific contextsis a key insight. It highlights that simply providing more concrete examples (as inSynapse) can lead to overfitting or bias, whereasAWM's focus on generalizedsub-routinesleads to superiorgeneralization. This concept could be transferable to other domains where agents need to learn reusable skills, such as robotics or code generation. -
Nuances in Workflow Representation: The ablation studies on
NL vs. HTMLandtext vs. codeworkflow representations offer practical guidance for agent design. The finding thatNL descriptionsare often better than rawHTML(especially when noisy) underscores the importance of high-level semantic understanding forLM agents. The slightAction F1decrease inAWMcompared toMindActalso hints at a subtle challenge: while workflows provide powerful guidance, knowing when to deviate from them based on immediate environmental cues is a sophisticated skill that warrants further exploration. -
Challenges of High-Level Actions: The
AWM_ASexperiment, showing agentresistanceto calling high-levelworkflow actions, reveals an interesting gap between providing capabilities and enabling their effective utilization. This suggests thatLM agentsmay struggle with hierarchical planning or action selection when the action space becomes multi-granular. Future work could explore more explicit hierarchical reasoning, planning modules, or reinforcement learning approaches to better integrate these high-level actions. -
Integration Complexity: The
AWM_off+onresults are a crucial practical insight. They indicate that simply accumulating workflows from different sources is not a panacea; sophisticatedworkflow management,conflict resolution, orcontextual activationmechanisms are likely needed to prevent interference and ensure synergy. This suggests that the "memory" inAgent Workflow Memorymight need more advanced organizational and retrieval capabilities beyond simple augmentation.Overall,
AWMprovides a robust framework for improvingLM agentsby focusing on the adaptive learning and reuse of procedural knowledge. Its success in generalization, particularly inonlinesettings, makes it a highly promising direction for future research in artificial intelligence.
Similar papers
Recommended via semantic vector search.