Paper status: completed

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

Published:09/30/2025

LLM Reasoning Capacity Enhancement (39)Sequence Policy Optimization (40)RL Training for Large Language Models (67)Memory Mechanisms for LLMs (1)Test-Time Scaling Techniques (1)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ReasoningBank distills self-judged experiences into general reasoning strategies, enabling agents to retrieve and update memories for continual improvement. Combined with MaTTS, it enhances learning efficiency and performance in continuous multi-task scenarios.

Abstract

With the growing adoption of large language model agents in persistent real-world roles, they naturally encounter continuous streams of tasks. A key limitation, however, is their failure to learn from the accumulated interaction history, forcing them to discard valuable insights and repeat past errors. We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent's self-judged successful and failed experiences. At test time, an agent retrieves relevant memories from ReasoningBank to inform its interaction and then integrates new learnings back, enabling it to become more capable over time. Building on this powerful experience learner, we further introduce memory-aware test-time scaling (MaTTS), which accelerates and diversifies this learning process by scaling up the agent's interaction experience. By allocating more compute to each task, the agent generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. The better memory in turn guides more effective scaling, establishing a powerful synergy between memory and test-time scaling. Across web browsing and software engineering benchmarks, ReasoningBank consistently outperforms existing memory mechanisms that store raw trajectories or only successful task routines, improving both effectiveness and efficiency; MaTTS further amplifies these gains. These findings establish memory-driven experience scaling as a new scaling dimension, enabling agents to self-evolve with emergent behaviors naturally arise.

Mind Map

In-depth Reading

English Analysis~41 min read · 54,109 chars

1. Bibliographic Information

1.1. Title

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

1.2. Authors

The authors of this paper are:

Siru Ouyang (University of Illinois Urbana-Champaign)
Jun Yan (Google Cloud AI Research)
I-Hung Hsu (Google Cloud AI Research)
Yanfei Chen (Google Cloud AI Research)
Ke Jiang (Google Cloud AI Research)
Zifeng Wang (Google Cloud AI Research)
Rujun Han (Google Cloud AI Research)
Long T. Le (Google Cloud AI Research)
Samira Daruki (Google Cloud AI Research)
Xiangru Tang (Yale University)
Vishy Tirumalashetty (Google Cloud AI Research)
George Lee (Google Cloud AI Research)
Mahsan Rofouei (Google Cloud AI)
Hangfei Lin (Google Cloud AI)
Jiawei Han (University of Illinois Urbana-Champaign)
Chen-Yu Lee (Google Cloud AI Research)
Tomas Pfister (Google Cloud AI Research)

The affiliations show a strong presence from Google Cloud AI Research and Google Cloud AI, indicating industrial research efforts, complemented by academic contributions from the University of Illinois Urbana-Champaign and Yale University.

1.3. Journal/Conference

The paper is published at arxiv.org. This is a preprint server, meaning the paper has been submitted but might not yet have undergone peer review for a specific journal or conference. However, arXiv is a widely respected platform for quickly disseminating research in fields like AI and machine learning.

1.4. Publication Year

The paper was published at (UTC) 2025-09-29T17:51:03.000Z, indicating a publication year of 2025.

1.5. Abstract

The paper introduces ReasoningBank, a novel memory framework for large language model (LLM) agents. The core idea is to distill generalizable reasoning strategies from an agent's self-judged successful and failed experiences, rather than just storing raw interaction trajectories or successful routines. At test time, agents retrieve relevant memories from ReasoningBank to guide their actions and then integrate new learnings back into the bank, fostering continuous self-evolution. Building on this, the authors propose memory-aware test-time scaling (MaTTS), which enhances learning by generating diverse and abundant interaction experiences. By allocating more compute per task, MaTTS provides richer contrastive signals for synthesizing higher-quality memory, which in turn guides more effective scaling, creating a powerful synergy. Experiments on web browsing and software engineering benchmarks demonstrate that ReasoningBank consistently outperforms existing memory mechanisms in both effectiveness and efficiency, with MaTTS further amplifying these gains. The findings establish memory-driven experience scaling as a new dimension for enabling agents to self-evolve and exhibit emergent behaviors.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2509.25140
PDF Link: https://arxiv.org/pdf/2509.25140v1.pdf
Publication Status: This is a preprint, indicating it is publicly available but might not have completed a formal peer-review process for a journal or conference yet.

2. Executive Summary

2.1. Background & Motivation

The rapid advancements in large language models (LLMs) have led to the development of sophisticated LLM agents capable of interacting with complex environments, such as web browsers and software development tools. These agents are increasingly deployed in persistent, long-running roles where they encounter a continuous stream of tasks.

The core problem addressed by this paper is a significant limitation of current LLM agents: their inability to effectively learn from their accumulated interaction history. Typically, these agents approach each new task in isolation, essentially "forgetting" past experiences. This leads to several critical issues:

Repeating Past Errors: Agents are prone to making the same mistakes repeatedly, as they don't retain lessons from failures.
Discarding Valuable Insights: Useful strategies or solutions discovered in one task are often not transferred to similar future tasks.
Lack of Self-Evolving Capabilities: Without a mechanism to learn and adapt from experience, agents cannot continuously improve their capabilities over time, which is essential for real-world, dynamic environments.

Existing approaches to agent memory, such as storing raw interaction trajectories or common successful routines (workflows), suffer from two fundamental drawbacks:
Limited Generalizability: They struggle to distill high-level, transferable reasoning patterns that can apply broadly across different tasks or domains. Raw trajectories are too specific and noisy, while successful workflows are often too rigid.
Underexplored Failures: Many memory systems predominantly focus on successful experiences, neglecting the rich learning opportunities presented by failures (e.g., what not to do, common pitfalls).

This highlights a critical need for memory-aware agent systems that can learn from past experiences in a more abstract, generalizable, and comprehensive way to overcome these limitations and enable true self-evolution.

2.2. Main Contributions / Findings

The paper makes several primary contributions to address the aforementioned challenges:

ReasoningBank Framework:
- Novel Memory Mechanism: The paper introduces ReasoningBank, a new memory framework that distills generalizable reasoning strategies from an agent's self-judged successful and failed experiences. Unlike prior work focused on raw trajectories or only successful routines, ReasoningBank extracts high-level, transferable patterns and actionable principles, including preventative lessons from failures.
- Closed-Loop Self-Evolution: ReasoningBank operates in a continuous closed loop. Agents retrieve relevant memories to guide current tasks, and upon task completion, new experiences are analyzed, distilled into memory items, and consolidated back into the ReasoningBank. This enables agents to continuously evolve and improve their strategic capabilities over time.
Memory-aware Test-Time Scaling (MaTTS):
- Synergistic Scaling Dimension: The paper proposes MaTTS, a novel approach that integrates ReasoningBank with test-time scaling techniques. Instead of merely scaling the number of tasks, MaTTS focuses on scaling the depth of experience per task by allocating more computational resources to generate abundant and diverse exploration trajectories.
- Enhanced Memory Curation: MaTTS leverages the contrastive signals arising from these diverse successful and failed trajectories to synthesize higher-quality, more generalizable memories for ReasoningBank. This creates a powerful synergy: better memory guides more effective exploration, and richer exploration yields even stronger memories, establishing memory-driven experience scaling as a new scaling dimension for agents.
Empirical Validation and Emergent Behaviors:
- Superior Performance: Extensive experiments on challenging benchmarks (WebArena for web browsing, Mind2Web for generalization, and SWE-Bench-Verified for software engineering) demonstrate that ReasoningBank consistently outperforms existing memory mechanisms (raw trajectories, successful routines) in both effectiveness (higher success rates) and efficiency (fewer interaction steps). MaTTS further amplifies these gains.
- Learning from Failures: The paper shows that ReasoningBank effectively transforms failures into constructive signals, unlike baselines that either ignore failures or degrade when they are included.
- Emergent Reasoning Strategies: Analysis reveals that ReasoningBank enables agents to develop increasingly complex, emergent reasoning strategies over time, evolving from simple procedural steps to adaptive checks and compositional strategies.
  
  These findings collectively establish a practical pathway towards building adaptive, lifelong-learning agents that can continuously improve and generalize from their experiences in dynamic, real-world environments.

3.1. Foundational Concepts

To fully grasp the contributions of this paper, it's helpful to understand a few core concepts related to large language models and agent systems:

Large Language Models (LLMs): These are advanced artificial intelligence models trained on massive amounts of text data, enabling them to understand, generate, and process human language. Examples include OpenAI's GPT series, Google's Gemini, and Anthropic's Claude. In the context of agents, LLMs serve as the "brain" or policy $\pi _ { \mathcal { L } }$ that generates thoughts, plans, and actions.
LLM Agents: An LLM agent is an AI system that leverages an LLM to perform tasks in an environment. Unlike simple LLM prompting, agents can perceive their environment (receive observations), plan multi-step actions, execute those actions, and learn from the outcomes to achieve a goal. They interact sequentially with an environment, making decisions over time.
Environment: For an LLM agent, the environment is the interactive system or world it operates in. This paper focuses on two main types:
- Web Browsing Environments: Where the agent navigates websites, clicks elements, types text, and extracts information (e.g., WebArena, Mind2Web). The observations are typically text-based representations of web pages (e.g., accessibility trees), and actions are web navigation operations (e.g., click(), type()).
- Software Engineering (SWE) Environments: Where the agent interacts with a codebase, executes commands, and modifies files to resolve software issues (e.g., SWE-Bench-Verified). Observations might be code snippets, error messages, or file contents, and actions are bash commands.
Policy ( $\pi_{\mathcal{L}}$ ): In reinforcement learning and agent systems, a policy defines an agent's behavior. It's a function that maps an observed state or history of observations to an action. In this paper, the agent policy $\pi _ { \mathcal { L } } ( \cdot | \mathcal { M } , \mathcal { A } )$ is parameterized by the backbone LLM $\mathcal { L }$ , conditioned on a memory module $\mathcal { M }$ and the action space $\mathcal { A }$ . This means the LLM decides what to do based on its internal knowledge, available memory, and the possible actions it can take.
Trajectory / Experience: A trajectory (or experience) is a sequence of observations and actions taken by an agent during its interaction with an environment to complete a task. It's a record of what the agent "did" and "saw." Formally, it's represented as ( o _ { 0 : t } , a _ { 0 : t } ) for $t$ steps, where $o_t$ is the observation at time $t$ and $a_t$ is the action.
Test-Time Learning: This paradigm refers to agents learning and improving their capabilities during the deployment or test phase, rather than just during a separate training phase. In this setting, tasks arrive in a streaming fashion (one after another), and the agent must adapt and evolve without access to future tasks or external ground-truth labels. It relies on self-verification and its own past experiences.
LLM-as-a-Judge: This is a technique where an LLM is used to evaluate the performance or correctness of another LLM's output or an agent's trajectory, often without needing human-labeled ground truth. The judge LLM is prompted with the task, the agent's actions, and the outcome, and then determines if the task was successful or failed. This is crucial for enabling self-judgement in test-time learning.

3.2. Previous Works

The paper positions ReasoningBank and MaTTS within the context of existing research on LLM agent memory and test-time scaling.

3.2.1. Memory for LLM Agents

Prior work on LLM agent memory aims to enhance agent performance by enabling them to utilize past information. This research can be categorized by the form of stored information and the mechanisms for retrieval and management:

Memory Content:
- Plain Text: Simple storage of raw text.
- Latent Knowledge Embeddings: Representing knowledge in a dense vector space for semantic retrieval.
- Structured Graphs: Organizing information into graph structures to capture relationships.
Memory Management: Involves retrieval mechanisms (e.g., semantic search) and management strategies (e.g., updating, merging). Reinforcement learning (RL) has also been explored for memory management.

More specifically, the paper highlights two common approaches that ReasoningBank aims to improve upon (as illustrated in Figure 1):

Trajectory Memory (e.g., Synapse [Zheng et al., 2024]): These methods primarily store raw past interactions, or trajectories, for direct reuse. While comprehensive, these trajectories can be lengthy and noisy, making them less efficient and generalizable. They are akin to recording every step an agent took. The paper cites Kagaya et al., 2024, Kong et al., 2025, and Zheng et al., 2024 as examples.

Example (from paper's Figure 1):

<action> click('188') </action>
<think> I have successfully navigated to the 'My Account' page. The user wants to know the date of their first purchase. On this page, I see a "Recent Orders" section. To ensure I find the *first* purchase, I need to view the complete order history, not just the recent ones. There is a "View All" link next to "Recent Orders" which should take me to the full order history. </think>
<action> click('1530') </action>
<think> I am on the "My Orders" page. The user wants to know the date of their first purchase. The orders are currently displayed from most recent to oldest. ...</think>
<action> click(1614') </action>

This shows a detailed, step-by-step log of actions and thoughts.

Workflow Memory (e.g., AWM [Wang et al., 2025d]): These approaches abstract common, successful routines or procedures (workflows) from past interactions. They generalize a bit more than raw trajectories by identifying sequences of actions that commonly lead to success. However, they often over-emphasize successful experiences and may still lack the ability to distill higher-level, transferable reasoning patterns, especially from failures. The paper cites Fang et al., 2025 and Wang et al., 2025d as examples.
- Example (from paper's Figure 1):
```
Workflow: Retrieve Information from Specific Page Section
<think> This workflow is designed to access a specific section .. </think>
<action> click('section_or_tab_id') </action>
<action> send_msg_to_user('extracted_info_message') </action>
```
  This represents a more abstract sequence of actions than a raw trajectory.

3.2.2. Agent Test-Time Scaling (TTS)

Test-time scaling (TTS) involves allocating additional computation during inference to boost performance. It has been widely adopted in tasks like coding and math reasoning. Common TTS methods include:

Best-of-N (BoN) [Chow et al., 2025]: Generating $N$ independent solutions or trajectories for a problem and then selecting the best one (e.g., using a verifier or self-evaluation).
Beam Search [Wu et al., 2024b]: Exploring multiple candidate sequences of actions or thoughts in parallel, expanding the most promising ones.
Leveraging Verifiers [Setlur et al., 2025]: Using external or self-generated mechanisms to check the correctness of intermediate steps or final solutions.

While effective for reasoning tasks, TTS for multi-turn interactive agentic tasks remains underexplored. Existing works in this area scale different dimensions of agentic systems, such as:
Search space for each action [Yu et al., 2025b].
Number of agents in multi-agent systems [Jin et al., 2025].
Number of interactions with the environment [Shen et al., 2025].

Crucially, the paper notes that none of these efforts considers the role of agent memory in scaling, where memory could guide future decisions and enhance the scaling process.

3.3. Technological Evolution

The evolution of LLM agents has progressed from simple prompt-response interactions to complex multi-step decision-making. Initially, agents were often stateless, treating each task as novel. The introduction of memory mechanisms marked a significant step, allowing agents to retain episodic (short-term, context-specific) or semantic (long-term, knowledge-based) information. Early memory systems focused on storing raw interaction logs or successful procedural steps.

However, the limitations of these approaches became apparent: raw logs are too detailed and non-transferable, while successful routines often miss crucial lessons from failures and struggle to generalize high-level strategies. This paper's ReasoningBank represents an evolution by moving beyond mere record-keeping to distilling generalizable reasoning strategies and explicitly incorporating lessons from failures.

Concurrently, test-time scaling has emerged as a powerful way to enhance LLM performance by allocating more compute to explore diverse solutions. This paper identifies a gap in combining TTS with agent memory, proposing MaTTS to create a synergy. This integration represents a new frontier, moving towards agents that not only learn from past experiences but also actively and strategically generate richer experiences to accelerate that learning, leading to a self-evolving capability.

3.4. Differentiation Analysis

ReasoningBank and MaTTS differentiate themselves from prior work in several key ways:

Memory Content (What to store):
- ReasoningBank vs. Trajectory Memory (e.g., Synapse): Instead of storing raw, low-level trajectories, ReasoningBank distills high-level, transferable reasoning patterns and strategies. This abstraction makes memories more compact, reusable, and less susceptible to noise from specific execution details.
- ReasoningBank vs. Workflow Memory (e.g., AWM): While Workflow Memory abstracts successful routines, ReasoningBank goes a step further by extracting generalizable reasoning strategies from both successful and failed experiences. This is crucial because failures provide valuable counterfactual signals and pitfalls that help an agent understand what not to do, leading to more robust and comprehensive learning. Prior methods often neglect this aspect.
Learning Mechanism (How to learn):
- Self-Judged Experiences: ReasoningBank relies on self-judged successful and failed experiences (using LLM-as-a-judge) without requiring external ground-truth labels. This enables continuous learning in real-world, dynamic environments where ground truth might be unavailable.
- Closed-Loop Evolution: The framework establishes a closed-loop process where memories guide actions, and new experiences continuously update the memory, fostering genuine self-evolution.
Integration with Test-Time Scaling (How to scale learning):
- MaTTS as a Synergy: The paper is the first to explicitly explore a memory-aware test-time scaling (MaTTS) strategy. Unlike vanilla TTS which might generate diverse but uncurated experiences, MaTTS uses ReasoningBank to guide the scaled exploration towards more promising paths and, conversely, leverages the diverse experiences from scaling to synthesize higher-quality, more generalizable memories. This creates a positive feedback loop that traditional TTS or memory-only systems lack.
- Contrastive Signals: MaTTS explicitly utilizes contrastive signals (comparing successful vs. failed trajectories, or different exploration paths) during memory curation, which is a key innovation for distilling robust strategies.
  
  In essence, ReasoningBank aims to create actionable, generalizable guidance for future decisions by capturing why certain strategies work or fail, rather than just what happened. MaTTS then supercharges this learning process by intelligently generating rich, diverse experiences for ReasoningBank to learn from, making the agent more capable over time.

4. Methodology

This section details the problem formulation, the design of ReasoningBank, and the proposed Memory-aware Test-Time Scaling (MaTTS).

4.1. Problem Formulation

The paper frames the problem within the context of LLM-based agents operating in a test-time learning paradigm.

4.1.1. Agent Configuration

The agent's behavior is governed by a policy $\pi _ { \mathcal { L } }$ . $ \pi _ { \mathcal { L } } ( \cdot | \mathcal { M } , \mathcal { A } ) $ Here:

$\mathcal{L}$ represents the backbone LLM (e.g., Gemini-2.5-flash, Claude-3.7-sonnet) that parameterizes the policy.
$\mathcal{M}$ denotes the memory module, which in this work is ReasoningBank. Initially, $\mathcal{M}$ is empty and accumulates memories over time.
$\mathcal{A}$ is the action space, defining the set of available actions the agent can take.

The agent's goal is to perform a task by interacting with an environment. This interaction is modeled as a sequential decision-making process. The transition function of the environment is defined as $\mathcal { T } ( s _ { t + 1 } \vert s _ { t } , a _ { t } )$ , where:
$s_t$ is the state of the environment at time $t$ .
$a_t$ is the action selected by the agent's policy $\pi _ { \mathcal { L } }$ at time $t$ .
$s_{t+1}$ is the new state after executing action $a_t$ from state $s_t$ .

The paper focuses on two main types of tasks:
Web Browsing Tasks: The action space $\mathcal{A}$ includes web navigation operations (e.g., click, type, scroll). Observations $o_t$ are typically text-based accessibility trees of web pages, providing a structured representation of the current webpage content.
Software Engineering (SWE) Tasks: The action space $\mathcal{A}$ consists of bash commands. Observations $o_t$ can be code snippets, error messages, or file contents.

For a given task, the agent generates a trajectory of interactions ( o _ { 0 : t } , a _ { 0 : t } ) over $t$ steps. At each step, the agent observes $o_t$ from $s_t$ and uses its policy to generate the next action $a _ { t + 1 }$ : $ a _ { t + 1 } = \pi _ { \mathcal { L } } ( o _ { 0 : t } , a _ { 0 : t } ; \mathcal { M } , \mathcal { A } ) $ Crucially, the memory module $\mathcal{M}$ contributes relevant memories as additional system instructions to the LLM policy $\pi _ { \mathcal { L } }$ , guiding its decision-making.

4.1.2. Test-Time Learning

The paper operates under a test-time learning paradigm (Wang et al., 2025c; Wu et al., 2024a). This means:

Streaming Task Queries: A sequence of tasks $Q = \{ q _ { 1 } , q _ { 2 } , . . . , q _ { N } \}$ arrives sequentially. The agent must process each task one by one, without knowing future tasks.
No Ground Truth: During test time, no ground truth labels or external supervision are available to tell the agent if it succeeded or failed.
Continuous Evolution: The agent must continually evolve and improve its capabilities by solely leveraging its own past trajectories and any form of self-verification.

This setup poses two main challenges:

Memory Extraction and Preservation: How to effectively extract useful, generalizable information from past trajectories and store it in memory.
Memory Utilization: How to effectively retrieve and leverage such memory for future queries to avoid repeating errors or rediscovering known strategies.

4.2. ReasoningBank

ReasoningBank is designed to address the limitations of storing raw, lengthy, and noisy trajectories by distilling useful strategies and reasoning hints into structured, reusable memory items.

The overall architecture of ReasoningBank is depicted in Figure 2. The following figure (Figure 2 from the original paper) shows the overall architecture of ReasoningBank:

Figure 2 | Overview of REAs oNINGBANK. Experiences are distilled into structured memory items with a title, description, and content. For each new task, the agent retrieves relevant items to interact… 该图像是论文中ReasoningBank架构的示意图，展示了代理与环境的交互经验如何通过记忆提取器转化为结构化的记忆条目，并通过记忆检索和整合形成闭环记忆过程，图中包含任务序列与记忆处理流程。

Figure 2 | Overview of REAs oNINGBANK. Experiences are distilled into structured memory items with a title, description, and content. For each new task, the agent retrieves relevant items to interact with the environment, and constructs new ones from both successful and failed trajectories. These items are then consolidated into REAs oN INGBAN K, forming a closed-loop memory process.

4.2.1. Memory Schema

Memory items in ReasoningBank are structured knowledge units abstracted from past experiences. They focus on preserving transferable reasoning patterns and strategies rather than low-level execution details. Each memory item comprises three components:

Title: A concise identifier summarizing the core strategy or reasoning pattern (e.g., "Navigation Strategy").
Description: A brief one-sentence summary of the memory item.
Content: Detailed distilled reasoning steps, decision rationales, or operational insights extracted from past experiences.

This structured format makes memory items both human-interpretable (understandable) and machine-usable (easy for the agent to integrate and act upon), facilitating efficient reuse.

4.2.2. Integration of ReasoningBank with Agents

The integration of ReasoningBank with an agent follows a closed-loop process with three steps: Memory Retrieval, Memory Construction, and Memory Consolidation.

4.2.2.1. Memory Retrieval

When facing a new task (query context):

Querying ReasoningBank: The agent queries ReasoningBank with the current query context.
Similarity Search: It identifies the top-k most relevant experiences and their corresponding memory items using embedding-based similarity search. This involves:
- Embedding Tasks: Each task query is embedded into a vector space using a pre-trained model (e.g., gemini-embedding-001).
- Cosine Distance: Similarity between the current task's embedding and stored memory item embeddings is calculated using cosine distance.
Instruction Injection: The retrieved top-k memory items are concatenated into the agent's system instruction (prompt). This ensures that the agent's current decision-making process is informed and guided by relevant past experiences. The prompt includes the title and content of each retrieved memory item, along with an instruction for the agent to explicitly consider these items.

4.2.2.2. Memory Construction

After the current task is completed, new memory items are extracted from the agent's recent experience. This process involves:

Correctness Signals (LLM-as-a-judge): Since no ground truth is available, the agent first obtains proxy signals for the correctness of its completed trajectories. This is done using an LLM-as-a-judge (Gu et al., 2024), which labels the outcome as success or failure based on the query and the agent's trajectory. The system instructions for obtaining binary signals indicating success or failures of the current trajectory are shown in Figure 9 from the paper:

该图像是一个折线图，展示了使用不同数量经验对成功率的消融实验结果。横轴为经验数量，纵轴为成功率，图中显示经验数量为1时成功率最高达49.7%。

Figure 9 |Syste instructions for obtaining binary signals indicating success or failures of the curent trajectory. This prompt guides the LLM-as-a-judge to output "Success" or "Failure" based on the User Intent, Trajectory, Final State of the Webpage, and Bot Response.
Extraction Strategies: Based on the success/failure signal, different strategies are applied for memory extraction:
- Successful Experiences: Contribute validated strategies, highlighting effective approaches. The system instruction for extracting memory items from successful trajectories emphasizes analyzing why the trajectory succeeded and summarizing transferable reasoning strategies.
- Failed Experiences: Supply counterfactual signals and pitfalls, helping to identify common mistakes and sharpen guardrails. The system instruction for extracting memory items from failed trajectories requires reflecting on the causes of failure and articulating lessons or preventive strategies. The system instructions for extracting memory items are shown in Figure 8 from the paper:
  
  该图像是两部分的示意图，展示了基于成功轨迹和失败轨迹的系统指令对比，指导提取和总结记忆条目，以帮助智能体学习和提升任务完成能力。
Figure 8 | System instructions for extracting memory items from agent trajectories: the left panel targets successful trajectories (summarizing why they succeed), while the right targets failed trajectories (reflecting on failure and deriving lessons). In both cases, the output is constrained to at most three memory items in a structured Markdown format, ensuring conciseness, non-redundancy, and generalizability (i.e., not tied to specific websites or queries).

4.2.2.3. Memory Consolidation

Finally, the newly constructed memory items are consolidated into ReasoningBank.

Simple Addition: The paper adopts a minimal consolidation strategy, where newly generated items are directly added to the memory pool. This approach highlights the pure contribution of ReasoningBank without the complexities of advanced consolidation algorithms (e.g., merging, pruning, forgetting).
Evolving Repository: This continuous addition maintains an evolving repository of memory items, allowing the agent to continuously learn and adapt over time.

4.3. MATTS: Memory-aware Test-Time Scaling

Memory-aware Test-Time Scaling (MaTTS) is introduced to establish a powerful synergy between memory and test-time scaling. It aims to accelerate and diversify the learning process by intelligently scaling up the agent's interaction experience. The core idea is to leverage the abundant successful and failed trajectories generated during scaling for more effective memory curation, rather than simply converting more trajectories into more memory items independently (which is the "vanilla TTS" approach).

The following figure (Figure 3 from the original paper) compares vanilla TTS and MaTTS:

Figure 3 | Comparison of (a) vanilla TTS and MATTS with `( b )` parallel scaling, where self-contrast across multiple trajectories curates reliable memory, and `( c )` sequential scaling, where self-… 该图像是图3的示意图，展示了(a)基础的TTS方法，(b)通过多轨迹自对比实现的MaTTS并行缩放，以及(c)通过轨迹间自我优化实现的MaTTS顺序缩放，强调记忆的可靠性和中间推理信号的丰富。

Figure 3 | Comparison of (a) vanilla TTS and MATTS with ( b ) parallel scaling, where self-contrast across multiple trajectories curates reliable memory, and ( c ) sequential scaling, where self-refinement enriches memory with intermediate reasoning signals.

MaTTS is instantiated in two complementary settings: parallel scaling and sequential scaling. A scaling factor $k$ denotes the number of trajectories for parallel scaling or refinement steps for sequential scaling.

4.3.1. Parallel Scaling

In parallel scaling, the agent generates multiple trajectories ( $k$ trajectories) for the same query under the initial guidance of retrieved memory items.

Diverse Exploration: This process promotes diverse exploration by allowing the agent to attempt the task in several different ways.
Self-Contrast: The key innovation is to leverage self-contrast (Chen et al., 2020) across these multiple generated trajectories. By comparing and contrasting the successful and failed attempts, the agent can identify reliable patterns that lead to success and pitfalls that result in spurious solutions.
Reliable Memory Curation: This provides rich contrastive signals for ReasoningBank to synthesize more reliable and generalizable memory items from multiple trials of a single query. The system instructions for memory-aware test-time scaling are shown in Figure 10 from the paper. The left panel shows the parallel scaling instruction, where the model is prompted to compare and contrast multiple trajectories to identify useful strategies and common mistakes. The following figure (Figure 10 from the original paper) illustrates the system instructions for MATTS:

Figure 14 | ReAsoniNg BAN k enables the agent to recall and apply past reasoning hints, guiding it to the full order history and yielding the correct first purchase date, unlike the baseline that fai… 该图像是论文中图14的示意图，展示了Baseline（无记忆）和Reasoning Bank两种方法在查询用户首次购买日期时的对比。Reasoning Bank通过调用记忆中的推理提示，成功检索完整订单历史并给出正确答案，而Baseline仅依赖近期订单信息，回答错误。

Figure 10 | System instructions for memory-aware test-time scaling: the left panel shows parallel scaling (comparing multiple trajectories to extract generalizable insights), while the right panel shows sequential scaling (iteratively re-checking a trajectory to refine the final answer).

4.3.2. Sequential Scaling

In sequential scaling, the agent iteratively refines its reasoning within a single trajectory after an initial completion, following the principle of self-refinement (Madaan et al., 2023).

Iterative Refinement: After an initial attempt, the agent is prompted to re-examine its own trajectory and intermediate notes (e.g., thoughts, plans) generated during the process.
Enriching Memory with Intermediate Signals: These intermediate notes (reasoning attempts, corrections, insights) are valuable signals for memory construction. They capture the agent's internal thought process, identifying where it struggled, made mistakes, or found breakthroughs, even if not explicitly part of the final solution.
Consistency and Correction: The iterative re-checking ensures consistency and allows for corrections, enriching the memory with a deeper understanding of the problem-solving process. The right panel of Figure 10 shows the instruction for sequential scaling, where the agent iteratively re-examines its previous trajectory and reasoning steps, correcting inconsistencies and confirming its final answer.

4.3.3. Synergy between Memory and Test-Time Scaling

MaTTS creates a positive feedback loop:

Memory Guides Scaling: High-quality memory items retrieved from ReasoningBank provide better initial guidance to the agent, steering the scaled exploration (both parallel and sequential) towards more promising paths and reducing wasted computation.
Scaling Enriches Memory: The diverse and abundant experiences (trajectories, intermediate signals) generated through MaTTS provide richer contrastive signals for ReasoningBank to synthesize higher-quality, more generalizable memory.

This synergy positions memory-driven experience scaling as a new and powerful dimension for agent improvement, where the agent's ability to learn and its ability to explore effectively mutually reinforce each other.

5. Experimental Setup

The paper evaluates ReasoningBank and MaTTS on challenging benchmarks across web browsing and software engineering domains.

5.1. Datasets

The experiments are conducted on three agentic datasets:

5.1.1. WebArena

Description: A benchmark for general web navigation across diverse real-world domains (Zhou et al., 2024). It involves tasks requiring agents to interact with web pages to achieve specific goals.
Domains/Subsets:
- Shopping (187 instances): Tasks related to e-commerce websites.
- Admin (182 instances): Tasks involving administrative interfaces.
- Gitlab (180 instances): Tasks related to software development platforms.
- Reddit (106 instances): Tasks involving forum navigation.
- Multi (29 instances): Tasks that require transferring memory across multiple websites.
Scale: 684 test instances in total.
Data Sample Example (Conceptual): A task might be "Find the price of a specific product on an e-commerce site and add it to your cart," or "Change a user's permission setting in an admin panel." The agent receives the initial webpage state (accessibility tree) and the task instruction.

5.1.2. Mind2Web

Description: A benchmark designed to test the generalization capabilities of agents on versatile operations and environments (Deng et al., 2023). It assesses how well agents can perform tasks in cross-task, cross-website, and cross-domain settings.
Settings:
- Cross-Task (252 instances): Generalization to new tasks within familiar websites.
- Cross-Website (177 instances): Generalization to familiar tasks on new websites.
- Cross-Domain (912 instances): Generalization to new tasks on new websites from entirely different domains.
Scale: 1341 test instances in total.
Data Sample Example (Conceptual): A task could be "Book a flight from city A to city B" on a travel website. A cross-website challenge might be doing the same task on a different travel website, while a cross-domain challenge might be applying skills learned from flight booking to, say, ordering food online.

5.1.3. SWE-Bench-Verified

Description: A repository-level issue resolution benchmark for agentic coding tasks (Jimenez et al., 2024). It involves fixing software bugs described in GitHub issues.
Task: Agents need to generate a patch (code changes) that resolves the underlying bug, such that all provided test scripts execute successfully.
Scale: 500 high-quality, manually verified test instances.
Data Sample Example (Conceptual): A task might be a GitHub issue description like "Bug: When submitting form, validation error message is not displayed for empty 'email' field." The agent interacts with a bash environment, reads code, modifies files, runs tests, and submits a patch.

5.2. Evaluation Metrics

The paper uses a set of metrics tailored to each benchmark to assess effectiveness (success rate) and efficiency (number of interaction steps).

5.2.1. WebArena Metrics

Success Rate (SR):
1. Conceptual Definition: Measures the percentage of user queries that are successfully resolved by the agent. It quantifies the agent's ability to complete tasks correctly.
2. Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of successfully resolved queries}}{\text{Total number of queries}} \times 100% $
3. Symbol Explanation:
  - Number of successfully resolved queries: The count of tasks where the agent's final output or state satisfies the task requirements.
  - Total number of queries: The total number of tasks attempted in the benchmark.
- Measurement: Evaluation employs LLM-based fuzzy matching and exact string matching to verify if essential answer terms appear in the agent's predictions.
Steps:
1. Conceptual Definition: Represents the average number of interaction steps (actions) taken by the agent to complete each query. It quantifies the computational and interaction cost, serving as an efficiency metric.
2. Mathematical Formula: $ \mathrm{Steps} = \frac{\sum_{i=1}^{N} \text{Steps}_i}{\text{Number of completed queries}} $
3. Symbol Explanation:
  - $N$ : Total number of queries.
  - $\text{Steps}_i$ : The number of interaction steps taken for query $i$ .
  - Number of completed queries: The total number of queries for which the agent either succeeded or failed (i.e., did not time out).

5.2.2. Mind2Web Metrics

Mind2Web tasks have a predefined fixed number of steps. At each step, the agent predicts an action, and specific metrics are used:

Element Accuracy (EA):
1. Conceptual Definition: Measures if the agent correctly selects the target page element for an action at a given step. It quantifies the agent's ability to visually and semantically identify the correct interactive component on a web page.
2. Mathematical Formula: $ \mathrm{EA} = \frac{\text{Number of correctly selected elements}}{\text{Total number of steps}} \times 100% $
3. Symbol Explanation:
  - Number of correctly selected elements: The count of steps where the agent's chosen element matches the ground-truth target element.
  - Total number of steps: The total number of action steps across all tasks.
Action F1 (AF1):
1. Conceptual Definition: Measures the correctness of the action type taken on a selected element. It's an F1 score, balancing precision and recall for action labels.
2. Mathematical Formula: $ \mathrm{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $ $ \mathrm{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $ $ \mathrm{AF1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} \times 100% $
3. Symbol Explanation:
  - True Positives: Correctly predicted action types.
  - False Positives: Incorrectly predicted action types (predicted when not ground truth).
  - False Negatives: Missed ground-truth action types (not predicted when ground truth).
Step Success Rate (SSR):
1. Conceptual Definition: Checks if both the correct element is selected AND the correct action is taken on it at a given step. It's a stricter per-step metric.
2. Mathematical Formula: $ \mathrm{SSR} = \frac{\text{Number of steps with correct element and action}}{\text{Total number of steps}} \times 100% $
3. Symbol Explanation:
  - Number of steps with correct element and action: Count of steps where both Element Accuracy and Action F1 criteria are met for that specific step.
  - Total number of steps: The total number of action steps across all tasks.
Task-level Success Rate (SR):
1. Conceptual Definition: Measures if all intermediate steps for a given task are successfully conducted. A task is considered successful only if every single step within that task achieves a Step Success Rate of 1.0.
2. Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of tasks with all steps successfully completed}}{\text{Total number of tasks}} \times 100% $
3. Symbol Explanation:
  - Number of tasks with all steps successfully completed: The count of tasks where the SSR for every step in the task is 1.0.
  - Total number of tasks: The total number of tasks in the benchmark.

5.2.3. SWE-Bench-Verified Metrics

Resolve Rate:
1. Conceptual Definition: The primary evaluation metric for SWE-Bench-Verified, measuring the percentage of issues where the agent's submitted patch successfully passes all provided test scripts.
2. Mathematical Formula: $ \mathrm{Resolve Rate} = \frac{\text{Number of issues with passing patch}}{\text{Total number of issues}} \times 100% $
3. Symbol Explanation:
  - Number of issues with passing patch: The count of software issues for which the agent generated a code patch that successfully resolved the issue (i.e., all tests pass).
  - Total number of issues: The total number of issues in the benchmark.
Steps: (Same definition as WebArena Steps metric, but applied to SWE tasks)
1. Conceptual Definition: The average number of interaction steps (bash commands, thought processes) taken by the agent to attempt to resolve each issue.
2. Mathematical Formula: $ \mathrm{Steps} = \frac{\sum_{i=1}^{N} \text{Steps}_i}{\text{Number of completed issues}} $
3. Symbol Explanation:
  - $N$ : Total number of issues.
  - $\text{Steps}_i$ : The number of interaction steps taken for issue $i$ .
  - Number of completed issues: The total number of issues for which the agent completed its process (not necessarily successful).

5.2.4. Pass@k (for MATTS)

Conceptual Definition: Used in the context of parallel scaling, Pass@k measures the probability of finding at least one correct solution among $k$ independently generated solutions (trajectories). Pass@1 refers to the success rate of a single trajectory. Best-of-N (BoN) is essentially equivalent to Pass@N when $N$ trajectories are generated and the best one is selected.
Mathematical Formula (General): If $P_s$ is the success probability of a single attempt, then the probability of at least one success in $k$ attempts is $1 - (1 - P_s)^k$ . $ \mathrm{Pass}@k = 1 - (1 - P_s)^k $ In practice, BoN is calculated by generating $N$ trajectories and checking if any of them are successful. $ \mathrm{BoN} = \frac{\text{Number of tasks where at least one of N trajectories succeeded}}{\text{Total number of tasks}} \times 100% $
Symbol Explanation:
- $P_s$ : The success rate of a single, randomly chosen trajectory.
- $k$ : The number of independent trajectories generated.
- Number of tasks where at least one of N trajectories succeeded: The count of tasks where at least one of the $N$ parallel executions led to a successful outcome.
- Total number of tasks: The total number of tasks attempted.

5.3. Baselines

The paper compares ReasoningBank against several representative memory-augmented approaches and a memory-free baseline:

No Memory (Vanilla):
- Description: This baseline represents the backbone LLM agent operating without any explicit memory module. Each task is approached in isolation, relying solely on the LLM's inherent knowledge and the current task prompt.
- Purpose: Serves as a fundamental reference point to demonstrate the benefits of any memory mechanism.
Synapse (Zheng et al., 2024):
- Description: A trajectory-based memory approach. It organizes and reuses past raw trajectories as in-context memory for new tasks. This means the agent might be provided with examples of how similar tasks were solved in the past, including all the detailed steps.
- Purpose: Represents methods that leverage direct past interaction records.
AWM (Agent Workflow Memory) (Wang et al., 2025d):
- Description: A more abstract memory mechanism compared to Synapse. It distills common successful routines or workflows from past trajectories into reusable procedures. These workflows abstract away some low-level details, focusing on sequences of successful high-level actions.
- Purpose: Represents methods that generalize from successful patterns, but typically do not explicitly leverage failures.
  
  These baselines provide a comprehensive comparison, spanning from agents without memory, to those reusing raw trajectories, and finally to methods that distill higher-level structures, allowing for a clear evaluation of ReasoningBank's innovations.

5.4. Implementation Details

Backbone LLMs:
- Google Gemini-2.5-Flash (Comanici et al., 2025)
- Google Gemini-2.5-Pro (Comanici et al., 2025)
- Anthropic Claude-3.7-Sonnet (Anthropic, 2025)
- These LLMs are accessed via the Vertex AI API. This allows for investigation of performance across different LLM families (Gemini, Claude) and model sizes/capabilities (Flash, Pro).
Execution Environments:
- Web Browsing (WebArena, Mind2Web): BrowserGym (de Chezelles et al., 2025) is used as the execution environment. A maximum step limit of 30 is set per query to prevent infinite loops.
- Software Engineering (SWE-Bench-Verified): A bash-only environment with no specialized tools or scaffold structures, following the setup of miniSWE-Agent (Yang et al., 2024).
Agent Style: The agent is implemented in a ReAct (Reasoning and Acting) style (Yao et al., 2023). This means the agent alternates between thinking (generating internal reasoning steps) and acting (executing an action in the environment) until it predicts a stop action or a task termination condition is met.
Decoding Configurations:
- Web Browsing (WebArena, Mind2Web): A decoding temperature of 0.7 is used for model generations. This allows for some creativity and diversity in agent actions and thoughts.
- Memory Extraction: For memory extraction, the backbone LLM of the extractor is set to the same as the agent system, with a temperature of 1.0 to encourage diverse memory item generation.
- LLM-as-a-Judge: For the LLM-as-a-judge (for success/failure signals), the backbone LLM is also the same, but the decoding temperature is set to 0.0 to ensure determinism in its judgment.
- Best-of-N Calculation: For selecting the best trajectory among $N$ candidates, an LLM (same backbone as the agent) is used with a carefully curated prompt.
Memory Details:
- Embedding Model: gemini-embedding-001 (Lee et al., 2025) is used for embedding task queries for memory retrieval.
- Similarity Search: Cosine distance is used for similarity search over the memory pool.
- Retrieval Quantity: Top-k relevant memory items are retrieved, with a default $k=1$ (further analyzed in ablations).
- Memory Storage: ReasoningBank is maintained in JSON format, with each entry containing the task query, original trajectory, and corresponding memory items (structured as {title, description, content}). Embeddings are pre-computed and stored in a separate JSON for efficient search. Memory is persisted across independent runs for continual accumulation.

6. Results & Analysis

This section presents the experimental results for ReasoningBank and MaTTS, analyzing their effectiveness, efficiency, and the synergy between memory and scaling.

6.1. Core Results Analysis

The paper first demonstrates the consistent outperformance of ReasoningBank over baselines across various benchmarks and LLM backbones.

6.1.1. WebArena Benchmark Results

The following are the results from Table 1 of the original paper:

Models	Shopping (187)		Admin (182)		Gitlab (180)		Reddit (106)		Multi (29)		Overall (684)
Models	SR	Step	SR	Step	SR	Step	SR	Step	SR	Step	SR	Step
Gemini-2.5-flash
No Memory	39.0	8.2	44.5	9.5	33.9	13.3	55.7	6.7	10.3	10.0	40.5	9.7
Synapse	40.6	7.0	45.1	9.1	35.6	13.0	59.4	6.5	10.3	10.5	42.1	9.2
AWM	44.4	7.0	46.7	8.8	37.2	13.2	62.3	6.1	3.4	7.7	44.1	9.0
ReasoningBank	49.7	6.1	51.1	8.2	40.6	12.3	67.0	5.6	13.8	8.8	48.8	8.3
Gemini-2.5-pro
No Memory	45.5	7.6	51.1	8.7	35.0	11.6	71.7	6.0	6.9	8.8	46.7	8.8
Synapse	46.5	6.6	52.2	8.9	38.3	11.3	68.9	5.9	6.9	9.0	47.7	8.5
AWM	48.1	6.4	49.3	9.8	40.0	11.2	68.9	6.4	3.4	9.3	47.6	8.7
ReasoningBank	51.9	6.0	56.6	7.7	44.4	9.8	80.2	5.1	13.8	8.2	53.9	7.4
Claude-3.7-sonnet
No Memory	38.5	6.1	49.5	8.4	36.7	10.6	53.8	5.5	0.0	11.6	41.7	8.0
Synapse	39.6	5.8	50.5	8.5	38.0	10.0	53.8	6.1	0.0	11.8	42.6	7.9
AWM	39.6	7.2	47.8	9.3	34.6	10.9	52.8	7.0	0.0	12.4	40.8	8.9
ReAsoNinGBANK	44.9	5.6	53.3	7.6	41.1	9.5	57.5	5.2	3.4	10.5	46.3	7.3

Consistent Outperformance: ReasoningBank consistently achieves higher Success Rates (SR) across all LLM backbones (Gemini-2.5-flash, Gemini-2.5-pro, Claude-3.7-sonnet) and almost all WebArena subsets compared to No Memory, Synapse, and AWM. For instance, with Gemini-2.5-flash, it improves the Overall SR from 40.5% (No Memory) to 48.8%. With Gemini-2.5-pro, the Overall SR increases from 46.7% to 53.9%. This highlights its robustness and general applicability.
Enhanced Generalization (Multi subset): On the challenging Multi subset, which requires transferring memory across different websites, ReasoningBank shows significant gains (e.g., +3.5% with Gemini-2.5-flash, +6.9% with Gemini-2.5-pro, +3.4% with Claude-3.7-sonnet compared to No Memory). Notably, strong baselines like AWM even show degradation (e.g., 3.4% SR for AWM with Gemini-2.5-flash vs. 10.3% for No Memory). This indicates that ReasoningBank curates more robust and transferable memory items.
Superior Efficiency: Beyond effectiveness, ReasoningBank also reduces the average number of Steps needed to complete tasks. On WebArena, it lowers the average step count by up to 1.4 (No Memory) and 1.6 (other memory baselines). This suggests that ReasoningBank helps agents find solutions more directly and efficiently by reusing refined reasoning knowledge, avoiding redundant exploration.
Distinct Memory Source: The paper attributes ReasoningBank's superior performance to its extraction strategy, which distills memory from both successful and failed experiences, unlike Synapse and AWM that rely on a narrower, success-only memory source.

6.1.2. Mind2Web Benchmark Results

The following are the results from Table 3 of the original paper:

Models	Cross-Task (252)				Cross-Website (177)				Cross-Domain (912)
Models	EA	AF1	SSR	SR	EA	AF1	SSR	SR	EA	AF1	SSR	SR
Gemini-2.5-flash
No Memory	46.0	59.1	40.3	3.3	39.8	45.1	31.7	1.7	35.8	37.9	31.9	1.0
Synapse	47.0	59.5	41.2	3.5	40.3	46.0	32.1	1.9	36.3	38.5	32.4	1.1
AWM	46.3	56.1	41.0	3.5	39.1	42.2	31.7	2.1	33.3	36.5	30.1	0.7
ReasoningBank	52.1	60.4	44.9	4.8	44.3	52.6	33.9	2.3	40.6	41.3	36.6	1.6
Gemini-2.5-pro
No Memory	49.3	60.2	44.4	3.5	41.2	49.8	34.8	3.4	37.9	37.7	35.0	1.4
Synapse	50.1	61.0	44.7	3.6	41.8	51.2	35.0	3.2	38.5	39.8	35.6	1.5
AWM	48.6	61.2	44.4	3.7	41.9	47.9	34.8	2.3	37.3	38.1	34.4	1.2
REAsoNinGBaNk	53.6	62.7	45.6	5.1	46.1	54.8	36.9	3.8	42.8	45.2	38.1	1.7

Generalization Performance: ReasoningBank consistently improves success rates (SR) across all generalization settings (Cross-Task, Cross-Website, Cross-Domain) on Mind2Web.
Pronounced Gains in Cross-Domain: The gains are particularly significant in the Cross-Domain setting (e.g., +0.6% SR for Gemini-2.5-flash and +0.3% SR for Gemini-2.5-pro compared to Synapse), which demands the highest level of generalization. This reinforces that ReasoningBank's curated memory is more robust and transferable, enabling agents to apply learned strategies to truly novel scenarios.

6.1.3. SWE-Bench-Verified Results

The following are the results from Table 2 of the original paper:

Methods	Resolve Rate	Step
Gemini-2.5-flash
No Memory	34.2	30.3
Synapse	35.4	30.7
REAsoNingBank	38.8	27.5
Gemini-2.5-pro
No Memory	54.0	21.1
Synapse	53.4	21.0
REASONINGBANK	57.4	19.8

Robustness on SWE Tasks: ReasoningBank demonstrates its robustness on SWE-Bench-Verified for repository-level issue resolution tasks. It consistently achieves higher Resolve Rates (e.g., 38.8% vs. 34.2% for No Memory with Gemini-2.5-flash, and 57.4% vs. 54.0% for No Memory with Gemini-2.5-pro) and reduces Steps (e.g., 27.5 vs. 30.3 with Gemini-2.5-flash, and 19.8 vs. 21.1 with Gemini-2.5-pro). This confirms its efficacy beyond web browsing, suggesting broad applicability.

6.2. Results of MATTS

The paper further investigates the impact of MaTTS on ReasoningBank's performance, focusing on the WebArena-Shopping subset with Gemini-2.5-flash.

The following figure (Figure 4 from the original paper) shows the effect of scaling factor k for MATTS:

$Figure 4 | Effect of scaling factor $k$ for MATTS under with REAs ONINGBAN K on WebArena-Shopping subset. We compare (a) parallel and (b) sequential test-time scaling.$ 该图像是图表，展示了在WebArena-Shopping子集上，不同测试时刻的MaTTS扩展因子k对成功率（SR）的影响。左图为(a)并行扩展，右图为(b)顺序扩展，比较了MaTTS和其去除记忆与去除聚合的版本表现。

Figure 4 | Effect of scaling factor $k$ for MATTS under with REAs ONINGBAN K on WebArena-Shopping subset. We compare (a) parallel and (b) sequential test-time scaling.

Performance Boost from Scaling: Both parallel scaling and sequential scaling generally boost performance as the scaling factor $k$ increases, confirming the benefit of allocating more inference-time computation. With MaTTS, parallel scaling increases from 49.7% ( $k=1$ ) to 55.1% ( $k=5$ ), and sequential scaling rises from 49.7% to 54.5%.
MaTTS Superiority over Vanilla TTS: MaTTS (which integrates ReasoningBank and memory-aware aggregation) consistently surpasses vanilla TTS (labeled as MaTTS w/o aggregation in the figure). At $k=5$ , MaTTS achieves 55.1% in parallel scaling compared to 52.4% for vanilla TTS, and 54.5% versus 51.9% in sequential scaling. This indicates that the memory-aware coordination and aggregation within MaTTS are crucial for effectively leveraging scaling.
Memory's Role in Scaling: For the baseline MaTTS w/o memory, the gains from scaling are smaller and less consistent (e.g., parallel scaling fluctuates between 39.0% and 42.2%). This highlights that memory is essential for making scaling truly effective, guiding the agent toward more promising solutions.
Parallel vs. Sequential Scaling:
- Sequential scaling shows higher gains at small k with ReasoningBank (e.g., initially matching or slightly exceeding parallel). However, its benefit saturates quickly as further refinements yield little new insight.
- Parallel scaling provides diverse rollouts that allow for critique and improvement, leading it to surpass sequential scaling at larger k (e.g., 55.1% vs. 54.5% at k=5).
- For vanilla TTS (without memory-aware aspects), sequential scaling offers little to no benefit, and parallel scaling consistently dominates.

6.3. Synergy of Memory and Test-Time Scaling

This section analyzes the bidirectional interaction between memory quality and scaling effectiveness using Pass@1 (success rate of a single, randomly selected trajectory) and Best-of-N (BoN) (success rate if the best of N trajectories is chosen). Results are for WebArena-Shopping subset with parallel scaling factor $k=3$ .

The following figure (Figure 5 from the original paper) shows a snapshot of MATTS on WebArena-Shopping subset with different memory mechanisms with $k = 3$ :

$Figure 5 | Snapshot of MATTS on WebArenaShopping subset with different memory mechanisms with $k = 3$ . We compute BoN for all 3 trajectories and Pass `@ 1` with one randomly selected trajectory.$ 该图像是一个柱状折线图，展示了在不同内存机制下，MATTS在WebArenaShopping子集上成功率（Success Rate）随三种策略的对比表现，包括No Memory、Synapse、AWM和ReasoningBank，指标有Pass@1和Best-of-3。

Figure 5 | Snapshot of MATTS on WebArenaShopping subset with different memory mechanisms with $k = 3$ . We compute BoN for all 3 trajectories and Pass @ 1 with one randomly selected trajectory.

Better Memory Enables Stronger TTS Performance (BoN): The BoN results (blue bars) show that the benefit of scaling depends critically on the underlying memory mechanism.
- No Memory: Scaling yields only a slight improvement in BoN (from 39.0% to 40.6%).
- Weaker Memories (Synapse, AWM): Provide moderate gains, reaching 42.8% and 45.5% respectively.
- MaTTS with ReasoningBank: Delivers the strongest benefit, with BoN climbing from 49.7% to 52.4%. This demonstrates that high-quality memory directs scaling toward more promising rollouts, ensuring that additional trajectories are effectively converted into higher success rates.
Scaling Yields Better Memory Curation (Pass@1): The Pass@1 results (pink bars) reveal how scaling feeds back into memory curation.
- Weaker Memories (Synapse, AWM): Pass@1 reduces with scaling (Synapse from 40.6% to 40.1%, AWM from 44.4% to 41.2%). This suggests that without strong guidance, the extra rollouts generated by scaling introduce noise rather than useful signals for memory curation.
- ReasoningBank: Is the only method where Pass@1 rises with scaling (from 49.7% to 50.8%). This indicates that high-quality memory can harness the diversity of scaling to extract constructive contrastive signals, leading to more effective memory curation.
  
  This asymmetry highlights a virtuous cycle: scaling alone is insufficient; it must be paired with a good memory mechanism like ReasoningBank for scaling to contribute to the curation of more effective memory, thereby closing the learning loop.

6.4. Incorporating Failure Trajectories

The paper explicitly analyzes the impact of including failure trajectories in memory construction, a key differentiator of ReasoningBank.

The following figure (Figure 7 from the original paper) shows the ablation results of incorporating failure trajectories for memory induction:

Figure 7 | Ablation results of incorporating failure trajectories for memory induction. 该图像是图表，展示了图7的消融实验结果，比较了仅使用成功轨迹与同时加入失败轨迹进行记忆归纳对不同模型任务成功率的影响，结果显示加入失败轨迹普遍提升了性能。

Figure 7 | Ablation results of incorporating failure trajectories for memory induction.

Baselines' Limitations: Synapse and AWM build memory solely from successful trajectories. When failures are included:
- Synapse: Only marginally improves from 40.6% (success-only) to 41.7% (with failures).
- AWM: Degrades from 44.4% to 42.2%. This indicates that these baselines are either unable to benefit from failures or that failures introduce noise that harms their performance.
ReasoningBank's Advantage: ReasoningBank is designed to distill reasoning patterns from both successes and failures.
- Success-only: Achieves 46.5%.
- With failures: Further improves to 49.7%. This clearly demonstrates that ReasoningBank can transform failures into constructive signals rather than noise, leading to more robust generalization and better overall performance. Lessons from mistakes are actively integrated and utilized.

6.5. Efficiency Study

To gain deeper insight into the Steps reduction, the analysis separates the average number of steps into successful and failed test instances.

The following are the results from Table 4 of the original paper:

Models	Shopping		Admin		Gitlab		Reddit
	Successful	Failed	Successful	Failed	Successful	Failed	Successful	Failed
	No Memory	6.8	8.7	8.4	10.4	8.6	15.7	6.1	7.6
ReAsoningBank	4.7↓2.1	7.3↓1.4	7.0↓1.4	9.5↓0.9	7.6↓1.0	15.5↓0.2	5.0↓1.1	6.8↓0.8

Consistent Step Reduction: ReasoningBank consistently reduces the number of steps across all domains for both successful and failed instances compared to No Memory.
Greater Reduction in Successful Cases: The reduction is particularly pronounced on successful cases. For example, in Shopping, ReasoningBank reduces steps by 2.1 for successful cases (a 26.9% relative reduction) compared to 1.4 for failed cases. Similar patterns are observed across other domains.
Guiding Purposeful Decision-Making: This finding indicates that ReasoningBank primarily helps the agent reach solutions with fewer interactions by strengthening its ability to follow effective reasoning paths. It suggests that the memory isn't just truncating failed attempts prematurely but is guiding more purposeful and efficient decision-making when the agent is on the right track, leading to faster successes.

6.6. Emergent Behaviors with ReasoningBank

The paper highlights that ReasoningBank fosters the evolution of strategies over time, leading to emergent behaviors akin to Reinforcement Learning dynamics.

The following figure (Figure 6 from the original paper) shows a case study illustrating emergent behaviors in ReasoningBank through memory items:

Figure 6 | A case study illustrating emergent behaviors in REAs onING BANK through memory items. 该图像是图表，展示了REAs在ReasoningBank记忆项上的新兴行为示例，反映了测试时间学习过程中步骤和策略的时间线，包括自我反思、程序执行、适应性检查和复杂策略等关键节点。

Figure 6 | A case study illustrating emergent behaviors in REAs onING BANK through memory items.

The case study illustrates the evolution of a memory item:

Execution-oriented/Procedural (Initial): Strategies start simple, focusing on straightforward action rules (e.g., "find navigation links").
Adaptive Self-Reflection (Intermediate): With more experience, the agent develops strategies for re-verifying identifiers to reduce simple mistakes.
Adaptive Checks (Advanced): The memory item evolves to include systematic checks, such as leveraging available search or filters to ensure completeness before concluding results.
Compositional Strategies (Mature): Finally, it matures into complex compositional strategies, like cross-referencing task requirements and reassessing options based on multiple criteria.

This evolution from low-level actions to high-level reasoning demonstrates how ReasoningBank enables agents to refine and abstract strategies during test-time learning, leading to more sophisticated and robust problem-solving capabilities.

6.7. Ablation Studies / Parameter Analysis

6.7.1. Number of Retrieved Experiences

The paper conducts an ablation study on the number of retrieved experiences (i.e., the top-k value for memory retrieval). This is performed using Gemini-2.5-flash on the WebArena-Shopping subset.

The following figure (Figure 12 from the original paper) shows the ablation results for using various number of experiences:

该图像是一个示意图，展示了在购物任务中采用Baseline（无记忆）与ReasoningBank两种策略的操作流程和步骤数对比。ReasoningBank利用记忆引导导航，减少了步骤数（29步降至10步），提升操作效率。

Figure 12 | Ablation results for using various number of experiences.

Benefit of Memory: Incorporating just one relevant memory item ( $k=1$ ) significantly boosts performance from 39.0% (without memory) to 49.7%. This confirms the value of memory guidance.
Diminishing Returns/Noise: As the number of retrieved experiences increases, the success rate gradually declines (e.g., 46.0% with $k=2$ , 45.5% with $k=3$ , and 44.4% with $k=4$ ).
Quality over Quantity: This suggests that while some memory is good, excessive experiences may introduce conflicts or noise if they are not all perfectly relevant or coherent. The relevance and quality of retrieved memory items are more crucial than simply retrieving a large quantity. This validates the design choice of ReasoningBank to focus on distilling high-quality, structured memories.

6.7.2. Pass@k Analysis for MaTTS

A Pass@k analysis is performed under parallel scaling on the WebArena-Shopping subset with Gemini-2.5-flash to understand the sample efficiency and performance gains of MaTTS.

The following figure (Figure 13 from the original paper) shows Pass@k under parallel scaling with ReASONINGBANk:

Figure 13 | Pass @ k under parallel scaling with ReASONINGBANk.

Vanilla TTS Improves Sample Efficiency: MaTTS w/o aggregation (equivalent to vanilla TTS) already makes test-time learning behave similarly to RL training. Instead of merely inflating Pass@k at large $k$ , it improves sample efficiency by guiding exploration. For example, at $k=2$ , MaTTS w/o aggregation achieves 50.8% Pass@k compared to 47.6% from MaTTS w/o memory. This means it extracts more value from each rollout.
MaTTS Amplifies Gains: Equipping TTS with memory-aware scaling (MaTTS) pushes performance even further. MaTTS not only preserves efficiency at small k (51.3% at $k=2$ ) but also sustains strong growth with scaling, reaching 62.1% at $k=5$ . This is significantly higher than MaTTS w/o memory (which reaches only 52.4% at $k=5$ ).
Unlocking Potential: This analysis shows that MaTTS unlocks more potential in agent systems, encouraging diverse generation that leads to better Pass@k performance and more efficient learning from exploration.

6.8. Case Studies

The paper presents two case studies to intuitively illustrate the benefits of ReasoningBank.

6.8.1. First Purchase Date Retrieval

The following figure (Figure 14 from the original paper) shows a case study for first purchase date retrieval:

Baseline Failure: A baseline agent (without memory) faced with the query "What is the date when I made my first purchase on this site?" only checks the "Recent Orders" table and incorrectly outputs the most recent purchase date. It fails to identify the need for a complete order history.
ReasoningBank Success: The agent equipped with ReasoningBank recalls relevant past reasoning hints (memory items). These hints guide it to explore the full purchase history, allowing it to correctly identify and output the earliest order date. This demonstrates ReasoningBank's ability to transfer high-level strategies (e.g., "always check full history for 'first' or 'earliest' queries").

6.8.2. Shopping Task Efficiency

The following figure (Figure 15 from the original paper) shows a case study for shopping task efficiency:

Figure 15 | REAs oN INGBANk improves efficiency by leveraging past reasoning hints, reducing the navigation from 29 steps to 10 steps compared to the baseline without memory.

Baseline Inefficiency: In a navigation-heavy shopping task (e.g., "find men's clothing"), the baseline agent without memory struggles to find the correct filter for "Men" and requires 29 steps due to repeated, inefficient browsing and getting stuck.
ReasoningBank Efficiency: The agent with ReasoningBank leverages stored reasoning about category filtering. This allows it to directly reach the relevant items and complete the task in only 10 steps. This vividly illustrates how ReasoningBank improves efficiency by guiding the agent to purposeful decisions and avoiding redundant exploration.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces ReasoningBank, a novel memory framework that significantly enhances the self-evolving capabilities of large language model (LLM) agents. ReasoningBank distinguishes itself by distilling generalizable reasoning strategies from an agent's self-judged successful and failed experiences, moving beyond traditional memory mechanisms that rely on raw trajectories or only successful routines. This allows agents to learn not only effective strategies but also crucial preventative lessons from past mistakes. The framework operates in a closed loop, where agents retrieve relevant memories to inform current tasks and then integrate new learnings back into the ReasoningBank.

Building on this, the paper proposes memory-aware test-time scaling (MaTTS), which synergistically combines ReasoningBank with test-time scaling. MaTTS accelerates and diversifies the learning process by intelligently generating abundant and diverse interaction experiences (through parallel or sequential scaling). These rich experiences provide contrastive signals for ReasoningBank to synthesize higher-quality, more generalizable memory. In turn, this refined memory guides more effective scaling, creating a powerful positive feedback loop.

Extensive experiments on web browsing (WebArena, Mind2Web) and software engineering (SWE-Bench-Verified) benchmarks consistently demonstrate that ReasoningBank outperforms existing memory mechanisms in both effectiveness (higher success rates, up to 34.2% relative improvement) and efficiency (fewer interaction steps, 16.0% less). MaTTS further amplifies these gains, showing superior performance over vanilla test-time scaling. The findings establish memory-driven experience scaling as a new dimension for agent scaling, enabling self-evolution and the natural emergence of complex behaviors.

7.2. Limitations & Future Work

The authors acknowledge several limitations of their work and suggest future research directions:

7.2.1. Limitations

Focus on Memory Content: The study primarily emphasizes how to curate and utilize memory content (e.g., integrating failure trajectories, distilled reasoning cues). It did not extensively compare with other memory architectures like episodic or hierarchical memory. These architectural designs address orthogonal concerns (memory form/structure), while ReasoningBank focuses on what specific knowledge should be stored and reused.
Simplicity in Memory Retrieval and Consolidation: The current implementation uses simple embedding-based retrieval and straightforward consolidation (direct addition of new items). This was an intentional choice to isolate the effect of memory content quality. More sophisticated strategies (e.g., adaptive retrieval, hierarchical consolidation, merging, forgetting policies) were not explored.
Dependence on LLM-as-a-Judge for Correctness Signals: The success/failure signals for trajectories are determined by an LLM-as-a-judge. While this enables scalable self-evaluation without ground-truth feedback, it may introduce noise if the judge LLM makes errors or if tasks are ambiguous. Although the framework showed robustness to such noise, this is a potential source of imperfection.

7.2.2. Future Work

Compositional Memory: Explore how memory items could be composed into higher-level strategies or reusable macros. The current framework distills individual memory items and retrieves them independently. Future work could investigate composition-aware retrieval and consolidation to enable agents to combine complementary items for richer strategies and stronger generalization in long-horizon tasks.
Advanced Memory Architectures: Integrate ReasoningBank's philosophy with more complex memory architectures, such as:
- Episodic traces (Fountas et al., 2025) for per-task context.
- Short-term "working" memory (Lumer et al., 2025) for within-session state.
- Long-term consolidated knowledge (Wang et al., 2025b) with decay/refresh policies. The current ReasoningBank's philosophy is compatible with these, and such integration could lead to a more comprehensive memory system.
Sophisticated Retrieval and Consolidation Mechanisms: Move beyond simple embedding-based similarity for retrieval to reasoning-intensive controllers (Shao et al., 2025). These controllers could decompose queries, plan multi-hop lookups across memory tiers, and condition selection on factors like uncertainty, recency, and cost. Learning-based routers and consolidation policies could automate these processes, potentially turning ReasoningBank with MaTTS into a deployable memory service across domains.
Stronger Verifiers for Self-Judgement: Incorporate stronger verifiers, human-in-the-loop feedback, or ensemble judgment to enhance the reliability of memory induction, mitigating the noise introduced by the LLM-as-a-judge.

7.3. Personal Insights & Critique

This paper presents a highly valuable contribution to the field of LLM agents, addressing a critical bottleneck: the lack of effective learning from experience. The core idea of ReasoningBank—distilling generalizable reasoning strategies from both successes and failures—is intuitively powerful. It moves agent memory from passive record-keeping to active knowledge creation. The emphasis on learning from failures, in particular, is a standout feature, as real-world learning often hinges on understanding and avoiding past mistakes.

The introduction of MaTTS is equally compelling. The concept of memory-driven experience scaling creates a virtuous cycle where compute is not just blindly thrown at a problem but intelligently used to generate diverse experiences that refine memory, which in turn makes future explorations more efficient. This synergistic relationship is a fundamental insight that could guide future research in scaling LLM agents.

Potential Strengths and Applications:

Robustness and Generalization: By learning from high-level reasoning and failures, agents become more robust to novel situations and can generalize across domains, as shown in the Mind2Web results. This is crucial for deploying agents in dynamic, open-ended environments.
Efficiency: The observed reduction in interaction steps is highly practical, as it translates directly to lower computational costs and faster task completion in real-world applications.
Path to Lifelong Learning: The closed-loop nature of ReasoningBank offers a clear path toward truly lifelong learning agents that continuously improve over their operational lifetime, a long-sought goal in AI.
Inspiration for Other Domains: The principles of distilling reasoning from experience and using contrastive signals could be applied to various other AI challenges, such as robotics, scientific discovery, or complex decision support systems.

Areas for Further Consideration / Critique:

Complexity of Memory Items: While structuring memory items into title, description, and content is a good start, the "content" itself can still be quite textual. Future work might explore more formal or executable representations of reasoning strategies to make them even more machine-actionable and less reliant on LLM interpretation.
Scalability of Retrieval: As ReasoningBank accumulates more memory items, the embedding-based similarity search for top-k items might become computationally intensive. The paper acknowledges this and suggests reasoning-intensive controllers or hierarchical memory as future directions, which would be crucial for industrial-scale deployment.
Bias in LLM-as-a-Judge: The reliance on LLM-as-a-judge for correctness signals, while practical, introduces the possibility of LLM biases or hallucinations affecting memory curation. While the paper states robustness, this remains a fundamental challenge in LLM-driven self-correction. Exploring diverse or ensemble judges, or incorporating human oversight, could mitigate this.
Computational Cost of MaTTS: While MaTTS improves overall efficiency by reducing steps to success, generating $k$ parallel trajectories still incurs a k-fold increase in inference cost for a single task. The trade-off between the increased compute for exploration and the long-term benefits of better memory needs to be carefully managed for practical applications.
"Forgetting" Mechanism: The current minimal consolidation strategy (simple addition) means the memory bank will continuously grow. Without a forgetting or pruning mechanism for redundant, outdated, or less useful memories, ReasoningBank could become unwieldy over very long timescales, potentially leading to increased retrieval latency or noise. This is explicitly noted as a future direction.

Overall, "ReasoningBank" is an inspiring paper that takes a significant step toward creating more intelligent, adaptive, and autonomous LLM agents. Its dual focus on refined memory content and memory-aware scaling provides a robust framework for continuous improvement in agents operating in complex real-world environments.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~41 min read · 54,109 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Memory for LLM Agents

3.2.2. Agent Test-Time Scaling (TTS)

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Problem Formulation

4.1.1. Agent Configuration

4.1.2. Test-Time Learning

4.2. ReasoningBank

4.2.1. Memory Schema

4.2.2. Integration of ReasoningBank with Agents

4.2.2.1. Memory Retrieval

4.2.2.2. Memory Construction

4.2.2.3. Memory Consolidation

4.3. MATTS: Memory-aware Test-Time Scaling

4.3.1. Parallel Scaling

4.3.2. Sequential Scaling

4.3.3. Synergy between Memory and Test-Time Scaling

5. Experimental Setup

5.1. Datasets

5.1.1. WebArena

5.1.2. Mind2Web

5.1.3. SWE-Bench-Verified

5.2. Evaluation Metrics

5.2.1. WebArena Metrics

5.2.2. Mind2Web Metrics

5.2.3. SWE-Bench-Verified Metrics

5.2.4. Pass@k (for MATTS)

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. WebArena Benchmark Results

6.1.2. Mind2Web Benchmark Results

6.1.3. SWE-Bench-Verified Results

6.2. Results of MATTS

6.3. Synergy of Memory and Test-Time Scaling

6.4. Incorporating Failure Trajectories

6.5. Efficiency Study

6.6. Emergent Behaviors with ReasoningBank

6.7. Ablation Studies / Parameter Analysis

6.7.1. Number of Retrieved Experiences

6.7.2. Pass@k Analysis for MaTTS

6.8. Case Studies

6.8.1. First Purchase Date Retrieval

6.8.2. Shopping Task Efficiency

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.2.1. Limitations

7.2.2. Future Work

7.3. Personal Insights & Critique

Similar papers