ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
TL;DR Summary
ReasoningBank distills self-judged experiences into general reasoning strategies, enabling agents to retrieve and update memories for continual improvement. Combined with MaTTS, it enhances learning efficiency and performance in continuous multi-task scenarios.
Abstract
With the growing adoption of large language model agents in persistent real-world roles, they naturally encounter continuous streams of tasks. A key limitation, however, is their failure to learn from the accumulated interaction history, forcing them to discard valuable insights and repeat past errors. We propose ReasoningBank, a novel memory framework that distills generalizable reasoning strategies from an agent's self-judged successful and failed experiences. At test time, an agent retrieves relevant memories from ReasoningBank to inform its interaction and then integrates new learnings back, enabling it to become more capable over time. Building on this powerful experience learner, we further introduce memory-aware test-time scaling (MaTTS), which accelerates and diversifies this learning process by scaling up the agent's interaction experience. By allocating more compute to each task, the agent generates abundant, diverse experiences that provide rich contrastive signals for synthesizing higher-quality memory. The better memory in turn guides more effective scaling, establishing a powerful synergy between memory and test-time scaling. Across web browsing and software engineering benchmarks, ReasoningBank consistently outperforms existing memory mechanisms that store raw trajectories or only successful task routines, improving both effectiveness and efficiency; MaTTS further amplifies these gains. These findings establish memory-driven experience scaling as a new scaling dimension, enabling agents to self-evolve with emergent behaviors naturally arise.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory
1.2. Authors
The authors of this paper are:
-
Siru Ouyang (University of Illinois Urbana-Champaign)
-
Jun Yan (Google Cloud AI Research)
-
I-Hung Hsu (Google Cloud AI Research)
-
Yanfei Chen (Google Cloud AI Research)
-
Ke Jiang (Google Cloud AI Research)
-
Zifeng Wang (Google Cloud AI Research)
-
Rujun Han (Google Cloud AI Research)
-
Long T. Le (Google Cloud AI Research)
-
Samira Daruki (Google Cloud AI Research)
-
Xiangru Tang (Yale University)
-
Vishy Tirumalashetty (Google Cloud AI Research)
-
George Lee (Google Cloud AI Research)
-
Mahsan Rofouei (Google Cloud AI)
-
Hangfei Lin (Google Cloud AI)
-
Jiawei Han (University of Illinois Urbana-Champaign)
-
Chen-Yu Lee (Google Cloud AI Research)
-
Tomas Pfister (Google Cloud AI Research)
The affiliations show a strong presence from Google Cloud AI Research and Google Cloud AI, indicating industrial research efforts, complemented by academic contributions from the University of Illinois Urbana-Champaign and Yale University.
1.3. Journal/Conference
The paper is published at arxiv.org. This is a preprint server, meaning the paper has been submitted but might not yet have undergone peer review for a specific journal or conference. However, arXiv is a widely respected platform for quickly disseminating research in fields like AI and machine learning.
1.4. Publication Year
The paper was published at (UTC) 2025-09-29T17:51:03.000Z, indicating a publication year of 2025.
1.5. Abstract
The paper introduces ReasoningBank, a novel memory framework for large language model (LLM) agents. The core idea is to distill generalizable reasoning strategies from an agent's self-judged successful and failed experiences, rather than just storing raw interaction trajectories or successful routines. At test time, agents retrieve relevant memories from ReasoningBank to guide their actions and then integrate new learnings back into the bank, fostering continuous self-evolution. Building on this, the authors propose memory-aware test-time scaling (MaTTS), which enhances learning by generating diverse and abundant interaction experiences. By allocating more compute per task, MaTTS provides richer contrastive signals for synthesizing higher-quality memory, which in turn guides more effective scaling, creating a powerful synergy. Experiments on web browsing and software engineering benchmarks demonstrate that ReasoningBank consistently outperforms existing memory mechanisms in both effectiveness and efficiency, with MaTTS further amplifying these gains. The findings establish memory-driven experience scaling as a new dimension for enabling agents to self-evolve and exhibit emergent behaviors.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2509.25140
- PDF Link: https://arxiv.org/pdf/2509.25140v1.pdf
- Publication Status: This is a preprint, indicating it is publicly available but might not have completed a formal peer-review process for a journal or conference yet.
2. Executive Summary
2.1. Background & Motivation
The rapid advancements in large language models (LLMs) have led to the development of sophisticated LLM agents capable of interacting with complex environments, such as web browsers and software development tools. These agents are increasingly deployed in persistent, long-running roles where they encounter a continuous stream of tasks.
The core problem addressed by this paper is a significant limitation of current LLM agents: their inability to effectively learn from their accumulated interaction history. Typically, these agents approach each new task in isolation, essentially "forgetting" past experiences. This leads to several critical issues:
-
Repeating Past Errors: Agents are prone to making the same mistakes repeatedly, as they don't retain lessons from failures.
-
Discarding Valuable Insights: Useful strategies or solutions discovered in one task are often not transferred to similar future tasks.
-
Lack of Self-Evolving Capabilities: Without a mechanism to learn and adapt from experience, agents cannot continuously improve their capabilities over time, which is essential for real-world, dynamic environments.
Existing approaches to agent memory, such as storing raw interaction trajectories or common successful routines (workflows), suffer from two fundamental drawbacks:
-
Limited Generalizability: They struggle to distill high-level, transferable reasoning patterns that can apply broadly across different tasks or domains. Raw trajectories are too specific and noisy, while successful workflows are often too rigid.
-
Underexplored Failures: Many memory systems predominantly focus on successful experiences, neglecting the rich learning opportunities presented by failures (e.g., what not to do, common pitfalls).
This highlights a critical need for
memory-aware agent systemsthat can learn from past experiences in a more abstract, generalizable, and comprehensive way to overcome these limitations and enable true self-evolution.
2.2. Main Contributions / Findings
The paper makes several primary contributions to address the aforementioned challenges:
-
ReasoningBank Framework:
- Novel Memory Mechanism: The paper introduces
ReasoningBank, a new memory framework that distillsgeneralizable reasoning strategiesfrom an agent's self-judged successful and failed experiences. Unlike prior work focused on raw trajectories or only successful routines,ReasoningBankextracts high-level, transferable patterns and actionable principles, including preventative lessons from failures. - Closed-Loop Self-Evolution:
ReasoningBankoperates in a continuousclosed loop. Agents retrieve relevant memories to guide current tasks, and upon task completion, new experiences are analyzed, distilled into memory items, and consolidated back into theReasoningBank. This enables agents to continuously evolve and improve their strategic capabilities over time.
- Novel Memory Mechanism: The paper introduces
-
Memory-aware Test-Time Scaling (MaTTS):
- Synergistic Scaling Dimension: The paper proposes
MaTTS, a novel approach that integratesReasoningBankwith test-time scaling techniques. Instead of merely scaling the number of tasks,MaTTSfocuses on scaling the depth of experience per task by allocating more computational resources to generate abundant and diverse exploration trajectories. - Enhanced Memory Curation:
MaTTSleverages the contrastive signals arising from these diverse successful and failed trajectories to synthesize higher-quality, more generalizable memories forReasoningBank. This creates a powerful synergy: better memory guides more effective exploration, and richer exploration yields even stronger memories, establishing memory-driven experience scaling as a new scaling dimension for agents.
- Synergistic Scaling Dimension: The paper proposes
-
Empirical Validation and Emergent Behaviors:
-
Superior Performance: Extensive experiments on challenging benchmarks (WebArena for web browsing, Mind2Web for generalization, and SWE-Bench-Verified for software engineering) demonstrate that
ReasoningBankconsistently outperforms existing memory mechanisms (raw trajectories, successful routines) in botheffectiveness(higher success rates) andefficiency(fewer interaction steps).MaTTSfurther amplifies these gains. -
Learning from Failures: The paper shows that
ReasoningBankeffectively transforms failures into constructive signals, unlike baselines that either ignore failures or degrade when they are included. -
Emergent Reasoning Strategies: Analysis reveals that
ReasoningBankenables agents to develop increasingly complex,emergent reasoning strategiesover time, evolving from simple procedural steps to adaptive checks and compositional strategies.These findings collectively establish a practical pathway towards building adaptive, lifelong-learning agents that can continuously improve and generalize from their experiences in dynamic, real-world environments.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the contributions of this paper, it's helpful to understand a few core concepts related to large language models and agent systems:
-
Large Language Models (LLMs): These are advanced artificial intelligence models trained on massive amounts of text data, enabling them to understand, generate, and process human language. Examples include OpenAI's GPT series, Google's Gemini, and Anthropic's Claude. In the context of agents, LLMs serve as the "brain" or
policythat generates thoughts, plans, and actions. -
LLM Agents: An
LLM agentis an AI system that leverages an LLM to perform tasks in an environment. Unlike simple LLM prompting, agents can perceive their environment (receiveobservations), plan multi-step actions, execute those actions, and learn from the outcomes to achieve a goal. They interact sequentially with an environment, making decisions over time. -
Environment: For an
LLM agent, theenvironmentis the interactive system or world it operates in. This paper focuses on two main types:- Web Browsing Environments: Where the agent navigates websites, clicks elements, types text, and extracts information (e.g., WebArena, Mind2Web). The
observationsare typically text-based representations of web pages (e.g., accessibility trees), andactionsare web navigation operations (e.g.,click(),type()). - Software Engineering (SWE) Environments: Where the agent interacts with a codebase, executes commands, and modifies files to resolve software issues (e.g., SWE-Bench-Verified).
Observationsmight be code snippets, error messages, or file contents, andactionsare bash commands.
- Web Browsing Environments: Where the agent navigates websites, clicks elements, types text, and extracts information (e.g., WebArena, Mind2Web). The
-
Policy (): In reinforcement learning and agent systems, a
policydefines an agent's behavior. It's a function that maps an observed state or history of observations to an action. In this paper, theagent policyis parameterized by thebackbone LLM, conditioned on amemory moduleand theaction space. This means the LLM decides what to do based on its internal knowledge, available memory, and the possible actions it can take. -
Trajectory / Experience: A
trajectory(orexperience) is a sequence of observations and actions taken by an agent during its interaction with an environment to complete a task. It's a record of what the agent "did" and "saw." Formally, it's represented as( o _ { 0 : t } , a _ { 0 : t } )for steps, where is the observation at time and is the action. -
Test-Time Learning: This paradigm refers to agents learning and improving their capabilities during the deployment or
test phase, rather than just during a separate training phase. In this setting, tasks arrive in astreaming fashion(one after another), and the agent must adapt and evolve without access to future tasks or external ground-truth labels. It relies onself-verificationand its own past experiences. -
LLM-as-a-Judge: This is a technique where an
LLMis used to evaluate the performance or correctness of anotherLLM's output or an agent's trajectory, often without needing human-labeled ground truth. The judgeLLMis prompted with the task, the agent's actions, and the outcome, and then determines if the task was successful or failed. This is crucial for enablingself-judgementin test-time learning.
3.2. Previous Works
The paper positions ReasoningBank and MaTTS within the context of existing research on LLM agent memory and test-time scaling.
3.2.1. Memory for LLM Agents
Prior work on LLM agent memory aims to enhance agent performance by enabling them to utilize past information. This research can be categorized by the form of stored information and the mechanisms for retrieval and management:
-
Memory Content:
- Plain Text: Simple storage of raw text.
- Latent Knowledge Embeddings: Representing knowledge in a dense vector space for semantic retrieval.
- Structured Graphs: Organizing information into graph structures to capture relationships.
-
Memory Management: Involves
retrieval mechanisms(e.g., semantic search) andmanagement strategies(e.g., updating, merging).Reinforcement learning (RL)has also been explored for memory management.More specifically, the paper highlights two common approaches that
ReasoningBankaims to improve upon (as illustrated in Figure 1): -
Trajectory Memory (e.g., Synapse [Zheng et al., 2024]): These methods primarily store raw past interactions, or
trajectories, for direct reuse. While comprehensive, these trajectories can belengthy and noisy, making them less efficient and generalizable. They are akin to recording every step an agent took. The paper citesKagaya et al., 2024,Kong et al., 2025, andZheng et al., 2024as examples.- Example (from paper's Figure 1):
This shows a detailed, step-by-step log of actions and thoughts.<action> click('188') </action> <think> I have successfully navigated to the 'My Account' page. The user wants to know the date of their first purchase. On this page, I see a "Recent Orders" section. To ensure I find the *first* purchase, I need to view the complete order history, not just the recent ones. There is a "View All" link next to "Recent Orders" which should take me to the full order history. </think> <action> click('1530') </action> <think> I am on the "My Orders" page. The user wants to know the date of their first purchase. The orders are currently displayed from most recent to oldest. ...</think> <action> click(1614') </action>
- Example (from paper's Figure 1):
-
Workflow Memory (e.g., AWM [Wang et al., 2025d]): These approaches abstract common,
successful routinesorprocedures(workflows) from past interactions. They generalize a bit more than raw trajectories by identifying sequences of actions that commonly lead to success. However, they oftenover-emphasize successful experiencesand may still lack the ability to distill higher-level, transferable reasoning patterns, especially from failures. The paper citesFang et al., 2025andWang et al., 2025das examples.- Example (from paper's Figure 1):
This represents a more abstract sequence of actions than a raw trajectory.Workflow: Retrieve Information from Specific Page Section <think> This workflow is designed to access a specific section .. </think> <action> click('section_or_tab_id') </action> <action> send_msg_to_user('extracted_info_message') </action>
- Example (from paper's Figure 1):
3.2.2. Agent Test-Time Scaling (TTS)
Test-time scaling (TTS) involves allocating additional computation during inference to boost performance. It has been widely adopted in tasks like coding and math reasoning. Common TTS methods include:
-
Best-of-N (BoN) [Chow et al., 2025]: Generating independent solutions or trajectories for a problem and then selecting the best one (e.g., using a verifier or self-evaluation).
-
Beam Search [Wu et al., 2024b]: Exploring multiple candidate sequences of actions or thoughts in parallel, expanding the most promising ones.
-
Leveraging Verifiers [Setlur et al., 2025]: Using external or self-generated mechanisms to check the correctness of intermediate steps or final solutions.
While effective for reasoning tasks,
TTSformulti-turn interactive agentic tasksremains underexplored. Existing works in this area scale different dimensions of agentic systems, such as: -
Search space for each action[Yu et al., 2025b]. -
Number of agents in multi-agent systems[Jin et al., 2025]. -
Number of interactions with the environment[Shen et al., 2025].Crucially, the paper notes that
none of these efforts considers the role of agent memory in scaling, where memory could guide future decisions and enhance the scaling process.
3.3. Technological Evolution
The evolution of LLM agents has progressed from simple prompt-response interactions to complex multi-step decision-making. Initially, agents were often stateless, treating each task as novel. The introduction of memory mechanisms marked a significant step, allowing agents to retain episodic (short-term, context-specific) or semantic (long-term, knowledge-based) information. Early memory systems focused on storing raw interaction logs or successful procedural steps.
However, the limitations of these approaches became apparent: raw logs are too detailed and non-transferable, while successful routines often miss crucial lessons from failures and struggle to generalize high-level strategies. This paper's ReasoningBank represents an evolution by moving beyond mere record-keeping to distilling generalizable reasoning strategies and explicitly incorporating lessons from failures.
Concurrently, test-time scaling has emerged as a powerful way to enhance LLM performance by allocating more compute to explore diverse solutions. This paper identifies a gap in combining TTS with agent memory, proposing MaTTS to create a synergy. This integration represents a new frontier, moving towards agents that not only learn from past experiences but also actively and strategically generate richer experiences to accelerate that learning, leading to a self-evolving capability.
3.4. Differentiation Analysis
ReasoningBank and MaTTS differentiate themselves from prior work in several key ways:
-
Memory Content (What to store):
- ReasoningBank vs. Trajectory Memory (e.g., Synapse): Instead of storing raw, low-level trajectories,
ReasoningBankdistills high-level, transferable reasoning patterns and strategies. This abstraction makes memories more compact, reusable, and less susceptible to noise from specific execution details. - ReasoningBank vs. Workflow Memory (e.g., AWM): While
Workflow Memoryabstractssuccessful routines,ReasoningBankgoes a step further byextracting generalizable reasoning strategiesfrom bothsuccessful and failed experiences. This is crucial becausefailures provide valuable counterfactual signals and pitfallsthat help an agent understand what not to do, leading to more robust and comprehensive learning. Prior methods often neglect this aspect.
- ReasoningBank vs. Trajectory Memory (e.g., Synapse): Instead of storing raw, low-level trajectories,
-
Learning Mechanism (How to learn):
- Self-Judged Experiences:
ReasoningBankrelies onself-judged successful and failed experiences(usingLLM-as-a-judge) without requiring external ground-truth labels. This enables continuous learning in real-world, dynamic environments where ground truth might be unavailable. - Closed-Loop Evolution: The framework establishes a
closed-loop processwhere memories guide actions, and new experiences continuously update the memory, fostering genuineself-evolution.
- Self-Judged Experiences:
-
Integration with Test-Time Scaling (How to scale learning):
-
MaTTS as a Synergy: The paper is the first to explicitly explore a
memory-aware test-time scaling (MaTTS)strategy. Unlike vanillaTTSwhich might generate diverse but uncurated experiences,MaTTSusesReasoningBanktoguide the scaled exploration towards more promising pathsand, conversely, leverages thediverse experiences from scaling to synthesize higher-quality, more generalizable memories. This creates apositive feedback loopthat traditionalTTSor memory-only systems lack. -
Contrastive Signals:
MaTTSexplicitly utilizescontrastive signals(comparing successful vs. failed trajectories, or different exploration paths) during memory curation, which is a key innovation for distilling robust strategies.In essence,
ReasoningBankaims to createactionable, generalizable guidancefor future decisions by capturing why certain strategies work or fail, rather than just what happened.MaTTSthen supercharges this learning process by intelligently generating rich, diverse experiences forReasoningBankto learn from, making the agent more capable over time.
-
4. Methodology
This section details the problem formulation, the design of ReasoningBank, and the proposed Memory-aware Test-Time Scaling (MaTTS).
4.1. Problem Formulation
The paper frames the problem within the context of LLM-based agents operating in a test-time learning paradigm.
4.1.1. Agent Configuration
The agent's behavior is governed by a policy .
$
\pi _ { \mathcal { L } } ( \cdot | \mathcal { M } , \mathcal { A } )
$
Here:
-
represents the
backbone LLM(e.g., Gemini-2.5-flash, Claude-3.7-sonnet) that parameterizes the policy. -
denotes the
memory module, which in this work isReasoningBank. Initially, is empty and accumulates memories over time. -
is the
action space, defining the set of available actions the agent can take.The agent's goal is to perform a task by interacting with an
environment. This interaction is modeled as asequential decision-making process. Thetransition functionof the environment is defined as , where: -
is the
stateof the environment at time . -
is the
actionselected by the agent's policy at time . -
is the new state after executing action from state .
The paper focuses on two main types of tasks:
-
Web Browsing Tasks: The action space includes web navigation operations (e.g.,
click,type,scroll).Observationsare typicallytext-based accessibility treesof web pages, providing a structured representation of the current webpage content. -
Software Engineering (SWE) Tasks: The action space consists of
bash commands.Observationscan becode snippets,error messages, orfile contents.For a given task, the agent generates a
trajectoryof interactions( o _ { 0 : t } , a _ { 0 : t } )over steps. At each step, the agent observes from and uses its policy to generate the next action : $ a _ { t + 1 } = \pi _ { \mathcal { L } } ( o _ { 0 : t } , a _ { 0 : t } ; \mathcal { M } , \mathcal { A } ) $ Crucially, thememory modulecontributesrelevant memoriesasadditional system instructionsto theLLM policy, guiding its decision-making.
4.1.2. Test-Time Learning
The paper operates under a test-time learning paradigm (Wang et al., 2025c; Wu et al., 2024a). This means:
-
Streaming Task Queries: A sequence of tasks arrives
sequentially. The agent must process each task one by one, without knowing future tasks. -
No Ground Truth: During test time,
no ground truthlabels or external supervision are available to tell the agent if it succeeded or failed. -
Continuous Evolution: The agent must continually
evolveand improve its capabilities by solely leveraging itsown past trajectoriesand any form ofself-verification.This setup poses two main challenges:
- Memory Extraction and Preservation: How to effectively extract useful, generalizable information from past
trajectoriesand store it inmemory. - Memory Utilization: How to effectively retrieve and leverage such
memoryfor futurequeriesto avoid repeating errors or rediscovering known strategies.
4.2. ReasoningBank
ReasoningBank is designed to address the limitations of storing raw, lengthy, and noisy trajectories by distilling useful strategies and reasoning hints into structured, reusable memory items.
The overall architecture of ReasoningBank is depicted in Figure 2.
The following figure (Figure 2 from the original paper) shows the overall architecture of ReasoningBank:
该图像是论文中ReasoningBank架构的示意图,展示了代理与环境的交互经验如何通过记忆提取器转化为结构化的记忆条目,并通过记忆检索和整合形成闭环记忆过程,图中包含任务序列与记忆处理流程。
Figure 2 | Overview of REAs oNINGBANK. Experiences are distilled into structured memory items with a title, description, and content. For each new task, the agent retrieves relevant items to interact with the environment, and constructs new ones from both successful and failed trajectories. These items are then consolidated into REAs oN INGBAN K, forming a closed-loop memory process.
4.2.1. Memory Schema
Memory items in ReasoningBank are structured knowledge units abstracted from past experiences. They focus on preserving transferable reasoning patterns and strategies rather than low-level execution details. Each memory item comprises three components:
-
Title: A concise identifier summarizing the core strategy or reasoning pattern (e.g., "Navigation Strategy").
-
Description: A brief one-sentence summary of the memory item.
-
Content: Detailed distilled reasoning steps, decision rationales, or operational insights extracted from past experiences.
This structured format makes memory items both
human-interpretable(understandable) andmachine-usable(easy for the agent to integrate and act upon), facilitating efficient reuse.
4.2.2. Integration of ReasoningBank with Agents
The integration of ReasoningBank with an agent follows a closed-loop process with three steps: Memory Retrieval, Memory Construction, and Memory Consolidation.
4.2.2.1. Memory Retrieval
When facing a new task (query context):
- Querying ReasoningBank: The agent queries
ReasoningBankwith the current query context. - Similarity Search: It identifies the
top-kmost relevant experiences and their corresponding memory items usingembedding-based similarity search. This involves:- Embedding Tasks: Each task query is embedded into a vector space using a pre-trained model (e.g.,
gemini-embedding-001). - Cosine Distance: Similarity between the current task's embedding and stored memory item embeddings is calculated using
cosine distance.
- Embedding Tasks: Each task query is embedded into a vector space using a pre-trained model (e.g.,
- Instruction Injection: The retrieved
top-kmemory items areconcatenated into the agent's system instruction(prompt). This ensures that the agent's current decision-making process is informed and guided by relevant past experiences. The prompt includes the title and content of each retrieved memory item, along with an instruction for the agent to explicitly consider these items.
4.2.2.2. Memory Construction
After the current task is completed, new memory items are extracted from the agent's recent experience. This process involves:
-
Correctness Signals (LLM-as-a-judge): Since no ground truth is available, the agent first obtains
proxy signalsfor thecorrectnessof its completedtrajectories. This is done using anLLM-as-a-judge(Gu et al., 2024), which labels the outcome assuccessorfailurebased on the query and the agent's trajectory. The system instructions for obtaining binary signals indicating success or failures of the current trajectory are shown in Figure 9 from the paper:
该图像是一个折线图,展示了使用不同数量经验对成功率的消融实验结果。横轴为经验数量,纵轴为成功率,图中显示经验数量为1时成功率最高达49.7%。Figure 9 |Syste instructions for obtaining binary signals indicating success or failures of the curent trajectory. This prompt guides the
LLM-as-a-judgeto output "Success" or "Failure" based on theUser Intent,Trajectory,Final State of the Webpage, andBot Response. -
Extraction Strategies: Based on the
success/failure signal, different strategies are applied for memory extraction:-
Successful Experiences: Contribute
validated strategies, highlighting effective approaches. The system instruction for extracting memory items from successful trajectories emphasizes analyzing why the trajectory succeeded and summarizing transferable reasoning strategies. -
Failed Experiences: Supply
counterfactual signalsandpitfalls, helping to identify common mistakes and sharpenguardrails. The system instruction for extracting memory items from failed trajectories requires reflecting on the causes of failure and articulating lessons or preventive strategies. The system instructions for extracting memory items are shown in Figure 8 from the paper:
该图像是两部分的示意图,展示了基于成功轨迹和失败轨迹的系统指令对比,指导提取和总结记忆条目,以帮助智能体学习和提升任务完成能力。
Figure 8 | System instructions for extracting memory items from agent trajectories: the left panel targets successful trajectories (summarizing why they succeed), while the right targets failed trajectories (reflecting on failure and deriving lessons). In both cases, the output is constrained to
at most three memory itemsin a structured Markdown format, ensuring conciseness, non-redundancy, and generalizability (i.e., not tied to specific websites or queries). -
4.2.2.3. Memory Consolidation
Finally, the newly constructed memory items are consolidated into ReasoningBank.
- Simple Addition: The paper adopts a
minimal consolidation strategy, where newly generated items are directly added to the memory pool. This approach highlights the pure contribution ofReasoningBankwithout the complexities of advanced consolidation algorithms (e.g., merging, pruning, forgetting). - Evolving Repository: This continuous addition maintains an
evolving repositoryof memory items, allowing the agent to continuously learn and adapt over time.
4.3. MATTS: Memory-aware Test-Time Scaling
Memory-aware Test-Time Scaling (MaTTS) is introduced to establish a powerful synergy between memory and test-time scaling. It aims to accelerate and diversify the learning process by intelligently scaling up the agent's interaction experience. The core idea is to leverage the abundant successful and failed trajectories generated during scaling for more effective memory curation, rather than simply converting more trajectories into more memory items independently (which is the "vanilla TTS" approach).
The following figure (Figure 3 from the original paper) compares vanilla TTS and MaTTS:
该图像是图3的示意图,展示了(a)基础的TTS方法,(b)通过多轨迹自对比实现的MaTTS并行缩放,以及(c)通过轨迹间自我优化实现的MaTTS顺序缩放,强调记忆的可靠性和中间推理信号的丰富。
Figure 3 | Comparison of (a) vanilla TTS and MATTS with ( b ) parallel scaling, where self-contrast across multiple trajectories curates reliable memory, and ( c ) sequential scaling, where self-refinement enriches memory with intermediate reasoning signals.
MaTTS is instantiated in two complementary settings: parallel scaling and sequential scaling. A scaling factor denotes the number of trajectories for parallel scaling or refinement steps for sequential scaling.
4.3.1. Parallel Scaling
In parallel scaling, the agent generates multiple trajectories ( trajectories) for the same query under the initial guidance of retrieved memory items.
- Diverse Exploration: This process promotes
diverse explorationby allowing the agent to attempt the task in several different ways. - Self-Contrast: The key innovation is to leverage
self-contrast(Chen et al., 2020) across these multiple generated trajectories. By comparing and contrasting thesuccessfulandfailedattempts, the agent can identifyreliable patternsthat lead to success andpitfallsthat result in spurious solutions. - Reliable Memory Curation: This provides rich
contrastive signalsforReasoningBankto synthesize morereliable and generalizable memory itemsfrom multiple trials of a single query. The system instructions for memory-aware test-time scaling are shown in Figure 10 from the paper. The left panel shows the parallel scaling instruction, where the model is prompted to compare and contrast multiple trajectories to identify useful strategies and common mistakes. The following figure (Figure 10 from the original paper) illustrates the system instructions for MATTS:
该图像是论文中图14的示意图,展示了Baseline(无记忆)和Reasoning Bank两种方法在查询用户首次购买日期时的对比。Reasoning Bank通过调用记忆中的推理提示,成功检索完整订单历史并给出正确答案,而Baseline仅依赖近期订单信息,回答错误。
Figure 10 | System instructions for memory-aware test-time scaling: the left panel shows parallel scaling (comparing multiple trajectories to extract generalizable insights), while the right panel shows sequential scaling (iteratively re-checking a trajectory to refine the final answer).
4.3.2. Sequential Scaling
In sequential scaling, the agent iteratively refines its reasoning within a single trajectory after an initial completion, following the principle of self-refinement (Madaan et al., 2023).
- Iterative Refinement: After an initial attempt, the agent is prompted to
re-examine its own trajectoryandintermediate notes(e.g., thoughts, plans) generated during the process. - Enriching Memory with Intermediate Signals: These
intermediate notes(reasoning attempts, corrections, insights) are valuable signals formemory construction. They capture the agent's internal thought process, identifying where it struggled, made mistakes, or found breakthroughs, even if not explicitly part of the final solution. - Consistency and Correction: The iterative re-checking ensures consistency and allows for corrections, enriching the memory with a deeper understanding of the problem-solving process. The right panel of Figure 10 shows the instruction for sequential scaling, where the agent iteratively re-examines its previous trajectory and reasoning steps, correcting inconsistencies and confirming its final answer.
4.3.3. Synergy between Memory and Test-Time Scaling
MaTTS creates a positive feedback loop:
-
Memory Guides Scaling: High-quality memory items retrieved from
ReasoningBankprovide better initial guidance to the agent, steering thescaled exploration(both parallel and sequential) towardsmore promising pathsand reducing wasted computation. -
Scaling Enriches Memory: The
diverse and abundant experiences(trajectories, intermediate signals) generated throughMaTTSprovide richercontrastive signalsforReasoningBankto synthesizehigher-quality, more generalizable memory.This synergy positions
memory-driven experience scalingas a new and powerful dimension for agent improvement, where the agent's ability to learn and its ability to explore effectively mutually reinforce each other.
5. Experimental Setup
The paper evaluates ReasoningBank and MaTTS on challenging benchmarks across web browsing and software engineering domains.
5.1. Datasets
The experiments are conducted on three agentic datasets:
5.1.1. WebArena
- Description: A benchmark for
general web navigationacross diverse real-world domains (Zhou et al., 2024). It involves tasks requiring agents to interact with web pages to achieve specific goals. - Domains/Subsets:
Shopping(187 instances): Tasks related to e-commerce websites.Admin(182 instances): Tasks involving administrative interfaces.Gitlab(180 instances): Tasks related to software development platforms.Reddit(106 instances): Tasks involving forum navigation.Multi(29 instances): Tasks that require transferring memory across multiple websites.
- Scale: 684 test instances in total.
- Data Sample Example (Conceptual): A task might be "Find the price of a specific product on an e-commerce site and add it to your cart," or "Change a user's permission setting in an admin panel." The agent receives the initial webpage state (accessibility tree) and the task instruction.
5.1.2. Mind2Web
- Description: A benchmark designed to test the
generalization capabilitiesof agents on versatile operations and environments (Deng et al., 2023). It assesses how well agents can perform tasks incross-task,cross-website, andcross-domainsettings. - Settings:
Cross-Task(252 instances): Generalization to new tasks within familiar websites.Cross-Website(177 instances): Generalization to familiar tasks on new websites.Cross-Domain(912 instances): Generalization to new tasks on new websites from entirely different domains.
- Scale: 1341 test instances in total.
- Data Sample Example (Conceptual): A task could be "Book a flight from city A to city B" on a travel website. A
cross-websitechallenge might be doing the same task on a different travel website, while across-domainchallenge might be applying skills learned from flight booking to, say, ordering food online.
5.1.3. SWE-Bench-Verified
- Description: A
repository-level issue resolution benchmarkforagentic coding tasks(Jimenez et al., 2024). It involves fixing software bugs described in GitHub issues. - Task: Agents need to generate a
patch(code changes) that resolves the underlying bug, such that all provided test scripts execute successfully. - Scale: 500 high-quality, manually verified test instances.
- Data Sample Example (Conceptual): A task might be a GitHub issue description like "Bug: When submitting form, validation error message is not displayed for empty 'email' field." The agent interacts with a bash environment, reads code, modifies files, runs tests, and submits a patch.
5.2. Evaluation Metrics
The paper uses a set of metrics tailored to each benchmark to assess effectiveness (success rate) and efficiency (number of interaction steps).
5.2.1. WebArena Metrics
-
Success Rate (SR):
- Conceptual Definition: Measures the percentage of user queries that are successfully resolved by the agent. It quantifies the agent's ability to complete tasks correctly.
- Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of successfully resolved queries}}{\text{Total number of queries}} \times 100% $
- Symbol Explanation:
Number of successfully resolved queries: The count of tasks where the agent's final output or state satisfies the task requirements.Total number of queries: The total number of tasks attempted in the benchmark.
- Measurement: Evaluation employs
LLM-based fuzzy matchingandexact string matchingto verify if essential answer terms appear in the agent's predictions.
-
Steps:
- Conceptual Definition: Represents the average number of interaction steps (actions) taken by the agent to complete each query. It quantifies the computational and interaction cost, serving as an efficiency metric.
- Mathematical Formula: $ \mathrm{Steps} = \frac{\sum_{i=1}^{N} \text{Steps}_i}{\text{Number of completed queries}} $
- Symbol Explanation:
- : Total number of queries.
- : The number of interaction steps taken for query .
Number of completed queries: The total number of queries for which the agent either succeeded or failed (i.e., did not time out).
5.2.2. Mind2Web Metrics
Mind2Web tasks have a predefined fixed number of steps. At each step, the agent predicts an action, and specific metrics are used:
-
Element Accuracy (EA):
- Conceptual Definition: Measures if the agent correctly selects the target page element for an action at a given step. It quantifies the agent's ability to visually and semantically identify the correct interactive component on a web page.
- Mathematical Formula: $ \mathrm{EA} = \frac{\text{Number of correctly selected elements}}{\text{Total number of steps}} \times 100% $
- Symbol Explanation:
Number of correctly selected elements: The count of steps where the agent's chosen element matches the ground-truth target element.Total number of steps: The total number of action steps across all tasks.
-
Action F1 (AF1):
- Conceptual Definition: Measures the correctness of the action type taken on a selected element. It's an F1 score, balancing precision and recall for action labels.
- Mathematical Formula: $ \mathrm{Precision} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Positives}} $ $ \mathrm{Recall} = \frac{\text{True Positives}}{\text{True Positives} + \text{False Negatives}} $ $ \mathrm{AF1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} \times 100% $
- Symbol Explanation:
True Positives: Correctly predicted action types.False Positives: Incorrectly predicted action types (predicted when not ground truth).False Negatives: Missed ground-truth action types (not predicted when ground truth).
-
Step Success Rate (SSR):
- Conceptual Definition: Checks if both the correct element is selected AND the correct action is taken on it at a given step. It's a stricter per-step metric.
- Mathematical Formula: $ \mathrm{SSR} = \frac{\text{Number of steps with correct element and action}}{\text{Total number of steps}} \times 100% $
- Symbol Explanation:
Number of steps with correct element and action: Count of steps where both Element Accuracy and Action F1 criteria are met for that specific step.Total number of steps: The total number of action steps across all tasks.
-
Task-level Success Rate (SR):
- Conceptual Definition: Measures if all intermediate steps for a given task are successfully conducted. A task is considered successful only if every single step within that task achieves a Step Success Rate of 1.0.
- Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of tasks with all steps successfully completed}}{\text{Total number of tasks}} \times 100% $
- Symbol Explanation:
Number of tasks with all steps successfully completed: The count of tasks where theSSRfor every step in the task is 1.0.Total number of tasks: The total number of tasks in the benchmark.
5.2.3. SWE-Bench-Verified Metrics
-
Resolve Rate:
- Conceptual Definition: The primary evaluation metric for SWE-Bench-Verified, measuring the percentage of issues where the agent's submitted patch successfully passes all provided test scripts.
- Mathematical Formula: $ \mathrm{Resolve Rate} = \frac{\text{Number of issues with passing patch}}{\text{Total number of issues}} \times 100% $
- Symbol Explanation:
Number of issues with passing patch: The count of software issues for which the agent generated a code patch that successfully resolved the issue (i.e., all tests pass).Total number of issues: The total number of issues in the benchmark.
-
Steps: (Same definition as WebArena Steps metric, but applied to SWE tasks)
- Conceptual Definition: The average number of interaction steps (bash commands, thought processes) taken by the agent to attempt to resolve each issue.
- Mathematical Formula: $ \mathrm{Steps} = \frac{\sum_{i=1}^{N} \text{Steps}_i}{\text{Number of completed issues}} $
- Symbol Explanation:
- : Total number of issues.
- : The number of interaction steps taken for issue .
Number of completed issues: The total number of issues for which the agent completed its process (not necessarily successful).
5.2.4. Pass@k (for MATTS)
- Conceptual Definition: Used in the context of
parallel scaling,Pass@kmeasures the probability of finding at least one correct solution among independently generated solutions (trajectories).Pass@1refers to the success rate of a single trajectory.Best-of-N (BoN)is essentially equivalent toPass@Nwhen trajectories are generated and the best one is selected. - Mathematical Formula (General):
If is the success probability of a single attempt, then the probability of at least one success in attempts is .
$
\mathrm{Pass}@k = 1 - (1 - P_s)^k
$
In practice,
BoNis calculated by generating trajectories and checking if any of them are successful. $ \mathrm{BoN} = \frac{\text{Number of tasks where at least one of N trajectories succeeded}}{\text{Total number of tasks}} \times 100% $ - Symbol Explanation:
- : The success rate of a single, randomly chosen trajectory.
- : The number of independent trajectories generated.
Number of tasks where at least one of N trajectories succeeded: The count of tasks where at least one of the parallel executions led to a successful outcome.Total number of tasks: The total number of tasks attempted.
5.3. Baselines
The paper compares ReasoningBank against several representative memory-augmented approaches and a memory-free baseline:
-
No Memory (Vanilla):
- Description: This baseline represents the
backbone LLM agentoperating without any explicit memory module. Each task is approached in isolation, relying solely on the LLM's inherent knowledge and the current task prompt. - Purpose: Serves as a fundamental reference point to demonstrate the benefits of any memory mechanism.
- Description: This baseline represents the
-
Synapse (Zheng et al., 2024):
- Description: A
trajectory-based memoryapproach. It organizes and reuses past raw trajectories asin-context memoryfor new tasks. This means the agent might be provided with examples of how similar tasks were solved in the past, including all the detailed steps. - Purpose: Represents methods that leverage direct past interaction records.
- Description: A
-
AWM (Agent Workflow Memory) (Wang et al., 2025d):
-
Description: A more abstract memory mechanism compared to
Synapse. It distillscommon successful routinesorworkflowsfrom past trajectories into reusable procedures. These workflows abstract away some low-level details, focusing on sequences of successful high-level actions. -
Purpose: Represents methods that generalize from successful patterns, but typically do not explicitly leverage failures.
These baselines provide a comprehensive comparison, spanning from agents without memory, to those reusing raw trajectories, and finally to methods that distill higher-level structures, allowing for a clear evaluation of
ReasoningBank's innovations.
-
5.4. Implementation Details
-
Backbone LLMs:
- Google Gemini-2.5-Flash (Comanici et al., 2025)
- Google Gemini-2.5-Pro (Comanici et al., 2025)
- Anthropic Claude-3.7-Sonnet (Anthropic, 2025)
- These LLMs are accessed via the
Vertex AI API. This allows for investigation of performance across differentLLM families(Gemini, Claude) andmodel sizes/capabilities(Flash, Pro).
-
Execution Environments:
- Web Browsing (WebArena, Mind2Web):
BrowserGym(de Chezelles et al., 2025) is used as the execution environment. Amaximum step limit of 30is set per query to prevent infinite loops. - Software Engineering (SWE-Bench-Verified): A
bash-only environmentwith no specialized tools or scaffold structures, following the setup ofminiSWE-Agent(Yang et al., 2024).
- Web Browsing (WebArena, Mind2Web):
-
Agent Style: The agent is implemented in a
ReAct (Reasoning and Acting) style(Yao et al., 2023). This means the agent alternates betweenthinking(generating internal reasoning steps) andacting(executing an action in the environment) until it predicts a stop action or a task termination condition is met. -
Decoding Configurations:
- Web Browsing (WebArena, Mind2Web): A
decoding temperature of 0.7is used for model generations. This allows for some creativity and diversity in agent actions and thoughts. - Memory Extraction: For memory extraction, the backbone LLM of the extractor is set to the same as the agent system, with a
temperature of 1.0to encourage diverse memory item generation. - LLM-as-a-Judge: For the
LLM-as-a-judge(for success/failure signals), the backbone LLM is also the same, but thedecoding temperature is set to 0.0to ensuredeterminismin its judgment. - Best-of-N Calculation: For selecting the best trajectory among candidates, an LLM (same backbone as the agent) is used with a carefully curated prompt.
- Web Browsing (WebArena, Mind2Web): A
-
Memory Details:
- Embedding Model:
gemini-embedding-001(Lee et al., 2025) is used for embedding task queries formemory retrieval. - Similarity Search:
Cosine distanceis used for similarity search over the memory pool. - Retrieval Quantity:
Top-krelevant memory items are retrieved, with a default (further analyzed in ablations). - Memory Storage:
ReasoningBankis maintained inJSON format, with each entry containing the task query, original trajectory, and corresponding memory items (structured as{title, description, content}). Embeddings are pre-computed and stored in a separate JSON for efficient search. Memory ispersistedacross independent runs forcontinual accumulation.
- Embedding Model:
6. Results & Analysis
This section presents the experimental results for ReasoningBank and MaTTS, analyzing their effectiveness, efficiency, and the synergy between memory and scaling.
6.1. Core Results Analysis
The paper first demonstrates the consistent outperformance of ReasoningBank over baselines across various benchmarks and LLM backbones.
6.1.1. WebArena Benchmark Results
The following are the results from Table 1 of the original paper:
| Models | Shopping (187) | Admin (182) | Gitlab (180) | Reddit (106) | Multi (29) | Overall (684) | ||||||
| SR | Step | SR | Step | SR | Step | SR | Step | SR | Step | SR | Step | |
| Gemini-2.5-flash | ||||||||||||
| No Memory | 39.0 | 8.2 | 44.5 | 9.5 | 33.9 | 13.3 | 55.7 | 6.7 | 10.3 | 10.0 | 40.5 | 9.7 |
| Synapse | 40.6 | 7.0 | 45.1 | 9.1 | 35.6 | 13.0 | 59.4 | 6.5 | 10.3 | 10.5 | 42.1 | 9.2 |
| AWM | 44.4 | 7.0 | 46.7 | 8.8 | 37.2 | 13.2 | 62.3 | 6.1 | 3.4 | 7.7 | 44.1 | 9.0 |
| ReasoningBank | 49.7 | 6.1 | 51.1 | 8.2 | 40.6 | 12.3 | 67.0 | 5.6 | 13.8 | 8.8 | 48.8 | 8.3 |
| Gemini-2.5-pro | ||||||||||||
| No Memory | 45.5 | 7.6 | 51.1 | 8.7 | 35.0 | 11.6 | 71.7 | 6.0 | 6.9 | 8.8 | 46.7 | 8.8 |
| Synapse | 46.5 | 6.6 | 52.2 | 8.9 | 38.3 | 11.3 | 68.9 | 5.9 | 6.9 | 9.0 | 47.7 | 8.5 |
| AWM | 48.1 | 6.4 | 49.3 | 9.8 | 40.0 | 11.2 | 68.9 | 6.4 | 3.4 | 9.3 | 47.6 | 8.7 |
| ReasoningBank | 51.9 | 6.0 | 56.6 | 7.7 | 44.4 | 9.8 | 80.2 | 5.1 | 13.8 | 8.2 | 53.9 | 7.4 |
| Claude-3.7-sonnet | ||||||||||||
| No Memory | 38.5 | 6.1 | 49.5 | 8.4 | 36.7 | 10.6 | 53.8 | 5.5 | 0.0 | 11.6 | 41.7 | 8.0 |
| Synapse | 39.6 | 5.8 | 50.5 | 8.5 | 38.0 | 10.0 | 53.8 | 6.1 | 0.0 | 11.8 | 42.6 | 7.9 |
| AWM | 39.6 | 7.2 | 47.8 | 9.3 | 34.6 | 10.9 | 52.8 | 7.0 | 0.0 | 12.4 | 40.8 | 8.9 |
| ReAsoNinGBANK | 44.9 | 5.6 | 53.3 | 7.6 | 41.1 | 9.5 | 57.5 | 5.2 | 3.4 | 10.5 | 46.3 | 7.3 |
- Consistent Outperformance:
ReasoningBankconsistently achieves higherSuccess Rates (SR)across allLLM backbones(Gemini-2.5-flash, Gemini-2.5-pro, Claude-3.7-sonnet) and almost all WebArena subsets compared toNo Memory,Synapse, andAWM. For instance, withGemini-2.5-flash, it improves theOverall SRfrom 40.5% (No Memory) to 48.8%. WithGemini-2.5-pro, theOverall SRincreases from 46.7% to 53.9%. This highlights its robustness and general applicability. - Enhanced Generalization (Multi subset): On the challenging
Multisubset, which requires transferring memory across different websites,ReasoningBankshows significant gains (e.g., +3.5% with Gemini-2.5-flash, +6.9% with Gemini-2.5-pro, +3.4% with Claude-3.7-sonnet compared toNo Memory). Notably, strong baselines likeAWMeven show degradation (e.g., 3.4% SR for AWM with Gemini-2.5-flash vs. 10.3% for No Memory). This indicates thatReasoningBankcurates more robust and transferable memory items. - Superior Efficiency: Beyond effectiveness,
ReasoningBankalso reduces the average number ofStepsneeded to complete tasks. On WebArena, it lowers the average step count by up to 1.4 (No Memory) and 1.6 (other memory baselines). This suggests thatReasoningBankhelps agents find solutions more directly and efficiently by reusing refined reasoning knowledge, avoiding redundant exploration. - Distinct Memory Source: The paper attributes
ReasoningBank's superior performance to its extraction strategy, which distills memory from both successful and failed experiences, unlikeSynapseandAWMthat rely on a narrower, success-only memory source.
6.1.2. Mind2Web Benchmark Results
The following are the results from Table 3 of the original paper:
| Models | Cross-Task (252) | Cross-Website (177) | Cross-Domain (912) | |||||||||
| EA | AF1 | SSR | SR | EA | AF1 | SSR | SR | EA | AF1 | SSR | SR | |
| Gemini-2.5-flash | ||||||||||||
| No Memory | 46.0 | 59.1 | 40.3 | 3.3 | 39.8 | 45.1 | 31.7 | 1.7 | 35.8 | 37.9 | 31.9 | 1.0 |
| Synapse | 47.0 | 59.5 | 41.2 | 3.5 | 40.3 | 46.0 | 32.1 | 1.9 | 36.3 | 38.5 | 32.4 | 1.1 |
| AWM | 46.3 | 56.1 | 41.0 | 3.5 | 39.1 | 42.2 | 31.7 | 2.1 | 33.3 | 36.5 | 30.1 | 0.7 |
| ReasoningBank | 52.1 | 60.4 | 44.9 | 4.8 | 44.3 | 52.6 | 33.9 | 2.3 | 40.6 | 41.3 | 36.6 | 1.6 |
| Gemini-2.5-pro | ||||||||||||
| No Memory | 49.3 | 60.2 | 44.4 | 3.5 | 41.2 | 49.8 | 34.8 | 3.4 | 37.9 | 37.7 | 35.0 | 1.4 |
| Synapse | 50.1 | 61.0 | 44.7 | 3.6 | 41.8 | 51.2 | 35.0 | 3.2 | 38.5 | 39.8 | 35.6 | 1.5 |
| AWM | 48.6 | 61.2 | 44.4 | 3.7 | 41.9 | 47.9 | 34.8 | 2.3 | 37.3 | 38.1 | 34.4 | 1.2 |
| REAsoNinGBaNk | 53.6 | 62.7 | 45.6 | 5.1 | 46.1 | 54.8 | 36.9 | 3.8 | 42.8 | 45.2 | 38.1 | 1.7 |
- Generalization Performance:
ReasoningBankconsistently improvessuccess rates (SR)across all generalization settings (Cross-Task,Cross-Website,Cross-Domain) on Mind2Web. - Pronounced Gains in Cross-Domain: The gains are particularly significant in the
Cross-Domainsetting (e.g., +0.6% SR for Gemini-2.5-flash and +0.3% SR for Gemini-2.5-pro compared to Synapse), which demands the highest level of generalization. This reinforces thatReasoningBank's curated memory is more robust and transferable, enabling agents to apply learned strategies to truly novel scenarios.
6.1.3. SWE-Bench-Verified Results
The following are the results from Table 2 of the original paper:
| Methods | Resolve Rate | Step |
| Gemini-2.5-flash | ||
| No Memory | 34.2 | 30.3 |
| Synapse | 35.4 | 30.7 |
| REAsoNingBank | 38.8 | 27.5 |
| Gemini-2.5-pro | ||
| No Memory | 54.0 | 21.1 |
| Synapse | 53.4 | 21.0 |
| REASONINGBANK | 57.4 | 19.8 |
- Robustness on SWE Tasks:
ReasoningBankdemonstrates its robustness onSWE-Bench-Verifiedfor repository-level issue resolution tasks. It consistently achieves higherResolve Rates(e.g., 38.8% vs. 34.2% forNo Memorywith Gemini-2.5-flash, and 57.4% vs. 54.0% forNo Memorywith Gemini-2.5-pro) and reducesSteps(e.g., 27.5 vs. 30.3 with Gemini-2.5-flash, and 19.8 vs. 21.1 with Gemini-2.5-pro). This confirms its efficacy beyond web browsing, suggesting broad applicability.
6.2. Results of MATTS
The paper further investigates the impact of MaTTS on ReasoningBank's performance, focusing on the WebArena-Shopping subset with Gemini-2.5-flash.
The following figure (Figure 4 from the original paper) shows the effect of scaling factor k for MATTS:
该图像是图表,展示了在WebArena-Shopping子集上,不同测试时刻的MaTTS扩展因子k对成功率(SR)的影响。左图为(a)并行扩展,右图为(b)顺序扩展,比较了MaTTS和其去除记忆与去除聚合的版本表现。
Figure 4 | Effect of scaling factor for MATTS under with REAs ONINGBAN K on WebArena-Shopping subset. We compare (a) parallel and (b) sequential test-time scaling.
- Performance Boost from Scaling: Both
parallel scalingandsequential scalinggenerallyboost performanceas thescaling factorincreases, confirming the benefit of allocating more inference-time computation. WithMaTTS, parallel scaling increases from 49.7% () to 55.1% (), and sequential scaling rises from 49.7% to 54.5%. - MaTTS Superiority over Vanilla TTS:
MaTTS(which integratesReasoningBankand memory-aware aggregation) consistentlysurpasses vanilla TTS(labeled asMaTTS w/o aggregationin the figure). At ,MaTTSachieves 55.1% in parallel scaling compared to 52.4% forvanilla TTS, and 54.5% versus 51.9% in sequential scaling. This indicates that the memory-aware coordination and aggregation withinMaTTSare crucial for effectively leveraging scaling. - Memory's Role in Scaling: For the baseline
MaTTS w/o memory, the gains from scaling aresmaller and less consistent(e.g., parallel scaling fluctuates between 39.0% and 42.2%). This highlights that memory is essential for making scaling truly effective, guiding the agent toward more promising solutions. - Parallel vs. Sequential Scaling:
Sequential scalingshowshigher gains at small kwithReasoningBank(e.g., initially matching or slightly exceeding parallel). However, its benefitsaturates quicklyas further refinements yield little new insight.Parallel scalingprovidesdiverse rolloutsthat allow for critique and improvement, leading it tosurpass sequential scaling at larger k(e.g., 55.1% vs. 54.5% at k=5).- For
vanilla TTS(without memory-aware aspects), sequential scaling offers little to no benefit, and parallel scaling consistently dominates.
6.3. Synergy of Memory and Test-Time Scaling
This section analyzes the bidirectional interaction between memory quality and scaling effectiveness using Pass@1 (success rate of a single, randomly selected trajectory) and Best-of-N (BoN) (success rate if the best of N trajectories is chosen). Results are for WebArena-Shopping subset with parallel scaling factor .
The following figure (Figure 5 from the original paper) shows a snapshot of MATTS on WebArena-Shopping subset with different memory mechanisms with :
该图像是一个柱状折线图,展示了在不同内存机制下,MATTS在WebArenaShopping子集上成功率(Success Rate)随三种策略的对比表现,包括No Memory、Synapse、AWM和ReasoningBank,指标有Pass@1和Best-of-3。
Figure 5 | Snapshot of MATTS on WebArenaShopping subset with different memory mechanisms with . We compute BoN for all 3 trajectories and Pass @ 1 with one randomly selected trajectory.
-
Better Memory Enables Stronger TTS Performance (BoN): The
BoNresults (blue bars) show that the benefit of scaling depends critically on the underlying memory mechanism.No Memory: Scaling yields only a slight improvement inBoN(from 39.0% to 40.6%).Weaker Memories (Synapse, AWM): Provide moderate gains, reaching 42.8% and 45.5% respectively.MaTTS with ReasoningBank: Delivers the strongest benefit, withBoNclimbing from 49.7% to 52.4%. This demonstrates thathigh-quality memory directs scaling toward more promising rollouts, ensuring that additional trajectories are effectively converted into higher success rates.
-
Scaling Yields Better Memory Curation (Pass@1): The
Pass@1results (pink bars) reveal how scaling feeds back into memory curation.-
Weaker Memories (Synapse, AWM):Pass@1reduceswith scaling (Synapse from 40.6% to 40.1%, AWM from 44.4% to 41.2%). This suggests that without strong guidance, the extra rollouts generated by scaling introducenoiserather than useful signals for memory curation. -
ReasoningBank: Is the only method wherePass@1riseswith scaling (from 49.7% to 50.8%). This indicates thathigh-quality memory can harness the diversity of scalingto extract constructivecontrastive signals, leading to more effective memory curation.This asymmetry highlights a
virtuous cycle:scaling alone is insufficient; it must be paired with a good memory mechanism likeReasoningBankfor scaling to contribute to the curation of more effective memory, thereby closing the learning loop.
-
6.4. Incorporating Failure Trajectories
The paper explicitly analyzes the impact of including failure trajectories in memory construction, a key differentiator of ReasoningBank.
The following figure (Figure 7 from the original paper) shows the ablation results of incorporating failure trajectories for memory induction:
该图像是图表,展示了图7的消融实验结果,比较了仅使用成功轨迹与同时加入失败轨迹进行记忆归纳对不同模型任务成功率的影响,结果显示加入失败轨迹普遍提升了性能。
Figure 7 | Ablation results of incorporating failure trajectories for memory induction.
-
Baselines' Limitations:
SynapseandAWMbuild memory solely from successful trajectories. When failures are included:Synapse: Only marginally improves from 40.6% (success-only) to 41.7% (with failures).AWM:Degradesfrom 44.4% to 42.2%. This indicates that these baselines are either unable to benefit from failures or that failures introduce noise that harms their performance.
-
ReasoningBank's Advantage:
ReasoningBankis designed to distill reasoning patterns from both successes and failures.Success-only: Achieves 46.5%.With failures: Further improves to 49.7%. This clearly demonstrates thatReasoningBankcan transformfailures into constructive signalsrather than noise, leading to more robust generalization and better overall performance. Lessons from mistakes are actively integrated and utilized.
6.5. Efficiency Study
To gain deeper insight into the Steps reduction, the analysis separates the average number of steps into successful and failed test instances.
The following are the results from Table 4 of the original paper:
| Models | Shopping | Admin | Gitlab | |||||
| Successful | Failed | Successful | Failed | Successful | Failed | Successful | Failed | |
| No Memory | 6.8 | 8.7 | 8.4 | 10.4 | 8.6 | 15.7 | 6.1 | 7.6 |
| ReAsoningBank | 4.7↓2.1 | 7.3↓1.4 | 7.0↓1.4 | 9.5↓0.9 | 7.6↓1.0 | 15.5↓0.2 | 5.0↓1.1 | 6.8↓0.8 |
- Consistent Step Reduction:
ReasoningBankconsistentlyreduces the number of stepsacross all domains for both successful and failed instances compared toNo Memory. - Greater Reduction in Successful Cases: The reduction is
particularly pronounced on successful cases. For example, inShopping,ReasoningBankreduces steps by 2.1 for successful cases (a 26.9% relative reduction) compared to 1.4 for failed cases. Similar patterns are observed across other domains. - Guiding Purposeful Decision-Making: This finding indicates that
ReasoningBankprimarily helps the agentreach solutions with fewer interactionsby strengthening its ability to follow effective reasoning paths. It suggests that the memory isn't just truncating failed attempts prematurely but is guiding more purposeful and efficient decision-making when the agent is on the right track, leading to faster successes.
6.6. Emergent Behaviors with ReasoningBank
The paper highlights that ReasoningBank fosters the evolution of strategies over time, leading to emergent behaviors akin to Reinforcement Learning dynamics.
The following figure (Figure 6 from the original paper) shows a case study illustrating emergent behaviors in ReasoningBank through memory items:
该图像是图表,展示了REAs在ReasoningBank记忆项上的新兴行为示例,反映了测试时间学习过程中步骤和策略的时间线,包括自我反思、程序执行、适应性检查和复杂策略等关键节点。
Figure 6 | A case study illustrating emergent behaviors in REAs onING BANK through memory items.
The case study illustrates the evolution of a memory item:
-
Execution-oriented/Procedural (Initial): Strategies start simple, focusing on straightforward action rules (e.g., "find navigation links").
-
Adaptive Self-Reflection (Intermediate): With more experience, the agent develops strategies for
re-verifying identifiersto reduce simple mistakes. -
Adaptive Checks (Advanced): The memory item evolves to include systematic checks, such as leveraging available search or filters to ensure completeness before concluding results.
-
Compositional Strategies (Mature): Finally, it matures into complex
compositional strategies, likecross-referencing task requirementsandreassessing optionsbased on multiple criteria.This evolution from low-level actions to high-level reasoning demonstrates how
ReasoningBankenables agents torefine and abstract strategiesduringtest-time learning, leading to more sophisticated and robust problem-solving capabilities.
6.7. Ablation Studies / Parameter Analysis
6.7.1. Number of Retrieved Experiences
The paper conducts an ablation study on the number of retrieved experiences (i.e., the top-k value for memory retrieval). This is performed using Gemini-2.5-flash on the WebArena-Shopping subset.
The following figure (Figure 12 from the original paper) shows the ablation results for using various number of experiences:
该图像是一个示意图,展示了在购物任务中采用Baseline(无记忆)与ReasoningBank两种策略的操作流程和步骤数对比。ReasoningBank利用记忆引导导航,减少了步骤数(29步降至10步),提升操作效率。
Figure 12 | Ablation results for using various number of experiences.
- Benefit of Memory: Incorporating just
one relevant memory item() significantly boosts performance from 39.0% (without memory) to 49.7%. This confirms the value of memory guidance. - Diminishing Returns/Noise: As the number of retrieved experiences
increases, the success rategradually declines(e.g., 46.0% with , 45.5% with , and 44.4% with ). - Quality over Quantity: This suggests that while some memory is good,
excessive experiences may introduce conflicts or noiseif they are not all perfectly relevant or coherent. Therelevance and qualityof retrieved memory items are more crucial than simply retrieving a large quantity. This validates the design choice ofReasoningBankto focus on distilling high-quality, structured memories.
6.7.2. Pass@k Analysis for MaTTS
A Pass@k analysis is performed under parallel scaling on the WebArena-Shopping subset with Gemini-2.5-flash to understand the sample efficiency and performance gains of MaTTS.
The following figure (Figure 13 from the original paper) shows Pass@k under parallel scaling with ReASONINGBANk:

Figure 13 | Pass @ k under parallel scaling with ReASONINGBANk.
- Vanilla TTS Improves Sample Efficiency:
MaTTS w/o aggregation(equivalent to vanillaTTS) already makes test-time learning behave similarly toRL training. Instead of merely inflatingPass@kat large , itimproves sample efficiencyby guiding exploration. For example, at ,MaTTS w/o aggregationachieves 50.8%Pass@kcompared to 47.6% fromMaTTS w/o memory. This means it extracts more value from each rollout. - MaTTS Amplifies Gains: Equipping
TTSwithmemory-aware scaling(MaTTS) pushes performance even further.MaTTSnot only preserves efficiency atsmall k(51.3% at ) but also sustainsstrong growth with scaling, reaching 62.1% at . This is significantly higher thanMaTTS w/o memory(which reaches only 52.4% at ). - Unlocking Potential: This analysis shows that
MaTTS unlocks more potentialin agent systems, encouraging diverse generation that leads to betterPass@kperformance and more efficient learning from exploration.
6.8. Case Studies
The paper presents two case studies to intuitively illustrate the benefits of ReasoningBank.
6.8.1. First Purchase Date Retrieval
The following figure (Figure 14 from the original paper) shows a case study for first purchase date retrieval:

Figure 14 | ReAsoniNg BAN k enables the agent to recall and apply past reasoning hints, guiding it to the full order history and yielding the correct first purchase date, unlike the baseline that fails with only recent orders.
- Baseline Failure: A
baseline agent(without memory) faced with the query "What is the date when I made my first purchase on this site?" only checks the "Recent Orders" table and incorrectly outputs the most recent purchase date. It fails to identify the need for a complete order history. - ReasoningBank Success: The agent equipped with
ReasoningBankrecalls relevant past reasoning hints(memory items). These hints guide it to explore thefull purchase history, allowing it to correctly identify and output the earliest order date. This demonstratesReasoningBank's ability to transfer high-level strategies (e.g., "always check full history for 'first' or 'earliest' queries").
6.8.2. Shopping Task Efficiency
The following figure (Figure 15 from the original paper) shows a case study for shopping task efficiency:

Figure 15 | REAs oN INGBANk improves efficiency by leveraging past reasoning hints, reducing the navigation from 29 steps to 10 steps compared to the baseline without memory.
- Baseline Inefficiency: In a navigation-heavy shopping task (e.g., "find men's clothing"), the
baseline agentwithout memory struggles to find the correct filter for "Men" and requires29 stepsdue to repeated, inefficient browsing and getting stuck. - ReasoningBank Efficiency: The agent with
ReasoningBankleverages stored reasoning about category filtering. This allows it todirectly reach the relevant itemsand complete the task in only10 steps. This vividly illustrates howReasoningBankimproves efficiency by guiding the agent to purposeful decisions and avoiding redundant exploration.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces ReasoningBank, a novel memory framework that significantly enhances the self-evolving capabilities of large language model (LLM) agents. ReasoningBank distinguishes itself by distilling generalizable reasoning strategies from an agent's self-judged successful and failed experiences, moving beyond traditional memory mechanisms that rely on raw trajectories or only successful routines. This allows agents to learn not only effective strategies but also crucial preventative lessons from past mistakes. The framework operates in a closed loop, where agents retrieve relevant memories to inform current tasks and then integrate new learnings back into the ReasoningBank.
Building on this, the paper proposes memory-aware test-time scaling (MaTTS), which synergistically combines ReasoningBank with test-time scaling. MaTTS accelerates and diversifies the learning process by intelligently generating abundant and diverse interaction experiences (through parallel or sequential scaling). These rich experiences provide contrastive signals for ReasoningBank to synthesize higher-quality, more generalizable memory. In turn, this refined memory guides more effective scaling, creating a powerful positive feedback loop.
Extensive experiments on web browsing (WebArena, Mind2Web) and software engineering (SWE-Bench-Verified) benchmarks consistently demonstrate that ReasoningBank outperforms existing memory mechanisms in both effectiveness (higher success rates, up to 34.2% relative improvement) and efficiency (fewer interaction steps, 16.0% less). MaTTS further amplifies these gains, showing superior performance over vanilla test-time scaling. The findings establish memory-driven experience scaling as a new dimension for agent scaling, enabling self-evolution and the natural emergence of complex behaviors.
7.2. Limitations & Future Work
The authors acknowledge several limitations of their work and suggest future research directions:
7.2.1. Limitations
- Focus on Memory Content: The study primarily emphasizes how to curate and utilize memory content (e.g., integrating failure trajectories, distilled reasoning cues). It did not extensively compare with other memory architectures like
episodicorhierarchical memory. These architectural designs address orthogonal concerns (memory form/structure), whileReasoningBankfocuses on what specific knowledge should be stored and reused. - Simplicity in Memory Retrieval and Consolidation: The current implementation uses simple
embedding-based retrievalandstraightforward consolidation(direct addition of new items). This was an intentional choice to isolate the effect of memory content quality. More sophisticated strategies (e.g., adaptive retrieval, hierarchical consolidation, merging, forgetting policies) were not explored. - Dependence on LLM-as-a-Judge for Correctness Signals: The
success/failure signalsfor trajectories are determined by anLLM-as-a-judge. While this enables scalable self-evaluation without ground-truth feedback, it may introducenoiseif the judgeLLMmakes errors or if tasks are ambiguous. Although the framework showed robustness to such noise, this is a potential source of imperfection.
7.2.2. Future Work
- Compositional Memory: Explore how memory items could be
composed into higher-level strategiesorreusable macros. The current framework distills individual memory items and retrieves them independently. Future work could investigatecomposition-aware retrieval and consolidationto enable agents to combine complementary items for richer strategies and stronger generalization inlong-horizon tasks. - Advanced Memory Architectures: Integrate
ReasoningBank's philosophy with more complex memory architectures, such as:Episodic traces(Fountas et al., 2025) for per-task context.Short-term "working" memory(Lumer et al., 2025) for within-session state.Long-term consolidated knowledge(Wang et al., 2025b) withdecay/refresh policies. The currentReasoningBank's philosophy is compatible with these, and such integration could lead to a more comprehensive memory system.
- Sophisticated Retrieval and Consolidation Mechanisms: Move beyond simple embedding-based similarity for retrieval to
reasoning-intensive controllers(Shao et al., 2025). These controllers could decompose queries, plan multi-hop lookups across memory tiers, and condition selection on factors like uncertainty, recency, and cost.Learning-based routersandconsolidation policiescould automate these processes, potentially turningReasoningBankwithMaTTSinto a deployable memory service across domains. - Stronger Verifiers for Self-Judgement: Incorporate
stronger verifiers,human-in-the-loop feedback, orensemble judgmentto enhance the reliability of memory induction, mitigating the noise introduced by theLLM-as-a-judge.
7.3. Personal Insights & Critique
This paper presents a highly valuable contribution to the field of LLM agents, addressing a critical bottleneck: the lack of effective learning from experience. The core idea of ReasoningBank—distilling generalizable reasoning strategies from both successes and failures—is intuitively powerful. It moves agent memory from passive record-keeping to active knowledge creation. The emphasis on learning from failures, in particular, is a standout feature, as real-world learning often hinges on understanding and avoiding past mistakes.
The introduction of MaTTS is equally compelling. The concept of memory-driven experience scaling creates a virtuous cycle where compute is not just blindly thrown at a problem but intelligently used to generate diverse experiences that refine memory, which in turn makes future explorations more efficient. This synergistic relationship is a fundamental insight that could guide future research in scaling LLM agents.
Potential Strengths and Applications:
- Robustness and Generalization: By learning from high-level reasoning and failures, agents become more robust to novel situations and can generalize across domains, as shown in the Mind2Web results. This is crucial for deploying agents in dynamic, open-ended environments.
- Efficiency: The observed reduction in interaction steps is highly practical, as it translates directly to lower computational costs and faster task completion in real-world applications.
- Path to Lifelong Learning: The closed-loop nature of
ReasoningBankoffers a clear path toward trulylifelong learning agentsthat continuously improve over their operational lifetime, a long-sought goal in AI. - Inspiration for Other Domains: The principles of distilling reasoning from experience and using contrastive signals could be applied to various other AI challenges, such as
robotics,scientific discovery, orcomplex decision support systems.
Areas for Further Consideration / Critique:
-
Complexity of Memory Items: While structuring memory items into
title,description, andcontentis a good start, the "content" itself can still be quite textual. Future work might explore more formal or executable representations of reasoning strategies to make them even more machine-actionable and less reliant onLLMinterpretation. -
Scalability of Retrieval: As
ReasoningBankaccumulates more memory items, theembedding-based similarity searchfortop-kitems might become computationally intensive. The paper acknowledges this and suggestsreasoning-intensive controllersorhierarchical memoryas future directions, which would be crucial for industrial-scale deployment. -
Bias in LLM-as-a-Judge: The reliance on
LLM-as-a-judgefor correctness signals, while practical, introduces the possibility ofLLM biasesorhallucinationsaffecting memory curation. While the paper states robustness, this remains a fundamental challenge inLLM-driven self-correction. Exploring diverse or ensemble judges, or incorporating human oversight, could mitigate this. -
Computational Cost of MaTTS: While
MaTTSimproves overall efficiency by reducing steps to success, generating parallel trajectories still incurs ak-fold increase in inference costfor a single task. The trade-off between the increased compute for exploration and the long-term benefits of better memory needs to be carefully managed for practical applications. -
"Forgetting" Mechanism: The current
minimal consolidation strategy(simple addition) means the memory bank will continuously grow. Without aforgettingorpruningmechanism for redundant, outdated, or less useful memories,ReasoningBankcould become unwieldy over very long timescales, potentially leading to increased retrieval latency or noise. This is explicitly noted as a future direction.Overall, "ReasoningBank" is an inspiring paper that takes a significant step toward creating more intelligent, adaptive, and autonomous
LLM agents. Its dual focus on refined memory content and memory-aware scaling provides a robust framework for continuous improvement in agents operating in complex real-world environments.
Similar papers
Recommended via semantic vector search.