Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
TL;DR Summary
ACE framework evolves context via modular generation-reflection-curation, tackling brevity bias and context collapse, boosting agent and finance tasks, reducing latency and cost for efficient, unsupervised LLM self-improvement.
Abstract
Large language model (LLM) applications such as agents and domain-specific reasoning increasingly rely on context adaptation -- modifying inputs with instructions, strategies, or evidence, rather than weight updates. Prior approaches improve usability but often suffer from brevity bias, which drops domain insights for concise summaries, and from context collapse, where iterative rewriting erodes details over time. Building on the adaptive memory introduced by Dynamic Cheatsheet, we introduce ACE (Agentic Context Engineering), a framework that treats contexts as evolving playbooks that accumulate, refine, and organize strategies through a modular process of generation, reflection, and curation. ACE prevents collapse with structured, incremental updates that preserve detailed knowledge and scale with long-context models. Across agent and domain-specific benchmarks, ACE optimizes contexts both offline (e.g., system prompts) and online (e.g., agent memory), consistently outperforming strong baselines: +10.6% on agents and +8.6% on finance, while significantly reducing adaptation latency and rollout cost. Notably, ACE could adapt effectively without labeled supervision and instead by leveraging natural execution feedback. On the AppWorld leaderboard, ACE matches the top-ranked production-level agent on the overall average and surpasses it on the harder test-challenge split, despite using a smaller open-source model. These results show that comprehensive, evolving contexts enable scalable, efficient, and self-improving LLM systems with low overhead.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Agentic Context Engineering: Evolving Contexts for Self-Improving Language Models
- Authors: Qizheng Zhang, Changran Hu, Shubhangi Upasani, Boyuan Ma, Fenglu Hong, Vamsidhar Kamanuru, Jay Rainton, Chen Wu, Mengmeng Ji, Hanchen Li, Urmish Thakker, James Zou, and Kunle Olukotun.
- Affiliations: The authors are affiliated with Stanford University and SambaNova Systems.
- Journal/Conference: The paper is available on arXiv, which is a preprint server for academic papers. This means it has not yet undergone formal peer review for publication in a conference or journal. The reputation of arXiv is that it is the standard platform for sharing cutting-edge research quickly within the AI/ML community.
- Publication Year: The paper lists future dates for citations and its own source link (e.g., 2025), which is unusual. The content and style are consistent with research from 2024. For the purpose of this analysis, we will treat it as a recent preprint.
- Abstract: The paper introduces ACE (Agentic Context Engineering), a framework to improve Large Language Models (LLMs) by adapting their input contexts rather than their weights. It addresses two key problems in prior methods: brevity bias (losing details by over-summarizing) and context collapse (iterative rewriting erodes information). ACE treats contexts as evolving "playbooks" that accumulate strategies through a modular process of generation, reflection, and curation. It uses structured, incremental updates to preserve knowledge. The framework demonstrates significant performance gains on agent benchmarks (+10.6%) and financial tasks (+8.6%), while reducing adaptation costs. Notably, an open-source model using ACE matched a top-ranked, larger proprietary model on the AppWorld leaderboard, showing that evolving contexts can lead to scalable, self-improving, and efficient LLM systems.
- Original Source Link:
- ArXiv Link:
https://arxiv.org/abs/2510.04618 - PDF Link:
https://arxiv.org/pdf/2510.04618v1.pdf - Status: Preprint.
- ArXiv Link:
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Modern AI systems, especially LLM-powered agents, rely on context adaptation—modifying inputs with instructions, examples, or memory—to improve performance without costly retraining. However, existing methods for automatic context optimization face critical flaws.
- Gaps in Prior Work: The paper identifies two major challenges:
- Brevity Bias: Many prompt optimizers are designed to create short, concise instructions. While this can be useful, it often strips away domain-specific details, heuristics, and failure recovery strategies that are essential for complex tasks.
- Context Collapse: When an LLM is asked to iteratively rewrite and update a large context, it tends to over-summarize, causing a sudden and catastrophic loss of accumulated knowledge. The paper provides a stark example where context size and accuracy plummeted after a single rewriting step (Figure 2).
- Fresh Angle: The paper argues that for complex reasoning, LLMs benefit from comprehensive, detailed contexts, not concise summaries. It proposes that contexts should be treated as evolving playbooks that accumulate and organize knowledge over time, much like a human expert's collection of notes and strategies.
-
Main Contributions / Findings (What):
- Novel Framework (ACE): The primary contribution is ACE (Agentic Context Engineering), a framework for creating and maintaining these evolving playbooks. It features a modular, agentic architecture.
- Structured, Incremental Updates: To prevent context collapse, ACE introduces incremental delta updates. Instead of rewriting the entire context, it generates small, localized changes (
deltas) that are merged into the existing playbook. This preserves accumulated knowledge. - Modular Workflow: ACE divides the labor of context improvement into three specialized roles: a
Generator(executes tasks), aReflector(extracts lessons), and aCurator(integrates lessons into the playbook). - Strong Empirical Performance: ACE significantly outperforms strong baselines across agent tasks (
AppWorld) and domain-specific financial reasoning (FiNER,Formula), with average gains of +10.6% and +8.6% respectively. - Self-Improvement without Supervision: ACE can effectively adapt using only natural feedback from task execution (e.g., code success/failure), eliminating the need for expensive ground-truth labels.
- Efficiency and Scalability: ACE drastically reduces the time and computational cost of adaptation compared to prior methods, achieving up to 86.9% lower latency.
- Competitive Results with Smaller Models: A key finding is that a smaller open-source model (DeepSeek-V3.1) equipped with ACE matched the performance of a top-ranked, production-level agent using GPT-4.1 on the
AppWorldbenchmark, and even surpassed it on the more difficulttest-challengesplit.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Large Language Models (LLMs): These are massive neural networks trained on vast amounts of text data (e.g., GPT-4, Llama 3). They can understand and generate human-like text but their behavior is highly dependent on the input they receive.
- Context Adaptation (or Context Engineering): The practice of improving an LLM's performance by carefully crafting its input, known as the "context." This includes writing better instructions (
system prompts), providing relevant examples (in-context learning), or supplying external memory. It is an alternative to fine-tuning, which involves updating the model's internal weights and is far more computationally expensive. - LLM Agents: LLMs configured to perform tasks in an environment by reasoning, planning, and using tools (like APIs or code interpreters). Their success often depends on learning from past interactions.
- KV Cache: In transformer-based models like LLMs, the Key-Value (KV) cache stores intermediate computations (attention keys and values) for the input tokens. By reusing the KV cache for static parts of the context (like a system prompt or playbook), the cost of processing long inputs is significantly reduced, as these computations don't need to be repeated for every new generation. This makes long-context approaches like ACE more practical.
-
Previous Works:
- Language Feedback Methods: The paper situates itself within a line of work that uses an LLM to generate feedback on its own performance to improve.
- Reflexion: An agent framework that reflects on past failures to generate textual feedback and improve its plans for subsequent attempts.
- GEPA (Genetic-Pareto): A state-of-the-art prompt optimizer that uses an evolutionary algorithm to refine prompts based on reflective feedback from execution traces. The paper uses GEPA as a strong baseline but critiques its tendency toward brevity.
- Dynamic Cheatsheet (DC): An approach where an LLM maintains an external "cheatsheet" of reusable strategies. ACE builds directly on the agentic design of DC but introduces more structured mechanisms to prevent the
context collapsethat DC can suffer from.
- Language Feedback Methods: The paper situates itself within a line of work that uses an LLM to generate feedback on its own performance to improve.
-
Technological Evolution: The field has moved from static
prompt engineeringto dynamic, automated methods forprompt optimization. However, these optimizers often aim for a single, fixed, and concise prompt. With the advent of long-context LLMs, the constraint on input length has loosened. This paper capitalizes on this trend, arguing that the goal should shift from finding one perfect, short prompt to building a comprehensive, ever-growing knowledge base that the LLM can leverage at inference time. -
Differentiation: ACE distinguishes itself from prior work in several key ways:
- Playbook vs. Prompt: ACE creates a detailed, structured, and growing "playbook," whereas methods like GEPA aim for a single, concise, and static instruction prompt.
- Incremental vs. Monolithic Updates: ACE uses
delta updatesto incrementally add or refine small pieces of knowledge. This contrasts with methods that ask an LLM to rewrite the entire context at once, which riskscontext collapse. - Division of Labor: ACE's
Generator-Reflector-Curatorarchitecture is more modular than systems like Dynamic Cheatsheet, explicitly separating task execution, lesson extraction, and knowledge integration. This dedicatedReflectoris a key innovation for improving the quality of insights.
4. Methodology (Core Technology & Implementation)
ACE is a framework designed to treat LLM contexts as dynamic, evolving playbooks. Its core philosophy is "grow-and-refine" to prevent information loss.
-
Principles: The central idea is that LLMs perform better on complex tasks when provided with a rich, detailed, and organized set of strategies, examples, and warnings, rather than a short, abstract instruction. The system should accumulate knowledge, not compress it away.
-
Steps & Procedures: The ACE framework operates in a continuous loop, orchestrated by three specialized agentic components, as shown in Figure 4.
该图像是图4,ACE框架的示意图,展示了由Generator、Reflector和Curator三部分组成的代理式架构,流程包括查询与上下文剧本输入生成轨迹,反思器迭代提炼洞察,终由策展人更新上下文条目,形成持续演进的上下文。- The Generator: This is the primary LLM agent responsible for solving a given task or query. It uses the current version of the "playbook" (the context) to inform its reasoning and actions. As it works, it produces an execution trace (thoughts, tool calls, outputs) and can also provide feedback on which entries in the playbook were helpful or misleading for the current task.
- The Reflector: This component acts as an analyst. It takes the
Generator's execution trace and feedback as input. Its job is to critically evaluate theGenerator's performance, identify the root causes of successes and failures, and distill these observations into concrete, actionable "lessons." The paper highlights that this reflection can be an iterative process to refine the quality of the insights. - The Curator: This component acts as a librarian. It receives the lessons from the
Reflectorand is responsible for integrating them into the playbook. Crucially, it does not rewrite the whole playbook. Instead, it synthesizes the lessons intodelta entries—small, structured pieces of new or updated information. A lightweight, non-LLM logic then deterministically merges these deltas into the main playbook. This might involve adding a new bullet point, updating an existing one, or marking one for de-duplication.
-
Mathematical Formulas & Key Details:
Incremental Delta Updates (§3.1) This is the core mechanism to prevent context collapse.
- The context is not a single block of text but a collection of structured, itemized bullets.
- Each bullet is a small unit of knowledge (e.g., a strategy, a code snippet, a pitfall) and has two parts:
- Metadata: A unique ID, and counters for how often it was marked as helpful or harmful.
- Content: The actual text of the strategy or rule.
- Instead of rewriting the entire context, the
ReflectorandCuratorgenerate adelta context, which is just a small set of new or modified bullets. Thisdeltais then merged into the main context. This process is computationally cheap, avoids LLM-induced summarization, and ensures that valuable past knowledge is preserved.
Example of an ACE-Generated Playbook Entry: The paper provides Figure 3 as an example of what the playbook looks like. It's structured into sections like
STRATEGIES AND HARD RULES,USEFUL CODE SNIPPETS AND TEMPLATES, andTROUBLESHOOTING AND PITFALLS. Each entry is tagged with an ID (e.g.,[shr-00009]) and contains specific, actionable advice.Grow-and-Refine (§3.2) This mechanism ensures the playbook remains useful and manageable as it grows.
- Grow: New bullets from
delta contextsare appended to the playbook. - Refine: This process manages the quality and size of the playbook.
- Existing bullets can be updated (e.g., their "helpful" counter is incremented).
- A de-duplication step is performed periodically or lazily (when the context window is full). This step compares bullets using semantic embeddings (numerical representations of their meaning) to identify and prune redundant entries.
- This dual process allows the playbook to steadily expand with new knowledge while controlling for redundancy, ensuring it remains both comprehensive and efficient.
5. Experimental Setup
-
Datasets:
AppWorld: An agent benchmark where the LLM interacts with a simulated environment of common applications (e.g., email, file system) via APIs. It tests API understanding, code generation, and multi-step reasoning. It includesnormalandchallengedifficulty levels.FiNER: A financial domain task requiring the LLM to perform named entity recognition on financial documents written in XBRL (eXtensible Business Reporting Language). It involves labeling tokens with one of 139 fine-grained entity types.Formula: Another financial task that tests numerical reasoning. The LLM must extract values from XBRL documents and perform calculations to answer financial questions.
-
Evaluation Metrics:
- Task Goal Completion (TGC): Used for
AppWorld.- Conceptual Definition: Measures whether the agent successfully completes a specific, low-level instruction within a scenario.
- Mathematical Formula:
- Symbol Explanation: This is a simple ratio of successfully achieved sub-goals to the total number of sub-goals presented.
- Scenario Goal Completion (SGC): Used for
AppWorld.- Conceptual Definition: A stricter metric that measures whether the agent achieves the overall, high-level objective of the entire scenario, which may require successfully completing multiple interdependent tasks.
- Mathematical Formula:
- Symbol Explanation: This is a ratio of successfully completed overarching scenarios to the total number of scenarios.
- Accuracy: Used for
FiNERandFormula.- Conceptual Definition: Measures the proportion of the model's predictions that exactly match the ground-truth answers.
- Mathematical Formula:
- Symbol Explanation: A standard classification metric that rewards exact matches.
- Task Goal Completion (TGC): Used for
-
Baselines:
Base LLM: The model (DeepSeek-V3.1) with a default prompt, without any advanced context engineering.In-Context Learning (ICL): Providing the model with as many training examples (demonstrations) as can fit in its context window.MIPROv2: A well-known prompt optimizer that uses Bayesian optimization.GEPA: A state-of-the-art prompt optimizer based on reflective evolution, representing a strong baseline for automated instruction generation.Dynamic Cheatsheet (DC): An adaptive memory method that ACE is conceptually based on. It serves as a direct point of comparison for online adaptation.
6. Results & Analysis
The paper's results consistently show that ACE's "evolving playbook" approach leads to significant improvements in performance and efficiency.
-
Core Results:
Agent Benchmark (AppWorld): The following is a transcription of Table 1 from the paper.
Method GT Labels Test-Normal Test-Challenge Average TGC↑ SGC↑ TGC↑ SGC↑ DeepSeek-V3.1 as Base LLM ReAct 63.7 42.9 41.5 21.6 42.4 Offline Adaptation ReAct + ICL ✓ 64.3+0.6 46.4+3.5 46.0 +4.5 27.3+5.7 46.0+3.6 ReAct + GEPA ✓ 64.9+1.2 44.6+1.7 46.0+4.5 30.2+8.6 46.4+4.0 ReAct + ACE ✓ 76.2+12.5 64.3+21.4 57.3+15.8 39.6+18.0 59.4+17.0 ReAct + ACE X 75.0+11.3 64.3+21.4 54.4+12.9 35.2+13.6 57.2+14.8 Online Adaptation ReAct + DC (CU) X 65.5+1.8 58.9+16.0 52.3+10.8 30.8+9.2 51.9+9.5 ReAct + ACE X 69.6 +5.9 53.6+10.7 66.0+24.5 48.9+27.3 59.5+17.1 -
Analysis: Offline,
ACE(+17.0%) massively outperforms other optimizers likeGEPA(+4.0%) and simpleICL(+3.6%). This supports the claim that a comprehensive playbook is better than a concise prompt or a set of examples. -
Crucially,
ACEperforms nearly as well without ground-truth (GT) labels (+14.8%) as with them (+17.0%), demonstrating its ability to learn from natural execution feedback (e.g., API call success/failure). -
In online adaptation,
ACE(+17.1%) also surpassesDynamic Cheatsheet(+9.5%), its closest conceptual relative. -
The leaderboard result (Figure 5) is particularly striking: a smaller open-source model with
ACEmatches a production agent using the much larger GPT-4.1, highlighting that better context can be as impactful as a bigger model.
Domain-Specific Benchmarks (Financial Analysis): The following is a transcription of Table 2 from the paper.
Method GT Labels FINER (Acc↑) Formula (Acc↑) Average DeepSeek-V3.1 as Base LLM Base LLM 70.7 67.5 69.1 Offline Adaptation ICL ✓ 72.3+1.6 67.0-0.5 69.6+0.5 MIPROv2 ✓ 72.4+1.7 69.5+2.0 70.9+1.8 GEPA ✓ 73.5+2.8 71.5+4.0 72.5+3.4 ACE ✓ 78.3+7.6 85.5+18.0 81.9+12.8 ACE X 71.1+0.4 83.0+15.5 77.1+8.0 Online Adaptation DC (CU) ✓ 74.2+3.5 69.5+2.0 71.8+2.7 DC (CU) X 68.3-2.4 62.5-5.0 65.4-3.7 ACE ✓ 76.7+6.0 76.5+9.0 76.6+7.5 ACE X 67.3-3.4 78.5+11.0 72.9+3.8 - Analysis:
ACEagain shows the largest gains, achieving an average improvement of +12.8% in the offline setting. The +18.0% gain onFormulais particularly large, suggesting the playbook is highly effective for tasks requiring a combination of knowledge extraction and numerical reasoning. - However, the table also shows that without reliable feedback (no GT labels and no clear execution signal like in
AppWorld), performance for bothACEandDCcan degrade on some tasks (e.g., FiNER). This highlights a key dependency: the quality of adaptation is tied to the quality of the feedback signal.
-
-
Ablations / Parameter Sensitivity:
The following is a transcription of Table 3, which ablates
ACEcomponents onAppWorld.Method GT Labels Test-Normal Test-Challenge Average TGC↑ SGC↑ TGC↑ SGC↑ DeepSeek-V3.1 as Base LLM ReAct 63.7 42.9 41.5 21.6 42.4 Offline Adaptation ReAct + ACE w/o Reflector or multi-epoch ✓ 70.8+7.1 55.4+12.5 55.9+14.4 38.1+17.5 55.1+12.7 ReAct + ACE w/o multi-epoch ✓ 72.0+8.3 60.7+17.8 54.9+13.4 39.6+18.0 56.8+14.4 ReAct + ACE ✓ 76.2+12.5 64.3+21.4 57.3+15.8 39.6+18.0 59.4+17.0 Online Adaptation ReAct + ACE X 67.9 +4.2 51.8+8.9 61.4+19.9 43.2+21.6 56.1+13.7 ReAct + ACE + offline warmup X 69.6+5.9 53.6+10.7 66.0 +24.5 48.9 +27.3 59.5+17.1 - Analysis: Each component contributes to performance. Removing the
Reflectorandmulti-epochadaptation reduces the average gain from +17.0% to +12.7%. Adding back theReflector(but without multi-epoch) boosts it to +14.4%. The fullACEmodel achieves the best result at +17.0%. This confirms that the dedicatedReflectorand iterative refinement over the training data are both crucial design choices. - For online adaptation, using an offline warmup phase to initialize the playbook provides a clear performance boost (from +13.7% to +17.1%), showing the synergy between offline preparation and online learning.
- Analysis: Each component contributes to performance. Removing the
-
Cost and Speed Analysis:
The following is a transcription of Table 4. (a) Offline (AppWorld).
Method Latency (s) ↓ # Rollouts↓ ReAct + GEPA 53898 1434 ReAct + ACE 9517 (-82.3%) 357 (-75.1%) (b) Online (FiNER).
Method Latency (s)↓ Token Cost ($) ↓ DC (CU) 65104 17.7 ACE 5503 (-91.5%) 2.9 (-83.6%) - Analysis: ACE's incremental
delta updatemechanism is vastly more efficient than methods requiring full context rewrites. It achieves an 82.3% reduction in adaptation latency compared toGEPAoffline and a 91.5% reduction compared toDynamic Cheatsheetonline. This efficiency makes continuous, self-improving systems much more practical for real-world deployment.
- Analysis: ACE's incremental
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully identifies and provides a solution for two critical limitations in context adaptation: brevity bias and context collapse. The proposed ACE framework, which treats contexts as evolving playbooks, is shown to be highly effective. By using a modular agentic architecture and incremental updates, ACE creates comprehensive, detailed contexts that significantly boost LLM performance on complex agent and domain-specific tasks. The framework is not only effective but also highly efficient, and its ability to learn from natural feedback without ground-truth labels is a major step toward truly self-improving AI systems.
-
Limitations & Future Work:
- Reflector Quality: The authors acknowledge that ACE's performance is dependent on the
Reflectormodel's ability to extract meaningful insights. If theReflectoris weak or the feedback signals are noisy, it could pollute the playbook with useless or harmful information. - Task Suitability: ACE is most beneficial for tasks that require deep domain knowledge, complex tool use, or strategies that evolve over time. For simpler tasks or those with fixed, simple strategies (e.g., Game of 24), the overhead of maintaining a large playbook may not be justified.
- Reflector Quality: The authors acknowledge that ACE's performance is dependent on the
-
Personal Insights & Critique:
- Paradigm Shift: The shift in perspective from "prompt optimization" to "context evolution" is the most powerful idea in this paper. It reframes the goal from finding a single perfect instruction to cultivating a dynamic knowledge base, which feels much more aligned with how experts learn and work.
- Architectural Elegance: The
Generator-Reflector-Curatordesign is clean and intuitive. This modularity is a strong engineering principle, as it allows each component to be improved or replaced independently. One could imagine using a highly specialized (and perhaps costly) model for reflection, while using a faster, cheaper model for generation. - The Unspoken Challenge: The reliance on a good
Reflectorhints at a recursive problem: how do we ensure the quality of reflection? This remains an open and challenging research area. While ACE shows it can work with execution feedback, the quality of that feedback is paramount. - Practical Impact: The demonstration that a smaller, open-source model with superior context engineering can compete with a much larger proprietary model is highly significant. It suggests that organizations can achieve state-of-the-art performance not just by scaling up models, but by investing in smarter, more adaptive systems around them. This has major implications for making cutting-edge AI more accessible. Future work could explore more advanced curation strategies, such as automatically identifying conflicting advice or structuring the playbook into a knowledge graph for more sophisticated retrieval.
Similar papers
Recommended via semantic vector search.