Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
TL;DR Summary
Dynamic Cheatsheet equips black-box LLMs with persistent adaptive memory, enabling test-time learning that reuses strategies and code snippets, significantly boosting performance without labeled data or retraining.
Abstract
Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.
Mind Map
In-depth Reading
English Analysis
Bibliographic Information
- Title:
Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory - Authors: Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou
- Journal/Conference: Published at arXiv (a preprint server for research papers).
- Publication Year: 2025
- Abstract: Current
language models (LMs)process each input query independently, lacking memory of previous attempts. This paper introducesDynamic Cheatsheet (DC), a lightweight framework that provides ablack-box LMwith a persistent, evolving memory.DCallows models to store and reuse strategies, code snippets, and problem-solving insights at inference time, enhancing performance without requiringground-truth labelsorhuman feedback. For example,Claude 3.5 Sonnetmore than doubled its accuracy onAIME math examsby retaining algebraic insights, andGPT-4o's success rate onGame of 24increased from 10% to 99% after discovering and reusing a Python-based solution.DCalso led to near-perfect accuracy in tasks prone to arithmetic mistakes (e.g., balancing equations) by recalling validated code. Beyond arithmetic,DCimprovedClaude's accuracy by 9% onGPQA-Diamondand 8% onMMLU-Proproblems. Crucially,DC's memory is self-curated, focusing on concise, transferable snippets rather than full transcripts, and adaptsLMs'problem-solving skillson the flywithout modifying underlying parameters. The findings presentDCas a promising approach for augmentingLMswith persistent memory, bridging isolated inference events with cumulative, experience-driven learning. - Original Source Link: https://arxiv.org/abs/2504.07952
- PDF Link: https://arxiv.org/pdf/2504.07952v1.pdf
- Publication Status Comment: The paper is published on
arXiv, which is a preprint server. This means it has been shared publicly but may not have undergone formal peer review by a specific journal or conference yet.
- Publication Status Comment: The paper is published on
Executive Summary
-
Background & Motivation (Why):
- Problem: Current
Large Language Models (LLMs)operate in a vacuum, treating each input query as a standalone problem. They do not retaininsights,strategies, ormistakesfrom past interactions, leading to repetitivere-discoveryof solutions andre-committingof errors during inference. This fundamental limitation preventsLLMsfrom exhibitingcumulative learningakin to human cognition. - Importance: This problem is crucial because it limits the efficiency, performance, and adaptability of
LLMsin real-world, sequential problem-solving scenarios. Models waste computational resources and time by re-deriving known solutions, struggle with consistency, and cannot autonomously improve theirproblem-solving heuristicsover time. Existing solutions likefine-tuningare costly and modifymodel parameters, whilestatic retrieval-augmented generation (RAG)systems rely on fixedknowledge bases. - Novel Approach: The paper introduces
Dynamic Cheatsheet (DC), a lightweight and intuitive framework that endowsblack-box LLMswith apersistent, evolving memoryat inference time. This approach enables theLLMto learnon the flywithout modifying its internalparametersor requiringexplicit ground-truth labelsorhuman feedback.
- Problem: Current
-
Main Contributions / Findings (What):
- Novel Framework: Presentation of
Dynamic Cheatsheet (DC), a non-parametric,test-time learningframework that allowsLLMsto store and reusestrategies,solution sketches, andcode snippets. - Significant Performance Gains:
DCsubstantially improvesLLMperformance across a range of challenging tasks:Claude 3.5 Sonnetmore than doubled its accuracy onAIME math exams(e.g.,AIME 2024from 23% to 50%).GPT-4o's success rate onGame of 24soared from ~10% to 99% by discovering and reusing a Python-based brute-force solution.GPT-4oandClaudeachieved near-perfect accuracy (98-100%) onMath Equation Balancerby recalling validated code, up from ~50%.Claudeshowed notable gains onknowledge-demanding tasks: 9% improvement onGPQA-Diamondand 8% onMMLU-Pro(Engineering and Physics).
- Self-Curated Memory:
DC's memory is actively managed, focusing onconcise, transferable snippetsinstead of entire transcripts, thus preventingcontext window bloatand ensuring efficient retrieval. - Black-Box & Parameter-Free Adaptation:
DCenhancesLLMs'problem-solving skills withoutfine-tuningor modifying their underlyingparameters, making it compatible withblack-box APIs(e.g.,GPT-4,Claude). - Efficient Tool Usage:
DCfostersLLMs'inclination towardscode generationfor computationally intensive tasks, allowing models tooffloadcomplex calculations to external tools (e.g., Python interpreter). - Scalability Dependency: The effectiveness of
DCis tied to theLLM'sscaleandgenerative capacity; smaller models showed limited or inconsistent gains.
- Novel Framework: Presentation of
Prerequisite Knowledge & Related Work
This section aims to equip a beginner with the necessary context to understand the paper.
-
Foundational Concepts:
- Language Models (LMs)/Large Language Models (LLMs): These are
deep learningmodels trained on vast amounts of text data to understand, generate, and process human language. They predict the next word in a sequence, enabling tasks like text generation, question answering, and translation.LLMsare typically characterized by billions or trillions ofparameters(learnable values that define the model's behavior). - Inference Time: This refers to the phase where a trained
LLMis used to make predictions or generate outputs for new, unseen inputs. It contrasts withtraining time, where the model learns from data. - Black-Box LLM: An
LLMthat is accessible only through anAPI(Application Programming Interface), meaning users can send inputs and receive outputs, but they cannot access or modify the model's internalparametersorweights.GPT-4andClaudeare examples ofblack-box LLMs. - Fine-tuning: A process where a pre-trained
LLMis further trained on a smaller, task-specific dataset to adapt itsparametersfor a particular application (e.g., medical text generation). This typically requires significant computational resources and access to the model's internal workings. - Retrieval-Augmented Generation (RAG): A technique where an
LLMis augmented with a retrieval system that can fetch relevant information from an externalknowledge base(e.g., a database of documents) and provide it as context to theLLMbefore generating a response. This helpsLLMsproduce more factual and up-to-date outputs. - Test-Time Learning (Online Learning/Incremental Learning): A paradigm where a model's behavior or knowledge is adapted during inference (or test time) as new data streams in. Unlike traditional training, it involves continuous, often lightweight, updates to improve performance on subsequent tasks without full re-training.
- Heuristic: A practical, experience-based approach to problem-solving that may not be optimal or perfect but is often sufficient for immediate goals. In
LLMs, aheuristicmight be a commonproblem-solving strategyor a rule of thumb. - Brute-Force Algorithm: A straightforward problem-solving technique that tries every possible option or combination until a solution is found. While often inefficient for large problem spaces, it guarantees a solution if one exists.
GPT-4o's Python solver forGame of 24is an example. - t-SNE (t-Distributed Stochastic Neighbor Embedding): (Mentioned in Image 10 description). A
dimensionality reductiontechnique used to visualize high-dimensional data (likeLLMembeddings) in two or three dimensions, such that similar data points are clustered together and dissimilar points are far apart. It helps to understand the relationships and clusters within data. - Context Window: The maximum amount of text (measured in
tokens) that anLLMcan process at once. If input exceeds this limit, some information must be discarded, which can lead tolost informationorperformance degradation.
- Language Models (LMs)/Large Language Models (LLMs): These are
-
Previous Works & Technological Evolution:
- Fixed Models Post-Deployment: Traditionally,
LLMs(like earlier versions ofGPTorClaude) werestaticonce deployed. Theirparameterswerefixed, meaning they couldn't learn or adapt from new experiences during actual use. They approached every problemde novo. - Dynamic Evaluation (Krause et al., 2019): Early attempts at
test-time adaptationinvolveddynamic evaluation, where alanguage modelmight be updated withgradient stepson thetest-time dataitself. However, this often requiresparameter updates, which is difficult or impossible forblack-box APIs. - Domain Adaptation (Gururangan et al., 2020): Methods to adapt
LMsto specificdomainsortaskspost-pre-training, often through further training on relevant datasets. - Traditional
Retrieval-Augmented Generation (RAG)(Guu et al., 2020; Zhang et al., 2024b): These systems retrieve facts from amassive static corpusto augmentLLMresponses. The key distinction fromDCis that theknowledge baseis typicallyfixedand does notevolvebased on the model's own inference experiences. - Iterative Refinement Approaches (
Reflexion(Shinn et al., 2023),Self-Refine(Madaan et al., 2023),Self-Critic(Gou et al., 2023),Chameleon(Lu et al., 2023),Meta-Prompting(Suzgun & Kalai, 2024),SelfRAG(Asai et al., 2023)): These methods usefeedback loopsorverification mechanismsto correct mistakes in solutions. While they involve iterative improvement,DCdistinguishes itself by focusing on storing generalizable heuristics and solution strategies that can be repeatedly retrieved and applied across tasks, rather than just correcting a single solution. - Memory-Augmented Generation (
Thought-Retriever(Feng et al., 2024),Buffer-of-Thoughts (BoT)(Yang et al., 2025)): These works also explore storing reasoning processes.Thought-Retrieverlogschains-of-thoughtfor reuse, andBoTdistillsthought templates.DCdiffers by emphasizing selective curation of transferable snippets and being fully external and training-free. - Tool Usage/Code Execution (Schick et al., 2023; Lu et al., 2023): Research showing how
LLMscan call external tools like Python interpreters.DCbuilds on this by allowing theLLMto learn when and how to use these tools effectively and store those strategies for future reuse.
- Fixed Models Post-Deployment: Traditionally,
-
Differentiation:
DCvs.Fine-tuning:DCdoes not modify theLLM'sparametersorweights. It's anon-parametricapproach, making it compatible withblack-box APIsand avoiding expensive retraining.DCvs.Static RAG:DC'smemoryisdynamicandevolvesbased on theLLM'stest-time experiences(successes and failures), rather than being afixed corpus.DCvs.Naive Full-History Appending (FH):FHsimply adds all prior interactions to thecontext window, leading tocontext bloatand diluting relevant information.DCemploysactive curation, selecting onlysuccinct, useful, and transferable knowledge.DCvs. Other Iterative Refinement Methods:DCfocuses ongeneralizable heuristicsandsolution sketchesthat can be applied across similar problems, effectively amortizing the cost of discovering robust strategies. Other methods often focus on improving a single solution or rely on specific feedback mechanisms.
Methodology (Core Technology & Implementation Details)
The Dynamic Cheatsheet (DC) framework operates by endowing a black-box LLM with an external, non-parametric memory that evolves in tandem with the LLM's inference process. Instead of modifying the model's internal weights through gradient-based updates, DC tracks the successes and failures of the model at test time and selectively stores heuristics, strategies, or short textual artifacts that can guide the LLM in future problem instances. This design respects the black-box nature of many commercial LLM APIs.
Principles
The core idea behind DC is to enable an LLM to learn from its own experiences during inference, mimicking how humans accumulate knowledge. This is achieved through:
- External, Non-Parametric Memory: The memory is separate from the
LLM'sparameters, allowing for flexible updates withoutgradient-based changes. - Iterative Evolution: The memory is continuously updated, refined, and curated based on the outcomes of the
LLM'sproblem-solving attempts. - Black-Box Compatibility: The framework works with
LLMswhose internalparametersare inaccessible. - Self-Curation: The system itself decides what to store, discard, or refine in memory, without requiring
explicit ground-truth labelsorhuman feedbackfor every decision.
Steps & Procedures
The DC framework consists of two core modules: generation and curation. These modules can be implemented using the same LLM (with different prompts) or separate LLMs.
1. DC-Cu (Dynamic Cheatsheet - Cumulative)
The DC-Cu variant represents the basic iterative loop of the Dynamic Cheatsheet.
1.1. Solution Generation with Memory
When presented with a new query, the LLM first consults its external memory. This memory contains previously stored insights, strategies, techniques, or heuristics. The LLM then uses this combined information—the new query and the current memory state—to produce a candidate solution.
The process is formalized as:
- : The -th new input query or problem presented to the model.
- : The current state of the external memory at the -th step, representing accumulated knowledge.
- : The
solution generatormodule (typically theLLMitself, appropriately prompted). - : The candidate solution produced by the model for the query , conditioned on the memory .
1.2. Memory Curation Step
After generating a candidate solution for , a curator module (which can be the same LLM with a different prompt) updates the memory. The curator assesses the usefulness, generalizability, and correctness of the generated solution. Since no ground-truth labels are available, the curator must make these assessments autonomously (e.g., by checking for logical consistency, common patterns, or successful execution of generated code).
The memory update process is formalized as:
-
: The memory state before the current
curationstep. -
: The input query for which the solution was generated.
-
: The candidate solution produced by the generator.
-
: The
curatormodule (typically theLLMitself, prompted forcuration). -
: The updated state of the memory, incorporating insights from the current interaction.
During
curation,Curprimarily considers: -
Usefulness and Generalizability: If is correct or provides valuable, generalizable insights, it is distilled into a concise form suitable for future reference.
-
Refinement or Removal: If an existing memory entry was found to be incorrect or if a more efficient/versatile strategy emerges,
Curmay revise or remove the old entry. -
Clarity and Compactness: Memory entries are consolidated to maintain succinct, high-impact references and
heuristics, preventingmemory bloat.Image 7: Illustration of Dynamic Cheatsheet (DC-Cu variant).
该图像是论文中图4的示意图,展示了Dynamic Cheatsheet (DC-Cu variant)的工作流程。左侧为解决方案生成模块,语言模型结合记忆生成答案,右侧为记忆策展模块,通过评估和筛选模型输出更新记忆,形成更优的知识库。
2. DC-RS (Dynamic Cheatsheet - Retrieval & Synthesis)
DC-RS modifies the DC-Cu approach by introducing a retrieval mechanism and refining the memory before generating a response. This addresses two potential drawbacks of DC-Cu:
-
DC-Cuupdates memory after generation, missing opportunities to incorporate new insights during the current reasoning process. -
DC-Cudoesn't explicitly store or revisit pastinput-output pairs, which can be valuable for diverse topics.The
DC-RSworkflow is as follows: -
Retrieval: For a new query , the system first retrieves the
top-kmost similar pastinput-output pairsfrom its knowledge base. This is done using aretrievalmodule, . -
Pre-Generation Curation: The retrieved examples and the most recent memory content are passed to the
curatorto update the memory before generation. -
Solution Generation: Finally, the
generatorproduces a solution , using the new query and the freshly updated memory .The steps are summarized as:
-
: The set of
top-kmost similarinput-output pairsretrieved from past examples for the current query . -
: The
retrievalmodule. -
: The current input query.
-
: The collection of all previously processed input queries and their generated solutions.
-
: The number of top similar examples to retrieve.
-
: The memory state from the previous step.
-
: The current input query.
-
: The retrieved
top-kexamples. -
: The
curatormodule. -
: The updated memory state before generation, now informed by retrieved examples and the current query.
-
: The current input query.
-
: The memory state after
pre-generation curation. -
: The
solution generatormodule. -
: The final candidate solution for .
Prompts
The paper provides examples of prompts used for the Generator and Memory Curator modules.
Generator Prompt (for DR, FH, and DC Approaches)
This prompt instructs the LLM to act as a problem solver, combining its expertise with provided reference materials. It outlines a structured approach:
- Analysis & Strategy: Analyze the problem, search for patterns/strategies in the
cheatsheet, create a structured approach, and document limitations. - Solution Development: Present clear, logical steps, explain reasoning, provide detailed explanations, and verify assumptions.
- Programming Tasks: For tasks requiring code, it instructs the model to
Write clean, efficient Python code, explicitly request execution withEXECUTE CODE!, declare imports, add inline comments, and perform result validation. - Final Answer Format: Crucially, it mandates the final answer be wrapped in specific
XML-style tags: .
Memory Curation Prompt (under DC-RS)
This prompt guides the curator LLM in its role of maintaining the cheatsheet.
- Purpose and Goals: Emphasizes continuous learning and adaptation.
- Core Responsibilities:
- Selective Knowledge Retention: Discard redundant/trivial details, ensure effective solutions remain accessible, and incorporate new superior methods.
- Continuous Refinement & Optimization: Extract, generalize, and introduce
meta-strategies. - Structure & Organization: Maintain a well-organized
cheatsheetwith sections likeReusable Code Snippets,General Problem-Solving Heuristics,Optimization Techniques, andSpecialized Knowledge & Theorems.
- Principles and Best Practices: For each new problem, it instructs the
curatorto:-
Evaluate Solution's Effectiveness: Assess optimality, potential for improvement, and relevance to existing strategies.
-
Curate & Document Valuable Insights: Identify patterns, edge cases, and insights worth retaining, replacing old versions if a better approach is found.
-
Maintain Concise, Actionable Entries: Keep entries clear, actionable, concise, and focused on widely applicable methods, aiming to extract useful and general
solution strategiesand/or Pythoncode snippets. -
Implement a Usage Counter: Use a counter to prioritize frequently used solutions.
These detailed prompts are essential for guiding the
black-box LLMsto perform thegenerationandcurationtasks effectively within theDCframework.
-
Baselines
To quantify the efficacy of memory-driven test-time learning, the DC framework and its variants are compared against four baselines:
-
Baseline prompting (BL): A standardvanilla promptingapproach with minimal instructions. It reflects traditional one-off inference, where each query is processed independently without any iterative memory or retrieval mechanism. -
DC Ø (empty memory): This variant uses theDCframework's structuredproblem-solvingandexplicit tool-useinstructions (from thegenerator prompt) but always keeps the memory content empty. It isolates the effect ofmemory curationby showing how much performance improvement comes purely from storing and reusing knowledge over time, as opposed to just better prompting. -
Full-History Appending (FH): A naive approach where theentire conversation history(all previous input queries and generated outputs) is appended to the model input without anycurationortruncation. This can exceedcontext-window limitsand include redundant or low-value information but serves as a comparison toDC'scurated memory. -
Dynamic Retrieval (DR): This baseline uses retrieval but nocuration. For each new query, it retrieves themost similar past interactionsand directly pastes them,verbatim, into the prompt. It allows the model to see relevantinput-output pairsbut does not codify any abstract or generalized solutions.Image 6: Diagram of pseudocode structures of Baseline, Empty Memory, Full History, and four dynamic memory strategies (DC-RS, DC-Cu, DR).
该图像是多个代码框的示意图,展示了Baseline、Empty Memory、Full History及四种动态记忆策略(DC-RS、DC-Cu、DR)的伪代码结构,清晰对比了各方法中记忆初始化、检索及解题过程的区别。
Experimental Setup
The experiments were designed to rigorously evaluate DC's effectiveness on challenging tasks where state-of-the-art LLMs still face limitations, prioritizing tasks demanding multi-step reasoning, heuristic search, strategic adaptation, and cumulative learning.
Datasets
The selected datasets cover algorithmic, logical, and domain-specific reasoning tasks:
-
AIME 2020-2025 Exam Questions:
- Description: The
American Invitational Mathematics Examination (AIME)is a prestigious high-school competition featuring complex problems inalgebra,combinatorics,number theory,geometry, andprobability. These require deep mathematical reasoning. - Subsets Used:
AIME 2024(30 questions)AIME 2025(30 questions)AIME 2020-2024(133 questions)
- Justification: Chosen to stress-test the model's ability to refine its mathematical reasoning over time.
- Description: The
-
GPQA-Diamond (Rein et al., 2024):
- Description: A high-quality, difficult subset of the
Graduate-Level Google-Proof Q&A (GPQA)benchmark, comprising 198 expert-validated questions across natural sciences (biology, chemistry, physics). Questions were correctly answered bydomain expertsbut often missed by non-experts. - Justification: Ideal for evaluating
DC's ability to handle complex,multi-hop reasoning tasksrequiring specialized knowledge.
- Description: A high-quality, difficult subset of the
-
Game of 24 (Yao et al., 2023; Suzgun & Kalai, 2024):
- Description: A
heuristic-driven arithmetic challenge. The goal is to form an expression equaling 24 using four given numbers exactly once. For instance, given7 7 8 1, one valid answer is . (Note: The paper gives for input7 7 8 11. This is a typo in the paper, it should probably be or the input digits are7 7 8 11). The example uses four numbers:8, 7, 7, 11. It evaluates to . - Data Sample: Input:
7 7 8 11. Output: . - Size: 100 examples from (Suzgun & Kalai, 2024).
- Justification: Emphasizes
systematic search,strategic reasoning, andpattern recognition, making it suitable for assessingDC's capacity for refining computationalheuristics.
- Description: A
-
Math Equation Balancer:
- Description: Requires the model to complete equations by inserting appropriate operators to form valid expressions.
- Data Sample: Input: . Correct output could be or .
- Size: 250 arithmetic expressions.
- Justification: Focuses on
elementary arithmetic reasoningand sequential placement of operators.
-
MMLU-Pro (Engineering and Physics) (Wang et al., 2024b):
- Description: A professional-level subset of the
MMLUbenchmark, specifically focusing onphysicsandengineering. All questions are in amultiple-choice format. - Size: Sampled 250 questions from each subset (original dataset has 1,299 physics and 969 engineering questions).
- Justification: Evaluates
DC's ability to maintain a "toolkit" of formulas and general problem-solving patterns inknowledge-intensive domains.
- Description: A professional-level subset of the
Language Models
The efficacy of DC was evaluated across a range of LLMs:
- State-of-the-art LLMs:
GPT-4oandClaude 3.5 Sonnet. - Smaller-scale counterparts:
GPT-4o-miniandClaude 3.5 Haiku. - Specialized models:
DeepSeek R1(designed for reasoning-intensive tasks). is also mentioned as a similar specialized model.
Evaluation Protocol
All models were instructed to format their final answers in a structured, machine-readable format to ensure accurate and consistent parsing:
Accuracy Metrics
Given the diversity of tasks, different accuracy metrics were used:
-
Soft Match (SM):- Conceptual Definition: A lenient metric where an answer is considered correct if it matches the
ground truthafter ignoring minor formatting differences, such as punctuation or whitespace variations. This metric is suitable for tasks where the core content of the answer is what matters, not strict formatting. - Mathematical Formula: For multi-choice questions,
SMtypically evaluates to 1 if the model's chosen option (after normalization) matches theground truthoption, and 0 otherwise. There isn't a single universal formula forSoft Matchas its implementation depends on the task and what constitutes a "minor formatting difference." Conceptually, for a givenmodel_answerandground_truth: - Symbol Explanation:
- : The
Soft Matchscore for a given model answer and ground truth. - : The output generated by the
LLM. - : The correct answer from the dataset.
- : A function that performs text normalization (e.g., lowercasing, removing punctuation, standardizing whitespace) to make answers comparable despite minor formatting variations.
- : The
- Application: Applied to
GPQA-Diamond, andMMLU Pro(Engineering and Physics), which usemultiple-choice formats.
- Conceptual Definition: A lenient metric where an answer is considered correct if it matches the
-
Functionally Correct (FC):- Conceptual Definition: An even more flexible metric that assesses whether the model's output satisfies the task-specific constraints, even if the exact numeral presentation or formatting differs slightly from the reference solution. This is particularly relevant for problems where multiple valid expressions or numerical forms can lead to the same correct result.
- Mathematical Formula: Similar to
Soft Match,Functionally Correctoften involves a custom verification function. For tasks like math problems, it checks if the model's derived answer (e.g., a numerical value or a mathematical expression) is equivalent to theground truthwhen evaluated. - Symbol Explanation:
- : The
Functionally Correctscore. - : The output generated by the
LLM. - : The correct reference solution.
- : A task-specific verification function that checks if the
model_outputfulfills the problem's requirements or evaluates to the same result asground_truth. For example, forGame of 24, it would check if the expression correctly evaluates to 24.
- : The
- Application: Applied to
Game of 24,Math Equation Balancer, andAIME benchmarks.
Results & Analysis
The results demonstrate that DC enables test-time learning and significantly reduces repetitive errors across various challenging reasoning benchmarks, particularly for larger LLMs.
Core Results
DC Enables Test-Time Learning and Reduces Repetitive Errors
The Game of 24 task provides a striking example. GPT-4o's baseline accuracy was 10%, which dramatically increased to 99% under DC-RS. This was largely due to GPT-4o discovering an efficient Python-based brute-force solver early in the test sequence. Once stored, this snippet was consistently retrieved and applied, eliminating manual arithmetic errors. The comparison with DC Ø (19%) highlights that memory curation and retrieval are the primary drivers of this improvement. In contrast, Claude 3.5 Sonnet showed only marginal gain (12% to 14%), indicating that DC's success depends on the underlying LLM's capacity to identify and encode robust, reusable strategies.
DC Provides Substantial Improvements Across Various Challenging Reasoning Benchmarks
Image 1: The image is a combined bar chart showing accuracy improvements of the Dynamic Cheatsheet method across different models and tasks.
该图像是多个柱状图组合的图表,展示了Dynamic Cheatsheet方法对不同模型和任务的准确率提升情况。图中包括AIME 2020-2025、GPOA Diamond、Game of 24及数学方程平衡器任务,展示Baseline、DC-O和DC-RS三种方法下的性能对比,突出DC方法显著提升模型表现。
Image 5: The image is a radar chart showing the overall task performance of Claude 3.5 Sonnet under the baseline prompting and Dynamic Cheatsheet with Retrieval & Synthesis (DC-RS) approaches across various math and knowledge tasks.
该图像是一个雷达图,展示了Claude 3.5 Sonnet在基线提示(Baseline)与动态备忘录检索合成(DC-RS)两种方法下的任务整体表现,涵盖多个数学和知识任务,显示DC-RS方法普遍优于基线。
Image 11: The image is a radar chart showing the performance comparison between the Dynamic Cheatsheet (DC-RS) method and the baseline across multiple tasks involving math competitions, arithmetic challenges, and knowledge reasoning.
该图像是一个雷达图,展示了Dynamic Cheatsheet(DC-RS)方法与基线(Baseline)在多个任务上的性能对比,涉及数学竞赛、算术任务及知识推理等领域。
Image 12: The image is a radar chart illustrating the accuracy comparison between the Dynamic Cheatsheet method (DC-RS) and the baseline across multiple tasks.
该图像是一个雷达图,展示了Dynamic Cheatsheet方法(DC-RS)与基线方法在多个任务上的准确率比较,包括AIME系列考试、Game of 24、GPQA Diamond和MMLU等。图中绿色区域代表DC-RS性能显著提升,尤其在“Game of 24”和“Math Equation Balancer”任务中表现突出。
-
AIME Exam Problems:
Claude 3.5 Sonnetsaw significant improvements onAIME 2020-2024, surging from 6.7% to 40.6% underDC-RS. OnAIME 2024, accuracy rose from 23.3% to 50.0%, and onAIME 2025, from 6.7% to 36.7% underDC-Cu.GPT-4oalso gained, withAIME 2024performance rising from 20.0% to 40.0% underDC-RS, andAIME 2025from 6.7% to 20.0%. These results indicate that structuredtest-time memorycan effectively tackle difficult math problems. -
GPQA-Diamond:
Claude 3.5 Sonnetimproved from 59.6% to 68.7% underDC-RS(a 9.1% gain). This shows thatmemory curationandsynthesisprovide additional benefits beyond justretrieval(DRat 63.6%).GPT-4ohad only a slight increase (57.1% to 58.1%), suggesting thatretrievalcan introduce confusion if suboptimal examples are recalled, and success depends on the model'sgenerationandcuration capabilities. -
Math Equation Balancer: Both
Claude 3.5 Sonnet(44.8% to 98-100% withDC-RSandDC-Cu) andGPT-4o(50.0% to 99-100%) reached near-perfect accuracy. Similar toGame of 24, models learned and reused analgorithmicor Python-based balancing routine. -
MMLU-Pro Tasks:
Claude 3.5 Sonnetshowed consistent gains, up to 8.0% in Physics (from 74% to 82%). It stored and retrieved compact "reference guides" on engineering and physics principles.GPT-4oexperienced slight decreases, suggesting thatdomain complexityandbaseline knowledge gapscan attenuateDC's benefits ifcurated memoryis unreliable.
Data Presentation (Tables)
Table 1: Performance comparison of Claude 3.5 Sonnet and GPT-4o across various tasks under different methods (Accuracy, %) (Manual Transcription)
| Tasks | BL | DC-Ø | DR | DC-Cu. | DC-RS | BL | DC-Ø | DR | DC-Cu. | DC-RS |
|---|---|---|---|---|---|---|---|---|---|---|
| Claude 3.5 Sonnet | GPT-4o | |||||||||
| AIME 2024 | 23.3 | 36.7 | 43.3 | 50.0 | 46.7 | 20.0 | 36.7 | 26.7 | 36.7 | 40.0 |
| AIME 2025 | 6.7 | 23.3 | 23.3 | 36.7 | 30.0 | 6.7 | 10.0 | 10.0 | 16.7 | 20.0 |
| AIME 2020-24 | 6.7 | 30.1 | 39.1 | 38.4 | 40.6 | 9.8 | 24.1 | 24.1 | 20.3 | 24.8 |
| Game of 24 | 12.0 | 10.0 | 11.0 | 14.0 | 14.0 | 10.0 | 19.0 | 6.0 | 93.0 | 99.0 |
| GPQA Diamond | 59.6 | 60.1 | 63.6 | 61.1 | 68.7 | 57.1 | 57.1 | 55.1 | 58.1 | 57.1 |
| Math Eqn. Balancer | 44.8 | 56.4 | 60.4 | 100 | 97.8 | 50.0 | 88.0 | 100 | 100 | 99.2 |
| MMLU Pro Eng. | 61.2 | 57.2 | 65.2 | 66.8 | 67.6 | 53.2 | 51.6 | 48.8 | 44.0 | 51.2 |
| MMLU Pro Physics | 74.0 | 75.6 | 80.4 | 77.6 | 82.0 | 75.6 | 70.8 | 75.6 | 70.4 | 75.2 |
Table 2: Performance breakdown of BL (default baseline), FH (full history), DC-Cu, and DC-RS approaches under AIME 2024 and 2025. (Manual Transcription)
| Tasks | BL | FH | DC-Cu. | BL | FH | DC-RS |
|---|---|---|---|---|---|---|
| Claude 3.5 Sonnet | GPT-4o | |||||
| AIME 2024 | 23.3 | 26.7 | 50.0 | 20.0 | 13.3 | 40.0 |
| AIME 2025 | 6.7 | 6.7 | 36.7 | 6.7 | 3.3 | 20.0 |
Table 3: Performance of Claude 3.5 Haiku and GPT-4o-mini across AIME (2024, 2025) and GPQA-Diamond. (Manual Transcription)
| Tasks | BL | DC-Ø | DC-Cu. | DC-RS |
|---|---|---|---|---|
| Claude 3.5 Haiku | ||||
| AIME 2024 | 10.0 | 26.7 | 36.7 | 30.0 |
| AIME 2025 | 0.0 | 13.3 | 13.3 | 10.0 |
| GPQA-Diamond | 43.4 | 41.9 | 43.7 | 49.0 |
| Tasks | BL | DC-Ø | DC-Cu. | DC-RS |
| GPT-4o-mini | ||||
| AIME 2024 | 16.7 | 20.0 | 13.3 | 13.3 |
| AIME 2025 | 10.0 | 13.3 | 13.3 | 16.7 |
| GPQA-Diamond | 34.3 | 34.3 | 33.8 | 32.3 |
Table 4: Comparison of majority voting (MV) with DC on AIME. (Manual Transcription)
| Tasks | BL | MV(BL) | DC-Ø | MV(DC-Ø) | DC-Cu. |
|---|---|---|---|---|---|
| Claude 3.5 Sonnet | |||||
| AIME 2024 | 23.3 | 23.33 | 36.7 | 33.3 | 50.0 |
| AIME 2025 | 6.7 | 6.7 | 23.3 | 23.3 | 36.7 |
Ablations / Parameter Sensitivity
-
DC Øvs.DC-RS/DC-Cu: TheDC Øbaseline (empty memory) allowed for measuring the impact of structured problem-solving and explicit tool-use prompting without actual memory retention. The significant gap betweenDC ØandDC-RS/DC-Cu(e.g.,Game of 24withGPT-4o: 19% forDC Øvs. 99% forDC-RS) clearly shows that storing and reusing knowledge is the main driver of performance gains, not just advanced prompting. -
FH(Full-History) vs.DC:DC'smemory curationprovides gains overfull-history appending. ForClaude 3.5 SonnetonAIME 2024,FHreached 26.7% accuracy, whileDC-Cuhit 50.0%. ForGPT-4o,FHdecreased performance onAIME 2024to 13.3% from a 20.0% baseline, whereasDC-RSachieved 40.0%. This highlights that uncurated, excessive history can overwhelm the model, dilute insights, and increase inference costs, whileDC's selective curation ensures efficient access to high-value knowledge. -
Model Scale and Capacity Impact: The effectiveness of
DCis strongly tied to theLLM'sscaleandgenerative capacity.- Larger Models (Claude 3.5 Sonnet, GPT-4o): Showed notable gains across multiple tasks.
- Smaller Models (Claude 3.5 Haiku, GPT-4o-mini): Showed limited and inconsistent gains (Table 3).
Claude 3.5 Haikuhad moderate gains (e.g.,AIME 2024from 10.0% to 36.7% underDC-Cu,GPQA-Diamondfrom 43.4% to 49.0% underDC-RS).GPT-4o-minishowed even smaller, sometimes negative, gains. OnAIME 2024,DC-CuandDC-RSperformed worse than baseline (13.3% vs. 16.7%).GPQA-Diamondperformance was largely stagnant or declined.
- Reasons for Smaller Model Limitations:
- Generative Competence: Smaller models produce correct solutions less reliably, leading to a sparse or low-quality
memory repository. - Contextual and Memory Curation Limitations: They struggle with
long-context understandingandmemory retrieval, failing to retrieve the most relevant solutions or misapplying retrieved knowledge.
- Generative Competence: Smaller models produce correct solutions less reliably, leading to a sparse or low-quality
-
Majority Voting (MV)Comparison:DCperforms better thanconventional majority voting (MV). OnAIME 2024,MVwithBLperformed identically toBL(23.3%).MVwithDC Øslightly underperformedDC Ø(33.3% vs. 36.7%). In contrast,DC-Cusignificantly outperformedMV, reaching 50.0% onAIME 2024and 36.7% onAIME 2025. This confirms thatmemory-based adaptationis more effective than simple statistical voting for complex reasoning.
Other Key Observations
-
DCFosters Efficient Tool Usage / Code Generation:GPT-4o's shift to Python scripts forGame of 24is a prime example. The model learned thatcode-based brute forceis more systematic than manual arithmetic, generated a Python function, stored it, and refined it iteratively. This demonstratesDC's potential to nurtureLLMs' ability to recognize when external tools are more robust. Image 8: Excerpt from GPT-4o's external memory after processing 100 examples from Game of 24 under DC-RS.
该图像是论文中GPT-4o在Game of 24任务下DC-RS记忆片段的截图,展示了模型存储的用于解决游戏的Python代码和策略,其中包含用于求解24点问题的步骤说明和自动化脚本。 -
Test-Time Task Similarity and Example Ordering:
DCthrives when test examples share structural similarities. For tasks likeGame of 24,Math Equation Balancer, andAIME, discovering a solution or strategy for one problem allowed for easy transfer across structurally similar questions. This suggests that acurriculum-style learningapproach, where simpler or archetypal problems are presented first, could bootstrap performance. -
Reasoning and Information Efficiency:
DCreduces the need to "reinvent the wheel," cutting downreasoning overheadandtoken usagein subsequent queries by encoding and reusingwell-established techniques. -
Clustering of Errors and Corrections: Experiments suggest that errors and their corrections often
cluster in a latent embedding space. Once a high-qualityheuristicis acquired for a cluster, it can be applied totightly embedded neighbors. However, faultyheuristicscan also be amplified, underscoring the need for carefulcurationandpruningto avoid propagating erroneous strategies. Image 2: t-SNE visualization of question embeddings in the GPQA Diamond task.
该图像是一个散点图,展示了GPQA Diamond任务中问题嵌入的t-SNE降维结果。点的颜色表示基线模型和动态备忘录(DC-RS)模型的正确性组合,反映了不同模型对问题的解答表现差异。 -
Transferability of Memory Content Across Models: While larger models can produce high-quality strategies, transferring this memory to smaller models sometimes yielded mixed results. If a smaller model lacks the
generative capacityto interpret or refine strategies correctly, its performance can stall or degrade. Memory entries cannot fully compensate for inadequate base capabilities.
Conclusion & Personal Thoughts
Conclusion Summary
The paper effectively introduces Dynamic Cheatsheet (DC), a novel framework that bridges the gap between isolated LLM inference events and the cumulative, experience-driven learning characteristic of human cognition. By endowing black-box LLMs with a persistent, evolving, and self-curated memory, DC enables models to store and reuse problem-solving strategies, code snippets, and general insights at test time. This leads to substantial performance improvements across diverse and challenging tasks, including complex math problems (AIME), arithmetic puzzles (Game of 24, Math Equation Balancer), and knowledge-intensive Q&A (GPQA-Diamond, MMLU-Pro). DC operates without modifying LLM parameters and actively curates memory to avoid context bloat, fostering efficient tool usage and reducing repetitive errors. The findings underscore the critical role of model capacity and task similarity in DC's effectiveness.
Limitations & Future Work (as identified by the authors)
- Memory Curation Challenges:
DC'smemory curationcan demandprecise reproductionormodificationof prior knowledge.LLMssometimes merely reference or abbreviate existing memory ("Previous content [...] preserved") instead of explicitly rewriting it, which can reduce the quality of storedheuristicsover time. Potential solutions include maintaining astructured, external databasethat theLLMcan reference without regenerating large texts. - Retrieval Bottlenecks and Noise: While
DC-RSimproves accuracy,poorly filtered retrieval mechanismscan introduce confusion, especially with diverse or loosely related queries.GPT-4ooccasionally dipped inGPQA-Diamonddue tosuboptimal retrieval choices. This highlights the need forrobust retrieval methods(e.g.,dense vector search,advanced ranking algorithms) to surface high-quality exemplars and suppress irrelevant information. - Hierarchical and Modular Memory: For scaling
LLM deploymentsand specializeddomains, future work could explore subdividing or hierarchically organizing memory (e.g., separate memories forcombinatoricsorphysics). This could reduce the load on a unified memory, isolate errors, and improve clarity and reliability of retrievedheuristics. - Time and Token Complexity: Although
DCoptimizes efficiency over time by reducing redundant computation andtoken usage, itssequential structurestill poses challenges forlarge-scale parallel or batch tasksrequiring independent inference. - Limitations with Smaller Models: Smaller models (e.g.,
GPT-4o-mini,Claude 3.5 Haiku,DeepSeek R1, ) show limited or inconsistent gains. Their restrictedgenerative abilitymakes it difficult for them to producereliable strategiesfor storage or tointerpret retrieved heuristicseffectively.DCrequires acapable foundation modelto seed and refinecurated knowledge.
Personal Insights & Critique
- Novelty and Impact: The paper presents a highly novel and practical approach. The concept of a
self-curated, evolving cheatsheetdirectly addresses a fundamental limitation ofLLMs– their lack of persistent memory. The performance gains, especially the dramatic increase inGame of 24andMath Equation Balanceraccuracy, are compelling. This framework has significant implications for makingLLMsmore efficient, consistent, and adaptable in real-world applications where sequential problem-solving is common. - Practicality and Generalizability:
DC'sblack-boxcompatibility is a huge strength, allowing it to be used with powerful commercialAPIswithout needing access tomodel weights. This makes it immediately deployable. While the paper focuses on math, logic, and knowledge tasks, the underlying principle ofcurating transferable insightscould generalize to various domains, such ascoding assistance,scientific discovery, orlegal research, whereLLMscould maintainlibraries of reusable patternsandbest practices. - Prompt Engineering Complexity: The effectiveness of
DCheavily relies on sophisticatedprompt engineeringfor both thegeneratorandcuratormodules. Crafting prompts that reliably extractgeneralizable heuristicsand manage memory (e.g., deciding what to keep, discard, or refine) is a non-trivial task. The quality ofmemory curationis directly proportional to the quality of thesecurator prompts. - Risk of Catastrophic Forgetting/Misguidance: While
DCaims to prevent errors, the paper acknowledges that "faulty heuristics that slip into memory can be equally amplified." Thiscatastrophic reinforcementof bad strategies is a potential risk. Effectivepruningandverification mechanismswithin thecuratorare crucial to maintain memory quality. The "usage counter" proposed in the curator prompt is a good step, but more sophisticated methods forevaluating utilityandcorrectnessmight be necessary. - Open Questions:
- Scaling Memory Size: How does memory scale with an ever-growing number of tasks? While curation keeps it compact, what are the practical limits before retrieval becomes too slow or the
LLMstruggles with a largecheatsheet? - Complexity of Curation Decisions: How robust is the
LLM's ability to discern "usefulness," "generalizability," and "correctness" without externalground truth? This self-assessment is powerful but also the most prone toLLMhallucinationsor errors. - Proactive Strategy Discovery: Could
DCbe extended toproactively discover strategiesthroughself-playorsimulated environments, rather than waiting for a successfulsolution generation?
- Scaling Memory Size: How does memory scale with an ever-growing number of tasks? While curation keeps it compact, what are the practical limits before retrieval becomes too slow or the
- Long-Term Memory Stability: The paper highlights that
truncated memory updatescan reduce the quality of storedheuristics. This indicates a potential challenge in maintaininglong-term coherenceandcompletenessof thecheatsheet. - Comparison to Human Learning: The analogy to human cognition is strong, but humans also forget or selectively update.
DC'spruningandrefinementmechanisms are a step towards this, but the depth ofmeta-learning(learning how to learn or how to curate) within theLLMforDCcould be further explored.
Similar papers
Recommended via semantic vector search.