Paper status: completed

Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory

Published:04/11/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
14 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Dynamic Cheatsheet equips black-box LLMs with persistent adaptive memory, enabling test-time learning that reuses strategies and code snippets, significantly boosting performance without labeled data or retraining.

Abstract

Despite their impressive performance on complex tasks, current language models (LMs) typically operate in a vacuum: Each input query is processed separately, without retaining insights from previous attempts. Here, we present Dynamic Cheatsheet (DC), a lightweight framework that endows a black-box LM with a persistent, evolving memory. Rather than repeatedly re-discovering or re-committing the same solutions and mistakes, DC enables models to store and reuse accumulated strategies, code snippets, and general problem-solving insights at inference time. This test-time learning enhances performance substantially across a range of tasks without needing explicit ground-truth labels or human feedback. Leveraging DC, Claude 3.5 Sonnet's accuracy more than doubled on AIME math exams once it began retaining algebraic insights across questions. Similarly, GPT-4o's success rate on Game of 24 increased from 10% to 99% after the model discovered and reused a Python-based solution. In tasks prone to arithmetic mistakes, such as balancing equations, DC enabled GPT-4o and Claude to reach near-perfect accuracy by recalling previously validated code, whereas their baselines stagnated around 50%. Beyond arithmetic challenges, DC yields notable accuracy gains on knowledge-demanding tasks. Claude achieved a 9% improvement in GPQA-Diamond and an 8% boost on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than entire transcript. Unlike finetuning or static retrieval methods, DC adapts LMs' problem-solving skills on the fly, without modifying their underlying parameters. Overall, our findings present DC as a promising approach for augmenting LMs with persistent memory, bridging the divide between isolated inference events and the cumulative, experience-driven learning characteristic of human cognition.

Mind Map

In-depth Reading

English Analysis

Bibliographic Information

  • Title: Dynamic Cheatsheet: Test-Time Learning with Adaptive Memory
  • Authors: Mirac Suzgun, Mert Yuksekgonul, Federico Bianchi, Dan Jurafsky, James Zou
  • Journal/Conference: Published at arXiv (a preprint server for research papers).
  • Publication Year: 2025
  • Abstract: Current language models (LMs) process each input query independently, lacking memory of previous attempts. This paper introduces Dynamic Cheatsheet (DC), a lightweight framework that provides a black-box LM with a persistent, evolving memory. DC allows models to store and reuse strategies, code snippets, and problem-solving insights at inference time, enhancing performance without requiring ground-truth labels or human feedback. For example, Claude 3.5 Sonnet more than doubled its accuracy on AIME math exams by retaining algebraic insights, and GPT-4o's success rate on Game of 24 increased from 10% to 99% after discovering and reusing a Python-based solution. DC also led to near-perfect accuracy in tasks prone to arithmetic mistakes (e.g., balancing equations) by recalling validated code. Beyond arithmetic, DC improved Claude's accuracy by 9% on GPQA-Diamond and 8% on MMLU-Pro problems. Crucially, DC's memory is self-curated, focusing on concise, transferable snippets rather than full transcripts, and adapts LMs' problem-solving skills on the fly without modifying underlying parameters. The findings present DC as a promising approach for augmenting LMs with persistent memory, bridging isolated inference events with cumulative, experience-driven learning.
  • Original Source Link: https://arxiv.org/abs/2504.07952
  • PDF Link: https://arxiv.org/pdf/2504.07952v1.pdf
    • Publication Status Comment: The paper is published on arXiv, which is a preprint server. This means it has been shared publicly but may not have undergone formal peer review by a specific journal or conference yet.

Executive Summary

  • Background & Motivation (Why):

    • Problem: Current Large Language Models (LLMs) operate in a vacuum, treating each input query as a standalone problem. They do not retain insights, strategies, or mistakes from past interactions, leading to repetitive re-discovery of solutions and re-committing of errors during inference. This fundamental limitation prevents LLMs from exhibiting cumulative learning akin to human cognition.
    • Importance: This problem is crucial because it limits the efficiency, performance, and adaptability of LLMs in real-world, sequential problem-solving scenarios. Models waste computational resources and time by re-deriving known solutions, struggle with consistency, and cannot autonomously improve their problem-solving heuristics over time. Existing solutions like fine-tuning are costly and modify model parameters, while static retrieval-augmented generation (RAG) systems rely on fixed knowledge bases.
    • Novel Approach: The paper introduces Dynamic Cheatsheet (DC), a lightweight and intuitive framework that endows black-box LLMs with a persistent, evolving memory at inference time. This approach enables the LLM to learn on the fly without modifying its internal parameters or requiring explicit ground-truth labels or human feedback.
  • Main Contributions / Findings (What):

    • Novel Framework: Presentation of Dynamic Cheatsheet (DC), a non-parametric, test-time learning framework that allows LLMs to store and reuse strategies, solution sketches, and code snippets.
    • Significant Performance Gains: DC substantially improves LLM performance across a range of challenging tasks:
      • Claude 3.5 Sonnet more than doubled its accuracy on AIME math exams (e.g., AIME 2024 from 23% to 50%).
      • GPT-4o's success rate on Game of 24 soared from ~10% to 99% by discovering and reusing a Python-based brute-force solution.
      • GPT-4o and Claude achieved near-perfect accuracy (98-100%) on Math Equation Balancer by recalling validated code, up from ~50%.
      • Claude showed notable gains on knowledge-demanding tasks: 9% improvement on GPQA-Diamond and 8% on MMLU-Pro (Engineering and Physics).
    • Self-Curated Memory: DC's memory is actively managed, focusing on concise, transferable snippets instead of entire transcripts, thus preventing context window bloat and ensuring efficient retrieval.
    • Black-Box & Parameter-Free Adaptation: DC enhances LLMs' problem-solving skills without fine-tuning or modifying their underlying parameters, making it compatible with black-box APIs (e.g., GPT-4, Claude).
    • Efficient Tool Usage: DC fosters LLMs' inclination towards code generation for computationally intensive tasks, allowing models to offload complex calculations to external tools (e.g., Python interpreter).
    • Scalability Dependency: The effectiveness of DC is tied to the LLM's scale and generative capacity; smaller models showed limited or inconsistent gains.

Prerequisite Knowledge & Related Work

This section aims to equip a beginner with the necessary context to understand the paper.

  • Foundational Concepts:

    • Language Models (LMs)/Large Language Models (LLMs): These are deep learning models trained on vast amounts of text data to understand, generate, and process human language. They predict the next word in a sequence, enabling tasks like text generation, question answering, and translation. LLMs are typically characterized by billions or trillions of parameters (learnable values that define the model's behavior).
    • Inference Time: This refers to the phase where a trained LLM is used to make predictions or generate outputs for new, unseen inputs. It contrasts with training time, where the model learns from data.
    • Black-Box LLM: An LLM that is accessible only through an API (Application Programming Interface), meaning users can send inputs and receive outputs, but they cannot access or modify the model's internal parameters or weights. GPT-4 and Claude are examples of black-box LLMs.
    • Fine-tuning: A process where a pre-trained LLM is further trained on a smaller, task-specific dataset to adapt its parameters for a particular application (e.g., medical text generation). This typically requires significant computational resources and access to the model's internal workings.
    • Retrieval-Augmented Generation (RAG): A technique where an LLM is augmented with a retrieval system that can fetch relevant information from an external knowledge base (e.g., a database of documents) and provide it as context to the LLM before generating a response. This helps LLMs produce more factual and up-to-date outputs.
    • Test-Time Learning (Online Learning/Incremental Learning): A paradigm where a model's behavior or knowledge is adapted during inference (or test time) as new data streams in. Unlike traditional training, it involves continuous, often lightweight, updates to improve performance on subsequent tasks without full re-training.
    • Heuristic: A practical, experience-based approach to problem-solving that may not be optimal or perfect but is often sufficient for immediate goals. In LLMs, a heuristic might be a common problem-solving strategy or a rule of thumb.
    • Brute-Force Algorithm: A straightforward problem-solving technique that tries every possible option or combination until a solution is found. While often inefficient for large problem spaces, it guarantees a solution if one exists. GPT-4o's Python solver for Game of 24 is an example.
    • t-SNE (t-Distributed Stochastic Neighbor Embedding): (Mentioned in Image 10 description). A dimensionality reduction technique used to visualize high-dimensional data (like LLM embeddings) in two or three dimensions, such that similar data points are clustered together and dissimilar points are far apart. It helps to understand the relationships and clusters within data.
    • Context Window: The maximum amount of text (measured in tokens) that an LLM can process at once. If input exceeds this limit, some information must be discarded, which can lead to lost information or performance degradation.
  • Previous Works & Technological Evolution:

    • Fixed Models Post-Deployment: Traditionally, LLMs (like earlier versions of GPT or Claude) were static once deployed. Their parameters were fixed, meaning they couldn't learn or adapt from new experiences during actual use. They approached every problem de novo.
    • Dynamic Evaluation (Krause et al., 2019): Early attempts at test-time adaptation involved dynamic evaluation, where a language model might be updated with gradient steps on the test-time data itself. However, this often requires parameter updates, which is difficult or impossible for black-box APIs.
    • Domain Adaptation (Gururangan et al., 2020): Methods to adapt LMs to specific domains or tasks post-pre-training, often through further training on relevant datasets.
    • Traditional Retrieval-Augmented Generation (RAG) (Guu et al., 2020; Zhang et al., 2024b): These systems retrieve facts from a massive static corpus to augment LLM responses. The key distinction from DC is that the knowledge base is typically fixed and does not evolve based on the model's own inference experiences.
    • Iterative Refinement Approaches (Reflexion (Shinn et al., 2023), Self-Refine (Madaan et al., 2023), Self-Critic (Gou et al., 2023), Chameleon (Lu et al., 2023), Meta-Prompting (Suzgun & Kalai, 2024), SelfRAG (Asai et al., 2023)): These methods use feedback loops or verification mechanisms to correct mistakes in solutions. While they involve iterative improvement, DC distinguishes itself by focusing on storing generalizable heuristics and solution strategies that can be repeatedly retrieved and applied across tasks, rather than just correcting a single solution.
    • Memory-Augmented Generation (Thought-Retriever (Feng et al., 2024), Buffer-of-Thoughts (BoT) (Yang et al., 2025)): These works also explore storing reasoning processes. Thought-Retriever logs chains-of-thought for reuse, and BoT distills thought templates. DC differs by emphasizing selective curation of transferable snippets and being fully external and training-free.
    • Tool Usage/Code Execution (Schick et al., 2023; Lu et al., 2023): Research showing how LLMs can call external tools like Python interpreters. DC builds on this by allowing the LLM to learn when and how to use these tools effectively and store those strategies for future reuse.
  • Differentiation:

    • DC vs. Fine-tuning: DC does not modify the LLM's parameters or weights. It's a non-parametric approach, making it compatible with black-box APIs and avoiding expensive retraining.
    • DC vs. Static RAG: DC's memory is dynamic and evolves based on the LLM's test-time experiences (successes and failures), rather than being a fixed corpus.
    • DC vs. Naive Full-History Appending (FH): FH simply adds all prior interactions to the context window, leading to context bloat and diluting relevant information. DC employs active curation, selecting only succinct, useful, and transferable knowledge.
    • DC vs. Other Iterative Refinement Methods: DC focuses on generalizable heuristics and solution sketches that can be applied across similar problems, effectively amortizing the cost of discovering robust strategies. Other methods often focus on improving a single solution or rely on specific feedback mechanisms.

Methodology (Core Technology & Implementation Details)

The Dynamic Cheatsheet (DC) framework operates by endowing a black-box LLM with an external, non-parametric memory that evolves in tandem with the LLM's inference process. Instead of modifying the model's internal weights through gradient-based updates, DC tracks the successes and failures of the model at test time and selectively stores heuristics, strategies, or short textual artifacts that can guide the LLM in future problem instances. This design respects the black-box nature of many commercial LLM APIs.

Principles

The core idea behind DC is to enable an LLM to learn from its own experiences during inference, mimicking how humans accumulate knowledge. This is achieved through:

  • External, Non-Parametric Memory: The memory is separate from the LLM's parameters, allowing for flexible updates without gradient-based changes.
  • Iterative Evolution: The memory is continuously updated, refined, and curated based on the outcomes of the LLM's problem-solving attempts.
  • Black-Box Compatibility: The framework works with LLMs whose internal parameters are inaccessible.
  • Self-Curation: The system itself decides what to store, discard, or refine in memory, without requiring explicit ground-truth labels or human feedback for every decision.

Steps & Procedures

The DC framework consists of two core modules: generation and curation. These modules can be implemented using the same LLM (with different prompts) or separate LLMs.

1. DC-Cu (Dynamic Cheatsheet - Cumulative)

The DC-Cu variant represents the basic iterative loop of the Dynamic Cheatsheet.

1.1. Solution Generation with Memory

When presented with a new query, the LLM first consults its external memory. This memory contains previously stored insights, strategies, techniques, or heuristics. The LLM then uses this combined information—the new query and the current memory state—to produce a candidate solution.

The process is formalized as: y~i=Gen(xi,Mi) \tilde { y } _ { i } = \mathsf { G e n } ( x _ { i } , M _ { i } )

  • xix_i: The ii-th new input query or problem presented to the model.
  • MiM_i: The current state of the external memory at the ii-th step, representing accumulated knowledge.
  • Gen\mathsf{Gen}: The solution generator module (typically the LLM itself, appropriately prompted).
  • y~i\tilde{y}_i: The candidate solution produced by the model for the query xix_i, conditioned on the memory MiM_i.

1.2. Memory Curation Step

After generating a candidate solution y~i\tilde{y}_i for xix_i, a curator module (which can be the same LLM with a different prompt) updates the memory. The curator assesses the usefulness, generalizability, and correctness of the generated solution. Since no ground-truth labels are available, the curator must make these assessments autonomously (e.g., by checking for logical consistency, common patterns, or successful execution of generated code).

The memory update process is formalized as: Mi+1=Cur(Mi,xi,y~i) M _ { i + 1 } = \mathsf { C u r } ( M _ { i } , x _ { i } , \tilde { y } _ { i } )

  • MiM_i: The memory state before the current curation step.

  • xix_i: The input query for which the solution was generated.

  • y~i\tilde{y}_i: The candidate solution produced by the generator.

  • Cur\mathsf{Cur}: The curator module (typically the LLM itself, prompted for curation).

  • Mi+1M_{i+1}: The updated state of the memory, incorporating insights from the current interaction.

    During curation, Cur primarily considers:

  • Usefulness and Generalizability: If y~i\tilde{y}_i is correct or provides valuable, generalizable insights, it is distilled into a concise form suitable for future reference.

  • Refinement or Removal: If an existing memory entry was found to be incorrect or if a more efficient/versatile strategy emerges, Cur may revise or remove the old entry.

  • Clarity and Compactness: Memory entries are consolidated to maintain succinct, high-impact references and heuristics, preventing memory bloat.

    Image 7: Illustration of Dynamic Cheatsheet (DC-Cu variant).

    Figure 4: Illustration of Dynamic Cheatsheet (DC-Cu variant). 该图像是论文中图4的示意图,展示了Dynamic Cheatsheet (DC-Cu variant)的工作流程。左侧为解决方案生成模块,语言模型结合记忆生成答案,右侧为记忆策展模块,通过评估和筛选模型输出更新记忆,形成更优的知识库。

2. DC-RS (Dynamic Cheatsheet - Retrieval & Synthesis)

DC-RS modifies the DC-Cu approach by introducing a retrieval mechanism and refining the memory before generating a response. This addresses two potential drawbacks of DC-Cu:

  1. DC-Cu updates memory after generation, missing opportunities to incorporate new insights during the current reasoning process.

  2. DC-Cu doesn't explicitly store or revisit past input-output pairs, which can be valuable for diverse topics.

    The DC-RS workflow is as follows:

  3. Retrieval: For a new query xix_i, the system first retrieves the top-k most similar past input-output pairs {(xj,y~j)}j<i\{ (x_j, \tilde{y}_j) \}_{j < i} from its knowledge base. This is done using a retrieval module, Retr\mathsf{Retr}.

  4. Pre-Generation Curation: The retrieved examples RiR_i and the most recent memory content Mi1M_{i-1} are passed to the curator to update the memory before generation.

  5. Solution Generation: Finally, the generator produces a solution y~i\tilde{y}_i, using the new query xix_i and the freshly updated memory MiM_i.

    The steps are summarized as: Ri=Retr(xi,{(xj,y~j)}j<i,k) R _ { i } = \mathsf { R e t r } ( x _ { i } , \{ ( x _ { j } , \tilde { y } _ { j } ) \} _ { j < i } , k )

  • RiR_i: The set of top-k most similar input-output pairs retrieved from past examples for the current query xix_i.

  • Retr\mathsf{Retr}: The retrieval module.

  • xix_i: The current input query.

  • {(xj,y~j)}j<i\{ (x_j, \tilde{y}_j) \}_{j < i}: The collection of all previously processed input queries and their generated solutions.

  • kk: The number of top similar examples to retrieve.

    Mi=Cur(Mi1,xi,Ri) M _ { i } = \mathsf { C u r } ( M _ { i - 1 } , x _ { i } , R _ { i } )

  • Mi1M_{i-1}: The memory state from the previous step.

  • xix_i: The current input query.

  • RiR_i: The retrieved top-k examples.

  • Cur\mathsf{Cur}: The curator module.

  • MiM_i: The updated memory state before generation, now informed by retrieved examples and the current query.

    y~i=Gen(xi,Mi) \tilde { y } _ { i } = \mathsf { G e n } ( x _ { i } , M _ { i } )

  • xix_i: The current input query.

  • MiM_i: The memory state after pre-generation curation.

  • Gen\mathsf{Gen}: The solution generator module.

  • y~i\tilde{y}_i: The final candidate solution for xix_i.

Prompts

The paper provides examples of prompts used for the Generator and Memory Curator modules.

Generator Prompt (for DR, FH, and DC Approaches)

This prompt instructs the LLM to act as a problem solver, combining its expertise with provided reference materials. It outlines a structured approach:

  1. Analysis & Strategy: Analyze the problem, search for patterns/strategies in the cheatsheet, create a structured approach, and document limitations.
  2. Solution Development: Present clear, logical steps, explain reasoning, provide detailed explanations, and verify assumptions.
  3. Programming Tasks: For tasks requiring code, it instructs the model to Write clean, efficient Python code, explicitly request execution with EXECUTE CODE!, declare imports, add inline comments, and perform result validation.
  4. Final Answer Format: Crucially, it mandates the final answer be wrapped in specific XML-style tags: <answer>(finalanswer)</answer><answer> (final answer) </answer>.

Memory Curation Prompt (under DC-RS)

This prompt guides the curator LLM in its role of maintaining the cheatsheet.

  • Purpose and Goals: Emphasizes continuous learning and adaptation.
  • Core Responsibilities:
    • Selective Knowledge Retention: Discard redundant/trivial details, ensure effective solutions remain accessible, and incorporate new superior methods.
    • Continuous Refinement & Optimization: Extract, generalize, and introduce meta-strategies.
    • Structure & Organization: Maintain a well-organized cheatsheet with sections like Reusable Code Snippets, General Problem-Solving Heuristics, Optimization Techniques, and Specialized Knowledge & Theorems.
  • Principles and Best Practices: For each new problem, it instructs the curator to:
    1. Evaluate Solution's Effectiveness: Assess optimality, potential for improvement, and relevance to existing strategies.

    2. Curate & Document Valuable Insights: Identify patterns, edge cases, and insights worth retaining, replacing old versions if a better approach is found.

    3. Maintain Concise, Actionable Entries: Keep entries clear, actionable, concise, and focused on widely applicable methods, aiming to extract useful and general solution strategies and/or Python code snippets.

    4. Implement a Usage Counter: Use a counter to prioritize frequently used solutions.

      These detailed prompts are essential for guiding the black-box LLMs to perform the generation and curation tasks effectively within the DC framework.

Baselines

To quantify the efficacy of memory-driven test-time learning, the DC framework and its variants are compared against four baselines:

  1. Baseline prompting (BL): A standard vanilla prompting approach with minimal instructions. It reflects traditional one-off inference, where each query is processed independently without any iterative memory or retrieval mechanism.

  2. DC Ø (empty memory): This variant uses the DC framework's structured problem-solving and explicit tool-use instructions (from the generator prompt) but always keeps the memory content empty. It isolates the effect of memory curation by showing how much performance improvement comes purely from storing and reusing knowledge over time, as opposed to just better prompting.

  3. Full-History Appending (FH): A naive approach where the entire conversation history (all previous input queries and generated outputs) is appended to the model input without any curation or truncation. This can exceed context-window limits and include redundant or low-value information but serves as a comparison to DC's curated memory.

  4. Dynamic Retrieval (DR): This baseline uses retrieval but no curation. For each new query, it retrieves the most similar past interactions and directly pastes them, verbatim, into the prompt. It allows the model to see relevant input-output pairs but does not codify any abstract or generalized solutions.

    Image 6: Diagram of pseudocode structures of Baseline, Empty Memory, Full History, and four dynamic memory strategies (DC-RS, DC-Cu, DR).

    该图像是多个代码框的示意图,展示了Baseline、Empty Memory、Full History及四种动态记忆策略(DC-RS、DC-Cu、DR)的伪代码结构,清晰对比了各方法中记忆初始化、检索及解题过程的区别。 该图像是多个代码框的示意图,展示了Baseline、Empty Memory、Full History及四种动态记忆策略(DC-RS、DC-Cu、DR)的伪代码结构,清晰对比了各方法中记忆初始化、检索及解题过程的区别。

Experimental Setup

The experiments were designed to rigorously evaluate DC's effectiveness on challenging tasks where state-of-the-art LLMs still face limitations, prioritizing tasks demanding multi-step reasoning, heuristic search, strategic adaptation, and cumulative learning.

Datasets

The selected datasets cover algorithmic, logical, and domain-specific reasoning tasks:

  • AIME 2020-2025 Exam Questions:

    • Description: The American Invitational Mathematics Examination (AIME) is a prestigious high-school competition featuring complex problems in algebra, combinatorics, number theory, geometry, and probability. These require deep mathematical reasoning.
    • Subsets Used:
      • AIME 2024 (30 questions)
      • AIME 2025 (30 questions)
      • AIME 2020-2024 (133 questions)
    • Justification: Chosen to stress-test the model's ability to refine its mathematical reasoning over time.
  • GPQA-Diamond (Rein et al., 2024):

    • Description: A high-quality, difficult subset of the Graduate-Level Google-Proof Q&A (GPQA) benchmark, comprising 198 expert-validated questions across natural sciences (biology, chemistry, physics). Questions were correctly answered by domain experts but often missed by non-experts.
    • Justification: Ideal for evaluating DC's ability to handle complex, multi-hop reasoning tasks requiring specialized knowledge.
  • Game of 24 (Yao et al., 2023; Suzgun & Kalai, 2024):

    • Description: A heuristic-driven arithmetic challenge. The goal is to form an expression equaling 24 using four given numbers exactly once. For instance, given 7 7 8 1, one valid answer is 8(7+711)8 * (7 + 7 - 11). (Note: The paper gives 8(7+711)8 * (7 + 7 - 11) for input 7 7 8 11. This is a typo in the paper, it should probably be 8(7+711)8 * (7 + 7 - 11) or the input digits are 7 7 8 11). The example 8(7+711)8 * (7 + 7 - 11) uses four numbers: 8, 7, 7, 11. It evaluates to 8(1411)=83=248 * (14 - 11) = 8 * 3 = 24.
    • Data Sample: Input: 7 7 8 11. Output: 8(7+711)8 * (7 + 7 - 11).
    • Size: 100 examples from (Suzgun & Kalai, 2024).
    • Justification: Emphasizes systematic search, strategic reasoning, and pattern recognition, making it suitable for assessing DC's capacity for refining computational heuristics.
  • Math Equation Balancer:

    • Description: Requires the model to complete equations by inserting appropriate operators to form valid expressions.
    • Data Sample: Input: 1?2?3=61 ? 2 ? 3 = 6. Correct output could be 1+2+3=61 + 2 + 3 = 6 or 123=61 * 2 * 3 = 6.
    • Size: 250 arithmetic expressions.
    • Justification: Focuses on elementary arithmetic reasoning and sequential placement of operators.
  • MMLU-Pro (Engineering and Physics) (Wang et al., 2024b):

    • Description: A professional-level subset of the MMLU benchmark, specifically focusing on physics and engineering. All questions are in a multiple-choice format.
    • Size: Sampled 250 questions from each subset (original dataset has 1,299 physics and 969 engineering questions).
    • Justification: Evaluates DC's ability to maintain a "toolkit" of formulas and general problem-solving patterns in knowledge-intensive domains.

Language Models

The efficacy of DC was evaluated across a range of LLMs:

  • State-of-the-art LLMs: GPT-4o and Claude 3.5 Sonnet.
  • Smaller-scale counterparts: GPT-4o-mini and Claude 3.5 Haiku.
  • Specialized models: DeepSeek R1 (designed for reasoning-intensive tasks). o1o1 is also mentioned as a similar specialized model.

Evaluation Protocol

All models were instructed to format their final answers in a structured, machine-readable format to ensure accurate and consistent parsing: <answer>(finalanswer)</answer><answer> (final answer) </answer>

Accuracy Metrics

Given the diversity of tasks, different accuracy metrics were used:

  1. Soft Match (SM):

    • Conceptual Definition: A lenient metric where an answer is considered correct if it matches the ground truth after ignoring minor formatting differences, such as punctuation or whitespace variations. This metric is suitable for tasks where the core content of the answer is what matters, not strict formatting.
    • Mathematical Formula: For multi-choice questions, SM typically evaluates to 1 if the model's chosen option (after normalization) matches the ground truth option, and 0 otherwise. There isn't a single universal formula for Soft Match as its implementation depends on the task and what constitutes a "minor formatting difference." Conceptually, for a given model_answer and ground_truth: SM(model_answer,ground_truth)={1if normalize(model_answer)=normalize(ground_truth)0otherwise \mathrm{SM}(\mathrm{model\_answer}, \mathrm{ground\_truth}) = \begin{cases} 1 & \text{if } \mathrm{normalize}(\mathrm{model\_answer}) = \mathrm{normalize}(\mathrm{ground\_truth}) \\ 0 & \text{otherwise} \end{cases}
    • Symbol Explanation:
      • SM(model_answer,ground_truth)\mathrm{SM}(\mathrm{model\_answer}, \mathrm{ground\_truth}): The Soft Match score for a given model answer and ground truth.
      • model_answer\mathrm{model\_answer}: The output generated by the LLM.
      • ground_truth\mathrm{ground\_truth}: The correct answer from the dataset.
      • normalize()\mathrm{normalize}(\cdot): A function that performs text normalization (e.g., lowercasing, removing punctuation, standardizing whitespace) to make answers comparable despite minor formatting variations.
    • Application: Applied to GPQA-Diamond, and MMLU Pro (Engineering and Physics), which use multiple-choice formats.
  2. Functionally Correct (FC):

    • Conceptual Definition: An even more flexible metric that assesses whether the model's output satisfies the task-specific constraints, even if the exact numeral presentation or formatting differs slightly from the reference solution. This is particularly relevant for problems where multiple valid expressions or numerical forms can lead to the same correct result.
    • Mathematical Formula: Similar to Soft Match, Functionally Correct often involves a custom verification function. For tasks like math problems, it checks if the model's derived answer (e.g., a numerical value or a mathematical expression) is equivalent to the ground truth when evaluated. FC(model_output,ground_truth)={1if evaluate_functionally(model_output,ground_truth)0otherwise \mathrm{FC}(\mathrm{model\_output}, \mathrm{ground\_truth}) = \begin{cases} 1 & \text{if } \mathrm{evaluate\_functionally}(\mathrm{model\_output}, \mathrm{ground\_truth}) \\ 0 & \text{otherwise} \end{cases}
    • Symbol Explanation:
      • FC(model_output,ground_truth)\mathrm{FC}(\mathrm{model\_output}, \mathrm{ground\_truth}): The Functionally Correct score.
      • model_output\mathrm{model\_output}: The output generated by the LLM.
      • ground_truth\mathrm{ground\_truth}: The correct reference solution.
      • evaluate_functionally()\mathrm{evaluate\_functionally}(\cdot): A task-specific verification function that checks if the model_output fulfills the problem's requirements or evaluates to the same result as ground_truth. For example, for Game of 24, it would check if the expression correctly evaluates to 24.
    • Application: Applied to Game of 24, Math Equation Balancer, and AIME benchmarks.

Results & Analysis

The results demonstrate that DC enables test-time learning and significantly reduces repetitive errors across various challenging reasoning benchmarks, particularly for larger LLMs.

Core Results

DC Enables Test-Time Learning and Reduces Repetitive Errors

The Game of 24 task provides a striking example. GPT-4o's baseline accuracy was 10%, which dramatically increased to 99% under DC-RS. This was largely due to GPT-4o discovering an efficient Python-based brute-force solver early in the test sequence. Once stored, this snippet was consistently retrieved and applied, eliminating manual arithmetic errors. The comparison with DC Ø (19%) highlights that memory curation and retrieval are the primary drivers of this improvement. In contrast, Claude 3.5 Sonnet showed only marginal gain (12% to 14%), indicating that DC's success depends on the underlying LLM's capacity to identify and encode robust, reusable strategies.

DC Provides Substantial Improvements Across Various Challenging Reasoning Benchmarks

Image 1: The image is a combined bar chart showing accuracy improvements of the Dynamic Cheatsheet method across different models and tasks.

该图像是多个柱状图组合的图表,展示了Dynamic Cheatsheet方法对不同模型和任务的准确率提升情况。图中包括AIME 2020-2025、GPOA Diamond、Game of 24及数学方程平衡器任务,展示Baseline、DC-O和DC-RS三种方法下的性能对比,突出DC方法显著提升模型表现。 该图像是多个柱状图组合的图表,展示了Dynamic Cheatsheet方法对不同模型和任务的准确率提升情况。图中包括AIME 2020-2025、GPOA Diamond、Game of 24及数学方程平衡器任务,展示Baseline、DC-O和DC-RS三种方法下的性能对比,突出DC方法显著提升模型表现。

Image 5: The image is a radar chart showing the overall task performance of Claude 3.5 Sonnet under the baseline prompting and Dynamic Cheatsheet with Retrieval & Synthesis (DC-RS) approaches across various math and knowledge tasks.

Figure 2: Overall task performance of Claude 3.5 Sonnet under the baseline prompting approach with minimal instructions (BL) and Dynamic Cheatsheet with Retrieval & Synthesis (DC-RS). 该图像是一个雷达图,展示了Claude 3.5 Sonnet在基线提示(Baseline)与动态备忘录检索合成(DC-RS)两种方法下的任务整体表现,涵盖多个数学和知识任务,显示DC-RS方法普遍优于基线。

Image 11: The image is a radar chart showing the performance comparison between the Dynamic Cheatsheet (DC-RS) method and the baseline across multiple tasks involving math competitions, arithmetic challenges, and knowledge reasoning.

该图像是一个雷达图,展示了Dynamic Cheatsheet(DC-RS)方法与基线(Baseline)在多个任务上的性能对比,涉及数学竞赛、算术任务及知识推理等领域。 该图像是一个雷达图,展示了Dynamic Cheatsheet(DC-RS)方法与基线(Baseline)在多个任务上的性能对比,涉及数学竞赛、算术任务及知识推理等领域。

Image 12: The image is a radar chart illustrating the accuracy comparison between the Dynamic Cheatsheet method (DC-RS) and the baseline across multiple tasks.

该图像是一个雷达图,展示了Dynamic Cheatsheet方法(DC-RS)与基线方法在多个任务上的准确率比较,包括AIME系列考试、Game of 24、GPQA Diamond和MMLU等。图中绿色区域代表DC-RS性能显著提升,尤其在“Game of 24”和“Math Equation Balancer”任务中表现突出。 该图像是一个雷达图,展示了Dynamic Cheatsheet方法(DC-RS)与基线方法在多个任务上的准确率比较,包括AIME系列考试、Game of 24、GPQA Diamond和MMLU等。图中绿色区域代表DC-RS性能显著提升,尤其在“Game of 24”和“Math Equation Balancer”任务中表现突出。

  • AIME Exam Problems: Claude 3.5 Sonnet saw significant improvements on AIME 2020-2024, surging from 6.7% to 40.6% under DC-RS. On AIME 2024, accuracy rose from 23.3% to 50.0%, and on AIME 2025, from 6.7% to 36.7% under DC-Cu. GPT-4o also gained, with AIME 2024 performance rising from 20.0% to 40.0% under DC-RS, and AIME 2025 from 6.7% to 20.0%. These results indicate that structured test-time memory can effectively tackle difficult math problems.

  • GPQA-Diamond: Claude 3.5 Sonnet improved from 59.6% to 68.7% under DC-RS (a 9.1% gain). This shows that memory curation and synthesis provide additional benefits beyond just retrieval (DR at 63.6%). GPT-4o had only a slight increase (57.1% to 58.1%), suggesting that retrieval can introduce confusion if suboptimal examples are recalled, and success depends on the model's generation and curation capabilities.

  • Math Equation Balancer: Both Claude 3.5 Sonnet (44.8% to 98-100% with DC-RS and DC-Cu) and GPT-4o (50.0% to 99-100%) reached near-perfect accuracy. Similar to Game of 24, models learned and reused an algorithmic or Python-based balancing routine.

  • MMLU-Pro Tasks: Claude 3.5 Sonnet showed consistent gains, up to 8.0% in Physics (from 74% to 82%). It stored and retrieved compact "reference guides" on engineering and physics principles. GPT-4o experienced slight decreases, suggesting that domain complexity and baseline knowledge gaps can attenuate DC's benefits if curated memory is unreliable.

Data Presentation (Tables)

Table 1: Performance comparison of Claude 3.5 Sonnet and GPT-4o across various tasks under different methods (Accuracy, %) (Manual Transcription)

Tasks BL DC-Ø DR DC-Cu. DC-RS BL DC-Ø DR DC-Cu. DC-RS
Claude 3.5 Sonnet GPT-4o
AIME 2024 23.3 36.7 43.3 50.0 46.7 20.0 36.7 26.7 36.7 40.0
AIME 2025 6.7 23.3 23.3 36.7 30.0 6.7 10.0 10.0 16.7 20.0
AIME 2020-24 6.7 30.1 39.1 38.4 40.6 9.8 24.1 24.1 20.3 24.8
Game of 24 12.0 10.0 11.0 14.0 14.0 10.0 19.0 6.0 93.0 99.0
GPQA Diamond 59.6 60.1 63.6 61.1 68.7 57.1 57.1 55.1 58.1 57.1
Math Eqn. Balancer 44.8 56.4 60.4 100 97.8 50.0 88.0 100 100 99.2
MMLU Pro Eng. 61.2 57.2 65.2 66.8 67.6 53.2 51.6 48.8 44.0 51.2
MMLU Pro Physics 74.0 75.6 80.4 77.6 82.0 75.6 70.8 75.6 70.4 75.2

Table 2: Performance breakdown of BL (default baseline), FH (full history), DC-Cu, and DC-RS approaches under AIME 2024 and 2025. (Manual Transcription)

Tasks BL FH DC-Cu. BL FH DC-RS
Claude 3.5 Sonnet GPT-4o
AIME 2024 23.3 26.7 50.0 20.0 13.3 40.0
AIME 2025 6.7 6.7 36.7 6.7 3.3 20.0

Table 3: Performance of Claude 3.5 Haiku and GPT-4o-mini across AIME (2024, 2025) and GPQA-Diamond. (Manual Transcription)

Tasks BL DC-Ø DC-Cu. DC-RS
Claude 3.5 Haiku
AIME 2024 10.0 26.7 36.7 30.0
AIME 2025 0.0 13.3 13.3 10.0
GPQA-Diamond 43.4 41.9 43.7 49.0
Tasks BL DC-Ø DC-Cu. DC-RS
GPT-4o-mini
AIME 2024 16.7 20.0 13.3 13.3
AIME 2025 10.0 13.3 13.3 16.7
GPQA-Diamond 34.3 34.3 33.8 32.3

Table 4: Comparison of majority voting (MV) with DC on AIME. (Manual Transcription)

Tasks BL MV(BL) DC-Ø MV(DC-Ø) DC-Cu.
Claude 3.5 Sonnet
AIME 2024 23.3 23.33 36.7 33.3 50.0
AIME 2025 6.7 6.7 23.3 23.3 36.7

Ablations / Parameter Sensitivity

  • DC Ø vs. DC-RS/DC-Cu: The DC Ø baseline (empty memory) allowed for measuring the impact of structured problem-solving and explicit tool-use prompting without actual memory retention. The significant gap between DC Ø and DC-RS/DC-Cu (e.g., Game of 24 with GPT-4o: 19% for DC Ø vs. 99% for DC-RS) clearly shows that storing and reusing knowledge is the main driver of performance gains, not just advanced prompting.

  • FH (Full-History) vs. DC: DC's memory curation provides gains over full-history appending. For Claude 3.5 Sonnet on AIME 2024, FH reached 26.7% accuracy, while DC-Cu hit 50.0%. For GPT-4o, FH decreased performance on AIME 2024 to 13.3% from a 20.0% baseline, whereas DC-RS achieved 40.0%. This highlights that uncurated, excessive history can overwhelm the model, dilute insights, and increase inference costs, while DC's selective curation ensures efficient access to high-value knowledge.

  • Model Scale and Capacity Impact: The effectiveness of DC is strongly tied to the LLM's scale and generative capacity.

    • Larger Models (Claude 3.5 Sonnet, GPT-4o): Showed notable gains across multiple tasks.
    • Smaller Models (Claude 3.5 Haiku, GPT-4o-mini): Showed limited and inconsistent gains (Table 3).
      • Claude 3.5 Haiku had moderate gains (e.g., AIME 2024 from 10.0% to 36.7% under DC-Cu, GPQA-Diamond from 43.4% to 49.0% under DC-RS).
      • GPT-4o-mini showed even smaller, sometimes negative, gains. On AIME 2024, DC-Cu and DC-RS performed worse than baseline (13.3% vs. 16.7%). GPQA-Diamond performance was largely stagnant or declined.
    • Reasons for Smaller Model Limitations:
      • Generative Competence: Smaller models produce correct solutions less reliably, leading to a sparse or low-quality memory repository.
      • Contextual and Memory Curation Limitations: They struggle with long-context understanding and memory retrieval, failing to retrieve the most relevant solutions or misapplying retrieved knowledge.
  • Majority Voting (MV) Comparison: DC performs better than conventional majority voting (MV). On AIME 2024, MV with BL performed identically to BL (23.3%). MV with DC Ø slightly underperformed DC Ø (33.3% vs. 36.7%). In contrast, DC-Cu significantly outperformed MV, reaching 50.0% on AIME 2024 and 36.7% on AIME 2025. This confirms that memory-based adaptation is more effective than simple statistical voting for complex reasoning.

Other Key Observations

  • DC Fosters Efficient Tool Usage / Code Generation: GPT-4o's shift to Python scripts for Game of 24 is a prime example. The model learned that code-based brute force is more systematic than manual arithmetic, generated a Python function, stored it, and refined it iteratively. This demonstrates DC's potential to nurture LLMs' ability to recognize when external tools are more robust. Image 8: Excerpt from GPT-4o's external memory after processing 100 examples from Game of 24 under DC-RS.

    Figure 5: Excerpt from GPT-4o's external memory after processing 100 examples from Game of 24 under DC-RS. Early in the test sequence, the model discovered a Python-based brute-force solution, stored… 该图像是论文中GPT-4o在Game of 24任务下DC-RS记忆片段的截图,展示了模型存储的用于解决游戏的Python代码和策略,其中包含用于求解24点问题的步骤说明和自动化脚本。

  • Test-Time Task Similarity and Example Ordering: DC thrives when test examples share structural similarities. For tasks like Game of 24, Math Equation Balancer, and AIME, discovering a solution or strategy for one problem allowed for easy transfer across structurally similar questions. This suggests that a curriculum-style learning approach, where simpler or archetypal problems are presented first, could bootstrap performance.

  • Reasoning and Information Efficiency: DC reduces the need to "reinvent the wheel," cutting down reasoning overhead and token usage in subsequent queries by encoding and reusing well-established techniques.

  • Clustering of Errors and Corrections: Experiments suggest that errors and their corrections often cluster in a latent embedding space. Once a high-quality heuristic is acquired for a cluster, it can be applied to tightly embedded neighbors. However, faulty heuristics can also be amplified, underscoring the need for careful curation and pruning to avoid propagating erroneous strategies. Image 2: t-SNE visualization of question embeddings in the GPQA Diamond task.

    该图像是一个散点图,展示了GPQA Diamond任务中问题嵌入的t-SNE降维结果。点的颜色表示基线模型和动态备忘录(DC-RS)模型的正确性组合,反映了不同模型对问题的解答表现差异。 该图像是一个散点图,展示了GPQA Diamond任务中问题嵌入的t-SNE降维结果。点的颜色表示基线模型和动态备忘录(DC-RS)模型的正确性组合,反映了不同模型对问题的解答表现差异。

  • Transferability of Memory Content Across Models: While larger models can produce high-quality strategies, transferring this memory to smaller models sometimes yielded mixed results. If a smaller model lacks the generative capacity to interpret or refine strategies correctly, its performance can stall or degrade. Memory entries cannot fully compensate for inadequate base capabilities.

Conclusion & Personal Thoughts

Conclusion Summary

The paper effectively introduces Dynamic Cheatsheet (DC), a novel framework that bridges the gap between isolated LLM inference events and the cumulative, experience-driven learning characteristic of human cognition. By endowing black-box LLMs with a persistent, evolving, and self-curated memory, DC enables models to store and reuse problem-solving strategies, code snippets, and general insights at test time. This leads to substantial performance improvements across diverse and challenging tasks, including complex math problems (AIME), arithmetic puzzles (Game of 24, Math Equation Balancer), and knowledge-intensive Q&A (GPQA-Diamond, MMLU-Pro). DC operates without modifying LLM parameters and actively curates memory to avoid context bloat, fostering efficient tool usage and reducing repetitive errors. The findings underscore the critical role of model capacity and task similarity in DC's effectiveness.

Limitations & Future Work (as identified by the authors)

  • Memory Curation Challenges: DC's memory curation can demand precise reproduction or modification of prior knowledge. LLMs sometimes merely reference or abbreviate existing memory ("Previous content [...] preserved") instead of explicitly rewriting it, which can reduce the quality of stored heuristics over time. Potential solutions include maintaining a structured, external database that the LLM can reference without regenerating large texts.
  • Retrieval Bottlenecks and Noise: While DC-RS improves accuracy, poorly filtered retrieval mechanisms can introduce confusion, especially with diverse or loosely related queries. GPT-4o occasionally dipped in GPQA-Diamond due to suboptimal retrieval choices. This highlights the need for robust retrieval methods (e.g., dense vector search, advanced ranking algorithms) to surface high-quality exemplars and suppress irrelevant information.
  • Hierarchical and Modular Memory: For scaling LLM deployments and specialized domains, future work could explore subdividing or hierarchically organizing memory (e.g., separate memories for combinatorics or physics). This could reduce the load on a unified memory, isolate errors, and improve clarity and reliability of retrieved heuristics.
  • Time and Token Complexity: Although DC optimizes efficiency over time by reducing redundant computation and token usage, its sequential structure still poses challenges for large-scale parallel or batch tasks requiring independent inference.
  • Limitations with Smaller Models: Smaller models (e.g., GPT-4o-mini, Claude 3.5 Haiku, DeepSeek R1, o1o1) show limited or inconsistent gains. Their restricted generative ability makes it difficult for them to produce reliable strategies for storage or to interpret retrieved heuristics effectively. DC requires a capable foundation model to seed and refine curated knowledge.

Personal Insights & Critique

  • Novelty and Impact: The paper presents a highly novel and practical approach. The concept of a self-curated, evolving cheatsheet directly addresses a fundamental limitation of LLMs – their lack of persistent memory. The performance gains, especially the dramatic increase in Game of 24 and Math Equation Balancer accuracy, are compelling. This framework has significant implications for making LLMs more efficient, consistent, and adaptable in real-world applications where sequential problem-solving is common.
  • Practicality and Generalizability: DC's black-box compatibility is a huge strength, allowing it to be used with powerful commercial APIs without needing access to model weights. This makes it immediately deployable. While the paper focuses on math, logic, and knowledge tasks, the underlying principle of curating transferable insights could generalize to various domains, such as coding assistance, scientific discovery, or legal research, where LLMs could maintain libraries of reusable patterns and best practices.
  • Prompt Engineering Complexity: The effectiveness of DC heavily relies on sophisticated prompt engineering for both the generator and curator modules. Crafting prompts that reliably extract generalizable heuristics and manage memory (e.g., deciding what to keep, discard, or refine) is a non-trivial task. The quality of memory curation is directly proportional to the quality of these curator prompts.
  • Risk of Catastrophic Forgetting/Misguidance: While DC aims to prevent errors, the paper acknowledges that "faulty heuristics that slip into memory can be equally amplified." This catastrophic reinforcement of bad strategies is a potential risk. Effective pruning and verification mechanisms within the curator are crucial to maintain memory quality. The "usage counter" proposed in the curator prompt is a good step, but more sophisticated methods for evaluating utility and correctness might be necessary.
  • Open Questions:
    • Scaling Memory Size: How does memory scale with an ever-growing number of tasks? While curation keeps it compact, what are the practical limits before retrieval becomes too slow or the LLM struggles with a large cheatsheet?
    • Complexity of Curation Decisions: How robust is the LLM's ability to discern "usefulness," "generalizability," and "correctness" without external ground truth? This self-assessment is powerful but also the most prone to LLM hallucinations or errors.
    • Proactive Strategy Discovery: Could DC be extended to proactively discover strategies through self-play or simulated environments, rather than waiting for a successful solution generation?
  • Long-Term Memory Stability: The paper highlights that truncated memory updates can reduce the quality of stored heuristics. This indicates a potential challenge in maintaining long-term coherence and completeness of the cheatsheet.
  • Comparison to Human Learning: The analogy to human cognition is strong, but humans also forget or selectively update. DC's pruning and refinement mechanisms are a step towards this, but the depth of meta-learning (learning how to learn or how to curate) within the LLM for DC could be further explored.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.