Attention Is All You Need for KV Cache in Diffusion LLMs
TL;DR Summary
Proposes Elastic-Cache, a training-free method that adaptively refreshes KV caches in diffusion LLMs based on attention drift and depth scheduling, drastically accelerating inference up to 45× without accuracy loss.
Abstract
This work studies how to adaptively recompute key-value (KV) caches for diffusion large language models (DLMs) to maximize prediction accuracy while minimizing decoding latency. Prior methods' decoders recompute QKV for all tokens at every denoising step and layer, despite KV states changing little across most steps, especially in shallow layers, leading to substantial redundancy. We make three observations: (1) distant tokens primarily act as a length-bias and can be cached block-wise beyond the active prediction window; (2) KV dynamics increase with depth, suggesting that selective refresh starting from deeper layers is sufficient; and (3) the most-attended token exhibits the smallest KV drift, providing a conservative lower bound on cache change for other tokens. Building on these, we propose , a training-free, architecture-agnostic strategy that jointly decides to refresh (via an attention-aware drift test on the most-attended token) and to refresh (via a depth-aware schedule that recomputes from a chosen layer onward while reusing shallow-layer caches and off-window MASK caches). Unlike fixed-period schemes, Elastic-Cache performs adaptive, layer-aware cache updates for diffusion LLMs, reducing redundant computation and accelerating decoding with negligible loss in generation quality. Experiments on LLaDA-Instruct, LLaDA-1.5, and LLaDA-V across mathematical reasoning and code generation tasks demonstrate consistent speedups: on GSM8K (256 tokens), on longer sequences, and on HumanEval, while consistently maintaining higher accuracy than the baseline. Our method achieves significantly higher throughput ( on GSM8K) than existing confidence-based approaches while preserving generation quality, enabling practical deployment of diffusion LLMs.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Attention Is All You Need for KV Cache in Diffusion LLMs
- Authors:
- Quan Nguyen-Tri (FPT AI Residency, Hanoi, Vietnam)
- Mukul Ranjan (VILA Lab, MBZUAI, Abu Dhabi, UAE)
- Zhiqiang Shen (VILA Lab, MBZUAI, Abu Dhabi, UAE)
- The authors are affiliated with research institutions known for contributions to AI and machine learning.
- Journal/Conference: This paper is available as a preprint on arXiv. It has not yet been published in a peer-reviewed journal or conference. Preprints on arXiv are common in the fast-moving field of AI, allowing for rapid dissemination of new research. The arXiv identifier
2510.14973v1suggests a submission intended for a future date (e.g., October 2025), a convention sometimes used to reserve an identifier. - Publication Year: The paper was submitted to arXiv in 2024 (inferred from the context and similar paper IDs, despite the futuristic identifier).
- Abstract: The abstract introduces the problem of high decoding latency in diffusion large language models (DLMs) due to redundant computation. Standard DLM decoders recompute the entire Query-Key-Value (QKV) matrix for all tokens at every denoising step and in every layer, even though the internal states (Key-Value or KV cache) change minimally. The authors propose
Elastic-Cache, a training-free, adaptive strategy to manage the KV cache. This method decides when to refresh the cache (using an attention-based "drift test") and where to refresh it (using a depth-aware schedule that recomputes only deeper layers). The approach is based on three observations about DLM behavior. Experiments show significant speedups (e.g., up to 45.1x on long sequences) and higher throughput compared to existing methods, all while maintaining or even improving generation accuracy on tasks like mathematical reasoning and code generation. - Original Source Link:
- Official Source: https://arxiv.org/abs/2510.14973v1
- PDF Link: https://arxiv.org/pdf/2510.14973v1.pdf
- Status: Preprint.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Diffusion Large Language Models (DLMs) are a promising alternative to standard autoregressive models (like GPT) because they can generate text in parallel, potentially offering much faster inference. However, their iterative "denoising" process is computationally expensive. At each step, the model refines its predictions for all masked tokens, requiring a full-pass computation through all layers. This involves re-calculating the Key-Value (KV) cache for every token, creating massive computational redundancy, as most token representations change very little from one step to the next, especially in the early layers of the network.
- Importance & Gaps: Prior acceleration methods for DLMs, such as
Fast-dLLM, use a fixed schedule (e.g., refreshing the cache every steps or after a block of tokens is decoded). These "one-size-fits-all" approaches are inefficient because they recompute the cache even when it's stable and fail to update it when rapid changes occur. They also treat all layers equally, wasting computation on shallow layers that have already converged while under-serving deeper layers where semantic adjustments are still happening. - Innovation: This paper introduces an adaptive, fine-grained caching strategy called
Elastic-Cache. Instead of a fixed schedule, it uses the model's own attention patterns as a signal to determine if an update is needed and which parts of the model need updating. This aligns computation with the model's actual information flow, drastically reducing waste.
-
Main Contributions / Findings (What):
- Diagnosis of Redundancy: The paper empirically demonstrates and quantifies the redundancy in DLM decoding, showing that Key-Value states (
KV drift) change minimally in shallow layers and for most tokens across denoising steps. Elastic-CacheAlgorithm: It proposes a novel, training-free, and architecture-agnostic caching policy with two core components:- Attention-Aware Refresh Trigger (When): A lightweight test that monitors the change in attention patterns for the single "most-attended" token. If this token's attention distribution shifts significantly, it triggers a cache refresh, acting as a conservative proxy for the entire cache's stability.
- Depth-Aware Refresh Schedule (Where): When a refresh is triggered at a certain layer,
Elastic-Cacheonly recomputes the KV cache for that layer and all subsequent (deeper) layers, reusing the stable caches of the shallow layers.
- Block-wise
MASKCaching: It introduces a strategy to cache the representations of distant, un-decoded (MASK) tokens, which are observed to have negligible influence on the current prediction window. - Significant Performance Gains:
Elastic-Cacheachieves substantial speedups (e.g., 8.7x on GSM8K, 45.1x on longer sequences) and higher throughput (6.8x on GSM8K) compared to baselines, often with a slight improvement in accuracy. This makes DLMs more practical for real-world deployment.
- Diagnosis of Redundancy: The paper empirically demonstrates and quantifies the redundancy in DLM decoding, showing that Key-Value states (
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Transformer & Self-Attention: The core building block of modern LLMs. In a Transformer, for each token, the model computes three vectors: a Query (Q), a Key (K), and a Value (V). The Q vector of a token is compared with the K vectors of all other tokens to compute "attention scores," which determine how much focus to place on each of those other tokens. These scores are then used to create a weighted sum of the V vectors, producing the token's final representation.
- KV Cache in Autoregressive LLMs: In standard models like GPT, text is generated one token at a time. To generate the 100th token, the model needs to attend to the previous 99. Instead of recomputing the K and V vectors for all 99 tokens every time, they are calculated once and stored in a KV Cache. For the 101st token, the model only needs to compute the K and V for the 100th token and append them to the cache. This is efficient because past token representations are fixed.
- Diffusion Models (DMs): A class of generative models that learn to reverse a "noising" process. For images, this involves adding Gaussian noise. For text, a discrete version is used.
- Masked Diffusion Models (DLMs): A specific type of diffusion model for text. The "noising" process involves progressively replacing original tokens with a special
[MASK]token. Generation starts with a sequence of all[MASK]tokens. In each "denoising" step, the model predicts the original tokens for some of the[MASK]positions. A key feature is bidirectional attention: every token can attend to every other token, including future ones (which are[MASK]s). This breaks the assumption of standard KV caching, as the representations of all tokens can change at every step, rendering the simple append-only cache invalid.
-
Previous Works:
LLaDAandDream7B: These are foundational DLMs that demonstrated performance competitive with autoregressive models, establishing the viability of the diffusion paradigm for language. The baseline in this paper is aLLaDAmodel with no caching.Fast-dLLM: A key prior work on accelerating DLMs. It introduced a block-wise caching mechanism. Instead of recomputing the cache at every step, it recomputes it only after a "block" of tokens has been decoded. This is a fixed, non-adaptive schedule.Elastic-Cacheis directly compared againstFast-dLLMand shown to be superior.dKV-CacheandDeepCache: Other caching methods.DeepCacheis for standard diffusion models (e.g., for images) and reuses features from deeper layers.dKV-Cachealso proposes caching for DLMs but relies on a fixed interval.Elastic-Cachedifferentiates itself by being fully adaptive and fine-grained, using attention drift as a dynamic signal rather than a fixed interval or block size.
-
Technological Evolution: The field has moved from computationally intensive, full-recomputation DLMs to more efficient versions. The first step was fixed-schedule caching (
Fast-dLLM). This paper represents the next logical step: intelligent, adaptive caching that is aware of the model's internal dynamics. -
Differentiation:
Elastic-Cachevs.Fast-dLLM:Fast-dLLMuses a rigid, block-wise schedule.Elastic-Cacheuses a dynamic, event-driven schedule based on attention drift.Elastic-Cachevs. Interval-based Caching:Elastic-Cacheis more granular. It decides not only when to update but also where (which layers), whereas interval-based methods typically perform a full refresh.Elastic-Cachevs. Autoregressive KV Cache: The mechanisms are fundamentally different. Autoregressive caching is simple and append-only.Elastic-Cacheis a complex policy for managing a fully dynamic, bidirectional cache.
4. Methodology (Core Technology & Implementation)
The core of Elastic-Cache is built on three empirical observations about DLM behavior during decoding:
-
Distant
MASKtokens act as a length bias:MASKtokens far from the current decoding position receive very little attention and have minimal impact on the prediction. They can be cached without frequent updates. -
KV dynamics increase with depth: The representations in shallow layers of the network stabilize quickly, while deeper layers continue to refine semantic relationships. This suggests that recomputation is most beneficial in deeper layers.
-
The most-attended token has the smallest KV drift: The token that receives the most attention from the currently predicted
MASKs is the most influential. Its representation tends to be the most stable. Therefore, if this token's representation starts to change significantly, it's a strong, conservative signal that other, less stable tokens are also changing, and a cache refresh is needed.These observations lead to the
Elastic-Cachealgorithm, which jointly decides when and where to refresh the cache.
Figure 1 Analysis: This figure visually confirms the paper's core motivations. (a) Attention is concentrated on prompt tokens and nearby tokens, while "Faraway MASK tokens" receive negligible attention. (b) The heatmap shows cosine similarity of KV states between steps. The similarity is very high (blue, >0.98) for shallow layers but drops for "Deep layers," indicating larger changes. (d) The most-attended tokens show very high similarity across all layers, confirming they are the most stable.
Steps & Procedures
The Elastic-Cache pipeline is illustrated below and contrasted with the Fast-dLLM baseline.
Figure 2 Analysis: This diagram clearly contrasts the two methods. (a) Fast-dLLM uses a fixed block-based approach. It decodes a block of tokens (e.g., 4) and then does a full KV cache recomputation. (b) Elastic-Cache is more dynamic. At each step, it checks for significant attention change using the cosine similarity condition. If the condition is met at a layer , it triggers a partial recomputation from layer to , reusing the cache for shallower layers.
The algorithm proceeds as follows:
-
Initialization: At the first step (), a full forward pass is performed, and the initial K and V vectors for all tokens (prompt +
MASKs) are computed and stored in the cache. -
Sliding Window Decoding: For subsequent steps, computation is focused on a "sliding window" of
MASKtokens, denoted , which are the next tokens targeted for decoding. Other, "off-window"MASKtokens are assumed to be stable and their cached representations are reused without recomputation. -
Attention-Aware Trigger (When to Refresh):
- At each layer and step , the model identifies the most-attended token from the set of already decoded tokens. This is the token whose Key vector receives the highest cumulative attention score from the Query vectors of the
MASKs in the sliding window.- : Set of tokens decoded before step .
- : Set of
MASKtokens in the current prediction window. - : Attention score matrix at step , layer .
- The algorithm then checks if the attention distribution directed at this most-attended token has changed significantly from the previous step. It computes the cosine similarity between the attention score vectors for this token at step and
t-1. - If the similarity drops below a predefined threshold , a refresh is triggered for that layer.
- At each layer and step , the model identifies the most-attended token from the set of already decoded tokens. This is the token whose Key vector receives the highest cumulative attention score from the Query vectors of the
-
Layer-Aware Schedule (Where to Refresh):
-
Let be the first (shallowest) layer where the trigger condition is met.
-
The KV caches for all shallow layers are reused.
-
The KV caches for all deeper layers are recomputed. This is done by taking the hidden states from the output of layer and running a full forward pass from layer to the final layer .
-
This "recompute from here onward" strategy saves significant computation by not re-running the stable shallow layers.
The complete process is detailed in Algorithm 1 of the paper.
-
5. Experimental Setup
-
Datasets:
- GSM8K: A dataset of grade school math word problems.
- MATH: A more challenging dataset of competition-level mathematics problems.
- HumanEval & MBPP: Datasets for evaluating code generation from natural language descriptions.
- MathVista & MathVerse: Multimodal datasets that require mathematical reasoning over visual contexts (e.g., charts, diagrams). These datasets were chosen to test the method's effectiveness on complex reasoning tasks where generation quality is critical.
-
Evaluation Metrics:
- Accuracy:
- Conceptual Definition: Measures the correctness of the generated output. The specific calculation depends on the task.
- For GSM8K,
flexible_extractis used, which checks if the final numerical answer in the generation matches the ground truth. - For MATH,
math_verifychecks the correctness of mathematical derivations. - For HumanEval and MBPP,
pass@1is used. It measures the percentage of problems for which the generated code passes a set of unit tests. - For MathVista/MathVerse,
gpt-eval-scoreuses a powerful model like GPT-4 to judge the correctness of the response.
- Throughput:
- Conceptual Definition: Measures the speed of generation. It is the total number of tokens generated divided by the total time taken.
- Mathematical Formula:
- Unit: tokens/second. Higher is better.
- Speedup:
- Conceptual Definition: A relative measure of how much faster a method is compared to a baseline.
- Mathematical Formula:
- Unit: A dimensionless factor (e.g., 2.0x means twice as fast).
- Accuracy:
-
Baselines:
LLaDA(No Cache): The standard DLM implementation where the QKV is recomputed for all tokens at every step and layer. This is the slowest but most accurate baseline.Fast-dLLM: The primary competitor, which uses a block-wise caching strategy. It represents the state-of-the-art in fixed-schedule caching for DLMs.
6. Results & Analysis
The experiments convincingly demonstrate that Elastic-Cache provides a superior trade-off between speed and accuracy.
Core Results
The following transcribed tables summarize the main results on the LLaDA-Instruct and LLaDA-1.5 models.
Table 1: Performance on LLaDA-Instruct (Manual Transcription) Each cell shows Accuracy% (top) and Throughput (tokens/s) with Speedup (bottom).
| Benchmark | Gen Length | LLaDA (Baseline) | Fast-dLLM | Elastic-Cache (Ours) |
|---|---|---|---|---|
| GSM8K (5-shot) | 256 | 78.01 7.3 (1.0x) |
77.94 53.7 (7.7x) |
78.24 58.0 (8.2x) |
| 512 | 77.10 3.6 (1.0x) |
74.83 44.0 (12.3x) |
77.71 90.1 (25.2x) |
|
| MATH (4-shot) | 256 | 33.58 9.5 (1.0x) |
32.50 49.0 (5.1x) |
33.14 48.7 (5.1x) |
| 512 | 40.85 7.1 (1.0x) |
37.20 52.8 (7.4x) |
40.24 59.3 (7.9x) |
|
| HumanEval (0-shot) | 256 | 43.90 33.3 (1.0x) |
45.73 99.8 (3.0x) |
46.34 160.5 (4.8x) |
| 512 | 43.29 17.7 (1.0x) |
45.73 76.1 (4.3x) |
46.34 100.7 (5.0x) |
|
| MBPP (3-shot) | 256 | 29.80 6.5 (1.0x) |
25.40 45.1 (7.0x) |
32.20 46.9 (7.3x) |
| 512 | 15.0 4.7 (1.0x) |
13.6 44.7 (9.5x) |
15.6 63.0 (13.4x) |
Table 2: Performance on LLaDA-1.5 (Manual Transcription) Each cell shows Accuracy% (top) and Throughput (tokens/s) with Speedup (bottom).
| Benchmark Gen Length |
LLaDA-1.5 (Baseline) | Confident-Aware Decoding | |
|---|---|---|---|
| Fast-dLLM | Elastic-Cache (Ours) | ||
| GSM8K (5-shot) 256 |
80.36 6.7 (1.0x) |
80.59 51.2 (7.6x) |
81.50 58.0 (8.7x) |
| GSM8K (5-shot) 512 |
81.35 2.6 (1.0x) |
80.82 36.8 (14.1x) |
81.35 117.2 (45.1x) |
| MATH (4-shot) 256 |
33.52 8.5 (1.0x) |
32.74 44.4 (5.2x) |
33.50 51.0 (6.5x) |
| MATH (4-shot) 512 |
35.63 5.0 (1.0x) |
33.68 44.4 (8.8x) |
35.36 74.8 (14.9x) |
| HumanEval (0-shot) 256 |
43.29 7.0 (1.0x) |
34.75 18.7 (2.7x) |
36.59 20.9 (3.0x) |
| HumanEval (0-shot) 512 |
40.85 3.2 (1.0x) |
36.59 15.4 (4.8x) |
37.80 16.8 (5.3x) |
| MBPP (3-shot) 256 |
38.00 2.4 (1.0x) |
34.60 28.0 (11.6x) |
41.20 32.7 (13.5x) |
| MBPP (3-shot) 512 |
38.20 1.0 (1.0x) |
36.20 17.8 (17.8x) |
39.00 32.8 (32.8x) |
Analysis:
- Superior Speed-Accuracy Trade-off: In almost all cases,
Elastic-Cacheachieves higher throughput thanFast-dLLMwhile also achieving better accuracy. For example, on GSM8K withLLaDA-1.5(length 512),Elastic-Cacheachieves a staggering 45.1x speedup while maintaining the baseline's 81.35% accuracy. In contrast,Fast-dLLM's speedup is only 14.1x, and its accuracy drops to 80.82%. - Advantage Increases with Length: The speedup advantage of
Elastic-CacheoverFast-dLLMbecomes more pronounced for longer generation lengths. In Table 1 (GSM8K), the speedup factor goes from 8.2x (at 256 tokens) to 25.2x (at 512 tokens). This is because the sliding window and selective refresh mechanisms become more effective as the proportion of cacheable "distant" tokens grows. - Robustness Across Models and Tasks: The gains are consistent across different models (
LLaDA-Instruct,LLaDA-1.5,LLaDA-V) and tasks (math, code, multimodal reasoning), demonstrating the general applicability of the method.
Ablations / Parameter Sensitivity

*该图像是包含三部分的图表,分别展示了滑动窗口与块大小(a)、缓存更新时间频率(b)、以及置信感知解码阈值(c)对GSM8K 5-shot准确率和吞吐量的影响。图中用箭头标注了不同配置下的加速比,说明了Elastic-Cache在保持准确率的同时显著提升吞吐量。*
*Figure 3 Analysis: This figure explores the method's sensitivity to its design choices.
(a) Sliding Window vs. Block-wise: The proposed sliding window approach (blue lines) consistently achieves higher accuracy than a block-wise approach (orange lines). While block-wise can have high throughput at a specific size, the sliding window provides a better and more stable accuracy-throughput curve.
(b) Cache Update Frequency vs. Gamma (): As the gamma threshold increases (making the trigger more sensitive), throughput decreases because the cache is updated more often. However, accuracy increases. Elastic-Cache (blue line) maintains high accuracy even with low update frequencies, significantly outperforming the no-cache baseline's speed (orange line).
(c) Confidence-Aware Decoding: Elastic-Cache (blue line) maintains a high throughput advantage over the baseline (orange line) across different confidence thresholds for parallel decoding.*
Table 4: Impact of on GSM8K (512 tokens) (Manual Transcription) Each cell shows Accuracy% (top) and Throughput (tokens/s) with Speedup (bottom).
| Model | No Cache | Fast-dLLM | Elastic-Cache (Ours) with different γ | |||||
|---|---|---|---|---|---|---|---|---|
| γ=0.5 | γ=0.7 | γ=0.8 | γ=0.85 | γ=0.9 | γ=0.95 | |||
| LLaDA | 77.10 | 74.83 | 71.57 | 73.46 | 74.30 | 74.68 | 77.71 | 76.72 |
| 3.6 (1.0x) | 44.0 (12.2x) | 109.9 (30.5x) | 108.7 (30.2x) | 103.9 (28.9x) | 99.1 (27.5x) | 91.5 (25.4x) | 75.5 (21.0x) | |
| LLaDA-1.5 | 81.35 | 80.82 | 76.04 | 77.63 | 79.45 | 80.21 | 81.35 | 83.02 |
| 2.6 (1.0x) | 36.8 (14.2x) | 142.7 (54.9x) | 138.6 (53.3x) | 131.2 (50.5x) | 129.9 (50.0x) | 117.2 (45.1x) | 98.4 (37.8x) | |
Analysis:
- Tunable Trade-off: Table 4 clearly shows that the hyperparameter provides a direct lever to trade speed for accuracy. For
LLaDA-1.5, setting yields a massive 54.9x speedup but drops accuracy to 76.04%. Setting yields a lower but still huge 37.8x speedup while improving accuracy to 83.02% over the baseline. This tunability is a powerful feature for practical deployment. - Optimal : The best depends on the model. For the more accurate
LLaDA-1.5, a higher is beneficial, suggesting that as models get better, more sensitive cache updates are needed to preserve their performance.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully identifies and addresses a major bottleneck in diffusion LLM inference: redundant KV cache recomputation. The proposed
Elastic-Cachemethod offers a principled, training-free, and highly effective solution. By using attention dynamics to create an adaptive policy that decides when and where to refresh the cache, it achieves significant latency reductions and throughput improvements with negligible, and sometimes positive, impact on generation quality. This work makes an important contribution toward making large-scale deployment of diffusion LLMs practical. -
Limitations & Future Work: The authors propose several avenues for future work:
- Integrating
Elastic-Cachewith other acceleration techniques like speculative decoding. - Developing hardware-aware scheduling to further optimize the cache management policy.
- Extending the core principles to autoregressive LLMs and other multimodal diffusion frameworks.
- Integrating
-
Personal Insights & Critique:
- Novelty and Elegance: The central idea of using the most-attended token's stability as a proxy for the entire cache's drift is both clever and elegant. It's a simple heuristic grounded in a strong empirical observation, avoiding the complexity of a learned policy while achieving excellent results. The depth-aware refresh is another simple yet powerful optimization.
- Practical Impact: Being training-free and architecture-agnostic makes
Elastic-Cachehighly practical. It can be implemented as a wrapper around existing DLMs without requiring model retraining, lowering the barrier to adoption. The dramatic speedups shown could be a game-changer for the real-world viability of DLMs. - Open Questions:
- While the "most-attended token" is a good heuristic, is it always the optimal one? An alternative could be to monitor a small set of highly-attended tokens to get a more robust signal.
- The method still relies on a manually tuned hyperparameter, . While it offers a useful trade-off lever, an interesting research direction would be to learn this threshold dynamically based on the input prompt or decoding state.
- The analysis focuses on latency reduction. A deeper analysis of the memory footprint implications would be valuable, especially as sequence lengths grow.
Similar papers
Recommended via semantic vector search.