dKV-Cache: The Cache for Diffusion Language Models
TL;DR Summary
dKV-Cache introduces a delayed KV-cache for diffusion language models, enabling 2-10× faster inference by conditionally caching key-value states. Two variants balance speed and performance, revealing underutilized context and narrowing efficiency gaps with autoregressive models.
Abstract
Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
dKV-Cache: The Cache for Diffusion Language Models
1.2. Authors
Xinyin Ma, Runpeng Yu, Gongfan Fang, Xinchao Wang Affiliations: National University of Singapore
1.3. Journal/Conference
The paper is an arXiv preprint, indicating it has not yet undergone formal peer review and publication in a journal or conference. arXiv is a widely recognized platform for disseminating research quickly in fields like AI, often serving as a precursor to formal publication.
1.4. Publication Year
2025
1.5. Abstract
Diffusion Language Models (DLMs) are emerging as a promising alternative to autoregressive language models (ARs) but are hampered by slow inference speeds. This bottleneck primarily stems from their non-autoregressive architecture and bidirectional attention, which prevent the direct use of the key-value (KV) cache mechanism that typically accelerates AR decoding. To address this, the authors propose dKV-Cache (delayed KV-Cache), a KV-cache-like mechanism tailored for the denoising process in DLMs. The approach is motivated by the observation that token representations exhibit distinct dynamics during the diffusion process.
dKV-Cache employs a delayed and conditioned caching strategy for key and value states. Two complementary variants are introduced:
-
dKV-Cache-Decode: Offers almost lossless acceleration and even improves performance on long sequences, suggesting that existing DLMs might under-utilize contextual information.
-
dKV-Cache-Greedy: Achieves higher speed-ups with
quadratic time complexityat the cost of some performance degradation, through aggressive caching with reduced lifespan.Overall,
dKV-Cacheachieves inference speedups of , significantly narrowing the speed gap between ARs and DLMs. The method is evaluated on various benchmarks (general language understanding, mathematical, code-generation) and demonstrates that caching can be applied to DLMs in atraining-freemanner using existing models.
1.6. Original Source Link
Official Source or PDF Link: https://arxiv.org/abs/2505.15781 Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The paper addresses the critical issue of slow inference in Diffusion Language Models (DLMs). While DLMs offer a promising alternative to traditional autoregressive language models (ARs) due to their parallel decoding potential, their practical inference speed remains significantly slower. This inefficiency is a major barrier to their widespread adoption.
The core problem lies in the architectural differences between DLMs and ARs. ARs benefit greatly from a mechanism called KV-Cache, which reuses previously computed key and value states in the attention mechanism, accelerating decoding. However, DLMs, with their non-autoregressive architecture and bidirectional attention, cannot directly leverage the standard KV-Cache. This is because:
-
Timestep-variant key and value states: In ARs,
keyandvaluestates for prior tokens remain fixed due to acausal attention mask. DLMs usebidirectional attention, meaningkeyandvaluestates can change at every timestep, making direct reuse problematic. -
Non-sequential decoding order: ARs decode strictly left-to-right, allowing for deterministic computation of the next token's
QKVstates. DLMs can update any masked position dynamically, making it impossible to predict whichQKVstates need to be computed or reused.These limitations lead to a
cubic time complexityof for DLMs (where is sequence length) compared toquadratic time complexityof per step for ARs using KV-Cache. The paper's innovative idea is to adapt the concept of KV-Cache to DLMs by observing therepresentation dynamicsof tokens during the diffusion process. They hypothesize that despite bidirectional attention,keyandvaluestates are not entirely unreusable but require a "delayed and conditioned reuse."
2.2. Main Contributions / Findings
The paper makes several key contributions:
-
First KV-Cache for DLMs: It proposes
dKV-Cache(delayed KV-Cache), the first mechanism to integrate KV-Cache-like functionality into diffusion language models, specifically designed to be compatible with theirbidirectional attentionandnon-sequential decoding. This addresses a long-standing bottleneck. -
Leveraging Representation Dynamics: The design is motivated by empirical observations that
keyandvaluestates exhibit consistent patterns: they are relatively stable after a token is decoded, with the most significant changes occurring precisely at the decoding step. This insight underpins thedelayed caching strategy. -
Two Practical Variants:
- dKV-Cache-Decode: Achieves
almost lossless accelerationand in some cases, evenimproves performanceon long sequences, suggesting better utilization of contextual information. - dKV-Cache-Greedy: Offers
higher speed-upsby aggressively refreshing caches, reducing theper-step time complexityfrom to at the cost of some performance degradation.
- dKV-Cache-Decode: Achieves
-
Significant Inference Speedup:
dKV-Cachedelivers inference speedups on 7B-scale diffusion language models (LLaDA, Dream) across diverse benchmarks (general language understanding, math, code generation). This significantly closes the gap with ARs. -
Training-Free Application: The method can be applied to existing DLMs without requiring any re-training, making it immediately usable.
The key conclusion is that caching is indeed viable for DLMs, and by carefully designing a delayed and conditioned caching strategy, significant inference speedups can be achieved with minimal or even positive impact on performance, especially for long contexts.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the innovations of dKV-Cache, a foundational understanding of several key concepts is essential for a beginner:
Autoregressive Language Models (ARs)
Autoregressive Language Models (ARs), such as GPT or LLaMA, generate text sequentially, token by token, from left to right. When predicting the next token, the model considers all previously generated tokens as context. This sequential generation means that each new token depends only on tokens that came before it in the sequence. This property is crucial for the efficiency of KV-Cache.
Diffusion Language Models (DLMs)
Diffusion Language Models (DLMs) are a newer paradigm for text generation, inspired by diffusion models in computer vision. Unlike ARs, DLMs generate text by gradually refining a sequence of initially noisy or masked tokens into a coherent output.
- Forward Process: This process gradually adds noise (e.g., masking tokens) to a clean text sequence until it becomes completely noisy or masked.
- Reverse (Denoising) Process: This is the generative part. A neural network learns to reverse the forward process, gradually removing noise or unmasking tokens to reconstruct the original text. This denoising often happens in multiple steps.
- Non-autoregressive: DLMs are typically non-autoregressive, meaning they can update multiple tokens or even the entire sequence in parallel at each denoising step, rather than generating one token at a time. This potential for parallelism is a key advantage.
- Bidirectional Attention: A core component of DLMs (and other
non-autoregressive models) isbidirectional attention. Unlike thecausal attention maskused in ARs, bidirectional attention allows each token to attend to all other tokens in the sequence (both preceding and succeeding tokens) when computing its representation. This comprehensive contextual understanding is powerful but complicates caching.
Transformers
The Transformer architecture is the backbone of most modern language models, including both ARs and DLMs. It revolutionized sequence processing by relying entirely on attention mechanisms instead of recurrent neural networks (RNNs).
- Encoder-Decoder/Decoder-Only: Transformers can have an encoder-decoder structure (for tasks like machine translation) or be decoder-only (common for large language models like GPT). DLMs often use a Transformer-like architecture for their denoising network.
- Attention Mechanism: The core innovation of Transformers. It allows the model to weigh the importance of different parts of the input sequence when processing each token. It computes an output based on three inputs: a
Query (Q), aKey (K), and aValue (V).
Attention Mechanism (Detailed)
The attention mechanism calculates a weighted sum of value vectors, where the weights are determined by the similarity between a query vector and all key vectors.
The standard scaled dot-product attention formula is:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
-
(Query): A matrix where each row represents a query vector for a token.
-
(Key): A matrix where each row represents a key vector for a token.
-
(Value): A matrix where each row represents a value vector for a token.
-
: The dimension of the
keyvectors, used to scale the dot products to prevent them from becoming too large, which could push thesoftmaxfunction into regions with very small gradients. -
: The dot product of
queryandkeymatrices, which measures the similarity between eachqueryand allkeys. -
: A function that converts the similarity scores into probability distributions, ensuring the attention weights sum to 1.
-
: The output of the attention mechanism, a weighted sum of the
valuevectors.In a Transformer layer, for each token in the input sequence, its
hidden stateis projected intoQuery,Key, andValuevectors using learned weight matrices ().
KV-Cache
The Key-Value Cache (KV-Cache) is a crucial optimization for accelerating inference in autoregressive Transformers. In AR models, when generating a sequence token by token:
- For the first token, all
Query,Key, andValuevectors for that token are computed. - For subsequent tokens, only the
Queryfor the new token needs to be computed. TheKeyandValuevectors for all previous tokens are already known and can be reused (cached) from the previous steps. This reuse is possible because of thecausal attention maskin ARs, which ensures that a token can only attend to itself and preceding tokens. Therefore, thekeyandvaluestates of prior tokens do not change once they are computed. The KV-Cache significantly reduces redundant computations, changing theper-step time complexityfrom (if recomputing everything) to (for the new token's Q, and dot products with cached K, V). Overall, this makes AR inference much faster.
3.2. Previous Works
Diffusion Models
The concept of diffusion models originated in continuous domains like image generation [15, 48] and video generation [5, 28]. They model data generation as the inverse of a forward noise process. For language, initial extensions were made to categorical data [3, 25] or continuous word-embedding spaces [21, 29].
Discrete Diffusion Language Models
- D3PM [3]: Extended
Denoising Diffusion Probabilistic Models (DDPM)to categorical data, defining transition matrices for corruption and denoising. - SEDD [33]: Introduced
score entropy trainingto extendscore-matchingto discrete data. - MDLM [40]: Showed that simple
masked discrete diffusionmodels can be competitive, and the current paper builds upon this family of models. - LLaDA [37] and Dream [52]: Scaled masked diffusion language models to billions of parameters, achieving performance comparable to leading
autoregressive LLMs. These models are the primary targets fordKV-Cache.
Cache in Generative Models
- Transformers [44]: The
KV-Cachewas originally introduced here and became fundamental. - KV-Cache improvements: Several works have focused on reducing memory consumption of
KV-Cachefor long contexts in ARs [18, 26, 32], for instance, throughquantization[26] orcompression[18, 32]. - Cache in image/video diffusion: Caching has been explored in diffusion models for images [36, 46] and videos [34, 56] by leveraging
temporal similaritiesin features [9, 35] orattention maps[31, 54]. - Cache in 3D generative modeling: Also explored [50].
- KV-Cache in DLMs (limited previous work):
- Block Diffusion [2]: Extended
non-autoregressive discrete language diffusion modelsinto asemi-autoregressiveform, makingKV-Cachefeasible but requiring training considerations and still being constrained by anautoregressive formula. - MDLM [40]: Also considered a form of caching, but under the
strict condition that no new tokens have been calculated, which limits its applicability.
- Block Diffusion [2]: Extended
3.3. Technological Evolution
The evolution of generative language models progressed from Recurrent Neural Networks (RNNs) to Transformers with their attention mechanism. Autoregressive Transformers (like GPT-series) became dominant, and KV-Cache emerged as a critical optimization for their sequential decoding process. Simultaneously, diffusion models gained prominence in continuous data domains (images, audio, video) and were later adapted for discrete data, specifically text, leading to Diffusion Language Models (DLMs).
The challenge then became how to transfer the efficiency gains from ARs (like KV-Cache) to DLMs, which inherently operate differently with bidirectional attention and non-sequential decoding. Previous attempts at caching in DLMs were either semi-autoregressive [2] or imposed strict conditions [40], limiting their full potential.
3.4. Differentiation Analysis
Compared to the main methods in related work, dKV-Cache offers several core differences and innovations:
-
Addressing Fundamental Incompatibility: Unlike previous works that either adapted DLMs to be
semi-autoregressive[2] or imposedstrict conditionson caching [40],dKV-Cachedirectly tackles the fundamental incompatibility ofbidirectional attentionandnon-sequential decodingwith standardKV-Cache. It proposes a solution that works for fullynon-autoregressiveDLMs. -
Leveraging Representation Dynamics: The core innovation is the empirical observation of
token representation dynamics(Figure 2). This insight—that tokens become stable after decoding—allows for adelayed and conditioned caching strategy, which is unique to the diffusion process. This is a novel way to adapt caching to the unique characteristics of DLMs. -
Training-Free Application: A significant advantage is that
dKV-Cacheistraining-free. This means it can be applied directly to existing, pre-trained DLMs (like LLaDA and Dream) without any modification to their training regimen or architecture, making it highly practical for immediate deployment. -
Two Complementary Variants: The proposal of
dKV-Cache-Decode(near-lossless, potentially improved performance) anddKV-Cache-Greedy(higher speed-up, complexity) offers flexibility to users based on their specific performance-accuracy trade-offs. This provides a more comprehensive solution compared to a single caching approach. -
Enhanced Performance for Long Contexts:
dKV-Cache-Decodeis shown to even improve performance on long sequences, suggesting that existing DLMs might not fully utilize contextual information during inference. This points to a deeper architectural insight beyond mere speedup. -
System-Level Complexity: The
concat_reordermechanism (Appendix A) specifically addresses the challenges ofgathering and scatteringkeyandvaluestates innon-contiguous memoryfor DLMs, which is a novel system-level optimization for this domain.In essence,
dKV-Cacheprovides a novel, practical, and effective solution to a critical bottleneck in DLMs by deeply understanding their unique generative process, without compromising their corenon-autoregressivenature or requiring re-training.
4. Methodology
4.1. Principles
The core idea behind dKV-Cache is rooted in an empirical observation about the behavior of token representations within Diffusion Language Models (DLMs) during their denoising process. Unlike autoregressive models where key and value states of prior tokens are static once computed, DLMs utilize bidirectional attention, which theoretically means these states could change at every denoising step. However, the authors observe that:
-
The
keyandvalueembeddings, and , exhibit consistently high similarity across timesteps despite step-to-step differences. -
Crucially, once a token is
decoded(i.e., its identity is finalized from a[MASK]token), its representation becomes relatively stable in subsequent steps. Conversely,still-masked tokenscontinue to fluctuate significantly. -
The most substantial changes in and occur precisely at the
decoding stepof each token and in the early stages of the denoising process.These observations motivate a
delayed and conditioned caching strategy. Instead of cachingkeyandvaluestates immediately upon their computation (like in ARs),dKV-Cachedelays caching them until they are "stable" (i.e., after the token has been decoded and its representation has largely settled). This addresses thetimestep-variantnature of representations in DLMs. Furthermore, the caching is "conditioned" on the token's state (decoded vs. masked). This principle allows for reuse of stable parts of the sequence while recomputing the volatile parts, balancing efficiency with correctness underbidirectional attention.
4.2. Core Methodology In-depth (Layer by Layer)
The paper first outlines why traditional KV-Cache cannot be directly applied to Diffusion Language Models (DLMs), then presents empirical observations that motivate dKV-Cache, and finally details its variants.
Why KV-Cache Cannot be Used in DLMs?
The effectiveness of KV-Cache in autoregressive models (ARs) relies on two key assumptions that DLMs violate:
- Timestep-variant key and value states: In ARs, a
causal attention maskensures that a token at step only attends to tokens at positions1to . This means that thekeyandvaluestates for tokens1tot-1remain constant for all subsequent steps. Thus, and are the same for any step . In contrast,DLMsemploy abidirectional attention mask, allowing any token to attend to all other tokens. Consequently, thekeyandvaluestates of a token can change at eachdenoising stepbased on the context of other tokens being updated. This means and if , breaking the global reuse assumption of conventionalKV-Cache. - Non-sequential decoding order: ARs generate tokens strictly from left-to-right, making the position of the next token to be decoded deterministic. This allows them to compute
Query,Key, andValue(QKV) states selectively only for the current decoding position.DLMs, however, support flexible generation orders. At eachdenoising step, any masked token position can be selected for update based on probabilities or confidence scores. Thisuncertaintymeans the model cannotpre-determinewhich token'shidden states() andQKVstates will need computation, rendering the selective computation strategy of ARKV-Cacheineffective.
Representation Dynamics for Tokens in Diffusion Sampling
To understand if key and value states could still be reused despite bidirectional attention, the authors investigated the dynamics of and for each token throughout the diffusion process. This analysis (summarized in Figure 2) revealed critical insights:
-
Consistent Similarity (Figure 2a): Despite being updated at each step, the
keyandvalueembeddings exhibitconsistently high similarityacross different timesteps. This suggests that the representations don't radically change, offering a potential for reuse. -
Stability After Decoding (Figure 2b): Once a token is
decoded(i.e., its identity is no longer[MASK]), its representation becomesrelatively stablein subsequent steps. Conversely,still-masked tokensshowsignificant fluctuations. This implies thatdecoded tokensare good candidates for caching. -
Largest Changes at Decoding Step (Figure 2c): The most substantial changes in and occur precisely at the
decoding stepof each token and in the early stages of the denoising process. This indicates that caching a token'sQKVimmediately when it is computed might be premature, as its representation is still undergoing significant change.These observations directly inform the design of
dKV-Cache, particularly thedelayedaspect.
Delayed KV-Cache for Masked Diffusion Language Models
The core design of dKV-Cache is based on a generalized non-sequential KV-Cache formulation that accommodates the dynamic nature of DLMs.
Generalized Non-sequential KV-Cache
The standard KV-Cache in ARs uses a contiguous slice and . dKV-Cache replaces this with a set of cached indices (where is the set of all token positions). The next token position is not fixed at but rather , the position of the token being denoised at step .
The modified attention computation at step is: $ \mathbf { z } _ { t } = \mathrm { s o f t m a x } \left( \frac { \mathbf { Q } _ { t } ^ { D _ { t } } \left( \mathbf { K } _ { t } ^ { S _ { t } \cup { D _ { t } } } \right) ^ { \top } } { \sqrt { d _ { k } } } \right) \mathbf { V } _ { t } ^ { S _ { t } \cup { D _ { t } } } \mathrm { ~ w i t h ~ } \left{ \begin{array} { l l } { \mathbf { K } _ { t } ^ { S _ { t } \cup { t } } = \mathrm { c o n c a t } . \mathrm { r e o r d e r } \left( \mathbf { K } _ { t - 1 } ^ { S _ { t } } , \mathbf { K } _ { t } ^ { D _ { t } } \right) } \ { \mathbf { V } _ { t } ^ { S _ { t } \cup { t } } = \mathrm { c o n c a t } . \mathrm { r e o r d e r } \left( \mathbf { V } _ { t - 1 } ^ { S _ { t } } , \mathbf { V } _ { t } ^ { D _ { t } } \right) } \end{array} \right. $ Where:
-
: The output of the attention head at step .
-
: The
softmaxfunction applied to the scaled dot product. -
: The
queryvector for the token at position (the token being decoded at step ). -
and : The
keyandvaluematrices formed by concatenating the cached states for positions in and the newly computed states for position . The notation is likely a typo in the paper and should consistently refer to the current set of tokens considered ( or the whole sequence ). Given the context of the equation, and refer to the key and value states for the entire sequence up to step , with representing the current position being updated or rather the step index itself. -
: The dimension of the
keyvectors, used for scaling. -
: The position of the token currently being denoised/decoded at step .
-
: The set of token positions for which
keyandvaluestates have been cached from previous steps. -
: A specialized operation (explained below in
concat_reorder) that concatenates the cached states from the previous step ( and ) with the newly computed states for ( and ), and then reorders them.To generalize, the
query inputis extended from a single token to multiple and arbitrary tokens. Atdecoding step, a dynamic set is constructed, representing tokens that are not yet finalized (i.e., still masked or noisy). The caching is delayed, not by input time, but by the step a token isdecoded. This meanskeyandvaluestates corresponding todecoded tokens() are reused, whilenon-finalized positions() are recomputed at each step. This also avoids the need to predict denoising order.
One-step Delayed Caching (dKV-Cache-Decode)
Building on the observation that the most significant changes in QKV states happen at the exact decoding step (Figure 2c), caching a token's key and value immediately after it's decoded might lead to using stale representations. To mitigate this, dKV-Cache introduces one-step delayed caching.
At timestep , the method uses the masking state from the previous step, , to determine which tokens are cacheable. This means tokens decoded at step t-1 are cached at step .
The final formulation for dKV-Cache-Decode is:
$
\mathbf { z } _ { t } = \mathrm { s o f t m a x } \left( \frac { \mathbf { Q } _ { t } ^ { \mathcal { M } _ { t } } \left( \mathbf { K } _ { t } ^ { \mathcal { T } } \right) ^ { \top } } { \sqrt { d _ { k } } } \right) \mathbf { V } _ { t } ^ { \mathcal { T } } \mathrm { w i t h } \left{ \begin{array} { l l } { \mathbf { K } _ { t } ^ { \mathcal { T } } = \mathrm { c o n c a t } . \mathrm { r e o r d e r } \left( \mathbf { K } _ { t - 1 } ^ { \mathcal { T } \setminus \mathcal { M } _ { t - 1 } } , \mathbf { K } _ { t } ^ { \mathcal { M } _ { t - 1 } } \right) } \ { \mathbf { V } _ { t } ^ { \mathcal { T } } = \mathrm { c o n c a t } . \mathrm { r e o r d e r } \left( \mathbf { V } _ { t - 1 } ^ { \mathcal { T } \setminus \mathcal { M } _ { t - 1 } } , \mathbf { V } _ { t } ^ { \mathcal { M } _ { t - 1 } } \right) } \end{array} \right.
$
Where:
-
: The attention output at step .
-
: The
queryvectors for the set of currently non-finalized tokens . -
and : The full
keyandvaluematrices for the entire sequence at step . -
: The set of all token positions .
-
: The set of tokens that were not finalized at the previous step
t-1. -
: The set of tokens that were finalized (decoded) by step
t-1. Theirkeyandvaluestates ( and ) are reused from the previous step. -
and : The
keyandvaluestates newly computed at step for the tokens that werenot finalizedat stept-1. -
: Concatenates the reused and newly computed states and reorders them for attention calculation.
This delay stabilizes performance and accuracy, though it slightly reduces immediate efficiency.
Cache Refreshing Mechanism
To maintain correctness and consistency over long sequences, dKV-Cache includes a cache refreshing mechanism. Every steps, the entire stored cache is discarded, and the calculation reverts to a normal, full computation step (i.e., effectively becomes the empty set for that step in Equation 4). This recomputes all key and value states from scratch, ensuring freshness.
dKV-Cache-Prefill and dKV-Cache-PD
These are specialized variants for handling prefill tokens (the initial context provided to the model):
- dKV-Cache-Prefill: Based on the observation that
prefill tokensprimarily attend to each other and are less influenced by later generated tokens, this strategycaches prefill tokens without ever refreshing them. This is beneficial for very long input contexts. - dKV-Cache-PD: This variant intermittently refreshes
only the newly decoded tokenswhile maintaining thekeyandvaluestates ofprefill tokenswithout any recomputation.
dKV-Cache-Greedy: Greedy Formulation of dKV-Cache
The dKV-Cache-Decode method, while accurate, still involves time complexity because the set (non-finalized tokens) can initially contain almost all tokens, only shrinking towards the end. To achieve time complexity (comparable to ARs), dKV-Cache-Greedy adopts a more aggressive caching strategy by restricting the scope of recomputation.
Instead of recomputing QKV states for all tokens in , dKV-Cache-Greedy defines to include only three components for recomputation:
- The token at the
current step. - The token from the
previous step(motivated byone-step delayed caching). - A
local window. This window includes the current token and its neighbors within a fixed size . The paper found better performance when centering this window around . The formula for thelocal windowis: $ \begin{array} { r } { \mathcal { W } _ { t } = \left{ x _ { i } \ : \middle | \ : i \in \left[ D _ { t } - \left\lceil \frac { w } { 2 } \right\rceil , D _ { t } + \left\lfloor \frac { w } { 2 } \right\rfloor \right] \right} } \end{array} $ Where:
-
: The set of token positions included in the local window at step .
-
: The token at position .
-
: The index of the token position.
-
: The position of the token being decoded at step .
-
: The fixed
window size(e.g., at most 6 in experiments). -
and : Ceiling and floor functions, respectively, used to define the window boundaries.
By restricting the
QKVrecomputation to a small, fixed-size subset of tokens (current, previous, and local window), the size of is no longer dependent on , thereby reducing thetime complexityto . This aggressive strategy provides higher speedups but might incur some performance degradation due to more aggressively stalekeyandvaluestates.
A. Design for concat_reorder (from Appendix)
The concat_reorder operation is crucial for the efficiency of dKV-Cache. Unlike standard KV-Cache in ARs which uses simple concatenation for contiguous memory, dKV-Cache needs to handle non-contiguous memory access due to arbitrary positions for cached and newly computed tokens. This involves gathering (collecting data from scattered memory locations) and scattering (writing data to scattered memory locations), which can be inefficient.
The concat_reorder algorithm aims to minimize these overheads by reordering token positions during the Transformer's forward calculation. The key idea is to arrange the sequence such that all cached tokens are contiguous (e.g., on the left side) and newly decoded tokens are on the other side. This allows for more efficient matrix operations.
The process involves:
-
Retrieval of Cached States: At step , previously cached
keyandvaluestates ( and ) are retrieved based on their position indices (tokens that were finalized by stept-1). -
Reordering the Sequence Input: The input sequence for the current step () is reordered.
Cached tokens(those in ) are placed first, followed byuncached tokens(those in ). Let's denote this reordered sequence as . -
Positional Embeddings Reordering: The
positional embeddings (PE)are also reordered to match the new sequence order. This is a one-time operation per model evaluation and can be shared across layers. -
Compute QKV for Uncached Tokens: The
TransformercalculatesQKVstates () only for theuncached tokens(those in ). -
Concatenation for Attention Calculation: The retrieved
cached K/Vstates and the newly computedK/Vstates are concatenated directly to form the full and matrices, which are then used in theattention calculation. -
Attention Output and Final Reordering: The attention mechanism produces an output. If needed, this output can be reordered back to the original sequence positions using an
inverse mappingto match the model's expected output format.The
one-step shiftin caching means that the positions ofcached tokensat step correspond to tokens decoded at stept-1. This alignment helps track whichkeyandvalueentries need to be cached without storing entire matrices.
The pseudo-algorithm for this process (Algorithm 1 in the paper) is:
Algorithm 1: dKV-Cache
Require: Sequence 1:L at step t (Simplied as x), position index of masked tokens M_t, cached Key K_t-1^(T\M_t-1) and Value V_t-1^(T\M_t-1)
1: x' ← x[M_t-1] (M_t-1: t-1 for one-step shift)
2: PE' ← [PE[T\M_t-1]; PE[M_t-1]] (Positional embeddings: cached on left, uncached on right)
3: Q_t^M_t, K_t^M_t, V_t^M_t ← T(x') (T: Calculation in Transformer to get Q, K and V)
4: K_t^T ← Concat(K_t-1^(T\M_t-1), K_t^M_t), V_t^T ← Concat(V_t-1^(T\M_t-1), V_t^M_t) ▹ Get all K and V
5: K_t^(L\M_t) ← Reorder(K_t^L, I'), V_t^(L\M_t) ← Reorder(V_t^L, I') (I': The index of T\M_t in the [x[T\M_t-1]; x[M_t-1]])
6: p' ← A(Q_t^M_t, K_t^T, V_t^T)
7: p ← Scatter(p', M_t-1) (Put the token logits back to the original position)
8: Return p, K_t^T, V_t^T
Let's break down Algorithm 1:
-
Require: This specifies the inputs:
- : The input sequence of length at the current step .
- : The set of indices of
masked tokens(non-finalized) at the current step . - and : The
cached KeyandValuestates from the previous stept-1for tokens that werefinalized(decoded). ( represents the indices of finalized tokens).
-
Line 1: : This step is unclear based on the given description and likely a simplification or typo. The text states "Reorder the sequence... making the cached tokens... at the left, and uncached tokens... at the right." This line implies would be only the
masked tokensfrom . However, the accompanying text forconcat_reorderdescribes reordering the full sequence based on its masked/unmasked status. If this line is interpreted as extracting only theunmasked tokensfrom the previous step (), it means represents the subset of tokens that needQKVrecomputation. This interpretation aligns with the next lines. The comment confirms that themasking statefrom the previous step is used. -
Line 2: :
PErefers toPositional Embeddings. This line constructs a reorderedpositional embeddingmatrixPE'. It places thepositional embeddingscorresponding to thefinalized tokens() first, followed by those for thenon-finalized tokens(). ThisPE'will be used with the reordered input sequence or for calculating states for the tokens that need recomputation. -
Line 3: Q_t^M_t, K_t^M_t, V_t^M_t ← T(x'): The
Transformerblock () calculatesQuery,Key, andValuestates (QKV) only for the tokens identified in (which correspond to thenon-finalized tokensfrom the previous step, ). The generatedQKVare thus specifically for thesemasked tokensat the current step . -
Line 4: K_t^T ← Concat(K_t-1^(T\M_t-1), K_t^M_t), V_t^T ← Concat(V_t-1^(T\M_t-1), V_t^M_t): This is the crucial
concat_reorderstep forKeyandValue. It concatenates thecached K/Vstates (fromfinalized tokensat stept-1) with thenewly computed K/Vstates (fromnon-finalized tokensat stept-1). This forms the completeKeyandValuematrices for the entire sequence at step , denoted and . -
Line 5: : This step seems to reorder the final and states for the entire sequence. The notation should probably be and should be to be consistent with the other lines. is described as "The index of in the ". This line appears to be ensuring the
KeyandValuestates are correctly ordered for the upcomingattentioncalculation, which uses (from Line 3) attending to the full and (from Line 4). However, the notation is inconsistent with from the previous line. A more consistent reading might be that and from Line 4 are already reordered in a specific way, and this line is reordering them again to a final sequence-level order or for storage. Given the overall mechanism, the most logical interpretation is that this step reorders the full and matrices to a canonical order that matches thequerystructure or the original sequence order for storage. -
Line 6: p' ← A(Q_t^M_t, K_t^T, V_t^T): The
attention mechanism() is applied. Thequerycomes from thenon-finalized tokensat step (), and it attends to the fullKeyandValuematrices for the entire sequence (). represents the attention output (e.g., logits or hidden states) for thenon-finalized tokens. -
Line 7: : The attention output (for the
non-finalized tokens) isscatteredback to its original positions in the sequence. represents the full sequence's output after this step, with updatedlogitsorhidden statesfor thenon-finalized tokens. -
Line 8: : The algorithm returns the updated sequence output and the current full
KeyandValuematrices () which will be used as cached states for the next step.The
concat_reorderstrategy is key to makingdKV-Cacheefficient, particularly because it handlesRotary Positional Embeddings (ROPE)by simply reordering them alongside the tokens, incurring minimal overhead.
B. Design for Dream (from Appendix)
Dream models, being adapted from pre-trained autoregressive models, have a unique characteristic: their output positions are shifted. Specifically, the model predicts the token at position based on input up to . This differs from standard masked diffusion models where the output aligns with the current token. This shifted output position necessitates different caching strategies:
- Un-Shift (Figure 6a): In this variant, for the -th token, its
keyandvalueare stored as and at position . This is the standard interpretation of caching. - Right-Shift (Figure 6b): Due to the
shifted output, the model'shidden stateat position might actually correspond to the token att-1. This variant explores caching and for the -th token. However, this largely harms performance. - Un&Right-Shift (Figure 6c): This is a stricter variant where caching is
conditionedon bothinput stabilityanddecoding completion. For the -th token, features are cached only after its input is fixed and it has been decoded. This variant performed best in Dream but was incompatible with theconcat_reordermechanism. The paper uses theUn-Shiftstrategy forDreamin its main experiments for compatibility, despiteUn&Right-Shiftshowing slightly better performance for Dream specifically.
Figure 6 illustrates these variants:
- Figure 6a (Un-Shift): Shows the
Key/Value cachebeing stored for token at index (labeled as , ), and these are used for thequeryat step . The input at step is . - Figure 6b (Right-Shift): Illustrates storing
K/Vfor token at index (labeled as , ), which are then used for thequeryat step . - Figure 6c (Un&Right-Shift): Shows a more complex caching decision where , are cached only after the token is finalized and its input is stable, then used for the next step.
These show the varying strategies for dealing with
positional shiftstypical of models adapted fromautoregressivedesigns.
5. Experimental Setup
5.1. Datasets
The experiments evaluated dKV-Cache on two 7B-scale Diffusion Language Models (DLMs): LLaDA [37] and Dream [52]. A diverse set of benchmarks was used to assess various capabilities:
- General Language Understanding:
- MMLU (Massive Multitask Language Understanding) [23]: A benchmark covering 57 subjects across STEM, humanities, social sciences, and more, testing a wide range of knowledge and problem-solving abilities.
- GPQA (Graduate-level Google-Proof Q&A) [39]: A challenging question-answering benchmark designed to require deep expertise and complex reasoning, often "Google-proof" (not easily answered by simple search).
- Mathematical Reasoning:
- GSM8K [10]: A dataset of 8.5K diverse grade school math word problems.
- Math500: A dataset for mathematical problem solving.
- Code Generation:
-
HumanEval [4]: A dataset of programming problems to test code synthesis, where models generate Python functions based on docstrings.
-
MBPP (Mostly Basic Python Problems) [8]: Another dataset for code generation, focusing on basic programming tasks.
For
LLaDA, evaluations werezero-shot(no examples provided in the prompt), with answers generated and matched against ground truth. ForDream,few-shot in-context learning(ICL) was employed, where a few examples are provided in the prompt to guide the model.
-
5.2. Evaluation Metrics
The performance of dKV-Cache was primarily measured using two metrics:
-
Accuracy: This metric quantifies the correctness of the model's generated answers, typically the percentage of correctly solved problems or correctly generated tokens/sequences. The conceptual definition is simply the proportion of correct predictions out of the total number of predictions.
-
Token/s (Tokens per Second): This metric measures the inference speed, indicating how many tokens the model can process per second. It directly reflects the efficiency gains of
dKV-Cache. -
Cache Ratio: This metric quantifies the proportion of
Key-Value (KV)pairs that are reused from the cache during thedenoising process. It provides insight into how much of the computation is being saved bydKV-Cache.The formula for
Cache Ratiois: $ \mathrm{Cache~Ratio} = \frac { 1 } { T } \sum _ { i = 1 } ^ { T } \frac { \left| \mathcal { T } _ { i } ^ { \mathrm { c a c h e } } \right| } { | \mathcal { T } _ { i } | } $ Where:- : The total number of
denoising stepsin the generation process. - : The index of the current
denoising step. - : The set of all tokens processed at step . This typically refers to the entire sequence length . So, would be .
- : The subset of tokens whose
Key-Value (KV)pairs are reused from the cache at timestep . In the context of the paper, this corresponds to (the number ofmasked tokensfor whichKVstates are recomputed, subtracted from total length, or the complement). More precisely, it refers to the size of the set from Equation 4, i.e., the tokens whose KV states are reused from the previous step.
- : The total number of
5.3. Baselines
The proposed dKV-Cache methods were compared against baselines that accelerate generation by reducing the number of denoising steps:
-
Base (random / confidence): This refers to the original
LLaDAorDreammodel running with its full number ofdenoising steps(e.g., ). It serves as the unaccelerated baseline for accuracy. Theremasking strategy(how tokens are selected to be unmasked) can berandomor based onconfidencescores. -
Few-Steps (random): This baseline reduces the total number of
denoising stepsby a certain percentage (e.g., , ) usingrandom remasking. This is a common way to speed upDLMsbut often comes with a performance trade-off. -
Half-Steps (confidence): Similar to
Few-Steps, but specifically refers to using of the originaldenoising stepsand employing aconfidence-based remasking strategy.The number of steps for
Few-StepsandHalf-Stepsbaselines was selected such that theirsampling speedwas comparable to or slower thandKV-Cache, allowing for a fair comparison to show thatdKV-Cachecan achieve better performance at similar (or faster) speeds.
C. Evaluation Details (from Appendix)
C.1 For LLaDA
The authors re-implemented the evaluation for LLaDA using the specified datasets. Instead of comparing log probabilities for multiple-choice questions, they generated the full answer text and extracted the final answer for matching. This method can lead to lower reported results for MMLU and GPQA compared to previous works, as the model might fail to generate the answer in the expected format or might not generate an answer at all.
The configurations for LLaDA experiments are detailed in Table 5:
The following are the results from Table 5 of the original paper:
| Remasking | Base (random / confidence) Configuration | Few-Steps (random) Steps T | dKV-Cache-Greedy (random) Cache Interval | dKV-Cache-Greedy (random) Window Size | Half-Steps (confidence) Steps T | dKV-Cache-Decode (confidence) Cache Interval |
|---|---|---|---|---|---|---|
| MMLU | L=32, T=32, B=16 | T=20 | 2 | 4 | T=16 | 8 |
| GSM8K | L=256, T=256, B=32 | T=160 | 2 | 4 | T=128 | 8 |
| Math500 | L=256, T=256, B=64 | T=160 | 2 | 4 | T=128 | 8 |
| GPQA | L=128, T=128, B=64 | T=80 | 2 | 4 | T=64 | 8 |
| HumanEval | L=512, T=512, B=32 | T=320 | 2 | 4 | T=256 | 8 |
| MBPP | L=512, T=512, B=32 | T=320 | 2 | 2 | T=256 | 8 |
- : Denotes the
decoding length(maximum sequence length). - : Denotes the total number of
denoising steps. - : Denotes the
batch size. Cache Interval: FordKV-Cache-Greedy, this is how often the cache is refreshed. FordKV-Cache-Decode, it's therefresh step.Window Size: FordKV-Cache-Greedy, this is the size of thelocal windowused for recomputation.
C.2 For Dream
For Dream, the authors followed the original evaluation pipeline. They adopted MMLU and GPQA to generate answers instead of comparing probabilities, similar to LLaDA. The evaluation also utilized few-shot in-context learning with a specified number of shots.
6. Results & Analysis
6.1. Core Results Analysis
The experiments aim to demonstrate the performance trade-offs and speedups introduced by dKV-Cache across various benchmarks and configurations.
LLaDA Results
The following are the results from Table 1 of the original paper:
| Remasking | Base (random) | Few-Steps (random) | dKV-Cache-Greedy (random) | dKV-Cache-Greedy w/ Window (random) | Base (confidence) | Half-Steps (confidence) | dKV-Cache-Decode (confidence) | |||||||
| Acc. (%) | Token/s | Acc. (%) | Token/s | Acc. (%) | Token/s | Acc. (%) | Token/s | Acc. (%) | Token/s | Acc. (%) | Token/s | Acc. (%) | Token/s | |
| MMLU | 51.79 | 30.20 | 43.19 | 47.49 (1.67x) | 45.77 | 50.56 (1.57x) | 47.70 (4) | 45.72 (1.51x) | 51.11 | 28.27 | 51.11 | 55.80 (1.97×) | 51.00 | 66.52 (2.35x) |
| GSM8K | 72.25 | 15.16 | 65.58 | 24.08 (1.59x) | 67.93 | 25.47 (1.68×) | 68.23 (4) | 24.76 (1.63x) | 77.56 | 14.31 | 77.91 | 28.71 (2.00x) | 78.85 | 27.50 (1.92x) |
| Math500 | 27.4 | 12.00 | 21.8 | 19.36 (1.61x) | 26.0 | 20.34 (1.70x) | 27.0 (4) | 19.86 (1.66x) | 36.6 | 11.53 | 34.2 | 23.10 (2.00x) | 36.8 | 24.46 (2.12x) |
| GPQA | 27.46 | 11.40 | 24.78 | 18.59 (1.63x) | 26.79 | 19.27 (1.69x) | 28.35 (4) | 18.26 (1.60x) | 30.80 | 11.86 | 27.68 | 23.88 (2.01x) | 28.13 | 28.73 (2.42x) |
| HumanEval | 19.88 | 7.50 | 15.61 | 12.50 (1.67×) | 15.13 | 12.31 (1.64x) | 15.37 (4) | 12.13 (1.62x) | 39.63 | 7.08 | 33.54 | 14.18 (2.00x) | 46.34 | 13.76 (1.83x) |
| MBPP | 21.4 | 7.51 | 15.6 | 12.97 (1.73x) | 17.8 | 12.55 (1.67×) | 20.4 (2) | 12.44 (1.66×) | 40.4 | 7.50 | 33.8 | 15.01 (2.00×) | 40.4 | 13.93 (1.86x) |
(Note: The number in parentheses next to the dKV-Cache-Greedy w/ Window accuracy (e.g., 4) represents the window size.)
- dKV-Cache-Greedy (random remasking): This variant generally
outperforms few-step baselinesin terms of accuracy across most benchmarks (MMLU, GSM8K, Math500, GPQA, MBPP). For instance, on MMLU,Base (random)gets 51.79%,Few-Stepsgets 43.19%, whiledKV-Cache-Greedyachieves 45.77% anddKV-Cache-Greedy w/ Windowachieves 47.70%. This shows that with alocal window,dKV-Cache-Greedycan significantly close the gap to the base model while still providing speedups (e.g., to ).HumanEvalis an exception wheredKV-Cache-Greedyshows a lower accuracy thanFew-Steps, possibly due to the aggressive caching being less suitable for code generation's stricter requirements. - dKV-Cache-Decode (confidence remasking): This variant offers the
best trade-offbetween speed and accuracy. It achievesnear-lossless performance, often very close to theBase (confidence)model, and in some cases evenoutperformsit (e.g.,GSM8K: 78.85% vs. 77.56%;HumanEval: 46.34% vs. 39.63%). It consistently provides speedups ranging from to . This variant demonstrates thatKV-Cachecan be applied toDLMswithout sacrificing performance, and sometimes even improving it.
Dream Results
The following are the results from Table 2 of the original paper:
| Dream-7B | Half-Steps | dKV-Cache-Decode | dKV-Cache-Prefill | dKV-Cache-PD | ||
| Acc. (%) / Token/s (Speedup) | Acc. (%) / Token/s (Speedup) | Acc. (%) / Token/s (Speedup) | Acc. (%) / Token/s (Speedup) | Acc. (%) / Token/s (Speedup) | ||
| GSM8K (8-shot) L = 256 | T = 256 | 76.88 / 15.1 (1.00x) | 68.08 / 30.3 (2.00x) | 76.57 / 31.6 (2.09x) | 75.66 / 53.6 (3.55x) | 74.07 / 50.2 (3.32x) |
| T = 128 | 68.81 / 30.3 (2.01x) | 46.63 / 60.5 (4.01x) | 65.35 / 62.31 (4.13x) | 65.96 / 107.4 (7.11x) | 63.31 / 99.5 (6.6x) | |
| MBPP (3-shot) L=512 | T= 512 | 55.8 / 5.4 (1.00×) | 45.2 / 10.8 (2.00x) | 53.4 / 10.4 (1.93x) | 55.2 / 13.6 (2.52x) | 51.0 / 14.5 (2.69x) |
| T = 256 | 45.2 / 10.8 (2.00x) | 26.2 / 21.5 (3.98x) | 43.4 / 20.6 (3.81×) | 41.8 / 27.1 (5.02x) | 42.6 / 28.9 (5.35 x) | |
| HumanEval (0-shot) L=512 | T = 512 | 57.93 / 10.3 (1.00x) | 37.20 / 20.5 (1.99x) | 57.32 / 15.5 (1.50x) | 56.10 / 14.4 (1.40x) | 59.76 / 17.4 (1.69x) |
| T = 256 | 37.20 / 20.5 (1.99x) | 18.29 / 40.9 (3.97×) | 31.70 / 31.1 (3.02x) | 33.54 / 28.7 (2.79x) | 31.70 / 34.8 (3.38×) | |
- Impact of few-shot ICL and Prefill: For
Dream, which usesfew-shot in-context learning (ICL),long input contextsare common, leading to significant overhead. In these scenarios,dKV-Cache-Prefill(which cachesprefill tokenswithout refreshing) shows substantial speed improvements. - Significant Speedups from dKV-Cache-Prefill/PD: On
GSM8Kwith ,dKV-Cache-Prefillachieves a speedup (107.4 Token/s) with 65.96% accuracy, significantly better thanHalf-Steps(4.01x, 46.63%).dKV-Cache-PDalso shows strong speedups (e.g., on GSM8K). - Performance with Increased Diffusion Steps: As the number of
diffusion steps (T)increases, the speed gains fromdKV-Cachebecome even more pronounced relative to the baseline. For example, on , when is reduced from 256 to 128 fordKV-Cache, it still maintains strong accuracy while drastically increasing speedup. TheHalf-Stepsbaseline suffers a larger accuracy drop for the same speedup. - Improved Performance: In cases like ,
dKV-Cache-Decode(65.35%) anddKV-Cache-Prefill(65.96%) significantly improve overHalf-Steps(46.63%) while delivering comparable or higher speedups.dKV-Cache-PDalso performs well. - Code Generation: For ,
dKV-Cache-PDeven slightly surpasses theDream-7Bbase model in accuracy (59.76% vs 57.93%) while providing a speedup of .
Long Prefill Settings
The following are the results from Table 3 of the original paper:
| Dream-7B | Half-Steps | dKV-Cache-Decode | dKV-Cache-Prefill | ||
| Acc. (%) / Token/s (Speedup) | Acc. (%) / Token/s (Speedup) | Acc. (%) / Token/s (Speedup) | Acc. (%) / Token/s (Speedup) | ||
| MMLU (5-shot) L=8 | T = 8 | 72.19 / 9.1 (1.00×) | 72.21 / 18.1 (1.99x) | 71.74 / 25.2 (2.77x) | 71.76 / 57.6 (6.33x) |
| T = 4 | 72.21 / 18.1 (1.99x) | 71.63 / 36.1 (3.97×) | 71.69 / 49.2 (5.41×) | 71.71 / 67.3 (7.40x) | |
| GPQA (5-shot) L = 128 | T = 128 | 36.83 / 7.4 (1.00x) | 35.49 / 14.7 (1.99x) | 35.71 / 18.2 (2.46×) | 35.27 / 75.40 (10.19 x) |
| T = 64 | 35.49 / 14.7 (1.99x) | 35.27 / 29.4 (3.97×) | 34.15 / 36.8 (4.97x) | 35.27 / 139.9 (18.91x) | |
- Extreme Speedup for Long Prefill: On
GPQAwith ,dKV-Cache-Prefillachieves an astounding speedup at steps, and an even more impressive speedup at steps, while maintaining comparable accuracy to the baseline anddKV-Cache-Decode. This highlights the immense value ofdKV-Cache-Prefillfor tasks with extensiveprefill contextsdue toICL.
6.2. Ablation Studies / Parameter Analysis
The One-step Delay in dKV-Cache
As can be seen from the results in Figure 3 of the original paper:
该图像是图表,展示了图3中一步延迟缓存对GSM8K准确率的影响。图中比较了无延迟和延迟缓存情况下,随着缓存比例增大准确率的变化趋势,延迟缓存准确率保持较高且稳定。
Figure 3 illustrates the critical role of the one-step delay in the dKV-Cache mechanism.
- Without Delay: When caching is applied immediately without delay, performance (Accuracy %) remains acceptable only at
low cache ratios. As thecache ratioincreases (meaning more KV pairs are aggressively cached), performance degrades rapidly, eventually collapsing to near-zero accuracy. This confirms the initial observation that token representations undergo significant changes at theirdecoding step, making immediate caching detrimental. - With Delay: Introducing a
one-step delaystabilizesgeneration quality. The model maintainsnearly lossless performanceeven underhigh cache ratios. This validates the hypothesis that delaying caching until the token's representation has settled (i.e., after its most significant change point) is crucial for effective and accurate caching inDLMs.
Performance on Different Decoding Length, Denoising Steps, and Refreshing Steps
As can be seen from the results in Figure 4 of the original paper:
该图像是论文中的图表,展示了dKV-Cache-Decode(左图)和dKV-Cache-Greedy(右图)在GSM8K数据集上的性能与速度对比,横轴为速度(Tokens/s),纵轴为性能(准确率%),不同参数如解码长度、采样步数、刷新间隔及窗口大小的影响均有体现。
Figure 4 presents a detailed analysis of dKV-Cache-Decode (left) and dKV-Cache-Greedy (right) on GSM8K under various configurations, plotting Performance (Accuracy %) against Speed (Tokens/s).
- Decoding Robustness: The performance of
dKV-Cacheis largelyinsensitiveto the number ofdecoding steps (L)andsampling steps (S), indicating its strong robustness across varying generation lengths. - Enhanced Long-form Generation: In tasks requiring longer outputs (e.g., ),
dKV-Cache-Decodenot only accelerates butoutperformsthe baseline in terms of accuracy. For instance,GSM8Kaccuracy improves from to , andHumanEvalfrom to . This suggests that the originalbidirectional attentioninDLMsmightunder-utilize contextual informationfor long sequences, anddKV-Cachesomehow helps to alleviate this. - Effectiveness with Few Refreshing Steps:
dKV-Cache-Decodemaintains good performance even withinfrequent refreshes(e.g., every 16 steps), showing onlysmall degradation. This indicates efficiency in terms of cache maintenance. - Effectiveness of Local Windows: For
dKV-Cache-Greedy, incorporating alocal window( in the graph)notably enhances performancewith minimal additional computational cost, bringing its accuracy closer to thedKV-Cache-DecodeorBasemodels while offering higher speeds.
Memory and Speed Analysis
As can be seen from the results in Figure 5 of the original paper:
该图像是图表,展示了dKV-Cache-Decode与dKV-Cache-Greedy在不同预填充长度(128和512)和刷新步数(2和8)条件下的速度和内存使用情况,包含吞吐量和显存峰值对比。
Figure 5 analyzes the speed and memory footprint of dKV-Cache-Decode and dKV-Cache-Greedy across different decoding and prefill lengths (128 and 512).
- Substantial Inference Acceleration: Both
dKV-Cachevariants achieve significant inference acceleration, ranging from to . - Modest Memory Increase: This acceleration comes with
only a modest increase in memory usage. - dKV-Cache-Greedy Potential:
dKV-Cache-Greedygenerally demonstratesgreater potential for accelerating inferencecompared todKV-Cache-Decode. AlthoughdKV-Cache-Greedymight see performance degradation if therefresh intervalis too large (e.g., greater than 2 in experiments), it consistently achieveshigher speedupsthandKV-Cache-Decodeunder the same refresh interval.
Impact of Batch Size on Speed
As can be seen from the results in Figure 7 of the original paper:
该图像是图表,展示了批量大小对解码速度(Token/s)的影响,基于在单个NVIDIA H20上评估的LLaDA模型,预填充长度固定为100个Token。图中分别比较了dKV-Cache-Decode和原始方法在不同序列长度L(16至128)下的性能。
Figure 7 shows the impact of batch size on decoding speed for LLaDA with a fixed prefill length of 100 tokens.
- Memory-Bound at Small Batch Sizes: The
dKV-Cacheinference pipeline, involvingindexing operations,gathers, andscatters, generates a stream of small,non-contiguous memory accesses. Atbatch size one, theseuncoalesced readsmake the inferencememory-bound, causing the GPU'scompute unitsto sit idle, waiting for data. In such scenarios,caching can even underperform the unaccelerated baseline. - Scaling with Batch Size: As the
batch size increases, this problem is alleviated, anddKV-Cachedeliversfar greater, more stable speed-up ratiosover the baseline. This implies thatdKV-Cachebenefits significantly frombatch processing, a common optimization indeep learning.
Comparison of Dream Caching Strategies
The following are the results from Table 4 of the original paper:
| Un-Shift | Right-Shift | Un&Right Shift | |
| MMLU | 71.78 | 64.60 | 71.73 |
| GSM8K | 76.34 | 32.68 | 77.71 |
This table compares different caching strategies for Dream-Base-7B, which is adapted from autoregressive models and has a shifted output position.
- Un&Right-Shift Best: The
Un&Right-Shiftstrategy (which conditions caching on both input stability and decoding completion) generally yields thebest performance(e.g., 71.73% on MMLU, 77.71% on GSM8K). - Right-Shift Harmful: The
Right-Shiftvariant significantlyharms model performance(e.g., 64.60% on MMLU, 32.68% on GSM8K). This indicates that simply shifting the cached positions to align with a future output position is not effective. - Un-Shift Trade-off: The
Un-Shiftstrategy, while not the absolute best forDream's accuracy, was used in the main experiments becauseUn&Right-Shiftwasincompatible with the concat_reorder mechanism. This highlights a practical trade-off between architectural compatibility and optimal performance for specific model adaptations.
6.3. Case Study
The case studies illustrate the qualitative output of dKV-Cache variants compared to the base models.
The following are the results from Table 6 of the original paper:
| Model | Generate Text (L=128, T=128, B=64) |
| LLaDA-8B-Instruct (low-confidence) | 1. Calculate the distance Lily runs in the first 4 hours:- Speed: 12 kilometers per hour- Time: 4 hours- Distance: 12 km/h * 4 h = 48 km2. Calculate the distance Lily runs in the next 4 hours:- Speed: 6 kilometers per hour- Time: 4 hours- Distance: 6 km/h * 4 h = 24 km3. Add the distances together:- Total distance: \boxed{72} km |
| dKV-Cache-Decode,Refresh = 4 | 1. Calculate the distance Lily runs in the first 4 hours:- Distance = Speed × Time = 12 km/h × 4 h = 48 km2. Calculate the distance Lily runs in the next 4 hours:- Distance = Speed × Time = 6 km/h x 4 h = 24 km3. Add the distances from both periods to find the total distance Lilyruns in 8 hours:- Total Distance = 48 km + 24 km = 72 km\boxed{72 km} |
| dKV-Cache-Decode,Refresh=16 | 1. Calculate the distance Lily runs in the first 4 hours:12 km/hour * 4 hours = 48 km2. Calculate the distance Lily runs in the next 4 hours:6 km/hour * 4 hours = 24 km3. Add the distances from the first 4 hours and the next 4 hours to findthe total distance she runs:48 km + 24 km = 72 kmTherefore, Lily can run \boxed{72} kilometers in 8 hours. |
| dKV-Cache-Greedy,Refresh = 2 | To determine how many kilometers Lily can run in 8 hours, we need tobreak down her running schedule and sum the distances.First, calculate the distance she runs at 12 kilometers per hour for 4hours:12 * 4 = 48After 4 hours, she runs 6 kilometers per hour for the remaining 4 hours:6 * 4 = 24Now, add the distances:48 + 24 = 72\boxed{72} |
| dKV-Cache-Greedy,Refresh=4(random) | For the first 4 hours Lily runs at 12 kilometers per hour, so she runs 12 *4 = 48 kilometers. 8 hours - 4 hours are left after the first 4 (8 - 4 = 4hours).For the next 4 hours, she runs at 6 kilometers per hour, so she runs 6 * 4= 24 kilometers.Adding the two distances together, Lily can run 48 + 24 = 72 kilometersin 8 hours.Answer: \boxed{72} |
This case study for LLaDA-8B-Instruct on a math word problem shows that all dKV-Cache variants (Decode with refresh 4 and 16, Greedy with refresh 2 and 4) successfully solve the problem and produce the correct answer (72). The generated text for each variant is coherent, follows a logical step-by-step reasoning process, and is qualitatively very similar to the Base LLaDA model. This reinforces the quantitative findings that dKV-Cache maintains performance quality.
The following are the results from Table 7 of the original paper:
| Model | Generate Text (L=256, T=256) |
| Dream-Instruct-7B(Entropy) | Once upon a time, there was a young man named Joey who lived in a smalltown in the middle of nowhere. He was a kind and gentle soul who loved to singand play the guitar. One day, he met a beautiful girl named Rachel who wasalso a talented singer and musician. They fell in love and decided to get married.However, there was a problem. Joey's parents were against the marriage becausethey didn't know Rachel well enough. They thought that she was not a goodmatch for their son. Rachel, on the other hand, was not worried about Joey'sparents' opinion. She knew that she loved Joey and that he was the one for her.She decided to try and win over Joey's parents by showing them how much shecared about him. She started by cooking for them and inviting them to her housefor dinner. She also took them on a trip to the mountains where they could enjoythe scenery and the fresh air. After a while, Joey's parents started to see Rachelin a different light. They realized that she was a good person and that she lovedJoey very much. They decided to give their blessing to the marriage. Finally,Joey and Rachel got married and lived happily ever after. |
| dKV-Cache-Decode,Refresh = 16 | Once upon a time, in a small town called New York, there lived a young mannamed Joey and a young woman named Rachel. They were both very happywith their lives, but they were also very different from each other. Joey was avery outgoing person who loved to be around people, while Rachel was a veryintroverted person who loved to be alone. One day, Joey and Rachel met at acoffee shop and fell in love. They spent hours talking and laughing together, andthey realized that they were perfect for each other. They decided to be togetherand build their lives together. As the years went by, Joey and Rachel grew closerand closer. They shared their dreams and their fears, and they supported eachother through thick and thin. They were a perfect match, and they knew thatthey were meant to be together. Finally, Joey and Rachel got married. They hada beautiful wedding with all their friends and family in attendance. They wereso happy to be together, and they knew that they would always be together. |
| dKV-Cache-Prefill | Once upon a time, in a small town called New York, there lived a young mannamed Joey. Joey was a kind and caring person who always tried to help others.He had a beautiful girlfriend named Rachel, who was also kind and caring.They were in love with each other and wanted to spend their lives together. Oneday, Joey and Rachel decided to get married. They planned their wedding for along time, and everyone in the town was excited to attend. On the day of thewedding, Joey and Rachel exchanged vows and promised to love each otherforever. After the wedding, Joey and Rachel moved into a beautiful house in thecountryside. They were happy and content with their lives, and they had twochildren, a boy and a girl. The children grew up to be kind and caring people,just like their parents. As the years went by, Joey and Rachel continued to loveeach other deeply. They had many ups and downs, but they always found a wayto overcome their challenges and stay together. Finally, after many years ofmarriage, Joey and Rachel decided to retire and live in a small cottage in thecountryside. They were happy and content with their lives, and they knew thatthey had found the love of their lives in each other. |
This case study for Dream-Instruct-7B on a story generation task with a specific ending shows that dKV-Cache-Decode and dKV-Cache-Prefill both successfully generate coherent stories that adhere to the prompt's ending. The narrative styles and details vary slightly between models, but all maintain a reasonable quality similar to the Base Dream model, further underscoring the lossless or near-lossless performance of dKV-Cache variants.
The following are the results from Table 8 of the original paper:
| Model | Generate Text (L=128, T=128) |
| Dream-Instruct-7B(Entropy) | In the first 4 hours, Lily runs 12 kilometers per hour, so she runs a total of 12 *4 = 48 kilometers. In the next 4 hours, Lily runs 6 kilometers per hour, so sheruns a total of 6 * 4 = 24 kilometers. Therefore, in 8 hours, Lily can run a totalof 48 + 24 = 72 kilometers. #### 72 |
| dKV-Cache-Decode,Refresh = 4 | In the first 4 hours, Lily can run 12 kilometers per hour, so she can run a total of12 * 4 = 48 kilometers. In the next 4 hours, Lily can run 6 kilometers per hour,so she can run a total of 6 * 4 = 24 kilometers. Therefore, in 8 hours, Lily canrun a total of 48 + 24 = 72 kilometers. #### 72 |
| dKV-Cache-Decode,Refresh = 16 | In the first 4 hours, Lily runs 12 kilometers per hour, so she runs a total of 12 *4 = 48 kilometers. In the next 4 hours, Lily runs 6 kilometers per hour, so sheruns a total of 6 * 4 = 24 kilometers. Therefore, in 8 hours, Lily can run a totalof 48 + 24 = 72 kilometers. The answer is: 72 |
| dKV-Cache-Prefill,No refresh | In the first 4 hours, Lily runs 12 kilometers per hour, so she runs a total of12 × 4 = 48 kilometers. In the next 4 hours, she runs 6 kilometers per hour, soshe runs a total of 6 × 4 = 24 kilometers. Therefore, in 8 hours, Lily can run atotal of 48 + 24 = \boxed{72} kilometers.The answer is: 72 |
This case study for Dream-Instruct-7B on the same math word problem as Table 6 further confirms the robustness of dKV-Cache variants. All models, including dKV-Cache-Decode (with refresh 4 and 16) and dKV-Cache-Prefill, correctly solve the problem and provide the answer 72 with clear, logical steps. This demonstrates that for both LLaDA and Dream, dKV-Cache preserves the model's problem-solving capabilities.
6.4. Overall Analysis
The results consistently show that dKV-Cache effectively tackles the slow inference problem in DLMs.
-
The
one-step delayed cachingis critical for maintaining accuracy, especially at high cache ratios. -
dKV-Cache-Decodeoffers the best balance, providing substantial speedups ( for LLaDA, up to for Dream in some cases) withnear-lossless performance, sometimes even improving accuracy, particularly forlong sequences. -
dKV-Cache-Prefillexcels in scenarios withlong input contexts(common infew-shot ICL), delivering impressive speedups of up to or even . -
dKV-Cache-Greedyachieves higher rawtoken/sbut with a slight accuracy trade-off, making it suitable for applications where speed is paramount and some performance degradation is acceptable. Its performance can be significantly improved with alocal window. -
The system benefits from
batch processing, mitigating the overhead ofgather/scatteroperations.Overall,
dKV-Cachesuccessfully pushes thePareto frontforward, allowingDLMsto achieve significantly faster inference while maintaining or even improving their generation quality.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully demonstrates the feasibility of incorporating a caching mechanism, dKV-Cache, into Diffusion Language Models (DLMs), a task previously considered challenging due to their non-autoregressive architecture and bidirectional attention. The core innovation lies in leveraging the observed dynamics of token representations during the diffusion process, specifically that representations stabilize after a token is decoded. This insight led to the design of a delayed and conditioned caching strategy.
The authors proposed two main variants: dKV-Cache-Decode, which offers near-lossless accuracy and even improves performance on long sequences, and dKV-Cache-Greedy, which provides higher speed-ups by making more aggressive caching decisions. Extensive experiments on 7B-scale LLaDA and Dream models across diverse benchmarks (language understanding, math, code generation) confirm that dKV-Cache achieves inference speedups. Crucially, this is achieved in a training-free manner, making it immediately applicable to existing DLMs, significantly narrowing the performance gap with autoregressive language models.
7.2. Limitations & Future Work
The authors acknowledge one primary limitation:
-
Focus on Algorithmic Design in Isolation: The current work primarily focuses on the algorithmic design of
dKV-Cache. While effective,Diffusion Language Modelsstill have substantial room for improvement at thesystem level.They suggest future research directions:
-
Integrating Algorithmic Innovations with System-Level Optimization: This includes areas such as
memory management,parallelism, andhardware-aware execution. CombiningdKV-Cachewith these system-level improvements could unlock furtherefficiency gainsandperformance enhancementsforDLMs.
7.3. Personal Insights & Critique
This paper presents a highly impactful contribution to the field of Diffusion Language Models. The core insight that token representations become stable after decoding, despite bidirectional attention, is elegant and effectively addresses a fundamental challenge. It transforms what was perceived as an inherent limitation of non-autoregressive models into an opportunity for optimization.
The introduction of dKV-Cache-Decode and dKV-Cache-Greedy provides a practical toolkit for researchers and practitioners, allowing them to choose between near-lossless performance or maximal speedup based on their specific needs. The training-free nature of the method is a significant advantage, ensuring immediate applicability without the costly overhead of retraining large models. The observation that dKV-Cache-Decode can improve performance on long sequences is particularly intriguing. It suggests that bidirectional attention, while powerful, might sometimes be inefficient or redundant in how it processes context, and judicious caching can implicitly guide the model towards more effective contextual aggregation.
One potential area for deeper exploration could be a more fine-grained analysis of why dKV-Cache-Decode improves performance on long sequences. Is it acting as a form of regularization? Is it forcing the model to focus on more stable, "decoded" information, implicitly reducing noise or conflicting signals from still-fluctuating masked tokens? Further qualitative and quantitative analysis, perhaps comparing attention maps or gradient flows with and without dKV-Cache, could shed light on this phenomenon.
From a practical standpoint, the reliance on batch size to overcome the memory-bound nature of gather/scatter operations (Figure 7) is a common challenge in non-contiguous memory access patterns. This highlights that while the algorithmic solution is sound, its practical deployment will require careful system-level engineering to achieve optimal gains, as the authors themselves note. Future work could investigate specialized hardware kernels or memory allocation strategies to minimize these overheads even at small batch sizes, which are often crucial for low-latency inference in real-time applications.
Finally, the paper's clear exposition, rigorous experimental validation, and thoughtful analysis of different caching strategies for a complex model architecture like DLMs make it a valuable piece of research that paves the way for more efficient and practical Diffusion Language Models. The methods and insights could potentially be transferred to other non-autoregressive generative models that share similar representation dynamics during their iterative generation processes.
Similar papers
Recommended via semantic vector search.