AiPaper
Paper status: completed

dKV-Cache: The Cache for Diffusion Language Models

Published:05/22/2025
Original LinkPDF
Price: 0.10
Price: 0.10
11 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

dKV-Cache introduces a delayed KV-cache for diffusion language models, enabling 2-10× faster inference by conditionally caching key-value states. Two variants balance speed and performance, revealing underutilized context and narrowing efficiency gaps with autoregressive models.

Abstract

Diffusion Language Models (DLMs) have been seen as a promising competitor for autoregressive language models. However, diffusion language models have long been constrained by slow inference. A core challenge is that their non-autoregressive architecture and bidirectional attention preclude the key-value cache that accelerates decoding. We address this bottleneck by proposing a KV-cache-like mechanism, delayed KV-Cache, for the denoising process of DLMs. Our approach is motivated by the observation that different tokens have distinct representation dynamics throughout the diffusion process. Accordingly, we propose a delayed and conditioned caching strategy for key and value states. We design two complementary variants to cache key and value step-by-step: (1) dKV-Cache-Decode, which provides almost lossless acceleration, and even improves performance on long sequences, suggesting that existing DLMs may under-utilise contextual information during inference. (2) dKV-Cache-Greedy, which has aggressive caching with reduced lifespan, achieving higher speed-ups with quadratic time complexity at the cost of some performance degradation. dKV-Cache, in final, achieves from 2-10x speedup in inference, largely narrowing the gap between ARs and DLMs. We evaluate our dKV-Cache on several benchmarks, delivering acceleration across general language understanding, mathematical, and code-generation benchmarks. Experiments demonstrate that cache can also be used in DLMs, even in a training-free manner from current DLMs.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

dKV-Cache: The Cache for Diffusion Language Models

1.2. Authors

Xinyin Ma, Runpeng Yu, Gongfan Fang, Xinchao Wang Affiliations: National University of Singapore

1.3. Journal/Conference

The paper is an arXiv preprint, indicating it has not yet undergone formal peer review and publication in a journal or conference. arXiv is a widely recognized platform for disseminating research quickly in fields like AI, often serving as a precursor to formal publication.

1.4. Publication Year

2025

1.5. Abstract

Diffusion Language Models (DLMs) are emerging as a promising alternative to autoregressive language models (ARs) but are hampered by slow inference speeds. This bottleneck primarily stems from their non-autoregressive architecture and bidirectional attention, which prevent the direct use of the key-value (KV) cache mechanism that typically accelerates AR decoding. To address this, the authors propose dKV-Cache (delayed KV-Cache), a KV-cache-like mechanism tailored for the denoising process in DLMs. The approach is motivated by the observation that token representations exhibit distinct dynamics during the diffusion process.

dKV-Cache employs a delayed and conditioned caching strategy for key and value states. Two complementary variants are introduced:

  1. dKV-Cache-Decode: Offers almost lossless acceleration and even improves performance on long sequences, suggesting that existing DLMs might under-utilize contextual information.

  2. dKV-Cache-Greedy: Achieves higher speed-ups with quadratic time complexity at the cost of some performance degradation, through aggressive caching with reduced lifespan.

    Overall, dKV-Cache achieves inference speedups of 2-10×2\text{-}10\times, significantly narrowing the speed gap between ARs and DLMs. The method is evaluated on various benchmarks (general language understanding, mathematical, code-generation) and demonstrates that caching can be applied to DLMs in a training-free manner using existing models.

Official Source or PDF Link: https://arxiv.org/abs/2505.15781 Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The paper addresses the critical issue of slow inference in Diffusion Language Models (DLMs). While DLMs offer a promising alternative to traditional autoregressive language models (ARs) due to their parallel decoding potential, their practical inference speed remains significantly slower. This inefficiency is a major barrier to their widespread adoption.

The core problem lies in the architectural differences between DLMs and ARs. ARs benefit greatly from a mechanism called KV-Cache, which reuses previously computed key and value states in the attention mechanism, accelerating decoding. However, DLMs, with their non-autoregressive architecture and bidirectional attention, cannot directly leverage the standard KV-Cache. This is because:

  1. Timestep-variant key and value states: In ARs, key and value states for prior tokens remain fixed due to a causal attention mask. DLMs use bidirectional attention, meaning key and value states can change at every timestep, making direct reuse problematic.

  2. Non-sequential decoding order: ARs decode strictly left-to-right, allowing for deterministic computation of the next token's QKV states. DLMs can update any masked position dynamically, making it impossible to predict which QKV states need to be computed or reused.

    These limitations lead to a cubic time complexity of O(L3)\mathcal{O}(L^3) for DLMs (where LL is sequence length) compared to quadratic time complexity of O(L2)\mathcal{O}(L^2) per step for ARs using KV-Cache. The paper's innovative idea is to adapt the concept of KV-Cache to DLMs by observing the representation dynamics of tokens during the diffusion process. They hypothesize that despite bidirectional attention, key and value states are not entirely unreusable but require a "delayed and conditioned reuse."

2.2. Main Contributions / Findings

The paper makes several key contributions:

  1. First KV-Cache for DLMs: It proposes dKV-Cache (delayed KV-Cache), the first mechanism to integrate KV-Cache-like functionality into diffusion language models, specifically designed to be compatible with their bidirectional attention and non-sequential decoding. This addresses a long-standing bottleneck.

  2. Leveraging Representation Dynamics: The design is motivated by empirical observations that key and value states exhibit consistent patterns: they are relatively stable after a token is decoded, with the most significant changes occurring precisely at the decoding step. This insight underpins the delayed caching strategy.

  3. Two Practical Variants:

    • dKV-Cache-Decode: Achieves almost lossless acceleration and in some cases, even improves performance on long sequences, suggesting better utilization of contextual information.
    • dKV-Cache-Greedy: Offers higher speed-ups by aggressively refreshing caches, reducing the per-step time complexity from O(L3)\mathcal{O}(L^3) to O(L2)\mathcal{O}(L^2) at the cost of some performance degradation.
  4. Significant Inference Speedup: dKV-Cache delivers 2-10×2\text{-}10\times inference speedups on 7B-scale diffusion language models (LLaDA, Dream) across diverse benchmarks (general language understanding, math, code generation). This significantly closes the gap with ARs.

  5. Training-Free Application: The method can be applied to existing DLMs without requiring any re-training, making it immediately usable.

    The key conclusion is that caching is indeed viable for DLMs, and by carefully designing a delayed and conditioned caching strategy, significant inference speedups can be achieved with minimal or even positive impact on performance, especially for long contexts.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully grasp the innovations of dKV-Cache, a foundational understanding of several key concepts is essential for a beginner:

Autoregressive Language Models (ARs)

Autoregressive Language Models (ARs), such as GPT or LLaMA, generate text sequentially, token by token, from left to right. When predicting the next token, the model considers all previously generated tokens as context. This sequential generation means that each new token depends only on tokens that came before it in the sequence. This property is crucial for the efficiency of KV-Cache.

Diffusion Language Models (DLMs)

Diffusion Language Models (DLMs) are a newer paradigm for text generation, inspired by diffusion models in computer vision. Unlike ARs, DLMs generate text by gradually refining a sequence of initially noisy or masked tokens into a coherent output.

  • Forward Process: This process gradually adds noise (e.g., masking tokens) to a clean text sequence until it becomes completely noisy or masked.
  • Reverse (Denoising) Process: This is the generative part. A neural network learns to reverse the forward process, gradually removing noise or unmasking tokens to reconstruct the original text. This denoising often happens in multiple steps.
  • Non-autoregressive: DLMs are typically non-autoregressive, meaning they can update multiple tokens or even the entire sequence in parallel at each denoising step, rather than generating one token at a time. This potential for parallelism is a key advantage.
  • Bidirectional Attention: A core component of DLMs (and other non-autoregressive models) is bidirectional attention. Unlike the causal attention mask used in ARs, bidirectional attention allows each token to attend to all other tokens in the sequence (both preceding and succeeding tokens) when computing its representation. This comprehensive contextual understanding is powerful but complicates caching.

Transformers

The Transformer architecture is the backbone of most modern language models, including both ARs and DLMs. It revolutionized sequence processing by relying entirely on attention mechanisms instead of recurrent neural networks (RNNs).

  • Encoder-Decoder/Decoder-Only: Transformers can have an encoder-decoder structure (for tasks like machine translation) or be decoder-only (common for large language models like GPT). DLMs often use a Transformer-like architecture for their denoising network.
  • Attention Mechanism: The core innovation of Transformers. It allows the model to weigh the importance of different parts of the input sequence when processing each token. It computes an output based on three inputs: a Query (Q), a Key (K), and a Value (V).

Attention Mechanism (Detailed)

The attention mechanism calculates a weighted sum of value vectors, where the weights are determined by the similarity between a query vector and all key vectors. The standard scaled dot-product attention formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:

  • QQ (Query): A matrix where each row represents a query vector for a token.

  • KK (Key): A matrix where each row represents a key vector for a token.

  • VV (Value): A matrix where each row represents a value vector for a token.

  • dkd_k: The dimension of the key vectors, used to scale the dot products to prevent them from becoming too large, which could push the softmax function into regions with very small gradients.

  • QKTQK^T: The dot product of query and key matrices, which measures the similarity between each query and all keys.

  • softmax()\mathrm{softmax}(\cdot): A function that converts the similarity scores into probability distributions, ensuring the attention weights sum to 1.

  • Attention(Q,K,V)\mathrm{Attention}(Q, K, V): The output of the attention mechanism, a weighted sum of the value vectors.

    In a Transformer layer, for each token in the input sequence, its hidden state is projected into Query, Key, and Value vectors using learned weight matrices (WQ,WK,WVW_Q, W_K, W_V).

KV-Cache

The Key-Value Cache (KV-Cache) is a crucial optimization for accelerating inference in autoregressive Transformers. In AR models, when generating a sequence token by token:

  1. For the first token, all Query, Key, and Value vectors for that token are computed.
  2. For subsequent tokens, only the Query for the new token needs to be computed. The Key and Value vectors for all previous tokens are already known and can be reused (cached) from the previous steps. This reuse is possible because of the causal attention mask in ARs, which ensures that a token can only attend to itself and preceding tokens. Therefore, the key and value states of prior tokens do not change once they are computed. The KV-Cache significantly reduces redundant computations, changing the per-step time complexity from O(L2)\mathcal{O}(L^2) (if recomputing everything) to O(L)\mathcal{O}(L) (for the new token's Q, and dot products with cached K, V). Overall, this makes AR inference much faster.

3.2. Previous Works

Diffusion Models

The concept of diffusion models originated in continuous domains like image generation [15, 48] and video generation [5, 28]. They model data generation as the inverse of a forward noise process. For language, initial extensions were made to categorical data [3, 25] or continuous word-embedding spaces [21, 29].

Discrete Diffusion Language Models

  • D3PM [3]: Extended Denoising Diffusion Probabilistic Models (DDPM) to categorical data, defining transition matrices for corruption and denoising.
  • SEDD [33]: Introduced score entropy training to extend score-matching to discrete data.
  • MDLM [40]: Showed that simple masked discrete diffusion models can be competitive, and the current paper builds upon this family of models.
  • LLaDA [37] and Dream [52]: Scaled masked diffusion language models to billions of parameters, achieving performance comparable to leading autoregressive LLMs. These models are the primary targets for dKV-Cache.

Cache in Generative Models

  • Transformers [44]: The KV-Cache was originally introduced here and became fundamental.
  • KV-Cache improvements: Several works have focused on reducing memory consumption of KV-Cache for long contexts in ARs [18, 26, 32], for instance, through quantization [26] or compression [18, 32].
  • Cache in image/video diffusion: Caching has been explored in diffusion models for images [36, 46] and videos [34, 56] by leveraging temporal similarities in features [9, 35] or attention maps [31, 54].
  • Cache in 3D generative modeling: Also explored [50].
  • KV-Cache in DLMs (limited previous work):
    • Block Diffusion [2]: Extended non-autoregressive discrete language diffusion models into a semi-autoregressive form, making KV-Cache feasible but requiring training considerations and still being constrained by an autoregressive formula.
    • MDLM [40]: Also considered a form of caching, but under the strict condition that no new tokens have been calculated, which limits its applicability.

3.3. Technological Evolution

The evolution of generative language models progressed from Recurrent Neural Networks (RNNs) to Transformers with their attention mechanism. Autoregressive Transformers (like GPT-series) became dominant, and KV-Cache emerged as a critical optimization for their sequential decoding process. Simultaneously, diffusion models gained prominence in continuous data domains (images, audio, video) and were later adapted for discrete data, specifically text, leading to Diffusion Language Models (DLMs).

The challenge then became how to transfer the efficiency gains from ARs (like KV-Cache) to DLMs, which inherently operate differently with bidirectional attention and non-sequential decoding. Previous attempts at caching in DLMs were either semi-autoregressive [2] or imposed strict conditions [40], limiting their full potential.

3.4. Differentiation Analysis

Compared to the main methods in related work, dKV-Cache offers several core differences and innovations:

  • Addressing Fundamental Incompatibility: Unlike previous works that either adapted DLMs to be semi-autoregressive [2] or imposed strict conditions on caching [40], dKV-Cache directly tackles the fundamental incompatibility of bidirectional attention and non-sequential decoding with standard KV-Cache. It proposes a solution that works for fully non-autoregressive DLMs.

  • Leveraging Representation Dynamics: The core innovation is the empirical observation of token representation dynamics (Figure 2). This insight—that tokens become stable after decoding—allows for a delayed and conditioned caching strategy, which is unique to the diffusion process. This is a novel way to adapt caching to the unique characteristics of DLMs.

  • Training-Free Application: A significant advantage is that dKV-Cache is training-free. This means it can be applied directly to existing, pre-trained DLMs (like LLaDA and Dream) without any modification to their training regimen or architecture, making it highly practical for immediate deployment.

  • Two Complementary Variants: The proposal of dKV-Cache-Decode (near-lossless, potentially improved performance) and dKV-Cache-Greedy (higher speed-up, O(L2)\mathcal{O}(L^2) complexity) offers flexibility to users based on their specific performance-accuracy trade-offs. This provides a more comprehensive solution compared to a single caching approach.

  • Enhanced Performance for Long Contexts: dKV-Cache-Decode is shown to even improve performance on long sequences, suggesting that existing DLMs might not fully utilize contextual information during inference. This points to a deeper architectural insight beyond mere speedup.

  • System-Level Complexity: The concat_reorder mechanism (Appendix A) specifically addresses the challenges of gathering and scattering key and value states in non-contiguous memory for DLMs, which is a novel system-level optimization for this domain.

    In essence, dKV-Cache provides a novel, practical, and effective solution to a critical bottleneck in DLMs by deeply understanding their unique generative process, without compromising their core non-autoregressive nature or requiring re-training.

4. Methodology

4.1. Principles

The core idea behind dKV-Cache is rooted in an empirical observation about the behavior of token representations within Diffusion Language Models (DLMs) during their denoising process. Unlike autoregressive models where key and value states of prior tokens are static once computed, DLMs utilize bidirectional attention, which theoretically means these states could change at every denoising step. However, the authors observe that:

  1. The key and value embeddings, K\mathbf{K} and V\mathbf{V}, exhibit consistently high similarity across timesteps despite step-to-step differences.

  2. Crucially, once a token is decoded (i.e., its identity is finalized from a [MASK] token), its representation becomes relatively stable in subsequent steps. Conversely, still-masked tokens continue to fluctuate significantly.

  3. The most substantial changes in K\mathbf{K} and V\mathbf{V} occur precisely at the decoding step of each token and in the early stages of the denoising process.

    These observations motivate a delayed and conditioned caching strategy. Instead of caching key and value states immediately upon their computation (like in ARs), dKV-Cache delays caching them until they are "stable" (i.e., after the token has been decoded and its representation has largely settled). This addresses the timestep-variant nature of representations in DLMs. Furthermore, the caching is "conditioned" on the token's state (decoded vs. masked). This principle allows for reuse of stable parts of the sequence while recomputing the volatile parts, balancing efficiency with correctness under bidirectional attention.

4.2. Core Methodology In-depth (Layer by Layer)

The paper first outlines why traditional KV-Cache cannot be directly applied to Diffusion Language Models (DLMs), then presents empirical observations that motivate dKV-Cache, and finally details its variants.

Why KV-Cache Cannot be Used in DLMs?

The effectiveness of KV-Cache in autoregressive models (ARs) relies on two key assumptions that DLMs violate:

  • Timestep-variant key and value states: In ARs, a causal attention mask ensures that a token at step tt only attends to tokens at positions 1 to tt. This means that the key and value states for tokens 1 to t-1 remain constant for all subsequent steps. Thus, Km[1:t1]\mathbf{K}_m^{[1:t-1]} and Vm[1:t1]\mathbf{V}_m^{[1:t-1]} are the same for any step mtm \geq t. In contrast, DLMs employ a bidirectional attention mask, allowing any token to attend to all other tokens. Consequently, the key and value states of a token can change at each denoising step based on the context of other tokens being updated. This means Km[1:t1]Kn[1:t1]\mathbf{K}_m^{[1:t-1]} \neq \mathbf{K}_n^{[1:t-1]} and Vm[1:t1]Vn[1:t1]\mathbf{V}_m^{[1:t-1]} \neq \mathbf{V}_n^{[1:t-1]} if nmn \neq m, breaking the global reuse assumption of conventional KV-Cache.
  • Non-sequential decoding order: ARs generate tokens strictly from left-to-right, making the position of the next token to be decoded deterministic. This allows them to compute Query, Key, and Value (QKV) states selectively only for the current decoding position. DLMs, however, support flexible generation orders. At each denoising step, any masked token position can be selected for update based on probabilities or confidence scores. This uncertainty means the model cannot pre-determine which token's hidden states (h[i]\mathbf{h}^{[i]}) and QKV states will need computation, rendering the selective computation strategy of AR KV-Cache ineffective.

Representation Dynamics for Tokens in Diffusion Sampling

To understand if key and value states could still be reused despite bidirectional attention, the authors investigated the dynamics of K\mathbf{K} and V\mathbf{V} for each token throughout the diffusion process. This analysis (summarized in Figure 2) revealed critical insights:

  • Consistent Similarity (Figure 2a): Despite being updated at each step, the key and value embeddings exhibit consistently high similarity across different timesteps. This suggests that the representations don't radically change, offering a potential for reuse.

  • Stability After Decoding (Figure 2b): Once a token is decoded (i.e., its identity is no longer [MASK]), its representation becomes relatively stable in subsequent steps. Conversely, still-masked tokens show significant fluctuations. This implies that decoded tokens are good candidates for caching.

  • Largest Changes at Decoding Step (Figure 2c): The most substantial changes in K\mathbf{K} and V\mathbf{V} occur precisely at the decoding step of each token and in the early stages of the denoising process. This indicates that caching a token's QKV immediately when it is computed might be premature, as its representation is still undergoing significant change.

    These observations directly inform the design of dKV-Cache, particularly the delayed aspect.

Delayed KV-Cache for Masked Diffusion Language Models

The core design of dKV-Cache is based on a generalized non-sequential KV-Cache formulation that accommodates the dynamic nature of DLMs.

Generalized Non-sequential KV-Cache

The standard KV-Cache in ARs uses a contiguous slice K[1:t1]\mathbf{K}^{[1:t-1]} and V[1:t1]\mathbf{V}^{[1:t-1]}. dKV-Cache replaces this with a set of cached indices StT={1,,L}\mathcal{S}_t \subseteq \mathcal{T} = \{1, \dots, L\} (where T\mathcal{T} is the set of all token positions). The next token position is not fixed at tt but rather DtD_t, the position of the token being denoised at step tt.

The modified attention computation at step tt is: $ \mathbf { z } _ { t } = \mathrm { s o f t m a x } \left( \frac { \mathbf { Q } _ { t } ^ { D _ { t } } \left( \mathbf { K } _ { t } ^ { S _ { t } \cup { D _ { t } } } \right) ^ { \top } } { \sqrt { d _ { k } } } \right) \mathbf { V } _ { t } ^ { S _ { t } \cup { D _ { t } } } \mathrm { ~ w i t h ~ } \left{ \begin{array} { l l } { \mathbf { K } _ { t } ^ { S _ { t } \cup { t } } = \mathrm { c o n c a t } . \mathrm { r e o r d e r } \left( \mathbf { K } _ { t - 1 } ^ { S _ { t } } , \mathbf { K } _ { t } ^ { D _ { t } } \right) } \ { \mathbf { V } _ { t } ^ { S _ { t } \cup { t } } = \mathrm { c o n c a t } . \mathrm { r e o r d e r } \left( \mathbf { V } _ { t - 1 } ^ { S _ { t } } , \mathbf { V } _ { t } ^ { D _ { t } } \right) } \end{array} \right. $ Where:

  • zt\mathbf{z}_t: The output of the attention head at step tt.

  • softmax()\mathrm{softmax}(\cdot): The softmax function applied to the scaled dot product.

  • QtDt\mathbf{Q}_t^{D_t}: The query vector for the token at position DtD_t (the token being decoded at step tt).

  • KtSt{Dt}\mathbf{K}_t^{S_t \cup \{D_t\}} and VtSt{Dt}\mathbf{V}_t^{S_t \cup \{D_t\}}: The key and value matrices formed by concatenating the cached states for positions in St\mathcal{S}_t and the newly computed states for position DtD_t. The notation KtSt{t}\mathbf{K}_t^{S_t \cup \{t\}} is likely a typo in the paper and should consistently refer to the current set of tokens considered (St{Dt}S_t \cup \{D_t\} or the whole sequence TT). Given the context of the equation, KtSt{t}\mathbf{K}_t^{S_t \cup \{t\}} and VtSt{t}\mathbf{V}_t^{S_t \cup \{t\}} refer to the key and value states for the entire sequence up to step tt, with tt representing the current position being updated or rather the step index itself.

  • dkd_k: The dimension of the key vectors, used for scaling.

  • DtD_t: The position of the token currently being denoised/decoded at step tt.

  • St\mathcal{S}_t: The set of token positions for which key and value states have been cached from previous steps.

  • concat_reorder()\mathrm{concat\_reorder}(\cdot): A specialized operation (explained below in concat_reorder) that concatenates the cached states from the previous step (Kt1St\mathbf{K}_{t-1}^{\mathcal{S}_t} and Vt1St\mathbf{V}_{t-1}^{\mathcal{S}_t}) with the newly computed states for DtD_t (KtDt\mathbf{K}_t^{D_t} and VtDt\mathbf{V}_t^{D_t}), and then reorders them.

    To generalize, the query input QtDt\mathbf{Q}_t^{D_t} is extended from a single token to multiple and arbitrary tokens. At decoding step tt, a dynamic set Mt\mathcal{M}_t is constructed, representing tokens that are not yet finalized (i.e., still masked or noisy). The caching is delayed, not by input time, but by the step a token is decoded. This means key and value states corresponding to decoded tokens (TMt\mathcal{T} \setminus \mathcal{M}_t) are reused, while non-finalized positions (Mt\mathcal{M}_t) are recomputed at each step. This also avoids the need to predict denoising order.

One-step Delayed Caching (dKV-Cache-Decode)

Building on the observation that the most significant changes in QKV states happen at the exact decoding step (Figure 2c), caching a token's key and value immediately after it's decoded might lead to using stale representations. To mitigate this, dKV-Cache introduces one-step delayed caching. At timestep tt, the method uses the masking state from the previous step, Mt1\mathcal{M}_{t-1}, to determine which tokens are cacheable. This means tokens decoded at step t-1 are cached at step tt.

The final formulation for dKV-Cache-Decode is: $ \mathbf { z } _ { t } = \mathrm { s o f t m a x } \left( \frac { \mathbf { Q } _ { t } ^ { \mathcal { M } _ { t } } \left( \mathbf { K } _ { t } ^ { \mathcal { T } } \right) ^ { \top } } { \sqrt { d _ { k } } } \right) \mathbf { V } _ { t } ^ { \mathcal { T } } \mathrm { w i t h } \left{ \begin{array} { l l } { \mathbf { K } _ { t } ^ { \mathcal { T } } = \mathrm { c o n c a t } . \mathrm { r e o r d e r } \left( \mathbf { K } _ { t - 1 } ^ { \mathcal { T } \setminus \mathcal { M } _ { t - 1 } } , \mathbf { K } _ { t } ^ { \mathcal { M } _ { t - 1 } } \right) } \ { \mathbf { V } _ { t } ^ { \mathcal { T } } = \mathrm { c o n c a t } . \mathrm { r e o r d e r } \left( \mathbf { V } _ { t - 1 } ^ { \mathcal { T } \setminus \mathcal { M } _ { t - 1 } } , \mathbf { V } _ { t } ^ { \mathcal { M } _ { t - 1 } } \right) } \end{array} \right. $ Where:

  • zt\mathbf{z}_t: The attention output at step tt.

  • QtMt\mathbf{Q}_t^{\mathcal{M}_t}: The query vectors for the set of currently non-finalized tokens Mt\mathcal{M}_t.

  • KtT\mathbf{K}_t^{\mathcal{T}} and VtT\mathbf{V}_t^{\mathcal{T}}: The full key and value matrices for the entire sequence at step tt.

  • T\mathcal{T}: The set of all token positions {1,,L}\{1, \dots, L\}.

  • Mt1\mathcal{M}_{t-1}: The set of tokens that were not finalized at the previous step t-1.

  • TMt1\mathcal{T} \setminus \mathcal{M}_{t-1}: The set of tokens that were finalized (decoded) by step t-1. Their key and value states (Kt1TMt1\mathbf{K}_{t-1}^{\mathcal{T} \setminus \mathcal{M}_{t-1}} and Vt1TMt1\mathbf{V}_{t-1}^{\mathcal{T} \setminus \mathcal{M}_{t-1}}) are reused from the previous step.

  • KtMt1\mathbf{K}_t^{\mathcal{M}_{t-1}} and VtMt1\mathbf{V}_t^{\mathcal{M}_{t-1}}: The key and value states newly computed at step tt for the tokens that were not finalized at step t-1.

  • concat_reorder()\mathrm{concat\_reorder}(\cdot): Concatenates the reused and newly computed states and reorders them for attention calculation.

    This delay stabilizes performance and accuracy, though it slightly reduces immediate efficiency.

Cache Refreshing Mechanism

To maintain correctness and consistency over long sequences, dKV-Cache includes a cache refreshing mechanism. Every NN steps, the entire stored cache is discarded, and the calculation reverts to a normal, full computation step (i.e., Mt1\mathcal{M}_{t-1} effectively becomes the empty set \emptyset for that step in Equation 4). This recomputes all key and value states from scratch, ensuring freshness.

dKV-Cache-Prefill and dKV-Cache-PD

These are specialized variants for handling prefill tokens (the initial context provided to the model):

  • dKV-Cache-Prefill: Based on the observation that prefill tokens primarily attend to each other and are less influenced by later generated tokens, this strategy caches prefill tokens without ever refreshing them. This is beneficial for very long input contexts.
  • dKV-Cache-PD: This variant intermittently refreshes only the newly decoded tokens while maintaining the key and value states of prefill tokens without any recomputation.

dKV-Cache-Greedy: Greedy Formulation of dKV-Cache

The dKV-Cache-Decode method, while accurate, still involves O(L3)\mathcal{O}(L^3) time complexity because the set Mt\mathcal{M}_t (non-finalized tokens) can initially contain almost all LL tokens, only shrinking towards the end. To achieve O(L2)\mathcal{O}(L^2) time complexity (comparable to ARs), dKV-Cache-Greedy adopts a more aggressive caching strategy by restricting the scope of recomputation.

Instead of recomputing QKV states for all tokens in Mt\mathcal{M}_t, dKV-Cache-Greedy defines Mt\mathcal{M}_t to include only three components for recomputation:

  1. The token at the current step DtD_t.
  2. The token from the previous step Dt1D_{t-1} (motivated by one-step delayed caching).
  3. A local window W(t)\mathcal{W}(t). This window includes the current token and its neighbors within a fixed size ww. The paper found better performance when centering this window around Dt1D_{t-1}. The formula for the local window is: $ \begin{array} { r } { \mathcal { W } _ { t } = \left{ x _ { i } \ : \middle | \ : i \in \left[ D _ { t } - \left\lceil \frac { w } { 2 } \right\rceil , D _ { t } + \left\lfloor \frac { w } { 2 } \right\rfloor \right] \right} } \end{array} $ Where:
  • Wt\mathcal{W}_t: The set of token positions included in the local window at step tt.

  • xix_i: The token at position ii.

  • ii: The index of the token position.

  • DtD_t: The position of the token being decoded at step tt.

  • ww: The fixed window size (e.g., at most 6 in experiments).

  • \lceil \cdot \rceil and \lfloor \cdot \rfloor: Ceiling and floor functions, respectively, used to define the window boundaries.

    By restricting the QKV recomputation to a small, fixed-size subset of tokens (current, previous, and local window), the size of Mt\mathcal{M}_t is no longer dependent on LL, thereby reducing the time complexity to O(L2)\mathcal{O}(L^2). This aggressive strategy provides higher speedups but might incur some performance degradation due to more aggressively stale key and value states.

A. Design for concat_reorder (from Appendix)

The concat_reorder operation is crucial for the efficiency of dKV-Cache. Unlike standard KV-Cache in ARs which uses simple concatenation for contiguous memory, dKV-Cache needs to handle non-contiguous memory access due to arbitrary positions for cached and newly computed tokens. This involves gathering (collecting data from scattered memory locations) and scattering (writing data to scattered memory locations), which can be inefficient.

The concat_reorder algorithm aims to minimize these overheads by reordering token positions during the Transformer's forward calculation. The key idea is to arrange the sequence such that all cached tokens are contiguous (e.g., on the left side) and newly decoded tokens are on the other side. This allows for more efficient matrix operations.

The process involves:

  1. Retrieval of Cached States: At step tt, previously cached key and value states (Kt1TMt1\mathbf{K}_{t-1}^{\mathcal{T} \setminus \mathcal{M}_{t-1}} and Vt1TMt1\mathbf{V}_{t-1}^{\mathcal{T} \setminus \mathcal{M}_{t-1}}) are retrieved based on their position indices TMt1\mathcal{T} \setminus \mathcal{M}_{t-1} (tokens that were finalized by step t-1).

  2. Reordering the Sequence Input: The input sequence for the current step (x\mathbf{x}) is reordered. Cached tokens (those in TMt1\mathcal{T} \setminus \mathcal{M}_{t-1}) are placed first, followed by uncached tokens (those in Mt1\mathcal{M}_{t-1}). Let's denote this reordered sequence as x\mathbf{x}'.

  3. Positional Embeddings Reordering: The positional embeddings (PE) are also reordered to match the new sequence order. This is a one-time operation per model evaluation and can be shared across layers.

  4. Compute QKV for Uncached Tokens: The Transformer calculates QKV states (QtMt,KtMt,VtMt\mathbf{Q}_t^{\mathcal{M}_t}, \mathbf{K}_t^{\mathcal{M}_t}, \mathbf{V}_t^{\mathcal{M}_t}) only for the uncached tokens (those in Mt\mathcal{M}_t).

  5. Concatenation for Attention Calculation: The retrieved cached K/V states and the newly computed K/V states are concatenated directly to form the full KtT\mathbf{K}_t^{\mathcal{T}} and VtT\mathbf{V}_t^{\mathcal{T}} matrices, which are then used in the attention calculation.

  6. Attention Output and Final Reordering: The attention mechanism produces an output. If needed, this output can be reordered back to the original sequence positions using an inverse mapping to match the model's expected output format.

    The one-step shift in caching means that the positions of cached tokens at step tt correspond to tokens decoded at step t-1. This alignment helps track which key and value entries need to be cached without storing entire matrices.

The pseudo-algorithm for this process (Algorithm 1 in the paper) is:

Algorithm 1: dKV-Cache

Require: Sequence 1:L at step t (Simplied as x), position index of masked tokens M_t, cached Key K_t-1^(T\M_t-1) and Value V_t-1^(T\M_t-1)
1: x' ← x[M_t-1]                    (M_t-1: t-1 for one-step shift)
2: PE' ← [PE[T\M_t-1]; PE[M_t-1]]  (Positional embeddings: cached on left, uncached on right)
3: Q_t^M_t, K_t^M_t, V_t^M_t ← T(x') (T: Calculation in Transformer to get Q, K and V)
4: K_t^T ← Concat(K_t-1^(T\M_t-1), K_t^M_t), V_t^T ← Concat(V_t-1^(T\M_t-1), V_t^M_t) ▹ Get all K and V
5: K_t^(L\M_t) ← Reorder(K_t^L, I'), V_t^(L\M_t) ← Reorder(V_t^L, I') (I': The index of T\M_t in the [x[T\M_t-1]; x[M_t-1]])
6: p' ← A(Q_t^M_t, K_t^T, V_t^T)
7: p ← Scatter(p', M_t-1)         (Put the token logits back to the original position)
8: Return p, K_t^T, V_t^T

Let's break down Algorithm 1:

  • Require: This specifies the inputs:

    • xx: The input sequence of length LL at the current step tt.
    • Mt\mathcal{M}_t: The set of indices of masked tokens (non-finalized) at the current step tt.
    • Kt1TMt1\mathbf{K}_{t-1}^{\mathcal{T} \setminus \mathcal{M}_{t-1}} and Vt1TMt1\mathbf{V}_{t-1}^{\mathcal{T} \setminus \mathcal{M}_{t-1}}: The cached Key and Value states from the previous step t-1 for tokens that were finalized (decoded). (TMt1\mathcal{T} \setminus \mathcal{M}_{t-1} represents the indices of finalized tokens).
  • Line 1: xx[Mt1]x' ← x[M_t-1]: This step is unclear based on the given description and likely a simplification or typo. The text states "Reorder the sequence... making the cached tokens... at the left, and uncached tokens... at the right." This line implies xx' would be only the masked tokens from xx. However, the accompanying text for concat_reorder describes reordering the full sequence xx based on its masked/unmasked status. If this line is interpreted as extracting only the unmasked tokens from the previous step (Mt1\mathcal{M}_{t-1}), it means x\mathbf{x}' represents the subset of tokens that need QKV recomputation. This interpretation aligns with the next lines. The comment (Mt1:t1foronestepshift)(M_t-1: t-1 for one-step shift) confirms that the masking state from the previous step is used.

  • Line 2: PE[PE[T\Mt1];PE[Mt1]]PE' ← [PE[T\M_t-1]; PE[M_t-1]]: PE refers to Positional Embeddings. This line constructs a reordered positional embedding matrix PE'. It places the positional embeddings corresponding to the finalized tokens (TMt1\mathcal{T} \setminus \mathcal{M}_{t-1}) first, followed by those for the non-finalized tokens (Mt1\mathcal{M}_{t-1}). This PE' will be used with the reordered input sequence or for calculating QQ states for the tokens that need recomputation.

  • Line 3: Q_t^M_t, K_t^M_t, V_t^M_t ← T(x'): The Transformer block (T()T(\cdot)) calculates Query, Key, and Value states (QKV) only for the tokens identified in x\mathbf{x}' (which correspond to the non-finalized tokens from the previous step, Mt1\mathcal{M}_{t-1}). The generated QKV are thus specifically for these masked tokens at the current step tt.

  • Line 4: K_t^T ← Concat(K_t-1^(T\M_t-1), K_t^M_t), V_t^T ← Concat(V_t-1^(T\M_t-1), V_t^M_t): This is the crucial concat_reorder step for Key and Value. It concatenates the cached K/V states (from finalized tokens at step t-1) with the newly computed K/V states (from non-finalized tokens at step t-1). This forms the complete Key and Value matrices for the entire sequence at step tt, denoted KtT\mathbf{K}_t^{\mathcal{T}} and VtT\mathbf{V}_t^{\mathcal{T}}.

  • Line 5: Kt(L\Mt)Reorder(KtL,I),Vt(L\Mt)Reorder(VtL,I)K_t^(L\M_t) ← Reorder(K_t^L, I'), V_t^(L\M_t) ← Reorder(V_t^L, I'): This step seems to reorder the final KK and VV states for the entire sequence. The notation L\MtL\M_t should probably be T\mathcal{T} and Mt\mathcal{M}_t should be Mt1\mathcal{M}_{t-1} to be consistent with the other lines. II' is described as "The index of TMt\mathcal{T} \setminus \mathcal{M}_t in the [x[TMt1];x[Mt1]][\mathbf{x}[\mathcal{T} \setminus \mathcal{M}_{t-1}]; \mathbf{x}[\mathcal{M}_{t-1}]]". This line appears to be ensuring the Key and Value states are correctly ordered for the upcoming attention calculation, which uses QtMt\mathbf{Q}_t^{\mathcal{M}_t} (from Line 3) attending to the full KtT\mathbf{K}_t^{\mathcal{T}} and VtT\mathbf{V}_t^{\mathcal{T}} (from Line 4). However, the notation Kt(LMt)\mathbf{K}_t^{(L\mathcal{M}_t)} is inconsistent with KtT\mathbf{K}_t^{\mathcal{T}} from the previous line. A more consistent reading might be that KtT\mathbf{K}_t^{\mathcal{T}} and VtT\mathbf{V}_t^{\mathcal{T}} from Line 4 are already reordered in a specific way, and this line is reordering them again to a final sequence-level order or for storage. Given the overall mechanism, the most logical interpretation is that this step reorders the full KK and VV matrices to a canonical order that matches the query structure or the original sequence order for storage.

  • Line 6: p' ← A(Q_t^M_t, K_t^T, V_t^T): The attention mechanism (A()A(\cdot)) is applied. The query comes from the non-finalized tokens at step tt (QtMt\mathbf{Q}_t^{\mathcal{M}_t}), and it attends to the full Key and Value matrices for the entire sequence (KtT,VtT\mathbf{K}_t^{\mathcal{T}}, \mathbf{V}_t^{\mathcal{T}}). pp' represents the attention output (e.g., logits or hidden states) for the non-finalized tokens.

  • Line 7: pScatter(p,Mt1)p ← Scatter(p', M_t-1): The attention output pp' (for the non-finalized tokens) is scattered back to its original positions in the sequence. pp represents the full sequence's output after this step, with updated logits or hidden states for the non-finalized tokens.

  • Line 8: Returnp,KtT,VtTReturn p, K_t^T, V_t^T: The algorithm returns the updated sequence output pp and the current full Key and Value matrices (KtT,VtT\mathbf{K}_t^{\mathcal{T}}, \mathbf{V}_t^{\mathcal{T}}) which will be used as cached states for the next step.

    The concat_reorder strategy is key to making dKV-Cache efficient, particularly because it handles Rotary Positional Embeddings (ROPE) by simply reordering them alongside the tokens, incurring minimal overhead.

B. Design for Dream (from Appendix)

Dream models, being adapted from pre-trained autoregressive models, have a unique characteristic: their output positions are shifted. Specifically, the model predicts the token at position t+1t+1 based on input up to tt. This differs from standard masked diffusion models where the output aligns with the current token. This shifted output position necessitates different caching strategies:

  • Un-Shift (Figure 6a): In this variant, for the tt-th token, its key and value are stored as Kt\mathbf{K}^t and Vt\mathbf{V}^t at position tt. This is the standard interpretation of caching.
  • Right-Shift (Figure 6b): Due to the shifted output, the model's hidden state at position tt might actually correspond to the token at t-1. This variant explores caching Kt+1\mathbf{K}^{t+1} and Vt+1\mathbf{V}^{t+1} for the tt-th token. However, this largely harms performance.
  • Un&Right-Shift (Figure 6c): This is a stricter variant where caching is conditioned on both input stability and decoding completion. For the tt-th token, features are cached only after its input is fixed and it has been decoded. This variant performed best in Dream but was incompatible with the concat_reorder mechanism. The paper uses the Un-Shift strategy for Dream in its main experiments for compatibility, despite Un&Right-Shift showing slightly better performance for Dream specifically.

Figure 6 illustrates these variants:

  • Figure 6a (Un-Shift): Shows the Key/Value cache being stored for token at index ii (labeled as KiK_i, ViV_i), and these are used for the query at step t+1t+1. The input at step tt is [x0,...,xL][x_0, ..., x_L].
  • Figure 6b (Right-Shift): Illustrates storing K/V for token at index i+1i+1 (labeled as Ki+1K_i+1, Vi+1V_i+1), which are then used for the query at step t+1t+1.
  • Figure 6c (Un&Right-Shift): Shows a more complex caching decision where KiK_i, ViV_i are cached only after the token xix_i is finalized and its input is stable, then used for the next step. These show the varying strategies for dealing with positional shifts typical of models adapted from autoregressive designs.

5. Experimental Setup

5.1. Datasets

The experiments evaluated dKV-Cache on two 7B-scale Diffusion Language Models (DLMs): LLaDA [37] and Dream [52]. A diverse set of benchmarks was used to assess various capabilities:

  • General Language Understanding:
    • MMLU (Massive Multitask Language Understanding) [23]: A benchmark covering 57 subjects across STEM, humanities, social sciences, and more, testing a wide range of knowledge and problem-solving abilities.
    • GPQA (Graduate-level Google-Proof Q&A) [39]: A challenging question-answering benchmark designed to require deep expertise and complex reasoning, often "Google-proof" (not easily answered by simple search).
  • Mathematical Reasoning:
    • GSM8K [10]: A dataset of 8.5K diverse grade school math word problems.
    • Math500: A dataset for mathematical problem solving.
  • Code Generation:
    • HumanEval [4]: A dataset of programming problems to test code synthesis, where models generate Python functions based on docstrings.

    • MBPP (Mostly Basic Python Problems) [8]: Another dataset for code generation, focusing on basic programming tasks.

      For LLaDA, evaluations were zero-shot (no examples provided in the prompt), with answers generated and matched against ground truth. For Dream, few-shot in-context learning (ICL) was employed, where a few examples are provided in the prompt to guide the model.

5.2. Evaluation Metrics

The performance of dKV-Cache was primarily measured using two metrics:

  1. Accuracy: This metric quantifies the correctness of the model's generated answers, typically the percentage of correctly solved problems or correctly generated tokens/sequences. The conceptual definition is simply the proportion of correct predictions out of the total number of predictions.

  2. Token/s (Tokens per Second): This metric measures the inference speed, indicating how many tokens the model can process per second. It directly reflects the efficiency gains of dKV-Cache.

  3. Cache Ratio: This metric quantifies the proportion of Key-Value (KV) pairs that are reused from the cache during the denoising process. It provides insight into how much of the computation is being saved by dKV-Cache.

    The formula for Cache Ratio is: $ \mathrm{Cache~Ratio} = \frac { 1 } { T } \sum _ { i = 1 } ^ { T } \frac { \left| \mathcal { T } _ { i } ^ { \mathrm { c a c h e } } \right| } { | \mathcal { T } _ { i } | } $ Where:

    • TT: The total number of denoising steps in the generation process.
    • ii: The index of the current denoising step.
    • Ti\mathcal{T}_i: The set of all tokens processed at step ii. This typically refers to the entire sequence length LL. So, Ti|\mathcal{T}_i| would be LL.
    • Ticache\mathcal{T}_i^{\mathrm{cache}}: The subset of tokens whose Key-Value (KV) pairs are reused from the cache at timestep ii. In the context of the paper, this corresponds to Mi|\mathcal{M}_i| (the number of masked tokens for which KV states are recomputed, subtracted from total length, or the complement). More precisely, it refers to the size of the set TMt1\mathcal{T} \setminus \mathcal{M}_{t-1} from Equation 4, i.e., the tokens whose KV states are reused from the previous step.

5.3. Baselines

The proposed dKV-Cache methods were compared against baselines that accelerate generation by reducing the number of denoising steps:

  • Base (random / confidence): This refers to the original LLaDA or Dream model running with its full number of denoising steps (e.g., T=LT=L). It serves as the unaccelerated baseline for accuracy. The remasking strategy (how tokens are selected to be unmasked) can be random or based on confidence scores.

  • Few-Steps (random): This baseline reduces the total number of denoising steps by a certain percentage (e.g., 50%50\%, 62.5%62.5\%) using random remasking. This is a common way to speed up DLMs but often comes with a performance trade-off.

  • Half-Steps (confidence): Similar to Few-Steps, but specifically refers to using 50%50\% of the original denoising steps and employing a confidence-based remasking strategy.

    The number of steps for Few-Steps and Half-Steps baselines was selected such that their sampling speed was comparable to or slower than dKV-Cache, allowing for a fair comparison to show that dKV-Cache can achieve better performance at similar (or faster) speeds.

C. Evaluation Details (from Appendix)

C.1 For LLaDA

The authors re-implemented the evaluation for LLaDA using the specified datasets. Instead of comparing log probabilities for multiple-choice questions, they generated the full answer text and extracted the final answer for matching. This method can lead to lower reported results for MMLU and GPQA compared to previous works, as the model might fail to generate the answer in the expected format or might not generate an answer at all.

The configurations for LLaDA experiments are detailed in Table 5:

The following are the results from Table 5 of the original paper:

Remasking Base (random / confidence) Configuration Few-Steps (random) Steps T dKV-Cache-Greedy (random) Cache Interval dKV-Cache-Greedy (random) Window Size Half-Steps (confidence) Steps T dKV-Cache-Decode (confidence) Cache Interval
MMLU L=32, T=32, B=16 T=20 2 4 T=16 8
GSM8K L=256, T=256, B=32 T=160 2 4 T=128 8
Math500 L=256, T=256, B=64 T=160 2 4 T=128 8
GPQA L=128, T=128, B=64 T=80 2 4 T=64 8
HumanEval L=512, T=512, B=32 T=320 2 4 T=256 8
MBPP L=512, T=512, B=32 T=320 2 2 T=256 8
  • LL: Denotes the decoding length (maximum sequence length).
  • TT: Denotes the total number of denoising steps.
  • BB: Denotes the batch size.
  • Cache Interval: For dKV-Cache-Greedy, this is how often the cache is refreshed. For dKV-Cache-Decode, it's the refresh step.
  • Window Size: For dKV-Cache-Greedy, this is the size of the local window used for recomputation.

C.2 For Dream

For Dream, the authors followed the original evaluation pipeline. They adopted MMLU and GPQA to generate answers instead of comparing probabilities, similar to LLaDA. The evaluation also utilized few-shot in-context learning with a specified number of shots.

6. Results & Analysis

6.1. Core Results Analysis

The experiments aim to demonstrate the performance trade-offs and speedups introduced by dKV-Cache across various benchmarks and configurations.

LLaDA Results

The following are the results from Table 1 of the original paper:

Remasking Base (random) Few-Steps (random) dKV-Cache-Greedy (random) dKV-Cache-Greedy w/ Window (random) Base (confidence) Half-Steps (confidence) dKV-Cache-Decode (confidence)
Acc. (%) Token/s Acc. (%) Token/s Acc. (%) Token/s Acc. (%) Token/s Acc. (%) Token/s Acc. (%) Token/s Acc. (%) Token/s
MMLU 51.79 30.20 43.19 47.49 (1.67x) 45.77 50.56 (1.57x) 47.70 (4) 45.72 (1.51x) 51.11 28.27 51.11 55.80 (1.97×) 51.00 66.52 (2.35x)
GSM8K 72.25 15.16 65.58 24.08 (1.59x) 67.93 25.47 (1.68×) 68.23 (4) 24.76 (1.63x) 77.56 14.31 77.91 28.71 (2.00x) 78.85 27.50 (1.92x)
Math500 27.4 12.00 21.8 19.36 (1.61x) 26.0 20.34 (1.70x) 27.0 (4) 19.86 (1.66x) 36.6 11.53 34.2 23.10 (2.00x) 36.8 24.46 (2.12x)
GPQA 27.46 11.40 24.78 18.59 (1.63x) 26.79 19.27 (1.69x) 28.35 (4) 18.26 (1.60x) 30.80 11.86 27.68 23.88 (2.01x) 28.13 28.73 (2.42x)
HumanEval 19.88 7.50 15.61 12.50 (1.67×) 15.13 12.31 (1.64x) 15.37 (4) 12.13 (1.62x) 39.63 7.08 33.54 14.18 (2.00x) 46.34 13.76 (1.83x)
MBPP 21.4 7.51 15.6 12.97 (1.73x) 17.8 12.55 (1.67×) 20.4 (2) 12.44 (1.66×) 40.4 7.50 33.8 15.01 (2.00×) 40.4 13.93 (1.86x)

(Note: The number in parentheses next to the dKV-Cache-Greedy w/ Window accuracy (e.g., 4) represents the window size.)

  • dKV-Cache-Greedy (random remasking): This variant generally outperforms few-step baselines in terms of accuracy across most benchmarks (MMLU, GSM8K, Math500, GPQA, MBPP). For instance, on MMLU, Base (random) gets 51.79%, Few-Steps gets 43.19%, while dKV-Cache-Greedy achieves 45.77% and dKV-Cache-Greedy w/ Window achieves 47.70%. This shows that with a local window, dKV-Cache-Greedy can significantly close the gap to the base model while still providing speedups (e.g., 1.51×1.51\times to 1.70×1.70\times). HumanEval is an exception where dKV-Cache-Greedy shows a lower accuracy than Few-Steps, possibly due to the aggressive caching being less suitable for code generation's stricter requirements.
  • dKV-Cache-Decode (confidence remasking): This variant offers the best trade-off between speed and accuracy. It achieves near-lossless performance, often very close to the Base (confidence) model, and in some cases even outperforms it (e.g., GSM8K: 78.85% vs. 77.56%; HumanEval: 46.34% vs. 39.63%). It consistently provides speedups ranging from 1.83×1.83\times to 2.35×2.35\times. This variant demonstrates that KV-Cache can be applied to DLMs without sacrificing performance, and sometimes even improving it.

Dream Results

The following are the results from Table 2 of the original paper:

Dream-7B Half-Steps dKV-Cache-Decode dKV-Cache-Prefill dKV-Cache-PD
Acc. (%) / Token/s (Speedup) Acc. (%) / Token/s (Speedup) Acc. (%) / Token/s (Speedup) Acc. (%) / Token/s (Speedup) Acc. (%) / Token/s (Speedup)
GSM8K (8-shot) L = 256 T = 256 76.88 / 15.1 (1.00x) 68.08 / 30.3 (2.00x) 76.57 / 31.6 (2.09x) 75.66 / 53.6 (3.55x) 74.07 / 50.2 (3.32x)
T = 128 68.81 / 30.3 (2.01x) 46.63 / 60.5 (4.01x) 65.35 / 62.31 (4.13x) 65.96 / 107.4 (7.11x) 63.31 / 99.5 (6.6x)
MBPP (3-shot) L=512 T= 512 55.8 / 5.4 (1.00×) 45.2 / 10.8 (2.00x) 53.4 / 10.4 (1.93x) 55.2 / 13.6 (2.52x) 51.0 / 14.5 (2.69x)
T = 256 45.2 / 10.8 (2.00x) 26.2 / 21.5 (3.98x) 43.4 / 20.6 (3.81×) 41.8 / 27.1 (5.02x) 42.6 / 28.9 (5.35 x)
HumanEval (0-shot) L=512 T = 512 57.93 / 10.3 (1.00x) 37.20 / 20.5 (1.99x) 57.32 / 15.5 (1.50x) 56.10 / 14.4 (1.40x) 59.76 / 17.4 (1.69x)
T = 256 37.20 / 20.5 (1.99x) 18.29 / 40.9 (3.97×) 31.70 / 31.1 (3.02x) 33.54 / 28.7 (2.79x) 31.70 / 34.8 (3.38×)
  • Impact of few-shot ICL and Prefill: For Dream, which uses few-shot in-context learning (ICL), long input contexts are common, leading to significant overhead. In these scenarios, dKV-Cache-Prefill (which caches prefill tokens without refreshing) shows substantial speed improvements.
  • Significant Speedups from dKV-Cache-Prefill/PD: On GSM8K with L=256,T=128L=256, T=128, dKV-Cache-Prefill achieves a 7.11×7.11\times speedup (107.4 Token/s) with 65.96% accuracy, significantly better than Half-Steps (4.01x, 46.63%). dKV-Cache-PD also shows strong speedups (e.g., 6.6×6.6\times on GSM8K).
  • Performance with Increased Diffusion Steps: As the number of diffusion steps (T) increases, the speed gains from dKV-Cache become even more pronounced relative to the baseline. For example, on GSM8K(L=256)GSM8K (L=256), when TT is reduced from 256 to 128 for dKV-Cache, it still maintains strong accuracy while drastically increasing speedup. The Half-Steps baseline suffers a larger accuracy drop for the same speedup.
  • Improved Performance: In cases like GSM8K(L=256,T=128)GSM8K (L=256, T=128), dKV-Cache-Decode (65.35%) and dKV-Cache-Prefill (65.96%) significantly improve over Half-Steps (46.63%) while delivering comparable or higher speedups. dKV-Cache-PD also performs well.
  • Code Generation: For HumanEval(L=512,T=512)HumanEval (L=512, T=512), dKV-Cache-PD even slightly surpasses the Dream-7B base model in accuracy (59.76% vs 57.93%) while providing a speedup of 1.69×1.69\times.

Long Prefill Settings

The following are the results from Table 3 of the original paper:

Dream-7B Half-Steps dKV-Cache-Decode dKV-Cache-Prefill
Acc. (%) / Token/s (Speedup) Acc. (%) / Token/s (Speedup) Acc. (%) / Token/s (Speedup) Acc. (%) / Token/s (Speedup)
MMLU (5-shot) L=8 T = 8 72.19 / 9.1 (1.00×) 72.21 / 18.1 (1.99x) 71.74 / 25.2 (2.77x) 71.76 / 57.6 (6.33x)
T = 4 72.21 / 18.1 (1.99x) 71.63 / 36.1 (3.97×) 71.69 / 49.2 (5.41×) 71.71 / 67.3 (7.40x)
GPQA (5-shot) L = 128 T = 128 36.83 / 7.4 (1.00x) 35.49 / 14.7 (1.99x) 35.71 / 18.2 (2.46×) 35.27 / 75.40 (10.19 x)
T = 64 35.49 / 14.7 (1.99x) 35.27 / 29.4 (3.97×) 34.15 / 36.8 (4.97x) 35.27 / 139.9 (18.91x)
  • Extreme Speedup for Long Prefill: On GPQA with L=128L=128, dKV-Cache-Prefill achieves an astounding 10.19×10.19\times speedup at T=128T=128 steps, and an even more impressive 18.91×18.91\times speedup at T=64T=64 steps, while maintaining comparable accuracy to the baseline and dKV-Cache-Decode. This highlights the immense value of dKV-Cache-Prefill for tasks with extensive prefill contexts due to ICL.

6.2. Ablation Studies / Parameter Analysis

The One-step Delay in dKV-Cache

As can be seen from the results in Figure 3 of the original paper:

Figure 3: Effect of onestep delayed caching 该图像是图表,展示了图3中一步延迟缓存对GSM8K准确率的影响。图中比较了无延迟和延迟缓存情况下,随着缓存比例增大准确率的变化趋势,延迟缓存准确率保持较高且稳定。

Figure 3 illustrates the critical role of the one-step delay in the dKV-Cache mechanism.

  • Without Delay: When caching is applied immediately without delay, performance (Accuracy %) remains acceptable only at low cache ratios. As the cache ratio increases (meaning more KV pairs are aggressively cached), performance degrades rapidly, eventually collapsing to near-zero accuracy. This confirms the initial observation that token representations undergo significant changes at their decoding step, making immediate caching detrimental.
  • With Delay: Introducing a one-step delay stabilizes generation quality. The model maintains nearly lossless performance even under high cache ratios. This validates the hypothesis that delaying caching until the token's representation has settled (i.e., after its most significant change point) is crucial for effective and accurate caching in DLMs.

Performance on Different Decoding Length, Denoising Steps, and Refreshing Steps

As can be seen from the results in Figure 4 of the original paper:

Figure 4: dKV-Cache-Decode(left) and dKV-Cache-Greedy(right) on GSM8K with different settings: decoding length \(L\) , sampling steps \(S\) , refresh intervals and the window size. 该图像是论文中的图表,展示了dKV-Cache-Decode(左图)和dKV-Cache-Greedy(右图)在GSM8K数据集上的性能与速度对比,横轴为速度(Tokens/s),纵轴为性能(准确率%),不同参数如解码长度LL、采样步数SS、刷新间隔及窗口大小的影响均有体现。

Figure 4 presents a detailed analysis of dKV-Cache-Decode (left) and dKV-Cache-Greedy (right) on GSM8K under various configurations, plotting Performance (Accuracy %) against Speed (Tokens/s).

  • Decoding Robustness: The performance of dKV-Cache is largely insensitive to the number of decoding steps (L) and sampling steps (S), indicating its strong robustness across varying generation lengths.
  • Enhanced Long-form Generation: In tasks requiring longer outputs (e.g., L=512L=512), dKV-Cache-Decode not only accelerates but outperforms the baseline in terms of accuracy. For instance, GSM8K accuracy improves from 80.97%80.97\% to 83.13%83.13\%, and HumanEval from 39.63%39.63\% to 46.34%46.34\%. This suggests that the original bidirectional attention in DLMs might under-utilize contextual information for long sequences, and dKV-Cache somehow helps to alleviate this.
  • Effectiveness with Few Refreshing Steps: dKV-Cache-Decode maintains good performance even with infrequent refreshes (e.g., every 16 steps), showing only small degradation. This indicates efficiency in terms of cache maintenance.
  • Effectiveness of Local Windows: For dKV-Cache-Greedy, incorporating a local window (ww in the graph) notably enhances performance with minimal additional computational cost, bringing its accuracy closer to the dKV-Cache-Decode or Base models while offering higher speeds.

Memory and Speed Analysis

As can be seen from the results in Figure 5 of the original paper:

Figure 5: Speed and memory for dKV-Cache-Decode and dKV-Cache-Greedy. The number (2 and 8) means that in every n steps, the cache would be refreshed. 该图像是图表,展示了dKV-Cache-Decode与dKV-Cache-Greedy在不同预填充长度(128和512)和刷新步数(2和8)条件下的速度和内存使用情况,包含吞吐量和显存峰值对比。

Figure 5 analyzes the speed and memory footprint of dKV-Cache-Decode and dKV-Cache-Greedy across different decoding and prefill lengths (128 and 512).

  • Substantial Inference Acceleration: Both dKV-Cache variants achieve significant inference acceleration, ranging from 1.75×1.75\times to 3.3×3.3\times.
  • Modest Memory Increase: This acceleration comes with only a modest increase in memory usage.
  • dKV-Cache-Greedy Potential: dKV-Cache-Greedy generally demonstrates greater potential for accelerating inference compared to dKV-Cache-Decode. Although dKV-Cache-Greedy might see performance degradation if the refresh interval is too large (e.g., greater than 2 in experiments), it consistently achieves higher speedups than dKV-Cache-Decode under the same refresh interval.

Impact of Batch Size on Speed

As can be seen from the results in Figure 7 of the original paper:

Figure 7: Impact of batch size on decoding speed. Evaluated on LLaDA with a single NVIDIA H20; prefill length fixed at 100 tokens. 该图像是图表,展示了批量大小对解码速度(Token/s)的影响,基于在单个NVIDIA H20上评估的LLaDA模型,预填充长度固定为100个Token。图中分别比较了dKV-Cache-Decode和原始方法在不同序列长度L(16至128)下的性能。

Figure 7 shows the impact of batch size on decoding speed for LLaDA with a fixed prefill length of 100 tokens.

  • Memory-Bound at Small Batch Sizes: The dKV-Cache inference pipeline, involving indexing operations, gathers, and scatters, generates a stream of small, non-contiguous memory accesses. At batch size one, these uncoalesced reads make the inference memory-bound, causing the GPU's compute units to sit idle, waiting for data. In such scenarios, caching can even underperform the unaccelerated baseline.
  • Scaling with Batch Size: As the batch size increases, this problem is alleviated, and dKV-Cache delivers far greater, more stable speed-up ratios over the baseline. This implies that dKV-Cache benefits significantly from batch processing, a common optimization in deep learning.

Comparison of Dream Caching Strategies

The following are the results from Table 4 of the original paper:

Un-Shift Right-Shift Un&Right Shift
MMLU 71.78 64.60 71.73
GSM8K 76.34 32.68 77.71

This table compares different caching strategies for Dream-Base-7B, which is adapted from autoregressive models and has a shifted output position.

  • Un&Right-Shift Best: The Un&Right-Shift strategy (which conditions caching on both input stability and decoding completion) generally yields the best performance (e.g., 71.73% on MMLU, 77.71% on GSM8K).
  • Right-Shift Harmful: The Right-Shift variant significantly harms model performance (e.g., 64.60% on MMLU, 32.68% on GSM8K). This indicates that simply shifting the cached positions to align with a future output position is not effective.
  • Un-Shift Trade-off: The Un-Shift strategy, while not the absolute best for Dream's accuracy, was used in the main experiments because Un&Right-Shift was incompatible with the concat_reorder mechanism. This highlights a practical trade-off between architectural compatibility and optimal performance for specific model adaptations.

6.3. Case Study

The case studies illustrate the qualitative output of dKV-Cache variants compared to the base models.

The following are the results from Table 6 of the original paper:

Model Generate Text (L=128, T=128, B=64)
LLaDA-8B-Instruct (low-confidence) 1. Calculate the distance Lily runs in the first 4 hours:- Speed: 12 kilometers per hour- Time: 4 hours- Distance: 12 km/h * 4 h = 48 km2. Calculate the distance Lily runs in the next 4 hours:- Speed: 6 kilometers per hour- Time: 4 hours- Distance: 6 km/h * 4 h = 24 km3. Add the distances together:- Total distance: \boxed{72} km
dKV-Cache-Decode,Refresh = 4 1. Calculate the distance Lily runs in the first 4 hours:- Distance = Speed × Time = 12 km/h × 4 h = 48 km2. Calculate the distance Lily runs in the next 4 hours:- Distance = Speed × Time = 6 km/h x 4 h = 24 km3. Add the distances from both periods to find the total distance Lilyruns in 8 hours:- Total Distance = 48 km + 24 km = 72 km\boxed{72 km}
dKV-Cache-Decode,Refresh=16 1. Calculate the distance Lily runs in the first 4 hours:12 km/hour * 4 hours = 48 km2. Calculate the distance Lily runs in the next 4 hours:6 km/hour * 4 hours = 24 km3. Add the distances from the first 4 hours and the next 4 hours to findthe total distance she runs:48 km + 24 km = 72 kmTherefore, Lily can run \boxed{72} kilometers in 8 hours.
dKV-Cache-Greedy,Refresh = 2 To determine how many kilometers Lily can run in 8 hours, we need tobreak down her running schedule and sum the distances.First, calculate the distance she runs at 12 kilometers per hour for 4hours:12 * 4 = 48After 4 hours, she runs 6 kilometers per hour for the remaining 4 hours:6 * 4 = 24Now, add the distances:48 + 24 = 72\boxed{72}
dKV-Cache-Greedy,Refresh=4(random) For the first 4 hours Lily runs at 12 kilometers per hour, so she runs 12 *4 = 48 kilometers. 8 hours - 4 hours are left after the first 4 (8 - 4 = 4hours).For the next 4 hours, she runs at 6 kilometers per hour, so she runs 6 * 4= 24 kilometers.Adding the two distances together, Lily can run 48 + 24 = 72 kilometersin 8 hours.Answer: \boxed{72}

This case study for LLaDA-8B-Instruct on a math word problem shows that all dKV-Cache variants (Decode with refresh 4 and 16, Greedy with refresh 2 and 4) successfully solve the problem and produce the correct answer (72). The generated text for each variant is coherent, follows a logical step-by-step reasoning process, and is qualitatively very similar to the Base LLaDA model. This reinforces the quantitative findings that dKV-Cache maintains performance quality.

The following are the results from Table 7 of the original paper:

Model Generate Text (L=256, T=256)
Dream-Instruct-7B(Entropy) Once upon a time, there was a young man named Joey who lived in a smalltown in the middle of nowhere. He was a kind and gentle soul who loved to singand play the guitar. One day, he met a beautiful girl named Rachel who wasalso a talented singer and musician. They fell in love and decided to get married.However, there was a problem. Joey's parents were against the marriage becausethey didn't know Rachel well enough. They thought that she was not a goodmatch for their son. Rachel, on the other hand, was not worried about Joey'sparents' opinion. She knew that she loved Joey and that he was the one for her.She decided to try and win over Joey's parents by showing them how much shecared about him. She started by cooking for them and inviting them to her housefor dinner. She also took them on a trip to the mountains where they could enjoythe scenery and the fresh air. After a while, Joey's parents started to see Rachelin a different light. They realized that she was a good person and that she lovedJoey very much. They decided to give their blessing to the marriage. Finally,Joey and Rachel got married and lived happily ever after.
dKV-Cache-Decode,Refresh = 16 Once upon a time, in a small town called New York, there lived a young mannamed Joey and a young woman named Rachel. They were both very happywith their lives, but they were also very different from each other. Joey was avery outgoing person who loved to be around people, while Rachel was a veryintroverted person who loved to be alone. One day, Joey and Rachel met at acoffee shop and fell in love. They spent hours talking and laughing together, andthey realized that they were perfect for each other. They decided to be togetherand build their lives together. As the years went by, Joey and Rachel grew closerand closer. They shared their dreams and their fears, and they supported eachother through thick and thin. They were a perfect match, and they knew thatthey were meant to be together. Finally, Joey and Rachel got married. They hada beautiful wedding with all their friends and family in attendance. They wereso happy to be together, and they knew that they would always be together.
dKV-Cache-Prefill Once upon a time, in a small town called New York, there lived a young mannamed Joey. Joey was a kind and caring person who always tried to help others.He had a beautiful girlfriend named Rachel, who was also kind and caring.They were in love with each other and wanted to spend their lives together. Oneday, Joey and Rachel decided to get married. They planned their wedding for along time, and everyone in the town was excited to attend. On the day of thewedding, Joey and Rachel exchanged vows and promised to love each otherforever. After the wedding, Joey and Rachel moved into a beautiful house in thecountryside. They were happy and content with their lives, and they had twochildren, a boy and a girl. The children grew up to be kind and caring people,just like their parents. As the years went by, Joey and Rachel continued to loveeach other deeply. They had many ups and downs, but they always found a wayto overcome their challenges and stay together. Finally, after many years ofmarriage, Joey and Rachel decided to retire and live in a small cottage in thecountryside. They were happy and content with their lives, and they knew thatthey had found the love of their lives in each other.

This case study for Dream-Instruct-7B on a story generation task with a specific ending shows that dKV-Cache-Decode and dKV-Cache-Prefill both successfully generate coherent stories that adhere to the prompt's ending. The narrative styles and details vary slightly between models, but all maintain a reasonable quality similar to the Base Dream model, further underscoring the lossless or near-lossless performance of dKV-Cache variants.

The following are the results from Table 8 of the original paper:

Model Generate Text (L=128, T=128)
Dream-Instruct-7B(Entropy) In the first 4 hours, Lily runs 12 kilometers per hour, so she runs a total of 12 *4 = 48 kilometers. In the next 4 hours, Lily runs 6 kilometers per hour, so sheruns a total of 6 * 4 = 24 kilometers. Therefore, in 8 hours, Lily can run a totalof 48 + 24 = 72 kilometers. #### 72
dKV-Cache-Decode,Refresh = 4 In the first 4 hours, Lily can run 12 kilometers per hour, so she can run a total of12 * 4 = 48 kilometers. In the next 4 hours, Lily can run 6 kilometers per hour,so she can run a total of 6 * 4 = 24 kilometers. Therefore, in 8 hours, Lily canrun a total of 48 + 24 = 72 kilometers. #### 72
dKV-Cache-Decode,Refresh = 16 In the first 4 hours, Lily runs 12 kilometers per hour, so she runs a total of 12 *4 = 48 kilometers. In the next 4 hours, Lily runs 6 kilometers per hour, so sheruns a total of 6 * 4 = 24 kilometers. Therefore, in 8 hours, Lily can run a totalof 48 + 24 = 72 kilometers. The answer is: 72
dKV-Cache-Prefill,No refresh In the first 4 hours, Lily runs 12 kilometers per hour, so she runs a total of12 × 4 = 48 kilometers. In the next 4 hours, she runs 6 kilometers per hour, soshe runs a total of 6 × 4 = 24 kilometers. Therefore, in 8 hours, Lily can run atotal of 48 + 24 = \boxed{72} kilometers.The answer is: 72

This case study for Dream-Instruct-7B on the same math word problem as Table 6 further confirms the robustness of dKV-Cache variants. All models, including dKV-Cache-Decode (with refresh 4 and 16) and dKV-Cache-Prefill, correctly solve the problem and provide the answer 72 with clear, logical steps. This demonstrates that for both LLaDA and Dream, dKV-Cache preserves the model's problem-solving capabilities.

6.4. Overall Analysis

The results consistently show that dKV-Cache effectively tackles the slow inference problem in DLMs.

  • The one-step delayed caching is critical for maintaining accuracy, especially at high cache ratios.

  • dKV-Cache-Decode offers the best balance, providing substantial speedups (2-3×2\text{-}3\times for LLaDA, up to 5×5\times for Dream in some cases) with near-lossless performance, sometimes even improving accuracy, particularly for long sequences.

  • dKV-Cache-Prefill excels in scenarios with long input contexts (common in few-shot ICL), delivering impressive speedups of up to 10×10\times or even 18×18\times.

  • dKV-Cache-Greedy achieves higher raw token/s but with a slight accuracy trade-off, making it suitable for applications where speed is paramount and some performance degradation is acceptable. Its performance can be significantly improved with a local window.

  • The system benefits from batch processing, mitigating the overhead of gather/scatter operations.

    Overall, dKV-Cache successfully pushes the Pareto front forward, allowing DLMs to achieve significantly faster inference while maintaining or even improving their generation quality.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper successfully demonstrates the feasibility of incorporating a caching mechanism, dKV-Cache, into Diffusion Language Models (DLMs), a task previously considered challenging due to their non-autoregressive architecture and bidirectional attention. The core innovation lies in leveraging the observed dynamics of token representations during the diffusion process, specifically that representations stabilize after a token is decoded. This insight led to the design of a delayed and conditioned caching strategy.

The authors proposed two main variants: dKV-Cache-Decode, which offers near-lossless accuracy and even improves performance on long sequences, and dKV-Cache-Greedy, which provides higher speed-ups by making more aggressive caching decisions. Extensive experiments on 7B-scale LLaDA and Dream models across diverse benchmarks (language understanding, math, code generation) confirm that dKV-Cache achieves 2-10×2\text{-}10\times inference speedups. Crucially, this is achieved in a training-free manner, making it immediately applicable to existing DLMs, significantly narrowing the performance gap with autoregressive language models.

7.2. Limitations & Future Work

The authors acknowledge one primary limitation:

  • Focus on Algorithmic Design in Isolation: The current work primarily focuses on the algorithmic design of dKV-Cache. While effective, Diffusion Language Models still have substantial room for improvement at the system level.

    They suggest future research directions:

  • Integrating Algorithmic Innovations with System-Level Optimization: This includes areas such as memory management, parallelism, and hardware-aware execution. Combining dKV-Cache with these system-level improvements could unlock further efficiency gains and performance enhancements for DLMs.

7.3. Personal Insights & Critique

This paper presents a highly impactful contribution to the field of Diffusion Language Models. The core insight that token representations become stable after decoding, despite bidirectional attention, is elegant and effectively addresses a fundamental challenge. It transforms what was perceived as an inherent limitation of non-autoregressive models into an opportunity for optimization.

The introduction of dKV-Cache-Decode and dKV-Cache-Greedy provides a practical toolkit for researchers and practitioners, allowing them to choose between near-lossless performance or maximal speedup based on their specific needs. The training-free nature of the method is a significant advantage, ensuring immediate applicability without the costly overhead of retraining large models. The observation that dKV-Cache-Decode can improve performance on long sequences is particularly intriguing. It suggests that bidirectional attention, while powerful, might sometimes be inefficient or redundant in how it processes context, and judicious caching can implicitly guide the model towards more effective contextual aggregation.

One potential area for deeper exploration could be a more fine-grained analysis of why dKV-Cache-Decode improves performance on long sequences. Is it acting as a form of regularization? Is it forcing the model to focus on more stable, "decoded" information, implicitly reducing noise or conflicting signals from still-fluctuating masked tokens? Further qualitative and quantitative analysis, perhaps comparing attention maps or gradient flows with and without dKV-Cache, could shed light on this phenomenon.

From a practical standpoint, the reliance on batch size to overcome the memory-bound nature of gather/scatter operations (Figure 7) is a common challenge in non-contiguous memory access patterns. This highlights that while the algorithmic solution is sound, its practical deployment will require careful system-level engineering to achieve optimal gains, as the authors themselves note. Future work could investigate specialized hardware kernels or memory allocation strategies to minimize these overheads even at small batch sizes, which are often crucial for low-latency inference in real-time applications.

Finally, the paper's clear exposition, rigorous experimental validation, and thoughtful analysis of different caching strategies for a complex model architecture like DLMs make it a valuable piece of research that paves the way for more efficient and practical Diffusion Language Models. The methods and insights could potentially be transferred to other non-autoregressive generative models that share similar representation dynamics during their iterative generation processes.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.