dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching
TL;DR Summary
dLLM-Cache is a training-free adaptive caching method accelerating diffusion large language models by reusing intermediate computations, achieving up to 9.1× speedup on LLaDA 8B and Dream 7B without degrading output quality.
Abstract
Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1 x speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. Codes are provided in the supplementary material and will be released publicly on GitHub.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching
1.2. Authors
-
Zhiyuan Liu (Shanghai Jiao Tong University, EPIC Lab)
-
Yicun Yang (Shanghai Jiao Tong University, EPIC Lab)
-
Yaojie Zhang (Shanghai Jiao Tong University, University of Electronic Science and Technology of China)
-
Junjie Chen (EPIC Lab, Shanghai Jiao Tong University)
-
Chang Zou (Shanghai Jiao Tong University, EPIC Lab, University of Electronic Science and Technology of China)
-
Qingyan Wei (EPIC Lab, Shanghai Jiao Tong University)
-
Shaobo Wang (Shanghai Jiao Tong University, EPIC Lab)
-
Linfeng Zhang (Shanghai Jiao Tong University, EPIC Lab)
The authors are primarily affiliated with the School of Artificial Intelligence and EPIC Lab at Shanghai Jiao Tong University, with some also associated with the University of Electronic Science and Technology of China. Their research backgrounds appear to be in artificial intelligence, particularly large language models and diffusion models, focusing on efficiency and acceleration.
1.3. Journal/Conference
This paper is published on arXiv, a preprint server, on 2025-05-17T15:50:46.000Z. arXiv is widely used by researchers to disseminate their work prior to, or in parallel with, peer review and formal publication. While not a peer-reviewed journal or conference in itself, it allows for rapid sharing and feedback within the scientific community.
1.4. Publication Year
2025
1.5. Abstract
Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2506.06295(Preprint) - PDF Link:
https://arxiv.org/pdf/2506.06295v1.pdf(Preprint)
2. Executive Summary
2.1. Background & Motivation
The field of Large Language Models (LLMs) has been primarily dominated by autoregressive models (ARMs) like GPT and LLaMA, which generate text sequentially. Recently, a new and promising paradigm, diffusion-based Large Language Models (dLLMs), has emerged. These models generate text by iteratively denoising masked segments and have shown significant potential, including overcoming certain limitations of ARMs, such as the "reversal curse" (where models trained on "A is B" may fail to infer "B is A").
However, the practical adoption of dLLMs is severely hampered by their high inference latency. Unlike ARMs that generate one token at a time, dLLMs require an iterative, multi-step denoising process for the entire sequence, leading to substantial computational demands. Existing acceleration techniques face specific challenges:
-
Incompatibility with ARM acceleration: Traditional ARM acceleration methods, such as
Key-Value (KV) caching, rely on thecausal attentionmechanism inherent in autoregressive generation. dLLMs, conversely, utilize abidirectional attentionmechanism, making KV caching directly inapplicable. -
Limitations of general diffusion model acceleration: While techniques exist for accelerating
diffusion models (DMs)(primarily in computer vision), they often apply uniform policies to all tokens. This approach fails to account for the unique characteristics of language generation, particularly the presence of a static prompt and a dynamically evolving, yet largely stable, response.The core problem the paper addresses is this high inference latency in dLLMs, specifically targeting the computational redundancies that existing acceleration methods overlook. The paper's innovative idea stems from a key observation: during dLLM inference, the input
prompttokens remain static, while theresponsetokens, though dynamic, often exhibit high stability across adjacent denoising steps, meaning most of their intermediate representations do not change significantly. Thissparse dynamismsuggests that much of the computation is redundant and can be cached.
2.2. Main Contributions / Findings
The paper introduces dLLM-Cache, a novel, training-free adaptive caching framework specifically designed to accelerate dLLM inference by exploiting the identified computational redundancies.
The primary contributions and findings are:
- Identification and Characterization of Dual Redundancies: The authors identify two distinct computational redundancies in dLLM inference:
- Quasi-static prompt redundancy: The input prompt remains constant, but its intermediate features are recomputed in every denoising step.
- Dynamic but sparse response redundancy: The generated response features evolve, but most tokens' representations remain stable across adjacent steps.
- Proposal of dLLM-Cache Framework: A training-free, model-agnostic adaptive caching framework that differentiates its caching strategy based on the nature of prompt and response tokens.
- It employs
long-interval prompt cachingfor the static prompt features, updating them only at sparse intervals. - It uses
adaptive short-interval response cachingfor response tokens, with full refreshes at more frequent short intervals and partial updates in between.
- It employs
- Introduction of V-verify Mechanism: A lightweight yet effective mechanism that guides the adaptive partial updates for response tokens. It uses
cosine similarityofValue (V) vectorsbetween current and cached states to identify themost dynamictokens (those with the lowest similarity) that require recomputation. This mechanism is empirically correlated with changes in subsequent Attention and Feedforward Network (FFN) outputs. - Significant Inference Acceleration with Lossless Quality: Extensive experiments on representative dLLMs, LLaDA 8B and Dream 7B, demonstrate:
dLLM-Cacheachieves up to 9.1x speedup over standard inference.- This acceleration is achieved without compromising output quality (lossless impact on response quality in most cases).
- The method significantly reduces inference latency, bringing dLLM performance close to that of ARMs under many settings.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
Large Language Models (LLMs)
Large Language Models (LLMs) are a class of artificial intelligence models that are trained on vast amounts of text data to understand, generate, and process human language. They typically use deep learning architectures, most commonly Transformers, and are capable of performing a wide range of natural language processing (NLP) tasks, such as translation, summarization, question answering, and text generation. They are considered "foundational" because their broad capabilities can be adapted to many specific applications.
Autoregressive Models (ARMs)
Autoregressive Models (ARMs) are the dominant paradigm for LLMs. These models generate text sequentially, token by token. In an ARM, each new token is predicted based on all previously generated tokens (and the initial input prompt). This sequential generation process is characterized by causal attention, meaning that when computing the representation for a given token, the model can only attend to tokens that appear before it in the sequence. Examples include OpenAI's GPT series, Google's PaLM, and Meta's LLaMA.
Diffusion Models (DMs)
Diffusion Models (DMs) are a class of generative models that have achieved state-of-the-art results, particularly in image generation. The core idea behind DMs is to learn to reverse a gradual data corruption process.
- Forward (Diffusion) Process: In this process, noise is progressively added to an input data sample (e.g., an image) over several steps, transforming it into a noisy, unrecognizable state (often pure Gaussian noise).
- Reverse (Denoising) Process: The model is trained to predict and remove this noise at each step, effectively learning to transform pure noise back into a coherent data sample. By iteratively applying this learned denoising step, the model can generate new data samples from random noise.
Diffusion-based Large Language Models (dLLMs)
Diffusion-based Large Language Models (dLLMs) adapt the diffusion model paradigm to discrete data, specifically text. Unlike image generation where data is continuous, text consists of discrete tokens. A common approach for dLLMs is Masked Diffusion Models (MDMs), which operate by iteratively denoising masked segments of a text sequence.
- Initialization: The process starts with a fully masked sequence (e.g., a sequence of
[MASK]tokens). - Iterative Denoising: In each
denoising step, the model predicts the original, clean tokens for some or all of the masked positions based on the surrounding context (which can include both prompt and partially denoised response tokens). This prediction is then used to update the masked sequence, progressively revealing the complete text. - Bidirectional Attention: A key characteristic of dLLMs, particularly those built on
Transformerarchitectures, is their use ofbidirectional attention. Unlike the causal attention in ARMs, bidirectional attention allows each token to attend to all other tokens in the input sequence (both preceding and succeeding) at every step. This provides a rich context for denoising and can be advantageous for tasks requiring global context understanding.
Attention Mechanism
The Attention mechanism is a core component of Transformer neural network architectures, which are foundational to modern LLMs. It allows the model to weigh the importance of different parts of the input sequence when processing a particular token.
The fundamental components of attention are Query (Q), Key (K), and Value (V) vectors. For each token in a sequence, the model generates Q, K, and V vectors.
The self-attention mechanism calculates an attention score by taking the dot product of the Query vector of the current token with the Key vectors of all tokens in the sequence. These scores are then scaled and passed through a softmax function to produce attention weights, indicating how much attention each token should pay to other tokens. Finally, these weights are used to compute a weighted sum of the Value vectors, resulting in an Attention Output vector for the current token.
The mathematical formula for scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
- (Query), (Key), (Value) are matrices representing the query, key, and value vectors for all tokens in the sequence. Each row corresponds to a token, and columns represent dimensions.
- is the dimension of the key vectors. Dividing by is a scaling factor to prevent large dot products from pushing the softmax function into regions with very small gradients.
- is the dot product of the query matrix and the transpose of the key matrix, which computes the similarity between each query and all keys.
- is the softmax function, which normalizes the attention scores into probabilities, ensuring they sum to 1.
- The final multiplication by computes the weighted sum of value vectors, producing the
attention output.
Key-Value (KV) Caching (for ARMs)
Key-Value (KV) caching is an optimization technique used to accelerate autoregressive models (ARMs) during inference. Since ARMs generate text token by token, the Key (K) and Value (V) vectors for previously generated tokens remain constant and are reused for subsequent token generation steps. Instead of recomputing these K and V vectors for all preceding tokens at each step, they are computed once and stored in a cache. This significantly reduces computation, as the attention mechanism only needs to compute K and V for the new token and attend to the cached K and V of previous tokens.
However, KV caching is incompatible with dLLMs because of their bidirectional attention mechanism. In ARMs, new tokens only attend to past tokens (causal attention). In dLLMs, tokens can attend to all other tokens (prompt and response) in the current intermediate state, and this state changes iteratively. This means that the K and V vectors for tokens in the response are not static and must be recomputed or re-evaluated at each denoising step, making direct KV caching inefficient.
Cosine Similarity
Cosine similarity is a measure of similarity between two non-zero vectors in an inner product space. It is defined as the cosine of the angle between the vectors, which ranges from -1 (opposite) to 1 (identical), with 0 indicating orthogonality (no correlation). In the context of this paper, it's used to quantify how much a token's feature vector (like its Value (V) vector) has changed between adjacent denoising steps. A higher cosine similarity (closer to 1) means the vectors are more similar, indicating less change.
The formula for cosine similarity between two vectors and is: $ \mathrm{cosine_similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\Vert \mathbf{A} \Vert \Vert \mathbf{B} \Vert} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \sqrt{\sum_{i=1}^n B_i^2}} $ Where:
- and are the two vectors being compared.
- denotes the dot product of the vectors.
- and denote the Euclidean norm (magnitude) of the vectors.
FLOPs and TPS
- FLOPs (Floating Point Operations): A measure of computational cost. It quantifies the number of floating-point arithmetic operations (additions, multiplications, divisions, etc.) performed by a model. Lower FLOPs generally indicate higher computational efficiency. In the paper,
TFLOPsrefers to tera-FLOPs ( FLOPs). - TPS (Tokens Per Second): A measure of inference speed or throughput. It indicates how many tokens the model can process or generate per second. Higher TPS means faster inference.
3.2. Previous Works
Autoregressive Models (ARMs)
The paper acknowledges the dominance of ARMs, citing foundational works such as:
- Radford (2018): Introduced
Generative Pre-training (GPT), a Transformer-based model pre-trained on large text corpora. - Radford et al. (2019):
GPT-2, demonstrating the power of scaling pre-training data and model size. - Brown (2020):
GPT-3, which highlighted the emergent capabilities of very large models, especially infew-shot learning. - OpenAI (2022):
ChatGPT, showcasing optimized language models for dialogue. These models established the sequential, causal attention paradigm for text generation.
Diffusion Models for Language
The adaptation of DMs to discrete text data has been a significant research area. Key works include:
- Austin et al. (2021): Introduced
Structured Denoising Diffusion Modelsin discrete state-spaces, a foundational work forMasked Diffusion Models (MDMs). - Lou et al. (2023) and Shi et al. (2024): Further explored discrete diffusion for language modeling, often by iteratively predicting masked tokens.
- Ou et al. (2024): Investigated the theoretical underpinnings of absorbing discrete diffusion for clean data conditional distributions.
Diffusion Large Language Models (dLLMs)
Recent scaling efforts have led to powerful dLLMs:
- LLaDA [Nie et al., 2025]: An 8-billion parameter MDM trained from scratch using a
bidirectional Transformer. It aims to be a direct alternative to ARMs. - Dream [Ye et al., 2025]: A dLLM (7-billion parameters) that initializes from pre-trained ARM weights, suggesting a different training strategy.
Both
LLaDAandDreamdemonstrate performance comparable to similarly-sized ARMs like LLaMA3 8B and leverage bidirectional architectures, potentially offering advantages such as circumventing the "reversal curse."
Key-Value Caching for ARMs
- Pope et al. (2023): This work describes how
Key-Value cachingaccelerates ARMs by reusing previously computed attention states. As discussed inFoundational Concepts, this method is not directly applicable to dLLMs due to theirbidirectional attention.
Feature Caching for DMs (primarily for vision)
Acceleration techniques for diffusion models (DMs) have largely focused on vision tasks, leveraging temporal redundancy in intermediate features across denoising steps.
- DCache [Ma et al., 2024] and FastDPM [Li et al., 2023]: Early approaches focusing on
U-Netarchitectures common in image DMs. - FORA [Selvaraju et al., 2024]: Applies static-interval caching for attention and MLP layers within
Diffusion Transformers. - ToCa series [Zou et al., 2024a,b]: Aims to reduce information loss by dynamically updating cached features.
- TaylorSeer [Liu et al., 2025]: Introduces a "cache-then-forecast" paradigm, predicting features using
Taylor series expansionrather than direct reuse.
3.3. Technological Evolution
The evolution of Large Language Models (LLMs) has seen a progression from early statistical methods to neural networks, culminating in the dominance of Transformer-based Autoregressive Models (ARMs). ARMs, while powerful, inherently generate text sequentially, which can be a bottleneck for certain tasks and prone to issues like the "reversal curse."
In parallel, Diffusion Models (DMs) revolutionized generative AI in continuous domains like image synthesis. The challenge was to adapt DMs to the discrete nature of text. Masked Diffusion Models (MDMs) emerged as a solution, offering an iterative, non-autoregressive generation process that utilizes bidirectional attention. This brought forth a new class of dLLMs that could potentially overcome ARM limitations.
However, this architectural shift introduced a new performance bottleneck: the iterative, multi-step denoising process of dLLMs leads to high inference latency. Existing acceleration techniques, whether Key-Value caching for ARMs (incompatible due to bidirectional attention) or general feature caching for vision DMs (ineffective due to uniform policies ignoring language-specific structure), fail to adequately address this.
This paper's work, dLLM-Cache, fits into this technological timeline by specifically addressing the inference efficiency gap for dLLMs. It acknowledges the unique challenges posed by dLLMs' bidirectional attention and iterative nature, drawing inspiration from existing caching ideas but fundamentally redesigning them to exploit the particular characteristics of text generation (static prompt, partially dynamic response).
3.4. Differentiation Analysis
dLLM-Cache differentiates itself from prior work through its specific adaptation to the unique characteristics of diffusion-based Large Language Models (dLLMs).
-
Compared to ARM Key-Value (KV) Caching:
- Prior Work: ARM KV caching (e.g., Pope et al., 2023) is highly effective for
autoregressive models (ARMs)because they operate withcausal attention. This means that when generating a new token, the model only needs to attend to previously generated tokens. TheKey (K)andValue (V)states of these past tokens remain constant and can be cached and reused. - dLLM-Cache Differentiation:
dLLM-Cacheexplicitly states that ARM KV caching is incompatible with dLLMs. This is because dLLMs employbidirectional attention, where each token can attend to all other tokens in the sequence (both preceding and succeeding) at everydenoising step. As the text evolves during the iterative denoising process, the K and V representations for response tokens also change, making simple reuse of static KV pairs impossible.dLLM-Cacheaddresses this incompatibility by developing a dynamic caching strategy.
- Prior Work: ARM KV caching (e.g., Pope et al., 2023) is highly effective for
-
Compared to General Diffusion Model Feature Caching (primarily for vision):
- Prior Work: Existing
feature cachingmethods for DMs (ee.g., DCache, FastDPM, FORA, ToCa, TaylorSeer) were primarily developed for continuous data, such as images, and often apply a uniform caching policy across all features. They typically focus on reusing intermediate features across denoising steps due to temporal redundancy. - dLLM-Cache Differentiation:
dLLM-Cacheidentifies a crucialgap: these uniform caching strategies are ineffective for dLLMs because they overlook the distinct nature of language tasks. The core innovation ofdLLM-Cacheis its recognition and exploitation of dual computational redundancies unique to dLLMs:- Static Prompt: The input prompt remains constant throughout the inference process, making its features highly stable and suitable for
long-interval caching. - Partially Dynamic Response: The generated response evolves, but
most tokensremain stable across adjacent denoising steps. Thissparse dynamismallows for anadaptive, partial update strategyrather than recomputing everything or uniformly caching.
- Static Prompt: The input prompt remains constant throughout the inference process, making its features highly stable and suitable for
- Adaptive vs. Uniform: Unlike uniform policies,
dLLM-Cacheemploys a differentiated adaptive caching strategy. It combineslong-interval prompt cachingwithadaptive short-interval response cachingguided byfeature similarity(V-verify). This allows for efficient reuse where features are stable and targeted recomputation where changes are significant, striking a superior balance between speed and quality for language generation.
- Prior Work: Existing
4. Methodology
4.1. Principles
The core idea behind dLLM-Cache is to leverage the inherent structural and temporal redundancies observed during the inference process of diffusion-based Large Language Models (dLLMs). The method operates on two fundamental principles:
-
Exploiting Static Prompt Redundancy: The input prompt remains constant throughout the iterative denoising process. This implies that the internal representations and computations related to the prompt tokens are highly stable across denoising steps. Therefore, these computations do not need to be performed repeatedly in every step.
-
Leveraging Partially Dynamic Response Redundancy: While the generated response evolves over the denoising steps, this evolution is often sparse. That is, only a small fraction of the response tokens undergo significant changes between adjacent denoising steps. Recomputing the features for all response tokens in every step is therefore inefficient. Instead, an adaptive approach that selectively updates only the most dynamic tokens can yield substantial computational savings.
Based on these principles,
dLLM-Cacheproposes a training-free adaptive caching mechanism. It aims to efficiently reuse intermediate computations, specificallyKey (K),Value (V),Attention Output (AttnOut), andFeedforward Network Output (FFNOut)features, without compromising the quality of the generated text. The framework differentiates between prompt and response tokens and employs aV-verifymechanism based oncosine similarityto identify and selectively update dynamic response tokens.
4.2. Core Methodology In-depth
4.2.1. Inference Process of dLLMs (Preliminary)
The paper explains the dLLM inference process using LLaDA as a representative example. dLLMs generate text through a non-autoregressive, iterative diffusion-based process that starts from a fully masked sequence and progressively denoises it into the target output.
Let be the token vocabulary, which includes a special [MASK] token.
Given an input prompt , where is the length of the prompt, the model aims to generate a response , where is the desired length of the response.
The generation occurs over discrete denoising steps, indexed from down to 0.
Let denote the intermediate state of the response sequence at step . The process begins with a fully masked sequence:
$
\mathbf{y}^{(K)} = (\underbrace{[\mathtt{MASK}], \dots, [\mathtt{MASK}]}{L \mathrm{times}})
$
At each step , a mask predictor estimates the distribution over the clean sequence. This function takes the prompt and the current intermediate response state as input, along with the model parameters :
$
P{\phi}(\mathbf{y}|\mathbf{c}, \mathbf{y}^{(k)}) = f_{\phi}(\mathbf{c}, \mathbf{y}^{(k)}; \phi)
$
Where:
-
represents the probability distribution over all possible clean sequences , conditioned on the prompt and the current (partially masked/denoised) response , and parameterized by .
-
is the
mask predictor network, typically a Transformer-based architecture, which learns to predict the original tokens from the noisy/masked input.The most likely
clean sequenceis then estimated by maximizing this probability: $ \hat{\mathbf{y}}^{(0)} = \underset{\mathbf{y} \in \mathcal{T}^L}{\arg\max} P_{\phi}(\mathbf{y}|\mathbf{c}, \mathbf{y}^{(k)}) $ Where denotes the set of all possible sequences of length composed of tokens from vocabulary .
A transition function is then applied to yield the next intermediate state . This function selectively updates tokens in based on the predicted clean sequence , the original prompt , and the current step index :
$
\mathbf{y}^{(k-1)} = S(\hat{\mathbf{y}}^{(0)}, \mathbf{y}^{(k)}, \mathbf{c}, k)
$
The specific strategy for can vary, potentially involving confidence-based remasking (masking tokens with low confidence again) or semi-autoregressive block updates (updating segments of tokens). This iterative process, while enabling flexible generation, incurs high latency due to the repeated computation of features across many steps, especially as (number of denoising steps) increases.
4.2.2. dLLM-Cache Overview
dLLM-Cache is a training-free caching framework designed to improve the inference efficiency of dLLMs by selectively updating intermediate computations. The framework makes a distinction between prompt and response features due to their differing dynamics.
For each Transformer layer within the dLLM, the framework stores four types of intermediate features:
-
: Key vectors for layer .
-
: Value vectors for layer .
-
: Attention output for layer .
-
: Feedforward Network output for layer .
These features are stored in two separate caches:
-
Prompt Cache(): Stores features related to the prompt tokens. -
Response Cache(): Stores features related to the response tokens.The caching behavior is governed by three hyperparameters:
-
prompt refresh interval(): Determines how frequently prompt features are fully recomputed. A larger means less frequent recomputation. -
response refresh interval(): Determines how frequently response features are fully recomputed. A smaller implies more frequent full refreshes. -
adaptive update ratio(): Used during adaptive partial updates for response tokens. It specifies the fraction of response tokens (those deemed most dynamic) that will have their features recomputed.The general workflow for
dLLM-Cacheduring iterative steps ( down to 1) is:
-
Prompt Management: If the current step is a multiple of (i.e., ), prompt features in are fully recomputed. Otherwise, they are reused from the cache.
-
Response Management: If the current step is a multiple of (i.e., ), response features in are fully recomputed. Otherwise, an
adaptive partial updatestrategy is employed, where only a fraction () of the most dynamic response tokens are recomputed. -
Computation: The layer's computation proceeds using either cached or freshly updated features.
This differentiated and adaptive strategy allows
dLLM-Cacheto significantly reduce computational load by avoiding redundant computations while maintaining the model's generation quality.
4.2.3. Prompt Cache Management
The input prompt remains constant throughout the entire dLLM inference process across all denoising steps. Consequently, its corresponding intermediate features (Key, Value, Attention Output, FFN Output) exhibit very high temporal stability. dLLM-Cache capitalizes on this by employing a Long-Interval Prompt Caching strategy.
Initialization (at step ):
- At the very first denoising step (), all features () related to the prompt tokens for each Transformer layer are computed from scratch.
- These computed prompt features are then stored in the
Prompt Cache.
Subsequent Denoising Steps ( down to 1):
-
Full Recomputation: If the current step is a multiple of the
prompt refresh interval(i.e., ), the prompt features in are fully recomputed. This periodic recomputation ensures that the prompt representations are always conditioned on the latest available (and potentially updated)KeyandValuefeatures from theresponse tokens(which might have been updated in the same step if , or via adaptive updates). This is crucial because prompt tokens still need to attend to response tokens, and if response tokens change, the prompt's attention output might also change slightly. -
Cache Reuse: If is not a multiple of (i.e., ), the prompt features for the current layer are directly retrieved and reused from . No recomputation is performed for these features.
This strategy dramatically reduces the computational load associated with processing the static prompt, especially when is set to a large value, allowing for significant speedups.
4.2.4. Response Cache with Adaptive Updates
In contrast to the static prompt, the response features dynamically evolve across denoising steps. However, as noted by the authors, this evolution is often sparse, meaning most tokens change gradually. dLLM-Cache manages the Response Cache using a two-pronged approach:
-
Full Refresh:
- When the current step is a multiple of the
response refresh interval(i.e., ), or at the very beginning of inference (), all response features are recomputed from scratch. - These newly computed features (, ,
AttnOut,FFNOutfor response tokens) are then used to fully update theResponse Cache. This ensures that periodically, the response features are completely synchronized with the current intermediate state, preventing significantinformation lossordriftthat might accumulate from prolonged reuse of stale cached features.
- When the current step is a multiple of the
-
Adaptive Partial Update:
- When is not a multiple of (i.e., ), instead of a full recomputation, an
adaptive partial updatestrategy is employed. This mechanism selectively updates only a subset of response tokens whose features are deemed to have changed significantly. The process involves:-
Identify Change (V-verify mechanism): For each response token at the current layer ,
dLLM-Cachecomputes itsValue (V) vectorbased on the current input. It then calculates thecosine similaritybetween this newly computedV vector() and its cached counterpart () from the previous step. $ s_j = \frac{(\mathbf{v}{r,j}^{(l)})^\top \tilde{\mathbf{v}}{r,j}^{(l)}}{\Vert \mathbf{v}{r,j}^{(l)} \Vert \Vert \tilde{\mathbf{v}}{r,j}^{(l)} \Vert} $ Where:- is the Value vector for response token at layer , computed at the current denoising step .
- is the Value vector for response token at layer , retrieved from the
Response Cache(i.e., from step ). - The superscript denotes the transpose.
- denotes the Euclidean norm (magnitude) of the vector.
The paper highlights that changes in a token's
Value vector(measured by cosine similarity) strongly correlate with changes in its subsequentAttention Output (AttnOut)andFeedforward Network Output (FFNOut)(as shown in Figure 2). This makesV-verifyan efficient proxy for identifying tokens whose overall downstream features have changed.
-
Select Tokens: The framework then selects a specific number of tokens for recomputation. Specifically, it identifies the tokens (where is the response length and is the
adaptive update ratio) that have the lowest cosine similarity . These tokens are considered themost dynamicand form the set . -
Recompute & Reuse: Only the features (, ,
AttnOut,FFNOut) for tokens belonging to are fully recomputed for the current layer. For all other response tokens (those not in ), their features are directly reused from theResponse Cache. -
Update Cache: The recomputed features for tokens in are then used to update their corresponding entries in . The features for reused tokens remain as they were.
This adaptive update mechanism is crucial for balancing computational efficiency with generation quality. By focusing recomputation only on the most dynamic parts of the response,
dLLM-Cacheminimizes unnecessary calculations while ensuring that critical information changes are captured.
-
- When is not a multiple of (i.e., ), instead of a full recomputation, an
4.2.5. Algorithmic Workflow
The overall process of dLLM-Cache is outlined in Algorithm 1, which calls several helper functions (Algorithms 2-6) for specific caching scenarios.
Algorithm 1 dLLM-Cache: Main Inference Algorithm
This algorithm orchestrates the entire denoising process, managing prompt and response caches across all steps and layers.
Input:
-
Prompt
-
Initial masked sequence (e.g., all
[MASK]tokens) -
Total denoising steps
-
Prompt refresh interval
-
Response refresh interval
-
Adaptive update ratio
Output: Final predicted sequence
- Initialize Caches (at step ):
- Call
InitializeCache(Algorithm 2) with and to compute all initial features and populate and .
- Call
- Generate Initial Prediction and Transition:
- Generate an initial prediction using the model based on the fully computed features.
- Apply the transition function to get the first intermediate response state: .
- Iterative Denoising Loop (from down to 1):
- For each denoising step :
x_layer_inis initialized as the concatenated input for the first layer.- Layer Loop: For each Transformer layer :
- Determine refresh conditions:
refresh_promptis true if .refresh_responseis true if .
- Cache Strategy Branching:
- If
refresh_promptANDrefresh_response: CallFullRefresh(Algorithm 3) to recompute and update all features for both prompt and response. - If
refresh_promptAND NOTrefresh_response: CallRefreshPromptOnly(Algorithm 4) to recompute prompt features and reuse response features. - If NOT
refresh_promptANDrefresh_response: CallRefreshResponseOnly(Algorithm 5) to reuse prompt features and recompute response features. - If NOT
refresh_promptAND NOTrefresh_response: CallAdaptiveUpdate(Algorithm 6) to reuse prompt features and perform adaptive partial updates for response features.
- If
- Update
x_layer_infor the next layer with the output of the current layer.
- Determine refresh conditions:
- Generate Prediction and Transition (after all layers for step ):
- Generate prediction using the model with the final layer output.
- Apply transition function : .
- For each denoising step :
- Return: Final predicted sequence at the end of the last step.
Algorithm 2 InitializeCache
This algorithm sets up the initial Prompt Cache () and Response Cache () at the beginning of the inference process (step ).
Input:
-
Prompt
-
Initial masked sequence
-
Transformer network with layers
Output: Initialized caches
- Cache Structure Definition:
- For each layer , initialize empty dictionaries for:
- Prompt key-value cache:
- Prompt attention output cache:
- Prompt FFN output cache: (referred to as
m1pin the paper's pseudocode) - Response key-value cache:
- Response attention output cache:
- Response FFN output cache: (referred to as
m1pin the paper's pseudocode)
- For each layer , initialize empty dictionaries for:
- Initial Caching (Step ):
- Concatenate prompt and initial response: . This is the input for the first layer.
- Layer Loop: For each layer :
- Attention Block:
- Apply
LayerNormto : . - Project to
Query,Key,Valuevectors: . - Split K, V for caching:
- Prompt K, V:
- Response K, V:
- Store them in respective caches: , .
- Compute
Attentionusing : . - Split AttnOut for caching:
- Prompt AttnOut:
- Response AttnOut:
- Store them: , .
- Compute post-attention residual: .
- Apply
- FFN Block:
- Apply
LayerNormto : . - Compute
FFNoutput: . - Split FFNOut for caching:
- Prompt FFNOut:
- Response FFNOut:
- Store them: , .
- Compute final residual: .
- Update input for the next layer: .
- Apply
- Attention Block:
- Return the populated caches .
Algorithm 3 dLLM-Cache: Case 1 - Full Refresh
This algorithm handles the scenario where both prompt and response features need to be fully refreshed. This happens when AND .
Input:
-
Layer input (output of layer
l-1, or for ) -
Layer index
-
Caches and
Output: Layer output , updated caches
- Attention Block:
- Compute
LayerNormof : . - Project to , , : .
- Split K, V for caching:
- Update prompt KV cache: .
- Update response KV cache: .
- Compute combined attention: .
- Split AttnOut for caching:
- Update prompt attention cache: .
- Update response attention cache: .
- Compute post-attention residual: .
- Compute
- FFN Block:
- Compute
LayerNormof : . - Compute combined
FFNoutput: . - Split FFNOut for caching:
- Update prompt FFN cache: .
- Update response FFN cache: .
- Compute final residual: .
- Compute
- Return .
Algorithm 4 dLLM-Cache: Case 2 - Refresh Prompt Only
This algorithm is executed when prompt features need refreshment () but response features do not ().
Input:
-
Layer input
-
Layer index
-
Caches and
Output: Layer output , updated caches
- Retrieve Response Features from Cache:
- Get cached response KV: .
- Get cached response attention output: .
- Get cached response FFN output: .
- Compute Fresh Prompt Features:
- Extract prompt input part: .
- Compute
LayerNormof prompt input: . - Project to , , for prompt: ; ; .
- Update prompt KV cache: .
- Compute Attention with Mixed Features:
- Combine newly computed prompt K, V with cached response K, V: , .
- Compute attention only for prompt queries using combined K, V: .
- Update prompt attention cache: .
- Combine prompt and (cached) response attention outputs: .
- Compute post-attention residual: .
- FFN Block:
- Split post-attention state: .
- Compute
LayerNormfor prompt part: . - Compute
FFNfor prompt part: . - Update prompt FFN cache: .
- Combine prompt and (cached) response FFN outputs: .
- Compute final residual: .
- Return .
Algorithm 5 dLLM-Cache: Case 3 - Refresh Response Only
This algorithm is executed when response features need refreshment () but prompt features do not ().
Input:
-
Layer input
-
Layer index
-
Caches and
Output: Layer output , updated caches
- Retrieve Prompt Features from Cache:
- Get cached prompt KV: .
- Get cached prompt attention output: .
- Get cached prompt FFN output: .
- Compute Fresh Response Features:
- Extract response input part: .
- Compute
LayerNormof response input: . - Project to , , for response: ; ; .
- Update response KV cache: .
- Compute Attention with Mixed Features:
- Combine cached prompt K, V with newly computed response K, V: , .
- Compute attention only for response queries using combined K, V: .
- Update response attention cache: .
- Combine (cached) prompt and response attention outputs: .
- Compute post-attention residual: .
- FFN Block:
- Split post-attention state: .
- Compute
LayerNormfor response part: . - Compute
FFNfor response part: . - Update response FFN cache: .
- Combine (cached) prompt and response FFN outputs: .
- Compute final residual: .
- Return .
Algorithm 6 dLLM-Cache: Case 4 - Adaptive Partial Update
This algorithm is triggered when neither prompt nor response features require a full refresh ( AND ). It performs the selective update for response tokens.
Input:
-
Layer input
-
Layer index
-
Caches and
-
Adaptive update ratio
Output: Layer output , updated caches
- Retrieve Cached Prompt Features:
- Get cached prompt KV: .
- Get cached prompt attention output: .
- Get cached prompt FFN output: .
- Conditional Adaptive Update (if ):
- Compute Current Response Value Projections:
- Extract response input part: .
- Compute
LayerNormfor response input: . - Compute new
Valuevectors for response tokens: . (Note: Queries and Keys are not fully recomputed here for all response tokens, only for selected ones later).
- Retrieve Cached Response Features:
- Get cached response KV (specifically, the cached
Valuevectors for similarity comparison): .
- Get cached response KV (specifically, the cached
- Compute Similarity to Identify Tokens Needing Update (V-verify):
- For each response token index : $ s_j \gets \frac{(\mathbf{V}_r^{\mathrm{new}}[j])^\top \mathbf{V}_r[j]}{\Vert \mathbf{V}_r^{\mathrm{new}}[j] \Vert \Vert \mathbf{V}_r[j] \Vert} $ (This computes cosine similarity between the current and cached Value vector for token ).
- Identify : the indices of tokens with the lowest (i.e., most dissimilar, thus most changed).
- Selective Computation for Selected Tokens:
Gatherthe correspondingLayerNorminputs from for tokens in : .- Compute
QueryandKeyfor these selected tokens: , .
- Update KV Cache with New Values:
- Use
ScatterUpdateto update the cached with at indices : . - The entire is used as the updated Value cache (meaning all Value vectors are effectively 'recomputed' for similarity check, but only a subset drives full recomputation). So, the
kv_cacheis set to .
- Use
- Compute Attention Only for Selected Tokens:
- Combine cached prompt K, V with updated response K, V: , .
- Compute attention only for selected response queries: .
- Update Response Attention Cache:
- Retrieve the existing response attention outputs: .
- Use
ScatterUpdateto update at with to form . - Update the cache: .
- Combine prompt and (updated) response attention outputs: .
- Compute post-attention residual: .
- FFN Block (Adaptive):
- Split post-attention state: .
- Gather selected tokens from : .
- Compute
LayerNormfor selected response tokens: . - Compute
FFNoutput for selected response tokens: . - Update Response FFN Cache:
- Retrieve existing response FFN outputs: .
- Use
ScatterUpdateto update at with to form . - Update the cache: .
- Combine prompt and (updated) response FFN outputs: .
- Compute Current Response Value Projections:
- Else (, pure cache retrieval):
- If is 0, no adaptive updates are performed.
- Retrieve cached response attention output: .
- Combine prompt and response attention outputs: .
- Compute post-attention residual: .
- Retrieve cached response FFN output: .
- Combine prompt and response FFN outputs: .
- Final Output:
- Compute final residual: .
- Return .
4.2.6. Storage Overhead of Caching
The paper provides a proof for the storage overhead of dLLM-Cache in Appendix B.
Theorem: The storage overhead of caching in our method is , where is the number of tokens, is the embedding dimension, and is the number of layers.
Proof:
-
Memory per Layer: The
dLLM-Cachemethod stores four types of intermediate features per Transformer layer:Key (K),Value (V),Attention Output (AttnOut), andFeedforward Network Output (FFNOut).- Each of these features (e.g., K, V, AttnOut, FFNOut for all tokens in a layer) has a size proportional to the total number of tokens () multiplied by the
embedding dimension(). This can be represented as . - Therefore, the memory required for each layer, denoted as , is: $ M_{\mathrm{layer}} = 4 \times T \times d $ This accounts for the four distinct feature types stored for all tokens within a single layer.
- Each of these features (e.g., K, V, AttnOut, FFNOut for all tokens in a layer) has a size proportional to the total number of tokens () multiplied by the
-
Total Memory for the Model: The entire Transformer model consists of layers. Since these features are cached for each layer, the total memory required for caching across all layers, , is the memory per layer multiplied by the number of layers: $ M_{\mathrm{total}} = L \times M_{\mathrm{layer}} = L \times 4 \times T \times d $
-
Considering Data Precision: The paper states that
bfloat16precision is used, where each element (e.g., a single floating-point number in a vector) requires 2 bytes of memory. Incorporating this, the total memory in bytes would be: $ M_{\mathrm{total}} = 2 \times L \times 4 \times T \times d \quad \mathrm{bytes} $ -
Asymptotic Analysis: In
asymptotic analysis(which describes the growth rate of a function as input size approaches infinity), constant factors are typically ignored. Therefore, ignoring the constant factor of , the storage overhead grows as: $ O(T \times d \times 4 \times L) $ This completes the proof, showing that the memory overhead scales linearly with the number of tokens, embedding dimension, and number of layers.
5. Experimental Setup
5.1. Datasets
The authors evaluated dLLM-Cache on two representative diffusion-based Large Language Models (dLLMs):
- LLaDA 8B [Nie et al., 2025]: An 8-billion parameter model.
- Dream 7B [Ye et al., 2025]: A 7-billion parameter model.
Both models were tested in their
BaseandInstructvariants. TheInstructvariant is typically fine-tuned to follow instructions, often performing better on specific tasks.
The experiments were conducted across a variety of standard benchmarks, categorized into "Mathematics & Science" and "General Tasks" and "Code" to assess performance across different domains. The following are the benchmarks mentioned in the paper:
Mathematics & Science:
- GSM8K [Cobbe et al., 2021]: A dataset of elementary school math word problems. It assesses a model's ability to reason and solve multi-step mathematical problems.
- GPQA: (No specific citation for GPQA is provided, but it typically refers to a General Purpose Question Answering benchmark, often requiring complex reasoning.)
- Math: (No specific citation, but likely refers to a general mathematics benchmark or a subset of a larger math dataset.)
General Tasks:
- MMLU-pro: (Likely a more challenging or domain-specific variant of MMLU.)
- MMLU [Hendrycks et al., 2021, not cited but standard]: Massive Multitask Language Understanding. A benchmark designed to measure knowledge and reasoning abilities across 57 subjects, including humanities, social sciences, STEM, and more.
- BBH [Srivastava et al., 2022, not cited but standard]: Beyond the Imitation Game Benchmark. A challenging collection of 23 tasks designed to test advanced reasoning capabilities of LLMs.
Code:
-
MBPP [Austin et al., 2021]: Mostly Basic Python Problems. A dataset of Python programming problems with test cases, used to evaluate code generation capabilities.
-
HumanEval [Chen et al., 2021, not cited but standard]: A benchmark for evaluating code synthesis models, consisting of programming problems with docstrings, function signatures, and test cases.
Additionally, the
LongBench-HotpotQA [Bai et al., 2023]task was mentioned in the discussion section to demonstrate the effectiveness on long-prompt scenarios.HotpotQAis a multi-hop question answering dataset that requires reasoning over multiple documents to answer questions.
The models were configured according to their original inference settings. The specific parameters for different tasks are detailed in Table 4 of the appendix. All models used a low-confidence remasking strategy unless specified.
The following are the results from Table 4 of the original paper:
| Task | Steps | Block Len | Gen Len | Few-shot |
| MMLU | 3 | 3 | 3 | 5 |
| MMLU-pro | 256 | 256 | 256 | 0 |
| Hellaswag | 3 | 3 | 3 | 0 |
| ARC-C | 512 | 512 | 512 | 0 |
| GSM8K | 256 | 8 | 256 | 4 |
| Math | 256 | 256 | 256 | 0 |
| GPQA | 128 | 64 | 128 | 5 |
| HumanEval | 512 | 32 | 512 | 0 |
| MBPP | 512 | 32 | 512 | 3 |
- Steps: The total number of
denoising steps() used in the diffusion process for generation. More steps generally lead to better quality but higher latency. - Block Len: The length of the block of tokens that might be updated together in some remasking strategies (e.g., semi-autoregressive block updates).
- Gen Len: The target total length of the generated response (number of tokens).
- Few-shot: The number of example input-output pairs provided to the model during inference to guide its generation, common in
in-context learningfor LLMs.0indicateszero-shotinference.
5.2. Evaluation Metrics
The performance of dLLM-Cache was evaluated using a combination of efficiency metrics and task-specific quality metrics.
-
Inference Efficiency Metrics:
- Tokens Per Second (TPS):
- Conceptual Definition: TPS measures the number of output tokens a model can generate or process per second. It is a direct indicator of inference speed or throughput. A higher TPS value implies faster generation and better efficiency.
- Mathematical Formula: Not explicitly provided in the paper, but generally calculated as: $ \mathrm{TPS} = \frac{\text{Total number of generated tokens}}{\text{Total inference time (seconds)}} $
- Symbol Explanation:
- : The sum of lengths of all responses generated in a given time frame.
- : The wall-clock time taken for the generation process, including all computational steps.
- Floating Point Operations (FLOPs):
- Conceptual Definition: FLOPs quantifies the total number of floating-point arithmetic operations (e.g., additions, multiplications, divisions) performed during the inference process. It serves as a hardware-agnostic measure of computational cost. Lower FLOPs indicate better computational efficiency. The paper uses
TFLOPs, which means FLOPs. - Mathematical Formula: Not a single formula, but an aggregate count of all floating-point operations. For Transformer layers, FLOPs calculation involves matrix multiplications and other operations. For a single Transformer layer, the FLOPs scale roughly with , where is sequence length and is embedding dimension.
- Symbol Explanation:
FLOPsis a cumulative count, not derived from a simple formula with variables. It's often estimated or measured by profiling.
- Conceptual Definition: FLOPs quantifies the total number of floating-point arithmetic operations (e.g., additions, multiplications, divisions) performed during the inference process. It serves as a hardware-agnostic measure of computational cost. Lower FLOPs indicate better computational efficiency. The paper uses
- Speed(TPS) / Speed(FLOPs): These are speedup ratios relative to the baseline (standard inference without
dLLM-Cache). ForTPS,Speed(TPS)is the ratio of TPS withdLLM-Cacheto baseline TPS. ForFLOPs,Speed(FLOPs)is the ratio of baseline FLOPs to FLOPs withdLLM-Cache(since lower FLOPs are better).
- Tokens Per Second (TPS):
-
Task Performance (Quality) Metrics:
- Score / Accuracy:
- Conceptual Definition:
Accuracyis a common metric for classification or exact match tasks, measuring the proportion of correctly predicted instances out of the total instances. For benchmarks likeGSM8K,MMLU,GPQA,Math, it typically represents the percentage of questions or problems for which the model provides the correct answer. - Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $
- Symbol Explanation:
- : The count of instances where the model's output matches the ground truth.
- : The total number of instances evaluated.
- Conceptual Definition:
- F1 Score:
- Conceptual Definition: The
F1 scoreis the harmonic mean ofprecisionandrecall, often used in tasks like information extraction or question answering (e.g.,HotpotQA) where both false positives and false negatives are important. It provides a single score that balances both metrics.Precisionmeasures the proportion of true positive results among all positive results (true positives + false positives).Recall(orsensitivity) measures the proportion of true positive results among all relevant samples (true positives + false negatives).
- Mathematical Formula: $ \mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ Where: $ \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} $ $ \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $
- Symbol Explanation:
- : True Positives (correctly identified positive instances).
- : False Positives (incorrectly identified positive instances).
- : False Negatives (incorrectly identified negative instances).
- : The F1 score, ranging from 0 to 1 (or 0% to 100%).
- Conceptual Definition: The
- GPU Memory (GB):
-
Conceptual Definition: Measures the peak Graphics Processing Unit (GPU) memory usage during inference, reported in gigabytes (GB). This metric is crucial for understanding the hardware requirements and scalability of the method. Lower memory usage allows for larger models or batch sizes.
All TPS and FLOPs tests were performed on a single
NVIDIA RTX 4090 GPU. The GPU memory measurements were also taken on a specific GPU setup.
-
- Score / Accuracy:
5.3. Baselines
The primary baseline for comparison was the standard inference of the dLLM models (LLaDA 8B and Dream 7B) without any caching or acceleration techniques applied. This "baseline" represents the original, unoptimized performance of the dLLMs.
In one specific comparison (Table 3), LLaMA3 8B (a state-of-the-art autoregressive model of similar size) was included as an additional baseline to contextualize the performance of dLLMs (with and without dLLM-Cache) against the dominant ARM paradigm. This comparison is particularly relevant because the paper aims to bring dLLM inference latency closer to that of ARMs.
6. Results & Analysis
6.1. Core Results Analysis
The experiments consistently demonstrate that dLLM-Cache significantly improves inference efficiency for diffusion-based Large Language Models (dLLMs) while largely preserving output quality.
Overall Efficacy:
dLLM-Cache achieves up to 9.1x speedup over standard inference without compromising output quality (often achieving "lossless" acceleration). This acceleration is measured by increased Tokens Per Second (TPS) and reduced Floating Point Operations (FLOPs). The method effectively brings dLLM inference latency closer to that of autoregressive models (ARMs) in many settings.
LLaDA 8B Results (Table 1)
The following are the results from Table 1 of the original paper:
| Task | Method | Inference Efficiency | Performance | |||
| Speed(TPS)↑ | FLOPs(T)↓ | Speed(FLOPs)↑ | Score↑ | |||
| Mathematics & Science | ||||||
| GSM8K | LLaDA Base | 7.32 | 1.00× | 16.12 | 1.00× | 69.06 |
| + Cache (Kp = 100, Kr = 6) | 31.43 | 4.29× | 2.76 | 5.84× | 67.32 | |
| LLaDA Instruct | 23.19 | 1.00× | 3.21 | 1.00× | 70.66 | |
| + Cache (Kp = 25, Kr = 5) | 15.87 | 3.17× | 13.36 | 5.02× | 78.54 | |
| GPQA | LLaDA Base | 5.12 | 1.00× | 22.97 | 1.00× | 31.91 |
| + Cache (Kp = 100, Kr = 8) | 25.23 | 4.93× | 3.20 | 7.18× | 32.81 | |
| LLaDA Instruct | 5.33 | 1.00× | 22.07 | 1.00× | 29.01 | |
| + Cache (Kp = 50, Kr = 6) | 28.01 | 5.26× | 2.73 | 8.08× | 29.01 | |
| Math | LLaDA Base | 8.31 | 1.00× | 14.11 | 1.00× | 30.84 |
| + Cache (Kp = 50, Kr = 8) | 33.92 | 4.08× | 2.61 | 5.41× | 29.84 | |
| LLaDA Instruct | 23.65 | 1.00× | 5.16 | 1.00× | 22.32 | |
| + Cache (Kp = 50, Kr = 1) | 31.02 | 1.31× | 3.96 | 1.30× | 22.52 | |
| General Tasks | ||||||
| MMLU-pro | LLaDA Base | 14.08 | 1.00× | 8.40 | 1.00× | 24.26 |
| + Cache (Kp = 100, Kr = 6) | 45.75 | 3.25× | 2.15 | 3.91× | 24.69 | |
| LLaDA Instruct | 14.01 | 1.00× | 8.46 | 1.00× | 36.41 | |
| + Cache (Kp = 51, Kr = 3) | 39.63 | 2.83× | 2.62 | 3.23× | 36.08 | |
| MMLU | LLaDA Base | 8.09 | 1.00× | 14.56 | 1.00× | 63.99 |
| + Cache (Kp = 100, Kr = 6) | 33.52 | 4.14× | 2.64 | 5.52× | 64.26 | |
| LLaDA Instruct | 10.12 | 1.00× | 11.85 | 1.00× | 61.24 | |
| + Cache (Kp = 100, Kr = 7) | 21.23 | 2.10× | 4.50 | 2.63× | 62.82 | |
| BBH | LLaDA Base | 6.41 | 1.00× | 18.29 | 1.00× | 44.77 |
| + Cache (Kp = 50, Kr = 6) | 27.90 | 4.35× | 3.09 | 5.92× | 45.04 | |
| LLaDA Instruct | 6.18 | 1.00× | 18.98 | 1.00× | 51.49 | |
| + Cache (Kp = 100, Kr = 5) | 27.55 | 4.46× | 3.08 | 6.16× | 51.98 | |
| Code | ||||||
| MBPP | LLaDA Base | 7.87 | 1.00× | 14.91 | 1.00× | 40.80 |
| + Cache (Kp = 50, Kr = 4) | 30.74 | 3.91× | 2.99 | 4.99× | 39.00 | |
| LLaDA Instruct | 24.61 | 1.00× | 3.07 | 1.00× | 40.60 | |
| + Cache (Kp = 25, Kr = 4) | 16.74 | 3.13× | 11.92 | 4.86× | 40.60 | |
| HumanEval | LLaDA Base | 7.55 | 1.00× | 15.53 | 1.00× | 39.20 |
| + Cache (Kp = 100, Kr = 5) | 31.73 | 4.20× | 2.80 | 5.55× | 39.60 | |
| LLaDA Instruct | 19.98 | 1.00× | 6.03 | 1.00× | 32.92 | |
| + Cache (Kp = 50, Kr = 5) | 51.96 | 2.60× | 2.04 | 2.96× | 32.31 | |
- Exceptional Speedup on GPQA: For
LLaDA InstructonGPQA,dLLM-Cacheachieves an outstanding 8.08x speedup in FLOPs (from 22.07 TFLOPs to 2.73 TFLOPs) with lossless quality (29.01% score maintained). This is a strong indicator of its efficiency for complex reasoning tasks. - Consistent Gains Across Benchmarks: Across various tasks like
GSM8K,MMLU-pro,MMLU,BBH,MBPP, andHumanEval,dLLM-Cacheconsistently delivers 2x to 5x speedups inTPSand 3x to 7x speedups inFLOPsfor bothBaseandInstructvariants. - Quality Preservation: In most cases, the task performance (Score) either remains identical or shows only minor changes (e.g., LLaDA Base on GSM8K drops from 69.06% to 67.32%, while LLaDA Instruct on GSM8K improves from 70.66% to 78.54%). This demonstrates that the caching strategy does not significantly compromise output quality.
- Variable Settings: The optimal
prompt refresh interval (K_p)andresponse refresh interval (K_r)vary by task and model, highlighting the adaptive nature required for different contexts. For example,Math(LLaDA Instruct) uses , implying very frequent response refreshes, resulting in a lower speedup (1.31x) but still a reduction in FLOPs.
Dream 7B Results (Table 2)
The following are the results from Table 2 of the original paper:
| Task | Configuration | Inference Efficiency | Performance | |||
| Speed(TPS)↑ | FLOPs(T)↓ | Speed(FLOPs)↑ | ||||
| TPS↑ Mathematics & Science | ||||||
| GSM8K | Dream Base | 6.36 | 1.00× | 19.59 | 1.00× | 76.95 |
| + Cache (Kp = 100, Kr = 8) | 32.44 | 5.10× | 2.84 | 6.90× | 76.95 | |
| Dream Instruct | 6.39 | 1.00× | 19.57 | 1.00× | 77.55 | |
| + Cache (Kp = 25, Kr = 2) | 24.52 | 3.84× | 4.24 | 4.62× | 76.80 | |
| GPQA | Dream Base | 5.80 | 1.00× | 21.66 | 1.00× | 33.92 |
| + Cache (Kp = 100, Kr = 8) | 30.95 | 5.33× | 3.03 | 7.15× | 34.15 | |
| Dream Instruct | 5.53 | 1.00× | 22.63 | 1.00× | 34.38 | |
| + Cache (Kp = 10, Kr = 8) | 21.98 | 3.97× | 4.69 | 4.83× | 33.93 | |
| Math | Dream Base | 9.40 | 1.00× | 13.31 | 1.00× | 38.68 |
| + Cache (Kp = 100, Kr = 4) | 36.89 | 3.92× | 2.61 | 5.10× | 37.94 | |
| Dream Instruct | 8.85 | 1.00× | 14.11 | 1.00× | 38.28 | |
| + Cache (Kp = 50, Kr = 1) | 23.52 | 2.66× | 4.66 | 3.03× | 37.62 | |
| General Tasks | ||||||
| MMLU-pro | Dream Base | 15.61 | 1.00× | 7.92 | 1.00× | 24.13 |
| + Cache (Kp = 25, Kr = 2) | 35.86 | 2.30× | 2.89 | 2.74× | 23.86 | |
| Dream Instruct | 15.40 | 1.00× | 7.98 | 1.00× | 43.79 | |
| + Cache (Kp = 5, Kr = 1) | 23.98 | 1.56× | 4.77 | 1.67× | 43.96 | |
| MMLU | Dream Base | 9.10 | 1.00× | 13.73 | 1.00× | 73.49 |
| + Cache (Kp = 100, Kr = 2) | 31.07 | 3.41× | 3.27 | 4.20× | 73.20 | |
| Dream Instruct | 8.45 | 1.00× | 14.75 | 1.00× | 73.40 | |
| + Cache (Kp = 100, Kr = 8) | 38.01 | 4.50× | 2.42 | 6.10× | 73.42 | |
| BBH | Dream Base | 7.24 | 1.00× | 17.25 | 1.00× | 52.25 |
| + Cache (Kp = 25, Kr = 4) | 29.61 | 4.09× | 3.35 | 5.15× | 51.66 | |
| Dream Instruct | 6.98 | 1.00× | 17.90 | 1.00× | 57.07 | |
| + Cache (Kp = 10, Kr = 2) | 22.31 | 3.20× | 4.82 | 3.71× | 57.07 | |
| Code | ||||||
| MBPP | Dream Base | 8.91 | 1.00× | 14.06 | 1.00× | 54.20 |
| + Cache (Kp = 25, Kr = 8) | 35.69 | 4.01× | 2.66 | 5.29× | 54.20 | |
| Dream Instruct | 8.46 | 1.00× | 14.65 | 1.00× | 57.00 | |
| + Cache (Kp = 10, Kr = 8) | 29.77 | 3.52× | 3.33 | 4.40× | 56.80 | |
| HumanEval | Dream Base | 21.43 | 1.00× | 5.68 | 1.00× | 58.53 |
| + Cache (Kp = 5, Kr = 1) | 27.40 | 1.28× | 4.17 | 1.36× | 57.31 | |
| Dream Instruct | 17.88 | 1.00× | 6.84 | 1.00× | 57.92 | |
| + Cache (Kp = 50, Kr = 1) | 28.03 | 1.57× | 3.94 | 1.74× | 56.09 | |
- Strong Performance for Dream Base: For
Dream BaseonGSM8K,dLLM-Cacheachieves a 6.90x speedup inFLOPswith perfect preservation of accuracy (76.95%). Similarly, onGPQA, it shows a 7.15x speedup with a slight increase in score (33.92% to 34.15%). - Consistent with LLaDA: The trends observed for
LLaDAare largely replicated forDream, indicating the generality ofdLLM-Cacheacross different dLLM architectures. Speedups range from 1.28x to 5.33x inTPSand 1.36x to 7.15x inFLOPs. - Minor Quality Fluctuations: While mostly lossless, some tasks show minor accuracy drops (e.g., Dream Instruct on GSM8K from 77.55% to 76.80%) or slight increases. These small variations are common and generally acceptable given the substantial speedups.
Comparison with ARMs (Table 3)
The following are the results from Table 3 of the original paper:
| Method | TPS↑ | Speed(TPS) ↑ | Accuracy↑ | GPU Memory (GB)↓ |
| LLaMA3 8B [Dubey et al., 2024] | 47.73 | 1.00× | 49.05 | 16.06 |
| LLaDA* | 7.37 | 1.00× | 69.06 | 16.94 |
| + Cache* (Kp = 100, Kr = 6) | 25.02 | 3.39× | 67.32 | 17.93 |
| LLaDA† | 14.77 | 1.00× | 64.14 | 16.94 |
| + Cache† (Kp = 50, Kr = 8) | 49.15 | 3.33× | 62.32 | 17.93 |
- The superscripts
*and†denote LLaDA with 256 and 128decoding steps, respectively. - dLLMs vs. ARMs on Accuracy: The baseline (7.37 TPS, 69.06% accuracy) significantly outperforms
LLaMA3 8B(47.73 TPS, 49.05% accuracy) in accuracy onGSM8K. This highlights a potential strength of dLLMs in certain tasks, even at higher latency. - dLLM-Cache Closing the Latency Gap:
- With
dLLM-Cache(+ Cache*), achieves 25.02TPS(3.39x speedup), making it much closer toLLaMA3 8B'sTPSof 47.73, while still maintaining superior accuracy (67.32% vs 49.05%). - When using fewer
denoising stepsforLLaDA(LLaDA†, 128 steps),dLLM-Cache(+ Cache†) boostsTPSto 49.15, surpassingLLaMA3 8B's 47.73TPS. This is a crucial finding, demonstrating thatdLLM-Cachecan enable dLLMs to achieve ARM-level inference speeds while retaining their accuracy advantages.
- With
- GPU Memory: The
GPU memoryusage forLLaMA3 8Bis 16.06 GB. BaselineLLaDAuses 16.94 GB. WithdLLM-Cache,LLaDAuses 17.93 GB. This indicates a moderate increase in memory overhead (around 1 GB) fordLLM-Cache, which is deemed acceptable given the substantial speedups and accuracy gains.
Effectiveness on Long-Prompt Scenarios
The benefits of dLLM-Cache are particularly notable for tasks involving long input prompts, such as document-based question answering.
- On the
LongBench-HotpotQAtask, applyingdLLM-CachetoLLaDA 8B Baseresulted in a 9.1x speedup over the unaccelerated baseline. - More impressively, it also led to a
performance improvement, with theF1 scoreincreasing from 34.56 to 36.10. This highlights thatLong-Interval Prompt Cachingis maximally leveraged in such scenarios, not only reducing computations but potentially also improving context handling for extensive static prompts.
6.2. Ablation Studies / Parameter Analysis
Effect of Cache Refresh Interval and
The authors conducted an ablation study to analyze the impact of the prompt refresh interval (K_p) and response refresh interval (K_r) on efficiency and accuracy, using LLaDA 8B Instruct on the GSM8K dataset.
The following are the results from Figure 4 of the original paper:

该图像是图表,展示了LLaDA 8B Instruct模型中缓存刷新间隔对性能的影响。图(a)中,随着缓存提示间隔变化,准确率和TFLOPs的关系;图(b)中,分别比较了基线组和本方法下不同响应刷新间隔对准确率和TFLOPs的影响。
- Figure 4(a): Effect of varying (with )
- This plot shows
Accuracyvs.TFLOPsas increases. - As increases (meaning prompt features are refreshed less frequently), the
TFLOPssignificantly decrease, leading to greater efficiency. - Crucially, the accuracy remains stable even with very large values. This confirms the hypothesis that prompt tokens are highly static and their features can be cached for long intervals without degrading performance. Infrequent prompt updates are sufficient.
- This plot shows
- Figure 4(b): Effect of varying (under two settings)
- This plot shows
Accuracyvs.TFLOPsas increases (meaning response features are refreshed less frequently). - Gray Line (Baseline: ): This represents a scenario with no prompt caching and no adaptive updates for response (equivalent to a uniform full refresh, but only for response). As increases, accuracy drops sharply, indicating that blindly caching response tokens for long intervals without selective updates leads to significant
information lossand poor performance. - Orange/Blue Line (dLLM-Cache: ): This represents the proposed
dLLM-Cachesetting (from Table 1, LLaDA Instruct forGSM8K). In this setup, even as increases, the accuracy remains high whileTFLOPsdecrease. This demonstrates that the combination oflong-interval prompt cachingandadaptive partial updatesfor response tokens (guided by ) effectively maintains accuracy even with less frequent full response refreshes. This validates the design choice of a differentiated and adaptive caching strategy.
- This plot shows
Effect of Update Ratio and Selection Strategy
This ablation investigates how the adaptive update ratio (\rho) and the method for selecting tokens to update impact performance. The study uses LLaDA 8B Instruct on GSM8K.
The following are the results from Figure 5 of the original paper:

该图像是图表,展示了在使用LLaDA 8B Instruct模型的GSM8K任务中,不同自适应更新比例ρ下的令牌选择策略对准确率和计算成本的影响。图中标明在高更新比例ρ=0.9时达到无损(Lossless)性能,准确率接近原始77.48%,TFLOPs明显上升。
- Figure 5: Accuracy vs. TFLOPs for different token selection strategies under varying
- Strategies Compared:
- V-verify (red line): Our proposed method using cosine similarity of
Value (V)vectors. - K-verify (blue line): Using cosine similarity of
Key (K)vectors. - Random Selection (green line): Randomly selecting fraction of tokens to update.
- V-verify (red line): Our proposed method using cosine similarity of
- Performance at High : At (meaning 90% of response tokens are recomputed), all methods achieve close to the original accuracy (77.48% for LLaDA 8B Instruct on
GSM8K). This indicates that if a large enough fraction is updated, quality is preserved, but at a higher computational cost. - Superiority of Similarity-based Strategies: Both
V-verifyandK-verifyconsistently outperformrandom selectionacross a wide range of values. This confirms that intelligently identifying dynamic tokens is crucial for maintaining accuracy at lower computational costs.Random selectionquickly degrades in accuracy as decreases. - V-verify vs. K-verify:
V-verify(red line) generally achieves slightly higher accuracy for a givenTFLOPscompared toK-verify(blue line), especially in the mid-range of . - Optimal Trade-off at : The
V-verifystrategy achieves its highest accuracy (around 78.54%, which is actually higher than the baseline LLaDA Instruct score forGSM8Kin Table 1) at . At this point, it still requires significantly fewerFLOPsthan full recomputation (which would correspond to ). This suggests that a moderate, targeted update of approximately 25% of response tokens strikes a favorable balance between efficiency and output quality.
- Strategies Compared:
Impact of Similarity Metric
The choice of similarity metric for V-verify was also analyzed.
- On
GSM8KwithLLaDA 8B Instruct,cosine similarityachieved 78.54% accuracy. - In contrast,
L2 distance(another common measure of vector dissimilarity) resulted in a significantly lower accuracy of 55.95%. - This clear difference demonstrates that
cosine similarityis much more effective at capturingsemantic changeinValue vectors, making it the superior choice for identifying dynamic tokens. Hence,cosine similarityis adopted as the default metric indLLM-Cache.
6.3. Discussion on Overheads
Storage Overhead of Caching
dLLM-Cache introduces additional memory requirements because it stores four types of intermediate features (, , AttnOut, FFNOut) per layer. The theoretical proof in Appendix B indicates that the storage overhead scales as , where is the number of tokens, is the embedding dimension, and is the number of layers.
- Empirical Observation (Table 3): On
GSM8KwithLLaDA 8B Base, the peakGPU memory usagewithout caching was 16.94 GB. WithdLLM-Cache, this increased moderately to 17.93 GB. - Comparison to ARMs: For context,
LLaMA3 8B(an ARM) used 16.06 GB. - Conclusion: The memory overhead introduced by
dLLM-Cacheis approximately 1 GB, which is considered moderate and acceptable given the significant inference speedups and superior accuracy compared to standard ARMs.
Cost of V-verify and the Fixed Update Overheads
While the V-verify mechanism itself (computing cosine similarity for Value vectors) is computationally lightweight, the overall practical speedup from adaptive partial updates is constrained by certain fixed operational overheads.
The following are the results from Figure 6 of the original paper:

该图像是图6的图表,展示了不同 值下,TPS 随参数 变化的趋势。可以看到,当 极小时,TPS 明显下降,反映了选择性更新启动时存在固定的操作开销。
- Figure 6: TPS versus
- This plot shows that
Tokens Per Second (TPS)generally decreases as theadaptive update ratio (\rho)decreases and approaches zero. - A notable decrease in
TPSpersists even at minimal values (i.e., when very few tokens are being actively recomputed).
- This plot shows that
- Explanation: This phenomenon highlights that initiating any
selective recomputation(when ) triggers certainnon-negligible system-level latencies. These fixed overheads are not directly proportional to the number of tokens being updated. Examples include:GPU kernel management: The overhead of launching new GPU kernels for specific computations.Data movement: The time taken to transfer data between different memory locations (e.g., from main memory to GPU registers or caches).Control flow logic: The computational cost of the branching and conditional operations in the algorithm itself (e.g., identifying tokens, gathering, scattering).
- Implication: At very low values, these fixed overheads start to dominate any savings from reduced dynamic computation. Therefore, simply minimizing indefinitely does not guarantee a proportional increase in
TPS. - Optimal : An optimal
adaptive update ratio() must carefully balance these fixed costs against the benefits of reduced dynamic computation, while simultaneously preserving model quality. As shown inFigure 5, a typically offers an effective trade-off, optimizing overall efficiency and fidelity without being excessively impacted by these fixed overheads.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces dLLM-Cache, an innovative training-free and model-agnostic caching framework specifically designed to accelerate inference in diffusion-based Large Language Models (dLLMs). The core insight driving dLLM-Cache is the identification and exploitation of dual computational redundancies unique to dLLM inference: the quasi-static nature of prompt tokens and the sparse dynamism of response tokens across iterative denoising steps.
By combining a long-interval prompt caching strategy with an adaptive short-interval response caching mechanism (guided by the V-verify method using cosine similarity to identify the most dynamic tokens), dLLM-Cache achieves substantial efficiency gains. Extensive experiments on LLaDA 8B and Dream 7B demonstrate up to 9.1x speedup in inference throughput and significant FLOPs reduction, critically without compromising the quality of the generated text. A notable finding is that dLLM-Cache can bring dLLM inference latency to parity with, or even surpass, that of Autoregressive Models (ARMs) like LLaMA3 8B while retaining dLLMs' inherent accuracy advantages.
7.2. Limitations & Future Work
The authors openly acknowledge a current limitation of their work:
-
Model Scale: Existing open-source
dLLMsare currently restricted to smaller scales (e.g., 8 billion parameters). This preventeddLLM-Cachefrom being evaluated on much larger models (e.g., 33B or 70B parameters).Based on this, they propose a clear direction for
future work: -
Larger Scale Evaluation: The authors believe that
dLLM-Cachewill deliver even greater acceleration benefits at larger model scales. This remains an important area for validation once larger open-source dLLMs become available. This suggests that the overheads (both computational and memory) of traditional dLLM inference become even more prohibitive at larger scales, making the proportional savings from caching potentially higher.
7.3. Personal Insights & Critique
Inspirations
- Clever Observation of Redundancy: The paper's strength lies in its astute observation of the specific computational redundancies within dLLM inference—the static prompt and the partially dynamic response. This granularity of analysis, distinguishing between different parts of the input sequence and their temporal stability, is a significant departure from previous, more uniform caching strategies. It highlights that understanding the task-specific dynamics of a model is key to efficient optimization.
- Adaptive and Training-Free Design: The
adaptive cachingapproach, particularly theV-verifymechanism, is elegant. It uses a lightweight proxy (cosine similarity of Value vectors) to make intelligent decisions about recomputation, avoiding the need for costly retraining or complex learning-based caching policies. This makes the method practical and easily deployable. - Bridging the Performance Gap: The ability to bring dLLM inference speed on par with ARMs while maintaining, or even improving, accuracy is a critical achievement. It strengthens the case for dLLMs as viable alternatives to ARMs, especially if they offer architectural benefits (like addressing the "reversal curse"). This could encourage broader adoption and research into dLLMs.
Potential Issues, Unverified Assumptions, or Areas for Improvement
-
Generality of and : While the paper suggests is a good trade-off, and optimal values vary, the robustness of these hyperparameters across a much wider range of dLLM architectures, datasets, and generation lengths is not fully explored. Fine-tuning these parameters for every new application might still incur some effort. A more dynamic, possibly self-tuning, approach for these hyperparameters could be a future research direction.
-
"Fixed Operational Overheads": The paper acknowledges that "fixed operational overheads" limit speedup at very low . While these are system-level latencies (GPU kernel management, data movement), further investigation into their exact nature and potential mitigation strategies could unlock even greater efficiency gains. Are there hardware-aware optimizations or scheduling techniques that could reduce these fixed costs?
-
Complexity of Implementation: While training-free, the
dLLM-Cachealgorithm involves splitting, gathering, scattering, and conditional execution of computations within Transformer layers. This can add complexity to the implementation and might introduce minor performance overheads compared to a perfectly optimized, monolithic forward pass. The current performance figures already account for this, but it's a factor. -
Information Loss in Partial Updates: Although the quality is largely preserved, the minor drops in accuracy for some tasks suggest that even
V-verifymight occasionally miss critical changes or that reusing stale features for a portion of tokens can have subtle impacts. Future work could explore more sophisticated "change detection" mechanisms or strategies to further minimize this potential information loss without sacrificing efficiency. -
Bidirectional Context for Prompt Features: The paper states that prompt recomputation (when ) accounts for evolving response features. This is important. However, the core advantage of dLLMs' bidirectional attention means that prompt features are always influenced by the
current stateof the response. Reusing prompt features for long intervals, even if they are static, still means they are not "re-attending" to the latest evolving response within those intervals. While practical, this is an implicit trade-off being made for speed. The F1 score increase on LongBench-HotpotQA is interesting and might suggest that the prompt caching, by stabilizing certain features, inadvertently helps in some complex reasoning tasks. More analysis on this specific observation would be valuable.Overall,
dLLM-Cachepresents a highly valuable contribution to the practical deployment of dLLMs. It's a pragmatic and well-motivated solution to a critical bottleneck, demonstrating how careful analysis of model behavior can lead to significant performance improvements.
Similar papers
Recommended via semantic vector search.