Paper status: completed

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

Published:05/17/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
5 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

dLLM-Cache is a training-free adaptive caching method accelerating diffusion large language models by reusing intermediate computations, achieving up to 9.1× speedup on LLaDA 8B and Dream 7B without degrading output quality.

Abstract

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1 x speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. Codes are provided in the supplementary material and will be released publicly on GitHub.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

1.2. Authors

  • Zhiyuan Liu (Shanghai Jiao Tong University, EPIC Lab)

  • Yicun Yang (Shanghai Jiao Tong University, EPIC Lab)

  • Yaojie Zhang (Shanghai Jiao Tong University, University of Electronic Science and Technology of China)

  • Junjie Chen (EPIC Lab, Shanghai Jiao Tong University)

  • Chang Zou (Shanghai Jiao Tong University, EPIC Lab, University of Electronic Science and Technology of China)

  • Qingyan Wei (EPIC Lab, Shanghai Jiao Tong University)

  • Shaobo Wang (Shanghai Jiao Tong University, EPIC Lab)

  • Linfeng Zhang (Shanghai Jiao Tong University, EPIC Lab)

    The authors are primarily affiliated with the School of Artificial Intelligence and EPIC Lab at Shanghai Jiao Tong University, with some also associated with the University of Electronic Science and Technology of China. Their research backgrounds appear to be in artificial intelligence, particularly large language models and diffusion models, focusing on efficiency and acceleration.

1.3. Journal/Conference

This paper is published on arXiv, a preprint server, on 2025-05-17T15:50:46.000Z. arXiv is widely used by researchers to disseminate their work prior to, or in parallel with, peer review and formal publication. While not a peer-reviewed journal or conference in itself, it allows for rapid sharing and feedback within the scientific community.

1.4. Publication Year

2025

1.5. Abstract

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.I×9 . I \times speedup over standard inference without compromising output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings.

  • Original Source Link: https://arxiv.org/abs/2506.06295 (Preprint)
  • PDF Link: https://arxiv.org/pdf/2506.06295v1.pdf (Preprint)

2. Executive Summary

2.1. Background & Motivation

The field of Large Language Models (LLMs) has been primarily dominated by autoregressive models (ARMs) like GPT and LLaMA, which generate text sequentially. Recently, a new and promising paradigm, diffusion-based Large Language Models (dLLMs), has emerged. These models generate text by iteratively denoising masked segments and have shown significant potential, including overcoming certain limitations of ARMs, such as the "reversal curse" (where models trained on "A is B" may fail to infer "B is A").

However, the practical adoption of dLLMs is severely hampered by their high inference latency. Unlike ARMs that generate one token at a time, dLLMs require an iterative, multi-step denoising process for the entire sequence, leading to substantial computational demands. Existing acceleration techniques face specific challenges:

  1. Incompatibility with ARM acceleration: Traditional ARM acceleration methods, such as Key-Value (KV) caching, rely on the causal attention mechanism inherent in autoregressive generation. dLLMs, conversely, utilize a bidirectional attention mechanism, making KV caching directly inapplicable.

  2. Limitations of general diffusion model acceleration: While techniques exist for accelerating diffusion models (DMs) (primarily in computer vision), they often apply uniform policies to all tokens. This approach fails to account for the unique characteristics of language generation, particularly the presence of a static prompt and a dynamically evolving, yet largely stable, response.

    The core problem the paper addresses is this high inference latency in dLLMs, specifically targeting the computational redundancies that existing acceleration methods overlook. The paper's innovative idea stems from a key observation: during dLLM inference, the input prompt tokens remain static, while the response tokens, though dynamic, often exhibit high stability across adjacent denoising steps, meaning most of their intermediate representations do not change significantly. This sparse dynamism suggests that much of the computation is redundant and can be cached.

2.2. Main Contributions / Findings

The paper introduces dLLM-Cache, a novel, training-free adaptive caching framework specifically designed to accelerate dLLM inference by exploiting the identified computational redundancies.

The primary contributions and findings are:

  1. Identification and Characterization of Dual Redundancies: The authors identify two distinct computational redundancies in dLLM inference:
    • Quasi-static prompt redundancy: The input prompt remains constant, but its intermediate features are recomputed in every denoising step.
    • Dynamic but sparse response redundancy: The generated response features evolve, but most tokens' representations remain stable across adjacent steps.
  2. Proposal of dLLM-Cache Framework: A training-free, model-agnostic adaptive caching framework that differentiates its caching strategy based on the nature of prompt and response tokens.
    • It employs long-interval prompt caching for the static prompt features, updating them only at sparse intervals.
    • It uses adaptive short-interval response caching for response tokens, with full refreshes at more frequent short intervals and partial updates in between.
  3. Introduction of V-verify Mechanism: A lightweight yet effective mechanism that guides the adaptive partial updates for response tokens. It uses cosine similarity of Value (V) vectors between current and cached states to identify the most dynamic tokens (those with the lowest similarity) that require recomputation. This mechanism is empirically correlated with changes in subsequent Attention and Feedforward Network (FFN) outputs.
  4. Significant Inference Acceleration with Lossless Quality: Extensive experiments on representative dLLMs, LLaDA 8B and Dream 7B, demonstrate:
    • dLLM-Cache achieves up to 9.1x speedup over standard inference.
    • This acceleration is achieved without compromising output quality (lossless impact on response quality in most cases).
    • The method significantly reduces inference latency, bringing dLLM performance close to that of ARMs under many settings.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

Large Language Models (LLMs)

Large Language Models (LLMs) are a class of artificial intelligence models that are trained on vast amounts of text data to understand, generate, and process human language. They typically use deep learning architectures, most commonly Transformers, and are capable of performing a wide range of natural language processing (NLP) tasks, such as translation, summarization, question answering, and text generation. They are considered "foundational" because their broad capabilities can be adapted to many specific applications.

Autoregressive Models (ARMs)

Autoregressive Models (ARMs) are the dominant paradigm for LLMs. These models generate text sequentially, token by token. In an ARM, each new token is predicted based on all previously generated tokens (and the initial input prompt). This sequential generation process is characterized by causal attention, meaning that when computing the representation for a given token, the model can only attend to tokens that appear before it in the sequence. Examples include OpenAI's GPT series, Google's PaLM, and Meta's LLaMA.

Diffusion Models (DMs)

Diffusion Models (DMs) are a class of generative models that have achieved state-of-the-art results, particularly in image generation. The core idea behind DMs is to learn to reverse a gradual data corruption process.

  1. Forward (Diffusion) Process: In this process, noise is progressively added to an input data sample (e.g., an image) over several steps, transforming it into a noisy, unrecognizable state (often pure Gaussian noise).
  2. Reverse (Denoising) Process: The model is trained to predict and remove this noise at each step, effectively learning to transform pure noise back into a coherent data sample. By iteratively applying this learned denoising step, the model can generate new data samples from random noise.

Diffusion-based Large Language Models (dLLMs)

Diffusion-based Large Language Models (dLLMs) adapt the diffusion model paradigm to discrete data, specifically text. Unlike image generation where data is continuous, text consists of discrete tokens. A common approach for dLLMs is Masked Diffusion Models (MDMs), which operate by iteratively denoising masked segments of a text sequence.

  1. Initialization: The process starts with a fully masked sequence (e.g., a sequence of [MASK] tokens).
  2. Iterative Denoising: In each denoising step, the model predicts the original, clean tokens for some or all of the masked positions based on the surrounding context (which can include both prompt and partially denoised response tokens). This prediction is then used to update the masked sequence, progressively revealing the complete text.
  3. Bidirectional Attention: A key characteristic of dLLMs, particularly those built on Transformer architectures, is their use of bidirectional attention. Unlike the causal attention in ARMs, bidirectional attention allows each token to attend to all other tokens in the input sequence (both preceding and succeeding) at every step. This provides a rich context for denoising and can be advantageous for tasks requiring global context understanding.

Attention Mechanism

The Attention mechanism is a core component of Transformer neural network architectures, which are foundational to modern LLMs. It allows the model to weigh the importance of different parts of the input sequence when processing a particular token. The fundamental components of attention are Query (Q), Key (K), and Value (V) vectors. For each token in a sequence, the model generates Q, K, and V vectors. The self-attention mechanism calculates an attention score by taking the dot product of the Query vector of the current token with the Key vectors of all tokens in the sequence. These scores are then scaled and passed through a softmax function to produce attention weights, indicating how much attention each token should pay to other tokens. Finally, these weights are used to compute a weighted sum of the Value vectors, resulting in an Attention Output vector for the current token.

The mathematical formula for scaled dot-product attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:

  • QQ (Query), KK (Key), VV (Value) are matrices representing the query, key, and value vectors for all tokens in the sequence. Each row corresponds to a token, and columns represent dimensions.
  • dkd_k is the dimension of the key vectors. Dividing by dk\sqrt{d_k} is a scaling factor to prevent large dot products from pushing the softmax function into regions with very small gradients.
  • QKTQK^T is the dot product of the query matrix and the transpose of the key matrix, which computes the similarity between each query and all keys.
  • softmax()\mathrm{softmax}(\cdot) is the softmax function, which normalizes the attention scores into probabilities, ensuring they sum to 1.
  • The final multiplication by VV computes the weighted sum of value vectors, producing the attention output.

Key-Value (KV) Caching (for ARMs)

Key-Value (KV) caching is an optimization technique used to accelerate autoregressive models (ARMs) during inference. Since ARMs generate text token by token, the Key (K) and Value (V) vectors for previously generated tokens remain constant and are reused for subsequent token generation steps. Instead of recomputing these K and V vectors for all preceding tokens at each step, they are computed once and stored in a cache. This significantly reduces computation, as the attention mechanism only needs to compute K and V for the new token and attend to the cached K and V of previous tokens.

However, KV caching is incompatible with dLLMs because of their bidirectional attention mechanism. In ARMs, new tokens only attend to past tokens (causal attention). In dLLMs, tokens can attend to all other tokens (prompt and response) in the current intermediate state, and this state changes iteratively. This means that the K and V vectors for tokens in the response are not static and must be recomputed or re-evaluated at each denoising step, making direct KV caching inefficient.

Cosine Similarity

Cosine similarity is a measure of similarity between two non-zero vectors in an inner product space. It is defined as the cosine of the angle between the vectors, which ranges from -1 (opposite) to 1 (identical), with 0 indicating orthogonality (no correlation). In the context of this paper, it's used to quantify how much a token's feature vector (like its Value (V) vector) has changed between adjacent denoising steps. A higher cosine similarity (closer to 1) means the vectors are more similar, indicating less change.

The formula for cosine similarity between two vectors A\mathbf{A} and B\mathbf{B} is: $ \mathrm{cosine_similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{\Vert \mathbf{A} \Vert \Vert \mathbf{B} \Vert} = \frac{\sum_{i=1}^n A_i B_i}{\sqrt{\sum_{i=1}^n A_i^2} \sqrt{\sum_{i=1}^n B_i^2}} $ Where:

  • A\mathbf{A} and B\mathbf{B} are the two vectors being compared.
  • AB\mathbf{A} \cdot \mathbf{B} denotes the dot product of the vectors.
  • A\Vert \mathbf{A} \Vert and B\Vert \mathbf{B} \Vert denote the Euclidean norm (magnitude) of the vectors.

FLOPs and TPS

  • FLOPs (Floating Point Operations): A measure of computational cost. It quantifies the number of floating-point arithmetic operations (additions, multiplications, divisions, etc.) performed by a model. Lower FLOPs generally indicate higher computational efficiency. In the paper, TFLOPs refers to tera-FLOPs (101210^{12} FLOPs).
  • TPS (Tokens Per Second): A measure of inference speed or throughput. It indicates how many tokens the model can process or generate per second. Higher TPS means faster inference.

3.2. Previous Works

Autoregressive Models (ARMs)

The paper acknowledges the dominance of ARMs, citing foundational works such as:

  • Radford (2018): Introduced Generative Pre-training (GPT), a Transformer-based model pre-trained on large text corpora.
  • Radford et al. (2019): GPT-2, demonstrating the power of scaling pre-training data and model size.
  • Brown (2020): GPT-3, which highlighted the emergent capabilities of very large models, especially in few-shot learning.
  • OpenAI (2022): ChatGPT, showcasing optimized language models for dialogue. These models established the sequential, causal attention paradigm for text generation.

Diffusion Models for Language

The adaptation of DMs to discrete text data has been a significant research area. Key works include:

  • Austin et al. (2021): Introduced Structured Denoising Diffusion Models in discrete state-spaces, a foundational work for Masked Diffusion Models (MDMs).
  • Lou et al. (2023) and Shi et al. (2024): Further explored discrete diffusion for language modeling, often by iteratively predicting masked tokens.
  • Ou et al. (2024): Investigated the theoretical underpinnings of absorbing discrete diffusion for clean data conditional distributions.

Diffusion Large Language Models (dLLMs)

Recent scaling efforts have led to powerful dLLMs:

  • LLaDA [Nie et al., 2025]: An 8-billion parameter MDM trained from scratch using a bidirectional Transformer. It aims to be a direct alternative to ARMs.
  • Dream [Ye et al., 2025]: A dLLM (7-billion parameters) that initializes from pre-trained ARM weights, suggesting a different training strategy. Both LLaDA and Dream demonstrate performance comparable to similarly-sized ARMs like LLaMA3 8B and leverage bidirectional architectures, potentially offering advantages such as circumventing the "reversal curse."

Key-Value Caching for ARMs

  • Pope et al. (2023): This work describes how Key-Value caching accelerates ARMs by reusing previously computed attention states. As discussed in Foundational Concepts, this method is not directly applicable to dLLMs due to their bidirectional attention.

Feature Caching for DMs (primarily for vision)

Acceleration techniques for diffusion models (DMs) have largely focused on vision tasks, leveraging temporal redundancy in intermediate features across denoising steps.

  • DCache [Ma et al., 2024] and FastDPM [Li et al., 2023]: Early approaches focusing on U-Net architectures common in image DMs.
  • FORA [Selvaraju et al., 2024]: Applies static-interval caching for attention and MLP layers within Diffusion Transformers.
  • ToCa series [Zou et al., 2024a,b]: Aims to reduce information loss by dynamically updating cached features.
  • TaylorSeer [Liu et al., 2025]: Introduces a "cache-then-forecast" paradigm, predicting features using Taylor series expansion rather than direct reuse.

3.3. Technological Evolution

The evolution of Large Language Models (LLMs) has seen a progression from early statistical methods to neural networks, culminating in the dominance of Transformer-based Autoregressive Models (ARMs). ARMs, while powerful, inherently generate text sequentially, which can be a bottleneck for certain tasks and prone to issues like the "reversal curse."

In parallel, Diffusion Models (DMs) revolutionized generative AI in continuous domains like image synthesis. The challenge was to adapt DMs to the discrete nature of text. Masked Diffusion Models (MDMs) emerged as a solution, offering an iterative, non-autoregressive generation process that utilizes bidirectional attention. This brought forth a new class of dLLMs that could potentially overcome ARM limitations.

However, this architectural shift introduced a new performance bottleneck: the iterative, multi-step denoising process of dLLMs leads to high inference latency. Existing acceleration techniques, whether Key-Value caching for ARMs (incompatible due to bidirectional attention) or general feature caching for vision DMs (ineffective due to uniform policies ignoring language-specific structure), fail to adequately address this.

This paper's work, dLLM-Cache, fits into this technological timeline by specifically addressing the inference efficiency gap for dLLMs. It acknowledges the unique challenges posed by dLLMs' bidirectional attention and iterative nature, drawing inspiration from existing caching ideas but fundamentally redesigning them to exploit the particular characteristics of text generation (static prompt, partially dynamic response).

3.4. Differentiation Analysis

dLLM-Cache differentiates itself from prior work through its specific adaptation to the unique characteristics of diffusion-based Large Language Models (dLLMs).

  • Compared to ARM Key-Value (KV) Caching:

    • Prior Work: ARM KV caching (e.g., Pope et al., 2023) is highly effective for autoregressive models (ARMs) because they operate with causal attention. This means that when generating a new token, the model only needs to attend to previously generated tokens. The Key (K) and Value (V) states of these past tokens remain constant and can be cached and reused.
    • dLLM-Cache Differentiation: dLLM-Cache explicitly states that ARM KV caching is incompatible with dLLMs. This is because dLLMs employ bidirectional attention, where each token can attend to all other tokens in the sequence (both preceding and succeeding) at every denoising step. As the text evolves during the iterative denoising process, the K and V representations for response tokens also change, making simple reuse of static KV pairs impossible. dLLM-Cache addresses this incompatibility by developing a dynamic caching strategy.
  • Compared to General Diffusion Model Feature Caching (primarily for vision):

    • Prior Work: Existing feature caching methods for DMs (ee.g., DCache, FastDPM, FORA, ToCa, TaylorSeer) were primarily developed for continuous data, such as images, and often apply a uniform caching policy across all features. They typically focus on reusing intermediate features across denoising steps due to temporal redundancy.
    • dLLM-Cache Differentiation: dLLM-Cache identifies a crucial gap: these uniform caching strategies are ineffective for dLLMs because they overlook the distinct nature of language tasks. The core innovation of dLLM-Cache is its recognition and exploitation of dual computational redundancies unique to dLLMs:
      1. Static Prompt: The input prompt remains constant throughout the inference process, making its features highly stable and suitable for long-interval caching.
      2. Partially Dynamic Response: The generated response evolves, but most tokens remain stable across adjacent denoising steps. This sparse dynamism allows for an adaptive, partial update strategy rather than recomputing everything or uniformly caching.
    • Adaptive vs. Uniform: Unlike uniform policies, dLLM-Cache employs a differentiated adaptive caching strategy. It combines long-interval prompt caching with adaptive short-interval response caching guided by feature similarity (V-verify). This allows for efficient reuse where features are stable and targeted recomputation where changes are significant, striking a superior balance between speed and quality for language generation.

4. Methodology

4.1. Principles

The core idea behind dLLM-Cache is to leverage the inherent structural and temporal redundancies observed during the inference process of diffusion-based Large Language Models (dLLMs). The method operates on two fundamental principles:

  1. Exploiting Static Prompt Redundancy: The input prompt remains constant throughout the iterative denoising process. This implies that the internal representations and computations related to the prompt tokens are highly stable across denoising steps. Therefore, these computations do not need to be performed repeatedly in every step.

  2. Leveraging Partially Dynamic Response Redundancy: While the generated response evolves over the denoising steps, this evolution is often sparse. That is, only a small fraction of the response tokens undergo significant changes between adjacent denoising steps. Recomputing the features for all response tokens in every step is therefore inefficient. Instead, an adaptive approach that selectively updates only the most dynamic tokens can yield substantial computational savings.

    Based on these principles, dLLM-Cache proposes a training-free adaptive caching mechanism. It aims to efficiently reuse intermediate computations, specifically Key (K), Value (V), Attention Output (AttnOut), and Feedforward Network Output (FFNOut) features, without compromising the quality of the generated text. The framework differentiates between prompt and response tokens and employs a V-verify mechanism based on cosine similarity to identify and selectively update dynamic response tokens.

4.2. Core Methodology In-depth

4.2.1. Inference Process of dLLMs (Preliminary)

The paper explains the dLLM inference process using LLaDA as a representative example. dLLMs generate text through a non-autoregressive, iterative diffusion-based process that starts from a fully masked sequence and progressively denoises it into the target output.

Let T\mathcal{T} be the token vocabulary, which includes a special [MASK] token. Given an input prompt c=(c1,,cM)\mathbf{c} = (c_1, \ldots, c_M), where MM is the length of the prompt, the model aims to generate a response y=(y1,,yL)\mathbf{y} = (y_1, \ldots, y_L), where LL is the desired length of the response. The generation occurs over KK discrete denoising steps, indexed from k=Kk=K down to 0. Let y(k)TL\mathbf{y}^{(k)} \in \mathcal{T}^L denote the intermediate state of the response sequence at step kk. The process begins with a fully masked sequence: $ \mathbf{y}^{(K)} = (\underbrace{[\mathtt{MASK}], \dots, [\mathtt{MASK}]}{L \mathrm{times}}) $ At each step kk, a mask predictor fϕf_{\phi} estimates the distribution over the clean sequence. This function takes the prompt c\mathbf{c} and the current intermediate response state y(k)\mathbf{y}^{(k)} as input, along with the model parameters ϕ\phi: $ P{\phi}(\mathbf{y}|\mathbf{c}, \mathbf{y}^{(k)}) = f_{\phi}(\mathbf{c}, \mathbf{y}^{(k)}; \phi) $ Where:

  • Pϕ(yc,y(k))P_{\phi}(\mathbf{y}|\mathbf{c}, \mathbf{y}^{(k)}) represents the probability distribution over all possible clean sequences y\mathbf{y}, conditioned on the prompt c\mathbf{c} and the current (partially masked/denoised) response y(k)\mathbf{y}^{(k)}, and parameterized by ϕ\phi.

  • fϕf_{\phi} is the mask predictor network, typically a Transformer-based architecture, which learns to predict the original tokens from the noisy/masked input.

    The most likely clean sequence y^(0)\hat{\mathbf{y}}^{(0)} is then estimated by maximizing this probability: $ \hat{\mathbf{y}}^{(0)} = \underset{\mathbf{y} \in \mathcal{T}^L}{\arg\max} P_{\phi}(\mathbf{y}|\mathbf{c}, \mathbf{y}^{(k)}) $ Where TL\mathcal{T}^L denotes the set of all possible sequences of length LL composed of tokens from vocabulary T\mathcal{T}.

A transition function SS is then applied to yield the next intermediate state y(k1)\mathbf{y}^{(k-1)}. This function selectively updates tokens in y(k)\mathbf{y}^{(k)} based on the predicted clean sequence y^(0)\hat{\mathbf{y}}^{(0)}, the original prompt c\mathbf{c}, and the current step index kk: $ \mathbf{y}^{(k-1)} = S(\hat{\mathbf{y}}^{(0)}, \mathbf{y}^{(k)}, \mathbf{c}, k) $ The specific strategy for SS can vary, potentially involving confidence-based remasking (masking tokens with low confidence again) or semi-autoregressive block updates (updating segments of tokens). This iterative process, while enabling flexible generation, incurs high latency due to the repeated computation of features across many steps, especially as KK (number of denoising steps) increases.

4.2.2. dLLM-Cache Overview

dLLM-Cache is a training-free caching framework designed to improve the inference efficiency of dLLMs by selectively updating intermediate computations. The framework makes a distinction between prompt and response features due to their differing dynamics.

For each Transformer layer ll within the dLLM, the framework stores four types of intermediate features:

  • K(l)\mathbf{K}^{(l)}: Key vectors for layer ll.

  • V(l)\mathbf{V}^{(l)}: Value vectors for layer ll.

  • AttnOut(l)\mathbf{AttnOut}^{(l)}: Attention output for layer ll.

  • FFNOut(l)\mathbf{FFNOut}^{(l)}: Feedforward Network output for layer ll.

    These features are stored in two separate caches:

  • Prompt Cache (Cp\mathcal{C}_p): Stores features related to the prompt tokens.

  • Response Cache (Cr\mathcal{C}_r): Stores features related to the response tokens.

    The caching behavior is governed by three hyperparameters:

  • prompt refresh interval (KpK_p): Determines how frequently prompt features are fully recomputed. A larger KpK_p means less frequent recomputation.

  • response refresh interval (KrK_r): Determines how frequently response features are fully recomputed. A smaller KrK_r implies more frequent full refreshes.

  • adaptive update ratio (ρ[0,1]\rho \in [0, 1]): Used during adaptive partial updates for response tokens. It specifies the fraction of response tokens (those deemed most dynamic) that will have their features recomputed.

    The general workflow for dLLM-Cache during iterative steps (k=K1k=K-1 down to 1) is:

  1. Prompt Management: If the current step kk is a multiple of KpK_p (i.e., k0(modKp)k \equiv 0 \pmod{K_p}), prompt features in Cp\mathcal{C}_p are fully recomputed. Otherwise, they are reused from the cache.

  2. Response Management: If the current step kk is a multiple of KrK_r (i.e., k0(modKr)k \equiv 0 \pmod{K_r}), response features in Cr\mathcal{C}_r are fully recomputed. Otherwise, an adaptive partial update strategy is employed, where only a fraction (ρ\rho) of the most dynamic response tokens are recomputed.

  3. Computation: The layer's computation proceeds using either cached or freshly updated features.

    This differentiated and adaptive strategy allows dLLM-Cache to significantly reduce computational load by avoiding redundant computations while maintaining the model's generation quality.

4.2.3. Prompt Cache Management

The input prompt c\mathbf{c} remains constant throughout the entire dLLM inference process across all denoising steps. Consequently, its corresponding intermediate features (Key, Value, Attention Output, FFN Output) exhibit very high temporal stability. dLLM-Cache capitalizes on this by employing a Long-Interval Prompt Caching strategy.

Initialization (at step k=Kk=K):

  • At the very first denoising step (k=Kk=K), all features (Kp(l),Vp(l),AttnOutp(l),FFNOutp(l)\mathbf{K}_p^{(l)}, \mathbf{V}_p^{(l)}, \mathbf{AttnOut}_p^{(l)}, \mathbf{FFNOut}_p^{(l)}) related to the prompt tokens for each Transformer layer ll are computed from scratch.
  • These computed prompt features are then stored in the Prompt Cache Cp\mathcal{C}_p.

Subsequent Denoising Steps (k=K1k=K-1 down to 1):

  • Full Recomputation: If the current step kk is a multiple of the prompt refresh interval KpK_p (i.e., k0(modKp)k \equiv 0 \pmod{K_p}), the prompt features in Cp\mathcal{C}_p are fully recomputed. This periodic recomputation ensures that the prompt representations are always conditioned on the latest available (and potentially updated) Key and Value features from the response tokens (which might have been updated in the same step if k0(modKr)k \equiv 0 \pmod{K_r}, or via adaptive updates). This is crucial because prompt tokens still need to attend to response tokens, and if response tokens change, the prompt's attention output might also change slightly.

  • Cache Reuse: If kk is not a multiple of KpK_p (i.e., k≢0(modKp)k \not\equiv 0 \pmod{K_p}), the prompt features for the current layer ll are directly retrieved and reused from Cp\mathcal{C}_p. No recomputation is performed for these features.

    This strategy dramatically reduces the computational load associated with processing the static prompt, especially when KpK_p is set to a large value, allowing for significant speedups.

4.2.4. Response Cache with Adaptive Updates

In contrast to the static prompt, the response features y(k)\mathbf{y}^{(k)} dynamically evolve across denoising steps. However, as noted by the authors, this evolution is often sparse, meaning most tokens change gradually. dLLM-Cache manages the Response Cache Cr\mathcal{C}_r using a two-pronged approach:

  1. Full Refresh:

    • When the current step kk is a multiple of the response refresh interval KrK_r (i.e., k0(modKr)k \equiv 0 \pmod{K_r}), or at the very beginning of inference (k=Kk=K), all response features are recomputed from scratch.
    • These newly computed features (KK, VV, AttnOut, FFNOut for response tokens) are then used to fully update the Response Cache Cr\mathcal{C}_r. This ensures that periodically, the response features are completely synchronized with the current intermediate state, preventing significant information loss or drift that might accumulate from prolonged reuse of stale cached features.
  2. Adaptive Partial Update:

    • When kk is not a multiple of KrK_r (i.e., k≢0(modKr)k \not\equiv 0 \pmod{K_r}), instead of a full recomputation, an adaptive partial update strategy is employed. This mechanism selectively updates only a subset of response tokens whose features are deemed to have changed significantly. The process involves:
      • Identify Change (V-verify mechanism): For each response token jj at the current layer ll, dLLM-Cache computes its Value (V) vector based on the current input. It then calculates the cosine similarity sjs_j between this newly computed V vector (vr,j(l)\mathbf{v}_{r,j}^{(l)}) and its cached counterpart (v~r,j(l)\tilde{\mathbf{v}}_{r,j}^{(l)}) from the previous step. $ s_j = \frac{(\mathbf{v}{r,j}^{(l)})^\top \tilde{\mathbf{v}}{r,j}^{(l)}}{\Vert \mathbf{v}{r,j}^{(l)} \Vert \Vert \tilde{\mathbf{v}}{r,j}^{(l)} \Vert} $ Where:

        • vr,j(l)\mathbf{v}_{r,j}^{(l)} is the Value vector for response token jj at layer ll, computed at the current denoising step kk.
        • v~r,j(l)\tilde{\mathbf{v}}_{r,j}^{(l)} is the Value vector for response token jj at layer ll, retrieved from the Response Cache Cr\mathcal{C}_r (i.e., from step k+1k+1).
        • The superscript \top denotes the transpose.
        • \Vert \cdot \Vert denotes the Euclidean norm (magnitude) of the vector. The paper highlights that changes in a token's Value vector (measured by cosine similarity) strongly correlate with changes in its subsequent Attention Output (AttnOut) and Feedforward Network Output (FFNOut) (as shown in Figure 2). This makes V-verify an efficient proxy for identifying tokens whose overall downstream features have changed.
      • Select Tokens: The framework then selects a specific number of tokens for recomputation. Specifically, it identifies the ρL\lfloor \rho L \rfloor tokens (where LL is the response length and ρ\rho is the adaptive update ratio) that have the lowest cosine similarity sjs_j. These tokens are considered the most dynamic and form the set Tupdate\mathcal{T}_{\mathrm{update}}.

      • Recompute & Reuse: Only the features (KK, VV, AttnOut, FFNOut) for tokens belonging to Tupdate\mathcal{T}_{\mathrm{update}} are fully recomputed for the current layer. For all other response tokens (those not in Tupdate\mathcal{T}_{\mathrm{update}}), their features are directly reused from the Response Cache Cr\mathcal{C}_r.

      • Update Cache: The recomputed features for tokens in Tupdate\mathcal{T}_{\mathrm{update}} are then used to update their corresponding entries in Cr\mathcal{C}_r. The features for reused tokens remain as they were.

        This adaptive update mechanism is crucial for balancing computational efficiency with generation quality. By focusing recomputation only on the most dynamic parts of the response, dLLM-Cache minimizes unnecessary calculations while ensuring that critical information changes are captured.

4.2.5. Algorithmic Workflow

The overall process of dLLM-Cache is outlined in Algorithm 1, which calls several helper functions (Algorithms 2-6) for specific caching scenarios.

Algorithm 1 dLLM-Cache: Main Inference Algorithm

This algorithm orchestrates the entire denoising process, managing prompt and response caches across all steps and layers.

Input:

  • Prompt c\mathbf{c}

  • Initial masked sequence y(K)\mathbf{y}^{(K)} (e.g., all [MASK] tokens)

  • Total denoising steps KK

  • Prompt refresh interval KpK_p

  • Response refresh interval KrK_r

  • Adaptive update ratio ρ\rho

    Output: Final predicted sequence y^(0)\hat{\mathbf{y}}^{(0)}

  1. Initialize Caches (at step k=Kk=K):
    • Call InitializeCache (Algorithm 2) with c\mathbf{c} and y(K)\mathbf{y}^{(K)} to compute all initial features and populate Cp\mathcal{C}_p and Cr\mathcal{C}_r.
  2. Generate Initial Prediction and Transition:
    • Generate an initial prediction y^(0)\hat{\mathbf{y}}^{(0)} using the model fϕf_{\phi} based on the fully computed features.
    • Apply the transition function SS to get the first intermediate response state: y(K1)S(y^(0),y(K),c,K)\mathbf{y}^{(K-1)} \gets S(\hat{\mathbf{y}}^{(0)}, \mathbf{y}^{(K)}, \mathbf{c}, K).
  3. Iterative Denoising Loop (from k=K1k=K-1 down to 1):
    • For each denoising step kk:
      • x_layer_in is initialized as the concatenated input [c;y(k)][\mathbf{c}; \mathbf{y}^{(k)}] for the first layer.
      • Layer Loop: For each Transformer layer ll:
        • Determine refresh conditions:
          • refresh_prompt is true if k0(modKp)k \equiv 0 \pmod{K_p}.
          • refresh_response is true if k0(modKr)k \equiv 0 \pmod{K_r}.
        • Cache Strategy Branching:
          • If refresh_prompt AND refresh_response: Call FullRefresh (Algorithm 3) to recompute and update all features for both prompt and response.
          • If refresh_prompt AND NOT refresh_response: Call RefreshPromptOnly (Algorithm 4) to recompute prompt features and reuse response features.
          • If NOT refresh_prompt AND refresh_response: Call RefreshResponseOnly (Algorithm 5) to reuse prompt features and recompute response features.
          • If NOT refresh_prompt AND NOT refresh_response: Call AdaptiveUpdate (Algorithm 6) to reuse prompt features and perform adaptive partial updates for response features.
        • Update x_layer_in for the next layer with the output of the current layer.
      • Generate Prediction and Transition (after all layers for step kk):
        • Generate prediction y^(0)\hat{\mathbf{y}}^{(0)} using the model fϕf_{\phi} with the final layer output.
        • Apply transition function SS: y(k1)S(y^(0),y(k),c,k)\mathbf{y}^{(k-1)} \gets S(\hat{\mathbf{y}}^{(0)}, \mathbf{y}^{(k)}, \mathbf{c}, k).
  4. Return: Final predicted sequence y^(0)\hat{\mathbf{y}}^{(0)} at the end of the last step.

Algorithm 2 InitializeCache

This algorithm sets up the initial Prompt Cache (Cp\mathcal{C}_p) and Response Cache (Cr\mathcal{C}_r) at the beginning of the inference process (step k=Kk=K).

Input:

  • Prompt c\mathbf{c}

  • Initial masked sequence y(K)\mathbf{y}^{(K)}

  • Transformer network with LL layers

    Output: Initialized caches Cp,Cr\mathcal{C}_p, \mathcal{C}_r

  1. Cache Structure Definition:
    • For each layer l{1,,L}l \in \{1, \ldots, L\}, initialize empty dictionaries for:
      • Prompt key-value cache: Cp[l][kv_cache]\mathcal{C}_p[l][\mathtt{kv\_cache}]
      • Prompt attention output cache: Cp[l][attn]\mathcal{C}_p[l][\mathtt{attn}]
      • Prompt FFN output cache: Cp[l][ffn]\mathcal{C}_p[l][\mathtt{ffn}] (referred to as m1p in the paper's pseudocode)
      • Response key-value cache: Cr[l][kv_cache]\mathcal{C}_r[l][\mathtt{kv\_cache}]
      • Response attention output cache: Cr[l][attn]\mathcal{C}_r[l][\mathtt{attn}]
      • Response FFN output cache: Cr[l][ffn]\mathcal{C}_r[l][\mathtt{ffn}] (referred to as m1p in the paper's pseudocode)
  2. Initial Caching (Step k=Kk=K):
    • Concatenate prompt and initial response: xin[c;y(K)]\mathbf{x}_{in} \gets [\mathbf{c}; \mathbf{y}^{(K)}]. This is the input for the first layer.
    • Layer Loop: For each layer ll:
      • Attention Block:
        • Apply LayerNorm to xin\mathbf{x}_{in}: xnormLayerNorm(xin)\mathbf{x}_{norm} \gets \mathrm{LayerNorm}(\mathbf{x}_{in}).
        • Project xnorm\mathbf{x}_{norm} to Query, Key, Value vectors: Q,K,VQ_proj(xnorm),K_proj(xnorm),V_proj(xnorm)\mathbf{Q}, \mathbf{K}, \mathbf{V} \gets \mathrm{Q\_proj}(\mathbf{x}_{norm}), \mathrm{K\_proj}(\mathbf{x}_{norm}), \mathrm{V\_proj}(\mathbf{x}_{norm}).
        • Split K, V for caching:
          • Prompt K, V: Kp,VpK1:c,V1:c\mathbf{K}_p, \mathbf{V}_p \gets \mathbf{K}_{1:|\mathbf{c}|}, \mathbf{V}_{1:|\mathbf{c}|}
          • Response K, V: Kr,VrKc+1:,Vc+1:\mathbf{K}_r, \mathbf{V}_r \gets \mathbf{K}_{|\mathbf{c}|+1:}, \mathbf{V}_{|\mathbf{c}|+1:}
          • Store them in respective caches: Cp[l][kv_cache]{Kp,Vp}\mathcal{C}_p[l][\mathtt{kv\_cache}] \gets \{\mathbf{K}_p, \mathbf{V}_p\}, Cr[l][kv_cache]{Kr,Vr}\mathcal{C}_r[l][\mathtt{kv\_cache}] \gets \{\mathbf{K}_r, \mathbf{V}_r\}.
        • Compute Attention using Q,K,V\mathbf{Q}, \mathbf{K}, \mathbf{V}: AttnOutAttention(Q,K,V)\mathbf{AttnOut} \gets \mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}).
        • Split AttnOut for caching:
          • Prompt AttnOut: AttnOutpAttnOut1:c\mathbf{AttnOut}_p \gets \mathbf{AttnOut}_{1:|\mathbf{c}|}
          • Response AttnOut: AttnOutrAttnOutc+1:\mathbf{AttnOut}_r \gets \mathbf{AttnOut}_{|\mathbf{c}|+1:}
          • Store them: Cp[l][attn]AttnOutp\mathcal{C}_p[l][\mathtt{attn}] \gets \mathbf{AttnOut}_p, Cr[l][attn]AttnOutr\mathcal{C}_r[l][\mathtt{attn}] \gets \mathbf{AttnOut}_r.
        • Compute post-attention residual: hxin+AttnOut\mathbf{h} \gets \mathbf{x}_{in} + \mathbf{AttnOut}.
      • FFN Block:
        • Apply LayerNorm to h\mathbf{h}: hnormLayerNorm(h)\mathbf{h}_{norm} \gets \mathrm{LayerNorm}(\mathbf{h}).
        • Compute FFN output: FFNOutFFN(hnorm)\mathbf{FFNOut} \gets \mathrm{FFN}(\mathbf{h}_{norm}).
        • Split FFNOut for caching:
          • Prompt FFNOut: FFNOutpFFNOut1:c\mathbf{FFNOut}_p \gets \mathbf{FFNOut}_{1:|\mathbf{c}|}
          • Response FFNOut: FFNOutrFFNOutc+1:\mathbf{FFNOut}_r \gets \mathbf{FFNOut}_{|\mathbf{c}|+1:}
          • Store them: Cp[l][ffn]FFNOutp\mathcal{C}_p[l][\mathtt{ffn}] \gets \mathbf{FFNOut}_p, Cr[l][ffn]FFNOutr\mathcal{C}_r[l][\mathtt{ffn}] \gets \mathbf{FFNOut}_r.
        • Compute final residual: xouth+FFNOut\mathbf{x}_{out} \gets \mathbf{h} + \mathbf{FFNOut}.
        • Update input for the next layer: xinxout\mathbf{x}_{in} \gets \mathbf{x}_{out}.
    • Return the populated caches Cp,Cr\mathcal{C}_p, \mathcal{C}_r.

Algorithm 3 dLLM-Cache: Case 1 - Full Refresh

This algorithm handles the scenario where both prompt and response features need to be fully refreshed. This happens when k0(modKp)k \equiv 0 \pmod{K_p} AND k0(modKr)k \equiv 0 \pmod{K_r}.

Input:

  • Layer input xin\mathbf{x}_{in} (output of layer l-1, or [c;y(k)][\mathbf{c}; \mathbf{y}^{(k)}] for l=1l=1)

  • Layer index ll

  • Caches Cp\mathcal{C}_p and Cr\mathcal{C}_r

    Output: Layer output xout\mathbf{x}_{out}, updated caches Cp,Cr\mathcal{C}_p, \mathcal{C}_r

  1. Attention Block:
    • Compute LayerNorm of xin\mathbf{x}_{in}: xnormLayerNorm(xin)\mathbf{x}_{norm} \gets \mathrm{LayerNorm}(\mathbf{x}_{in}).
    • Project xnorm\mathbf{x}_{norm} to QQ, KK, VV: Q,K,VQ_proj(xnorm),K_proj(xnorm),V_proj(xnorm)\mathbf{Q}, \mathbf{K}, \mathbf{V} \gets \mathrm{Q\_proj}(\mathbf{x}_{norm}), \mathrm{K\_proj}(\mathbf{x}_{norm}), \mathrm{V\_proj}(\mathbf{x}_{norm}).
    • Split K, V for caching:
      • Kp,KrK1:c,Kc+1:\mathbf{K}_p, \mathbf{K}_r \gets \mathbf{K}_{1:|\mathbf{c}|}, \mathbf{K}_{|\mathbf{c}|+1:}
      • Vp,VrV1:c,Vc+1:\mathbf{V}_p, \mathbf{V}_r \gets \mathbf{V}_{1:|\mathbf{c}|}, \mathbf{V}_{|\mathbf{c}|+1:}
      • Update prompt KV cache: Cp[l][kv_cache]{Kp,Vp}\mathcal{C}_p[l][\mathtt{kv\_cache}] \gets \{\mathbf{K}_p, \mathbf{V}_p\}.
      • Update response KV cache: Cr[l][kv_cache]{Kr,Vr}\mathcal{C}_r[l][\mathtt{kv\_cache}] \gets \{\mathbf{K}_r, \mathbf{V}_r\}.
    • Compute combined attention: AttnOutAttention(Q,K,V)\mathbf{AttnOut} \gets \mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}).
    • Split AttnOut for caching:
      • AttnOutp,AttnOutrAttnOut1:c,AttnOutc+1:\mathbf{AttnOut}_p, \mathbf{AttnOut}_r \gets \mathbf{AttnOut}_{1:|\mathbf{c}|}, \mathbf{AttnOut}_{|\mathbf{c}|+1:}
      • Update prompt attention cache: Cp[l][attn]AttnOutp\mathcal{C}_p[l][\mathtt{attn}] \gets \mathbf{AttnOut}_p.
      • Update response attention cache: Cr[l][attn]AttnOutr\mathcal{C}_r[l][\mathtt{attn}] \gets \mathbf{AttnOut}_r.
    • Compute post-attention residual: hxin+AttnOut\mathbf{h} \gets \mathbf{x}_{in} + \mathbf{AttnOut}.
  2. FFN Block:
    • Compute LayerNorm of h\mathbf{h}: hnormLayerNorm(h)\mathbf{h}_{norm} \gets \mathrm{LayerNorm}(\mathbf{h}).
    • Compute combined FFN output: FFNOutFFN(hnorm)\mathbf{FFNOut} \gets \mathrm{FFN}(\mathbf{h}_{norm}).
    • Split FFNOut for caching:
      • FFNOutp,FFNOutrFFNOut1:c,FFNOutc+1:\mathbf{FFNOut}_p, \mathbf{FFNOut}_r \gets \mathbf{FFNOut}_{1:|\mathbf{c}|}, \mathbf{FFNOut}_{|\mathbf{c}|+1:}
      • Update prompt FFN cache: Cp[l][ffn]FFNOutp\mathcal{C}_p[l][\mathtt{ffn}] \gets \mathbf{FFNOut}_p.
      • Update response FFN cache: Cr[l][ffn]FFNOutr\mathcal{C}_r[l][\mathtt{ffn}] \gets \mathbf{FFNOut}_r.
    • Compute final residual: xouth+FFNOut\mathbf{x}_{out} \gets \mathbf{h} + \mathbf{FFNOut}.
  3. Return xout,Cp,Cr\mathbf{x}_{out}, \mathcal{C}_p, \mathcal{C}_r.

Algorithm 4 dLLM-Cache: Case 2 - Refresh Prompt Only

This algorithm is executed when prompt features need refreshment (k0(modKp)k \equiv 0 \pmod{K_p}) but response features do not (k≢0(modKr)k \not\equiv 0 \pmod{K_r}).

Input:

  • Layer input xin\mathbf{x}_{in}

  • Layer index ll

  • Caches Cp\mathcal{C}_p and Cr\mathcal{C}_r

    Output: Layer output xout\mathbf{x}_{out}, updated caches Cp,Cr\mathcal{C}_p, \mathcal{C}_r

  1. Retrieve Response Features from Cache:
    • Get cached response KV: {Kr,Vr}Cr[l][kv_cache]\{\mathbf{K}_r, \mathbf{V}_r\} \gets \mathcal{C}_r[l][\mathtt{kv\_cache}].
    • Get cached response attention output: AttnOutrCr[l][attn]\mathbf{AttnOut}_r \gets \mathcal{C}_r[l][\mathtt{attn}].
    • Get cached response FFN output: FFNOutrCr[l][ffn]\mathbf{FFNOut}_r \gets \mathcal{C}_r[l][\mathtt{ffn}].
  2. Compute Fresh Prompt Features:
    • Extract prompt input part: xp_inxin,1:c\mathbf{x}_{p\_in} \gets \mathbf{x}_{in, 1:|\mathbf{c}|}.
    • Compute LayerNorm of prompt input: xp_normLayerNorm(xp_in)\mathbf{x}_{p\_norm} \gets \mathrm{LayerNorm}(\mathbf{x}_{p\_in}).
    • Project xp_norm\mathbf{x}_{p\_norm} to QQ, KK, VV for prompt: QpQ_proj(xp_norm)\mathbf{Q}_p \gets \mathrm{Q\_proj}(\mathbf{x}_{p\_norm}); KpK_proj(xp_norm)\mathbf{K}_p \gets \mathrm{K\_proj}(\mathbf{x}_{p\_norm}); VpV_proj(xp_norm)\mathbf{V}_p \gets \mathrm{V\_proj}(\mathbf{x}_{p\_norm}).
    • Update prompt KV cache: Cp[l][kv_cache]{Kp,Vp}\mathcal{C}_p[l][\mathtt{kv\_cache}] \gets \{\mathbf{K}_p, \mathbf{V}_p\}.
  3. Compute Attention with Mixed Features:
    • Combine newly computed prompt K, V with cached response K, V: K[Kp;Kr]\mathbf{K} \gets [\mathbf{K}_p; \mathbf{K}_r], V[Vp;Vr]\mathbf{V} \gets [\mathbf{V}_p; \mathbf{V}_r].
    • Compute attention only for prompt queries using combined K, V: AttnOutpAttention(Qp,K,V)\mathbf{AttnOut}_p \gets \mathrm{Attention}(\mathbf{Q}_p, \mathbf{K}, \mathbf{V}).
    • Update prompt attention cache: Cp[l][attn]AttnOutp\mathcal{C}_p[l][\mathtt{attn}] \gets \mathbf{AttnOut}_p.
    • Combine prompt and (cached) response attention outputs: AttnOut[AttnOutp;AttnOutr]\mathbf{AttnOut} \gets [\mathbf{AttnOut}_p; \mathbf{AttnOut}_r].
    • Compute post-attention residual: hxin+AttnOut\mathbf{h} \gets \mathbf{x}_{in} + \mathbf{AttnOut}.
  4. FFN Block:
    • Split post-attention state: hp,hrh1:c,hc+1:\mathbf{h}_p, \mathbf{h}_r \gets \mathbf{h}_{1:|\mathbf{c}|}, \mathbf{h}_{|\mathbf{c}|+1:}.
    • Compute LayerNorm for prompt part: hp_normLayerNorm(hp)\mathbf{h}_{p\_norm} \gets \mathrm{LayerNorm}(\mathbf{h}_p).
    • Compute FFN for prompt part: FFNOutpFFN(hp_norm)\mathbf{FFNOut}_p \gets \mathrm{FFN}(\mathbf{h}_{p\_norm}).
    • Update prompt FFN cache: Cp[l][ffn]FFNOutp\mathcal{C}_p[l][\mathtt{ffn}] \gets \mathbf{FFNOut}_p.
    • Combine prompt and (cached) response FFN outputs: FFNOut[FFNOutp;FFNOutr]\mathbf{FFNOut} \gets [\mathbf{FFNOut}_p; \mathbf{FFNOut}_r].
    • Compute final residual: xouth+FFNOut\mathbf{x}_{out} \gets \mathbf{h} + \mathbf{FFNOut}.
  5. Return xout,Cp,Cr\mathbf{x}_{out}, \mathcal{C}_p, \mathcal{C}_r.

Algorithm 5 dLLM-Cache: Case 3 - Refresh Response Only

This algorithm is executed when response features need refreshment (k0(modKr)k \equiv 0 \pmod{K_r}) but prompt features do not (k≢0(modKp)k \not\equiv 0 \pmod{K_p}).

Input:

  • Layer input xin\mathbf{x}_{in}

  • Layer index ll

  • Caches Cp\mathcal{C}_p and Cr\mathcal{C}_r

    Output: Layer output xout\mathbf{x}_{out}, updated caches Cp,Cr\mathcal{C}_p, \mathcal{C}_r

  1. Retrieve Prompt Features from Cache:
    • Get cached prompt KV: {Kp,Vp}Cp[l][kv_cache]\{\mathbf{K}_p, \mathbf{V}_p\} \gets \mathcal{C}_p[l][\mathtt{kv\_cache}].
    • Get cached prompt attention output: AttnOutpCp[l][attn]\mathbf{AttnOut}_p \gets \mathcal{C}_p[l][\mathtt{attn}].
    • Get cached prompt FFN output: FFNOutpCp[l][ffn]\mathbf{FFNOut}_p \gets \mathcal{C}_p[l][\mathtt{ffn}].
  2. Compute Fresh Response Features:
    • Extract response input part: xr_inxin,c+1:\mathbf{x}_{r\_in} \gets \mathbf{x}_{in, |\mathbf{c}|+1:}.
    • Compute LayerNorm of response input: xr_normLayerNorm(xr_in)\mathbf{x}_{r\_norm} \gets \mathrm{LayerNorm}(\mathbf{x}_{r\_in}).
    • Project xr_norm\mathbf{x}_{r\_norm} to QQ, KK, VV for response: QrQ_proj(xr_norm)\mathbf{Q}_r \gets \mathrm{Q\_proj}(\mathbf{x}_{r\_norm}); KrK_proj(xr_norm)\mathbf{K}_r \gets \mathrm{K\_proj}(\mathbf{x}_{r\_norm}); VrV_proj(xr_norm)\mathbf{V}_r \gets \mathrm{V\_proj}(\mathbf{x}_{r\_norm}).
    • Update response KV cache: Cr[l][kv_cache]{Kr,Vr}\mathcal{C}_r[l][\mathtt{kv\_cache}] \gets \{\mathbf{K}_r, \mathbf{V}_r\}.
  3. Compute Attention with Mixed Features:
    • Combine cached prompt K, V with newly computed response K, V: K[Kp;Kr]\mathbf{K} \gets [\mathbf{K}_p; \mathbf{K}_r], V[Vp;Vr]\mathbf{V} \gets [\mathbf{V}_p; \mathbf{V}_r].
    • Compute attention only for response queries using combined K, V: AttnOutrAttention(Qr,K,V)\mathbf{AttnOut}_r \gets \mathrm{Attention}(\mathbf{Q}_r, \mathbf{K}, \mathbf{V}).
    • Update response attention cache: Cr[l][attn]AttnOutr\mathcal{C}_r[l][\mathtt{attn}] \gets \mathbf{AttnOut}_r.
    • Combine (cached) prompt and response attention outputs: AttnOut[AttnOutp;AttnOutr]\mathbf{AttnOut} \gets [\mathbf{AttnOut}_p; \mathbf{AttnOut}_r].
    • Compute post-attention residual: hxin+AttnOut\mathbf{h} \gets \mathbf{x}_{in} + \mathbf{AttnOut}.
  4. FFN Block:
    • Split post-attention state: hp,hrh1:c,hc+1:\mathbf{h}_p, \mathbf{h}_r \gets \mathbf{h}_{1:|\mathbf{c}|}, \mathbf{h}_{|\mathbf{c}|+1:}.
    • Compute LayerNorm for response part: hr_normLayerNorm(hr)\mathbf{h}_{r\_norm} \gets \mathrm{LayerNorm}(\mathbf{h}_r).
    • Compute FFN for response part: FFNOutrFFN(hr_norm)\mathbf{FFNOut}_r \gets \mathrm{FFN}(\mathbf{h}_{r\_norm}).
    • Update response FFN cache: Cr[l][ffn]FFNOutr\mathcal{C}_r[l][\mathtt{ffn}] \gets \mathbf{FFNOut}_r.
    • Combine (cached) prompt and response FFN outputs: FFNOut[FFNOutp;FFNOutr]\mathbf{FFNOut} \gets [\mathbf{FFNOut}_p; \mathbf{FFNOut}_r].
    • Compute final residual: xouth+FFNOut\mathbf{x}_{out} \gets \mathbf{h} + \mathbf{FFNOut}.
  5. Return xout,Cp,Cr\mathbf{x}_{out}, \mathcal{C}_p, \mathcal{C}_r.

Algorithm 6 dLLM-Cache: Case 4 - Adaptive Partial Update

This algorithm is triggered when neither prompt nor response features require a full refresh (k≢0(modKp)k \not\equiv 0 \pmod{K_p} AND k≢0(modKr)k \not\equiv 0 \pmod{K_r}). It performs the selective update for response tokens.

Input:

  • Layer input xin\mathbf{x}_{in}

  • Layer index ll

  • Caches Cp\mathcal{C}_p and Cr\mathcal{C}_r

  • Adaptive update ratio ρ\rho

    Output: Layer output xout\mathbf{x}_{out}, updated caches Cp,Cr\mathcal{C}_p, \mathcal{C}_r

  1. Retrieve Cached Prompt Features:
    • Get cached prompt KV: {Kp,Vp}Cp[l][kv_cache]\{\mathbf{K}_p, \mathbf{V}_p\} \gets \mathcal{C}_p[l][\mathtt{kv\_cache}].
    • Get cached prompt attention output: AttnOutpCp[l][attn]\mathbf{AttnOut}_p \gets \mathcal{C}_p[l][\mathtt{attn}].
    • Get cached prompt FFN output: FFNOutpCp[l][ffn]\mathbf{FFNOut}_p \gets \mathcal{C}_p[l][\mathtt{ffn}].
  2. Conditional Adaptive Update (if ρ>0\rho > 0):
    • Compute Current Response Value Projections:
      • Extract response input part: xr_inxin,c+1:\mathbf{x}_{r\_in} \gets \mathbf{x}_{in, |\mathbf{c}|+1:}.
      • Compute LayerNorm for response input: xr_normLayerNorm(xr_in)\mathbf{x}_{r\_norm} \gets \mathrm{LayerNorm}(\mathbf{x}_{r\_in}).
      • Compute new Value vectors for response tokens: VrnewV_proj(xr_norm)\mathbf{V}_r^{\mathrm{new}} \gets \mathrm{V\_proj}(\mathbf{x}_{r\_norm}). (Note: Queries Qr\mathbf{Q}_r and Keys Kr\mathbf{K}_r are not fully recomputed here for all response tokens, only for selected ones later).
    • Retrieve Cached Response Features:
      • Get cached response KV (specifically, the cached Value vectors for similarity comparison): {Kr,Vr}Cr[l][kv_cache]\{\mathbf{K}_r, \mathbf{V}_r\} \gets \mathcal{C}_r[l][\mathtt{kv\_cache}].
    • Compute Similarity to Identify Tokens Needing Update (V-verify):
      • For each response token index jj: $ s_j \gets \frac{(\mathbf{V}_r^{\mathrm{new}}[j])^\top \mathbf{V}_r[j]}{\Vert \mathbf{V}_r^{\mathrm{new}}[j] \Vert \Vert \mathbf{V}_r[j] \Vert} $ (This computes cosine similarity between the current and cached Value vector for token jj).
      • Identify Tupdate\mathcal{T}_{\mathrm{update}}: the indices of ρy(k)\lfloor \rho |\mathbf{y}^{(k)}| \rfloor tokens with the lowest sjs_j (i.e., most dissimilar, thus most changed).
    • Selective Computation for Selected Tokens:
      • Gather the corresponding LayerNorm inputs from xr_norm\mathbf{x}_{r\_norm} for tokens in Tupdate\mathcal{T}_{\mathrm{update}}: xr_norm_selected\mathbf{x}_{r\_norm\_selected}.
      • Compute Query and Key for these selected tokens: QrselectedQ_proj(xr_norm_selected)\mathbf{Q}_r^{\mathrm{selected}} \gets \mathrm{Q\_proj}(\mathbf{x}_{r\_norm\_selected}), KrselectedK_proj(xr_norm_selected)\mathbf{K}_r^{\mathrm{selected}} \gets \mathrm{K\_proj}(\mathbf{x}_{r\_norm\_selected}).
    • Update KV Cache with New Values:
      • Use ScatterUpdate to update the cached Kr\mathbf{K}_r with Krselected\mathbf{K}_r^{\mathrm{selected}} at indices Tupdate\mathcal{T}_{\mathrm{update}}: KrupdatedScatterUpdate(Kr,Tupdate,Krselected)\mathbf{K}_r^{\mathrm{updated}} \gets \mathrm{ScatterUpdate}(\mathbf{K}_r, \mathcal{T}_{\mathrm{update}}, \mathbf{K}_r^{\mathrm{selected}}).
      • The entire Vrnew\mathbf{V}_r^{\mathrm{new}} is used as the updated Value cache (meaning all Value vectors are effectively 'recomputed' for similarity check, but only a subset drives full recomputation). So, the kv_cache is set to {Krupdated,Vrnew}\{\mathbf{K}_r^{\mathrm{updated}}, \mathbf{V}_r^{\mathrm{new}}\}.
    • Compute Attention Only for Selected Tokens:
      • Combine cached prompt K, V with updated response K, V: K[Kp;Krupdated]\mathbf{K} \gets [\mathbf{K}_p; \mathbf{K}_r^{\mathrm{updated}}], V[Vp;Vrnew]\mathbf{V} \gets [\mathbf{V}_p; \mathbf{V}_r^{\mathrm{new}}].
      • Compute attention only for selected response queries: AttnOutrselectedAttention(Qrselected,K,V)\mathbf{AttnOut}_r^{\mathrm{selected}} \gets \mathrm{Attention}(\mathbf{Q}_r^{\mathrm{selected}}, \mathbf{K}, \mathbf{V}).
    • Update Response Attention Cache:
      • Retrieve the existing response attention outputs: AttnOutrCr[l][attn]\mathbf{AttnOut}_r \gets \mathcal{C}_r[l][\mathtt{attn}].
      • Use ScatterUpdate to update AttnOutr\mathbf{AttnOut}_r at Tupdate\mathcal{T}_{\mathrm{update}} with AttnOutrselected\mathbf{AttnOut}_r^{\mathrm{selected}} to form AttnOutrupdated\mathbf{AttnOut}_r^{\mathrm{updated}}.
      • Update the cache: Cr[l][attn]AttnOutrupdated\mathcal{C}_r[l][\mathtt{attn}] \gets \mathbf{AttnOut}_r^{\mathrm{updated}}.
    • Combine prompt and (updated) response attention outputs: AttnOut[AttnOutp;AttnOutrupdated]\mathbf{AttnOut} \gets [\mathbf{AttnOut}_p; \mathbf{AttnOut}_r^{\mathrm{updated}}].
    • Compute post-attention residual: hxin+AttnOut\mathbf{h} \gets \mathbf{x}_{in} + \mathbf{AttnOut}.
    • FFN Block (Adaptive):
      • Split post-attention state: hp,hrh1:c,hc+1:\mathbf{h}_p, \mathbf{h}_r \gets \mathbf{h}_{1:|\mathbf{c}|}, \mathbf{h}_{|\mathbf{c}|+1:}.
      • Gather selected tokens from hr\mathbf{h}_r: hrselected\mathbf{h}_r^{\mathrm{selected}}.
      • Compute LayerNorm for selected response tokens: hr_selected_normLayerNorm(hrselected)\mathbf{h}_{r\_selected\_norm} \gets \mathrm{LayerNorm}(\mathbf{h}_r^{\mathrm{selected}}).
      • Compute FFN output for selected response tokens: FFNOutrselectedFFN(hr_selected_norm)\mathbf{FFNOut}_r^{\mathrm{selected}} \gets \mathrm{FFN}(\mathbf{h}_{r\_selected\_norm}).
      • Update Response FFN Cache:
        • Retrieve existing response FFN outputs: FFNOutrCr[l][ffn]\mathbf{FFNOut}_r \gets \mathcal{C}_r[l][\mathtt{ffn}].
        • Use ScatterUpdate to update FFNOutr\mathbf{FFNOut}_r at Tupdate\mathcal{T}_{\mathrm{update}} with FFNOutrselected\mathbf{FFNOut}_r^{\mathrm{selected}} to form FFNOutrupdated\mathbf{FFNOut}_r^{\mathrm{updated}}.
        • Update the cache: Cr[l][ffn]FFNOutrupdated\mathcal{C}_r[l][\mathtt{ffn}] \gets \mathbf{FFNOut}_r^{\mathrm{updated}}.
      • Combine prompt and (updated) response FFN outputs: FFNOut[FFNOutp;FFNOutrupdated]\mathbf{FFNOut} \gets [\mathbf{FFNOut}_p; \mathbf{FFNOut}_r^{\mathrm{updated}}].
  3. Else (ρ=0\rho = 0, pure cache retrieval):
    • If ρ\rho is 0, no adaptive updates are performed.
    • Retrieve cached response attention output: AttnOutrCr[l][attn]\mathbf{AttnOut}_r \gets \mathcal{C}_r[l][\mathtt{attn}].
    • Combine prompt and response attention outputs: AttnOut[AttnOutp;AttnOutr]\mathbf{AttnOut} \gets [\mathbf{AttnOut}_p; \mathbf{AttnOut}_r].
    • Compute post-attention residual: hxin+AttnOut\mathbf{h} \gets \mathbf{x}_{in} + \mathbf{AttnOut}.
    • Retrieve cached response FFN output: FFNOutrCr[l][ffn]\mathbf{FFNOut}_r \gets \mathcal{C}_r[l][\mathtt{ffn}].
    • Combine prompt and response FFN outputs: FFNOut[FFNOutp;FFNOutr]\mathbf{FFNOut} \gets [\mathbf{FFNOut}_p; \mathbf{FFNOut}_r].
  4. Final Output:
    • Compute final residual: xouth+FFNOut\mathbf{x}_{out} \gets \mathbf{h} + \mathbf{FFNOut}.
    • Return xout,Cp,Cr\mathbf{x}_{out}, \mathcal{C}_p, \mathcal{C}_r.

4.2.6. Storage Overhead of Caching

The paper provides a proof for the storage overhead of dLLM-Cache in Appendix B.

Theorem: The storage overhead of caching in our method is O(T×d×4×L)O(T \times d \times 4 \times L), where TT is the number of tokens, dd is the embedding dimension, and LL is the number of layers.

Proof:

  1. Memory per Layer: The dLLM-Cache method stores four types of intermediate features per Transformer layer: Key (K), Value (V), Attention Output (AttnOut), and Feedforward Network Output (FFNOut).

    • Each of these features (e.g., K, V, AttnOut, FFNOut for all tokens in a layer) has a size proportional to the total number of tokens (TT) multiplied by the embedding dimension (dd). This can be represented as T×dT \times d.
    • Therefore, the memory required for each layer, denoted as MlayerM_{\mathrm{layer}}, is: $ M_{\mathrm{layer}} = 4 \times T \times d $ This accounts for the four distinct feature types stored for all tokens within a single layer.
  2. Total Memory for the Model: The entire Transformer model consists of LL layers. Since these features are cached for each layer, the total memory required for caching across all layers, MtotalM_{\mathrm{total}}, is the memory per layer multiplied by the number of layers: $ M_{\mathrm{total}} = L \times M_{\mathrm{layer}} = L \times 4 \times T \times d $

  3. Considering Data Precision: The paper states that bfloat16 precision is used, where each element (e.g., a single floating-point number in a vector) requires 2 bytes of memory. Incorporating this, the total memory in bytes would be: $ M_{\mathrm{total}} = 2 \times L \times 4 \times T \times d \quad \mathrm{bytes} $

  4. Asymptotic Analysis: In asymptotic analysis (which describes the growth rate of a function as input size approaches infinity), constant factors are typically ignored. Therefore, ignoring the constant factor of 2×4=82 \times 4 = 8, the storage overhead grows as: $ O(T \times d \times 4 \times L) $ This completes the proof, showing that the memory overhead scales linearly with the number of tokens, embedding dimension, and number of layers.

5. Experimental Setup

5.1. Datasets

The authors evaluated dLLM-Cache on two representative diffusion-based Large Language Models (dLLMs):

  • LLaDA 8B [Nie et al., 2025]: An 8-billion parameter model.
  • Dream 7B [Ye et al., 2025]: A 7-billion parameter model. Both models were tested in their Base and Instruct variants. The Instruct variant is typically fine-tuned to follow instructions, often performing better on specific tasks.

The experiments were conducted across a variety of standard benchmarks, categorized into "Mathematics & Science" and "General Tasks" and "Code" to assess performance across different domains. The following are the benchmarks mentioned in the paper:

Mathematics & Science:

  • GSM8K [Cobbe et al., 2021]: A dataset of elementary school math word problems. It assesses a model's ability to reason and solve multi-step mathematical problems.
  • GPQA: (No specific citation for GPQA is provided, but it typically refers to a General Purpose Question Answering benchmark, often requiring complex reasoning.)
  • Math: (No specific citation, but likely refers to a general mathematics benchmark or a subset of a larger math dataset.)

General Tasks:

  • MMLU-pro: (Likely a more challenging or domain-specific variant of MMLU.)
  • MMLU [Hendrycks et al., 2021, not cited but standard]: Massive Multitask Language Understanding. A benchmark designed to measure knowledge and reasoning abilities across 57 subjects, including humanities, social sciences, STEM, and more.
  • BBH [Srivastava et al., 2022, not cited but standard]: Beyond the Imitation Game Benchmark. A challenging collection of 23 tasks designed to test advanced reasoning capabilities of LLMs.

Code:

  • MBPP [Austin et al., 2021]: Mostly Basic Python Problems. A dataset of Python programming problems with test cases, used to evaluate code generation capabilities.

  • HumanEval [Chen et al., 2021, not cited but standard]: A benchmark for evaluating code synthesis models, consisting of programming problems with docstrings, function signatures, and test cases.

    Additionally, the LongBench-HotpotQA [Bai et al., 2023] task was mentioned in the discussion section to demonstrate the effectiveness on long-prompt scenarios. HotpotQA is a multi-hop question answering dataset that requires reasoning over multiple documents to answer questions.

The models were configured according to their original inference settings. The specific parameters for different tasks are detailed in Table 4 of the appendix. All models used a low-confidence remasking strategy unless specified.

The following are the results from Table 4 of the original paper:

TaskStepsBlock LenGen LenFew-shot
MMLU3335
MMLU-pro2562562560
Hellaswag3330
ARC-C5125125120
GSM8K25682564
Math2562562560
GPQA128641285
HumanEval512325120
MBPP512325123
  • Steps: The total number of denoising steps (KK) used in the diffusion process for generation. More steps generally lead to better quality but higher latency.
  • Block Len: The length of the block of tokens that might be updated together in some remasking strategies (e.g., semi-autoregressive block updates).
  • Gen Len: The target total length of the generated response (number of tokens).
  • Few-shot: The number of example input-output pairs provided to the model during inference to guide its generation, common in in-context learning for LLMs. 0 indicates zero-shot inference.

5.2. Evaluation Metrics

The performance of dLLM-Cache was evaluated using a combination of efficiency metrics and task-specific quality metrics.

  1. Inference Efficiency Metrics:

    • Tokens Per Second (TPS):
      • Conceptual Definition: TPS measures the number of output tokens a model can generate or process per second. It is a direct indicator of inference speed or throughput. A higher TPS value implies faster generation and better efficiency.
      • Mathematical Formula: Not explicitly provided in the paper, but generally calculated as: $ \mathrm{TPS} = \frac{\text{Total number of generated tokens}}{\text{Total inference time (seconds)}} $
      • Symbol Explanation:
        • Total number of generated tokens\text{Total number of generated tokens}: The sum of lengths of all responses generated in a given time frame.
        • Total inference time (seconds)\text{Total inference time (seconds)}: The wall-clock time taken for the generation process, including all computational steps.
    • Floating Point Operations (FLOPs):
      • Conceptual Definition: FLOPs quantifies the total number of floating-point arithmetic operations (e.g., additions, multiplications, divisions) performed during the inference process. It serves as a hardware-agnostic measure of computational cost. Lower FLOPs indicate better computational efficiency. The paper uses TFLOPs, which means 101210^{12} FLOPs.
      • Mathematical Formula: Not a single formula, but an aggregate count of all floating-point operations. For Transformer layers, FLOPs calculation involves matrix multiplications and other operations. For a single Transformer layer, the FLOPs scale roughly with O(N2d+Nd2)O(N^2 d + N d^2), where NN is sequence length and dd is embedding dimension.
      • Symbol Explanation: FLOPs is a cumulative count, not derived from a simple formula with variables. It's often estimated or measured by profiling.
    • Speed(TPS) / Speed(FLOPs): These are speedup ratios relative to the baseline (standard inference without dLLM-Cache). For TPS, Speed(TPS) is the ratio of TPS with dLLM-Cache to baseline TPS. For FLOPs, Speed(FLOPs) is the ratio of baseline FLOPs to FLOPs with dLLM-Cache (since lower FLOPs are better).
  2. Task Performance (Quality) Metrics:

    • Score / Accuracy:
      • Conceptual Definition: Accuracy is a common metric for classification or exact match tasks, measuring the proportion of correctly predicted instances out of the total instances. For benchmarks like GSM8K, MMLU, GPQA, Math, it typically represents the percentage of questions or problems for which the model provides the correct answer.
      • Mathematical Formula: $ \mathrm{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $
      • Symbol Explanation:
        • Number of correct predictions\text{Number of correct predictions}: The count of instances where the model's output matches the ground truth.
        • Total number of predictions\text{Total number of predictions}: The total number of instances evaluated.
    • F1 Score:
      • Conceptual Definition: The F1 score is the harmonic mean of precision and recall, often used in tasks like information extraction or question answering (e.g., HotpotQA) where both false positives and false negatives are important. It provides a single score that balances both metrics.
        • Precision measures the proportion of true positive results among all positive results (true positives + false positives).
        • Recall (or sensitivity) measures the proportion of true positive results among all relevant samples (true positives + false negatives).
      • Mathematical Formula: $ \mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ Where: $ \mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}} $ $ \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $
      • Symbol Explanation:
        • TP\mathrm{TP}: True Positives (correctly identified positive instances).
        • FP\mathrm{FP}: False Positives (incorrectly identified positive instances).
        • FN\mathrm{FN}: False Negatives (incorrectly identified negative instances).
        • F1\text{F1}: The F1 score, ranging from 0 to 1 (or 0% to 100%).
    • GPU Memory (GB):
      • Conceptual Definition: Measures the peak Graphics Processing Unit (GPU) memory usage during inference, reported in gigabytes (GB). This metric is crucial for understanding the hardware requirements and scalability of the method. Lower memory usage allows for larger models or batch sizes.

        All TPS and FLOPs tests were performed on a single NVIDIA RTX 4090 GPU. The GPU memory measurements were also taken on a specific GPU setup.

5.3. Baselines

The primary baseline for comparison was the standard inference of the dLLM models (LLaDA 8B and Dream 7B) without any caching or acceleration techniques applied. This "baseline" represents the original, unoptimized performance of the dLLMs.

In one specific comparison (Table 3), LLaMA3 8B (a state-of-the-art autoregressive model of similar size) was included as an additional baseline to contextualize the performance of dLLMs (with and without dLLM-Cache) against the dominant ARM paradigm. This comparison is particularly relevant because the paper aims to bring dLLM inference latency closer to that of ARMs.

6. Results & Analysis

6.1. Core Results Analysis

The experiments consistently demonstrate that dLLM-Cache significantly improves inference efficiency for diffusion-based Large Language Models (dLLMs) while largely preserving output quality.

Overall Efficacy: dLLM-Cache achieves up to 9.1x speedup over standard inference without compromising output quality (often achieving "lossless" acceleration). This acceleration is measured by increased Tokens Per Second (TPS) and reduced Floating Point Operations (FLOPs). The method effectively brings dLLM inference latency closer to that of autoregressive models (ARMs) in many settings.

LLaDA 8B Results (Table 1)

The following are the results from Table 1 of the original paper:

TaskMethodInference EfficiencyPerformance
Speed(TPS)↑FLOPs(T)↓Speed(FLOPs)↑Score↑
Mathematics & Science
GSM8KLLaDA Base7.321.00×16.121.00×69.06
+ Cache (Kp = 100, Kr = 6)31.434.29×2.765.84×67.32
LLaDA Instruct23.191.00×3.211.00×70.66
+ Cache (Kp = 25, Kr = 5)15.873.17×13.365.02×78.54
GPQALLaDA Base5.121.00×22.971.00×31.91
+ Cache (Kp = 100, Kr = 8)25.234.93×3.207.18×32.81
LLaDA Instruct5.331.00×22.071.00×29.01
+ Cache (Kp = 50, Kr = 6)28.015.26×2.738.08×29.01
MathLLaDA Base8.311.00×14.111.00×30.84
+ Cache (Kp = 50, Kr = 8)33.924.08×2.615.41×29.84
LLaDA Instruct23.651.00×5.161.00×22.32
+ Cache (Kp = 50, Kr = 1)31.021.31×3.961.30×22.52
General Tasks
MMLU-proLLaDA Base14.081.00×8.401.00×24.26
+ Cache (Kp = 100, Kr = 6)45.753.25×2.153.91×24.69
LLaDA Instruct14.011.00×8.461.00×36.41
+ Cache (Kp = 51, Kr = 3)39.632.83×2.623.23×36.08
MMLULLaDA Base8.091.00×14.561.00×63.99
+ Cache (Kp = 100, Kr = 6)33.524.14×2.645.52×64.26
LLaDA Instruct10.121.00×11.851.00×61.24
+ Cache (Kp = 100, Kr = 7)21.232.10×4.502.63×62.82
BBHLLaDA Base6.411.00×18.291.00×44.77
+ Cache (Kp = 50, Kr = 6)27.904.35×3.095.92×45.04
LLaDA Instruct6.181.00×18.981.00×51.49
+ Cache (Kp = 100, Kr = 5)27.554.46×3.086.16×51.98
Code
MBPPLLaDA Base7.871.00×14.911.00×40.80
+ Cache (Kp = 50, Kr = 4)30.743.91×2.994.99×39.00
LLaDA Instruct24.611.00×3.071.00×40.60
+ Cache (Kp = 25, Kr = 4)16.743.13×11.924.86×40.60
HumanEvalLLaDA Base7.551.00×15.531.00×39.20
+ Cache (Kp = 100, Kr = 5)31.734.20×2.805.55×39.60
LLaDA Instruct19.981.00×6.031.00×32.92
+ Cache (Kp = 50, Kr = 5)51.962.60×2.042.96×32.31
  • Exceptional Speedup on GPQA: For LLaDA Instruct on GPQA, dLLM-Cache achieves an outstanding 8.08x speedup in FLOPs (from 22.07 TFLOPs to 2.73 TFLOPs) with lossless quality (29.01% score maintained). This is a strong indicator of its efficiency for complex reasoning tasks.
  • Consistent Gains Across Benchmarks: Across various tasks like GSM8K, MMLU-pro, MMLU, BBH, MBPP, and HumanEval, dLLM-Cache consistently delivers 2x to 5x speedups in TPS and 3x to 7x speedups in FLOPs for both Base and Instruct variants.
  • Quality Preservation: In most cases, the task performance (Score) either remains identical or shows only minor changes (e.g., LLaDA Base on GSM8K drops from 69.06% to 67.32%, while LLaDA Instruct on GSM8K improves from 70.66% to 78.54%). This demonstrates that the caching strategy does not significantly compromise output quality.
  • Variable Kp,KrK_p, K_r Settings: The optimal prompt refresh interval (K_p) and response refresh interval (K_r) vary by task and model, highlighting the adaptive nature required for different contexts. For example, Math (LLaDA Instruct) uses Kr=1K_r=1, implying very frequent response refreshes, resulting in a lower speedup (1.31x) but still a reduction in FLOPs.

Dream 7B Results (Table 2)

The following are the results from Table 2 of the original paper:

TaskConfigurationInference EfficiencyPerformance
Speed(TPS)↑FLOPs(T)↓Speed(FLOPs)↑
TPS↑ Mathematics & Science
GSM8KDream Base6.361.00×19.591.00×76.95
+ Cache (Kp = 100, Kr = 8)32.445.10×2.846.90×76.95
Dream Instruct6.391.00×19.571.00×77.55
+ Cache (Kp = 25, Kr = 2)24.523.84×4.244.62×76.80
GPQADream Base5.801.00×21.661.00×33.92
+ Cache (Kp = 100, Kr = 8)30.955.33×3.037.15×34.15
Dream Instruct5.531.00×22.631.00×34.38
+ Cache (Kp = 10, Kr = 8)21.983.97×4.694.83×33.93
MathDream Base9.401.00×13.311.00×38.68
+ Cache (Kp = 100, Kr = 4)36.893.92×2.615.10×37.94
Dream Instruct8.851.00×14.111.00×38.28
+ Cache (Kp = 50, Kr = 1)23.522.66×4.663.03×37.62
General Tasks
MMLU-proDream Base15.611.00×7.921.00×24.13
+ Cache (Kp = 25, Kr = 2)35.862.30×2.892.74×23.86
Dream Instruct15.401.00×7.981.00×43.79
+ Cache (Kp = 5, Kr = 1)23.981.56×4.771.67×43.96
MMLUDream Base9.101.00×13.731.00×73.49
+ Cache (Kp = 100, Kr = 2)31.073.41×3.274.20×73.20
Dream Instruct8.451.00×14.751.00×73.40
+ Cache (Kp = 100, Kr = 8)38.014.50×2.426.10×73.42
BBHDream Base7.241.00×17.251.00×52.25
+ Cache (Kp = 25, Kr = 4)29.614.09×3.355.15×51.66
Dream Instruct6.981.00×17.901.00×57.07
+ Cache (Kp = 10, Kr = 2)22.313.20×4.823.71×57.07
Code
MBPPDream Base8.911.00×14.061.00×54.20
+ Cache (Kp = 25, Kr = 8)35.694.01×2.665.29×54.20
Dream Instruct8.461.00×14.651.00×57.00
+ Cache (Kp = 10, Kr = 8)29.773.52×3.334.40×56.80
HumanEvalDream Base21.431.00×5.681.00×58.53
+ Cache (Kp = 5, Kr = 1)27.401.28×4.171.36×57.31
Dream Instruct17.881.00×6.841.00×57.92
+ Cache (Kp = 50, Kr = 1)28.031.57×3.941.74×56.09
  • Strong Performance for Dream Base: For Dream Base on GSM8K, dLLM-Cache achieves a 6.90x speedup in FLOPs with perfect preservation of accuracy (76.95%). Similarly, on GPQA, it shows a 7.15x speedup with a slight increase in score (33.92% to 34.15%).
  • Consistent with LLaDA: The trends observed for LLaDA are largely replicated for Dream, indicating the generality of dLLM-Cache across different dLLM architectures. Speedups range from 1.28x to 5.33x in TPS and 1.36x to 7.15x in FLOPs.
  • Minor Quality Fluctuations: While mostly lossless, some tasks show minor accuracy drops (e.g., Dream Instruct on GSM8K from 77.55% to 76.80%) or slight increases. These small variations are common and generally acceptable given the substantial speedups.

Comparison with ARMs (Table 3)

The following are the results from Table 3 of the original paper:

MethodTPS↑Speed(TPS) ↑Accuracy↑GPU Memory (GB)↓
LLaMA3 8B [Dubey et al., 2024]47.731.00×49.0516.06
LLaDA*7.371.00×69.0616.94
+ Cache* (Kp = 100, Kr = 6)25.023.39×67.3217.93
LLaDA†14.771.00×64.1416.94
+ Cache† (Kp = 50, Kr = 8)49.153.33×62.3217.93
  • The superscripts * and denote LLaDA with 256 and 128 decoding steps, respectively.
  • dLLMs vs. ARMs on Accuracy: The baseline LLaDALLaDA* (7.37 TPS, 69.06% accuracy) significantly outperforms LLaMA3 8B (47.73 TPS, 49.05% accuracy) in accuracy on GSM8K. This highlights a potential strength of dLLMs in certain tasks, even at higher latency.
  • dLLM-Cache Closing the Latency Gap:
    • With dLLM-Cache (+ Cache*), LLaDALLaDA* achieves 25.02 TPS (3.39x speedup), making it much closer to LLaMA3 8B's TPS of 47.73, while still maintaining superior accuracy (67.32% vs 49.05%).
    • When using fewer denoising steps for LLaDA (LLaDA†, 128 steps), dLLM-Cache (+ Cache†) boosts TPS to 49.15, surpassing LLaMA3 8B's 47.73 TPS. This is a crucial finding, demonstrating that dLLM-Cache can enable dLLMs to achieve ARM-level inference speeds while retaining their accuracy advantages.
  • GPU Memory: The GPU memory usage for LLaMA3 8B is 16.06 GB. Baseline LLaDA uses 16.94 GB. With dLLM-Cache, LLaDA uses 17.93 GB. This indicates a moderate increase in memory overhead (around 1 GB) for dLLM-Cache, which is deemed acceptable given the substantial speedups and accuracy gains.

Effectiveness on Long-Prompt Scenarios

The benefits of dLLM-Cache are particularly notable for tasks involving long input prompts, such as document-based question answering.

  • On the LongBench-HotpotQA task, applying dLLM-Cache to LLaDA 8B Base resulted in a 9.1x speedup over the unaccelerated baseline.
  • More impressively, it also led to a performance improvement, with the F1 score increasing from 34.56 to 36.10. This highlights that Long-Interval Prompt Caching is maximally leveraged in such scenarios, not only reducing computations but potentially also improving context handling for extensive static prompts.

6.2. Ablation Studies / Parameter Analysis

Effect of Cache Refresh Interval KpK_p and KrK_r

The authors conducted an ablation study to analyze the impact of the prompt refresh interval (K_p) and response refresh interval (K_r) on efficiency and accuracy, using LLaDA 8B Instruct on the GSM8K dataset. The following are the results from Figure 4 of the original paper:

Figure 4: Effect of cache refresh intervals using LLaDA 8B Instruct.a) Varying `K _ { p }` EY with \(K _ { r } = 1\) \(\\rho = 0\) (b) Varying `K _ { r }` under two settings: baseline with \(K _ { p } = 1\)…
该图像是图表,展示了LLaDA 8B Instruct模型中缓存刷新间隔对性能的影响。图(a)中,随着缓存提示间隔KpK_p变化,准确率和TFLOPs的关系;图(b)中,分别比较了基线组(Kp=1,ρ=0)(K_p=1, \rho=0)和本方法(Kp=50,ρ=0.25)(K_p=50, \rho=0.25)下不同响应刷新间隔KrK_r对准确率和TFLOPs的影响。

  • Figure 4(a): Effect of varying KpK_p (with Kr=1,ρ=0K_r=1, \rho=0)
    • This plot shows Accuracy vs. TFLOPs as KpK_p increases.
    • As KpK_p increases (meaning prompt features are refreshed less frequently), the TFLOPs significantly decrease, leading to greater efficiency.
    • Crucially, the accuracy remains stable even with very large KpK_p values. This confirms the hypothesis that prompt tokens are highly static and their features can be cached for long intervals without degrading performance. Infrequent prompt updates are sufficient.
  • Figure 4(b): Effect of varying KrK_r (under two settings)
    • This plot shows Accuracy vs. TFLOPs as KrK_r increases (meaning response features are refreshed less frequently).
    • Gray Line (Baseline: Kp=1,ρ=0K_p=1, \rho=0): This represents a scenario with no prompt caching and no adaptive updates for response (equivalent to a uniform full refresh, but only for response). As KrK_r increases, accuracy drops sharply, indicating that blindly caching response tokens for long intervals without selective updates leads to significant information loss and poor performance.
    • Orange/Blue Line (dLLM-Cache: Kp=50,ρ=0.25K_p=50, \rho=0.25): This represents the proposed dLLM-Cache setting (from Table 1, LLaDA Instruct for GSM8K). In this setup, even as KrK_r increases, the accuracy remains high while TFLOPs decrease. This demonstrates that the combination of long-interval prompt caching and adaptive partial updates for response tokens (guided by ρ=0.25\rho=0.25) effectively maintains accuracy even with less frequent full response refreshes. This validates the design choice of a differentiated and adaptive caching strategy.

Effect of Update Ratio ρ\rho and Selection Strategy

This ablation investigates how the adaptive update ratio (\rho) and the method for selecting tokens to update impact performance. The study uses LLaDA 8B Instruct on GSM8K. The following are the results from Figure 5 of the original paper:

Figure 5: Effect of token selection strategy on GSM8K using LLaDA 8B Instruct model under varying update ratios \(\\rho\) .
该图像是图表,展示了在使用LLaDA 8B Instruct模型的GSM8K任务中,不同自适应更新比例ρ下的令牌选择策略对准确率和计算成本的影响。图中标明在高更新比例ρ=0.9时达到无损(Lossless)性能,准确率接近原始77.48%,TFLOPs明显上升。

  • Figure 5: Accuracy vs. TFLOPs for different token selection strategies under varying ρ\rho
    • Strategies Compared:
      • V-verify (red line): Our proposed method using cosine similarity of Value (V) vectors.
      • K-verify (blue line): Using cosine similarity of Key (K) vectors.
      • Random Selection (green line): Randomly selecting ρ\rho fraction of tokens to update.
    • Performance at High ρ\rho: At ρ=0.9\rho = 0.9 (meaning 90% of response tokens are recomputed), all methods achieve close to the original accuracy (77.48% for LLaDA 8B Instruct on GSM8K). This indicates that if a large enough fraction is updated, quality is preserved, but at a higher computational cost.
    • Superiority of Similarity-based Strategies: Both V-verify and K-verify consistently outperform random selection across a wide range of ρ\rho values. This confirms that intelligently identifying dynamic tokens is crucial for maintaining accuracy at lower computational costs. Random selection quickly degrades in accuracy as ρ\rho decreases.
    • V-verify vs. K-verify: V-verify (red line) generally achieves slightly higher accuracy for a given TFLOPs compared to K-verify (blue line), especially in the mid-range of ρ\rho.
    • Optimal Trade-off at ρ0.25\rho \approx 0.25: The V-verify strategy achieves its highest accuracy (around 78.54%, which is actually higher than the baseline LLaDA Instruct score for GSM8K in Table 1) at ρ0.25\rho \approx 0.25. At this point, it still requires significantly fewer FLOPs than full recomputation (which would correspond to ρ=1\rho=1). This suggests that a moderate, targeted update of approximately 25% of response tokens strikes a favorable balance between efficiency and output quality.

Impact of Similarity Metric

The choice of similarity metric for V-verify was also analyzed.

  • On GSM8K with LLaDA 8B Instruct, cosine similarity achieved 78.54% accuracy.
  • In contrast, L2 distance (another common measure of vector dissimilarity) resulted in a significantly lower accuracy of 55.95%.
  • This clear difference demonstrates that cosine similarity is much more effective at capturing semantic change in Value vectors, making it the superior choice for identifying dynamic tokens. Hence, cosine similarity is adopted as the default metric in dLLM-Cache.

6.3. Discussion on Overheads

Storage Overhead of Caching

dLLM-Cache introduces additional memory requirements because it stores four types of intermediate features (KK, VV, AttnOut, FFNOut) per layer. The theoretical proof in Appendix B indicates that the storage overhead scales as O(T×d×4×L)O(T \times d \times 4 \times L), where TT is the number of tokens, dd is the embedding dimension, and LL is the number of layers.

  • Empirical Observation (Table 3): On GSM8K with LLaDA 8B Base, the peak GPU memory usage without caching was 16.94 GB. With dLLM-Cache, this increased moderately to 17.93 GB.
  • Comparison to ARMs: For context, LLaMA3 8B (an ARM) used 16.06 GB.
  • Conclusion: The memory overhead introduced by dLLM-Cache is approximately 1 GB, which is considered moderate and acceptable given the significant inference speedups and superior accuracy compared to standard ARMs.

Cost of V-verify and the Fixed Update Overheads

While the V-verify mechanism itself (computing cosine similarity for Value vectors) is computationally lightweight, the overall practical speedup from adaptive partial updates is constrained by certain fixed operational overheads. The following are the results from Figure 6 of the original paper:

Figure 6: TPS versus \(\\rho\) A notable decrease in TPS persists at minimal \(\\rho\) values, highlighting fixed operational overheads associated with initiating any selective update.
该图像是图6的图表,展示了不同 KrK_{r} 值下,TPS 随参数 ρ\rho 变化的趋势。可以看到,当 ρ\rho 极小时,TPS 明显下降,反映了选择性更新启动时存在固定的操作开销。

  • Figure 6: TPS versus ρ\rho
    • This plot shows that Tokens Per Second (TPS) generally decreases as the adaptive update ratio (\rho) decreases and approaches zero.
    • A notable decrease in TPS persists even at minimal ρ\rho values (i.e., when very few tokens are being actively recomputed).
  • Explanation: This phenomenon highlights that initiating any selective recomputation (when ρ>0\rho > 0) triggers certain non-negligible system-level latencies. These fixed overheads are not directly proportional to the number of tokens being updated. Examples include:
    • GPU kernel management: The overhead of launching new GPU kernels for specific computations.
    • Data movement: The time taken to transfer data between different memory locations (e.g., from main memory to GPU registers or caches).
    • Control flow logic: The computational cost of the branching and conditional operations in the algorithm itself (e.g., identifying tokens, gathering, scattering).
  • Implication: At very low ρ\rho values, these fixed overheads start to dominate any savings from reduced dynamic computation. Therefore, simply minimizing ρ\rho indefinitely does not guarantee a proportional increase in TPS.
  • Optimal ρ\rho: An optimal adaptive update ratio (ρ\rho) must carefully balance these fixed costs against the benefits of reduced dynamic computation, while simultaneously preserving model quality. As shown in Figure 5, a ρ0.25\rho \approx 0.25 typically offers an effective trade-off, optimizing overall efficiency and fidelity without being excessively impacted by these fixed overheads.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces dLLM-Cache, an innovative training-free and model-agnostic caching framework specifically designed to accelerate inference in diffusion-based Large Language Models (dLLMs). The core insight driving dLLM-Cache is the identification and exploitation of dual computational redundancies unique to dLLM inference: the quasi-static nature of prompt tokens and the sparse dynamism of response tokens across iterative denoising steps.

By combining a long-interval prompt caching strategy with an adaptive short-interval response caching mechanism (guided by the V-verify method using cosine similarity to identify the most dynamic tokens), dLLM-Cache achieves substantial efficiency gains. Extensive experiments on LLaDA 8B and Dream 7B demonstrate up to 9.1x speedup in inference throughput and significant FLOPs reduction, critically without compromising the quality of the generated text. A notable finding is that dLLM-Cache can bring dLLM inference latency to parity with, or even surpass, that of Autoregressive Models (ARMs) like LLaMA3 8B while retaining dLLMs' inherent accuracy advantages.

7.2. Limitations & Future Work

The authors openly acknowledge a current limitation of their work:

  • Model Scale: Existing open-source dLLMs are currently restricted to smaller scales (e.g., 8 billion parameters). This prevented dLLM-Cache from being evaluated on much larger models (e.g., 33B or 70B parameters).

    Based on this, they propose a clear direction for future work:

  • Larger Scale Evaluation: The authors believe that dLLM-Cache will deliver even greater acceleration benefits at larger model scales. This remains an important area for validation once larger open-source dLLMs become available. This suggests that the overheads (both computational and memory) of traditional dLLM inference become even more prohibitive at larger scales, making the proportional savings from caching potentially higher.

7.3. Personal Insights & Critique

Inspirations

  • Clever Observation of Redundancy: The paper's strength lies in its astute observation of the specific computational redundancies within dLLM inference—the static prompt and the partially dynamic response. This granularity of analysis, distinguishing between different parts of the input sequence and their temporal stability, is a significant departure from previous, more uniform caching strategies. It highlights that understanding the task-specific dynamics of a model is key to efficient optimization.
  • Adaptive and Training-Free Design: The adaptive caching approach, particularly the V-verify mechanism, is elegant. It uses a lightweight proxy (cosine similarity of Value vectors) to make intelligent decisions about recomputation, avoiding the need for costly retraining or complex learning-based caching policies. This makes the method practical and easily deployable.
  • Bridging the Performance Gap: The ability to bring dLLM inference speed on par with ARMs while maintaining, or even improving, accuracy is a critical achievement. It strengthens the case for dLLMs as viable alternatives to ARMs, especially if they offer architectural benefits (like addressing the "reversal curse"). This could encourage broader adoption and research into dLLMs.

Potential Issues, Unverified Assumptions, or Areas for Improvement

  • Generality of ρ\rho and Kp,KrK_p, K_r: While the paper suggests ρ0.25\rho \approx 0.25 is a good trade-off, and optimal Kp,KrK_p, K_r values vary, the robustness of these hyperparameters across a much wider range of dLLM architectures, datasets, and generation lengths is not fully explored. Fine-tuning these parameters for every new application might still incur some effort. A more dynamic, possibly self-tuning, approach for these hyperparameters could be a future research direction.

  • "Fixed Operational Overheads": The paper acknowledges that "fixed operational overheads" limit speedup at very low ρ\rho. While these are system-level latencies (GPU kernel management, data movement), further investigation into their exact nature and potential mitigation strategies could unlock even greater efficiency gains. Are there hardware-aware optimizations or scheduling techniques that could reduce these fixed costs?

  • Complexity of Implementation: While training-free, the dLLM-Cache algorithm involves splitting, gathering, scattering, and conditional execution of computations within Transformer layers. This can add complexity to the implementation and might introduce minor performance overheads compared to a perfectly optimized, monolithic forward pass. The current performance figures already account for this, but it's a factor.

  • Information Loss in Partial Updates: Although the quality is largely preserved, the minor drops in accuracy for some tasks suggest that even V-verify might occasionally miss critical changes or that reusing stale features for a portion of tokens can have subtle impacts. Future work could explore more sophisticated "change detection" mechanisms or strategies to further minimize this potential information loss without sacrificing efficiency.

  • Bidirectional Context for Prompt Features: The paper states that prompt recomputation (when k0(modKp)k \equiv 0 \pmod{K_p}) accounts for evolving response features. This is important. However, the core advantage of dLLMs' bidirectional attention means that prompt features are always influenced by the current state of the response. Reusing prompt features for long intervals, even if they are static, still means they are not "re-attending" to the latest evolving response within those intervals. While practical, this is an implicit trade-off being made for speed. The F1 score increase on LongBench-HotpotQA is interesting and might suggest that the prompt caching, by stabilizing certain features, inadvertently helps in some complex reasoning tasks. More analysis on this specific observation would be valuable.

    Overall, dLLM-Cache presents a highly valuable contribution to the practical deployment of dLLMs. It's a pragmatic and well-motivated solution to a critical bottleneck, demonstrating how careful analysis of model behavior can lead to significant performance improvements.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.