Paper status: completed

SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching

Published:09/15/2025

Diffusion Model Acceleration Framework (1)Diffusion Transformer (4)Speculative Feature Caching (1)Sample Verification Mechanism (1)Adaptive Computation Allocation (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SpeCa accelerates diffusion transformers with a "Forecast-then-verify" framework. Inspired by LLMs, it uses speculative sampling to predict and efficiently verify future denoising features, plus sample-adaptive computation. This yields significant speedups (up to 7.3x) with minim

Abstract

Diffusion models have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. These models face two fundamental challenges: strict temporal dependencies preventing parallelization, and computationally intensive forward passes required at each denoising step. Drawing inspiration from speculative decoding in large language models, we present SpeCa, a novel 'Forecast-then-verify' acceleration framework that effectively addresses both limitations. SpeCa's core innovation lies in introducing Speculative Sampling to diffusion models, predicting intermediate features for subsequent timesteps based on fully computed reference timesteps. Our approach implements a parameter-free verification mechanism that efficiently evaluates prediction reliability, enabling real-time decisions to accept or reject each prediction while incurring negligible computational overhead. Furthermore, SpeCa introduces sample-adaptive computation allocation that dynamically modulates resources based on generation complexity, allocating reduced computation for simpler samples while preserving intensive processing for complex instances. Experiments demonstrate 6.34x acceleration on FLUX with minimal quality degradation (5.5% drop), 7.3x speedup on DiT while preserving generation fidelity, and 79.84% VBench score at 6.1x acceleration for HunyuanVideo. The verification mechanism incurs minimal overhead (1.67%-3.5% of full inference costs), establishing a new paradigm for efficient diffusion model inference while maintaining generation quality even at aggressive acceleration ratios. Our codes have been released in Github: \textbf{https://github.com/Shenyi-Z/Cache4Diffusion}

Mind Map

In-depth Reading

English Analysis~14 min read · 18,165 chars

1. Bibliographic Information

Title: SpeCa: Accelerating Diffusion Transformers with Speculative Feature Caching
Authors: Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Fei Ren, Shaobo Wang, Kaixin Li, and Linfeng Zhang.
Affiliations: The authors are affiliated with several prominent institutions, including Shanghai Jiao Tong University, Shandong University, University of Electronic Science and Technology of China, The Hong Kong University of Science and Technology (Guangzhou), Tsinghua University, and the National University of Singapore.
Journal/Conference: The paper is slated for publication in the Proceedings of the 33rd ACM International Conference on Multimedia (MM '25). ACM Multimedia is a premier international conference in the field of multimedia, known for its high standards and impact.
Publication Year: 2025
Abstract: The paper introduces SpeCa, a "Forecast-then-verify" framework to accelerate diffusion models, which are computationally expensive. Inspired by speculative decoding in Large Language Models (LLMs), SpeCa predicts intermediate features for future denoising steps and uses a lightweight, parameter-free mechanism to verify these predictions. This allows it to skip full computations for accepted steps. The framework also features sample-adaptive computation, allocating more resources to complex samples and fewer to simpler ones. Experiments on models like FLUX, DiT, and HunyuanVideo show significant speedups (6-7x) with minimal quality loss, outperforming existing methods.
Original Source Link:
- Preprint: https://arxiv.org/abs/2509.11628
- PDF: http://arxiv.org/pdf/2509.11628v1
- Publication Status: The paper is currently available as a preprint on arXiv.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Diffusion models, despite their state-of-the-art performance in generating high-fidelity images and videos, are notoriously slow. Their inference process involves many sequential denoising steps, and each step requires a full, computationally intensive forward pass through a large neural network. This makes them unsuitable for real-time applications.
- Existing Gaps: Previous acceleration techniques have critical limitations.
  1. Reduced-Step Samplers (e.g., DDIM): These methods reduce the number of steps but inevitably trade off generation quality for speed.
  2. Feature Caching Methods (e.g., ToCa, FORA): These reuse features from previous steps. However, their effectiveness diminishes as the time gap between steps increases, limiting high acceleration ratios.
  3. Forecasting Methods (e.g., TaylorSeer): These predict future features but lack a verification mechanism. This allows prediction errors to accumulate, leading to a catastrophic drop in quality, especially at high speeds.
- Paper's Innovation: SpeCa introduces a "Forecast-then-verify" paradigm, adapting the successful concept of speculative decoding from LLMs to diffusion models. This approach not only predicts future states but crucially validates them, preventing error accumulation and enabling aggressive acceleration while maintaining high fidelity. It also introduces dynamic, sample-specific resource allocation.
Main Contributions / Findings (What):
1. SpeCa Framework: The paper proposes a novel acceleration framework that combines feature forecasting with a lightweight verification step. This "Forecast-then-verify" approach overcomes the quality collapse problem seen in other methods at high acceleration ratios.
2. Sample-Adaptive Computation Allocation: SpeCa dynamically adjusts the number of computational steps based on the generation complexity of each sample. Simpler samples are accelerated more, while complex ones receive the necessary computational budget, optimizing the trade-off between efficiency and quality.
3. State-of-the-Art Performance: SpeCa demonstrates superior performance across multiple benchmarks. It achieves a 6.34x speedup on FLUX with only a 5.5% quality drop, a 7.3x speedup on DiT while maintaining quality, and a 6.1x speedup on the demanding HunyuanVideo model with a top-tier VBench score.

Foundational Concepts:
- Diffusion Models (DMs): These are generative models that work in two stages. First, a "forward process" gradually adds Gaussian noise to a clean image until it becomes pure noise. Second, a "reverse process" learns to denoise the image step-by-step, starting from random noise to generate a new, clean image. Each denoising step requires the model to predict the noise added at that step.
- Diffusion Transformers (DiT): A powerful architecture for diffusion models that replaces the commonly used U-Net backbone with a Transformer. Transformers are highly scalable and have proven effective at learning complex dependencies, leading to significant improvements in generation quality, especially for high-resolution images.
- Speculative Decoding: A technique used to speed up inference in auto-regressive models like LLMs. It works by using a small, fast "draft" model to generate a sequence of candidate tokens. A large, accurate "target" model then checks all these candidates in parallel in a single forward pass. This breaks the strictly sequential nature of token-by-token generation, leading to significant speedups.
Previous Works & Differentiation:
- Sampling Timestep Reduction: Methods like DDIM and DPM-Solver create more efficient sampling trajectories to reduce the total number of steps from thousands to dozens. However, pushing this too far leads to significant quality degradation. SpeCa works orthogonally by reducing the computational cost within a given number of steps.
- Feature Caching-based Acceleration: This is the most closely related area.
  - "Cache-then-Reuse" (e.g., FORA, ToCa, DuCa): These methods cache intermediate features (like attention maps or MLP outputs) from a previous timestep and reuse them for the current step, skipping parts of the computation. Their main flaw is that features from distant timesteps are often too dissimilar, making reuse ineffective for large acceleration factors.
  - "Cache-then-Forecast" (TaylorSeer): This method improves upon reuse by using a Taylor series to predict future features based on a history of past features. However, it blindly trusts its predictions. Without a verification step, small errors can compound over many skipped steps, leading to poor final results.
- SpeCa's Key Difference: SpeCa builds on the "Cache-then-Forecast" idea but adds the crucial verification step. By checking each predicted feature's reliability with a lightweight error metric, it can dynamically decide whether to accept the prediction or fall back to a full computation. This prevents the error accumulation that plagues TaylorSeer and allows for much higher, yet safer, acceleration.

4. Methodology (Core Technology & Implementation)

SpeCa operates on a "Forecast-then-verify" cycle, which is visually depicted in Figure 1.

$Figure 1: SpeCa's speculative execution workflow. The draft model predicts $N$ future timesteps $( t - 1$ to $t - N )$ ; lightweight verification checks activation errors. Steps are accepted sequenti…$ Figure 1: SpeCa's speculative execution workflow. The draft model predicts $N$ future timesteps $( t - 1$ to $t - N )$ ; lightweight verification checks activation errors. Steps are accepted sequentially until error exceeds $\tau$ at $t - k$ where prediction is rejected. Accepted steps are cached, and the target model resumes computation from $t - k - 1$ to ensure fidelity.

The workflow consists of three main stages:

1. Feature Forecasting (The "Draft Model"):
- At a given timestep $t$ , the model performs a full computation to get an accurate feature representation.
- SpeCa then uses a lightweight predictor to forecast the features for several future timesteps ( $t-1, t-2, \dots, t-k$ ).
- The paper uses the TaylorSeer predictor as its draft model. This model uses a Taylor series expansion to estimate future features based on the current feature and its past derivatives (approximated using finite differences). This is a training-free and computationally cheap approach. The prediction formula is: $\mathcal { F } _ { \mathrm { pred } } ( \boldsymbol { x } _ { t - k } ^ { l } ) = \mathcal { F } ( \boldsymbol { x } _ { t } ^ { l } ) + \sum _ { i = 1 } ^ { m } \frac { \Delta ^ { i } \mathcal { F } ( \boldsymbol { x } _ { t } ^ { l } ) } { i ! \cdot N ^ { i } } ( - k ) ^ { i }$
  - $\mathcal{F}_{\text{pred}}(\boldsymbol{x}_{t-k}^l)$ : The predicted feature at a future timestep t-k in layer $l$ .
  - $\mathcal{F}(\boldsymbol{x}_t^l)$ : The accurately computed feature at the current timestep $t$ .
  - $\Delta^i \mathcal{F}(\boldsymbol{x}_t^l)$ : The $i$ -th order finite difference of the feature, which approximates its $i$ -th derivative.
  - $m$ : The order of the Taylor expansion.
  - $N$ : The sampling interval.
  - $k$ : The number of steps into the future to predict.
2. Error Computation and Validation:
- After predicting a feature, SpeCa must verify its quality. It does this by computing the actual feature (via a partial forward pass up to the validation layer) and comparing it to the prediction.
- Relative Error: The paper uses a relative error metric to measure the deviation, which is more robust to changes in feature magnitudes across timesteps. $e _ { k } = \frac { | \mathcal { F } _ { \mathrm { p red } } ( x _ { t - k } ^ { l } ) - \mathcal { F } ( x _ { t - k } ^ { l } ) | _ { 2 } } { | \mathcal { F } ( x _ { t - k } ^ { l } ) | _ { 2 } + \varepsilon }$
  - $e_k$ : The relative error for the prediction at step t-k.
  - $|\cdot|_2$ : The L2-norm (Euclidean distance).
  - $\varepsilon$ : A small constant to prevent division by zero.
- Sequential Validation: The verification happens sequentially for each predicted step ( $t-1, t-2, \dots$ ). If the error $e_k$ is below a threshold $\tau_t$ , the prediction is accepted, and the verifier moves to the next step. If $e_k > \tau_t$ , the prediction is rejected, and the loop breaks. All subsequent predictions are also discarded.
- Adaptive Threshold: The threshold $\tau_t$ $τ_{t}$ is not fixed. It adapts based on the timestep to allow for more aggressive skipping in the early, noisy stages of generation and more conservative checks in the later, detail-oriented stages. $\tau _ { t } ~ = ~ \tau _ { 0 } \cdot \beta ^ { \frac { T - t } { T } }$
  - $\tau_0$ : The initial base threshold.
  - $\beta \in (0, 1)$ : A decay rate.
  - $T$ : The total number of denoising steps.
3. Resuming Computation:
- After the verification process, the main model takes over. If $j$ steps were successfully accepted, the full computation is skipped for those steps. The main model then performs a full forward pass starting from timestep t-j-1. This ensures that the generation trajectory remains accurate.
Computational Complexity:
- The theoretical acceleration ratio ( $S$ $S$ ) is given by: $S = { \frac { 1 } { 1 - \alpha + \alpha \cdot \gamma } }$
  - $\alpha$ : The acceptance rate of predicted steps (fraction of total steps that are skipped).
  - $\gamma$ : The ratio of the verification cost to a full computation cost. Since verification is very lightweight (e.g., 1.67% - 3.5% of a full pass), $\gamma$ is very small.
- This shows that as the acceptance rate $\alpha$ approaches 1, the speedup approaches $1/\gamma$ . For example, if $\gamma=0.05$ , the theoretical maximum speedup is 20x.

5. Experimental Setup

Models and Datasets:
- Text-to-Image: FLUX.1-dev model evaluated on 200 prompts from the DrawBench dataset.
- Text-to-Video: HunyuanVideo model evaluated on 946 prompts from the VBench benchmark suite.
- Class-Conditional Image Generation: DiT-XL/2 model evaluated on the ImageNet dataset (50,000 generated images across 1,000 classes).
Evaluation Metrics:
1. ImageReward:
  - Conceptual Definition: A metric that predicts human preferences for text-to-image generations. It is a model trained on a large dataset of human ratings to score images based on their quality and alignment with the text prompt. Higher scores are better.
2. GenEval:
  - Conceptual Definition: An object-focused framework for evaluating text-to-image alignment. It assesses whether the generated image accurately contains the objects, attributes, and relationships described in the prompt. Higher scores are better.
3. VBench:
  - Conceptual Definition: A comprehensive benchmark for video generation models. It evaluates videos across 16 different dimensions, including temporal consistency, motion quality, object class, and aesthetic quality, providing a holistic score. Higher scores are better.
4. Fréchet Inception Distance (FID):
  - Conceptual Definition: Measures the similarity between the distribution of generated images and the distribution of real images. It uses features from a pre-trained InceptionV3 network to compare the statistics (mean and covariance) of the two distributions. Lower FID is better, indicating the generated images are more realistic and diverse.
  - Mathematical Formula: $\mathrm{FID}(x, g) = ||\mu_x - \mu_g||^2_2 + \mathrm{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2})$
  - Symbol Explanation:
    - $\mu_x, \mu_g$ : The mean feature vectors for real ( $x$ ) and generated ( $g$ ) images.
    - $\Sigma_x, \Sigma_g$ : The covariance matrices of the feature vectors.
    - $\mathrm{Tr}(\cdot)$ : The trace of a matrix.
5. sFID (spatial-FID):
  - Conceptual Definition: A variant of FID that is more sensitive to the spatial arrangement of features in an image. It helps detect spatial artifacts or incorrect object layouts that standard FID might miss. Lower is better.
6. Inception Score (IS):
  - Conceptual Definition: A metric that simultaneously evaluates the quality and diversity of generated images. High-quality images should have a clear, identifiable object (low entropy for the conditional label distribution $p(y|x)$ ), and a diverse set of images should cover many different classes (high entropy for the marginal label distribution p(y)). Higher is better.
  - Mathematical Formula: $\mathrm{IS}(G) = \exp\left(\mathbb{E}_{x \sim p_g} D_{KL}(p(y|x) || p(y))\right)$
  - Symbol Explanation:
    - $x \sim p_g$ : An image $x$ sampled from the generator $G$ .
    - $p(y|x)$ : The probability distribution of labels for image $x$ , predicted by an Inception model.
    - p(y): The marginal probability distribution of labels over the entire generated dataset.
    - $D_{KL}$ : The Kullback-Leibler divergence, which measures the difference between the two distributions.
Baselines: The paper compares SpeCa against several leading acceleration methods, including:
- Simple step reduction (e.g., using DDIM with fewer steps).
- Caching methods: Δ-DiT, FORA, ToCa, DuCa, TeaCache.
- Forecasting method: TaylorSeer.

6. Results & Analysis

SpeCa consistently outperforms all baselines, especially at high acceleration ratios where other methods suffer from severe quality degradation.

Core Results:

Text-to-Image (FLUX): The results are shown in the table below (transcribed from Table 1). At a 6.34x speedup, SpeCa achieves an ImageReward of 0.9355 (only a 5.5% drop from baseline), whereas TaylorSeer drops to 0.8168 (a 17.5% drop) and other methods perform even worse. This highlights the critical role of the verification mechanism.

Table 1 (Transcription): Quantitative comparison in text-to-image generation for FLUX on Image Reward

Method FLUX.1[17]	Efficient Attention [8]	Acceleration				Image Reward ↑ DrawBench	Geneval↑ Overall
Method FLUX.1[17]	Efficient Attention [8]	Latency(s) ↓	Speed ↑	FLOPs(T) ↓	Speed ↑	Image Reward ↑ DrawBench	Geneval↑ Overall
[dev]: 50 steps	✓	17.84	1.00×	3719.50	1.00×	0.9898(+0.0000)	0.6752 (+0.000)
TaylorSeer (N = 9, O = 2) + [24]	✓	4.72	3.78×	596.07	6.24×	0.8168 (-0.1730)	0.5380 (-0.1372)
SpeCa	✓	3.81	4.68×	586.93	6.34×	0.9355 (-0.0543)	0.5922 (-0.0830)

Text-to-Video (HunyuanVideo): As seen in the table below (transcribed from Table 2), SpeCa achieves a 6.16x speedup while maintaining a VBench score of 79.84%, outperforming all other methods at similar or even lower acceleration ratios. This demonstrates its effectiveness on computationally intensive video models.

Table 2 (Transcription): Quantitative comparison in text-to-video generation for HunyuanVideo on VBench

Method	Latency(s) ↓	Speed ↑	FLOPs(T) ↓	Speed ↑	VBench ↑
50-steps	145.00	1.00×	29773.0	1.00×	80.66
TaylorSeer2	31.69	4.58×	5359.1	5.56X	79.78
SpeCa2	31.45	4.61×	4834.8	6.16×	79.84

Class-Conditional Generation (DiT): The results on ImageNet are particularly striking (transcribed from Table 3). At ~7x acceleration, SpeCa achieves an FID of 3.76-3.78, which is remarkably close to the baseline FID of 2.32. In contrast, other methods experience a complete quality collapse, with FID scores soaring to 15, 22, or even 133. Figure 2 visually confirms this superiority.

Table 3 (Transcription): Quantitative comparison on class-to-image generation on ImageNet with DiT-XL/2

Method	Efficient Attention	Latency(s) ↓	FLOPs(T) ↓	Speed ↑	FID ↓	sFID ↓	Inception Score
DDIM-50 steps	✓	0.995	23.74	1.00×	2.32(+0.00)	4.32(+0.00)	241.25(+0.00)
TaylorSeer (N = 9, O = 4)+	✓	0.571	3.34	7.10×	5.55 (+3.23)	8.45 (+4.13)	191.19(-50.06)
SpeCa	✓	0.431	3.26	7.30×	3.78 (+1.46)	6.36 (+2.04)	217.61 (-23.64)

Figure 2: Comparison of caching methods in terms of Inception Score (IS) and FID. SpeCa achieves superior performance, especially at high acceleration ratios.

Qualitative Analysis: Visual comparisons in Figures 4 and 5 show that SpeCa maintains high visual fidelity. While other methods produce artifacts like distorted clock faces, blurred textures, or deformed objects at high speeds, SpeCa's generations are nearly indistinguishable from the non-accelerated baseline.

Figure 5: Text-to-image comparison: SpeCa achieves visual fidelity on par with FLUX.
Ablations / Parameter Sensitivity:
- Validation Layer Selection: Figure 6 shows that the activation error in deeper network layers (e.g., Layer 27) has a much stronger correlation ( $r=0.842$ ) with the final image error compared to shallow layers. This provides strong empirical justification for using the final layer's features for verification, as it is the most reliable predictor of output quality.
  
  Figure 6: Strong correlation between errors at layer 27 and final output, validating it as an effective monitoring point. layers and middle layers. This finding provides robust support for our validation strategy: we can efficiently predict final generation quality by monitoring deep layer feature errors without computing the entire network. This aligns precisely with our theoretical analysis of error propagation based on Taylor expansion, confirming that deeper features have a more direct and deterministic influence on final output quality. Additionally, trajectory analysis in feature space confirms that SpeCa maintains evolution paths closely aligned with
- Hyperparameter Analysis: Figure 8 analyzes the effect of the base threshold $\tau_0$ and decay rate $\beta$ . It confirms that a higher threshold leads to greater speed but lower quality, allowing users to tune the trade-off. The adaptive thresholding is shown to be effective at balancing efficiency and fidelity.
  
  $Figure 8: Hyperparameter sensitivity analysis of SpeCa showing effects of base ratio $\\left( \\tau _ { 0 } \\right)$ and decay rate $( \\beta )$ on computational efficiency and generation quality in Spe…$ Figure 8: Hyperparameter sensitivity analysis of SpeCa showing effects of base ratio $\left( \tau _ { 0 } \right)$ and decay rate $( \beta )$ on computational efficiency and generation quality in SpeCa.
- Trajectory Analysis: Figure 9 uses Principal Component Analysis (PCA) to visualize the feature evolution trajectory during generation. The trajectory of SpeCa (red line) almost perfectly overlaps with the original, un-accelerated DiT model (blue line). In contrast, the trajectories of other methods diverge significantly, visually explaining their drop in quality.
  
  Figure 9: Scatter plot of the trajectories of different diffusion acceleration methods after performing Principal Component Analysis (PCA). The figure illustrates how the trajectories evolve across different methods, highlighting their relative efficiencies in terms of feature evolution.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces SpeCa, a novel framework that brings the "Forecast-then-verify" principle of speculative decoding to diffusion models. By combining a training-free feature predictor with a lightweight but rigorous verification step, SpeCa achieves significant acceleration (6-7x) without the quality collapse that plagues other methods. Its sample-adaptive nature further optimizes resource usage. As a plug-and-play module, it offers a practical and powerful solution to a major bottleneck in generative AI.
Limitations & Future Work: The authors propose exploring SpeCa's application to other generative modalities and investigating synergies with other acceleration techniques (e.g., quantization, pruning) to push efficiency even further.
Personal Insights & Critique:
- Novelty and Impact: The paper's core strength is the elegant adaptation of an idea from the LLM domain to solve a critical problem in diffusion models. This cross-pollination of techniques is a powerful driver of innovation. The verification step is the "missing piece" that makes forecasting-based acceleration viable at high ratios.
- Practicality: The framework is highly practical because it requires no retraining or fine-tuning of the original diffusion model. The predictor is based on mathematical principles (Taylor series), and the verifier is a simple error check, making it easy to integrate into existing pipelines.
- Potential Weaknesses: The method's performance relies on the assumption that feature evolution is smooth enough to be approximated by a low-order Taylor series. While this holds for the models tested, it might be less effective for models with highly chaotic or discontinuous dynamics. The choice of hyperparameters ( $\tau_0, \beta$ ) still requires some tuning to find the optimal balance for a given application.
- Open Questions: The paper mentions sample-adaptive computation but could benefit from a deeper analysis of what characterizes a "simple" versus a "complex" sample. Is it related to the prompt, the image content (e.g., textures vs. flat areas), or the stage of the denoising process? Answering this could lead to even more intelligent resource allocation strategies.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.