Paper status: completed

Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching

Published:09/12/2025

Text-to-Image Generation (19)Text-to-Video Generation (7)Diffusion Transformer Acceleration (1)Cluster-Driven Feature Caching (1)Training-Free Acceleration Methods (22)

Original Link PDF

Price: 0.100000

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

To accelerate slow Diffusion Transformers, ClusCa proposes spatial clustering of tokens within each timestep, computing only one representative token per cluster. This novel feature caching reduces token computation by over 90%, achieving up to 4.96x acceleration on models like F

Abstract

Diffusion transformers have gained significant attention in recent years for their ability to generate high-quality images and videos, yet still suffer from a huge computational cost due to their iterative denoising process. Recently, feature caching has been introduced to accelerate diffusion transformers by caching the feature computation in previous timesteps and reusing it in the following timesteps, which leverage the temporal similarity of diffusion models while ignoring the similarity in the spatial dimension. In this paper, we introduce Cluster-Driven Feature Caching (ClusCa) as an orthogonal and complementary perspective for previous feature caching. Specifically, ClusCa performs spatial clustering on tokens in each timestep, computes only one token in each cluster and propagates their information to all the other tokens, which is able to reduce the number of tokens by over 90%. Extensive experiments on DiT, FLUX and HunyuanVideo demonstrate its effectiveness in both text-to-image and text-to-video generation. Besides, it can be directly applied to any diffusion transformer without requirements for training. For instance, ClusCa achieves 4.96x acceleration on FLUX with an ImageReward of 99.49%, surpassing the original model by 0.51%. The code is available at https://github.com/Shenyi-Z/Cache4Diffusion.

Mind Map

In-depth Reading

English Analysis~16 min read · 24,096 chars

1. Bibliographic Information

Title: Compute Only 16 Tokens in One Timestep: Accelerating Diffusion Transformers with Cluster-Driven Feature Caching
Authors: Zhixin Zheng, Xinyu Wang, Chang Zou, Shaobo Wang, and Linfeng Zhang. The authors are affiliated with Shanghai Jiao Tong University, University of Electronic Science and Technology of China, and Shandong University. Linfeng Zhang, the corresponding author, appears to lead a research group focused on model efficiency and acceleration.
Journal/Conference: The paper is submitted to the 33rd ACM International Conference on Multimedia (MM '25). ACM Multimedia is a premier international conference in the field of multimedia, known for its high standards and significant impact. Acceptance at this venue indicates a strong contribution to the field.
Publication Year: 2025 (as per the submission reference).
Abstract: The paper addresses the high computational cost of diffusion transformers used for image and video generation. While prior "feature caching" methods accelerate these models by reusing computations from previous timesteps (leveraging temporal similarity), they ignore similarities between different spatial regions (tokens) within the same timestep. The authors introduce Cluster-Driven Feature Caching (ClusCa), a method that performs spatial clustering on tokens. In each step, it computes only one representative token per cluster and propagates its features to the other tokens in the same cluster, reducing the number of computed tokens by over 90%. ClusCa is training-free and can be applied to any diffusion transformer. On the FLUX model, it achieves a 4.96x speedup while improving the ImageReward score, demonstrating its effectiveness.
Original Source Link:
- Preprint: https://arxiv.org/abs/2509.10312
- PDF Link: http://arxiv.org/pdf/2509.10312v1
- Status: The paper is available as a preprint on arXiv. The provided reference format suggests it is under review or accepted for publication at ACM MM '25.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Diffusion Transformers (DiTs), such as those used in state-of-the-art models like Sora and FLUX, produce stunningly high-quality images and videos. However, their iterative denoising process, which requires hundreds of passes through a large neural network, makes them computationally expensive and slow.
- Existing Gaps: Current acceleration techniques, particularly feature caching, have successfully exploited temporal redundancy—the observation that features don't change much between consecutive denoising steps. These methods cache and reuse features across time. However, they completely overlook spatial redundancy—the fact that many tokens (patches of the image) within the same timestep are highly similar (e.g., different parts of a blue sky). This untapped source of redundancy represents a significant opportunity for further acceleration.
- Fresh Angle: This paper introduces a novel perspective: what if we could accelerate inference by exploiting both temporal and spatial similarities simultaneously? The core innovation is to cluster similar tokens in the spatial dimension and compute updates for only a small, representative subset of them.
Main Contributions / Findings (What):
- Spatial Similarity Analysis: The paper is the first to systematically investigate and demonstrate that tokens in diffusion transformers exhibit strong spatial clustering. Tokens within the same cluster not only have similar features but also follow similar evolutionary paths during the denoising process.
- ClusCa Method: The paper proposes Cluster-Driven Feature Caching (ClusCa), a training-free algorithm that:
  1. Clusters all image tokens into a small number of groups (e.g., 16).
  2. In most timesteps, it only performs the expensive computation for a single token per cluster.
  3. It then intelligently updates the features of all other tokens by blending their old (cached) features with the new features from their cluster's representative token.
- State-of-the-Art Performance: ClusCa significantly improves the trade-off between speed and quality. For example, it achieves a 4.96x speedup on the FLUX text-to-image model while actually improving the output quality (ImageReward score of 99.49%). This is particularly effective at high acceleration ratios where previous methods suffer from severe quality degradation. The method's effectiveness is demonstrated on diverse models including DiT, FLUX, and HunyuanVideo.

Foundational Concepts:
- Diffusion Models: These are generative models that learn to create data, like images, by reversing a noise-adding process.
  - Forward Process: Starts with a clean image and gradually adds Gaussian noise over many timesteps ( $T$ ) until only pure noise remains.
  - Backward Process: A neural network is trained to reverse this process. Starting from random noise, it iteratively removes the predicted noise at each timestep ( $t$ ) to gradually recover a clean image. This iterative denoising is what makes diffusion models slow.
- Diffusion Transformer (DiT): A powerful architecture for the denoising network in diffusion models. Instead of the commonly used U-Net, a DiT uses a Transformer, which has proven to be highly scalable and effective. It treats an image as a sequence of patches or "tokens," similar to how Transformers process words in a sentence.
- Feature Caching: An acceleration technique based on the idea of temporal redundancy. Since the input to the denoising network at step $t$ is very similar to the input at step t-1, the intermediate features (activations) inside the network are also very similar. Feature caching methods compute these features once at a specific step, cache them, and then reuse them for the next few steps, skipping the expensive computation.
Previous Works:
- Step Reduction: Methods like DDIM and DPM-Solver aim to reduce the total number of denoising steps required, but often at the cost of quality.
- Temporal Feature Caching:
  - DeepCache first introduced this idea for U-Net based models.
  - FORA, Δ-DiT, ToCa, and DuCa adapted this concept for DiTs. These methods operate on a "cache-then-reuse" cycle. For example, in a cycle of $N$ steps, they perform a full computation at the first step and reuse the cached features for the next N-1 steps.
  - TaylorSeer improved upon this by "forecasting" what the features would be in future steps using a Taylor series expansion, rather than just reusing old ones. This helps reduce the error accumulated from caching.
Differentiation: ClusCa is fundamentally different from all previous feature caching methods because it introduces a spatial dimension to the caching strategy. While others treat each token independently and focus only on its evolution over time, ClusCa recognizes the relationships between tokens at a single point in time. It is orthogonal and complementary to previous methods. Instead of reusing a token's own past feature (temporal reuse), it also uses features from other spatially similar tokens computed in the current timestep (spatial reuse). This dual-reuse mechanism is the key innovation that allows ClusCa to maintain high quality even at extreme speedups.

4. Methodology (Core Technology & Implementation)

The core idea of ClusCa is to exploit spatial redundancy by grouping similar tokens and only computing a few representatives. The method operates in cycles.

Principles & Observations: The method is built on two key empirical observations about tokens in DiTs:
1. Intra-Cluster Similarity (Observation 1): Tokens can be grouped into clusters where members are highly similar to each other. As shown in Figure 2(a), the feature distance between tokens within the same cluster is about 100 times smaller than the average distance between all tokens. Furthermore, Figure 2(b) shows that tokens in the same cluster not only have similar features but also follow similar trajectories during the denoising process.
  
  Figure 1: Visualization of similarity in two dimensions Figure 2: (a) The distributions of distance between tokens in the same cluster (1/2/3) and all the tokens in the current timestep, showing that tokens in the same cluster have significantly lower distance. (b) PCA visualization on the evolution of tokens in different clusters. Points in the same line denote the same token in different timesteps. Tokens with the same color denote tokens in the same cluster.
2. Cluster Stability (Observation 2): The way tokens are grouped into clusters remains relatively stable across adjacent timesteps. As shown in Figure 5 from the paper, the clustering structure doesn't change drastically from one step to the next. This means the expensive clustering step doesn't need to be performed at every timestep, making the method efficient.
Steps & Procedures: ClusCa operates in "cache cycles" of length $N$ .
1. Full Calculation & Clustering (First Timestep of a Cycle):
  - At the first timestep of a cycle (e.g., step $t$ ), the model performs a full computation, processing all tokens as usual.
  - The output features from the final layer are then fed into a K-Means clustering algorithm. This algorithm groups all tokens into $K$ clusters based on feature similarity. For example, if there are 1024 tokens and $K=16$ , each token is assigned to one of 16 clusters.
  - The computed features for all tokens are stored in a cache.
2. Partial Calculation (Subsequent Timesteps in a Cycle):
  - For the next N-1 timesteps (from t-1 to t-N+1), the model performs a partial computation.
  - Instead of computing all tokens, it first randomly selects one token from each of the $K$ clusters.
  - Only these $K$ representative tokens are passed through the DiT network for computation. This drastically reduces the computational load (e.g., from 1024 tokens to just 16).
3. Spatiotemporal Feature Reuse & Propagation:
  - This is the key step where all tokens get updated, even those that were not computed.
  - For the $K$ computed tokens, their entries in the cache are updated with their new, freshly computed features.
  - For the vast majority of uncomputed tokens, their features are updated using a weighted average that combines temporal and spatial information.
Mathematical Formulas & Key Details: The cache update rule for a token $x_i$ is the centerpiece of the method: $C ( x _ { i } ) = \left\{ \begin{array} { l l } { { \mathcal F } ( x _ { i } ) , } & { i \in { \mathcal { I } } _ { \mathrm { Compute } } } \\ { \gamma \cdot \mu _ { ( i ) } + ( 1 - \gamma ) C ( x _ { i } ) , } & { i \notin { \mathcal { I } } _ { \mathrm { Compute } } } \end{array} \right.$ Let's break this down:
- $C(x_i)$ : The feature cache for token $x_i$ .
- $\mathcal{F}(x_i)$ : The result of passing token $x_i$ through the computationally expensive network layers.
- $\mathcal{I}_{\mathrm{Compute}}$ : The set of indices for the $K$ tokens that were selected for computation.
- Top Case (Computed Tokens): If token $i$ is in the computed set, its cache is directly updated with the new feature $\mathcal{F}(x_i)$ .
- Bottom Case (Uncomputed Tokens): If token $i$ $i$ was not computed, its new cache value is a mix of two parts:
  - $(1-\gamma)C(x_i)$ : This is the temporal reuse component. It's the token's own feature from the previous timestep, scaled by $(1-\gamma)$ . It provides historical context.
  - $\gamma \cdot \mu_{(i)}$ : This is the spatial reuse component. $\mu_{(i)}$ is the average of the newly computed features of the representative token(s) belonging to the same cluster as token $i$ . This term propagates the new information from the computed representative to all its cluster-mates.
- $\gamma$ : The propagation ratio, a small hyperparameter that balances the influence of spatial reuse versus temporal reuse. The authors find that a small, non-zero $\gamma$ is optimal. If $\gamma=0$ , the method becomes similar to previous temporal-only caching methods. If $\gamma$ is too large, it discards historical information, which can harm quality.
- $\mu_{(i)}$ : The mean feature of computed tokens in the same cluster as token $i$ . Formally: $\mu _ { ( i ) } = \frac { \sum _ { j \in \mathcal { I _ { \mathrm { C o m p u te } } } } \mathcal { F } ( x _ { j } ) } { \sum _ { j \in \mathcal { I _ { \mathrm { C o m p u te } } } } [ \mathbf { I } _ { j } = \mathbf { I } _ { i } ] }$ Where $\mathbf{I}_i$ is the cluster index for token $i$ , and $[\cdot]$ is the Iverson bracket (1 if true, 0 if false). Since only one token is computed per cluster, this effectively becomes the feature of that one representative.

5. Experimental Setup

Datasets:
- ImageNet: A large-scale image classification dataset. Used for class-conditional image generation with DiT-XL/2 to generate 50,000 images at $256 \times 256$ resolution.
- DrawBench: A set of 200 challenging text prompts designed to test the capabilities of text-to-image models. Used with FLUX.
- VBench: A comprehensive benchmark with 946 prompts for evaluating text-to-video generation models. Used with Hunyuan-Video.
Evaluation Metrics:
1. FID (Fréchet Inception Distance):
  - Conceptual Definition: Measures the quality and diversity of generated images. It calculates the "distance" between the feature distribution of generated images and real images. A lower FID score is better, indicating the generated images are more realistic.
  - Mathematical Formula: $\mathrm{FID}(x, g) = \left\| \mu_x - \mu_g \right\|_2^2 + \mathrm{Tr}\left(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}\right)$
  - Symbol Explanation:
    - $\mu_x, \mu_g$ : Mean of the feature vectors for real and generated images, respectively.
    - $\Sigma_x, \Sigma_g$ : Covariance matrices of the feature vectors for real and generated images.
    - $\mathrm{Tr}(\cdot)$ : The trace of a matrix (sum of diagonal elements).
2. sFID (spatial FID): A variant of FID that is more sensitive to spatial artifacts and object placement errors. Lower is better.
3. ImageReward:
  - Conceptual Definition: A metric that uses a trained model to predict human preference for generated images. It scores how "good" an image is based on aesthetics and prompt alignment, mimicking human judgment. Higher is better.
4. CLIP Score:
  - Conceptual Definition: Measures the semantic similarity between a given text prompt and a generated image. It uses the CLIP model to embed both the text and image into a shared space and calculates their cosine similarity. A higher CLIP score indicates better alignment with the prompt.
5. VBench Score: A comprehensive score for video generation that aggregates multiple metrics evaluating aspects like video quality, temporal consistency, motion realism, and text-video alignment. Higher is better.
Baselines: The paper compares ClusCa against several strong baselines:
- DDIM Step Reduction: Simply using fewer denoising steps. This is the most basic form of acceleration.
- FORA, ToCa, DuCa, Δ-DiT: State-of-the-art temporal feature caching methods for DiTs.
- TaylorSeer: An advanced feature forecasting method that represents the previous state-of-the-art in caching-based acceleration.
- TeaCache: Another relevant caching-based acceleration method.

6. Results & Analysis

Core Results:
- Text-to-Image (FLUX): As shown in the transcribed Table 1, ClusCa demonstrates a superior trade-off between speed and quality.
  - At a ~4.14x FLOPs speedup, $ClusCa (N=5, O=2, K=16)$ achieves an ImageReward of 0.9961, significantly outperforming ToCa (0.9802), Δ-DiT (0.8721), and TaylorSeer (0.9857) at similar speedups.
  - Even more impressively, at a 4.96x speedup, ClusCa maintains an extremely high ImageReward of 0.9949, which is better than the original, unaccelerated model (0.9898). Other methods see a significant quality drop at this level of acceleration.
  - The qualitative results in Figure 6 show this clearly. In the "oil painting of a couple" example, only ClusCa correctly generates an umbrella with proper geometric structure at 4.96x acceleration, while other methods produce distorted or incomplete objects.
    
    Figure 6: Qualitative comparison on FLUX. ClusCa excels in generating complex scenes and produces images with significantly richer content.
- Text-to-Video (Hunyuan-Video): The transcribed Table 2 shows that ClusCa also excels in video generation.
  - At a 5.54x speedup, ClusCa achieves a VBench score of 79.96%, outperforming TaylorSeer (79.78%) and other caching methods.
  - At an even higher 6.21x speedup, it still maintains a strong score of 79.60%. This is crucial for video, where computational costs are even more prohibitive.
- Class-to-Image (DiT): Figure 3 provides a clear visual summary of the results from the transcribed Table 3.
  - The plot shows that across all speedup ratios, ClusCa's curve (in dark blue) is consistently lower (better FID/sFID) than all other methods.
  - While other methods like ToCa and DuCa see their FID scores skyrocket beyond a 3.3x speedup, ClusCa's performance degrades much more gracefully, demonstrating its robustness to high acceleration. For instance, at a ~4.6x speedup, ClusCa has an FID of 3.20, whereas FORA is at 6.58.
    
    Figure 3: Comparison between existing feature caching methods and ClusCa on DiT with FID and sFID (lower is better).
Ablations / Parameter Sensitivity:
- Propagation Ratio ( $\gamma$ ): Figure 10 and the transcribed Table 4 show the impact of $\gamma$ .
  - The FID score first improves as $\gamma$ increases from 0, then starts to degrade. This confirms the hypothesis that a balance is needed: $\gamma=0$ ignores the beneficial spatial information, while a large $\gamma$ discards too much temporal history.
  - An optimal, small, non-zero $\gamma$ provides the best results, validating the effectiveness of the spatial propagation mechanism.
    
    Figure 10: ClusCa with different propagation ratios.
- Number of Clusters ( $K$ ): The transcribed Table 5 shows that increasing $K$ (from 4 to 32) generally improves FID. This is expected, as more clusters mean more representative tokens are computed, and the intra-cluster similarity is higher. However, this comes at the cost of reduced speedup. The authors chose $K=16$ as a good balance for their experiments.
- Cache Interval ( $N$ ): The transcribed Table 8 shows that as the cache interval $N$ increases (from 3 to 7), the speedup increases, but the FID also increases (quality degrades). However, the paper notes that even with a large interval like $N=7$ , ClusCa still outperforms TaylorSeer using a smaller interval of $N=6$ , showcasing its robustness.
Visualization on Feature Trajectories: Figure 9 uses PCA to visualize the evolution of features over time.
- The "DiT" trajectory represents the ground truth (no acceleration).
- The trajectories of other methods diverge significantly from the ground truth, indicating error accumulation.
- The ClusCa trajectory stays remarkably close to the ground truth, providing a clear visual confirmation that its update mechanism successfully minimizes error accumulation while accelerating computation.
  
  Figure 9: PCA visualization on the trajectories of features in different timesteps. "DiT" here denotes the original model without acceleration, which can be considered as the ground-truth trajectory.
Time Cost Analysis: Figure 11 shows that the overhead from the new components is minimal.
- The Clustering step takes up only 2.9%-4.1% of the total time.
- The Propagation step takes 5.6%-8.0%.
- The vast majority of the time (around 90%) is still spent on the core network computation, which is exactly what the method aims to reduce. This demonstrates that the management overhead of ClusCa is small.
  
  Figure 11: Time consumption visualization of each part

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces ClusCa, a novel and highly effective framework for accelerating diffusion transformers. By being the first to systematically exploit spatial similarity in addition to the well-studied temporal similarity, ClusCa sets a new state-of-the-art in the trade-off between inference speed and generation quality. Its core mechanism—clustering tokens, computing a small set of representatives, and propagating their information—is shown to be robust, efficient, and broadly applicable across different models and tasks (image and video generation) without any need for retraining.
Limitations & Future Work:
- The authors do not explicitly state limitations, but some can be inferred. The optimal number of clusters ( $K$ ) and the propagation ratio ( $\gamma$ ) are hyperparameters that may need tuning for different models, resolutions, or datasets.
- The clustering is performed on the features of the last layer. It's possible that clustering at an intermediate layer could provide different or even better results.
- The selection of a single random token per cluster is simple and effective, but a more intelligent selection strategy (e.g., choosing the token closest to the cluster centroid) could potentially yield further improvements.
Personal Insights & Critique:
- Simplicity and Elegance: The core idea of ClusCa is intuitive and powerful. The observation that tokens representing similar visual content (like patches of the sky) can be treated as a group is a fundamental insight that was overlooked by prior work. The implementation via K-Means and a simple weighted average is elegant.
- High Impact Potential: As diffusion transformers become larger and more ubiquitous (e.g., for video generation), efficient inference becomes paramount. ClusCa provides a practical, training-free solution that can significantly reduce computational barriers, making these powerful models more accessible for real-world applications.
- Transferability: The concept of exploiting spatial redundancy via clustering is not limited to diffusion models. It could potentially be applied to accelerate other vision transformer models in tasks like video understanding or dense prediction, where there is significant spatial correlation in the input.
- Open Question: How does the optimal clustering strategy change with the content of the image? For an image with simple, large regions (like a landscape), a small $K$ might suffice. For a highly detailed, complex scene, a larger $K$ might be necessary. An adaptive method that determines $K$ on-the-fly could be a promising direction for future research.

Transcribed Tables

This section contains manual transcriptions of the tables from the paper, as no image files were provided for them.

Table 1: Quantitative comparison in text-to-image generation for FLUX on Image Reward.

Method	Efficient Attention	Latency(s) ↓	Speed ↑	FLOPs(T) ↓	Speed ↑	Image Reward ↑ DrawBench	CLIP↑ Score
[dev]: 50 steps	✓	25.82	1.00×	3719.50	1.00×	0.9898	19.761
60% steps	✓	16.70	1.55×	2231.70	1.67×	0.9663	19.526
50% steps	✓	13.14	1.96×	1859.75	2.00×	0.9595	19.455
40% steps	✓	10.59	2.44×	1487.80	2.62×	0.9554	19.003
Δ-DT (N = 2)	✓	17.80	1.45X	2480.01	1.50×	0.9444	19.396
Δ-DiT (N = 3)	✓	13.02	1.98×	1686.76	2.21×	0.8721	18.742
34% steps	✓	9.07	2.85×	1264.63	3.13×	0.9453	18.870
FORA (N = 3) [29]	✓	10.16	2.54×	1320.07	2.82×	0.9776	19.339
ToCa (N = 6)[41]	✗	13.16	1.96X	924.30	4.02×	0.9802	18.688
DuCa(N = 5) [42]	✓	8.18	3.15×	978.76	3.80×	0.9955	19.314
TaylorSeer (N = 4, O = 2)	✓	9.24	2.80×	1042.27	3.57×	0.9857	19.496
ClusCa (N = 5, O = 1, K = 16)	✓	8.12	3.18×	897.03	4.14×	0.9825	19.481
ClusCa (N = 5, O = 2, K = 16)	✓	8.19	3.15×	897.03	4.14×	0.9961	19.422
22% steps	✓	6.04	4.28×	818.29	4.55×	0.8183	18.224
FORA (N = 4) [29]	✓	8.12	3.14×	967.91	3.84×	0.9730	19.210
ToCa (N = 8) [41]	✗	11.36	2.27×	784.54	4.74×	0.9451	18.402
DuCa (N = 7) [42]	✓	6.74	3.83X	760.14	4.89×	0.9757	18.962
TeaCache (l = 0.8) [18]	✓	7.21	3.58×	892.35	4.17×	0.8683	18.500
TaylorSeer (N = 5, O = 2)	✓	7.46	3.46X	893.54	4.16×	0.9864	19.406
ClusCa (N = 6, O = 1, K = 16)	✓	7.10	3.63×	748.48	4.96×	0.9762	19.533
ClusCa (N = 6, O = 2, K = 16)	✓	7.12	3.62×	748.48	4.96×	0.9949	19.453

Table 2: Quantitative comparison in text-to-video generation on VBench

Method	Efficient Attention	Latency(s) ↓	Speed ↑	FLOPs(T) ↓	Speed ↑	VBench ↑ Score(%)
Original	✓	334.96	1.00×	29773.0	1.00×	80.66
DDIM-22%	✓	87.01	3.85×	6550.1	4.55×	78.74
FORA (N = 5)	✓	83.78	4.00×	5960.4	5.00×	78.83
ToCa (N = 5)	✗	93.80	3.57×	7006.2	4.25×	78.86
DuCa (N = 5)	✓	87.48	3.83×	6483.2	4.62X	78.72
TeaCache (l = 0.4)	✓	70.43	4.76X	6550.1	4.55×	79.36
TeaCache (l = 0.5)	✓	61.47	5.45×	5359.1	5.56×	78.32
TaylorSeer (N = 5, O = 1)	✓	85.93	3.90×	5960.4	5.00×	79.93
TaylorSeer (N = 6, O = 1)	✓	79.46	4.22×	5359.1	5.56X	79.78
ClusCa (N = 5, K = 32)	✓	87.35	3.83x	5968.1	4.99x	79.99
ClusCa (N = 6, K = 32)	✓	81.64	4.10×	5373.0	5.54×	79.96
ClusCa (N = 7, K = 32)	✓	74.88	4.47×	4796.2	6.21×	79.60

Table 3: Quantitative comparison in class-to-image generation on ImageNet with DiT-XL/2.

Method	Latency(s) ↓	FLOPs(T) ↓	Speed ↑	FID ↓	sFID ↓
DDIM-25 steps	0.230	11.87	2.00×	3.18	4.74
FORA (N = 2)	0.278	12.35	1.92×	2.66	4.88
ToCa (N = 3)	0.216	9.73	2.44×	2.87	4.76
DuCa (N = 3)	0.208	9.54	2.49X	2.85	4.64
ClusCa (N = 3, K = 16)	0.232	9.18	2.59×	2.38	4.76
ClusCa (N = 3, K = 32)	0.238	9.78	2.43×	2.38	4.69
DDIM-20 steps	0.191	9.49	2.50×	3.81	5.15
FORA (N = 3)	0.222	8.58	2.77×	3.55	6.36
ToCa (N = 4)	0.197	8.10	2.93×	3.42	5.12
DuCa (N = 4)	0.175	7.61	3.12×	3.42	4.94
ClusCa (N = 4, K = 16)	0.189	7.35	3.23×	2.51	5.03
ClusCa (N = 4, K = 32)	0.202	8.03	2.95×	2.50	4.91
DDIM-15 steps	0.151	7.12	3.33×	5.17	6.11
FORA (N = 4)	0.193	6.66	3.56×	4.75	8.43
ToCa (N = 5)	0.176	6.77	3.51×	6.20	7.17
DuCa (N = 5)	0.152	6.27	3.79X	6.06	6.72
TaylorSeer (N = 4, O = 1)	0.186	6.66	3.56×	2.71	5.45
ClusCa (N = 5, K = 16)	0.166	5.98	3.97×	2.65	5.13
ClusCa (N = 5, K = 32)	0.179	6.73	3.53×	2.78	5.01
DDIM-12 steps	0.128	5.70	4.16×	7.80	8.03
FORA (N = 5)	0.171	5.24	4.53×	6.58	11.29
ToCa (N = 6)	0.170	6.34	3.75×	6.55	7.10
DuCa (N = 6)	0.145	5.81	4.08×	6.40	6.71
ClusCa (N = 6, K = 8)	0.153	5.15	4.61×	3.20	5.93
ClusCa (N = 6, K = 16)	0.159	5.52	4.29×	3.16	5.65
DDIM-10 steps	0.112	4.75	5.00×	12.15	11.33
FORA (N = 6)	0.165	4.76	4.99×	9.15	14.84
ToCa (N = 9)	0.158	4.54	5.23×	12.87	12.82
DuCa (N = 9)	0.131	4.30	5.52×	12.05	11.82
TaylorSeer (N = 6, O = 1)	0.157	4.76	4.98×	3.62	7.41
ClusCa (N = 7, K = 4)	0.133	4.02	5.91×	3.59	6.28
ClusCa (N = 7, K = 8)	0.138	4.21	5.63×	3.56	6.05

Table 4: Ablation Study with Different Configurations on ImageNet with DiT-XL/2.

Configuration		sFID↓	FID↓
N = 6	γ = 0.000	6.161	3.328
	γ = 0.001	5.989	3.276
	γ = 0.002	5.848	3.233
	γ = 0.003	5.748	3.184
	γ = 0.004	5.689	3.166
	γ = 0.005	5.659	3.165
	γ = 0.006	5.686	3.166
N = 7	γ = 0.000	6.444	3.751
	γ = 0.001	6.181	3.677
	γ = 0.002	5.971	3.619
	γ = 0.003	5.830	3.572
	γ = 0.004	5.761	3.541
	γ = 0.005	5.722	3.524
	γ = 0.006	5.754	3.547
	γ = 0.007	5.853	3.591
N = 8	γ = 0.000	6.775	4.782
	γ = 0.001	6.452	4.659
	γ = 0.002	6.192	4.557
	γ = 0.003	6.004	4.467
	γ = 0.004	5.877	4.401
	γ = 0.005	5.830	4.386
	γ = 0.006	5.855	4.412
	γ = 0.007	-	4.439

Table 5: Ablation study on number of clusters K.

Method	FLOPs	Speedup	FID	sFID
ClusCa(K = 4)	5.42	4.37	2.819	5.285
ClusCa(K = 8)	5.61	4.23	2.806	5.226
ClusCa(K = 16)	5.98	3.96	2.803	5.145
ClusCa(K = 32)	6.72	3.53	2.784	5.006

Table 6: Comparison of additional fidelity scores in video generation (HunyuanVideo).

Method	FLOPs(T) ↓	Speed ↑	VBench ↑ Score(%)	PSNR↑	SSIM↑	LPIPS↓
Original: 50 steps	29773.0	1.00×	80.66	-	-	-
DDIM-22%	6550.1	4.55X	78.74	11.2576	0.4323	0.6473
FORA (N = 5)	5960.4	5.00×	78.83	11.6969	0.4343	0.6354
ToCa (N = 5)	7006.2	4.25×	78.86	11.3269	0.4243	0.6468
DuCa (N = 5)	6483.2	4.62X	78.72	11.3366	0.4279	0.6467
TeaCache (l = 0.4)	6550.1	4.55X	79.36	11.6859	0.4305	0.6177
TaylorSeer (N = 5, O = 1)	5960.4	5.00×	79.93	11.5505	0.4238	0.6303
ClusCa (N = 5, K = 32)	5968.1	4.99×	79.99	13.5328	0.4863	0.5413
ClusCa (N = 6, K = 32)	5373.0	5.54×	79.96	12.3300	0.4533	0.5957

Table 7: Comparison of VBench motion sub-metrics in video generation (HunyuanVideo).

Method	Speed	BG Cons.	Temp. Flick.	Motion Smooth.	Dyn. Deg.	Human Act.	Temp. Style	Overall Cons.	Avg.
FORA(N = 5)	5.00X	0.9403	0.9771	0.9688	0.2764	0.907	0.6602	0.7209	0.7787
ToCa(N = 5)	4.25X	0.9399	0.9781	0.9708	0.2778	0.898	0.6615	0.7217	0.7783
DuCa(N = 5)	4.62×	0.9393	0.9798	0.9736	0.2778	0.905	0.6613	0.7184	0.7793
Taylor(N = 5)	5.00X	0.9416	0.9819	0.9739	0.2986	0.914	0.6720	0.7293	0.7873
ClusCa(N = 6, K = 32)	5.54×	0.9518	0.9865	0.9787	0.3000	0.914	0.6706	0.7254	0.7896

Table 8: Ablation study on cache cycle length N.

Method	FLOPs	Speedup	FID	sFID
ClusCa(N=3)	9.17	2.59	2.40	4.70
ClusCa(N=4)	7.35	3.23	2.51	5.03
ClusCa(N=5)	5.98	3.96	2.80	5.14
ClusCa(N=6)	5.52	4.29	3.17	5.66
ClusCa(N=7)	4.62	5.14	3.52	5.72

Similar papers

Recommended via semantic vector search.

No similar papers found yet.