Paper status: completed

DiTFastAttn: Attention Compression for Diffusion Transformer Models

Published:06/13/2024

Attention Compression for Diffusion Transformers (1)Window Attention with Residual Sharing (1)Attention Sharing across Timesteps (1)Conditional Generation Redundancy Skipping (1)High-Resolution Image Generation Acceleration (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DiTFastAttn is presented as a post-training compression method to address computational bottlenecks in Diffusion Transformers. It effectively reduces spatial, temporal, and conditional redundancies, achieving up to 76% reduction in attention FLOPs and 1.8x acceleration in generat

Abstract

Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.

Mind Map

In-depth Reading

English Analysis~33 min read · 42,766 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "DiTFastAttn: Attention Compression for Diffusion Transformer Models". This title clearly indicates a focus on optimizing the computational efficiency of Diffusion Transformer models, specifically by compressing their attention mechanisms.

1.2. Authors

The authors are Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, and Yu Wang. Their affiliations include:

Tsinghua University (1)
Infinigence AI (2)
Shanghai Jiao Tong University (3) This suggests a collaborative effort between academic institutions and an AI company, often indicating a blend of theoretical research and practical application.

1.3. Journal/Conference

The paper is published at arXiv, a preprint server, with a publication date of 2024-06-12T18:00:08.000Z. As an arXiv preprint, it has not yet undergone formal peer review for a specific journal or conference. However, arXiv is a highly influential platform for rapid dissemination of research in physics, mathematics, computer science, and other fields, allowing researchers to share their work before formal publication. Many significant papers first appear on arXiv before being accepted into top-tier conferences or journals.

1.4. Publication Year

The paper was published in 2024.

1.5. Abstract

Diffusion Transformers (DiT) are powerful generative models for images and videos but face significant computational hurdles due to the quadratic complexity of their self-attention operations. This paper introduces DiTFastAttn, a post-training compression method designed to alleviate this computational bottleneck during DiT inference. The authors identify three primary redundancies in attention computation: (1) spatial redundancy, where many attention heads focus locally; (2) temporal redundancy, where attention outputs are highly similar across neighboring denoising steps; and (3) conditional redundancy, where conditional and unconditional inferences show high similarity, particularly with Classifier-Free Guidance (CFG).

To address these, DiTFastAttn proposes three techniques: (1) Window Attention with Residual Sharing (WA-RS) for spatial redundancy, which uses window attention and caches/reuses residual information from full attention to maintain performance; (2) Attention Sharing across Timesteps (AST) for temporal redundancy, by sharing attention outputs between similar neighboring steps; and (3) Attention Sharing across CFG (ASC) for conditional redundancy, by skipping redundant computations during unconditional generation.

Applied to DiT and PixArt-Sigma for image generation, and OpenSora for video generation, DiTFastAttn demonstrates substantial efficiency gains. For image generation, it achieves up to a 76% reduction in attention FLOPs and up to a 1.8x end-to-end speedup, especially at high resolutions (e.g., 2k x 2k).

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2406.08552
PDF Link: https://arxiv.org/pdf/2406.08552v2.pdf This paper is an arXiv preprint.

2. Executive Summary

2.1. Background & Motivation

The core problem DiTFastAttn aims to solve is the substantial computational demand of Diffusion Transformer (DiT) models, particularly during inference for high-resolution image and video generation. While DiT models excel in generative quality, their underlying self-attention mechanism, a core component of Transformer architectures, has a quadratic complexity ( $\mathcal{O}(L^2)$ ) with respect to the input token length $L$ . As image and video resolutions increase, the token length $L$ grows significantly, making attention computation the primary bottleneck. For instance, generating a 2Kx2K image can involve 16k tokens, leading to several seconds of attention computation even on powerful GPUs.

Further exacerbating this issue is the nature of diffusion model inference, which requires many sequential denoising steps and often employs Classifier-Free Guidance (CFG), effectively doubling the computational cost per step by performing both conditional and unconditional network evaluations.

Prior research on accelerating attention (e.g., Local Attention, Swin Transformer, GQA) often involved architectural changes that necessitate expensive retraining of the entire model. Given the massive data and computational resources required to train DiT models, there is a crucial need for post-training compression methods that can reduce computational costs without incurring retraining expenses or significant performance degradation. This paper's entry point is to identify and exploit intrinsic redundancies within the attention computation of pre-trained DiT models during inference.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Identification of Three Key Redundancies: The authors rigorously identify and characterize three distinct types of redundancy in DiT attention computation during inference:
- Spatial Redundancy: Many attention heads primarily focus on local information, with attention values for distant tokens being negligible.
- Temporal Redundancy: Attention outputs for the same head exhibit high similarity across neighboring denoising steps.
- Conditional Redundancy: During CFG, attention outputs from conditional and unconditional inferences show significant similarity for certain heads and timesteps.
Proposal of Three Corresponding Compression Techniques: To address each identified redundancy, DiTFastAttn introduces novel post-training compression techniques:
- Window Attention with Residual Sharing (WA-RS): Replaces full attention with window attention for spatially redundant layers, and crucially, caches and reuses the residual difference between full and window attention outputs from previous steps to maintain long-range dependencies and preserve performance.
- Attention Sharing across Timesteps (AST): Exploits temporal similarity by caching and reusing attention outputs from an earlier step for subsequent similar steps, thereby skipping redundant computations.
- Attention Sharing across CFG (ASC): Leverages conditional similarity by reusing attention outputs from the conditional neural network evaluation for the unconditional evaluation during CFG, effectively halving the CFG attention cost.
Development of a Greedy Compression Plan Decision Method: A simple greedy algorithm is proposed to dynamically select the most appropriate compression strategy (or combination of strategies) for each layer and timestep based on a predefined loss threshold, ensuring an optimal balance between compression and quality.
Extensive Experimental Validation: DiTFastAttn is applied to various DiT models, including DiT-XL, PixArt-Sigma (for image generation), and OpenSora (for video generation).

The key conclusions and findings are:

Significant Computational Savings: DiTFastAttn consistently reduces computational costs. For image generation, it achieves up to a 76% reduction in attention FLOPs.
Substantial Speedup: It delivers an end-to-end speedup of up to 1.8x, particularly noticeable at high-resolution generation (e.g., 2048x2048 images with PixArt-Sigma).
Quality Preservation: The method effectively preserves the generative performance and visual quality of the original DiT models, especially at higher resolutions and moderate compression levels (e.g., D1-D4 configurations).
Resolution-Dependent Efficiency: The higher the resolution of the generated content, the greater the computational savings and latency reduction achieved by DiTFastAttn.
Effectiveness of Residual Sharing: The ablation study confirms that residual caching is crucial for Window Attention to maintain generative performance, preventing significant drops in quality compared to window attention alone.
Variability of Redundancy: The compression plans generated by the greedy search highlight that the distribution of different types of redundancies varies across models and across layers/timesteps, validating the need for a tailored search method rather than a universal strategy.

These findings solve the problem of high computational cost in pre-trained DiT models, making them more efficient and practical for deployment, especially for high-resolution content generation, without requiring expensive retraining.

3.1. Foundational Concepts

To understand DiTFastAttn, a solid grasp of Diffusion Models, Transformers, and associated concepts like self-attention and Classifier-Free Guidance is essential.

3.1.1. Diffusion Models (DMs)

Diffusion Models are a class of generative models that learn to reverse a diffusion process. Imagine a clean image gradually corrupted by adding Gaussian noise over many timesteps until it becomes pure noise. A diffusion model learns to reverse this process: it's trained to predict and subtract the noise from a noisy image at each timestep to progressively denoise it back to a clean image.

Denoising Process: During inference (generation), the model starts with random noise and iteratively applies a denoising neural network for a specified number of steps. In each step, the network takes the current noisy image and the current timestep (often encoded as an embedding) as input and outputs a prediction of the noise that was added. This predicted noise is then subtracted to get a slightly cleaner image, and the process repeats until a clean image is obtained.
Timesteps: The denoising process is discretized into a number of timesteps (e.g., 50, 1000). The model's behavior can change significantly across these timesteps, as it deals with different noise levels. Early steps remove large amounts of noise, while later steps refine details.

3.1.2. Transformers

Transformers are neural network architectures primarily known for their success in natural language processing (NLP) but have been widely adopted in computer vision (Vision Transformers or ViTs) and other domains. Their core innovation is the self-attention mechanism.

Self-Attention Mechanism: Instead of processing data sequentially like Recurrent Neural Networks (RNNs), Transformers process all input elements (e.g., tokens in text, image patches in vision) in parallel. The self-attention mechanism allows each input element to weigh the importance of all other input elements when computing its own representation. This is achieved through three learned linear projections: Query (Q), Key (K), and Value (V) matrices.
- Query (Q): Represents what an element is looking for.
- Key (K): Represents what an element contains.
- Value (V): Contains the information an element holds. The attention score between a query and a key determines how much value information from that key should be "attended to" by the query. The self-attention formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
- $Q$ : Query matrix, typically of shape (sequence length, $d_k$ ), where $d_k$ is the dimension of the keys.
- $K$ : Key matrix, typically of shape (sequence length, $d_k$ ).
- $V$ : Value matrix, typically of shape (sequence length, $d_v$ ), where $d_v$ is the dimension of the values.
- $Q K^T$ : The dot product between Query and Key matrices, calculating similarity scores. This results in a matrix of shape (sequence length, sequence length), which quantifies how much each token attends to every other token.
- $\sqrt{d_k}$ : A scaling factor to prevent large dot product values from pushing the softmax function into regions with very small gradients, stabilizing training.
- $\mathrm{softmax}$ : Normalizes the attention scores so they sum to 1, representing probability distributions.
- $\mathrm{Attention}(Q, K, V)$ : The final attention output, a weighted sum of Value vectors, where weights are determined by the softmax of query-key similarities.
Multi-Head Attention (MHA): Instead of a single attention mechanism, Transformers use multiple attention heads operating in parallel. Each head learns different $Q$ , $K$ , $V$ linear projections, allowing the model to focus on different parts of the input or different relationships simultaneously. The outputs from all heads are then concatenated and linearly transformed to produce the final output.
Quadratic Complexity: The computation of $Q K^T$ involves multiplying two matrices where one dimension is the sequence length $L$ . If $Q$ is $L \times d_k$ and $K^T$ is $d_k \times L$ , their product is $L \times L$ . This makes the computational complexity of self-attention $\mathcal{O}(L^2)$ , which becomes very expensive for long sequences (high-resolution images/videos).

3.1.3. Diffusion Transformers (DiT)

Diffusion Transformers (DiT) (Peebles & Xie, 2023) replace the traditional U-Net architecture, commonly used in early diffusion models (e.g., DDPM), with a Transformer backbone. Instead of operating on raw image pixels, DiT models often work in the latent space of an autoencoder, converting images into smaller latent tokens. These latent tokens are then processed by a Transformer network that incorporates timestep embeddings and conditional information (e.g., text embeddings) to predict the noise in the latent space. By leveraging Transformers, DiT models achieve better scalability and performance, especially for high-resolution content, but inherit the quadratic complexity bottleneck of self-attention.

3.1.4. Classifier-Free Guidance (CFG)

Classifier-Free Guidance (CFG) (Ho & Salimans, 2022) is a technique used to improve the quality and adherence to conditional inputs (e.g., text prompts) in diffusion models. Instead of relying on a separate classifier network, CFG combines the predictions from two parallel denoising network inferences at each timestep:

An unconditional prediction: The denoising network predicts noise without any conditional input (e.g., an empty text prompt). This learns the general distribution of images.
A conditional prediction: The denoising network predicts noise with the given conditional input (e.g., the actual text prompt). This guides the generation towards the specified condition. The final noise prediction is then a weighted combination of these two, typically: $ \hat{\epsilon}\theta(x_t | c) = \epsilon\theta(x_t | \emptyset) + w \cdot (\epsilon_\theta(x_t | c) - \epsilon_\theta(x_t | \emptyset)) $ Where:

$\hat{\epsilon}_\theta(x_t | c)$ : The final guided noise prediction.
$\epsilon_\theta(x_t | c)$ : The conditional noise prediction.
$\epsilon_\theta(x_t | \emptyset)$ : The unconditional noise prediction.
$w$ : The guidance scale, a hyperparameter that controls how strongly the conditional input influences the generation. Higher $w$ leads to stronger adherence to the prompt but can sometimes reduce diversity or quality. The drawback of CFG is that it requires two forward passes through the denoising network for every timestep, effectively doubling the computational cost per step.

3.1.5. Performance Metrics

FLOPs (Floating Point Operations): A measure of the total number of floating-point operations performed by a model. It quantifies the computational cost. Lower FLOPs indicate higher efficiency.
Latency: The time it takes for a model to complete a specific task (e.g., generate one image). It measures speed. Lower latency means faster execution.

3.2. Previous Works

The paper contextualizes DiTFastAttn by discussing existing efforts in diffusion models, Vision Transformer compression, local attention, attention sharing, and other diffusion acceleration methods.

3.2.1. Diffusion Models

The paper notes the evolution from U-Net based diffusion models (Ho et al., 2020; Rombach et al., 2022) to Transformer architectures (DiT by Peebles & Xie, 2023). It highlights PixArt-Sigma (Chen et al., 2024) for high-resolution image generation and Sora (Brooks et al., 2024) for video generation as examples of DiT's capabilities.

3.2.2. Vision Transformer Compression

The computational overhead of attention has driven various compression techniques:

FlashAttention (Dao, 2023): Optimizes attention computation by dividing input tokens into smaller tiles, improving memory access patterns and reducing latency. This is an algorithmic optimization rather than a model compression. DiTFastAttn builds upon FlashAttention-2.
Token Pruning/Merging: These methods aim to reduce the sequence length $L$ $L$ by removing or combining less important tokens.
- DynamicViT (Rao et al., 2021) uses a prediction network to dynamically filter tokens.
- Adaptive Sparse ViT (Liu et al., 2022) filters tokens based on attention values and L2 norm of features.
- $Lu et al. (2023)$ uses segmentation labels to guide token merging.
- Huang et al. (2023) downsamples tokens before attention and then upsamples.
- $Wu et al. (2023)$ suggests filtering for deeper layers and merging for shallower layers. These methods typically require some form of retraining or fine-tuning to adapt to the pruned/merged token structures.

3.2.3. Local Attention

This paradigm restricts attention computation to a fixed-size window of neighboring tokens to reduce quadratic complexity to linear.

Longformer (Beltagy et al.) introduced linear-scaling attention.
Bigbird (Zaheer et al., 2020) combines window attention with random and global attention for long-range dependencies.
Swin Transformer (Liu et al., 2021) uses non-overlapping local windows and shifted windows across layers to capture global context.
Twins Transformer (Chu et al., 2021), FasterViT (Vasu et al., 2023), Neighborhood Attention Transformer (Hassani et al., 2023) also employ window-based attention with variations to capture global context. DiTFastAttn uses fixed-size window attention but enhances it with Residual Sharing to maintain long-range dependencies in a training-free manner.

This category exploits similarities in attention mechanisms or outputs to reduce computation.

GQA (Group Query Attention) (Ainslie et al., 2023) groups query heads and shares key and value parameters within each group, reducing memory and improving efficiency. This is an architectural change during training.
PSVIT (Chen et al., 2021) observed similarity in attention maps across different Transformer layers and proposed sharing them to reduce redundancy.
Deepcache (Ma et al., 2023) noted similarity in high-level features of U-Net based diffusion models across timesteps and reused these features, skipping intermediate layers.
TGATE (Zhang et al., 2024) found that cross-attention outputs in text-conditional diffusion models converge to a fixed point after several denoising steps, caching and reusing this output. DiTFastAttn extends attention sharing by demonstrating similarity in attention outputs both CFG-wise and step-wise, and importantly, considers layer-specific and timestep-specific variations in similarity for selective sharing.

3.2.5. Other Methods to Accelerate Diffusion Models

Network Quantization: Reduces bitwidth of weights and activations (Shang et al., 2023; Zhao et al., 2024b, 2024a).
Scheduler Optimization: Decreases the number of denoising steps (Song et al., 2020; Lu et al., 2022; Liu et al., 2023a).
Distillation: Minimizes timesteps by training a smaller model or distilling knowledge from a larger one (Salimans & Ho, 2022; Meng et al., 2023; Liu et al., 2023b). DiTFastAttn is complementary to these methods, as it operates independently of quantization, scheduler, or timestep settings.

3.3. Technological Evolution

The evolution of generative AI has seen a shift from Generative Adversarial Networks (GANs) (Creswell et al., 2018) to diffusion models for superior performance. Early diffusion models were predominantly built on U-Net architectures (e.g., DDPM). However, with the success of Transformers in other domains, Diffusion Transformers (DiT) emerged as a scalable and powerful alternative, replacing U-Nets with Transformer blocks. This transition brought enhanced capabilities, especially for high-resolution content, but also inherited the quadratic complexity of the self-attention mechanism, which quickly became a bottleneck.

Efforts to address Transformer computational costs generally began with architectural designs like Swin Transformer for Vision Transformers or memory-optimized implementations like FlashAttention. More recently, researchers have focused on identifying and exploiting redundancies within the attention mechanism itself, through methods like token pruning/merging or attention sharing across layers or timesteps.

DiTFastAttn fits into this timeline by focusing specifically on the DiT architecture and introducing a post-training compression approach. This is a significant distinction from many prior methods that require extensive retraining. By identifying spatial, temporal, and conditional redundancies specific to the DiT inference process, DiTFastAttn offers a practical solution to accelerate existing pre-trained DiT models without the prohibitive cost of retraining, marking a step towards more efficient deployment of these powerful generative models.

3.4. Differentiation Analysis

Compared to the main methods in related work, DiTFastAttn offers several core differences and innovations:

Post-Training Compression: Many prior acceleration techniques (e.g., Swin Transformer, GQA, token pruning/merging if integrated into architecture) necessitate architectural changes or retraining. DiTFastAttn is explicitly a post-training method, meaning it can be applied directly to pre-trained DiT models without any further training or fine-tuning. This is a crucial practical advantage given the immense computational cost of training large DiT models.
Comprehensive Redundancy Identification: DiTFastAttn systematically identifies three distinct, yet complementary, types of attention redundancy: spatial, temporal (across timesteps), and conditional (across CFG evaluations). This multi-faceted approach allows for more granular and effective compression compared to methods focusing on a single type of redundancy.
Novel Residual Sharing for Window Attention: While local attention (window attention) has been explored (e.g., Swin Transformer), directly applying it post-training to DiT can degrade performance due to the loss of long-range dependencies. DiTFastAttn's Window Attention with Residual Sharing (WA-RS) innovatively addresses this by caching and reusing the difference (residual) between full and window attention outputs from previous steps. This training-free mechanism effectively preserves crucial long-range dependencies without incurring the quadratic cost of full attention at every step.
Targeted Attention Sharing for DiT Inference: Building on general attention sharing ideas (PSVIT, Deepcache, TGATE), DiTFastAttn specifically applies attention sharing to the timestep dimension (AST) and the CFG dimension (ASC) based on empirical observations of high output similarity in these contexts within DiTs. This is distinct from sharing attention maps across layers or cross-attention outputs.
Dynamic Compression Plan: Unlike static compression methods, DiTFastAttn employs a simple greedy algorithm to dynamically determine the optimal compression strategy for each layer and timestep. This allows for a tailored approach that adapts to the varying distributions of redundancies across the model and denoising process, ensuring maximum compression while maintaining quality for diverse models and resolutions.
Complementarity with Existing Methods: The paper highlights that DiTFastAttn is orthogonal to other diffusion acceleration methods like quantization, scheduler optimization, and distillation. This means it can potentially be combined with these techniques for even greater overall acceleration.

4. Methodology

4.1. Principles

The core principle behind DiTFastAttn is to achieve significant computational savings during the inference of Diffusion Transformer (DiT) models by identifying and exploiting inherent redundancies in their self-attention computations. This is achieved through a post-training compression approach, meaning it does not require retraining the large DiT models, which is a major advantage. The authors pinpoint three specific types of redundancies:

Spatial Redundancy: Many attention heads primarily focus on local patterns, making full self-attention (which considers all tokens) inefficient for these layers.
Temporal Redundancy: The attention outputs of the same layer can be highly similar across consecutive denoising steps, suggesting that recomputing them repeatedly is redundant.
Conditional Redundancy: When using Classifier-Free Guidance (CFG), the conditional and unconditional attention outputs can be very similar, leading to redundant computations.

Based on these observations, DiTFastAttn proposes three corresponding techniques: Window Attention with Residual Sharing (WA-RS) for spatial redundancy, Attention Sharing across Timesteps (AST) for temporal redundancy, and Attention Sharing across CFG (ASC) for conditional redundancy. A greedy search method is then used to intelligently apply these techniques layer-by-layer and step-by-step to maximize compression while minimizing performance degradation.

The following figure (Figure 2 from the original paper) provides a visual overview of the identified redundancies and their corresponding compression techniques:

该图像是一个示意图，展示了在 DiTFastAttn 中减轻注意力计算冗余的三种技术。横轴代表步骤维度，图中标示了空间冗余、时间步之间的相似性以及条件推理与非条件推理之间的相似性，分别对应不同的压缩技术，如Window Attention和Attention Sharing。

4.2. Core Methodology In-depth (Layer by Layer)

The first redundancy exploited is spatial redundancy. The authors observe that in many Transformer layers of pre-trained DiTs, attention values are highly concentrated within a localized window around the diagonal of the attention matrix. This means that tokens primarily attend to nearby tokens, and attention values for spatially distant tokens are often close to zero. This phenomenon is illustrated in Figure 3(a) (left panel).

To leverage this, DiTFastAttn proposes using window attention instead of full self-attention for selected layers. Window attention restricts computation to tokens within a fixed-size window, drastically reducing computational cost from quadratic to linear with respect to sequence length.

However, simply replacing full attention with window attention can lead to performance degradation because some tokens still rely on a small set of spatially distant tokens for their complete contextual understanding. Discarding these long-range dependencies negatively impacts model performance. A naive solution would be to use a very large window size, but this would negate most of the computational savings.

To address this, DiTFastAttn introduces Cache and Reuse the Residual for Window Attention. The key insight comes from an observation shown in Figure 3(a) (right panel): the residual between the outputs of full attention and window attention exhibits much smaller variation across timesteps compared to the direct window attention output. This suggests that the "missing" long-range information (the residual) changes slowly and can be effectively cached and reused.

The following figure (Figure 3 from the original paper) illustrates the Window Attention with Residual Sharing technique:

$该图像是示意图，展示了滑动窗口注意力机制的计算过程，左侧为不同时间步的注意图，右侧为全注意力与窗口注意力的比较及残差计算公式，其中 $R^t = O^t - W^t$ 表示残差计算。$ 该图像是示意图，展示了滑动窗口注意力机制的计算过程，左侧为不同时间步的注意图，右侧为全注意力与窗口注意力的比较及残差计算公式，其中 $R^t = O^t - W^t$ 表示残差计算。

Figure 3(b) illustrates the WA-RS computation flow: For a specific set of timesteps $\mathbf{K}$ that share a residual value, and an initial step r = \min(\mathbf{K}) within this set, the computation proceeds as follows:

Compute Full Attention Output: The standard full attention is computed for step $r$ $r$ using the Query ( $\mathbf{Q}_r$ $Q_{r}$ ), Key ( $\mathbf{K}_r$ $K_{r}$ ), and Value ( $\mathbf{V}_r$ $V_{r}$ ) matrices. $ \mathbf{O}_r = \operatorname{Attention}(\mathbf{Q}_r, \mathbf{K}_r, \mathbf{V}_r) $
- $\mathbf{O}_r$ : The output of the full self-attention mechanism at timestep $r$ .
- $\operatorname{Attention}(\cdot)$ : The standard full self-attention function.
- $\mathbf{Q}_r, \mathbf{K}_r, \mathbf{V}_r$ : The Query, Key, and Value matrices at timestep $r$ . These are derived from the input features of the Transformer layer at that timestep.
Compute Window Attention Output: The window attention is computed for step $r$ $r$ . $ \mathbf{W}_r = \operatorname{WindowAttention}(\mathbf{Q}_r, \mathbf{K}_r, \mathbf{V}_r) $
- $\mathbf{W}_r$ : The output of the window attention mechanism at timestep $r$ . WindowAttention is a variant of Attention that only considers tokens within a predefined local window.
Calculate and Cache Residual: The residual $\mathbf{R}_r$ $R_{r}$ is calculated as the difference between the full attention output and the window attention output at step $r$ $r$ . This residual captures the long-range dependencies that window attention misses. This $\mathbf{R}_r$ $R_{r}$ is then cached. $ \mathbf{R}_r = \mathbf{O}_r - \mathbf{W}_r $
- $\mathbf{R}_r$ : The residual value computed at timestep $r$ , representing the difference between full and window attention outputs. This residual is cached for reuse.
  
  For any subsequent timestep $k \in \mathbf{K}$ within the same sharing set, the computation is simplified:
Compute Window Attention Output: Only the window attention is computed for step $k$ $k$ . $ \mathbf{W}_k = \operatorname{WindowAttention}(\mathbf{Q}_k, \mathbf{K}_k, \mathbf{V}_k) $
- $\mathbf{W}_k$ : The output of the window attention mechanism at timestep $k$ . Note that $Q$ , $K$ , $V$ are recomputed for step $k$ as they are specific to the current timestep's input.
Add Cached Residual: The output for step $k$ $k$ is obtained by adding the previously cached residual $\mathbf{R}_r$ $R_{r}$ (from step $r$ $r$ ) to the current window attention output. This effectively reintroduces the long-range dependencies that window attention would otherwise omit, without needing to recompute full attention. $ \mathbf{O}_k = \mathbf{W}_k + \mathbf{R}_r $
- $\mathbf{O}_k$ : The final estimated attention output for timestep $k$ using WA-RS.
- $\mathbf{K}$ : The set of steps that share the residual value $\mathbf{R}_r$ .
  
  By doing this, WA-RS significantly reduces computation in subsequent steps by only calculating window attention (linear complexity) and adding a cached value, while still preserving the crucial long-range information that a simple window attention would lose.

The second redundancy addressed is the temporal similarity of attention outputs. During the sequential denoising process of diffusion models, the features and, consequently, the attention outputs of the same attention head can be highly similar across neighboring timesteps. Figure 4(a) demonstrates this by showing high cosine similarity between attention outputs of adjacent steps for certain layers. The similarity is not uniform; it varies across both timesteps and Transformer layers.

The Attention Sharing across Timesteps (AST) technique leverages this observation. For a group of timesteps where the attention outputs are similar to each other, the model computes the attention output only at the earliest timestep in that group. This cached attention output (denoted as $\mathbf{o}$ ) is then reused for all subsequent timesteps within that group, effectively skipping the attention computation for those steps. This significantly accelerates the denoising process by reducing redundant calculations over time.

The third redundancy is conditional redundancy, specifically within Classifier-Free Guidance (CFG). CFG is a standard technique for conditional generation, which typically doubles the computational cost by requiring two neural network inferences at each timestep: one with a conditional input (e.g., text prompt) and one without (unconditional input).

The authors observe that for many Transformer layers and timesteps, the attention outputs generated during the conditional neural network evaluation are highly similar to those generated during the unconditional neural network evaluation. Figure 4(b) illustrates this significant similarity (e.g., SSIM $\ge 0.95$ ) between the attention outputs in conditional and unconditional evaluations.

The Attention Sharing across CFG (ASC) technique exploits this. Instead of performing two separate attention computations, ASC reuses the attention output from the conditional neural network evaluation for the unconditional neural network evaluation. This allows the model to skip the attention computation for the unconditional path, thereby reducing the attention overhead during CFG by approximately 50%.

The following figure (Figure 4 from the original paper) shows the similarity of attention outputs across timesteps and CFG dimension:

Figure 4: Similarity of Attention Outputs Across Step and CFG Dimensions in DiT. (a) Similarity of attention outputs across step dimension in different layers. (b) Similarity between conditional and unconditional attention outputs in various layers at different steps 该图像是图表，显示了不同层次下的注意力输出的相似性。左侧为在不同时间步刻度下，第5层和第25层的注意力输出的余弦相似性；右侧为条件生成和无条件生成之间的注意力输出相似性的热图。

4.2.4. Method for Deciding the Compression Plan

As indicated by the varying similarity patterns in Figures 3 and 4, the optimal compression strategy is not universal; different layers and timesteps exhibit different types and degrees of redundancy. Therefore, a method is needed to determine which compression technique (or combination) should be applied to each Transformer layer at each timestep.

DiTFastAttn employs a simple greedy method to decide this compression plan. The method iteratively determines the best strategy for each timestep and then for each Transformer layer within that timestep.

The compression strategy list $S$ used in the greedy search is defined as $S = [\mathrm{AST}, \mathrm{WA-RS} + \mathrm{ASC}, \mathrm{WA-RS}, \mathrm{ASC}]$ . These strategies are ordered by their ascending potential compression ratio (meaning the most aggressive compression methods are tried first for a given layer/step).

The following algorithm (Algorithm 1 from the original paper) outlines the process:

$Algorithm 1: Method for Deciding the Compression Plan Input : Transformer Model $M$ , Total Step $T$ , Compression Strategy List $s$ , Threshold $\delta$ Output :dictionary dict that stores selected compression techniques Initialize dict for step t in $T$ do $O $ compute the output of the uncompressed $M$ for transformer layer $i$ in $M$ do for $m \in S$ order by ascending compression ratio do compress layer $i$ in step $t$ using compression strategy $m$ $O ^ { \prime } $ compute the output of $M$ if $\begin{array} { r } { L ( O , O ^ { \prime } ) ^ { \ast } < \frac { i } { | M | } \delta } \end{array}$ then update $m$ as the selected strategy of layer $i$ and step $t$ in dict break return dict$

Algorithm 1 Step-by-Step Explanation:

Inputs:
- Transformer Model M: The pre-trained DiT model.
- Total Step T: The total number of denoising steps (e.g., 50) in the diffusion process.
- Compression Strategy List S: A list of available compression strategies, ordered by ascending compression ratio (e.g., AST, WA-RS + ASC, WA-RS, ASC).
- $Threshold δ$ : A global scalar value that controls the maximum allowable quality degradation. A smaller $\delta$ means less compression but higher fidelity.
Initialization:
- dict: An empty dictionary is initialized to store the chosen compression strategy for each (layer, step) pair.
Outer Loop (Iterating through Timesteps):
- The algorithm iterates through each timestep $t$ from 1 to $T$ .
- For each timestep $t$ , the output of the uncompressed model, $O$ , is computed. This serves as the ground truth for comparison.
Inner Loop (Iterating through Transformer Layers):
- For each timestep $t$ , the algorithm then iterates through each Transformer layer $i$ in the model $M$ .
- Innermost Loop (Iterating through Strategies):
  - For the current layer $i$ and timestep $t$ , the algorithm iterates through the compression strategy list S. The strategies are applied in order of ascending compression ratio. This means that less aggressive methods (which generally cause less degradation) are tried first.
  - Apply Compression: The current layer $i$ at timestep $t$ is temporarily compressed using the current strategy $m$ .
  - Compute Compressed Output: The output of the model with this compression applied, $O'$ , is computed.
  - Check Loss Condition: The loss (quality degradation) between the uncompressed output $O$ and the compressed output $O'$ is calculated using the function L(O, O').
  - This loss is then compared against a dynamically adjusted threshold: $\frac{i}{|M|} \delta$ $\frac{i}{∣ M ∣} δ$ .
    - $|M|$ : The total number of Transformer layers in the model.
    - $\frac{i}{|M|}$ : A fractional term that means the allowable loss increases linearly with the layer index $i$ . This implies that shallower layers (smaller $i$ ) are more strictly constrained regarding loss, while deeper layers (larger $i$ ) can tolerate slightly more loss. This is a common heuristic, as deeper layers often contribute less to fine-grained details and more to abstract features, or might be more robust to small perturbations.
  - Select Strategy: If the calculated loss L(O, O') is less than the threshold $\frac{i}{|M|} \delta$ , then strategy $m$ is deemed acceptable for layer $i$ at timestep $t$ . This strategy $m$ is recorded in the dict, and the algorithm breaks out of the innermost loop (no need to try more aggressive strategies if a less aggressive one meets the criteria).
Return Compression Plan: After iterating through all timesteps and layers, the dict contains the selected compression strategy for each (layer, step) pair, forming the complete compression plan.

Computational Complexity of Search: The paper notes that this greedy algorithm has a computational complexity of $\mathcal{O}(|\mathcal{S}| \times |T| \times |M|^2 \times s^2)$ , where $|\mathcal{S}|$ is the number of compression strategies, $|T|$ is the total number of denoising steps, $|M|$ is the number of Transformer layers, and $s$ is the sequence length. This is significantly higher than a single DiT inference time, which is $\mathcal{O}(|T| \times |M| \times s^2)$ . For example, for a DiT-XL-2-512 model, the search takes approximately 224 seconds ( $2 \mathrm{s} \times 28 \text{ layers} \times 4 \text{ strategies}$ ), which is considered a reasonable one-time overhead for obtaining the compression plan.

Loss Function L(O, O') (Mean Relative Absolute Error): The loss function L(O, O') used to evaluate the quality degradation is the mean relative absolute error. As provided in Appendix A.2, it is calculated as follows: $ L ( O , O ^ { \prime } ) = \frac { 1 } { | O | _ { 1 } } \sum _ { i } \mathrm { c l i p } \left( \frac { | O _ { i } - O _ { i } ^ { \prime } | } { \operatorname* { m a x } ( | O _ { i } | , | O _ { i } ^ { \prime } | ) + \epsilon } , 0 , 1 0 \right) $ Where:

$|O|_1$ : The number of elements in the raw (uncompressed) output vector $O$ . This term normalizes the sum to get the mean.
$\sum_i$ : Summation over each element $i$ in the output vectors $O$ and $O'$ .
$O_i$ : The value of the $i$ -th element in the raw (uncompressed) output vector $O$ .
$O_i'$ : The value of the $i$ -th element in the compressed output vector $O'$ .
$|O_i - O_i'|$ : The absolute difference between the raw and compressed output elements, quantifying the error.
$\operatorname{max}(|O_i|, |O_i'|) + \epsilon$ : The normalization factor. It uses the maximum absolute value between the raw and compressed elements, ensuring the error is relative to the magnitude of the values. A small positive constant $\epsilon$ (set to $10^{-6}$ in experiments) is added to the denominator to prevent division by zero or numerical instability when both $O_i$ and $O_i'$ are very small.
$\mathrm{clip}(\cdot, 0, 10)$ : A function that clips the resulting relative error ratio to a range of [0, 10]. This prevents extreme outliers (e.g., if a true value is very small and the predicted is large, leading to a huge ratio) from dominating the overall mean error.

This loss function provides a normalized measure of the average relative deviation, with values ranging from 0 (perfect match) to 10 (maximum allowed clipped error). The parameter $\delta$ then sets the overall tolerance for this error. The configurations D1 ( $\delta = 0.025$ ), D2 ( $\delta = 0.05$ ), ..., D6 ( $\delta = 0.15$ ) represent different levels of loss tolerance, with higher $\delta$ allowing more aggressive compression.

5. Experimental Setup

5.1. Datasets

The experiments evaluate DiTFastAttn on various Diffusion Transformer models across different generation tasks.

Image Generation:
- DiT models (DiT-XL-2-512): Evaluated using the ImageNet dataset.
  - Source: ImageNet is a large-scale hierarchical image database.
  - Scale: Contains millions of images categorized into thousands of classes.
  - Characteristics: Diverse range of natural images, widely used for image classification and generation benchmarks.
  - Domain: General object recognition and natural scenes.
  - Why chosen: Standard benchmark for generative models, especially for unconditional image generation, allowing comparison with previous DiT model evaluations.
- PixArt-Sigma models (PixArt-Sigma-1024, PixArt-Sigma-2K): Evaluated using the MS-COCO dataset.
  - Source: Microsoft Common Objects in Context (MS-COCO).
  - Scale: Contains over 330K images with complex everyday scenes.
  - Characteristics: Focuses on object detection, segmentation, and captioning, with challenging scenes and many objects per image.
  - Domain: Object recognition in context, everyday scenes.
  - Text Prompts: For PixArt-Sigma models, MS-COCO 2014 captions are used as text prompts for text-to-image generation.
  - Why chosen: Standard benchmark for text-to-image generation, providing diverse and complex scenes with associated captions, suitable for evaluating conditional generation.
Video Generation:
- Open-Sora (Open-Sora, 2024): Evaluated for video generation tasks.
  - Source/Characteristics: Open-Sora is an open-source project aiming to reproduce Sora's capabilities. It generates videos. The specific dataset used for evaluation (beyond the model itself) is not explicitly detailed in the Settings section for OpenSora, but typical video generation benchmarks involve datasets like UCF101, Kinetics, or specific internal video collections. The context suggests general video generation.
  - Why chosen: Represents the application of DiT to the increasingly demanding domain of video generation.
    
    The number of generated samples for quality metrics:
DiT models: 50,000 images.
PixArt-Sigma models: 30,000 images.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a complete explanation:

5.2.1. FID (Fréchet Inception Distance)

Conceptual Definition: Fréchet Inception Distance (FID) is a metric used to assess the quality of images generated by generative models, particularly GANs and diffusion models. It quantifies the similarity between the distribution of generated images and the distribution of real (ground truth) images. FID is calculated by embedding both real and generated images into a feature space (typically using an Inception-v3 network pretrained on ImageNet) and then computing the Fréchet distance between the two multivariate Gaussian distributions fitted to these embeddings. A lower FID score indicates that the generated images are more similar to the real images, implying higher quality and diversity.
Mathematical Formula: $ \mathrm{FID} = ||\mu_1 - \mu_2||^2_2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
Symbol Explanation:
- $\mu_1$ : The mean feature vector of the real image distribution.
- $\mu_2$ : The mean feature vector of the generated image distribution.
- $||\cdot||^2_2$ : The squared L2 norm (Euclidean distance), measuring the distance between the mean vectors.
- $\Sigma_1$ : The covariance matrix of the real image distribution.
- $\Sigma_2$ : The covariance matrix of the generated image distribution.
- $\mathrm{Tr}(\cdot)$ : The trace of a matrix, which is the sum of the elements on the main diagonal.
- $(\Sigma_1 \Sigma_2)^{1/2}$ : The matrix square root of the product of the two covariance matrices.

5.2.2. IS (Inception Score)

Conceptual Definition: Inception Score (IS) is another metric for evaluating the quality of generated images. It aims to measure two aspects: the clarity (quality) and diversity of the generated images. It achieves this by using a pre-trained Inception-v3 network to classify the generated images.
- Clarity: Measured by how confident the Inception model is in its classification of a generated image (low entropy in $p(y|x)$ ). A model should generate images that are clearly recognizable as belonging to a specific class.
- Diversity: Measured by how varied the predicted class labels are across a batch of generated images (high entropy in p(y)). A good generative model should produce diverse outputs, not just multiple variations of a single class. A higher IS score indicates better quality and diversity.
Mathematical Formula: $ \mathrm{IS} = \exp(E_x \left[ D_{KL}(p(y|x) || p(y)) \right]) $
Symbol Explanation:
- $x$ : A generated image.
- $y$ : The predicted class label for an image $x$ by the Inception-v3 network.
- $p(y|x)$ : The conditional probability distribution over class labels $y$ given a generated image $x$ . This represents the Inception model's confidence in classifying $x$ .
- p(y): The marginal probability distribution over class labels $y$ across all generated images. This represents the diversity of the generated image distribution.
- $D_{KL}(P || Q)$ : The Kullback-Leibler (KL) divergence between probability distributions $P$ and $Q$ , which measures how one probability distribution diverges from a second, expected probability distribution.
- $E_x[\cdot]$ : The expectation (average) taken over all generated images $x$ .
- $\exp(\cdot)$ : The exponential function, used to scale the KL divergence result.

5.2.3. CLIP Score

Conceptual Definition: CLIP Score (Hessel et al., 2021) is a reference-free metric used to evaluate the semantic consistency between generated images and their corresponding text prompts in text-to-image generation tasks. It leverages the CLIP (Contrastive Language-Image Pre-training) model, which learns a joint embedding space where semantically similar images and texts are close together. The CLIP Score is computed as the cosine similarity between the CLIP embedding of the generated image and the CLIP embedding of the input text prompt. A higher CLIP Score indicates better alignment between the generated image and the text description.
Mathematical Formula: The CLIP Score is not typically defined by a single compact mathematical formula beyond the cosine similarity calculation, as it depends on the internal workings of the CLIP model's embedding process. It can be expressed as: $ \mathrm{CLIP Score} = \text{cosine_similarity}(\mathrm{CLIP}{\text{image}}(I), \mathrm{CLIP}{\text{text}}(T)) $
Symbol Explanation:
- $\mathrm{CLIP}_{\text{image}}(I)$ : The CLIP embedding (feature vector) of the generated image $I$ .
- $\mathrm{CLIP}_{\text{text}}(T)$ : The CLIP embedding (feature vector) of the input text prompt $T$ .
- $\text{cosine\_similarity}(A, B) = \frac{A \cdot B}{||A|| \cdot ||B||}$ : The cosine similarity between two vectors $A$ and $B$ , measuring the cosine of the angle between them. It ranges from -1 (opposite) to 1 (identical).

5.2.4. Attn FLOPs (Attention Floating Point Operations)

Conceptual Definition: Attn FLOPs specifically refers to the number of floating-point operations performed within the multi-head attention modules of the Transformer model. This metric directly quantifies the computational cost of the attention mechanism itself, which is the primary bottleneck addressed by DiTFastAttn.
Mathematical Formula: No single universal formula, as it depends on the specific implementation of attention. It typically involves matrix multiplications (Q K^T, Attention(Q,K,V) * V), which contribute the majority of FLOPs.
Symbol Explanation: This is presented as a fraction (percentage) relative to the original uncompressed model's attention FLOPs.

5.2.5. Latency

Conceptual Definition: Latency measures the real-world time taken to perform an operation, such as generating a single image or video. It is typically measured in seconds (s). Lower latency means the process is faster. The paper reports both overall end-to-end latency for generation and attention latency specifically for the multi-head attention module.
Mathematical Formula: Not a calculated formula, but a direct measurement of time.
Symbol Explanation: Expressed in seconds (s).

5.3. Baselines

The proposed DiTFastAttn method is compared against the raw (original) pre-trained versions of the Diffusion Transformer models it targets. These serve as the direct baselines, as the method is a post-training compression technique. The specific baseline models are:

DiT-XL-2-512 (for image generation at 512x512 resolution)
PixArt-Sigma-XL-1024 (for image generation at 1024x1024 resolution)
PixArt-Sigma-XL-2K (for image generation at 2048x2048 resolution)
OpenSora V1.1 (for video generation at 240p resolution with 16 frames)

These baselines are representative because they are the original, uncompressed models whose computational bottlenecks DiTFastAttn aims to alleviate. The comparison directly demonstrates the trade-off between computational efficiency and generative quality achieved by the proposed compression method.

5.4. Other Settings

Sampling Method & Steps:
- For DiT and PixArt-Sigma models (image generation): 50-step DPM-Solver (Lu et al., 2022) is used, which is a fast ODE solver for diffusion models.
- For Open-Sora (video generation): 200-step IDDPM (Improved Denoising Diffusion Probabilistic Models, Nichol & Dhariwal, 2021) is used.
Loss Thresholds for Compression Plan:
- The threshold $\delta$ in Algorithm 1 (Method for Deciding the Compression Plan) is varied at intervals of 0.025.
- These settings are denoted as D1 ( $\delta = 0.025$ ), D2 ( $\delta = 0.05$ ), D3 ( $\delta = 0.075$ ), D4 ( $\delta = 0.10$ ), D5 ( $\delta = 0.125$ ), and D6 ( $\delta = 0.15$ ). Higher $D$ numbers (larger $\delta$ ) imply a higher tolerance for loss, leading to more aggressive compression.
WA-RS Window Size: The window size for Window Attention with Residual Sharing (WA-RS) is set to 1/8 of the total token size (sequence length).
Implementation Details: DiTFastAttn is implemented based on FlashAttention-2 (Dao, 2023), a highly optimized attention implementation.
Hardware: All latency measurements are performed on a single Nvidia A100 GPU.
Batch Size:
- DiT models: Batch size of 8.
- PixArt-Sigma models: Batch size of 1.
- OpenSora: Batch size not explicitly stated for latency measurement, but standard for video generation.
Loss Function Details: The $\epsilon$ in the mean relative absolute error calculation is set to $10^{-6}$ . Other metrics like LPIPS and SSIM were considered for the compression plan search but discarded due to computational speed or insensitivity.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Results of Evaluation Metrics and Attention FLOPs on Image Generation

The following are the results from Table 1 of the original paper:

Model	DiT-XL-2 512x512			PixArt-Sigma-XL 1024x1024				PixArt-Sigma-XL 2048x2048
Score	IS	FID	Attn FLOPs	IS	FID	CLIP	Attn FLOPs	IS	FID	CLIP	Attn FLOPs
Raw	408.16	25.43	100%	24.33	55.65	31.27	100%	23.67	51.89	31.47	100%
D1	412.24	25.32	85%	24.27	55.73	31.27	90%	23.28	52.34	31.46	81%
D2	412.18	24.67	69%	24.25	55.69	31.26	74%	22.90	53.01	31.32	60%
D3	411.74	23.76	59%	24.16	55.61	31.25	63%	22.96	52.54	31.36	46%
D4	391.80	21.52	49%	24.07	55.32	31.24	52%	22.95	51.74	31.39	36%
D5	370.07	19.32	41%	24.17	54.54	31.22	44%	22.82	51.21	31.34	29%
D6	352.20	16.80	34%	23.94	52.73	31.18	37%	22.38	49.34	31.28	24%

Analysis:

DiT-XL-2 512x512: At lower compression levels (D1, D2, D3), DiTFastAttn achieves significant Attn FLOPs reduction (down to 59% at D3) while maintaining or even slightly improving IS and FID scores compared to the Raw model. For instance, IS increases from 408.16 to 411.74 (D3), and FID decreases from 25.43 to 23.76 (D3). This suggests that DiTFastAttn can make the model more efficient without compromising quality, and potentially even by pruning redundant calculations it may yield minor improvements in robustness. As compression intensifies (D4-D6), Attn FLOPs reduce further (down to 34% at D6), but IS and FID start to degrade, though still producing acceptable images (see Figure 6).
PixArt-Sigma-XL 1024x1024: Similar to DiT-XL-2, D1 to D3 configurations maintain IS, FID, and CLIP scores very close to the Raw model while achieving substantial Attn FLOPs reduction (down to 63% at D3). For D4-D6, Attn FLOPs reduce to 37% at D6, with a slight decrease in quality metrics, but remaining high. The CLIP score shows remarkable resilience, staying very close to the Raw score even at D6.
PixArt-Sigma-XL 2048x2048 (High Resolution): This model demonstrates the most pronounced benefits. At D6, Attn FLOPs are reduced to just 24% (a 76% reduction!) of the original, with minimal degradation in IS, FID, and CLIP scores compared to the Raw model. For instance, FID drops from 51.89 to 49.34, and CLIP from 31.47 to 31.28. This highlights a key finding: as image resolution increases, DiTFastAttn not only achieves greater compression but also better preserves the generative performance of the models. This is likely due to the quadratic complexity of self-attention, where the attention computation becomes a much larger fraction of the total FLOPs at higher resolutions, providing more scope for Attn FLOPs reduction.

6.1.2. Visualization of DiTFastAttn's Generation Results

The following figure (Figure 6 from the original paper) shows image generation samples:

Figure 6: Image generation samples at various image resolutions under various compression ratios. 该图像是图像生成示例，展示了在不同压缩比和分辨率下生成的图像，共有三种分辨率，分别为512x512、1024x1024和2048x2048。图中呈现了不同样本的细节和清晰度。

Analysis: Figure 6 qualitatively supports the quantitative results.

For DiT-XL-2-512 and PixArt-Sigma-1024 models, D1, D2, and D3 configurations produce images visually comparable to the original models, indicating minimal perceptual degradation despite significant FLOPs reduction.
More aggressive configurations like D4, D5, and D6 achieve higher compression but show slight variations in detail. However, the overall image quality remains acceptable, demonstrating the robustness of DiTFastAttn.
For PixArt-Sigma-2K, the image quality remains very close to the original up to D4, and even D5 and D6 (with up to 76% attention FLOPs reduction) generate high-quality outputs. This reinforces the finding that DiTFastAttn is particularly effective at higher resolutions, maintaining visual fidelity even with substantial compression.

The following figure (Figure 14 from the original paper) shows images generated by PixArt-Sigma-XL-1024 at different thresholds with/without negative prompt:

该图像是展示了使用 PixArt-Sigma-XL-1024 在不同阈值下生成的图像，包括有无负提示的比较。上部分为车辆图像，底部为蓝色房间图像。图中分别标注了 D2、D4 和 D6。

Analysis: Figure 14 illustrates DiTFastAttn's compatibility with negative conditioning. Even with compression, the model effectively incorporates negative prompts (like "Low quality"), as seen in the clear differences between images generated with and without negative prompts across D2, D4, and D6 configurations. This suggests that the compression does not interfere with the guidance mechanism crucial for refining generation quality.

6.1.3. Results on Video Generation

The following figure (Figure 7 from the original paper) shows video generation results from OpenSora:

Figure 7: Comparison of video generation using OpenSora V1.1 at 240p resolution with 16 frames. 该图像是一个视频生成对比图，展示了在240p分辨率下使用OpenSora V1.1生成的16帧视频。图像的左侧展示了热气球的场景，中间展示了海洋中的海龟，右侧展示了夜晚城市的景象，体现了不同场景下的视频生成效果。

The following figure (Figure 10 from the original paper) provides a more detailed comparison of video generation:

Figure 10: Comparison of video generation using OpenSora V1.1 at 240p resolution with 16 frames. The left column displays the original video, and the right columns illustrate the outputs from the D1 to D6 configuration.

Analysis:

DiTFastAttn was applied to OpenSora V1.1 for video generation at 240p resolution with 16 frames.
The Attn FLOPs reductions for D1 through D6 configurations were 7.63%, 19.50%, 30.16%, 37.66%, 40.52% respectively (the original text lists five values for D1-D6, implying D6 refers to the fifth value here, or there's a typo in the list, but it's clear there are increasing reductions). This confirms DiTFastAttn's applicability and effectiveness in reducing computational costs for video models.
Qualitatively (Figures 7 and 10), configurations D1 to D4 showed effective performance, balancing computational efficiency with retention of visual quality. The generated videos were smooth, with natural transitions and preserved details.
Configurations D5 and D6 (more aggressive compression) resulted in noticeable deviations from the original video characteristics, but the videos remained smooth and coherent, representing the intended narrative with reasonable accuracy. This suggests that while aggressive compression can compromise detail, it can still be valuable in resource-constrained environments.

6.1.4. #FLOPs Reduction and Speedup

The following are the results from Table 2 of the original paper:

Model	Seqlen	Metric	ASC	WA-RS	WA-RS+ASC	AST
DiT-XL-2 512x512	1024	Attn FLOPs	50%	77%	38%	0%
DiT-XL-2 512x512	1024	Attn Latency	59%	85%	51%	4%
PixArt-Sigma-XL 1024x1024	4096	Attn FLOPs	50%	51%	26%	0%
PixArt-Sigma-XL 1024x1024	4096	Attn Latency	54%	54%	31%	3%
PixArt-Sigma-XL 2048x2048	16384	Attn FLOPs	50%	33%	16%	0%
PixArt-Sigma-XL 2048x2048	16384	Attn Latency	52%	35%	19%	1%

Analysis of Individual Techniques (Table 2):

ASC (Attention Sharing across CFG): As expected, ASC consistently reduces Attn FLOPs by 50% across all models and resolutions, because it halves the CFG computation. The Attn Latency reduction is slightly less than 50% (52-59%), indicating some fixed overheads not directly proportional to FLOPs.
WA-RS (Window Attention with Residual Sharing): The Attn FLOPs reduction from WA-RS is highly dependent on sequence length. For DiT-XL-2 512x512 (Seqlen 1024), it reduces FLOPs to 77%. For PixArt-Sigma-XL 1024x1024 (Seqlen 4096), it goes to 51%. For PixArt-Sigma-XL 2048x2048 (Seqlen 16384), it achieves a substantial reduction to 33% (67% reduction). This trend confirms that WA-RS becomes much more effective at higher resolutions (longer sequence lengths) where the quadratic complexity of full attention is most dominant, making the window attention's linear complexity more beneficial. The Attn Latency reduction also follows this trend, becoming more significant at higher resolutions (e.g., 35% at 2048x2048).
WA-RS+ASC: This combination shows that WA-RS and ASC are orthogonal and can be applied simultaneously for compounding benefits. For PixArt-Sigma-XL 2048x2048, it reduces Attn FLOPs to a mere 16% and Attn Latency to 19%, representing a massive 84% and 81% reduction, respectively.
AST (Attention Sharing across Timesteps): The table shows 0% Attn FLOPs reduction and very small Attn Latency reduction (1-4%) for individual AST application. This might be because the table is showing a single-step FLOPs/latency. AST works by skipping entire steps of attention computation, so its benefit is cumulative over many steps. If evaluated at a single step where it does compute attention, the reduction would be 0. The overall speedup (Figure 8) and ablation study (Figure 9) will better reflect its impact.

6.1.5. Overall Latency of DiTFastAttn

The following figure (Figure 8 from the original paper) shows the latency for image generation and attention:

Figure 8: Latency of different resolutions of image generation under different compression ratios. DiT runs with a batch size of 8, while PixArt-Sigma models with a batch size of 1. The blue line delineates the latency for end-to-end image generation, whereas the orange line represents the latency of multi-head attention module.

Analysis:

Figure 8 illustrates the end-to-end latency (blue line) and multi-head attention module latency (orange line) as Attn FLOPs decrease due to DiTFastAttn compression.
DiTFastAttn achieves end-to-end latency reduction across all models and compression settings.
Resolution-dependent Performance: The benefits are more pronounced at higher resolutions.
- For DiT-XL-2 512x512, at D6, end-to-end latency is reduced to 40% of raw (speedup of 2.5x), and attention latency to 31% of raw (speedup of ~3.2x).
- For PixArt-Sigma-XL 1024x1024, at D6, end-to-end latency is 81% of raw, and attention latency is 54% of raw.
- For PixArt-Sigma-XL 2048x2048, at D6, end-to-end latency is 56% of raw (speedup of ~1.8x), and attention latency is 37% of raw (speedup of ~2.7x).
The results clearly show that as resolution increases, DiTFastAttn provides increasingly better performance in reducing latency for both the overall generation process and specifically the attention module, confirming its effectiveness in alleviating the quadratic complexity bottleneck.

6.1.6. Compression Plan after Search

The following figure (Figure 5 from the original paper) shows the compression plan under the D6 setting:

Figure 5: Compression plan for DiT-XL-512, PixArt-Sigma-XL-1024 and PixArt-Sigma-XL-2K at D6 with the number of DPM-Solver steps set to 50. 该图像是示意图，展示了DiT-XL-2、PixArt-Sigma-XL-1024和PixArt-Sigma-XL-2K在D6配置下的压缩计划，阈值设定为0.15。图中展示了不同层与时间步的注意力分布，包括完全注意力（Full Attn）、窗口注意力与残差共享（WA-RS）、注意力共享（ASC）、窗口注意力与残差共享及注意力共享（WA-RS+ASC）和自适应共享技术（AST）的效果。

Analysis: Figure 5 provides a heatmap visualization of the compression plan determined by the greedy search method under the D6 setting (highest compression tolerance) for DiT-XL-2-512, PixArt-Sigma-XL-1024, and PixArt-Sigma-XL-2K.

Variability across Models: The distribution of compression techniques (Full Attn, WA-RS, ASC, WA-RS+ASC, AST) varies significantly across the three models. This confirms the paper's claim that a universal compression strategy is not optimal and necessitates a tailored search.
DiT-XL-2-512: Shows AST and ASC primarily in early timesteps, with Full Attention more prevalent in initial attention layers.
PixArt-Sigma-XL-1024: Employs AST sporadically in the first two layers and middle attention layers during intermediate timesteps. The combination of WA-RS and ASC is notably predominant in the final timesteps.
PixArt-Sigma-XL-2K: Exhibits a unique pattern, suggesting different redundancy distributions for very high resolutions.
The heatmaps (and additional ones in Appendix A.5, Figures 11, 12, 13) visually validate the identified redundancies and the efficacy of the greedy search in adapting the compression plan to model-specific and resolution-specific characteristics.

The following figure (Figure 11 from the original paper) shows the compression plan for DiT-XL-2-512x512 at different thresholds:

该图像是一个示意图，展示了在不同阈值下，DiT-XL-2 512x512 模型在 DPM 解算器步骤设为 50 时的压缩计划，包括全注意力（Full Attn）、窗口注意力与残差共享（WA-RS）、条件注意力共享（ASC）、窗口注意力与残差共享加条件共享（WA-RS+ASC），以及自适应共享注意力（AST）的比较。

The following figure (Figure 12 from the original paper) shows the compression plan for PixArt-Sigma-XL-1024x1024 at different thresholds:

Figure 12: Compression plan for PixArt-Sigma-XL-1024x1024 at different thresholds with DPM solver step set to 50 该图像是PixArt-Sigma-XL-1024x1024在不同阈值下的压缩计划。展示了全注意力和几种简化注意力策略（WA-RS、ASC、WA-RS+ASC、AST）在多个时间步和层次的表现。不同的阈值（0.025, 0.05, 0.075, 0.1, 0.125, 0.15）影响了注意力权重的分布。

The following figure (Figure 13 from the original paper) shows the compression plan for PixArt-Sigma-XL-2K at different thresholds:

Figure 13: Compression plan for PixArt-Sigma-XL-2K at different thresholds with DPM solver step set to 50 该图像是一个关于PixArt-Sigma-XL-2K在不同阈值下的压缩计划示意图，展示了在不同阈值（0.025、0.05、0.075、0.1、0.125、0.15）的时间步与层的注意力模式。图中包含多种注意力策略的比较，标注包括Full Attn、WA-RS、ASC等。

Analysis of Compression Plans (Figures 11, 12, 13): These figures, showing compression plans at various thresholds, further underscore the dynamic nature of the redundancy.

As the threshold $\delta$ increases (from D1 to D6), more aggressive compression techniques (WA-RS, ASC, WA-RS+ASC, AST) are applied to a greater number of layers and timesteps. This is intuitive, as a higher threshold allows for more error tolerance.
The patterns of where AST, ASC, and WA-RS are applied (e.g., AST often in early/mid steps, WA-RS in later steps or high-resolution models) suggest that different types of redundancies become dominant or tolerable at different stages of the denoising process or for different Transformer layers.
For instance, WA-RS becomes more widely used for PixArt-Sigma-XL-2K (Figure 13), especially at higher $D$ settings, demonstrating its particular utility for high-resolution scenarios where spatial redundancy is a major factor.

6.2. Ablation Studies / Parameter Analysis

The following figure (Figure 9 from the original paper) shows the ablation study results:

Figure 9: Ablation study on DiT-XL-2-512. Examination of methodological impact (Left), timesteps variability (Middle), and residual sharing (Right). "WA" denotes Window Attention without the Residual Share (RS).

Analysis (Figure 9 on DiT-XL-2-512):

6.2.1. DiTFastAttn Outperforms Single Methods (Left Panel)

The left panel compares the performance (FID and IS) of DiTFastAttn against individual compression techniques (ASC, WA-RS, AST) for the same Attention TFLOPs budget.
DiTFastAttn consistently maintains higher quality metrics (IS and FID) across various TFLOPs reductions compared to any single method. This demonstrates the benefit of combining techniques and intelligently selecting them with the greedy search.
Among individual techniques, AST shows the best generative quality for a given initial reduction. However, beyond a certain TFLOPs level (around 2.2 TFLOPs), further compression with AST alone leads to significant performance degradation, causing the search algorithm to stop applying it. DiTFastAttn, by combining AST with WA-RS and ASC, allows for deeper compression while preserving better quality than any single method.

6.2.2. Higher Steps Improve DiTFastAttn's Performance (Middle Panel)

The middle panel investigates the impact of the number of DPM-Solver steps (20, 30, 40, 50) on DiTFastAttn's performance.
It's evident that as the number of steps increases, DiTFastAttn can compress more computation (achieve lower Attention TFLOPs) while maintaining comparable or even better quality (IS and FID).
This suggests that with more denoising steps, the diffusion process has more opportunities to correct errors introduced by compression, and the redundancies (especially temporal) might become more pronounced, allowing for greater savings without noticeable quality loss.

6.2.3. The Residual Caching Technique is Essential (Right Panel)

The right panel compares Window Attention with Residual Sharing (WA-RS) with plain Window Attention (WA) (without residual sharing).
WA-RS consistently maintains significantly better generative performance (IS and FID) than WA at the same compression ratio (same Attention TFLOPs).
This is a crucial finding: directly replacing full attention with window attention (WA) results in a substantial drop in performance (FID increases, IS decreases dramatically). The residual caching mechanism in WA-RS effectively mitigates this loss of long-range dependencies, confirming its essential role in enabling window attention for post-training DiT compression without severe quality degradation.

6.3. Data Presentation (Tables)

The following are the results from Table 3 of the original paper:

Model	Resolution	Config	Latency (s)	Attn Latency (s)	FID	IS
DiT-XL-2		Raw	6.66	2.26	25.43	408.16
		D1	6.61	2.22	25.32	412.24
		D2	6.45	2.05	24.67	412.18
		D3	2.89	0.91	23.76	411.74
		D4	2.78	0.83	21.52	391.80
		D5	2.77	0.80	19.32	370.07
		D6	2.66	0.71	16.80	352.20

The following are the results from Table 4 of the original paper:

Model	Config	Latency (s)	Attn Latency (s)	FID	IS	CLIP
PixArt-Sigma-XL 1024x1024	Raw	12.76	5.30	24.33	55.65	31.27
	D1	12.55	5.10	24.27	55.73	31.27
	D2	11.98	4.49	24.25	55.69	31.26
	D3	11.42	4.01	24.16	55.61	31.25
	D4	11.06	3.60	24.07	55.32	31.24
	D5	10.73	3.25	24.17	54.54	31.22
PixArt-Sigma-XL 2048x2048	D6	10.31	2.85	23.94	52.74	31.18
	Raw	39.86	27.57	23.67	51.89	31.47
	D1	35.75	23.62	23.28	52.34	31.46
	D2	31.44	19.29	22.90	53.01	31.32
	D3	28.99	16.51	22.96	52.54	31.36
	D4	26.18	13.88	22.95	51.74	31.39
	D5	23.86	11.66	22.82	51.22	31.34
	D6	22.27	10.13	22.38	49.34	31.28

The following are the results from Table 5 of the original paper:

Model	Resolution	Config	Latency (s)	Attn Latency (s)	FID	IS
DiT-XL-2	512x512	Raw	32.62	11.40	3.16	219.97
		D1	31.53	10.21	3.09	218.20
		D2	29.35	8.09	3.10	210.36
		D3	27.80	6.56	3.54	196.05
		D4	26.96	5.77	4.52	180.34

Analysis of Latency Values (Tables 3, 4, 5): These tables provide detailed latency measurements for different models, resolutions, and compression configurations, complementing the FLOPs analysis.

Overall Latency Reduction: Across all models and resolutions, DiTFastAttn consistently reduces both total generation latency and attention latency. For PixArt-Sigma-XL 2048x2048, total latency is reduced from 39.86s (Raw) to 22.27s (D6), and attention latency from 27.57s to 10.13s. This is a significant practical speedup for high-resolution generation.
Attention as Bottleneck: The attention latency makes up a large portion of the total latency, especially for high-resolution models (e.g., 27.57s out of 39.86s for PixArt-Sigma-XL 2048x2048 Raw). This confirms the paper's initial premise that attention is the primary computational bottleneck, and therefore, efforts to compress it yield substantial end-to-end speedups.
Impact of DPM-Solver Steps (Table 5): When using 250-step IDDPM solver with $cfg scale=1.5$ for DiT-XL-2-512x512, the raw latency is much higher (32.62s total, 11.40s Attn Latency) compared to the 50-step DPM-Solver (6.66s total, 2.26s Attn Latency in Table 3). This shows that longer denoising processes benefit even more from DiTFastAttn, as the accumulated savings over more steps become larger. At D4, total latency is reduced to 26.96s (from 32.62s) and Attn Latency to 5.77s (from 11.40s).

6.3.1. Compression Plan Search Time

The following are the results from Table 6 of the original paper:

Model	Resolution	Config	Plan Search Time
DiT-XL-2	512x512	Raw	04m39s
		D2	04m08s
		D4	03m49s
		D6	03m14s
PixArt-Sigma-XL	1024x1024	Raw	22m02s
		D2	20m12s
		D4	17m50s
		D6	15m49s
PixArt-Sigma-XL	2048x2048	Raw	1h50m13s
		D2	1h46m04s
		D4	1h22m53s
		D6	1h23m01s

Analysis:

Table 6 shows the time taken to generate the compression plan using the greedy search method.
The plan search time increases significantly with resolution, from a few minutes for DiT-XL-2 512x512 to over an hour for PixArt-Sigma-XL 2048x2048. This is expected due to the $O(Seqlen^2)$ complexity of the search algorithm.
Interestingly, for a given model, the plan search time decreases as the config (threshold $\delta$ ) increases from Raw to D6. This is because higher $\delta$ values mean a higher tolerance for loss. The greedy search, which tries strategies in ascending order of compression ratio, will find an acceptable strategy earlier (possibly a less aggressive one) and break out of the innermost loop faster, reducing the overall search time.
While the search time can be substantial for high-resolution models, it's a one-time cost to generate the plan, after which inference can be greatly accelerated.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces DiTFastAttn, a novel post-training compression method designed to accelerate Diffusion Transformer (DiT) models during inference by addressing the computational burden of their self-attention mechanisms. The core contribution lies in identifying and exploiting three types of redundancies: spatial redundancy (local focus of attention heads), temporal redundancy (similarity of attention outputs across neighboring steps), and conditional redundancy (similarity between conditional and unconditional inference outputs in CFG).

To tackle these, DiTFastAttn proposes three innovative techniques:

Window Attention with Residual Sharing (WA-RS): Employs window attention for spatial efficiency while preserving long-range dependencies through cached residuals.
Attention Sharing across Timesteps (AST): Skips redundant computations by reusing attention outputs across similar denoising steps.
Attention Sharing across CFG (ASC): Reduces CFG overhead by sharing attention outputs between conditional and unconditional evaluations.

A simple greedy search algorithm is developed to dynamically select the optimal combination of these techniques for each layer and timestep, balancing compression with generative quality. Experimental results on DiT-XL, PixArt-Sigma (image generation), and OpenSora (video generation) consistently demonstrate the effectiveness of DiTFastAttn. The method achieves significant attention FLOPs reduction (up to 76%) and considerable end-to-end speedup (up to 1.8x), particularly for high-resolution content generation, while largely preserving generation quality. The ablation studies confirm the critical role of residual caching in WA-RS and the benefits of combining multiple compression strategies.

7.2. Limitations & Future Work

The authors acknowledge several limitations of DiTFastAttn:

Post-Training Nature: As a post-training compression technique, DiTFastAttn cannot leverage the training process to potentially avoid some of the performance drops that might occur when aggressively compressing. This implies that if DiTFastAttn were integrated during training, it might achieve even better compression-quality trade-offs.
VRAM Increase for AST: When Attention Sharing across Timesteps (AST) is applied, the attention hidden states from previous timesteps need to be stored in memory for reuse. This caching mechanism can lead to extra VRAM usage, which might be a concern in memory-constrained environments, counteracting some of the FLOPs benefits.
Suboptimal Compression Plan: The greedy method used to decide the compression plan is described as "simple." While effective, it may not guarantee finding the globally optimal compression plan across all layers and timesteps, potentially leaving room for further efficiency improvements.
Limited Scope to Attention Module: The method solely focuses on reducing the computational cost of the attention module. Other computationally intensive components of Diffusion Transformers, such as feed-forward networks or convolutional layers (if present), are not addressed.

Based on these limitations, potential future research directions could include:

Exploring joint training/compression methods that integrate DiTFastAttn's principles into the model training pipeline.
Developing VRAM-optimized variants of AST that manage cached states more efficiently, perhaps by selectively storing only critical information or employing on-the-fly recomputation for less critical elements.
Investigating more advanced optimization algorithms (e.g., reinforcement learning, neural architecture search) for finding the optimal compression plan, potentially leading to better compression-quality trade-offs.
Extending the compression techniques to other modules within Diffusion Transformers or other large generative models to achieve broader end-to-end acceleration.

7.3. Personal Insights & Critique

DiTFastAttn presents a highly practical and well-motivated approach to accelerating Diffusion Transformers. Strengths:

Practicality of Post-Training: The focus on post-training compression is a significant strength. Retraining large DiT models is prohibitively expensive for most researchers and practitioners. Providing a method that works out-of-the-box on pre-trained models democratizes access to efficient DiT inference.
Systematic Redundancy Identification: The clear identification of spatial, temporal, and conditional redundancies is elegant and insightful. This structured approach allows for targeted compression strategies that address specific bottlenecks.
Ingenious Residual Sharing: The Window Attention with Residual Sharing (WA-RS) is particularly clever. Window attention is powerful for local efficiency, but the loss of long-range dependencies is its Achilles' heel. Reusing the residual from full attention effectively "patches" this gap without incurring the quadratic cost, demonstrating a deep understanding of Transformer behavior. This is a novel way to mitigate a known problem in local attention without retraining.
Resolution-Dependent Performance: The finding that DiTFastAttn yields greater benefits at higher resolutions is crucial. This is precisely where the quadratic complexity of attention becomes most problematic, making the method highly relevant for cutting-edge high-resolution image/video generation.

Critique & Potential Issues:

Cost of Greedy Search: While a one-time cost, the search time for the compression plan can be substantial, especially for very large models and high resolutions (e.g., almost 2 hours for PixArt-Sigma-XL 2048x2048). This might be a barrier for users who need to quickly adapt DiTFastAttn to many different configurations or models. Exploring more efficient search algorithms, perhaps those that leverage transfer learning of compression plans across similar models or resolutions, could be beneficial.
VRAM Impact of AST: The acknowledged VRAM increase for AST is a practical concern. In many high-resolution generative tasks, VRAM is already a limiting factor. The benefits of FLOPs reduction might be partially offset if increased VRAM usage forces smaller batch sizes or limits deployment on less powerful hardware. A more detailed analysis of the VRAM trade-off (e.g., VRAM usage vs. speedup) would strengthen the argument for AST's overall utility.
Loss Function Robustness: The mean relative absolute error loss function L(O, O') uses $\operatorname{max}(|O_i|, |O_i'|) + \epsilon$ in the denominator. While $\epsilon$ handles division by zero, if both $O_i$ and $O_i'$ are very small (e.g., close to $10^{-6}$ ), small absolute differences could still lead to large relative errors. Depending on the range of attention outputs, this could make the loss sensitive to noise in very low-magnitude feature dimensions, potentially guiding the search suboptimally in such cases.
Generality beyond Diffusion: While specifically designed for Diffusion Transformers, the principles of spatial, temporal, and conditional redundancies (especially spatial and temporal) might be applicable to other Transformer-based generative or discriminative models that process sequential data or operate iteratively. Investigating such cross-domain applicability could be a valuable extension.

Inspirations: The paper offers valuable insights into how to systematically analyze and optimize large models post-training. The idea of identifying specific types of redundancies based on common model behaviors (local attention, iterative refinement, guidance mechanisms) and then designing targeted, lightweight techniques to exploit them is highly transferable. This approach could be applied to other computationally intensive blocks beyond attention, and in other iterative deep learning processes where intermediate states might exhibit high similarity. The success of the simple greedy search also suggests that sophisticated, high-overhead NAS (Neural Architecture Search) might not always be necessary; sometimes, well-motivated heuristics and a good loss function are sufficient for significant gains.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

DiTFastAttn: Attention Compression for Diffusion Transformer Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~33 min read · 42,766 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Diffusion Models (DMs)

3.1.2. Transformers

3.1.3. Diffusion Transformers (DiT)

3.1.4. Classifier-Free Guidance (CFG)

3.1.5. Performance Metrics

3.2. Previous Works

3.2.1. Diffusion Models

3.2.2. Vision Transformer Compression

3.2.3. Local Attention

3.2.4. Attention Sharing

3.2.5. Other Methods to Accelerate Diffusion Models

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Window Attention with Residual Sharing (WA-RS)

4.2.2. Attention Sharing across Timesteps (AST)

4.2.3. Attention Sharing across CFG (ASC)

4.2.4. Method for Deciding the Compression Plan

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. FID (Fréchet Inception Distance)

5.2.2. IS (Inception Score)

5.2.3. CLIP Score

5.2.4. Attn FLOPs (Attention Floating Point Operations)

5.2.5. Latency

5.3. Baselines

5.4. Other Settings

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Results of Evaluation Metrics and Attention FLOPs on Image Generation

6.1.2. Visualization of DiTFastAttn's Generation Results

6.1.3. Results on Video Generation

6.1.4. #FLOPs Reduction and Speedup

6.1.5. Overall Latency of DiTFastAttn

6.1.6. Compression Plan after Search

6.2. Ablation Studies / Parameter Analysis

6.2.1. DiTFastAttn Outperforms Single Methods (Left Panel)

6.2.2. Higher Steps Improve DiTFastAttn's Performance (Middle Panel)

6.2.3. The Residual Caching Technique is Essential (Right Panel)

6.3. Data Presentation (Tables)

6.3.1. Compression Plan Search Time

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers