DiTFastAttn: Attention Compression for Diffusion Transformer Models
TL;DR Summary
DiTFastAttn is presented as a post-training compression method to address computational bottlenecks in Diffusion Transformers. It effectively reduces spatial, temporal, and conditional redundancies, achieving up to 76% reduction in attention FLOPs and 1.8x acceleration in generat
Abstract
Diffusion Transformers (DiT) excel at image and video generation but face computational challenges due to the quadratic complexity of self-attention operators. We propose DiTFastAttn, a post-training compression method to alleviate the computational bottleneck of DiT. We identify three key redundancies in the attention computation during DiT inference: (1) spatial redundancy, where many attention heads focus on local information; (2) temporal redundancy, with high similarity between the attention outputs of neighboring steps; (3) conditional redundancy, where conditional and unconditional inferences exhibit significant similarity. We propose three techniques to reduce these redundancies: (1) Window Attention with Residual Sharing to reduce spatial redundancy; (2) Attention Sharing across Timesteps to exploit the similarity between steps; (3) Attention Sharing across CFG to skip redundant computations during conditional generation. We apply DiTFastAttn to DiT, PixArt-Sigma for image generation tasks, and OpenSora for video generation tasks. Our results show that for image generation, our method reduces up to 76% of the attention FLOPs and achieves up to 1.8x end-to-end speedup at high-resolution (2k x 2k) generation.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "DiTFastAttn: Attention Compression for Diffusion Transformer Models". This title clearly indicates a focus on optimizing the computational efficiency of Diffusion Transformer models, specifically by compressing their attention mechanisms.
1.2. Authors
The authors are Zhihang Yuan, Hanling Zhang, Pu Lu, Xuefei Ning, Linfeng Zhang, Tianchen Zhao, Shengen Yan, Guohao Dai, and Yu Wang. Their affiliations include:
- Tsinghua University (1)
- Infinigence AI (2)
- Shanghai Jiao Tong University (3) This suggests a collaborative effort between academic institutions and an AI company, often indicating a blend of theoretical research and practical application.
1.3. Journal/Conference
The paper is published at arXiv, a preprint server, with a publication date of 2024-06-12T18:00:08.000Z. As an arXiv preprint, it has not yet undergone formal peer review for a specific journal or conference. However, arXiv is a highly influential platform for rapid dissemination of research in physics, mathematics, computer science, and other fields, allowing researchers to share their work before formal publication. Many significant papers first appear on arXiv before being accepted into top-tier conferences or journals.
1.4. Publication Year
The paper was published in 2024.
1.5. Abstract
Diffusion Transformers (DiT) are powerful generative models for images and videos but face significant computational hurdles due to the quadratic complexity of their self-attention operations. This paper introduces DiTFastAttn, a post-training compression method designed to alleviate this computational bottleneck during DiT inference. The authors identify three primary redundancies in attention computation: (1) spatial redundancy, where many attention heads focus locally; (2) temporal redundancy, where attention outputs are highly similar across neighboring denoising steps; and (3) conditional redundancy, where conditional and unconditional inferences show high similarity, particularly with Classifier-Free Guidance (CFG).
To address these, DiTFastAttn proposes three techniques: (1) Window Attention with Residual Sharing (WA-RS) for spatial redundancy, which uses window attention and caches/reuses residual information from full attention to maintain performance; (2) Attention Sharing across Timesteps (AST) for temporal redundancy, by sharing attention outputs between similar neighboring steps; and (3) Attention Sharing across CFG (ASC) for conditional redundancy, by skipping redundant computations during unconditional generation.
Applied to DiT and PixArt-Sigma for image generation, and OpenSora for video generation, DiTFastAttn demonstrates substantial efficiency gains. For image generation, it achieves up to a 76% reduction in attention FLOPs and up to a 1.8x end-to-end speedup, especially at high resolutions (e.g., 2k x 2k).
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2406.08552
- PDF Link: https://arxiv.org/pdf/2406.08552v2.pdf This paper is an arXiv preprint.
2. Executive Summary
2.1. Background & Motivation
The core problem DiTFastAttn aims to solve is the substantial computational demand of Diffusion Transformer (DiT) models, particularly during inference for high-resolution image and video generation. While DiT models excel in generative quality, their underlying self-attention mechanism, a core component of Transformer architectures, has a quadratic complexity () with respect to the input token length . As image and video resolutions increase, the token length grows significantly, making attention computation the primary bottleneck. For instance, generating a 2Kx2K image can involve 16k tokens, leading to several seconds of attention computation even on powerful GPUs.
Further exacerbating this issue is the nature of diffusion model inference, which requires many sequential denoising steps and often employs Classifier-Free Guidance (CFG), effectively doubling the computational cost per step by performing both conditional and unconditional network evaluations.
Prior research on accelerating attention (e.g., Local Attention, Swin Transformer, GQA) often involved architectural changes that necessitate expensive retraining of the entire model. Given the massive data and computational resources required to train DiT models, there is a crucial need for post-training compression methods that can reduce computational costs without incurring retraining expenses or significant performance degradation. This paper's entry point is to identify and exploit intrinsic redundancies within the attention computation of pre-trained DiT models during inference.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Identification of Three Key Redundancies: The authors rigorously identify and characterize three distinct types of redundancy in
DiTattentioncomputation duringinference:- Spatial Redundancy: Many
attention headsprimarily focus on local information, withattention valuesfor distant tokens being negligible. - Temporal Redundancy:
Attention outputsfor the same head exhibit high similarity across neighboringdenoising steps. - Conditional Redundancy: During
CFG,attention outputsfromconditionalandunconditionalinferencesshow significant similarity for certain heads andtimesteps.
- Spatial Redundancy: Many
-
Proposal of Three Corresponding Compression Techniques: To address each identified redundancy,
DiTFastAttnintroduces novel post-training compression techniques:- Window Attention with Residual Sharing (WA-RS): Replaces full
attentionwithwindow attentionfor spatially redundant layers, and crucially, caches and reuses the residual difference between full andwindow attention outputsfrom previous steps to maintain long-range dependencies and preserve performance. - Attention Sharing across Timesteps (AST): Exploits temporal similarity by caching and reusing
attention outputsfrom an earlier step for subsequent similar steps, thereby skipping redundant computations. - Attention Sharing across CFG (ASC): Leverages conditional similarity by reusing
attention outputsfrom theconditionalneural network evaluationfor theunconditionalevaluation duringCFG, effectively halving theCFGattentioncost.
- Window Attention with Residual Sharing (WA-RS): Replaces full
-
Development of a Greedy Compression Plan Decision Method: A simple greedy algorithm is proposed to dynamically select the most appropriate compression strategy (or combination of strategies) for each
layerandtimestepbased on a predefinedloss threshold, ensuring an optimal balance between compression and quality. -
Extensive Experimental Validation:
DiTFastAttnis applied to variousDiTmodels, includingDiT-XL,PixArt-Sigma(for image generation), andOpenSora(for video generation).The key conclusions and findings are:
-
Significant Computational Savings:
DiTFastAttnconsistently reduces computational costs. For image generation, it achieves up to a 76% reduction inattention FLOPs. -
Substantial Speedup: It delivers an end-to-end speedup of up to 1.8x, particularly noticeable at high-resolution generation (e.g., 2048x2048 images with
PixArt-Sigma). -
Quality Preservation: The method effectively preserves the generative performance and visual quality of the original
DiTmodels, especially at higher resolutions and moderate compression levels (e.g., D1-D4 configurations). -
Resolution-Dependent Efficiency: The higher the resolution of the generated content, the greater the computational savings and latency reduction achieved by
DiTFastAttn. -
Effectiveness of Residual Sharing: The ablation study confirms that
residual cachingis crucial forWindow Attentionto maintain generative performance, preventing significant drops in quality compared towindow attentionalone. -
Variability of Redundancy: The compression plans generated by the greedy search highlight that the distribution of different types of redundancies varies across models and across layers/timesteps, validating the need for a tailored search method rather than a universal strategy.
These findings solve the problem of high computational cost in pre-trained
DiTmodels, making them more efficient and practical for deployment, especially for high-resolution content generation, without requiring expensive retraining.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand DiTFastAttn, a solid grasp of Diffusion Models, Transformers, and associated concepts like self-attention and Classifier-Free Guidance is essential.
3.1.1. Diffusion Models (DMs)
Diffusion Models are a class of generative models that learn to reverse a diffusion process. Imagine a clean image gradually corrupted by adding Gaussian noise over many timesteps until it becomes pure noise. A diffusion model learns to reverse this process: it's trained to predict and subtract the noise from a noisy image at each timestep to progressively denoise it back to a clean image.
- Denoising Process: During
inference(generation), the model starts with random noise and iteratively applies adenoising neural networkfor a specified number ofsteps. In each step, the network takes the current noisy image and the currenttimestep(often encoded as an embedding) as input and outputs a prediction of the noise that was added. This predicted noise is then subtracted to get a slightly cleaner image, and the process repeats until a clean image is obtained. - Timesteps: The
denoising processis discretized into a number oftimesteps(e.g., 50, 1000). The model's behavior can change significantly across thesetimesteps, as it deals with different noise levels. Early steps remove large amounts of noise, while later steps refine details.
3.1.2. Transformers
Transformers are neural network architectures primarily known for their success in natural language processing (NLP) but have been widely adopted in computer vision (Vision Transformers or ViTs) and other domains. Their core innovation is the self-attention mechanism.
-
Self-Attention Mechanism: Instead of processing data sequentially like
Recurrent Neural Networks (RNNs),Transformersprocess all input elements (e.g.,tokensin text,image patchesin vision) in parallel. Theself-attentionmechanism allows each input element to weigh the importance of all other input elements when computing its own representation. This is achieved through three learned linear projections:Query (Q),Key (K), andValue (V)matrices.- Query (Q): Represents what an element is looking for.
- Key (K): Represents what an element contains.
- Value (V): Contains the information an element holds.
The
attention scorebetween aqueryand akeydetermines how muchvalueinformation from thatkeyshould be "attended to" by thequery. Theself-attentionformula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where: - :
Querymatrix, typically of shape (sequence length, ), where is the dimension of thekeys. - :
Keymatrix, typically of shape (sequence length, ). - :
Valuematrix, typically of shape (sequence length, ), where is the dimension of thevalues. - : The dot product between
QueryandKeymatrices, calculating similarity scores. This results in a matrix of shape (sequence length, sequence length), which quantifies how much eachtokenattends to every othertoken. - : A scaling factor to prevent large dot product values from pushing the
softmaxfunction into regions with very small gradients, stabilizing training. - : Normalizes the
attention scoresso they sum to 1, representing probability distributions. - : The final
attention output, a weighted sum ofValuevectors, where weights are determined by thesoftmaxofquery-keysimilarities.
-
Multi-Head Attention (MHA): Instead of a single
attentionmechanism,Transformersuse multipleattention headsoperating in parallel. Each head learns different , , linear projections, allowing the model to focus on different parts of the input or different relationships simultaneously. The outputs from all heads are then concatenated and linearly transformed to produce the final output. -
Quadratic Complexity: The computation of involves multiplying two matrices where one dimension is the
sequence length. If is and is , their product is . This makes the computational complexity ofself-attention, which becomes very expensive for long sequences (high-resolution images/videos).
3.1.3. Diffusion Transformers (DiT)
Diffusion Transformers (DiT) (Peebles & Xie, 2023) replace the traditional U-Net architecture, commonly used in early diffusion models (e.g., DDPM), with a Transformer backbone. Instead of operating on raw image pixels, DiT models often work in the latent space of an autoencoder, converting images into smaller latent tokens. These latent tokens are then processed by a Transformer network that incorporates timestep embeddings and conditional information (e.g., text embeddings) to predict the noise in the latent space. By leveraging Transformers, DiT models achieve better scalability and performance, especially for high-resolution content, but inherit the quadratic complexity bottleneck of self-attention.
3.1.4. Classifier-Free Guidance (CFG)
Classifier-Free Guidance (CFG) (Ho & Salimans, 2022) is a technique used to improve the quality and adherence to conditional inputs (e.g., text prompts) in diffusion models. Instead of relying on a separate classifier network, CFG combines the predictions from two parallel denoising network inferences at each timestep:
- An
unconditionalprediction: Thedenoising networkpredicts noise without anyconditionalinput (e.g., an empty text prompt). This learns the general distribution of images. - A
conditionalprediction: Thedenoising networkpredicts noise with the givenconditionalinput (e.g., the actual text prompt). This guides the generation towards the specified condition. The final noise prediction is then a weighted combination of these two, typically: $ \hat{\epsilon}\theta(x_t | c) = \epsilon\theta(x_t | \emptyset) + w \cdot (\epsilon_\theta(x_t | c) - \epsilon_\theta(x_t | \emptyset)) $ Where:
- : The final guided noise prediction.
- : The
conditionalnoise prediction. - : The
unconditionalnoise prediction. - : The
guidance scale, a hyperparameter that controls how strongly theconditionalinput influences the generation. Higher leads to stronger adherence to the prompt but can sometimes reduce diversity or quality. The drawback ofCFGis that it requires two forward passes through thedenoising networkfor everytimestep, effectively doubling the computational cost per step.
3.1.5. Performance Metrics
- FLOPs (Floating Point Operations): A measure of the total number of floating-point operations performed by a model. It quantifies the computational cost. Lower
FLOPsindicate higher efficiency. - Latency: The time it takes for a model to complete a specific task (e.g., generate one image). It measures speed. Lower
latencymeans faster execution.
3.2. Previous Works
The paper contextualizes DiTFastAttn by discussing existing efforts in diffusion models, Vision Transformer compression, local attention, attention sharing, and other diffusion acceleration methods.
3.2.1. Diffusion Models
The paper notes the evolution from U-Net based diffusion models (Ho et al., 2020; Rombach et al., 2022) to Transformer architectures (DiT by Peebles & Xie, 2023). It highlights PixArt-Sigma (Chen et al., 2024) for high-resolution image generation and Sora (Brooks et al., 2024) for video generation as examples of DiT's capabilities.
3.2.2. Vision Transformer Compression
The computational overhead of attention has driven various compression techniques:
- FlashAttention (Dao, 2023): Optimizes
attentioncomputation by dividinginput tokensinto smaller tiles, improving memory access patterns and reducinglatency. This is an algorithmic optimization rather than a model compression.DiTFastAttnbuilds uponFlashAttention-2. - Token Pruning/Merging: These methods aim to reduce the
sequence lengthby removing or combining less importanttokens.DynamicViT(Rao et al., 2021) uses a prediction network to dynamically filtertokens.Adaptive Sparse ViT(Liu et al., 2022) filterstokensbased onattention valuesandL2 normof features.- uses
segmentation labelsto guidetoken merging. Huang et al. (2023)downsamplestokensbeforeattentionand then upsamples.- suggests
filteringfor deeper layers andmergingfor shallower layers. These methods typically require some form of retraining or fine-tuning to adapt to the pruned/mergedtokenstructures.
3.2.3. Local Attention
This paradigm restricts attention computation to a fixed-size window of neighboring tokens to reduce quadratic complexity to linear.
Longformer(Beltagy et al.) introduced linear-scalingattention.Bigbird(Zaheer et al., 2020) combineswindow attentionwithrandomandglobal attentionfor long-range dependencies.Swin Transformer(Liu et al., 2021) uses non-overlappinglocal windowsandshifted windowsacross layers to capture global context.Twins Transformer(Chu et al., 2021),FasterViT(Vasu et al., 2023),Neighborhood Attention Transformer(Hassani et al., 2023) also employwindow-based attentionwith variations to capture global context.DiTFastAttnuses fixed-sizewindow attentionbut enhances it withResidual Sharingto maintain long-range dependencies in a training-free manner.
3.2.4. Attention Sharing
This category exploits similarities in attention mechanisms or outputs to reduce computation.
GQA (Group Query Attention)(Ainslie et al., 2023) groupsquery headsand shareskeyandvalueparameters within each group, reducing memory and improving efficiency. This is an architectural change during training.PSVIT(Chen et al., 2021) observed similarity inattention mapsacross differentTransformer layersand proposed sharing them to reduce redundancy.Deepcache(Ma et al., 2023) noted similarity inhigh-level featuresofU-Netbaseddiffusion modelsacrosstimestepsand reused these features, skippingintermediate layers.TGATE(Zhang et al., 2024) found thatcross-attention outputsin text-conditionaldiffusion modelsconverge to a fixed point after severaldenoising steps, caching and reusing this output.DiTFastAttnextendsattention sharingby demonstrating similarity inattention outputsbothCFG-wiseandstep-wise, and importantly, considers layer-specific andtimestep-specific variations in similarity for selective sharing.
3.2.5. Other Methods to Accelerate Diffusion Models
- Network Quantization: Reduces bitwidth of weights and activations (
Shang et al., 2023; Zhao et al., 2024b, 2024a). - Scheduler Optimization: Decreases the number of
denoising steps(Song et al., 2020; Lu et al., 2022; Liu et al., 2023a). - Distillation: Minimizes
timestepsby training a smaller model or distilling knowledge from a larger one (Salimans & Ho, 2022; Meng et al., 2023; Liu et al., 2023b).DiTFastAttnis complementary to these methods, as it operates independently of quantization, scheduler, ortimestepsettings.
3.3. Technological Evolution
The evolution of generative AI has seen a shift from Generative Adversarial Networks (GANs) (Creswell et al., 2018) to diffusion models for superior performance. Early diffusion models were predominantly built on U-Net architectures (e.g., DDPM). However, with the success of Transformers in other domains, Diffusion Transformers (DiT) emerged as a scalable and powerful alternative, replacing U-Nets with Transformer blocks. This transition brought enhanced capabilities, especially for high-resolution content, but also inherited the quadratic complexity of the self-attention mechanism, which quickly became a bottleneck.
Efforts to address Transformer computational costs generally began with architectural designs like Swin Transformer for Vision Transformers or memory-optimized implementations like FlashAttention. More recently, researchers have focused on identifying and exploiting redundancies within the attention mechanism itself, through methods like token pruning/merging or attention sharing across layers or timesteps.
DiTFastAttn fits into this timeline by focusing specifically on the DiT architecture and introducing a post-training compression approach. This is a significant distinction from many prior methods that require extensive retraining. By identifying spatial, temporal, and conditional redundancies specific to the DiT inference process, DiTFastAttn offers a practical solution to accelerate existing pre-trained DiT models without the prohibitive cost of retraining, marking a step towards more efficient deployment of these powerful generative models.
3.4. Differentiation Analysis
Compared to the main methods in related work, DiTFastAttn offers several core differences and innovations:
- Post-Training Compression: Many prior acceleration techniques (e.g.,
Swin Transformer,GQA,token pruning/mergingif integrated into architecture) necessitate architectural changes or retraining.DiTFastAttnis explicitly a post-training method, meaning it can be applied directly to pre-trainedDiTmodels without any further training or fine-tuning. This is a crucial practical advantage given the immense computational cost of training largeDiTmodels. - Comprehensive Redundancy Identification:
DiTFastAttnsystematically identifies three distinct, yet complementary, types ofattentionredundancy:spatial,temporal(acrosstimesteps), andconditional(acrossCFGevaluations). This multi-faceted approach allows for more granular and effective compression compared to methods focusing on a single type of redundancy. - Novel
Residual SharingforWindow Attention: Whilelocal attention(window attention) has been explored (e.g.,Swin Transformer), directly applying it post-training toDiTcan degrade performance due to the loss oflong-range dependencies.DiTFastAttn'sWindow Attention with Residual Sharing (WA-RS)innovatively addresses this by caching and reusing the difference (residual) between full andwindow attention outputsfrom previous steps. This training-free mechanism effectively preserves cruciallong-range dependencieswithout incurring the quadratic cost of full attention at every step. - Targeted
Attention SharingforDiTInference: Building on generalattention sharingideas (PSVIT,Deepcache,TGATE),DiTFastAttnspecifically appliesattention sharingto thetimestepdimension (AST) and theCFGdimension (ASC) based on empirical observations of high output similarity in these contexts withinDiTs. This is distinct from sharingattention mapsacross layers orcross-attentionoutputs. - Dynamic Compression Plan: Unlike static compression methods,
DiTFastAttnemploys a simple greedy algorithm to dynamically determine the optimal compression strategy for eachlayerandtimestep. This allows for a tailored approach that adapts to the varying distributions of redundancies across the model anddenoising process, ensuring maximum compression while maintaining quality for diverse models and resolutions. - Complementarity with Existing Methods: The paper highlights that
DiTFastAttnis orthogonal to otherdiffusion accelerationmethods likequantization,scheduler optimization, anddistillation. This means it can potentially be combined with these techniques for even greater overall acceleration.
4. Methodology
4.1. Principles
The core principle behind DiTFastAttn is to achieve significant computational savings during the inference of Diffusion Transformer (DiT) models by identifying and exploiting inherent redundancies in their self-attention computations. This is achieved through a post-training compression approach, meaning it does not require retraining the large DiT models, which is a major advantage. The authors pinpoint three specific types of redundancies:
-
Spatial Redundancy: Many
attention headsprimarily focus on local patterns, making fullself-attention(which considers alltokens) inefficient for these layers. -
Temporal Redundancy: The
attention outputsof the same layer can be highly similar across consecutivedenoising steps, suggesting that recomputing them repeatedly is redundant. -
Conditional Redundancy: When using
Classifier-Free Guidance (CFG), theconditionalandunconditionalattention outputscan be very similar, leading to redundant computations.Based on these observations,
DiTFastAttnproposes three corresponding techniques:Window Attention with Residual Sharing (WA-RS)for spatial redundancy,Attention Sharing across Timesteps (AST)for temporal redundancy, andAttention Sharing across CFG (ASC)for conditional redundancy. A greedy search method is then used to intelligently apply these techniques layer-by-layer and step-by-step to maximize compression while minimizing performance degradation.
The following figure (Figure 2 from the original paper) provides a visual overview of the identified redundancies and their corresponding compression techniques:
该图像是一个示意图,展示了在 DiTFastAttn 中减轻注意力计算冗余的三种技术。横轴代表步骤维度,图中标示了空间冗余、时间步之间的相似性以及条件推理与非条件推理之间的相似性,分别对应不同的压缩技术,如Window Attention和Attention Sharing。
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Window Attention with Residual Sharing (WA-RS)
The first redundancy exploited is spatial redundancy. The authors observe that in many Transformer layers of pre-trained DiTs, attention values are highly concentrated within a localized window around the diagonal of the attention matrix. This means that tokens primarily attend to nearby tokens, and attention values for spatially distant tokens are often close to zero. This phenomenon is illustrated in Figure 3(a) (left panel).
To leverage this, DiTFastAttn proposes using window attention instead of full self-attention for selected layers. Window attention restricts computation to tokens within a fixed-size window, drastically reducing computational cost from quadratic to linear with respect to sequence length.
However, simply replacing full attention with window attention can lead to performance degradation because some tokens still rely on a small set of spatially distant tokens for their complete contextual understanding. Discarding these long-range dependencies negatively impacts model performance. A naive solution would be to use a very large window size, but this would negate most of the computational savings.
To address this, DiTFastAttn introduces Cache and Reuse the Residual for Window Attention. The key insight comes from an observation shown in Figure 3(a) (right panel): the residual between the outputs of full attention and window attention exhibits much smaller variation across timesteps compared to the direct window attention output. This suggests that the "missing" long-range information (the residual) changes slowly and can be effectively cached and reused.
The following figure (Figure 3 from the original paper) illustrates the Window Attention with Residual Sharing technique:
该图像是示意图,展示了滑动窗口注意力机制的计算过程,左侧为不同时间步的注意图,右侧为全注意力与窗口注意力的比较及残差计算公式,其中 表示残差计算。
Figure 3(b) illustrates the WA-RS computation flow:
For a specific set of timesteps that share a residual value, and an initial step r = \min(\mathbf{K}) within this set, the computation proceeds as follows:
- Compute Full Attention Output: The standard full
attentionis computed for step using theQuery(),Key(), andValue() matrices. $ \mathbf{O}_r = \operatorname{Attention}(\mathbf{Q}_r, \mathbf{K}_r, \mathbf{V}_r) $- : The output of the full
self-attentionmechanism attimestep. - : The standard full
self-attentionfunction. - : The
Query,Key, andValuematrices attimestep. These are derived from the input features of theTransformer layerat thattimestep.
- : The output of the full
- Compute Window Attention Output: The
window attentionis computed for step . $ \mathbf{W}_r = \operatorname{WindowAttention}(\mathbf{Q}_r, \mathbf{K}_r, \mathbf{V}_r) $- : The output of the
window attentionmechanism attimestep.WindowAttentionis a variant ofAttentionthat only considerstokenswithin a predefined local window.
- : The output of the
- Calculate and Cache Residual: The residual is calculated as the difference between the full
attention outputand thewindow attention outputat step . This residual captures thelong-range dependenciesthatwindow attentionmisses. This is then cached. $ \mathbf{R}_r = \mathbf{O}_r - \mathbf{W}_r $-
: The residual value computed at
timestep, representing the difference between full andwindow attention outputs. This residual is cached for reuse.For any subsequent
timestepwithin the same sharing set, the computation is simplified:
-
- Compute Window Attention Output: Only the
window attentionis computed for step . $ \mathbf{W}_k = \operatorname{WindowAttention}(\mathbf{Q}_k, \mathbf{K}_k, \mathbf{V}_k) $- : The output of the
window attentionmechanism attimestep. Note that , , are recomputed for step as they are specific to the currenttimestep's input.
- : The output of the
- Add Cached Residual: The output for step is obtained by adding the previously cached residual (from step ) to the current
window attention output. This effectively reintroduces thelong-range dependenciesthatwindow attentionwould otherwise omit, without needing to recompute fullattention. $ \mathbf{O}_k = \mathbf{W}_k + \mathbf{R}_r $-
: The final estimated
attention outputfortimestepusingWA-RS. -
: The set of steps that share the residual value .
By doing this,
WA-RSsignificantly reduces computation in subsequent steps by only calculatingwindow attention(linear complexity) and adding a cached value, while still preserving the cruciallong-range informationthat a simplewindow attentionwould lose.
-
4.2.2. Attention Sharing across Timesteps (AST)
The second redundancy addressed is the temporal similarity of attention outputs. During the sequential denoising process of diffusion models, the features and, consequently, the attention outputs of the same attention head can be highly similar across neighboring timesteps. Figure 4(a) demonstrates this by showing high cosine similarity between attention outputs of adjacent steps for certain layers. The similarity is not uniform; it varies across both timesteps and Transformer layers.
The Attention Sharing across Timesteps (AST) technique leverages this observation. For a group of timesteps where the attention outputs are similar to each other, the model computes the attention output only at the earliest timestep in that group. This cached attention output (denoted as ) is then reused for all subsequent timesteps within that group, effectively skipping the attention computation for those steps. This significantly accelerates the denoising process by reducing redundant calculations over time.
4.2.3. Attention Sharing across CFG (ASC)
The third redundancy is conditional redundancy, specifically within Classifier-Free Guidance (CFG). CFG is a standard technique for conditional generation, which typically doubles the computational cost by requiring two neural network inferences at each timestep: one with a conditional input (e.g., text prompt) and one without (unconditional input).
The authors observe that for many Transformer layers and timesteps, the attention outputs generated during the conditional neural network evaluation are highly similar to those generated during the unconditional neural network evaluation. Figure 4(b) illustrates this significant similarity (e.g., SSIM ) between the attention outputs in conditional and unconditional evaluations.
The Attention Sharing across CFG (ASC) technique exploits this. Instead of performing two separate attention computations, ASC reuses the attention output from the conditional neural network evaluation for the unconditional neural network evaluation. This allows the model to skip the attention computation for the unconditional path, thereby reducing the attention overhead during CFG by approximately 50%.
The following figure (Figure 4 from the original paper) shows the similarity of attention outputs across timesteps and CFG dimension:
该图像是图表,显示了不同层次下的注意力输出的相似性。左侧为在不同时间步刻度下,第5层和第25层的注意力输出的余弦相似性;右侧为条件生成和无条件生成之间的注意力输出相似性的热图。
4.2.4. Method for Deciding the Compression Plan
As indicated by the varying similarity patterns in Figures 3 and 4, the optimal compression strategy is not universal; different layers and timesteps exhibit different types and degrees of redundancy. Therefore, a method is needed to determine which compression technique (or combination) should be applied to each Transformer layer at each timestep.
DiTFastAttn employs a simple greedy method to decide this compression plan. The method iteratively determines the best strategy for each timestep and then for each Transformer layer within that timestep.
The compression strategy list used in the greedy search is defined as . These strategies are ordered by their ascending potential compression ratio (meaning the most aggressive compression methods are tried first for a given layer/step).
The following algorithm (Algorithm 1 from the original paper) outlines the process:
Algorithm 1: Method for Deciding the Compression Plan Input : Transformer Model $M$ , Total Step $T$ , Compression Strategy List $s$ , Threshold $\delta$ Output :dictionary dict that stores selected compression techniques Initialize dict for step t in $T$ do $O $ compute the output of the uncompressed $M$ for transformer layer $i$ in $M$ do for $m \in S$ order by ascending compression ratio do compress layer $i$ in step $t$ using compression strategy $m$ $O ^ { \prime } $ compute the output of $M$ if $\begin{array} { r } { L ( O , O ^ { \prime } ) ^ { \ast } < \frac { i } { | M | } \delta } \end{array}$ then update $m$ as the selected strategy of layer $i$ and step $t$ in dict break return dict
Algorithm 1 Step-by-Step Explanation:
- Inputs:
Transformer Model M: The pre-trainedDiTmodel.Total Step T: The total number ofdenoising steps(e.g., 50) in thediffusion process.Compression Strategy List S: A list of available compression strategies, ordered by ascendingcompression ratio(e.g.,AST,WA-RS + ASC,WA-RS,ASC).- : A global scalar value that controls the maximum allowable quality degradation. A smaller means less compression but higher fidelity.
- Initialization:
dict: An empty dictionary is initialized to store the chosen compression strategy for each (layer,step) pair.
- Outer Loop (Iterating through Timesteps):
- The algorithm iterates through each
timestepfrom 1 to . - For each
timestep, the output of the uncompressed model, , is computed. This serves as the ground truth for comparison.
- The algorithm iterates through each
- Inner Loop (Iterating through Transformer Layers):
- For each
timestep, the algorithm then iterates through eachTransformer layerin the model . - Innermost Loop (Iterating through Strategies):
- For the current
layerandtimestep, the algorithm iterates through thecompression strategy list S. The strategies are applied in order of ascending compression ratio. This means that less aggressive methods (which generally cause less degradation) are tried first. - Apply Compression: The current
layerattimestepis temporarily compressed using the current strategy . - Compute Compressed Output: The output of the model with this compression applied, , is computed.
- Check Loss Condition: The
loss(quality degradation) between the uncompressed output and the compressed output is calculated using the functionL(O, O'). - This
lossis then compared against a dynamically adjustedthreshold: .- : The total number of
Transformer layersin the model. - : A fractional term that means the allowable
lossincreases linearly with thelayer index. This implies thatshallower layers(smaller ) are more strictly constrained regardingloss, whiledeeper layers(larger ) can tolerate slightly moreloss. This is a common heuristic, as deeper layers often contribute less to fine-grained details and more to abstract features, or might be more robust to small perturbations.
- : The total number of
- Select Strategy: If the calculated
lossL(O, O')is less than thethreshold, then strategy is deemed acceptable forlayerattimestep. This strategy is recorded in thedict, and the algorithmbreaks out of the innermost loop (no need to try more aggressive strategies if a less aggressive one meets the criteria).
- For the current
- For each
- Return Compression Plan: After iterating through all
timestepsandlayers, thedictcontains the selected compression strategy for each (layer,step) pair, forming the completecompression plan.
Computational Complexity of Search:
The paper notes that this greedy algorithm has a computational complexity of , where is the number of compression strategies, is the total number of denoising steps, is the number of Transformer layers, and is the sequence length. This is significantly higher than a single DiT inference time, which is . For example, for a DiT-XL-2-512 model, the search takes approximately 224 seconds (), which is considered a reasonable one-time overhead for obtaining the compression plan.
Loss Function L(O, O') (Mean Relative Absolute Error):
The loss function L(O, O') used to evaluate the quality degradation is the mean relative absolute error. As provided in Appendix A.2, it is calculated as follows:
$
L ( O , O ^ { \prime } ) = \frac { 1 } { | O | _ { 1 } } \sum _ { i } \mathrm { c l i p } \left( \frac { | O _ { i } - O _ { i } ^ { \prime } | } { \operatorname* { m a x } ( | O _ { i } | , | O _ { i } ^ { \prime } | ) + \epsilon } , 0 , 1 0 \right)
$
Where:
-
: The number of elements in the raw (uncompressed) output vector . This term normalizes the sum to get the mean.
-
: Summation over each element in the output vectors and .
-
: The value of the -th element in the raw (uncompressed) output vector .
-
: The value of the -th element in the compressed output vector .
-
: The absolute difference between the raw and compressed output elements, quantifying the error.
-
: The normalization factor. It uses the maximum absolute value between the raw and compressed elements, ensuring the error is relative to the magnitude of the values. A small positive constant (set to in experiments) is added to the denominator to prevent division by zero or numerical instability when both and are very small.
-
: A function that clips the resulting relative error ratio to a range of [0, 10]. This prevents extreme outliers (e.g., if a true value is very small and the predicted is large, leading to a huge ratio) from dominating the overall mean error.
This
lossfunction provides a normalized measure of the average relative deviation, with values ranging from 0 (perfect match) to 10 (maximum allowed clipped error). The parameter then sets the overall tolerance for this error. The configurationsD1(),D2(), ...,D6() represent different levels ofloss tolerance, with higher allowing more aggressive compression.
5. Experimental Setup
5.1. Datasets
The experiments evaluate DiTFastAttn on various Diffusion Transformer models across different generation tasks.
-
Image Generation:
- DiT models (DiT-XL-2-512): Evaluated using the ImageNet dataset.
- Source: ImageNet is a large-scale hierarchical image database.
- Scale: Contains millions of images categorized into thousands of classes.
- Characteristics: Diverse range of natural images, widely used for image classification and generation benchmarks.
- Domain: General object recognition and natural scenes.
- Why chosen: Standard benchmark for generative models, especially for unconditional image generation, allowing comparison with previous
DiTmodel evaluations.
- PixArt-Sigma models (PixArt-Sigma-1024, PixArt-Sigma-2K): Evaluated using the MS-COCO dataset.
- Source: Microsoft Common Objects in Context (MS-COCO).
- Scale: Contains over 330K images with complex everyday scenes.
- Characteristics: Focuses on object detection, segmentation, and captioning, with challenging scenes and many objects per image.
- Domain: Object recognition in context, everyday scenes.
- Text Prompts: For
PixArt-Sigmamodels, MS-COCO 2014 captions are used as text prompts fortext-to-image generation. - Why chosen: Standard benchmark for
text-to-image generation, providing diverse and complex scenes with associated captions, suitable for evaluatingconditional generation.
- DiT models (DiT-XL-2-512): Evaluated using the ImageNet dataset.
-
Video Generation:
- Open-Sora (Open-Sora, 2024): Evaluated for
video generationtasks.-
Source/Characteristics: Open-Sora is an open-source project aiming to reproduce
Sora's capabilities. It generates videos. The specific dataset used for evaluation (beyond the model itself) is not explicitly detailed in theSettingssection forOpenSora, but typical video generation benchmarks involve datasets likeUCF101,Kinetics, or specific internal video collections. The context suggests general video generation. -
Why chosen: Represents the application of
DiTto the increasingly demanding domain ofvideo generation.The number of generated samples for quality metrics:
-
- Open-Sora (Open-Sora, 2024): Evaluated for
-
DiTmodels: 50,000 images. -
PixArt-Sigmamodels: 30,000 images.
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a complete explanation:
5.2.1. FID (Fréchet Inception Distance)
- Conceptual Definition:
Fréchet Inception Distance (FID)is a metric used to assess the quality of images generated by generative models, particularlyGANsanddiffusion models. It quantifies the similarity between the distribution of generated images and the distribution of real (ground truth) images.FIDis calculated by embedding both real and generated images into a feature space (typically using anInception-v3network pretrained onImageNet) and then computing the Fréchet distance between the two multivariate Gaussian distributions fitted to these embeddings. A lowerFIDscore indicates that the generated images are more similar to the real images, implying higher quality and diversity. - Mathematical Formula: $ \mathrm{FID} = ||\mu_1 - \mu_2||^2_2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
- Symbol Explanation:
- : The mean feature vector of the real image distribution.
- : The mean feature vector of the generated image distribution.
- : The squared L2 norm (Euclidean distance), measuring the distance between the mean vectors.
- : The covariance matrix of the real image distribution.
- : The covariance matrix of the generated image distribution.
- : The trace of a matrix, which is the sum of the elements on the main diagonal.
- : The matrix square root of the product of the two covariance matrices.
5.2.2. IS (Inception Score)
- Conceptual Definition:
Inception Score (IS)is another metric for evaluating the quality of generated images. It aims to measure two aspects: the clarity (quality) and diversity of the generated images. It achieves this by using a pre-trainedInception-v3network to classify the generated images.- Clarity: Measured by how confident the
Inceptionmodel is in its classification of a generated image (low entropy in ). A model should generate images that are clearly recognizable as belonging to a specific class. - Diversity: Measured by how varied the predicted class labels are across a batch of generated images (high entropy in
p(y)). A good generative model should produce diverse outputs, not just multiple variations of a single class. A higherISscore indicates better quality and diversity.
- Clarity: Measured by how confident the
- Mathematical Formula: $ \mathrm{IS} = \exp(E_x \left[ D_{KL}(p(y|x) || p(y)) \right]) $
- Symbol Explanation:
- : A generated image.
- : The predicted class label for an image by the
Inception-v3network. - : The conditional probability distribution over class labels given a generated image . This represents the
Inceptionmodel's confidence in classifying . p(y): The marginal probability distribution over class labels across all generated images. This represents the diversity of the generated image distribution.- : The Kullback-Leibler (KL) divergence between probability distributions and , which measures how one probability distribution diverges from a second, expected probability distribution.
- : The expectation (average) taken over all generated images .
- : The exponential function, used to scale the
KL divergenceresult.
5.2.3. CLIP Score
- Conceptual Definition:
CLIP Score(Hessel et al., 2021) is a reference-free metric used to evaluate the semantic consistency between generated images and their corresponding text prompts intext-to-image generationtasks. It leverages theCLIP (Contrastive Language-Image Pre-training)model, which learns a joint embedding space where semantically similar images and texts are close together. TheCLIP Scoreis computed as the cosine similarity between theCLIPembedding of the generated image and theCLIPembedding of the input text prompt. A higherCLIP Scoreindicates better alignment between the generated image and the text description. - Mathematical Formula:
The
CLIP Scoreis not typically defined by a single compact mathematical formula beyond thecosine similaritycalculation, as it depends on the internal workings of theCLIPmodel's embedding process. It can be expressed as: $ \mathrm{CLIP Score} = \text{cosine_similarity}(\mathrm{CLIP}{\text{image}}(I), \mathrm{CLIP}{\text{text}}(T)) $ - Symbol Explanation:
- : The
CLIPembedding (feature vector) of the generated image . - : The
CLIPembedding (feature vector) of the input text prompt . - : The
cosine similaritybetween two vectors and , measuring the cosine of the angle between them. It ranges from -1 (opposite) to 1 (identical).
- : The
5.2.4. Attn FLOPs (Attention Floating Point Operations)
- Conceptual Definition:
Attn FLOPsspecifically refers to the number of floating-point operations performed within themulti-head attentionmodules of theTransformermodel. This metric directly quantifies the computational cost of theattentionmechanism itself, which is the primary bottleneck addressed byDiTFastAttn. - Mathematical Formula: No single universal formula, as it depends on the specific implementation of
attention. It typically involves matrix multiplications (Q K^T, Attention(Q,K,V) * V), which contribute the majority of FLOPs. - Symbol Explanation: This is presented as a fraction (percentage) relative to the original uncompressed model's
attention FLOPs.
5.2.5. Latency
- Conceptual Definition:
Latencymeasures the real-world time taken to perform an operation, such as generating a single image or video. It is typically measured in seconds (s). Lowerlatencymeans the process is faster. The paper reports both overallend-to-end latencyfor generation andattention latencyspecifically for themulti-head attentionmodule. - Mathematical Formula: Not a calculated formula, but a direct measurement of time.
- Symbol Explanation: Expressed in seconds (s).
5.3. Baselines
The proposed DiTFastAttn method is compared against the raw (original) pre-trained versions of the Diffusion Transformer models it targets. These serve as the direct baselines, as the method is a post-training compression technique.
The specific baseline models are:
-
DiT-XL-2-512(for image generation at 512x512 resolution) -
PixArt-Sigma-XL-1024(for image generation at 1024x1024 resolution) -
PixArt-Sigma-XL-2K(for image generation at 2048x2048 resolution) -
OpenSora V1.1(for video generation at 240p resolution with 16 frames)These baselines are representative because they are the original, uncompressed models whose computational bottlenecks
DiTFastAttnaims to alleviate. The comparison directly demonstrates the trade-off between computational efficiency and generative quality achieved by the proposed compression method.
5.4. Other Settings
- Sampling Method & Steps:
- For
DiTandPixArt-Sigmamodels (image generation): 50-stepDPM-Solver(Lu et al., 2022) is used, which is a fastODE solverfordiffusion models. - For
Open-Sora(video generation): 200-stepIDDPM(Improved Denoising Diffusion Probabilistic Models, Nichol & Dhariwal, 2021) is used.
- For
- Loss Thresholds for Compression Plan:
- The
thresholdin Algorithm 1 (Method for Deciding the Compression Plan) is varied at intervals of 0.025. - These settings are denoted as
D1(),D2(),D3(),D4(),D5(), andD6(). Higher numbers (larger ) imply a higher tolerance forloss, leading to more aggressive compression.
- The
- WA-RS Window Size: The
window sizeforWindow Attention with Residual Sharing (WA-RS)is set to 1/8 of the totaltoken size(sequence length). - Implementation Details:
DiTFastAttnis implemented based onFlashAttention-2(Dao, 2023), a highly optimizedattentionimplementation. - Hardware: All
latencymeasurements are performed on a singleNvidia A100 GPU. - Batch Size:
DiTmodels: Batch size of 8.PixArt-Sigmamodels: Batch size of 1.OpenSora: Batch size not explicitly stated for latency measurement, but standard for video generation.
- Loss Function Details: The in the
mean relative absolute errorcalculation is set to . Other metrics likeLPIPSandSSIMwere considered for the compression plan search but discarded due to computational speed or insensitivity.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Results of Evaluation Metrics and Attention FLOPs on Image Generation
The following are the results from Table 1 of the original paper:
| Model | DiT-XL-2 512x512 | PixArt-Sigma-XL 1024x1024 | PixArt-Sigma-XL 2048x2048 | ||||||||
| Score | IS | FID | Attn FLOPs | IS | FID | CLIP | Attn FLOPs | IS | FID | CLIP | Attn FLOPs |
| Raw | 408.16 | 25.43 | 100% | 24.33 | 55.65 | 31.27 | 100% | 23.67 | 51.89 | 31.47 | 100% |
| D1 | 412.24 | 25.32 | 85% | 24.27 | 55.73 | 31.27 | 90% | 23.28 | 52.34 | 31.46 | 81% |
| D2 | 412.18 | 24.67 | 69% | 24.25 | 55.69 | 31.26 | 74% | 22.90 | 53.01 | 31.32 | 60% |
| D3 | 411.74 | 23.76 | 59% | 24.16 | 55.61 | 31.25 | 63% | 22.96 | 52.54 | 31.36 | 46% |
| D4 | 391.80 | 21.52 | 49% | 24.07 | 55.32 | 31.24 | 52% | 22.95 | 51.74 | 31.39 | 36% |
| D5 | 370.07 | 19.32 | 41% | 24.17 | 54.54 | 31.22 | 44% | 22.82 | 51.21 | 31.34 | 29% |
| D6 | 352.20 | 16.80 | 34% | 23.94 | 52.73 | 31.18 | 37% | 22.38 | 49.34 | 31.28 | 24% |
Analysis:
- DiT-XL-2 512x512: At lower compression levels (D1, D2, D3),
DiTFastAttnachieves significantAttn FLOPsreduction (down to 59% at D3) while maintaining or even slightly improvingISandFIDscores compared to theRawmodel. For instance,ISincreases from 408.16 to 411.74 (D3), andFIDdecreases from 25.43 to 23.76 (D3). This suggests thatDiTFastAttncan make the model more efficient without compromising quality, and potentially even by pruning redundant calculations it may yield minor improvements in robustness. As compression intensifies (D4-D6),Attn FLOPsreduce further (down to 34% at D6), butISandFIDstart to degrade, though still producing acceptable images (see Figure 6). - PixArt-Sigma-XL 1024x1024: Similar to
DiT-XL-2,D1toD3configurations maintainIS,FID, andCLIPscores very close to theRawmodel while achieving substantialAttn FLOPsreduction (down to 63% at D3). ForD4-D6,Attn FLOPsreduce to 37% at D6, with a slight decrease in quality metrics, but remaining high. TheCLIPscore shows remarkable resilience, staying very close to theRawscore even atD6. - PixArt-Sigma-XL 2048x2048 (High Resolution): This model demonstrates the most pronounced benefits. At
D6,Attn FLOPsare reduced to just 24% (a 76% reduction!) of the original, with minimal degradation inIS,FID, andCLIPscores compared to theRawmodel. For instance,FIDdrops from 51.89 to 49.34, andCLIPfrom 31.47 to 31.28. This highlights a key finding: as image resolution increases,DiTFastAttnnot only achieves greater compression but also better preserves the generative performance of the models. This is likely due to thequadratic complexityofself-attention, where theattentioncomputation becomes a much larger fraction of the totalFLOPsat higher resolutions, providing more scope forAttn FLOPsreduction.
6.1.2. Visualization of DiTFastAttn's Generation Results
The following figure (Figure 6 from the original paper) shows image generation samples:
该图像是图像生成示例,展示了在不同压缩比和分辨率下生成的图像,共有三种分辨率,分别为512x512、1024x1024和2048x2048。图中呈现了不同样本的细节和清晰度。
Analysis: Figure 6 qualitatively supports the quantitative results.
-
For
DiT-XL-2-512andPixArt-Sigma-1024models,D1,D2, andD3configurations produce images visually comparable to theoriginalmodels, indicating minimal perceptual degradation despite significantFLOPsreduction. -
More aggressive configurations like
D4,D5, andD6achieve higher compression but show slight variations in detail. However, the overall image quality remains acceptable, demonstrating the robustness ofDiTFastAttn. -
For
PixArt-Sigma-2K, the image quality remains very close to theoriginalup toD4, and evenD5andD6(with up to 76%attention FLOPsreduction) generate high-quality outputs. This reinforces the finding thatDiTFastAttnis particularly effective at higher resolutions, maintaining visual fidelity even with substantial compression.The following figure (Figure 14 from the original paper) shows images generated by PixArt-Sigma-XL-1024 at different thresholds with/without negative prompt:
该图像是展示了使用 PixArt-Sigma-XL-1024 在不同阈值下生成的图像,包括有无负提示的比较。上部分为车辆图像,底部为蓝色房间图像。图中分别标注了 D2、D4 和 D6。Analysis: Figure 14 illustrates
DiTFastAttn's compatibility withnegative conditioning. Even with compression, the model effectively incorporates negative prompts (like "Low quality"), as seen in the clear differences between images generated with and without negative prompts acrossD2,D4, andD6configurations. This suggests that the compression does not interfere with the guidance mechanism crucial for refining generation quality.
6.1.3. Results on Video Generation
The following figure (Figure 7 from the original paper) shows video generation results from OpenSora:
该图像是一个视频生成对比图,展示了在240p分辨率下使用OpenSora V1.1生成的16帧视频。图像的左侧展示了热气球的场景,中间展示了海洋中的海龟,右侧展示了夜晚城市的景象,体现了不同场景下的视频生成效果。
The following figure (Figure 10 from the original paper) provides a more detailed comparison of video generation:

Analysis:
DiTFastAttnwas applied toOpenSora V1.1for video generation at 240p resolution with 16 frames.- The
Attn FLOPsreductions forD1throughD6configurations were 7.63%, 19.50%, 30.16%, 37.66%, 40.52% respectively (the original text lists five values for D1-D6, implyingD6refers to the fifth value here, or there's a typo in the list, but it's clear there are increasing reductions). This confirmsDiTFastAttn's applicability and effectiveness in reducing computational costs for video models. - Qualitatively (Figures 7 and 10), configurations
D1toD4showed effective performance, balancing computational efficiency with retention of visual quality. The generated videos were smooth, with natural transitions and preserved details. - Configurations
D5andD6(more aggressive compression) resulted in noticeable deviations from the original video characteristics, but the videos remained smooth and coherent, representing the intended narrative with reasonable accuracy. This suggests that while aggressive compression can compromise detail, it can still be valuable in resource-constrained environments.
6.1.4. #FLOPs Reduction and Speedup
The following are the results from Table 2 of the original paper:
| Model | Seqlen | Metric | ASC | WA-RS | WA-RS+ASC | AST |
| DiT-XL-2 512x512 | 1024 | Attn FLOPs | 50% | 77% | 38% | 0% |
| Attn Latency | 59% | 85% | 51% | 4% | ||
| PixArt-Sigma-XL 1024x1024 | 4096 | Attn FLOPs | 50% | 51% | 26% | 0% |
| Attn Latency | 54% | 54% | 31% | 3% | ||
| PixArt-Sigma-XL 2048x2048 | 16384 | Attn FLOPs | 50% | 33% | 16% | 0% |
| Attn Latency | 52% | 35% | 19% | 1% |
Analysis of Individual Techniques (Table 2):
- ASC (Attention Sharing across CFG): As expected,
ASCconsistently reducesAttn FLOPsby 50% across all models and resolutions, because it halves theCFGcomputation. TheAttn Latencyreduction is slightly less than 50% (52-59%), indicating some fixed overheads not directly proportional toFLOPs. - WA-RS (Window Attention with Residual Sharing): The
Attn FLOPsreduction fromWA-RSis highly dependent onsequence length. ForDiT-XL-2 512x512(Seqlen 1024), it reducesFLOPsto 77%. ForPixArt-Sigma-XL 1024x1024(Seqlen 4096), it goes to 51%. ForPixArt-Sigma-XL 2048x2048(Seqlen 16384), it achieves a substantial reduction to 33% (67% reduction). This trend confirms thatWA-RSbecomes much more effective at higher resolutions (longersequence lengths) where thequadratic complexityof fullattentionis most dominant, making thewindow attention's linear complexity more beneficial. TheAttn Latencyreduction also follows this trend, becoming more significant at higher resolutions (e.g., 35% at 2048x2048). - WA-RS+ASC: This combination shows that
WA-RSandASCare orthogonal and can be applied simultaneously for compounding benefits. ForPixArt-Sigma-XL 2048x2048, it reducesAttn FLOPsto a mere 16% andAttn Latencyto 19%, representing a massive84%and81%reduction, respectively. - AST (Attention Sharing across Timesteps): The table shows 0%
Attn FLOPsreduction and very smallAttn Latencyreduction (1-4%) for individualASTapplication. This might be because the table is showing a single-step FLOPs/latency.ASTworks by skipping entire steps of attention computation, so its benefit is cumulative over many steps. If evaluated at a single step where it does compute attention, the reduction would be 0. The overall speedup (Figure 8) and ablation study (Figure 9) will better reflect its impact.
6.1.5. Overall Latency of DiTFastAttn
The following figure (Figure 8 from the original paper) shows the latency for image generation and attention:

Analysis:
- Figure 8 illustrates the
end-to-end latency(blue line) andmulti-head attention module latency(orange line) asAttn FLOPsdecrease due toDiTFastAttncompression. DiTFastAttnachievesend-to-end latencyreduction across all models and compression settings.- Resolution-dependent Performance: The benefits are more pronounced at higher resolutions.
- For
DiT-XL-2 512x512, atD6,end-to-end latencyis reduced to 40% ofraw(speedup of 2.5x), andattention latencyto 31% ofraw(speedup of ~3.2x). - For
PixArt-Sigma-XL 1024x1024, atD6,end-to-end latencyis 81% ofraw, andattention latencyis 54% ofraw. - For
PixArt-Sigma-XL 2048x2048, atD6,end-to-end latencyis 56% ofraw(speedup of ~1.8x), andattention latencyis 37% ofraw(speedup of ~2.7x).
- For
- The results clearly show that as resolution increases,
DiTFastAttnprovides increasingly better performance in reducinglatencyfor both theoverall generation processand specifically theattention module, confirming its effectiveness in alleviating thequadratic complexitybottleneck.
6.1.6. Compression Plan after Search
The following figure (Figure 5 from the original paper) shows the compression plan under the D6 setting:
该图像是示意图,展示了DiT-XL-2、PixArt-Sigma-XL-1024和PixArt-Sigma-XL-2K在D6配置下的压缩计划,阈值设定为0.15。图中展示了不同层与时间步的注意力分布,包括完全注意力(Full Attn)、窗口注意力与残差共享(WA-RS)、注意力共享(ASC)、窗口注意力与残差共享及注意力共享(WA-RS+ASC)和自适应共享技术(AST)的效果。
Analysis: Figure 5 provides a heatmap visualization of the compression plan determined by the greedy search method under the D6 setting (highest compression tolerance) for DiT-XL-2-512, PixArt-Sigma-XL-1024, and PixArt-Sigma-XL-2K.
-
Variability across Models: The distribution of
compression techniques(Full Attn, WA-RS, ASC, WA-RS+ASC, AST) varies significantly across the three models. This confirms the paper's claim that a universal compression strategy is not optimal and necessitates a tailored search. -
DiT-XL-2-512: Shows
ASTandASCprimarily in earlytimesteps, withFull Attentionmore prevalent in initialattention layers. -
PixArt-Sigma-XL-1024: Employs
ASTsporadically in the first two layers and middleattention layersduring intermediatetimesteps. The combination ofWA-RSandASCis notably predominant in the finaltimesteps. -
PixArt-Sigma-XL-2K: Exhibits a unique pattern, suggesting different redundancy distributions for very high resolutions.
-
The heatmaps (and additional ones in Appendix A.5, Figures 11, 12, 13) visually validate the identified redundancies and the efficacy of the greedy search in adapting the
compression planto model-specific and resolution-specific characteristics.The following figure (Figure 11 from the original paper) shows the compression plan for DiT-XL-2-512x512 at different thresholds:
该图像是一个示意图,展示了在不同阈值下,DiT-XL-2 512x512 模型在 DPM 解算器步骤设为 50 时的压缩计划,包括全注意力(Full Attn)、窗口注意力与残差共享(WA-RS)、条件注意力共享(ASC)、窗口注意力与残差共享加条件共享(WA-RS+ASC),以及自适应共享注意力(AST)的比较。
The following figure (Figure 12 from the original paper) shows the compression plan for PixArt-Sigma-XL-1024x1024 at different thresholds:
该图像是PixArt-Sigma-XL-1024x1024在不同阈值下的压缩计划。展示了全注意力和几种简化注意力策略(WA-RS、ASC、WA-RS+ASC、AST)在多个时间步和层次的表现。不同的阈值(0.025, 0.05, 0.075, 0.1, 0.125, 0.15)影响了注意力权重的分布。
The following figure (Figure 13 from the original paper) shows the compression plan for PixArt-Sigma-XL-2K at different thresholds:
该图像是一个关于PixArt-Sigma-XL-2K在不同阈值下的压缩计划示意图,展示了在不同阈值(0.025、0.05、0.075、0.1、0.125、0.15)的时间步与层的注意力模式。图中包含多种注意力策略的比较,标注包括Full Attn、WA-RS、ASC等。
Analysis of Compression Plans (Figures 11, 12, 13): These figures, showing compression plans at various thresholds, further underscore the dynamic nature of the redundancy.
- As the
thresholdincreases (from D1 to D6), more aggressive compression techniques (WA-RS,ASC,WA-RS+ASC,AST) are applied to a greater number oflayersandtimesteps. This is intuitive, as a higherthresholdallows for more error tolerance. - The patterns of where
AST,ASC, andWA-RSare applied (e.g.,ASToften in early/mid steps,WA-RSin later steps or high-resolution models) suggest that different types of redundancies become dominant or tolerable at different stages of thedenoising processor for differentTransformerlayers. - For instance,
WA-RSbecomes more widely used forPixArt-Sigma-XL-2K(Figure 13), especially at higher settings, demonstrating its particular utility for high-resolution scenarios where spatial redundancy is a major factor.
6.2. Ablation Studies / Parameter Analysis
The following figure (Figure 9 from the original paper) shows the ablation study results:

Analysis (Figure 9 on DiT-XL-2-512):
6.2.1. DiTFastAttn Outperforms Single Methods (Left Panel)
- The left panel compares the performance (
FIDandIS) ofDiTFastAttnagainst individual compression techniques (ASC,WA-RS,AST) for the sameAttention TFLOPsbudget. DiTFastAttnconsistently maintains higher quality metrics (ISandFID) across variousTFLOPsreductions compared to any single method. This demonstrates the benefit of combining techniques and intelligently selecting them with the greedy search.- Among individual techniques,
ASTshows the best generative quality for a given initial reduction. However, beyond a certainTFLOPslevel (around 2.2TFLOPs), further compression withASTalone leads to significant performance degradation, causing the search algorithm to stop applying it.DiTFastAttn, by combiningASTwithWA-RSandASC, allows for deeper compression while preserving better quality than any single method.
6.2.2. Higher Steps Improve DiTFastAttn's Performance (Middle Panel)
- The middle panel investigates the impact of the number of
DPM-Solver steps(20, 30, 40, 50) onDiTFastAttn's performance. - It's evident that as the number of
stepsincreases,DiTFastAttncan compress more computation (achieve lowerAttention TFLOPs) while maintaining comparable or even better quality (ISandFID). - This suggests that with more
denoising steps, thediffusion processhas more opportunities to correct errors introduced by compression, and the redundancies (especially temporal) might become more pronounced, allowing for greater savings without noticeable quality loss.
6.2.3. The Residual Caching Technique is Essential (Right Panel)
- The right panel compares
Window Attention with Residual Sharing (WA-RS)with plainWindow Attention (WA)(without residual sharing). WA-RSconsistently maintains significantly better generative performance (ISandFID) thanWAat the samecompression ratio(sameAttention TFLOPs).- This is a crucial finding: directly replacing full
attentionwithwindow attention(WA) results in a substantial drop in performance (FIDincreases,ISdecreases dramatically). Theresidual cachingmechanism inWA-RSeffectively mitigates this loss oflong-range dependencies, confirming its essential role in enablingwindow attentionfor post-trainingDiTcompression without severe quality degradation.
6.3. Data Presentation (Tables)
The following are the results from Table 3 of the original paper:
| Model | Resolution | Config | Latency (s) | Attn Latency (s) | FID | IS |
| DiT-XL-2 | Raw | 6.66 | 2.26 | 25.43 | 408.16 | |
| D1 | 6.61 | 2.22 | 25.32 | 412.24 | ||
| D2 | 6.45 | 2.05 | 24.67 | 412.18 | ||
| D3 | 2.89 | 0.91 | 23.76 | 411.74 | ||
| D4 | 2.78 | 0.83 | 21.52 | 391.80 | ||
| D5 | 2.77 | 0.80 | 19.32 | 370.07 | ||
| D6 | 2.66 | 0.71 | 16.80 | 352.20 |
The following are the results from Table 4 of the original paper:
| Model | Config | Latency (s) | Attn Latency (s) | FID | IS | CLIP |
| PixArt-Sigma-XL 1024x1024 | Raw | 12.76 | 5.30 | 24.33 | 55.65 | 31.27 |
| D1 | 12.55 | 5.10 | 24.27 | 55.73 | 31.27 | |
| D2 | 11.98 | 4.49 | 24.25 | 55.69 | 31.26 | |
| D3 | 11.42 | 4.01 | 24.16 | 55.61 | 31.25 | |
| D4 | 11.06 | 3.60 | 24.07 | 55.32 | 31.24 | |
| D5 | 10.73 | 3.25 | 24.17 | 54.54 | 31.22 | |
| PixArt-Sigma-XL 2048x2048 | D6 | 10.31 | 2.85 | 23.94 | 52.74 | 31.18 |
| Raw | 39.86 | 27.57 | 23.67 | 51.89 | 31.47 | |
| D1 | 35.75 | 23.62 | 23.28 | 52.34 | 31.46 | |
| D2 | 31.44 | 19.29 | 22.90 | 53.01 | 31.32 | |
| D3 | 28.99 | 16.51 | 22.96 | 52.54 | 31.36 | |
| D4 | 26.18 | 13.88 | 22.95 | 51.74 | 31.39 | |
| D5 | 23.86 | 11.66 | 22.82 | 51.22 | 31.34 | |
| D6 | 22.27 | 10.13 | 22.38 | 49.34 | 31.28 |
The following are the results from Table 5 of the original paper:
| Model | Resolution | Config | Latency (s) | Attn Latency (s) | FID | IS |
| DiT-XL-2 | 512x512 | Raw | 32.62 | 11.40 | 3.16 | 219.97 |
| D1 | 31.53 | 10.21 | 3.09 | 218.20 | ||
| D2 | 29.35 | 8.09 | 3.10 | 210.36 | ||
| D3 | 27.80 | 6.56 | 3.54 | 196.05 | ||
| D4 | 26.96 | 5.77 | 4.52 | 180.34 |
Analysis of Latency Values (Tables 3, 4, 5):
These tables provide detailed latency measurements for different models, resolutions, and compression configurations, complementing the FLOPs analysis.
- Overall Latency Reduction: Across all models and resolutions,
DiTFastAttnconsistently reduces both totalgeneration latencyandattention latency. ForPixArt-Sigma-XL 2048x2048, totallatencyis reduced from 39.86s (Raw) to 22.27s (D6), andattention latencyfrom 27.57s to 10.13s. This is a significant practical speedup for high-resolution generation. - Attention as Bottleneck: The
attention latencymakes up a large portion of the totallatency, especially for high-resolution models (e.g., 27.57s out of 39.86s forPixArt-Sigma-XL 2048x2048 Raw). This confirms the paper's initial premise thatattentionis the primary computational bottleneck, and therefore, efforts to compress it yield substantialend-to-end speedups. - Impact of DPM-Solver Steps (Table 5): When using 250-step
IDDPM solverwith forDiT-XL-2-512x512, theraw latencyis much higher (32.62s total, 11.40sAttn Latency) compared to the 50-stepDPM-Solver(6.66s total, 2.26sAttn Latencyin Table 3). This shows that longerdenoising processesbenefit even more fromDiTFastAttn, as the accumulated savings over more steps become larger. AtD4, totallatencyis reduced to 26.96s (from 32.62s) andAttn Latencyto 5.77s (from 11.40s).
6.3.1. Compression Plan Search Time
The following are the results from Table 6 of the original paper:
| Model | Resolution | Config | Plan Search Time |
| DiT-XL-2 | 512x512 | Raw | 04m39s |
| D2 | 04m08s | ||
| D4 | 03m49s | ||
| D6 | 03m14s | ||
| PixArt-Sigma-XL | 1024x1024 | Raw | 22m02s |
| D2 | 20m12s | ||
| D4 | 17m50s | ||
| D6 | 15m49s | ||
| PixArt-Sigma-XL | 2048x2048 | Raw | 1h50m13s |
| D2 | 1h46m04s | ||
| D4 | 1h22m53s | ||
| D6 | 1h23m01s |
Analysis:
- Table 6 shows the time taken to generate the
compression planusing the greedy search method. - The
plan search timeincreases significantly with resolution, from a few minutes forDiT-XL-2 512x512to over an hour forPixArt-Sigma-XL 2048x2048. This is expected due to the complexity of the search algorithm. - Interestingly, for a given model, the
plan search timedecreases as theconfig(threshold ) increases fromRawtoD6. This is because higher values mean a higher tolerance forloss. The greedy search, which tries strategies in ascending order ofcompression ratio, will find an acceptable strategy earlier (possibly a less aggressive one) andbreakout of the innermost loop faster, reducing the overall search time. - While the search time can be substantial for high-resolution models, it's a one-time cost to generate the
plan, after whichinferencecan be greatly accelerated.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces DiTFastAttn, a novel post-training compression method designed to accelerate Diffusion Transformer (DiT) models during inference by addressing the computational burden of their self-attention mechanisms. The core contribution lies in identifying and exploiting three types of redundancies: spatial redundancy (local focus of attention heads), temporal redundancy (similarity of attention outputs across neighboring steps), and conditional redundancy (similarity between conditional and unconditional inference outputs in CFG).
To tackle these, DiTFastAttn proposes three innovative techniques:
-
Window Attention with Residual Sharing (WA-RS): Employswindow attentionfor spatial efficiency while preservinglong-range dependenciesthrough cached residuals. -
Attention Sharing across Timesteps (AST): Skips redundant computations by reusingattention outputsacross similardenoising steps. -
Attention Sharing across CFG (ASC): ReducesCFGoverhead by sharingattention outputsbetweenconditionalandunconditionalevaluations.A simple greedy search algorithm is developed to dynamically select the optimal combination of these techniques for each
layerandtimestep, balancing compression with generative quality. Experimental results onDiT-XL,PixArt-Sigma(image generation), andOpenSora(video generation) consistently demonstrate the effectiveness ofDiTFastAttn. The method achieves significantattention FLOPsreduction (up to 76%) and considerableend-to-end speedup(up to 1.8x), particularly for high-resolution content generation, while largely preserving generation quality. The ablation studies confirm the critical role ofresidual cachinginWA-RSand the benefits of combining multiple compression strategies.
7.2. Limitations & Future Work
The authors acknowledge several limitations of DiTFastAttn:
-
Post-Training Nature: As a post-training compression technique,
DiTFastAttncannot leverage the training process to potentially avoid some of the performance drops that might occur when aggressively compressing. This implies that ifDiTFastAttnwere integrated during training, it might achieve even better compression-quality trade-offs. -
VRAM Increase for AST: When
Attention Sharing across Timesteps (AST)is applied, theattention hidden statesfrom previoustimestepsneed to be stored in memory for reuse. This caching mechanism can lead to extraVRAM usage, which might be a concern in memory-constrained environments, counteracting some of theFLOPsbenefits. -
Suboptimal Compression Plan: The greedy method used to decide the
compression planis described as "simple." While effective, it may not guarantee finding the globally optimalcompression planacross all layers andtimesteps, potentially leaving room for further efficiency improvements. -
Limited Scope to Attention Module: The method solely focuses on reducing the computational cost of the
attention module. Other computationally intensive components ofDiffusion Transformers, such asfeed-forward networksorconvolutional layers(if present), are not addressed.Based on these limitations, potential future research directions could include:
- Exploring joint training/compression methods that integrate
DiTFastAttn's principles into the model training pipeline. - Developing
VRAM-optimizedvariants ofASTthat manage cached states more efficiently, perhaps by selectively storing only critical information or employing on-the-fly recomputation for less critical elements. - Investigating more advanced
optimization algorithms(e.g., reinforcement learning, neural architecture search) for finding theoptimal compression plan, potentially leading to better compression-quality trade-offs. - Extending the compression techniques to other modules within
Diffusion Transformersor other large generative models to achieve broaderend-to-end acceleration.
7.3. Personal Insights & Critique
DiTFastAttn presents a highly practical and well-motivated approach to accelerating Diffusion Transformers.
Strengths:
- Practicality of Post-Training: The focus on post-training compression is a significant strength. Retraining large
DiTmodels is prohibitively expensive for most researchers and practitioners. Providing a method that works out-of-the-box on pre-trained models democratizes access to efficientDiTinference. - Systematic Redundancy Identification: The clear identification of
spatial,temporal, andconditionalredundancies is elegant and insightful. This structured approach allows for targeted compression strategies that address specific bottlenecks. - Ingenious
Residual Sharing: TheWindow Attention with Residual Sharing (WA-RS)is particularly clever.Window attentionis powerful for local efficiency, but the loss oflong-range dependenciesis its Achilles' heel. Reusing the residual from full attention effectively "patches" this gap without incurring the quadratic cost, demonstrating a deep understanding ofTransformerbehavior. This is a novel way to mitigate a known problem in local attention without retraining. - Resolution-Dependent Performance: The finding that
DiTFastAttnyields greater benefits at higher resolutions is crucial. This is precisely where thequadratic complexityofattentionbecomes most problematic, making the method highly relevant for cutting-edge high-resolution image/video generation.
Critique & Potential Issues:
- Cost of Greedy Search: While a one-time cost, the search time for the
compression plancan be substantial, especially for very large models and high resolutions (e.g., almost 2 hours forPixArt-Sigma-XL 2048x2048). This might be a barrier for users who need to quickly adaptDiTFastAttnto many different configurations or models. Exploring more efficient search algorithms, perhaps those that leverage transfer learning ofcompression plansacross similar models or resolutions, could be beneficial. - VRAM Impact of
AST: The acknowledgedVRAMincrease forASTis a practical concern. In many high-resolution generative tasks,VRAMis already a limiting factor. The benefits ofFLOPsreduction might be partially offset if increasedVRAMusage forces smaller batch sizes or limits deployment on less powerful hardware. A more detailed analysis of theVRAMtrade-off (e.g.,VRAMusage vs.speedup) would strengthen the argument forAST's overall utility. - Loss Function Robustness: The
mean relative absolute errorloss functionL(O, O')uses in the denominator. While handles division by zero, if both and are very small (e.g., close to ), small absolute differences could still lead to large relative errors. Depending on the range ofattention outputs, this could make the loss sensitive to noise in very low-magnitude feature dimensions, potentially guiding the search suboptimally in such cases. - Generality beyond Diffusion: While specifically designed for
Diffusion Transformers, the principles ofspatial,temporal, andconditionalredundancies (especiallyspatialandtemporal) might be applicable to otherTransformer-based generative or discriminative models that process sequential data or operate iteratively. Investigating such cross-domain applicability could be a valuable extension.
Inspirations:
The paper offers valuable insights into how to systematically analyze and optimize large models post-training. The idea of identifying specific types of redundancies based on common model behaviors (local attention, iterative refinement, guidance mechanisms) and then designing targeted, lightweight techniques to exploit them is highly transferable. This approach could be applied to other computationally intensive blocks beyond attention, and in other iterative deep learning processes where intermediate states might exhibit high similarity. The success of the simple greedy search also suggests that sophisticated, high-overhead NAS (Neural Architecture Search) might not always be necessary; sometimes, well-motivated heuristics and a good loss function are sufficient for significant gains.
Similar papers
Recommended via semantic vector search.