Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity
TL;DR Summary
Sparse VideoGen (SVG) enhances video generation efficiency by leveraging the inherent sparsity of 3D attention, classifying attention heads into spatial and temporal types. It achieves up to 2.33x acceleration while maintaining generation quality, with open-source code available.
Abstract
Diffusion Transformers (DiTs) dominate video generation but their high computational cost severely limits real-world applicability, usually requiring tens of minutes to generate a few seconds of video even on high-performance GPUs. This inefficiency primarily arises from the quadratic computational complexity of 3D Full Attention with respect to the context length. In this paper, we propose a training-free framework termed Sparse VideoGen (SVG) that leverages the inherent sparsity in 3D Full Attention to boost inference efficiency. We reveal that the attention heads can be dynamically classified into two groups depending on distinct sparse patterns: (1) Spatial Head, where only spatially-related tokens within each frame dominate the attention output, and (2) Temporal Head, where only temporally-related tokens across different frames dominate. Based on this insight, SVG proposes an online profiling strategy to capture the dynamic sparse patterns and predicts the type of attention head. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves up to 2.28x and 2.33x end-to-end speedup on CogVideoX-v1.5 and HunyuanVideo, respectively, while preserving generation quality. Our code is open-sourced and is available at https://github.com/svg-project/Sparse-VideoGen
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "Sparse VideoGen: Accelerating Video Diffusion Transformers with Spatial-Temporal Sparsity." It focuses on improving the inference efficiency of video generation models by leveraging specific sparsity patterns inherent in their attention mechanisms.
1.2. Authors
The authors of the paper are: Haocheng , Shuo Yang * 1, Yilong Zhao 1, Chenfeng , Muyang Li 2, Xiuyu Li 1, Yujun Lin 2, Han Cai 3, Jintao Zhang 4, Dacheng , Jianfei Chen 4, Ion Stoica 1, Kurt Keutzer 1, Song Han 2 3
Affiliations:
-
University of California, Berkeley
-
Massachusetts Institute of Technology (MIT)
-
MIT-IBM Watson AI Lab
-
ByteDance
The authors represent prominent research institutions and a major tech company, indicating a strong background in machine learning, deep learning efficiency, and possibly hardware acceleration.
1.3. Journal/Conference
The paper is published on arXiv (a preprint server), indicated by the Original Source Link and PDF Link pointing to arxiv.org. While arXiv is not a peer-reviewed journal or conference in itself, it is a widely recognized platform for disseminating early research findings in machine learning and other scientific fields. Papers published on arXiv often undergo peer review for later publication in prestigious conferences (e.g., CVPR, ICCV, NeurIPS, ICLR) or journals. The publication date suggests it is a very recent work.
1.4. Publication Year
The paper was published on February 3, 2025.
1.5. Abstract
The paper addresses the significant computational cost of Diffusion Transformers (DiTs) in video generation, which currently requires extensive time even on high-performance GPUs due to the quadratic complexity of 3D Full Attention. To mitigate this, the authors propose Sparse VideoGen (SVG), a training-free framework that exploits inherent sparsity in 3D Full Attention. They identify two dynamic attention head types: Spatial Heads, which focus on spatially-related tokens within frames, and Temporal Heads, which concentrate on temporally-related tokens across frames. SVG employs an online profiling strategy to dynamically classify these heads and predict their sparse patterns. Combined with a novel hardware-efficient tensor layout transformation and customized kernel implementations, SVG achieves substantial end-to-end speedups (up to 2.28x on CogVideoX-v1.5 and 2.33x on HunyuanVideo) while preserving generation quality. The code for SVG is open-sourced.
1.6. Original Source Link
https://arxiv.org/abs/2502.01776
1.7. PDF Link
https://arxiv.org/pdf/2502.01776v2.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem Sparse VideoGen aims to solve is the high computational cost and slow inference speed of Diffusion Transformers (DiTs) for video generation. While DiTs have achieved state-of-the-art results in generating high-fidelity and temporally consistent videos (e.g., Sora, Kling, Wan 2.1, CogVideo, HunyuanVideo), their practical real-world applicability is severely limited by their computational demands. Generating just a few seconds of video can take tens of minutes to an hour on powerful GPUs.
This problem is particularly critical because the 3D Full Attention mechanism, a cornerstone of DiTs for spatiotemporal modeling, exhibits quadratic computational complexity with respect to the context length (the total number of tokens processed). As video resolution and frame count increase, the context length grows, making attention an increasingly dominant bottleneck. For instance, in HunyuanVideo, attention can consume over 80% of the total runtime for a 5-second video.
Prior research on sparse attention has shown promise in reducing computation, especially in Large Language Models (LLMs), by identifying and only computing attention over "important" tokens. However, the existing methods developed for text data cannot be directly applied to video DiTs because video data possesses fundamentally different sparsity patterns.
The paper's innovative idea is to leverage the inherent spatial-temporal sparsity observed in 3D Full Attention within video DiTs. The authors hypothesize that attention heads can be dynamically categorized into Spatial Heads (focusing on intra-frame relationships) and Temporal Heads (focusing on inter-frame relationships). By exploiting these specific, video-centric sparse patterns, they aim to dramatically reduce redundant computations without sacrificing video generation quality.
2.2. Main Contributions / Findings
The primary contributions and key findings of the Sparse VideoGen (SVG) paper are:
-
In-depth Analysis of Video DiTs' Sparse Patterns: The paper provides a novel analysis, revealing two distinct and inherent sparse attention patterns in video
Diffusion Transformers:Spatial HeadandTemporal Head. These patterns are crucial for maintaining spatial and temporal consistency in generated videos, respectively. This insight forms the algorithmic foundation forSVG. -
Development of a Training-Free Sparse Attention Framework (SVG): The authors propose
SVG, a comprehensive framework that comprises:- An Efficient Online Profiling Strategy: This strategy dynamically identifies the optimal sparse pattern (spatial or temporal) for each attention head during inference, with minimal overhead (around 3%). It achieves this by sampling a small subset of tokens and comparing the
Mean Squared Error (MSE)of sparse attention outputs against full attention. - An Efficient Inference System: This system includes a novel hardware-efficient tensor layout transformation that reorders non-contiguous temporal sparsity patterns into a compact, hardware-friendly format, enabling better utilization of
Tensor CoresonGPUs. It also integrates customizedCUDAandTritonkernels for operations likeQK-norm,RoPE, and block sparse attention usingFlashInfer.
- An Efficient Online Profiling Strategy: This strategy dynamically identifies the optimal sparse pattern (spatial or temporal) for each attention head during inference, with minimal overhead (around 3%). It achieves this by sampling a small subset of tokens and comparing the
-
Significant End-to-End Speedup with Quality Preservation:
SVGdemonstrates prominent efficiency improvements on state-of-the-art open-source video generative models:- Up to 2.28x end-to-end speedup on
CogVideoX-v1.5. - Up to 2.33x end-to-end speedup on
HunyuanVideo. - Up to 1.51x end-to-end speedup on
Wan 2.1(mentioned in abstract, but in text of paper this is only 1.92x on HunyuanVideo without FP8, and 2.33x with FP8). - Crucially, these speedups are achieved while preserving high generation quality, maintaining a
PSNRabove 29, outperforming prior methods that often suffer from significant quality degradation.
- Up to 2.28x end-to-end speedup on
-
Compatibility with Quantization:
SVGis shown to be compatible withFP8 quantization, enabling additional efficiency gains (up to 1.3x throughput boost) with only a minimal accuracy drop.These findings address the critical bottleneck of computational cost in video
DiTs, paving the way for more practical and widespread applications of video generative models.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand Sparse VideoGen, a foundational grasp of Diffusion Models, Transformers, and the Attention mechanism, especially in the context of video processing, is essential.
3.1.1. Diffusion Models
Diffusion Models are a class of generative models that learn to create data (like images or videos) by reversing a gradual noise diffusion process.
- Core Idea: They work by iteratively denoising a noisy input towards a clean data sample. During training, a forward diffusion process gradually adds Gaussian noise to data until it becomes pure noise. The model then learns to reverse this process, predicting the noise added at each step to reconstruct the original data.
- Generative Process: To generate a new sample, the model starts with random noise and iteratively applies its learned denoising steps until a clean image or video is produced.
- Denoising Steps: The quality of generation often correlates with the number of denoising steps. More steps generally lead to better quality but require more computation.
3.1.2. Transformers
Transformers are neural network architectures that have revolutionized natural language processing and, more recently, computer vision and generation tasks.
- Core Component: The central innovation of transformers is the
self-attentionmechanism, which allows the model to weigh the importance of different parts of the input sequence (or tokens) when processing each part. - Sequence Processing: Transformers process entire sequences simultaneously, rather than sequentially like
Recurrent Neural Networks (RNNs), making them highly parallelizable and efficient for long sequences.
3.1.3. Attention Mechanism (Specifically 3D Full Attention)
The Attention mechanism is a core component of Transformers. It allows a model to focus on different parts of an input when making predictions, effectively assigning different "importance scores" to various elements.
-
Standard Self-Attention: For a sequence of tokens,
self-attentioncalculates how much each token should attend to every other token in the sequence. It involves three learned matrices:-
Query (Q): Represents the current token being processed. -
Key (K): Represents all other tokens that the current token might attend to. -
Value (V): Represents the actual information content of all other tokens.The fundamental calculation for
Self-Attention(as introduced in "Attention Is All You Need" by Vaswani et al., 2017) is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where: -
: Query matrix of shape , where is the sequence length and is the dimension of keys and queries.
-
: Key matrix of shape .
-
: Value matrix of shape , where is the dimension of values.
-
: Dot product between queries and keys, resulting in attention scores of shape
(N, N). -
: Scaling factor to prevent large dot product values from pushing the softmax function into regions with tiny gradients.
-
: Normalizes attention scores to probabilities.
-
: Multiplies the normalized attention scores by the Value matrix to get the weighted sum of values, representing the attention output.
-
-
3D Full Attention for Video: In video
DiTs,2D attention(used for images) is extended to3D Full Attentionto handle the additional temporal dimension.- Tokens: A video is typically broken down into frames, and each frame is tokenized (e.g., into image patches which are then flattened into tokens). For a video with frames and tokens per frame, the total
context lengthis . - Full Attention:
3D Full Attentionmeans that each token in the entire video sequence (across all frames) can potentially attend to every other token. This allows the model to capture both spatial relationships within a frame and temporal relationships across frames. - Quadratic Complexity: The calculation of involves multiplying two matrices of size and , leading to an intermediate matrix of size
(S, S). This means the computational cost scales quadratically with the totalcontext length, i.e., . For videos, can be very large (), making3D Full Attentioncomputationally expensive.
- Tokens: A video is typically broken down into frames, and each frame is tokenized (e.g., into image patches which are then flattened into tokens). For a video with frames and tokens per frame, the total
3.1.4. Diffusion Transformers (DiTs)
DiTs combine the generative power of Diffusion Models with the architectural strengths of Transformers. Instead of using a U-Net architecture (common in traditional diffusion models) as the noise prediction backbone, DiTs use a Transformer.
- How it works: The
Transformertakes noisy latent representations of images or videos as input, along with timestep embeddings, and predicts the noise that was added. By usingself-attention,DiTscan effectively model long-range dependencies in the latent space, leading to higher quality and more scalable generative models. For video,3D Full Attentionis employed within theDiTarchitecture.
3.1.5. Sparsity in Attention
Sparsity in Attention refers to the observation that in many self-attention computations, not all token-to-token interactions are equally important. Often, only a small subset of Query-Key pairs contribute significantly to the final attention output.
- Goal: The goal of
sparse attentionmethods is to identify these important (or "heavy-hitter") interactions and only perform computations for them, thereby reducing the complexity without much loss in performance. - Challenges: Identifying the relevant sparsity patterns dynamically and implementing them efficiently on hardware accelerators are key challenges.
3.1.6. Evaluation Metrics
-
Peak Signal-to-Noise Ratio (PSNR):
- Conceptual Definition:
PSNRis a quality metric used to quantify the difference between two images or videos. It measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. HigherPSNRvalues indicate better quality and less distortion (i.e., the generated video is closer to the ground truth). - Mathematical Formula: $ MSE = \frac{1}{MN} \sum_{i=0}^{M-1}\sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 $ $ PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right) $
- Symbol Explanation:
MSE: Mean Squared Error between the two images/frames.I(i,j): The pixel value at position(i,j)in the original (ground truth) image/frame.K(i,j): The pixel value at position(i,j)in the generated (approximated) image/frame.M, N: Dimensions of the image (height and width).- : The maximum possible pixel value of the image (e.g., 255 for 8-bit grayscale images).
- Conceptual Definition:
-
Structural Similarity Index Measure (SSIM):
- Conceptual Definition:
SSIMis a perceptual metric that quantifies the similarity between two images. UnlikePSNRwhich measures absolute error,SSIMis designed to model human perception of image quality. It considers image degradation as a perceived change in structural information, and also incorporates luminance and contrast changes. Values range from -1 to 1, where 1 indicates perfect structural similarity. HigherSSIMvalues indicate better quality and perceptual similarity. - Mathematical Formula: $ SSIM(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
x, y: Two image patches being compared.- : Average of .
- : Average of .
- : Variance of .
- : Variance of .
- : Covariance of and .
- , : Small constants to avoid division by zero (where is the dynamic range of pixel values, and are small constants).
- Conceptual Definition:
-
Learned Perceptual Image Patch Similarity (LPIPS):
- Conceptual Definition:
LPIPSis a perceptual similarity metric that uses deep features extracted from a pre-trained neural network (e.g.,AlexNet,VGG, orResNet) to compare image patches. Instead of pixel-wise differences, it measures the distance between feature representations. It is generally considered to correlate better with human judgment of image similarity thanPSNRorSSIM. LowerLPIPSvalues indicate greater perceptual similarity (i.e., better quality). - Mathematical Formula:
LPIPSdoes not have a simple closed-form mathematical formula likePSNRorSSIMbecause it relies on the internal feature representations of a deep neural network. Conceptually, it can be described as: $ LPIPS(\mathbf{x}, \mathbf{x_0}) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} | \mathbf{w}l \odot (\phi_l(\mathbf{x}){h,w} - \phi_l(\mathbf{x_0})_{h,w}) |_2^2 $ - Symbol Explanation:
- : The two input images.
- : Feature extractor (e.g., a layer from
AlexNet) at layer . - : The feature vector at spatial location
(h,w)in layer for image . - : A learned scaling vector for layer .
- : Element-wise product.
- : Height and width of the feature map at layer .
- : Squared norm.
In essence,
LPIPScalculates the squared distance between feature stacks (after scaling) at various layers of a pre-trained network.
- Conceptual Definition:
-
VBench Score (ImageQual, SubConsist):
- Conceptual Definition:
VBenchis a comprehensive benchmark suite specifically designed for evaluating video generative models. It assesses various aspects of video generation quality, including visual quality, temporal consistency, motion, and alignment with text prompts. The paper specifically reportsImageQual(Image Quality) andSubConsist(Subject Consistency). These are typically composite scores derived from multiple sub-metrics, aiming to provide a holistic assessment aligned with human perception. HigherVBenchscores indicate better performance. - Mathematical Formula:
VBenchscores are not a single mathematical formula but rather a framework for evaluation that computes various sub-metrics. The paper does not provide the explicit formulas forImageQualorSubConsist. - Symbol Explanation: As
ImageQualandSubConsistare higher-level aggregated metrics from theVBenchframework, they do not have single, simple mathematical symbols or variables to explain beyond their conceptual meaning.
- Conceptual Definition:
3.2. Previous Works
The paper contextualizes SVG by discussing prior efforts in efficient diffusion models and efficient attention methods.
3.2.1. Efficient Diffusion Models
Previous work to make diffusion models more efficient generally falls into three categories:
-
Decreasing Denoising Steps:
- Problem: Most diffusion models rely on
Stochastic Differential Equations (SDEs)requiring many sampling steps (Song & Ermon, 2019; Ho et al., 2020; Meng et al., 2022). - Solutions:
DDIM(Song et al., 2020) approximated SDEs withOrdinary Differential Equations (ODEs). Subsequent techniques refinedODEpaths and solvers (Lu et al., 2022a;b; Liu et al., 2022; 2024c) or usedconsistency losses(Song et al., 2023; Luo et al., 2023) to achieve high quality with fewer steps. - Distillation: Methods like (Yin et al., 2024a;b) train simpler, few-step models by distilling knowledge from larger models.
- Limitation (for SVG): These approaches often require expensive re-training or fine-tuning, which is impractical for many video generation use cases.
SVGdistinguishes itself by being a training-free framework, directly applicable to off-the-shelf pre-trained models.
- Problem: Most diffusion models rely on
-
Diffusion Model Compression:
- Problem: Diffusion models are large and memory-intensive.
- Solutions:
Weight compressionthroughquantization(Li et al., 2023; Zhao et al., 2024a; Li* et al., 2025) reduces the precision of model weights (e.g.,INT8,INT4,FP8). Other methods propose efficient architectures (Xie et al., 2024; Cai et al., 2024; Chen et al., 2025) or high-compression autoencoders (Chen et al., 2024a). - Relationship to SVG:
SVGisorthogonalto these techniques, meaning it can be combined with them for additional efficiency gains. The paper demonstrates this by integratingFP8 quantizationwithSVG.
-
Efficient System Implementation:
- Problem: Optimizing the underlying software and hardware interactions for diffusion models.
- Solutions: System-level optimizations include
dynamic batching(Kodaira et al., 2023; Liang et al., 2024),caching strategies(Chen et al., 2024b; Zhao et al., 2024b), orhybrid approaches(Lv et al., 2024; Liu et al., 2024a). For example,PAB(Zhao et al., 2024b) reuses results from prior layers. - Limitation (for SVG): While these methods improve throughput, they often lead to a drop in output quality, with
PSNRsometimes falling below 22.SVGsignificantly outperforms them in maintaining fidelity, preserving aPSNRabove 30.
3.2.2. Efficient Attention Methods
The paper also reviews various strategies for making the attention mechanism more efficient.
-
Sparse Attention in LLMs:
- Problem:
Self-attention's quadratic complexity is a major bottleneck inLarge Language Models (LLMs)processing long contexts. - Solutions:
Temporal Locality: Methods likeStreamingLLM(Xiao et al., 2023) andLM-Infinite(Han et al., 2023) observe that attention often concentrates on recent or initial tokens.Heavy Hitter Tokens:H2O(Zhang et al., 2023b),Scissorhands(Liu et al., 2024d), andDoubleSparsity(Yang et al., 2024b) identify a small set of influential "heavy hitter" tokens.Cross-Layer/Head Correlation:TidalDecode(Yang et al., 2024a) notes correlation across layers, whileDuoAttention(Xiao et al., 2024a) andMInference(Jiang et al., 2024) identify distinct sparse patterns across different attention heads.
- Limitation (for SVG): These methods primarily focus on token-level sparsity specific to text data and do not leverage the inherent redundancy and distinct spatial-temporal patterns unique to video data.
SVG's video-specific sparsity patterns are a key differentiator.
- Problem:
-
Linear and Low-bit Attention:
- Linear Attention: Approaches like
Linformer(Wang et al., 2020),Performer(Choromanski et al., 2020), andEfficientViT(Cai et al., 2023) aim to reduceattentioncomplexity from quadratic to linear by using kernel methods or other approximations. - Low-bit Attention: Similar to model compression, this involves performing
attentioncalculations at reduced precision (e.g.,INT8inSageAttentionby Zhang et al., 2025a) to accelerate computation. - Relationship to SVG:
SVGisorthogonalto bothlinearandlow-bit attention. It can be combined withFP8 attention(a form of low-bit attention) for further gains, as demonstrated in the paper, because it addresses a different kind of sparsity.
- Linear Attention: Approaches like
3.3. Technological Evolution
The field of generative AI has seen a rapid evolution, moving from earlier generative adversarial networks (GANs) to the more stable and high-quality Diffusion Models. Within Diffusion Models, the architectural backbone has progressed from U-Nets to Transformers, leading to Diffusion Transformers (DiTs). DiTs have shown immense scalability and fidelity in image generation and have naturally extended to video generation, adapting from 2D attention to 3D Full Attention to model both spatial and temporal dynamics.
However, this increased capability comes with a substantial computational cost, particularly from the quadratic complexity of 3D Full Attention. The technological evolution in this space is now moving towards optimizing these powerful models for practical deployment. Early optimization efforts focused on general techniques like quantization or denoising step reduction. Concurrently, sparse attention emerged as a powerful optimization for Transformers in LLMs.
This paper's work, Sparse VideoGen, represents a crucial step in this evolution by adapting the concept of sparse attention to the unique challenges of video data. It moves beyond generic sparsity or text-specific patterns to identify and exploit video-specific spatial-temporal redundancies, combining this algorithmic insight with hardware-aware system optimizations. This positions SVG at the forefront of enabling efficient, high-quality video generation in real-world scenarios.
3.4. Differentiation Analysis
Compared to the main methods discussed in related work, Sparse VideoGen (SVG) offers several core differences and innovations:
-
Video-Specific Sparsity Patterns:
- Differentiation: Unlike
sparse attentionmethods forLLMs(e.g.,StreamingLLM,H2O,MInference,DuoAttention) that focus on token-level sparsity based on temporal locality or "heavy hitters" in text sequences,SVGspecifically identifies and leverages spatial and temporal sparsity patterns within the3D Full Attentionof video data. This recognizes the unique structured redundancy present in video (within-frame spatial coherence, across-frame temporal consistency). - Impact: This video-specific approach allows
SVGto preserve the critical structural and temporal integrity of generated videos, which general token-level sparsity methods often fail to do, leading to quality degradation (as shown byMInference's blurring and temporal inconsistencies in Figure 1 and Table 1).
- Differentiation: Unlike
-
Training-Free Framework:
- Differentiation: Many efficiency methods, especially those reducing denoising steps or involving distillation (e.g.,
DDIM,DPM-Solver,consistency models, distillation-based methods), require extensive re-training or fine-tuning of the diffusion model. - Impact:
SVGis training-free, meaning it can be directly applied to any off-the-shelf pre-trained videoDiTmodel without incurring the prohibitive cost of additional training, making it highly practical for deployment.
- Differentiation: Many efficiency methods, especially those reducing denoising steps or involving distillation (e.g.,
-
Online Profiling Strategy:
- Differentiation:
SVGintroduces an efficient online profiling mechanism to dynamically identify the optimal sparse pattern for each attention head at runtime. This addresses the challenge that sparsity patterns can vary across different denoising steps and input prompts. Other sparse attention methods might rely on static patterns or require prior analysis. - Impact: This dynamic adaptation ensures that the most appropriate sparse pattern is applied for maximum efficiency and quality preservation, with a negligible overhead of approximately 3%.
- Differentiation:
-
Hardware-Efficient Layout Transformation:
- Differentiation:
SVGexplicitly tackles the hardware inefficiency of certain sparsity patterns (specifically the non-contiguous nature of theTemporal Head) by proposing a novel tensorlayout transformation. This is a system-level innovation beyond mere algorithmic sparsity identification. - Impact: By reordering data to be contiguous,
SVGenables effective utilization ofGPU Tensor Cores, translating theoretical sparsity gains into actual, measurable end-to-end speedups (e.g., a 1.7x additional speedup for temporal attention compared to naive sparse attention).
- Differentiation:
-
Superior Quality Preservation:
-
Differentiation: Compared to other system-level optimizations or caching strategies (e.g.,
PAB) that might improve throughput but often lead to significant drops in output quality (PSNRbelow 22),SVGconsistently maintains high visual fidelity (PSNRabove 29). -
Impact:
SVGachieves a better balance between speed and quality, making it a more viable solution for high-stakes video generation applications.In summary,
SVGdifferentiates itself by providing a holistic, video-specific, and hardware-aware approach tosparse attention, addressing the unique challenges of videoDiTsthat previous, more general or text-focused methods could not.
-
4. Methodology
The Sparse VideoGen (SVG) framework is designed to accelerate video Diffusion Transformers (DiTs) by exploiting inherent sparsity in their 3D Full Attention mechanism. It tackles the challenges of dynamic sparsity patterns and hardware inefficiency through a novel online profiling strategy and a hardware-efficient tensor layout transformation, combined with customized kernel implementations.
4.1. Principles
The core idea behind SVG is the observation that 3D Full Attention in video DiTs does not distribute its attention uniformly across all tokens. Instead, attention heads exhibit distinct sparse patterns that are critical for different aspects of video generation:
-
Spatial Head: These heads primarily focus their attention on tokens within the same frame or spatially adjacent frames. This pattern is crucial for maintaining the
spatial consistencyand structure of objects within the generated video. It results in ablock-wise layoutin the attention map. -
Temporal Head: These heads focus on tokens at the same spatial location across different frames. This pattern is essential for ensuring
temporal consistencyand smooth motion throughout the video. It exhibits aslash-wise layoutwith a constant interval in the attention map.The principle is that by dynamically identifying and applying these specific sparse patterns to the relevant attention heads,
SVGcan significantly reduce computation without compromising the quality of the generated video, as most of the "unattended" tokens contribute negligibly to the output. Additionally, common to both head types,text promptsand thefirst frametokens are observed to hold significant attention scores and are always included.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. 3D Full Attention Shows Instinct Sparsity
The paper's methodology begins with a detailed analysis of the 3D Full Attention mechanism in video DiTs. The crucial finding is that attention heads are not uniform but can be dynamically categorized into two types based on their dominant sparse patterns, as illustrated in Figure 3.

该图像是示意图,展示了硬件高效的布局转换。图中左侧(a)展示了非连续布局,其硬件效率较低,右侧(b)展示了通过转置生成的连续布局,具有较高的硬件效率。
Figure 3. The visualization from the paper shows the attention distribution of the spatial and temporal heads along with their corresponding correlations. The upper part displays spatial and temporal attention maps, while the lower part visualizes the spatial and temporal correlations in a six-frame video.
-
Spatial Head (Figure 3(a-b)): This type of head primarily concentrates its attention scores on spatially-local tokens. When visualizing the attention map, this manifests as a
block-wise layout. Since tokens within a single frame are typically contiguous in the sequence, aSpatial Headlargely attends to tokens exclusively within the same frame and its immediate neighbors in the temporal dimension. This behavior is fundamental for preserving thespatial structuresandconsistencyof objects and scenes across the video frames. The "block size" here corresponds to the number of tokens representing a single frame. -
Temporal Head (Figure 3(c-d)): In contrast, the
Temporal Headexhibits a distinctslash-wise layoutin the attention map, characterized by a constant interval. Given that each frame is tokenized into a fixed number of tokens, , pixels or tokens occupying the same spatial position across different frames will be spaced apart by a stride of in the flattened token sequence. Consequently,Temporal Headseffectively capture information from tokens that share the same spatial coordinates but originate from multiple different frames. This pattern is vital for maintainingtemporal consistencyand smooth motion throughout the video.The authors also note that for both
SpatialandTemporal Heads, tokens corresponding to thetext promptsand thefirst frameconsistently hold significant attention scores. Therefore, these specific tokens are always included in the sparse attention computation for both head types.
4.2.2. Sparse Attention Achieves Lossless Accuracy (Oracle Method)
The paper empirically demonstrates that applying these identified sparse patterns (spatial or temporal) to the corresponding attention heads does not degrade the quality of generated videos. To prove this, they devised an "oracle" method: for each attention head and denoising step, they compute the full attention output and then compare it with the outputs produced by applying spatial sparse attention and temporal sparse attention. The sparse pattern that yields the lowest Mean Squared Error (MSE) relative to the full attention output is then chosen. This oracle approach, when applied to CogVideoX-v1.5 and HunyuanVideo, achieves a PSNR over 29, indicating high fidelity.
However, this oracle strategy is not practically efficient because it still requires the full attention computation to determine the best sparse pattern, negating any speedup. This highlights the need for a more efficient, real-time sparsity identification method, which SVG addresses next.
4.2.3. Sparse Attention Promises Theoretical Speedup
The theoretical advantage of sparse attention lies in its ability to significantly reduce the computational load by processing only the "important" tokens, as determined by the identified sparse patterns. The computational savings are analyzed as follows:
Given a model configuration with:
-
: Hidden dimension of the transformer.
-
: Number of tokens per frame.
-
: Total number of frames.
-
: Total number of tokens in the video.
The total computation (in
FLOPS) for eachfull attentionoperation is: $ \text{FLOPS}_{\text{full}} = 2 \cdot 2 \cdot ( L N ) ^ { 2 } \cdot H = 4 L ^ { 2 } N ^ { 2 } H $ Where: -
The factor of accounts for the multiplication of
QuerywithKeytranspose () and the multiplication of the attention weights withValue(Attention(Q,K,V)), each involving2operations (multiplication and addition, roughly). -
is , reflecting the quadratic complexity with respect to the total number of tokens.
-
is the hidden dimension, which also contributes to the computation.
For a
Spatial Head, assuming each query token only attends to nearby frames (e.g., the current frame and a few adjacent frames), the computation is reduced. The "local" attention is computed over tokens per frame, scaled by the frames it attends to and total frames. $ \text{FLOPS}{\text{spatial}} = ( 2 \cdot 2 \cdot L ^ { 2 } H ) \cdot c _ { s } N $ The sparsity achieved by aSpatial Headis approximately: $ \text{Sparsity}{\text{spatial}} = \frac { c _ { s } } { N } $ This indicates that the computation is reduced by a factor proportional to .
For a Temporal Head, assuming each query token only attends to tokens across all frames (i.e., tokens at the same spatial position across frames), the computation is reduced. The "temporal" attention is computed over frames, scaled by the spatial positions it attends to and total spatial positions.
$
\text{FLOPS}{\text{temporal}} = ( 2 \cdot 2 \cdot N ^ { 2 } H ) \cdot c _ { t } L
$
The sparsity achieved by a Temporal Head is approximately:
$
\text{Sparsity}{\text{temporal}} = \frac { c _ { t } } { L }
$
This indicates that the computation is reduced by a factor proportional to .
Since both (number of attended frames for spatial head) and (number of attended spatial tokens for temporal head) are typically much smaller than and respectively, significant sparsity (e.g., 30%) can be achieved, leading to substantial theoretical computational savings. For example, CogVideoX-v1.5-T2V achieves 31% sparsity for both head types while maintaining a high PSNR.
However, a crucial point noted is that despite the theoretical speedup, the Temporal Head's non-contiguous memory access pattern can make it hardware-inefficient in practice. This is addressed in a later section. (Note: The text prompts and first frame are excluded from this simplified theoretical calculation for clarity, as their contribution is constant and small relative to the entire video sequence.)
4.2.4. Online Profiling Strategy for Sparsity Identification
To overcome the overhead of the "oracle" method and dynamically identify the optimal sparse pattern for each attention head at runtime, SVG introduces an efficient online profiling strategy. This strategy determines whether an attention head should be classified as a Spatial Head or a Temporal Head on the fly, without needing to perform full attention computation across all tokens.
The online profiling strategy works as follows (detailed in Algorithm 1 and illustrated conceptually in Figure 4):

该图像是图表,展示了SVG注意力工作流程和每个头的在线配置。图中通过 计算注意力,其中区分了空间头和时间头的关系,表示了模型在生成过程中如何分类注意力头。
Figure 4. The diagram illustrates the SVG attention workflow and per-head online profiling. It shows the calculation of attention through , distinguishing the relationships between spatial and temporal heads, demonstrating how the model classifies attention heads during the generation process.
Algorithm 1 Online Profiling Strategy
Q, K, V, O: [B, H, S, D]_- query, key, value, output
S: Total Token Number E.g., 18k
t: Sampled Token Number. E.g., 32
# Sample the Indices
indices = sample_indices(s, t) # (t,)
Q_i = Q[:, :, indices, :]
# Get the attention masks
mask_spatial = gen_spatial_mask()[:, :, indices, :]
mask_temporal = gen_temporal_mask()[:, :, indices, :]
# Compute sampled attention score
# Shape: [B, H, t, D]
O_full = mask_attention(Q_i, K, V, None)
O_spatial = mask_attention(Q_i, K, V, mask_spatial)
O_temporal = mask_attention(Q_i, K, V, mask_temporal)
# Calculate MSE and get best mask
# Shape: [B, H]
MSE_s = (O_full - O_spatial).norm().mean(dim=(2,3))
MSE_t = (O_full - O_temporal).norm().mean(dim=(2,3))
best_mask_config = (MSE_s < MSE_t)
Step-by-step explanation:
-
Input: The algorithm takes the
Query(),Key(), andValue() tensors, typically of shape[B, H, S, D], where:- : Batch size.
- : Number of attention heads.
- : Total number of tokens (e.g.,
18k). - : Head dimension.
-
Sampling a Subset of Queries: Instead of processing all query tokens,
SVGrandomly samples a small subset of indices (e.g., of the total tokens) from the tokens.indices = sample_indices(s, t): This function generates random indices from0toS-1.- : Only the query vectors corresponding to these sampled
indicesare extracted. will have the shape[B, H, t, D].
-
Generating Sparse Attention Masks: For the sampled query tokens (), two types of
attention masksare generated:mask_spatial = gen_spatial_mask()[:, :, indices, :]: This mask represents the connections for aSpatial Head. It typically specifies that each sampled query token should only attend to other tokens within its own frame and a few adjacent frames. The mask is generated based on the structure of the video data (frames, tokens per frame) and then filtered to apply only to thesampled indices.mask_temporal = gen_temporal_mask()[:, :, indices, :]: This mask represents the connections for aTemporal Head. It specifies that each sampled query token should only attend to tokens at the same spatial position across different frames. Similarly, this mask is generated and filtered for thesampled indices.
-
Computing Sampled Attention Scores: With the sampled queries () and the full
Key() andValue() tensors, three attention computations are performed:- : This computes the
full attentionoutput only for the sampled query tokens. TheNonemask implies no sparsity restriction on this calculation.O_fullwill have shape[B, H, t, D]. - : This computes
sparse attentionoutput for the sampled query tokens, using themask_spatial. - : This computes
sparse attentionoutput for the sampled query tokens, using themask_temporal.
- : This computes the
-
Calculating Mean Squared Error (MSE) and Selecting Best Mask: For each
attention head(across the batch), theMean Squared Error (MSE)between the sparse attention outputs and the sampled full attention output is calculated.-
: Calculates the
MSEbetween thefull attentionoutput for sampled queries and thespatial sparse attentionoutput. The calculates the mean squared difference over the (sampled tokens) and (head dimension) dimensions, resulting in anMSEvalue per batch item and per head[B, H]. -
: Calculates the
MSEbetween thefull attentionoutput for sampled queries and thetemporal sparse attentionoutput. -
: This line compares the two
MSEvalues for each head. If is lower, the head is classified as aSpatial Head; otherwise, it's aTemporal Head. Thisbest_mask_configis a boolean tensor of shape[B, H], indicating the chosen sparse pattern for each head.Effectiveness: The paper highlights that profiling only of tokens can achieve a
PSNRof up to 31.1, which is comparable to the oracle method (profiling ), while incurring a negligible runtime overhead of only about compared to full attention. This demonstrates the efficiency and accuracy of the online profiling strategy.
-
4.2.5. Hardware-Efficient Layout Transformation
A significant challenge in achieving real-world speedups from sparse attention, particularly for the Temporal Head, is hardware inefficiency. While NVIDIA Tensor Cores (used for matrix multiplication on GPUs) are powerful, they require data to be contiguous (e.g., at least 16 contiguous elements) along dimensions for optimal utilization. The Temporal Head's sparsity pattern, which connects tokens at the same spatial location across frames, inherently involves non-contiguous memory access with a stride equal to the number of tokens per frame (). This prevents efficient use of Tensor Cores, limiting practical speedups.
To address this, SVG introduces a novel hardware-efficient layout transformation.

该图像是图表,展示了HunyuanVideo在生成5.3秒720p视频时的端到端运行时分解。SVG通过系统与算法的协同设计,将推理时间从2253秒有效减少至968秒,整体实现了2.33 imes的加速效果。
Figure 5. The visualization from the paper illustrates the hardware-efficient layout transformation. The left side (a) displays a non-contiguous sparsity layout of a temporal head, which is hardware inefficient. The right side (b) shows a contiguous layout generated by transposing the token-major tensor into a frame-major one, which can be efficiently handled by block sparse attention.
Explanation:
-
Problem (Figure 5a - Non-Contiguous Layout): In the standard token-major representation, tokens from the same frame are contiguous, followed by tokens from the next frame. A
Temporal Headneeds to access the 0th token of frame 0, then the 0th token of frame 1, then the 0th token of frame 2, and so on. These tokens are spaced by positions (the number of tokens per frame), making them non-contiguous in memory. This pattern is inefficient forTensor Coresand memory access. -
Solution (Figure 5b - Contiguous Layout via Transposition):
SVGproposes alayout transformationthat transposes the tensor from atoken-majorlayout to aframe-majorlayout.- Original Layout (Token-major):
[Frame 0, Token 0], [Frame 0, Token 1], ..., [Frame 0, Token L-1], [Frame 1, Token 0], ... - Transformed Layout (Frame-major):
[Frame 0, Token 0], [Frame 1, Token 0], ..., [Frame N-1, Token 0], [Frame 0, Token 1], ...By performing this transposition, all tokens corresponding to the same spatial position across all frames become contiguous in memory. For example, all "Token 0"s from all frames ( of them) are now grouped together, then all "Token 1"s from all frames, and so on.
- Original Layout (Token-major):
-
Hardware Efficiency: This
frame-majorlayout converts the non-contiguousslash-wiseaccess pattern of theTemporal Headinto acontiguous block-wiseaccess pattern. Thiscontiguous layoutis highly amenable toGPU Tensor Coresand enables efficientblock sparse attentioncomputations.Mathematical Equivalence: The paper notes that this transformation maintains a mathematically equivalent output because
attentioncomputation isassociative. This means that reordering the data before computing attention does not change the final result, only the efficiency of the underlying memory access and computation. This technique is crucial for translating theoretical speedups into practical, measurable gains. The effectiveness of this method is ablated in Section 5.5, showing significant speedup.
4.2.6. Other Optimizations
Beyond the core online profiling and layout transformation, SVG incorporates several system-level optimizations to further boost efficiency:
-
Efficient Kernel Customization:
- Problem: Standard
PyTorchimplementations of operations likeQK-norm(normalization of theQuery-Keydot product) andRoPE(Rotary Positional Embeddings, a common positional encoding technique) can suffer from performance issues, especially whenattention headdimensions are small (e.g., 64 inCogVideoX-v1.5). This is due to limited parallelism in standard implementations for small dimensions. - Solution:
SVGcustomizes these operations usingCUDAwithsub-warp reductionimplementations.Sub-warp reductionis a technique used inCUDAprogramming to efficiently perform reductions (like sums or means) across threads within awarp(a group of 32CUDAthreads), leveraging shared memory and fast inter-thread communication. - Impact: This customization provides substantial speedups, up to 5x faster than
PyTorchimplementations forQK-normandRoPE(as detailed in Table 2). - Overall Kernel Implementation: The entire
SVGframework, including the fusedonline profiling strategyandlayout transformation kernels, is prototyped usingTriton(aDSLforGPUkernels byOpenAI) andFlashInfer(an efficientattention engineforLLM inference serving).Tritonallows for writing high-performanceGPUkernels directly, whileFlashInferprovides optimizedblock sparse attentionkernels.
- Problem: Standard
-
Quantization:
- Problem: Deep learning models, including
DiTs, typically operate inFP32orFP16precision, which consume significant memory and computational resources.Quantizationreduces the numerical precision of weights and activations. - Solution:
SVGis designed to be compatible withFP8 quantization(8-bit floating point). This technique, often used in efficientLLMinference (Zhang et al., 2025a; 2024; Zhao et al., 2024c), significantly reduces memory footprint and enables faster arithmetic operations on compatible hardware. - Impact:
FP8 quantizationfurther boosts throughput by up to 1.3x with minimal accuracy drop (around 0.1PSNRonHunyuanVideo), as shown in Table 1. A customizedattention kernelthat supports bothFP8 quantizationandblock sparse computationis also developed. It's noted thatFP8 quantizationwas not applied toCogVideoX-v1.5because its small head dimension (64) limits the arithmetic intensity, meaningFP8wouldn't offer significant on-GPUspeedups in that specific configuration.
- Problem: Deep learning models, including
5. Experimental Setup
5.1. Datasets
The experiments evaluate SVG on prominent open-sourced video generation models and datasets to ensure representative benchmarking.
-
CogVideoX-v1.5-I2V (Image-to-Video):
- Description: This model generates video from an input image and a text prompt. It processes 11 frames with 4080 tokens per frame in its
3D Full Attentionmechanism, producing 720p resolution videos over 10 seconds. - Data Source: For evaluation,
SVGuses theVBenchdataset (Huang et al., 2023) after prompt optimization, as suggested byCogVideoX(Yang et al., 2024c).VBenchis a comprehensive benchmark suite for video generative models, evaluating various aspects of video quality. - Why Chosen:
CogVideoX-v1.5is a state-of-the-art open-sourceimage-to-videomodel, providing a strong baseline for evaluatingSVG's performance in translating static images into dynamic sequences while maintaining quality.
- Description: This model generates video from an input image and a text prompt. It processes 11 frames with 4080 tokens per frame in its
-
CogVideoX-v1.5-T2V (Text-to-Video):
- Description: Similar to the I2V version, but generates video purely from a text prompt. It also handles 11 frames with 4080 tokens per frame for 720p, 10-second videos.
- Data Source: Evaluated using the
VBenchdataset with optimized prompts. - Why Chosen:
CogVideoX-v1.5-T2Vis a keytext-to-videomodel, demonstratingSVG's ability to accelerate generation from abstract textual descriptions to concrete video content.
-
HunyuanVideo-T2V (Text-to-Video):
- Description: A large-scale video generative model that operates on 33 frames with 3600 tokens per frame for 720p resolution videos, typically 5.33 seconds long.
- Data Source: Benchmarked using prompts from the
Penguin Video Benchmarkreleased byHunyuanVideo(Kong etal., 2024). This benchmark likely focuses on generating videos of penguins in various scenarios. - Why Chosen:
HunyuanVideois another state-of-the-art open-sourcetext-to-videomodel, representing a different architecture and scale of video generation, thus providing a broader validation ofSVG's general applicability and efficiency.
Example of Data Sample (Conceptual): Since the datasets are composed of video prompts and actual video outputs, a concrete data sample would be:
-
Prompt (Text-to-Video): "A blue boat navigating the ocean with soft waves."
-
Input Image (Image-to-Video): A still image of a blue boat on calm water.
-
Generated Output: A 5-10 second video showing the blue boat gently rocking on the waves, moving across the ocean.
These datasets were chosen because they represent current state-of-the-art, publicly available video generation models, allowing for transparent and comparable evaluation of
SVG's effectiveness in accelerating real-world video synthesis tasks.
5.2. Evaluation Metrics
The paper uses a comprehensive set of metrics to assess the quality of generated videos, covering both pixel-level fidelity and perceptual similarity, as well as high-level video quality attributes.
5.2.1. Peak Signal-to-Noise Ratio (PSNR)
- Conceptual Definition:
PSNRmeasures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. In image/video generation, it quantifies the reconstruction quality of a generated output compared to a ground truth or reference video, focusing on pixel-wise differences. A higherPSNRindicates better quality. - Mathematical Formula: $ MSE = \frac{1}{MN} \sum_{i=0}^{M-1}\sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 $ $ PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right) $
- Symbol Explanation:
MSE: Mean Squared Error, the average of the squared differences between the pixels of the original and generated images/frames.I(i,j): The pixel value at coordinates(i,j)in the original image/frame.K(i,j): The pixel value at coordinates(i,j)in the generated image/frame.M, N: The dimensions (height and width) of the image/frame.- : The maximum possible pixel value of the image (e.g., 255 for an 8-bit image).
5.2.2. Structural Similarity Index Measure (SSIM)
- Conceptual Definition:
SSIMis a perceptual metric designed to assess the perceived quality of an image by comparing it to a reference image, taking into account luminance, contrast, and structural information. It aims to better reflect human visual perception thanPSNR.SSIMvalues range from -1 to 1, with 1 indicating perfect similarity. A higherSSIMsuggests better perceived quality. - Mathematical Formula: $ SSIM(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
x, y: Two image patches (e.g., from the original and generated frames) being compared.- : The mean of image patch .
- : The mean of image patch .
- : The variance of image patch .
- : The variance of image patch .
- : The covariance of image patches and .
- : Small constants used to prevent division by zero and stabilize the formula (, , where is the dynamic range of pixel values, and are small constants).
5.2.3. Learned Perceptual Image Patch Similarity (LPIPS)
- Conceptual Definition:
LPIPSis a metric that quantifies the perceptual difference between two images. Unlike traditional metrics likePSNRorSSIM,LPIPSuses features extracted from a pre-trained deep neural network (e.g.,AlexNet,VGG) to measure distance in a perceptually meaningful feature space. A lowerLPIPSscore indicates that two images are perceptually more similar (better quality). - Mathematical Formula:
LPIPSdoes not have a simple, direct mathematical formula likePSNRorSSIMbecause its calculation is based on the internal activations of a deep learning model. Conceptually, it measures the weighted distance between feature maps extracted from different layers of a pre-trained network. $ LPIPS(\mathbf{x}, \mathbf{x_0}) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} | \mathbf{w}l \odot (\phi_l(\mathbf{x}){h,w} - \phi_l(\mathbf{x_0})_{h,w}) |_2^2 $ - Symbol Explanation:
- : The two input images (e.g., original and generated frames).
- : A specific layer (feature extractor) from a pre-trained deep neural network (e.g.,
AlexNet) at layer . - : The feature vector extracted by from image at spatial position
(h,w). - : A learned scalar weight vector applied to the features of layer , which helps to fine-tune the perceptual distance.
- : The element-wise product.
- : The height and width of the feature map at layer .
- : The squared Euclidean (L2) norm, measuring the distance between the feature vectors.
5.2.4. VBench Score (ImageQual, SubConsist)
- Conceptual Definition:
VBenchis a comprehensive benchmark specifically designed to evaluate various aspects of video generative models. The paper reports two specific sub-metrics fromVBench:- ImageQual (Image Quality): This metric assesses the overall visual fidelity and aesthetic quality of the individual frames within the generated video. A higher
ImageQualscore suggests that the frames are visually pleasing and high-resolution. - SubConsist (Subject Consistency): This metric evaluates how well the main subject or entity (e.g., an object, a character) within the video maintains its identity, appearance, and characteristics consistently across different frames. A higher
SubConsistscore indicates better temporal coherence of the subject.
- ImageQual (Image Quality): This metric assesses the overall visual fidelity and aesthetic quality of the individual frames within the generated video. A higher
- Mathematical Formula:
VBenchmetrics are typically calculated through a combination of several underlying quantitative and qualitative measures, potentially involving human evaluations or specialized models. The paper does not provide explicit mathematical formulas forImageQualorSubConsist. - Symbol Explanation: As
ImageQualandSubConsistare composite scores from a benchmark suite, they are typically reported as percentages or scaled scores. No specific mathematical symbols are typically associated with them in the context of their definition.
5.3. Baselines
SVG is compared against several representative sparse attention algorithms and a cache-based DiT acceleration algorithm:
- DiTFastAttn (Yuan et al., 2024): This method is described as primarily a "Spatial-only" attention algorithm in the context of video
DiTs. It likely focuses on optimizing spatial dependencies within frames, possibly using a fixed window or block-sparse approach. - Temporal-only (Manually Implemented): To provide a direct comparison for
SVG's dual-head approach, the authors manually implemented a baseline that only utilizestemporal sparse attention. This allows for isolating the performance contribution and quality implications of solely focusing on temporal relationships. - MInference (Jiang et al., 2024): This method is a dynamic sparse attention algorithm, originally developed for
LLMs, which identifies different sparse patterns across attention heads. The paper refers to a variantMMInferenceforVLMin the T2V section of Table 1 (Li et al., 2025). It uses a "mean-pooling block sparse" mechanism. - PAB (Pyramid Attention Broadcast) (Zhao et al., 2024b): This is a cache-based
DiTacceleration algorithm. It aims to speed up inference by reusing results from prior layers or attention computations, rather than recomputing them. This method is primarily a system-level optimization for throughput.
Representativeness:
-
DiTFastAttnandTemporal-onlyrepresent single-focus sparse attention strategies, helping to demonstrate the necessity ofSVG's dualSpatialandTemporal Headapproach. -
MInferencerepresents a state-of-the-art dynamic sparse attention method from theLLMdomain, showing how existing text-focused solutions may struggle with video data's unique patterns. -
PABrepresents system-level optimizations that leverage caching, a common technique for efficiency, thus allowing comparison against a different class of acceleration methods.These baselines provide a comprehensive comparison, highlighting
SVG's advantages in balancing quality and efficiency by specifically addressing the unique sparse patterns of videoDiTs.
5.4. Parameters
The experimental setup uses specific parameters for SVG and general practices for all baselines:
-
Sparsity Ratios for SVG:
- For
CogVideoX-v1.5:- (number of frames for
Spatial Headattention): 4 frames. - (number of tokens for
Temporal Headattention): 1224 tokens.
- (number of frames for
- For
HunyuanVideo:- : 10 frames.
- : 1200 tokens.
- Implication: These configurations are chosen to achieve approximately 30% sparsity for both
SpatialandTemporal Heads. The paper states this level of sparsity is generally sufficient for "lossless generation."
- For
-
Online Profiling Ratio:
SVGutilizes a sampling ratio for its online profiling strategy. This means only of input rows (query tokens) are sampled to determine the optimal sparse pattern for each attention head.- Implication: This small ratio is critical for ensuring minimal overhead () while effectively classifying attention heads.
-
Denoising Steps Skipped:
- For all baselines, the first denoising steps are skipped.
- Implication: This is a common practice in diffusion model acceleration (Zhao et al., 2024b; Li et al., 2024; Lv et al., 2024; Liu et al., 2024a) because the initial steps are often considered less critical to the final generation quality, allowing for faster inference. However, the paper's comparison against baselines is fair as all methods adhere to this practice.
-
Baselines' Configurations: For
MInferenceandPAB, the authors state they used their official configurations, implying standard or recommended settings from the original papers. -
Hardware: The experiments were conducted on an
H100-80GB-HBM3 GPUwithCUDA 12.4. -
FlashAttention-2: All baselines (and implicitly
SVG's full attention parts) adoptedFlashAttention-2(Dao et al., 2022), indicating that the comparison is against an already highly optimized attention implementation.
6. Results & Analysis
The experimental results demonstrate Sparse VideoGen (SVG)'s significant advantages in both efficiency (speedup) and quality preservation compared to baseline methods on state-of-the-art video generation models.
6.1. Core Results Analysis
SVG consistently outperforms all baseline methods across all tested models (CogVideoX-v1.5-I2V, CogVideoX-v1.5-T2V, HunyuanVideo-T2V) in terms of generation quality metrics (PSNR, SSIM, LPIPS, ImageQual, SubConsist) while simultaneously achieving the highest end-to-end speedups.
-
Superior Quality:
SVGachieves an averagePSNRexceeding 29.55 onHunyuanVideoand 29.99 onCogVideoX-v1.5-T2V. This indicates exceptional fidelity and accurate reconstruction of fine details.SSIMandLPIPSscores also confirmSVG's ability to maintain high perceptual quality, outperforming baselines significantly. For instance, onCogVideoX-v1.5-T2V,SVGachieves29.989 PSNRand0.112 LPIPS, whileMInferenceyields22.451 PSNRand0.304 LPIPS(lower LPIPS is better, so SVG is superior). -
Maintenance of Spatial and Temporal Consistency:
SVG's key innovation—adaptively applyingSpatialandTemporalsparse patterns—is crucial for its quality performance. Other baselines, particularlyMInference, struggle with this.MInference(which uses a mean-pooling block sparse approach) cannot effectively capture theslash-wisetemporal sparsity, leading to a substantial drop inPSNRand issues like blurring and temporal inconsistencies (visible in Figure 1).PAB, a cache-based method, also significantly hurts quality by skipping3D Full Attentioncomputations. -
Leading Efficiency:
SVGachieves the highest end-to-end speedups:2.23xforCogVideoX-v1.5-I2V,2.28xforCogVideoX-v1.5-T2V, and2.33xforHunyuanVideo(with FP8 quantization). This demonstrates that its algorithmic and system-level co-design effectively translates sparsity into practical acceleration. -
FP8 Quantization Compatibility:
SVGis shown to be compatible withFP8 quantization, which further boosts efficiency by1.3xonHunyuanVideo(from1.92xto2.33xspeedup) with only a minor0.1 PSNRdrop. This highlightsSVG's extensibility and potential for even greater gains. The reasonFP8was not applied toCogVideoX-v1.5is due to its smaller head dimension (64), which limits the arithmetic intensity and thus the benefit ofFP8onGPU.Figure 1 visually supports these claims, showing
SVG's generated videos maintaining sharpness and temporal coherence, contrasting with the blurring and inconsistencies seen inMInference's output.
该图像是图表,展示了SVG在生成视频时的加速效果及生成质量。对比CogVideoX-v1.5和HunyuanVideo的数据,SVG分别实现了2.28 imes和2.33 imes的加速,并保持较高的PSNR值。与MInference方法相比,SVG在图像锐度和时间连贯性上表现更佳。
Figure 1. SVG accelerates video generation while maintaining high quality. On CogVideoX-v1.5-I2V and Hunyuan-T2V, our method achieves a and speedup with high PSNR. In contrast, MInference (Jiang et al., 2024) fails to maintain pixel fidelity (significant blurring in the first example) and temporal coherence (inconsistencies in the tree trunk in the second example).
Figure 6 provides further visual comparisons of SVG's generation quality across different models and prompts, reinforcing the claim of high fidelity.

该图像是图表,展示了不同稀疏注意力实现的延迟比较。我们的硬件高效布局转换优化了时间头的稀疏模式,使得速度比原始简单稀疏注意力快 1.7 imes,接近理论速度提升。
Figure 6. The schematic demonstrates the comparison of video generation results using Sparse VideoGen, including examples from CogVideoX-v1.5 and HunyuanVideo. Different prompt contents correspond to various video frame displays, such as a blue boat navigating the ocean and a book engulfed in flames.
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Type | Method | Quality | Efficiency | ||||||
| PSNR ↑ | SSIM↑ | LPIPS ↓ | ImageQual ↑ | SubConsist ↑ | FLOPS ↓ | Latency ↓ | Speedup ↑ | ||
| I2V | CogVideoX-v1.5 (720p, 10s, 80 frames) | - | - | - | 70.09% | 95.37% | 147.87 PFLOPs | 528s | 1x |
| DiTFastAttn (Spatial-only) | 24.591 | 0.836 | 0.167 | 70.44% | 95.29% | 78.86 PFLOPs | 338s | 1.56x | |
| Temporal-only | 23.839 | 0.844 | 0.157 | 70.37% | 95.13% | 70.27 PFLOPs | 327s | 1.61x | |
| MInference | 22.489 | 0.743 | 0.264 | 58.85% | 87.38% | 84.89 PFLOPs | 357s | 1.48x | |
| PAB | 23.234 | 0.842 | 0.145 | 69.18% | 95.42% | 105.88 PFLOPs | 374s | 1.41x | |
| Ours | 28.165 | 0.915 | 0.104 | 70.41% | 95.29% | 74.57 PFLOPs | 237s | 2.23x | |
| T2V | CogVideoX-v1.5 (720p, 10s, 80 frames) | - | - | - | 62.42% | 98.66% | 147.87 PFLOPs | 528s | 1x |
| DiTFastAttn (Spatial-only) | 23.202 | 0.741 | 0.256 | 62.22% | 96.95% | 78.86 PFLOPs | 338s | 1.56x | |
| Temporal-only | 23.804 | 0.811 | 0.198 | 62.12% | 98.53% | 70.27 PFLOPs | 327s | 1.61x | |
| MMInference | 22.451 | 0.691 | 0.304 | 54.87% | 91.52% | 84.89 PFLOPs | 357s | 1.48x | |
| PAB | 22.486 | 0.740 | 0.234 | 57.32% | 98.76% | 105.88 PFLOPs | 374s | 1.41x | |
| Ours | 29.989 | 0.910 | 0.112 | 63.01% | 98.67% | 74.57 PFLOPs | 232s | 2.28x | |
| T2V | HunyuanVideo (720p, 5.33s, 128 frames) | - | - | - | 66.11% | 93.69% | 612.37 PFLOPs | 2253s | 1x |
| DiTFastAttn (Spatial-only) | 21.416 | 0.646 | 0.331 | 67.33% | 90.10% | 260.48 PFLOPs | 1238s | 1.82x | |
| Temporal-only | 25.851 | 0.857 | 0.175 | 62.12% | 98.53% | 259.10 PFLOPs | 1231s | 1.83x | |
| nference | 23.157 | 0.823 | 0.163 | 63.96% | 91.12% | 293.87 PFLOPs | 1417s | 1.59x | |
| Ours | 29.546 | 0.907 | 0.127 | 65.90% | 93.51% | 259.79 PFLOPs | 1171s | 1.92x | |
| Ours + FP8 | 29.452 | 0.906 | 0.128 | 65.70% | 93.51% | 259.79 PFLOPs | 968s | 2.33x | |
Analysis of Table 1:
- Overall Dominance of SVG: Across all three evaluation scenarios (CogVideoX-v1.5 I2V, CogVideoX-v1.5 T2V, HunyuanVideo T2V), "Ours" (SVG) consistently achieves the highest
PSNR,SSIM, andImageQualwhile having the lowestLPIPS(lower is better) andLatency(lower is better), resulting in the highestSpeedup. This validatesSVG's claim of superior quality preservation and efficiency. - Quality Degradation in Baselines:
MInference(andMMInference): Shows significantly lowerPSNR,SSIM, andImageQualand higherLPIPScompared toSVG. For example, onCogVideoX-v1.5 T2V,MMInferencehas aPSNRof22.451andLPIPSof0.304, whileSVGachieves29.989and0.112. This confirms the paper's argument thatLLM-centric sparse attention methods fail to capture video's unique spatiotemporal dependencies.PAB: Also exhibits lowerPSNRandImageQualscores, similar toMInference, indicating that its caching strategy comes at a cost to video generation quality.DiTFastAttn (Spatial-only)andTemporal-only: While these specialized baselines perform better thanMInferencein some metrics, they still fall significantly short ofSVG's combined performance. This underscores the necessity ofSVG's dynamic approach to leverage both spatial and temporal sparsity, rather than focusing on only one.
- Efficiency Gains:
- The
FLOPSreduction forSVGis substantial, bringing it down to roughly50%or less of the originalCogVideoX-v1.5andHunyuanVideomodels. This directly translates into reducedLatencyand increasedSpeedup. SVGachieves2.28xspeedup onCogVideoX-v1.5 T2Vand1.92xonHunyuanVideo T2VwithoutFP8. WithFP8, theHunyuanVideo T2Vspeedup increases to2.33xwith only a minorPSNRdrop (29.546to29.452). This demonstrates the power of combiningSVGwith other optimizations.
- The
- Context Length Impact: The "1x" baseline for
HunyuanVideoshows aLatencyof2253sfor 128 frames, much higher thanCogVideoX-v1.5's528sfor 80 frames, illustrating the severe impact of increasedcontext length(more frames) on full attention computation.SVG's ability to accelerateHunyuanVideoto968s(withFP8) is particularly impactful.
6.3. Ablation Studies / Parameter Analysis
6.3.1. Online Profiling Strategy Ratios (Table 3)
The paper conducts a sensitivity test on the profiling ratio (the percentage of tokens sampled for online profiling) to demonstrate the robustness and efficiency of SVG's online profiling strategy.
The following are the results from Table 3 of the original paper:
| Ratios | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| CogVideoX-v1.5-I2V (720p, 10s, 80 frames) | |||
| profiling 0.1% | 30.791 | 0.941 | 0.0799 |
| profiling 1% | 31.118 | 0.945 | 0.0757 |
| profiling 5% | 31.008 | 0.944 | 0.0764 |
| profiling 100% | 31.324 | 0.947 | 0.0744 |
Analysis:
- The results show that even with a very small
profiling ratioof0.1%,SVGachieves a highPSNRof30.791. - Increasing the ratio to
1%yields aPSNRof31.118, which is very close to theoraclemethod's100%profilingPSNRof31.324. TheLPIPSfor1%(0.0757) is also very close to100%(0.0744). - This demonstrates that
SVG's online profiling strategy is highly effective and efficient: a minimal1%sampling of tokens is sufficient to achieve generation quality comparable to performing full attention computation for classification, with only a negligible3%runtime overhead. This validates the design choice of using a small sampling ratio for dynamic sparsity identification.
6.3.2. Generation Quality Over Different Sparsity Ratios (Table 4)
The paper also explores the impact of varying the sparsity ratios (controlled by and ) on generation quality, specifically LPIPS for HunyuanVideo. This analysis demonstrates the trade-off between efficiency and accuracy and SVG's robustness across different sparsity levels.
The following are the results from Table 4 of the original paper:
| Sparsity↓ | 0.13 | 0.18 | 0.35 | 0.43 | 0.52 |
| LPIPS↓ | 0.154 | 0.135 | 0.141 | 0.129 | 0.116 |
Analysis:
- The table shows that as the
Sparsityvalue (which represents the ratio of computed tokens to total tokens, so a lower value means more sparsity and higher compression) decreases, theLPIPS(lower is better) tends to increase, indicating a slight decrease in quality with higher sparsity. - However,
SVGconsistently maintains a "decent" generation quality even at high sparsity levels. For example, at aSparsityof0.13(meaning only 13% of potential attention connections are computed), theLPIPSis0.154, which is still a reasonable score. WhenSparsityis0.52(less compression),LPIPSimproves to0.116. - This confirms that
SVGoffers a flexible trade-off between efficiency and accuracy. Users can choose different and values to adjust the sparsity level based on their specific application requirements for speed versus quality. The authors note that adaptive sparsity control is an area for future work.
6.3.3. Hardware-Efficient Layout Transformation (Figure 8)
An ablation study was conducted to evaluate the effectiveness of the proposed hardware-efficient layout transformation for the Temporal Head.

该图像是一个示意图,展示了不同场景下的动态视频生成效果,包括滑板运动、动物互动和表情变化等,比较了稠密注意力和稀疏注意力的生成质量和速度。
Figure 8. Latency comparison of different implementations of sparse attention. Our hardware-efficient layout transformation optimizes the sparsity pattern of temporal head for better contiguity, which is faster than naive sparse attention (named original), approaching the theoretical speedup.
Analysis of Figure 8:
- The figure compares the
latencyof three implementations ofsparse attentionat varyingsparsitylevels:Theoretical,Our (with layout transformation), andOriginal (without layout transformation). - The
Theoreticalline represents the ideal speedup based purely on reducedFLOPSfrom sparsity. - The
Originalimplementation (naive sparse attention without layout transformation) falls significantly short of the theoretical speedup, especially as sparsity increases (i.e., the percentage of computed attention connections decreases). This is due to the hardware inefficiency of non-contiguous memory access for theTemporal Head. Ourmethod, which incorporates thehardware-efficient layout transformation, dramatically closes this gap. It closely approaches theTheoretical speedupcurve.- Quantitative Impact: At a sparsity level of
10%(meaning 10% of total attention is computed),Ourmethod achieves an additional1.7xspeedup compared to theOriginalapproach, resulting in a total3.63ximprovement over dense attention. This vividly demonstrates that the layout transformation is critical for translating theoretical sparse attention gains into practicalGPUacceleration forTemporal Heads.
6.3.4. Kernel-level Efficiency Benchmark (Table 2)
The paper benchmarks the performance of customized CUDA kernels for QK-norm and RoPE against their PyTorch implementations, specifically for CogVideoX-v1.5 configurations.
The following are the results from Table 2 of the original paper:
| Frame Number | 8 | 9 | 10 | 11 |
| QK-norm | 7.44x | 7.45x | 7.46x | 7.47x |
| RoPE | 14.50x | 15.23x | 15.93x | 16.47x |
Analysis:
- The customized
QK-normandRoPEkernels consistently achieve significant speedups across different numbers of frames (8 to 11). QK-normshows an average speedup of approximately7.4x(ranging from7.44xto7.47x).RoPEdemonstrates even more dramatic improvements, with an average speedup of about15.5x(ranging from14.50xto16.47x).- These results highlight the effectiveness of low-level
CUDAoptimization, particularlysub-warp reduction implementations, in improving the throughput of small-dimension operations that are otherwise bottlenecks inPyTorch. These kernel optimizations contribute significantly toSVG's overall end-to-end speedup.
6.3.5. End-to-End Runtime Breakdown (Figure 7)
The paper provides a detailed breakdown of the end-to-end inference time for HunyuanVideo to illustrate how each component of SVG contributes to the overall speedup.

该图像是图表,展示了采用稀疏视频生成技术的HunyuanVideo文本到视频生成的比较,显示了不同生成阶段的多个场景和动作。
Figure 7. The chart shows the breakdown of end-to-end runtime of HunyuanVideo when generating a 5.3s, 720p video. SVG effectively reduces the inference time from 2253 seconds to 968 seconds through system-algorithm co-design, achieving an overall speedup of .
Analysis of Figure 7:
- The baseline
HunyuanVideoinference takes2253 seconds. - The most substantial individual contribution comes from
Sparse Attention, which reduces the time to1811 seconds, achieving a1.81xspeedup (2253 / 1811 = 1.24, wait, this looks like the speedup is 1.24x, not 1.81x. Re-reading: "sparse attention delivering the most substantial improvement of 1.81x" must be relative to just the attention part, not end-to-end latency at that stage. The total reduction from2253sto1811sis442s.) The claim in the text "with sparse attention delivering the most substantial improvement of 1.81x" means that the speedup solely from applying sparse attention (assuming it was the only optimization) is 1.81x. - Adding
Layout Transformationfurther reduces the time to1343 seconds. - Then,
Kernel Optimization(likelyQK-normandRoPEcustomizations) brings it down to1171 seconds. - Finally, incorporating
FP8 Quantizationachieves the lowest latency of968 seconds, resulting in a total end-to-end speedup of2.33x(). - This breakdown clearly shows that
SVG's performance gains are a result of asystem-algorithm co-design, where each optimized component (sparse attention, layout transformation, kernel customization, and quantization) contributes significantly to the overall efficiency, rather than relying on a single dominant factor.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully presents Sparse VideoGen (SVG), a novel, training-free framework designed to accelerate video Diffusion Transformers (DiTs) by intelligently exploiting inherent spatial-temporal sparsity patterns within their 3D Full Attention mechanisms. The core innovation lies in the discovery that attention heads can be dynamically classified into Spatial Heads (focusing on intra-frame relationships) and Temporal Heads (focusing on inter-frame relationships).
SVG's key contributions include an efficient online profiling strategy that accurately identifies these dynamic sparse patterns with minimal overhead, and a hardware-efficient inference system. The latter is crucial, incorporating a tensor layout transformation to convert non-contiguous temporal sparsity into a hardware-friendly format, alongside customized CUDA and Triton kernels for optimized operations.
Through rigorous evaluation on state-of-the-art video DiTs like CogVideoX-v1.5 and HunyuanVideo, SVG achieves impressive end-to-end speedups (up to 2.33x) while meticulously preserving the high visual quality of generated videos. Furthermore, its compatibility with FP8 quantization offers additional efficiency benefits. This work significantly advances the practicality of video generative models for real-world applications by alleviating their substantial computational burden.
7.2. Limitations & Future Work
The authors explicitly mention one area for future work:
-
Adaptive Sparsity Control: In Section 5.4, during the sensitivity test on sparsity ratios, the authors state, "We leave the adaptive sparsity control for future work." This implies that while
SVGallows for setting fixed and (and thus fixed sparsity ratios), a more advanced system could dynamically adjust these ratios based on content, complexity, or user-defined quality/speed preferences during generation.Implicitly, other limitations could be inferred, though not explicitly stated by the authors:
-
Profiling Overhead (though minimal): While the online profiling overhead is only
3%, for extremely latency-sensitive applications or extremely large models, any overhead might still be a factor. -
Generality of Sparse Patterns: The identified
SpatialandTemporalheads are powerful for current videoDiTs. However, as model architectures evolve, or for more complex spatiotemporal tasks, there might be other nuanced or hybrid sparsity patterns that could be exploited. -
Dependence on Hardware Features: The hardware-efficient layout transformation explicitly targets
GPU Tensor Coresand their contiguity requirements. While this is effective for currentNVIDIA GPUs, future hardware architectures might necessitate different optimization strategies.
7.3. Personal Insights & Critique
This paper presents a highly practical and impactful contribution to the field of generative AI, particularly for video generation.
Personal Insights:
- Deep Understanding of Video Data: The core strength of
SVGlies in its deep understanding of video data's inherent redundancy. Moving beyond genericsparse attentionstrategies developed for text and identifyingspatialandtemporalheads specifically for video is a crucial insight. This highlights the importance of domain-specific algorithmic design for achieving optimal performance in specialized applications. - Algorithmic-System Co-Design: The success of
SVGis not solely due to an algorithmic breakthrough but also a meticuloussystem-algorithm co-design. Theonline profilingidentifies opportunities, but thehardware-efficient layout transformationandcustomized kernelsare equally vital in translating theoretical gains into practical speedups on real hardware. This holistic approach is often the key to significant real-world performance improvements in deep learning systems. - Training-Free Nature: The training-free aspect is a massive advantage. In an era where training large generative models is prohibitively expensive, an inference-time optimization that works "off-the-shelf" with pre-trained models immediately delivers value and accelerates research and deployment across many users.
- Extensibility: The demonstrated compatibility with
FP8 quantizationsuggests thatSVGis a foundational optimization layer that can be combined with other efficiency techniques, creating a powerful stack for even greater performance.
Critique & Areas for Improvement:
-
Robustness Across Content Diversity: While the
1%profiling ratio is shown to be effective on the tested datasets, it would be interesting to see if this holds true for extremely diverse, challenging, or "out-of-distribution" video content. Could certain pathological cases lead to misclassification of head types and, consequently, quality degradation? A deeper dive into the statistical properties of head classification errors would be valuable. -
Dynamic and Adaptive Sparsity: As noted by the authors, "adaptive sparsity control" is future work. Currently, and are fixed parameters. An intelligent system that could dynamically adjust these sparsity parameters per-layer, per-head, or even per-token based on real-time video content complexity or desired fidelity targets could yield even greater and more robust efficiency. For instance, a complex, fast-moving scene might require less sparsity than a static background.
-
Beyond Spatial/Temporal Heads: While the two head types are powerful, are there other latent "types" of attention patterns in video? For instance, object-centric attention, or attention to specific motion vectors, could potentially be exploited for further sparsity.
-
Cross-Model Transferability of Profiling: The
1%profiling ratio works well. Is thebest_mask_config(which attention heads are spatial vs. temporal) itself transferable to some degree across models or even different checkpoints of the same model? If so, this could further reduce profiling overhead by pre-computing a "default" classification for certain model families.The methods and conclusions of
SVGcould potentially be transferred or applied to other domains dealing with structured 3D data processed byTransformers, such as: -
3D Medical Imaging: Accelerating
Transformer-based models for 3D medical image segmentation or generation, where spatial (within a slice) and temporal/depth (across slices) correlations are crucial. -
3D Point Clouds/Meshes: Optimizing
Transformersfor processing dynamic 3D scenes or sequences of point clouds, where similar structural and temporal redundancies exist. -
Scientific Simulations: Accelerating
Transformer-based models used in physical simulations (e.g., fluid dynamics, climate modeling) that operate on spatiotemporal grids.In conclusion,
Sparse VideoGenis an elegant and highly effective solution to a critical problem. Its strength lies in a nuanced understanding of video data, coupled with smart, hardware-aware engineering, making it a significant step towards democratizing high-quality video generation.
Similar papers
Recommended via semantic vector search.