Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation
TL;DR Summary
This paper presents SVG2, a training-free framework that enhances critical token identification accuracy through semantic-aware permutation, reducing computation waste and addressing efficiency bottlenecks in sparse attention for video generation, achieving up to 2.30x accelerati
Abstract
Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively. Our code is open-sourced at \href{https://github.com/svg-project/Sparse-VideoGen}{https://github.com/svg-project/Sparse-VideoGen}.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is accelerating video generation using sparse attention mechanisms, specifically by employing semantic-aware permutation. The title is Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation.
1.2. Authors
The authors are Shuo Yang*, Haocheng X*, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, and Ion Stoica. Their affiliations include the University of California, Berkeley, MIT, and Stanford University. This diverse authorship from leading academic institutions suggests a strong research background in deep learning, efficient AI, and hardware acceleration.
1.3. Journal/Conference
The paper is published on arXiv, a preprint server, as indicated by the Original Source Link and PDF Link. While arXiv is not a peer-reviewed journal or conference, it is a widely recognized platform for disseminating cutting-edge research in computer science, physics, mathematics, and other fields. Papers often appear on arXiv before formal publication at conferences (e.g., ICML, NeurIPS) or journals, allowing for early sharing and feedback. Many of the authors have a strong publication record in top-tier machine learning conferences (e.g., ICML, NeurIPS, CVPR), suggesting that this work is likely intended for such a venue.
1.4. Publication Year
The publication timestamp indicates 2025-05-24T21:30:29.000Z, implying a publication year of 2025.
1.5. Abstract
The paper addresses the significant latency bottleneck in Diffusion Transformers (DiTs) for video generation, primarily caused by the quadratic complexity of their attention mechanisms. While sparse attention offers a promising solution by computing only critical tokens, existing methods fall short in generation quality for a given computation budget. The authors identify two main issues: (1) Inaccurate critical token identification due to position-based clustering, leading to imprecise aggregated representations. (2) Excessive computation waste because scattered critical tokens cause inefficient processing on GPUs optimized for contiguous memory access.
To overcome these, the paper proposes SVG2, a training-free framework designed to maximize identification accuracy and minimize computation waste, achieving a Pareto frontier trade-off between quality and efficiency. The core of SVG2 is semantic-aware permutation, which uses k-means clustering to group and reorder tokens based on semantic similarity. This ensures precise cluster representation for accurate identification and a densified layout of critical tokens for efficient computation without padding. SVG2 also incorporates top-p dynamic budget control and customized kernel implementations. It achieves up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively. The code is open-sourced.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2505.18875. This is a preprint on arXiv.
1.7. PDF Link
The PDF link is https://arxiv.org/pdf/2505.18875v3.pdf. This is the third version of the preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The paper addresses the critical problem of high computational latency in Diffusion Transformers (DiTs) when applied to video generation. DiTs have proven highly effective in generating high-quality images and videos, but their 3D spatio-temporal attention mechanisms introduce a quadratic computational complexity with respect to the sequence length. This means that as videos get longer or have higher resolution (more tokens), the computation time increases quadratically, making it prohibitively expensive for practical deployment. For instance, generating a short video using HunyuanVideo on an NVIDIA A100 GPU can take nearly an hour, with attention operations consuming over 80% of the runtime.
Previous research has noted that self-attention mechanisms are often sparse, meaning only a small fraction of computations significantly influence the final output. This observation led to sparse attention methods, which aim to reduce computational costs by processing only the most critical tokens. Current approaches typically involve an identification step where token activations are used to estimate attention scores, and tokens with the highest scores are selected. To minimize overhead, this identification is often performed at a block granularity, treating consecutive tokens as a single unit.
However, the authors identify two significant challenges with existing sparse attention methods that prevent them from achieving optimal generation quality under a given computational budget:
-
Inaccurate critical token identification: Existing
block-wise identificationmethods cluster tokens based on theirpositionin the sequence rather than theirsemantic similarity. This can group semantically diverse tokens into the same block, leading to an imprecise aggregated representation (e.g., using mean or max pooling for a block). Such imprecise representations result in inaccurate estimations ofattention scoresand, consequently, incorrect identification ofcritical tokens. -
Excessive computation waste: Even if
critical tokenscould be perfectly identified, their scattered distribution within the tensor leads tocomputation wasteon modernML acceleratorslikeGPUs. These accelerators are optimized fordense matrix multiplicationswith contiguous memory layouts. Whencritical tokensare scattered, they must be padded withnon-critical tokensto fit the hardware'scontiguous processing units(e.g.,tensor coresrequiring16x16x8shapes), wasting computational resources on non-essential data. The paper notes that up to80%of computation can be wasted this way.The paper's innovative idea is to leverage
semantic-aware permutationto address these two challenges simultaneously, aiming to bridge the gap between existingsparse attentionmethods and the theoretical upper bound of anoracle policy.
2.2. Main Contributions / Findings
The paper's primary contributions are:
- Proposed
SVG2Framework:SVG2is introduced as a novel, training-free framework forsparse attentionspecifically designed to accelerateDiT-based video generation. It aims to maximize the accuracy ofcritical token identificationand minimizecomputation waste. - Semantic-Aware Permutation: The core innovation is
semantic-aware permutation, which utilizesk-means clusteringto group and reorderQuery,Key, andValuetokens based on theirsemantic similarity(derived from their activations). This approach serves a dual purpose:- Improved Identification Accuracy: By creating
semantically coherent clusters, the aggregated representations (centroids) become more precise, leading to more accurate estimation ofattention scoresand thus better identification ofcritical tokens. - Minimized Computation Waste: The
permutationreorders scatteredcritical tokensintocompact, dense blocks. Thisdensified layoutallowsGPUsto process onlycritical tokensefficiently without needing padding, thereby reducingcomputation waste.
- Improved Identification Accuracy: By creating
- Centroid-Based
Top-pDynamic Budget Control:SVG2integrates a mechanism to dynamically adjust the computational budget. It usescluster centroidsto approximateattention scoresand then employs aTop-p selection strategyto selectcritical clustersuntil a predefined attention score target is met. This enables flexible trade-offs between quality and efficiency without manual tuning. - Efficient System-Algorithm Co-designs:
- Fast
k-meanswith Centroid Cache: To mitigate the computational overhead ofk-means clustering,SVG2implements acentroid cachethat reuses centroids from previous denoising steps, significantly accelerating the clustering process (up to76xspeedup). - Customized Attention Kernel for Dynamic Block Sizes: Recognizing that
semantic-aware permutationnaturally produces clusters of varying sizes,SVG2introduces acustomized attention kernelcapable of handlingdynamic block sizes. This kernel supportsFlashAttention-2 (FA2)andFlashAttention-3 (FA3)backends, enabling efficient sparse loading and dense computation without padding, achieving over85%of theoretical maximum performance.
- Fast
- Achieving Pareto Frontier Trade-off:
SVG2consistently outperforms existing methods, achieving a Pareto frontier in the quality-efficiency trade-off curve. This means for any given computational budget (density),SVG2delivers superior generation quality, and for any target quality, it offers higher efficiency. - Significant Speedup and Quality Retention: The framework demonstrates substantial practical benefits, achieving an end-to-end speedup of up to
2.30xand1.89xonHunyuanVideoandWan 2.1respectively, while maintaining high visual quality withPSNRvalues up to30and26.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Diffusion Transformers (DiTs)
Diffusion Transformers (DiTs) are a class of generative models that combine diffusion models with the Transformer architecture.
- Diffusion Models: These are generative models that learn to reverse a gradual
noising process. They start with random noise and progressivelydenoiseit over several steps to generate a coherent data sample (e.g., an image or video). They are known for generating high-quality and diverse outputs. - Transformers: Originally developed for natural language processing,
Transformersare neural network architectures primarily based on theself-attention mechanism. They excel at processing sequential data by allowing each element in a sequence to "attend" to all other elements, capturing long-range dependencies effectively. - DiTs for Video Generation: In the context of video generation,
DiTsextend this concept to3D spatio-temporal data. This means they process tokens that represent pixels/patches across both spatial dimensions (width, height) and the temporal dimension (frames). TheTransformerpart, especially itsself-attentionmodules, is responsible for modeling the intricate spatio-temporal relationships within the video latent space, guiding thedenoising processto generate coherent video content.
3.1.2. Self-Attention Mechanism
The self-attention mechanism is the core component of Transformer models. It allows a model to weigh the importance of different parts of the input sequence when processing a specific element.
For an input sequence of tokens, each token is transformed into three different vectors:
Query (Q): Represents what the current token is "looking for."Key (K): Represents what information the current token "offers."Value (V): Contains the actual information that is "offered" by the current token. The attention score between aQueryand allKeysdetermines how muchValueto aggregate from each token. The standard formula forself-attentionis: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:- is the matrix of
Queryvectors. - is the matrix of
Keyvectors. - is the matrix of
Valuevectors. - calculates the
dot productsimilarity between each query and all keys. - is a scaling factor (where is the dimension of the key vectors) used to prevent the dot products from becoming too large, which can push the
softmaxfunction into regions with tiny gradients. softmaxis an activation function that converts raw scores into probabilities, ensuring that the weights sum to 1.- The result is a weighted sum of
Valuevectors, where the weights are determined by thesoftmaxoutput.
3.1.3. Quadratic Complexity of Attention
The term quadratic complexity refers to how the computational cost of the attention mechanism scales with the input sequence length. If is the sequence length (number of tokens), computing the matrix involves multiplying an matrix () by a matrix (). This operation has a computational complexity of . Since can be very large in video generation (e.g., thousands of tokens per frame across multiple frames), the term leads to a rapid increase in computation time and memory usage, making it a significant bottleneck.
3.1.4. Sparse Attention
Sparse attention is a technique designed to mitigate the quadratic complexity of traditional self-attention. Instead of computing attention scores for all possible Query-Key pairs, sparse attention selectively computes only a subset of these pairs, focusing on the most "important" or "critical" ones. This reduces the number of operations from to or , where is a sparsity factor, leading to significant computational savings. The challenge lies in accurately identifying these critical tokens without sacrificing model performance.
3.1.5. K-Means Clustering
k-means clustering is a popular unsupervised machine learning algorithm used to partition observations into clusters, where each observation belongs to the cluster with the nearest mean (centroid), serving as a prototype of the cluster.
- Algorithm Steps:
- Initialization: Choose initial centroids (randomly or using a smarter method like
k-means++). - Assignment: Assign each data point (e.g., token vector) to the cluster whose centroid is closest (typically measured by
Euclidean distance). - Update: Recalculate the
centroidsas the mean of all data points assigned to that cluster. - Iteration: Repeat steps 2 and 3 until the cluster assignments no longer change or a maximum number of iterations is reached.
- Initialization: Choose initial centroids (randomly or using a smarter method like
- Application in SVG2: In
SVG2,k-meansis applied to theQueryandKeytoken vectors to group tokens that have similarsemantic activations.
3.1.6. Peak Signal-to-Noise Ratio (PSNR)
PSNR is a widely used metric to quantify the quality of reconstruction of an image or video, often used to measure the quality of lossy compression codecs or in this case, the fidelity of generated content compared to a reference. A higher PSNR generally indicates higher quality.
The PSNR is defined as:
$
\mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right)
$
Where:
- is the maximum possible pixel value of the image (e.g., 255 for an 8-bit image).
- is the
Mean Squared Errorbetween the original and the reconstructed image. TheMSEis calculated as: $ \mathrm{MSE} = \frac{1}{mn}\sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $ Where: I(i,j)is the pixel value at position(i,j)in the original image.K(i,j)is the pixel value at position(i,j)in the reconstructed (generated) image.- and are the dimensions (height and width) of the image.
3.1.7. Structural Similarity Index Measure (SSIM)
SSIM is a perceptual metric that measures the similarity between two images. Unlike PSNR which measures absolute error, SSIM is designed to model the human visual system's perception of quality. It considers three key aspects: luminance, contrast, and structure. A value closer to 1 indicates higher similarity.
The SSIM between two images and is defined as:
$
\mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}
$
Where:
- and are the average (mean) pixel values of image and , respectively.
- and are the variance of image and , respectively.
- is the covariance of image and .
- and are two constants included to avoid division by zero. is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and are small constants (e.g., ).
3.1.8. Learned Perceptual Image Patch Similarity (LPIPS)
LPIPS is a metric that assesses the perceptual similarity between two images using features extracted from a pre-trained deep neural network. Instead of comparing pixels directly, LPIPS computes the Euclidean distance between feature representations of image patches from a VGG or AlexNet model. A lower LPIPS score indicates higher perceptual similarity (i.e., the images look more alike to a human observer).
3.1.9. FLOPs (Floating Point Operations)
FLOPs refers to the number of floating-point operations performed by a model. It's a common metric to estimate the computational cost of a neural network. A lower FLOPs count indicates higher computational efficiency.
3.1.10. Speedup
Speedup is a metric used to quantify the performance improvement achieved by an optimization. It's typically calculated as the ratio of the original execution time to the optimized execution time, or, conversely, the ratio of the optimized computation rate to the original computation rate. For example, a 2x speedup means the task completes twice as fast.
3.1.11. CUDA Kernels & FlashAttention
- CUDA Kernels:
CUDAis a parallel computing platform and programming model developed byNVIDIAforGPUs.CUDA kernelsare functions that run on theGPUand are designed to execute in parallel across many threads, enabling highly efficient computation for tasks likematrix multiplicationwhich are common in deep learning. - FlashAttention (FA2, FA3):
FlashAttentionis a highly optimizedattention mechanismimplementation designed forNVIDIA GPUs. It significantly reducesmemory I/O(read/write operations toGPU memory), which is often a bottleneck, by usingtilingandreorderingto keep intermediatesoftmaxcomputations in fastSRAM(on-chip memory).FA2andFA3are subsequent versions with further optimizations. They are crucial for acceleratingTransformermodels. - FlashInfer:
FlashInferis an efficient and customizableattention enginebuilt onFlashAttentionprinciples, specifically designed forLLM inference serving. It provides optimized kernels for variousattentionpatterns.
3.2. Previous Works
The paper categorizes related work into several areas, focusing on sparse attention for DiTs and LLMs, linear attention, and long video generation/caching.
3.2.1. Sparse Attention for Video DiTs
Previous work in sparse attention for DiTs falls into two main categories:
- Static Methods: These methods pre-define sparse patterns or identify critical tokens based on fixed rules (e.g., always attending to recent tokens). Examples include
Sparse VideoGen (SVG)[4] and methods by Zhang et al. [8].- Limitation: They lack adaptability to diverse sparsity patterns, leading to suboptimal performance across different generation tasks or video content.
- Dynamic Methods: These methods determine sparse patterns at runtime, usually through an additional
identification stepwhereattention scoresare estimated. Examples includeSpargeAttention[9],XAttention[10], and others [17, 18, 19, 20, 21, 22].- Limitation: The paper argues that existing dynamic methods fail to achieve both high identification accuracy and low computation waste, primarily due to position-based clustering and scattered critical tokens.
- SpargeAttention [9]: Groups consecutive tokens into blocks and uses mean pooling to create an aggregated representation for each block, then approximates
attention scoresat the block level. - XAttention [10]: Also employs
block sparse attentionbut with anantidiagonal scoringmechanism.
3.2.2. Sparse Attention for Large Language Models (LLMs)
This area also has two categories:
- Memory-Efficient Methods: These focus on reducing
memory loadto accelerate decoding, crucial forLLMswith very long contexts. Examples includeQuest[5],H2O[6],Attention Sinks[7], andDuoattention[24].- Relevance to DiTs: While beneficial for
LLMs, these are often less effective forcompute-bound DiT-based video generation, where the primary bottleneck is computational power rather than memory capacity.
- Relevance to DiTs: While beneficial for
- Compute-Efficient Methods: These focus on processing only
critical tokensto reduce computation. Examples includeMinference[25],FlexPrefill[26],SeerAttention[27],Inflm[28], andLM-Infinite[29].- Relevance to DiTs: These methods often cannot directly optimize video
DiTsdue to the unique spatio-temporal sparse patterns of video data. - MMInference [12]: Notably, this work introduces a
modality-aware permutationformulti-modal LLMs. However, it is rule-based and designed for inter-modality tokens, differing fromSVG2'ssemantic-aware clusteringfor intra-modality (video) tokens. - Tactic [11] and Twilight [15]: These methods, for
LLMs, inspireSVG2'sTop-p critical token selection strategy.Tacticuses adaptive sparse attention with clustering and distribution fitting, whileTwilightemploys hierarchicalTop-p pruning.
- Relevance to DiTs: These methods often cannot directly optimize video
3.2.3. Linear Attention for Diffusion Models
This line of research replaces the quadratic complexity of standard attention with linear complexity, making it highly efficient for long-context problems. Examples include Transformers are RNNs [30], Gated Linear Attention Transformers [31], and Gated Delta Networks [32]. Some works combine linear attention or state space models (SSMs) like Mamba [33, 34] with Transformers for video generation:
Matten[35] usesMambafor global information andattentionfor local.LinGen[36] usesMamba2andSwin attention.M4V[37] proposes anMM-DiM block.SANA[40, 41, 42] andDC-AE[43, 44, 45] useLinear AttentionandDeep-Compressed Auto-Encoders.- Relevance to SVG2: While these methods offer significant complexity reduction,
SVG2focuses on optimizing the existingquadratic attentionwhere it's still used, often for local or specific types of interactions, rather than replacing it entirely withlinear attention.
- Relevance to SVG2: While these methods offer significant complexity reduction,
3.2.4. Long Video Generation and Caching-Based Acceleration
These methods address challenges in generating minute-level videos and optimizing efficiency through KV cache reuse. Examples include CausVid [46], Self-Forcing [47], LongLive [48], Framepack [49], RifleX [50], [51].
RadialAttention[52],VMOBA[53],Mixture-of-Context[54], andVSA[55] adoptsparse attentioninlong-context fine-tuning.Caching-based methods[56, 57, 58, 59, 60] utilizeredundancybetween timesteps andclassifier-free guidance (CFG)for efficiency.- Relevance to SVG2: The paper states that these methods are
orthogonaltoSVG2and can be integrated for even higher speedups, indicating thatSVG2focuses on theattention mechanismitself, while these methods optimize the overalldenoising processorKV caching.
- Relevance to SVG2: The paper states that these methods are
3.3. Technological Evolution
The evolution of DiT-based video generation has moved from full, dense attention (which has quadratic complexity) to sparse attention techniques to mitigate computational bottlenecks.
- Dense Attention: Initial
Transformermodels useddense attention, calculating interactions between allQuery-Keypairs. This delivered high quality but suffered fromquadratic scaling, making it impractical for long video sequences. - Static Sparse Attention: Early attempts to reduce computation involved pre-defined sparse patterns, like focusing on local windows or fixed patterns (e.g.,
Sparse VideoGen). While offering some speedup, these methods lacked adaptability. - Dynamic Sparse Attention (Position-based): The next step involved dynamically identifying
critical tokensat runtime. Methods likeSpargeAttentionandXAttentionclustered tokens into blocks based on theirpositionin the sequence (e.g., consecutive tokens) and then estimatedattention scoresfor these blocks. This reduced identification overhead but introducedinaccuracydue to mixing semantically distinct tokens within a block. Furthermore, thescattered natureof truly critical tokens led tocomputation wasteonGPUs. SVG2(Semantic-Aware Dynamic Sparse Attention):SVG2represents an advancement by addressing the core limitations of position-based dynamic sparse attention. It introducessemantic-aware clusteringusingk-meansto group tokens by actual semantic similarity, thus improving identification accuracy. Crucially, it thenpermutesthese semantically similar (and often critical) tokens into acontiguous layoutto maximize hardware efficiency and minimizecomputation waste. This placesSVG2at the forefront of dynamically adaptable and hardware-efficientsparse attentionfor video generation.
3.4. Differentiation Analysis
Compared to the main methods in related work, SVG2's core differences and innovations are:
-
From Position-based to Semantic-aware Clustering:
- Existing Methods (e.g.,
SpargeAttention): Cluster tokens based on theirposition(e.g., consecutive tokens in a block). This leads toinaccurate identificationbecause position does not guaranteesemantic similarity. SVG2's Innovation: Employsk-means clusteringonQueryandKeyactivations to group tokens based on theirsemantic similarity. This ensures that tokens within a cluster share similar semantics, leading to moreprecise aggregated representations(centroids) and significantly improvedcritical token identification accuracy.
- Existing Methods (e.g.,
-
From Scattered to Densified Critical Token Layout:
- Existing Methods: Even with perfect identification,
critical tokensremainscatteredacross the tensor. This causescomputation wastebecauseGPUs(especiallytensor cores) requirecontiguous inputsand performpaddingon sparse, scattered data. SVG2's Innovation: Introducessemantic-aware permutationwhich physically reorders tokens within eachk-means clusterinto acontiguous layout. Thisdensifiesthesparse computation, allowingGPUsto process onlycritical tokenswithoutpadding, thereby maximizing hardware utilization and minimizingcomputation waste.
- Existing Methods: Even with perfect identification,
-
Dynamic Budget Control with Centroid-based Estimation:
- Existing Methods: Often rely on pre-defined sparsity percentages or less accurate estimation methods.
SVG2's Innovation: Usescluster centroidsto accurately approximateattention scoresand integrates aTop-p selection strategyto dynamically adjust the number of selectedcritical tokensto meet a targetattention recallorquality requirement. This offers greater flexibility and adaptability.
-
Integrated System-Algorithm Co-design:
-
Existing Methods: May focus on either algorithmic sparsity or hardware optimization separately, sometimes leading to mismatches.
-
SVG2's Innovation: Proactively addresses hardware-software co-design challenges. It develops acentroid cacheforfast k-means(addressingk-meansoverhead) and, critically, acustomized attention kernelthat efficiently handles thedynamic block sizesnaturally arising fromsemantic-aware clustering. This ensures that the algorithmic benefits ofsemantic-aware permutationare fully realized on modernGPU architectures.In essence,
SVG2differentiates itself by providing a comprehensive solution that simultaneously tackles both theaccuracyofcritical token identificationand theefficiencyof their hardware-aware processing, moving beyond the limitations of prior heuristic or position-based sparse attention methods.
-
4. Methodology
4.1. Principles
The core idea behind SVG2 is to overcome the limitations of existing sparse attention methods in DiT-based video generation by leveraging semantic similarity for both accurate identification of critical tokens and efficient hardware processing. The theoretical basis and intuition are rooted in the observation that attention in DiTs is inherently sparse and that semantically similar tokens are more likely to attend to each other. By grouping tokens based on their semantics rather than their arbitrary position, SVG2 aims to:
- Improve Identification Accuracy: Create more representative aggregated features for
critical token selection. When tokens within a cluster are semantically similar, theircentroid(or mean/max pooling) can more accurately represent their collectiveattention scores, leading to better choices of which tokens are truly important. - Minimize Computation Waste: Facilitate efficient
sparse computationonGPUs. Ifsemantically similar tokensare grouped together, and these groups also tend to becritical, then reordering them contiguously allowsML acceleratorsto process thesedense blockswithout wasting computation onpaddingor scattered memory access. This aligns thesparse computationpattern withGPUarchitectural strengths.
4.2. Core Methodology In-depth (Layer by Layer)
The SVG2 framework operates training-free, meaning it does not require additional training of the DiT model itself. It integrates three key techniques: semantic-aware permutation with k-means clustering, centroid-based Top-p selection, and efficient system-algorithm co-designs. The overall workflow of SVG2 is visualized in Figure 5.
The process begins in a Diffusion Transformer (DiT) layer, where input activations with a hidden dimension are transformed into Query (Q), Key (K), and Value (V) tensors. These tensors are fundamental for the self-attention operation.
The standard self-attention operation in DiTs is defined as:
Where:
-
is the
Querymatrix, where is the number of query tokens. -
is the
Keymatrix, where is the number of key tokens. -
is the
Valuematrix. Note thatKeyandValueoften share the same number of tokens, , as they come from the same source sequence in self-attention. -
is the
hidden dimension(or dimension ofKeyvectors, , used for scaling). -
computes the raw
attention scores. -
normalizes these scores along the last dimension to produce the
attention probability matrix. This matrix captures the relationships betweenQuerytokens andKeytokens. -
is the final output matrix, representing the weighted sum of
Valuevectors.The problem, as highlighted by the authors, is that computing has a
quadratic complexityrelative to the sequence length, making it a bottleneck.SVG2addresses this by modifying howcritical tokensare identified and processed.
4.2.1. Semantic-Aware Permutation with -means Clustering
As discussed in Section 3.2, existing sparse attention methods suffer from inaccurate identification due to position-based clustering. To counter this, SVG2 introduces semantic-aware permutation by performing k-means clustering on the activations of the input tokens.
Clustering Step:
For each attention head and Transformer layer, k-means is applied independently to the Query tokens and Key tokens.
-
The
Querytokens, , are clustered intoquery clusters: . -
The
Keytokens, , are clustered intokey clusters: .This approach ensures that tokens within each resulting cluster share similar
semantics. This is crucial because it leads to more precisecentroid representations, which are then used for more accuratecritical token identification.
Permutation Step:
While k-means logically groups semantically similar tokens, these tokens are still physically scattered in the original , , and tensors. This scattered layout is inefficient for ML accelerators. To address this, SVG2 performs a semantic-aware permutation based on the k-means clustering. It reorders tokens within each cluster into a contiguous layout.
Let be the permutation matrix for Query tokens and be the permutation matrix for Key and Value tokens (since and must be permuted consistently). These matrices satisfy and , where is the identity matrix.
The permuted Query, Key, and Value tensors are then:
-
-
-
The
permuted attention outputis mathematically equivalent to the originalattention output: Here: -
The first line shows the computation of using the permuted , , and then undoing the
Query permutationwith . -
The second line demonstrates the equivalence: since and , the permutation matrices effectively cancel out, recovering the original
attentioncalculation and output . This ensures that the permutation does not change the mathematical result of theattentionoperation, only its computational layout.The benefits of this
cluster-wise contiguous layoutare that it can be efficiently computed by underlyingML accelerators, drastically reducingcomputation waste.
4.2.2. Centroid-Based Top Selection
After semantic-aware permutation creates semantic-coherent clusters, SVG2 needs to (1) efficiently estimate the criticality of these clusters and (2) dynamically determine how many critical clusters (and thus critical tokens) to select to meet desired accuracy.
Accurate and Efficient Estimation of Criticality:
SVG2 estimates the criticality of each cluster using a centroid-based estimation of attention scores. Instead of calculating the full , it uses the centroids of the Query and Key clusters to approximate these scores.
The raw pre-softmax scores between Query cluster and Key cluster are estimated as:
Where:
-
is the centroid vector of the -th
Query cluster. -
is the centroid vector of the -th
Key cluster. -
is the dimension of the
Keyvectors (used for scaling).These
pre-softmax scoresare then used to calculate an approximateattention scoreP'_{ij}for eachQuery clusterattending toKey cluster. This approximation also accounts for the size of theKey clusterto better reflect its overall contribution: Where: -
is the number of tokens in
Key cluster. -
applies the exponential function to the raw score.
-
The denominator is a sum over all
Key clusters(from to ) to normalize the scores, similar tosoftmax.Since
tokens within the same cluster share similar semantics, theircentroidsprovidehighly accurate representationsof the actualactivations, making this estimation reliable. The computational overhead of this cluster-level approximation is negligible (less than1%of full attention), as the number of clusters () is typically small (e.g., less than 1024).
Dynamic Adjustment of Computation Budget (Top-p Selection):
To dynamically control the number of critical tokens, SVG2 employs a Top-p selection strategy.
- All potential
Key clustersfor a givenQuery clusterare sorted in descending order based on their approximatedattention scoresP'_{ij}. Key clustersare then sequentially selected from this sorted list until theircumulative sumofP'_{ij}reaches a predefined target value . This allows for dynamic allocation of the computational budget based on the desired level ofattention recallorgeneration quality, without requiring manual adjustments of a fixed number of tokens.
4.2.3. Efficient System-Algorithm Co-design
To make SVG2 practical and efficient, the authors introduce several system-algorithm co-designs.
Fast -means with Centroid Cache:
k-means clustering can be computationally intensive due to its iterative nature, potentially taking many iterations to converge. This can add significant latency, especially if performed for every attention head and layer at each denoising step.
- Observation:
DiTstend to exhibit similarlatent space activationsbetween consecutivedenoising steps[59, 64]. This implies that thecluster centroidsfrom one step might be good initial guesses for the next. - Solution:
SVG2implements acentroids cache. This cache stores thecentroidscomputed in the previousdenoising stepand reuses them asinitialization pointsfork-meansin the current step. This "warm-starts" thek-meansalgorithm, drastically reducing the number of iterations required for convergence. The paper reports this technique can reducek-meansruntime by up to76x.
Efficient Sparse Attention Kernel for Varied Block-Sizes:
Standard efficient attention implementations (e.g., FlashAttention [65], FlexAttention [66], FlashInfer [16]) are optimized for block-wise sparse computation but typically assume static, fixed block sizes (e.g., ).
- Challenge: The
semantic-aware permutationnaturally generatesclusters (blocks)ofdynamic and diverse sizes. For example, aQuery clustermight have 128 tokens, while aKey clusterit attends to might only have 32 tokens. If a fixed block size (e.g., ) were used, the computation would needpaddingto fit, leading to75% computation waste. - Solution:
SVG2implements acustomized attention kernelthat explicitly supportsdynamic block sizesas input.-
Hardware Compatibility: This kernel is designed to be compatible with both
FlashAttention-2 (FA2)(forA100 GPUs) andFlashAttention-3 (FA3)(forH100 GPUs). -
Sparse Loading and Dense Computation:
- For
Querytokens: These are loaded contiguously from memory, as they are already grouped after permutation. - For
Key/Valuetokens: SinceKey/Value clusterscan still be scattered in global memory (relative to other clusters), the kernel usesper-token address offsetsto performsparse loading. These sparsely loadedKey/Valuetokens are then stored inshared memoryin acontiguous layout.
- For
-
MMAInstructions: This contiguous layout inshared memoryallows theGPUto efficiently utilizeMatrix Multiply Accumulate (MMA)instructions (e.g.,wgmma (m64n64k16)forFA3) without the need forexpensive padding. This design achieves over85%of the theoretical maximum performance, providing significant efficiency gains.Figure 5 shows the overall process: (a) Original attention map with different colors representing different semantics. Only tokens with similar semantics have high attention scores. (b) After
k-means clustering, semantically similar tokens are grouped, andQueryandKey centroidsare used to representcluster-level semantics. Thesecentroidsare then used to estimateattention scoresfor accuratecritical token identification. (c) Combined withTop-p selection,critical tokensare dynamically identified in acontiguous layoutdue to thesemantic-aware permutation, enabling efficient computation.
-
5. Experimental Setup
5.1. Datasets
The experiments evaluate SVG2 on state-of-the-art video generation DiT models and benchmark datasets.
-
Models for Video Generation:
Wan2.1-I2V/T2V-14B[2]: A large-scale video generative model by Ang Wang et al. (2025). The paper specifies720p resolutionfor generated videos. When tokenized by3D-VAE(a3D Variational Autoencoderused for encoding video frames into a lower-dimensional latent space),Wan2.1generates 21 frames with 3600 tokens per frame.HunyuanVideo-T2V-13B[1]: A systematic framework for large video generative models by Weijie Kong et al. (2025). Also generates720p resolutionvideos. This model processes 33 frames with 3600 tokens per frame.
-
Benchmark Datasets/Prompts:
- Text-to-Video (T2V) Generation: For
T2Vtasks, the authors adopt prompts from thePenguin Benchmarkafterprompt optimizationprovided by theVBench team. A prompt is a textual description used to guide the video generation process. - Image-to-Video (I2V) Generation: For
I2Vtasks, theprompt-image pairsprovided byVBench[67] are used. Images are cropped to16:9 ratiosto match the720p resolutiontarget. - VBench [67]: A comprehensive benchmark suite for video generative models, used to evaluate various aspects of video quality.
- Text-to-Video (T2V) Generation: For
5.2. Evaluation Metrics
The paper uses a comprehensive set of metrics to assess both the quality of generated videos and the efficiency of the sparse attention mechanisms.
5.2.1. Peak Signal-to-Noise Ratio (PSNR)
- Conceptual Definition:
PSNRis a quality metric that quantifies the difference between a generated video frame and a reference frame (e.g., from the originaldense attentiongeneration). It is calculated based on theMean Squared Error (MSE)between the pixel values of the two images. HigherPSNRvalues indicate that the generated image is closer to the reference image, implying higher fidelity and less degradation. - Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $
- Symbol Explanation:
- : The maximum possible pixel value of the image (e.g., 255 for an 8-bit grayscale image, or for each color channel).
- : The
Mean Squared Errorbetween the two images. For two images and of size : $ \mathrm{MSE} = \frac{1}{mn}\sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $I(i,j): The pixel value at coordinates(i,j)in the original (reference) image.K(i,j): The pixel value at coordinates(i,j)in the generated image.m, n: The height and width of the images in pixels.
5.2.2. Structural Similarity Index Measure (SSIM)
- Conceptual Definition:
SSIMis a perceptual metric designed to assess the perceived quality of an image by modeling how the human visual system works. UnlikePSNRwhich focuses on absolute errors,SSIMconsiders aspects likeluminance,contrast, andstructure. A value of 1 indicates perfect structural similarity, while values closer to 0 indicate less similarity. - Mathematical Formula: $ \mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
x, y: Two image patches being compared.- : The average (mean) pixel values of and , respectively.
- : The variance of and , respectively.
- : The covariance of and .
- : Small constants to prevent division by zero, where is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and are small constants (e.g., ).
5.2.3. Learned Perceptual Image Patch Similarity (LPIPS)
- Conceptual Definition:
LPIPSmeasures the perceptual distance between two images, aiming to align with human judgments of similarity. It works by extracting features from a pre-trained deep neural network (e.g.,VGGorAlexNet) and then calculating theL2 distancebetween these feature representations. A lowerLPIPSscore indicates higher perceptual similarity.
5.2.4. VBench
- Conceptual Definition:
VBenchis a comprehensivebenchmark suitespecifically designed to evaluate various attributes of video generative models. It includes metrics forsubject consistency (SubConsis),background consistency (BackConsis),motion smoothness (MotionSmooth),aesthetic quality (AesQual), andimage quality (ImagQual). TheAveragescore provides an overall quality assessment. Higher scores indicate better performance across these attributes.
5.2.5. Density
- Conceptual Definition:
Densityquantifies the computational budget used by asparse attentionmechanism. It is defined as the ratio of thesparse attention computation(e.g.,FLOPsor number ofQuery-Keypairs computed) to thefull attention computation. A lowerdensityindicates greater efficiency in terms of computational resources used forattention.
5.2.6. FLOPs (Floating Point Operations)
- Conceptual Definition:
FLOPsmeasure the total number of arithmetic operations (e.g., additions, multiplications) performed during video generation. In this context, it is used to quantify the overall computational cost. A lowerFLOPscount means the generation process is more computationally efficient. The unitPFLOPsstands for PetaFLOPs, or floating-point operations.
5.2.7. Speedup
- Conceptual Definition:
Speedupmeasures how much faster a task runs with the optimized method (SVG2) compared to the baseline (dense attention). It is calculated as the ratio of the baseline's execution time toSVG2's execution time, or inversely, the ratio of theFLOPssaved. A2x speedupmeans the process is twice as fast. The paper reports bothend-to-end speedup(total time for video generation) andattention speedup(time saved specifically in the attention module).
5.3. Baselines
The proposed SVG2 method is compared against several state-of-the-art sparse attention algorithms:
-
Static Method:
Sparse VideoGen (SVG)[4]: This is a prior work from some of the same authors, which accelerates videoDiffusion Transformersusingspatial-temporal sparsitybased on static patterns.
-
Dynamic Methods:
-
SpargeAttention[9]: A dynamic sparse attention method that focuses on accurate sparse attention for accelerating any model inference, typically using block-level approximation. -
XAttention[10]: Another dynamic block sparse attention method, which employsantidiagonal scoringfor efficient visual generation models. The paper notes thatXAttentionwas not evaluated onWan2.1due to lack of support.These baselines represent different strategies for implementing
sparse attention, ranging from static patterns to dynamic, block-wise approaches, providing a comprehensive comparison forSVG2.
-
5.4. Implementations
- Framework:
SVG2is prototyped as an end-to-end framework. - Customized Kernels: The customized kernels are built using
FlashInfer[16], an efficient and customizableattention engine. - Hardware: Benchmarking is conducted on an
NVIDIA H100 GPU. - Software:
CUDA 12.8. - Configuration:
- For
SVG2, the number ofquery clusters() is set to100andkey clusters() to500. The rationale for this choice is discussed in the ablation study (Section D.2 of the appendix). - Experiments are conducted with
sparse attention skippedduring the first30%ofdenoising stepsfor all methods. Thiswarmupstrategy is common in diffusion models [64, 68, 56, 59], as initial steps are critical for overall generation quality. Results withoutwarmupare provided in the appendix. - Various
accuracy targets(i.e.,attention score recall) are used to evaluate the trade-off between generation quality and efficiency. A single data point for detailed comparison is presented in Table 1.
- For
6. Results & Analysis
6.1. Core Results Analysis
The experiments aim to validate SVG2's effectiveness in accelerating video generation while maintaining high quality, comparing it against state-of-the-art baselines.
6.1.1. Qualitative Evaluation
Figure 6 visually demonstrates the effectiveness of SVG2's semantic-aware permutation.
The following figure (Figure 6 from the original paper) shows the visualization of attention maps:
![Figure 6: Visualization of attention maps from different attention heads in \(\\mathrm { W a n } 2 . 1\) when generating videos from VBench \[67\]. (a) Original attention maps with diverse sparse patterns…](/files/papers/691855af110b75dcc59ae14e/images/6.jpg)
该图像是一个示意图,展示了在生成视频时不同注意力头的注意力图。图中的 (a) 原始注意力图显示了多样的稀疏模式,(b) 经过语义感知排列的注意力图,(c) 在应用基于中心的 top- 选择后恢复的注意力图。这些结果展示了 SVG2 的有效性。
-
(a) Original Attention Maps: These show diverse
sparse patternsacross differentattention headsinWan2.1during video generation.Critical tokens(highlighted in red) are often scattered. -
(b) Permuted Attention Maps: After
semantic-aware permutationbased onk-means clustering, thecritical tokensare reorganized into acontiguous layout. This transformation is crucial for efficientblock-wise computationonGPUswithoutcomputation waste. -
(c) Recovered Attention Maps: By applying
centroid-based Top-p selectionto the permuted map and then undoing thepermutation, the attention map is recovered to its original layout. The high similarity between therecoveredandoriginal attention mapsvisually confirmsSVG2's ability to accurately identify and processcritical tokenswhile preserving the intendedattention patterns.Further qualitative comparison in the appendix (Figure 9 and Figure 10) shows that
SVG2generates videos with high pixel-level fidelity, closely matching the quality ofdense attentionfor bothHunyuanVideoandWan 2.1. The following figure (Figure 9 from the original paper) shows the comparison of Dense Attention and SVG2 on HunyuanVideo and Wan 2.1 Text-to-Video generation:

该图像是图表,展示了Dense Attention与SVG2在HunyuanVideo和Wan 2.1文本到视频生成的对比。每一对比中,左侧为Dense Attention的结果,右侧为SVG2的效果,展现了两者在生成质量和效率上的差异。
The following figure (Figure 10 from the original paper) shows the comparison of Dense Attention and SVG2 on Wan 2.1 Image-to-Video generation:

该图像是一个图表,展示了在Wan 2.1图像到视频生成任务中,密集注意力和SVG2方法的对比效果。每列分别展示采用密集注意力和SVG2生成的图像,以显示两者在生成质量上的差异。
6.1.2. Quantitative Evaluation of Quality and Efficiency
The quantitative results, presented in Table 1, compare SVG2 against baselines (SpargeAttn, SVG, XAttention) across Wan 2.1 and HunyuanVideo models. The table includes metrics for generation quality (PSNR, SSIM, LPIPS, VBench) and efficiency (Density, FLOPs, Speedup), with a 30% warmup setting.
The following are the results from Table 1 of the original paper:
| Model | Config | PSNR ↑ | SSIM ↑ | LPIPS ↓ | VBench ↑ | Density ↓ | FLOP↓↑ | Speedup ↑ |
|---|---|---|---|---|---|---|---|---|
| Wan 2.1 | 14B, 720P, Image-to-Video | - | - | - | 0.841 | 100% | 526.76 PFLOPs | 1× |
| SpargeAttn | 21.181 | 0.665 | 0.333 | - | 38.99% | 366.80 PFLOPs | 1.47× | |
| SVG | 24.059 | 0.813 | 0.174 | 0.836 | 30.25% | 343.88 PFLOPs | 1.56× | |
| Ours | 26.562 | 0.861 | 0.138 | 0.838 | 31.28% | 346.59 PFLOPs | 1.58× | |
| Wan 2.1 | Ours-Turbo | 24.510 | 0.812 | 0.179 | 0.836 | 14.13% | 301.62 PFLOPs | 1.84× |
| 14B, 720P, Text-to-Video | - | - | - | 0.846 | 100% | 658.46 PFLOPs | 1× | |
| SpargeAttn | 20.519 | 0.623 | 0.343 | 0.820 | 42.03% | 468.46 PFLOPs | 1.44× | |
| SVG | 22.989 | 0.785 | 0.199 | 0.837 | 30.25% | 429.86 PFLOPs | 1.58× | |
| Ours | 25.808 | 0.854 | 0.138 | 0.842 | 29.51% | 427.43 PFLOPs | 1.60× | |
| Hunyuan | Ours-Turbo | 23.682 | 0.789 | 0.196 | 0.838 | 12.87% | 372.89 PFLOPs | 1.89× |
| 13B, 720P, Text-to-Video | - | - | - | 0.850 | 100% | 612.38 PFLOPs | 1× | |
| SpargeAttn | 27.892 | 0.884 | 0.151 | - | 42.62% | 399.16 PFLOPs | 1.53× | |
| XAttention | 28.892 | 0.898 | 0.120 | 0.839 | 39.32% | 386.90 PFLOPs | 1.56× | |
| SVG | 29.157 | 0.905 | 0.120 | 0.845 | 29.86% | 351.75 PFLOPs | 1.91× | |
| SVG + FP8 | 29.033 | 0.902 | 0.121 | 0.843 | 29.86% | 351.75 PFLOPs | 2.3× | |
| Ours | 30.452 | 0.910 | 0.117 | 0.852 | 25.45% | 335.36 PFLOPs | 2.30× | |
| Ours + FP8 | 30.389 | 0.908 | 0.118 | 0.851 | 25.45% | 335.36 PFLOPs | 2.55× |
Key observations:
- Superior Quality:
SVG2consistently achieves the highest quality scores across all metrics (PSNR,SSIM,LPIPS,VBench) for bothWan 2.1andHunyuanVideomodels. For example, onHunyuanVideo,SVG2achieves aPSNRof30.452,SSIMof0.910, andLPIPSof0.117, outperforming all baselines. - Highest Speedup: Despite its superior quality,
SVG2also achieves the highestend-to-end speedup. ForHunyuanVideo,SVG2provides a2.30x speedupwhile using adensityof25.45%. When combined withFP8(8-bit floating point precision), the speedup further increases to2.55x. - Pareto Frontier: The results demonstrate that
SVG2offers a superior trade-off, achieving better quality at comparable or lowerdensity(i.e., less computation) than other methods. This positionsSVG2on thePareto frontierof the quality-efficiency trade-off, meaning it provides the best possible quality for a given efficiency level. Ours-TurboConfiguration: TheOurs-Turbovariants showcaseSVG2's potential for even higher speedups by accepting a slight reduction in quality. For instance,Wan 2.1 I2V Ours-Turboachieves1.84x speedupwith adensityof14.13%, still maintaining aPSNR(24.510) comparable to or better thanSVG(24.059) at a much lowerdensity. This flexibility is valuable for different application scenarios.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Efficiency Evaluation for Fast -means with Centroids Cache
To demonstrate the effectiveness of the centroids cache, the paper compares the runtime of k-means when varying the number of iterations required to achieve a 90% attention recall.
The following figure (Figure 7 from the original paper) shows the efficiency evaluation for fast k-means with centroids cache and customized attention kernel:

该图像是图表,展示了使用质心缓存和自定义注意力核的 -均值效率评估。图 (a) 显示了不同步骤下的延迟与密度关系,指出使用质心缓存能显著减少计算时间。图 (b) 比较了不同集群数量下的动态与静态方法的效能,显示了我们方法在效率上的提升。
- Figure 7(a) illustrates that enabling the
centroids cachesignificantly reduces theend-to-end latencyofk-means. With the cache enabled,k-meansachieves comparable or even lowerdensity(meaning better quality identification) with a drastic reduction in execution time, showing a76x speedupfork-means. This validates the importance of reusingcentroidsfrom previousdenoising stepsto mitigate the overhead ofk-means clustering.
6.2.2. Efficiency Evaluation for Customized Attention Kernel
The efficiency of the customized attention kernel with dynamic block sizes is evaluated by comparing its computation FLOPs against FlashInfer [16] (a state-of-the-art attention library). This is done by varying combinations of Query clusters () and Key clusters () while maintaining 90% attention recall.
-
Figure 7(b) demonstrates that the
customized kernelsachieve an average of1.48x computation reduction. In a practical setup with and ,SVG2achieves1.88x reductionincomputation waste. This highlights the advantage ofSVG2's kernel in efficiently handlingdynamic block sizesresulting fromsemantic-aware permutation, avoiding thepadding wasteinherent in static-block kernels.The following figure (Figure 11 from the original paper) shows the efficiency evaluation for our attention kernel, varying the number of key clusters:
该图像是一个图表,展示了FA2和FA3在不同块密度下的吞吐量评估。左侧图表显示FA2的吞吐量,右侧为FA3。通过对比稀疏和密集配置,可以观察到各配置在不同密度条件下的性能变化。
The following figure (Figure 12 from the original paper) shows the efficiency evaluation for our attention kernel, varying the number of query clusters:

该图像是一个图表,展示了 FA2 和 FA3 算法在不同块密度下的吞吐量表现。FA2 的吞吐量在大约 340 TFLOPS 左右,而 FA3 的吞吐量可达 550 TFLOPS。图中标示了各算法在不同行数下的性能差异。
Figures 11 and 12 (in Appendix D.1) further detail the kernel performance. They show that kernel performance (throughput) drastically decreases whenC_qis larger than 200, suggesting that an optimal balance between cluster granularity and hardware utilization must be maintained. However, increasing has a less pronounced effect. This implies specific hardware constraints related to query processing and tensor core utilization.
6.2.3. Sensitivity Test on Quality-Efficiency Trade-off
To validate SVG2's ability to achieve a superior trade-off, a comprehensive evaluation on Wan2.1-I2V-14B is conducted across a wide range of computational budgets (density).
The following figure (Figure 2 from the original paper) shows the trade-off curves between generation quality (PSNR) and efficiency (density):

该图像是图表,展示了不同方法在生成质量(PSNR)与效率(密度)之间的权衡曲线。SVG2在相同的密度下,始终超越现有方法,达到帕累托前沿,最大PSNR值接近26,表现出2.3倍的减少效率。
- Figure 2 clearly shows that
SVG2consistently achievesbetter generation quality (higher PSNR)at any givendensitycompared to baseline methods. This positionsSVG2on thePareto frontierof the quality-efficiency trade-off, indicating its optimal performance across various sparsity levels. Specifically,SVG2can reducedensityby up to2.3xwhile maintaining the samePSNRas its counterparts.
6.2.4. Ablation Study on Semantic-Aware Permutation
Effectiveness on Improving Identification Accuracy:
The impact of semantic-aware permutation on identification accuracy is assessed by comparing attention recall (how many critical tokens are correctly identified) with and without permutation, keeping mean-pooling and cluster size consistent.
The following figure (Figure 8 from the original paper) shows attention recall across various densities:

该图像是一个图表,展示了在不同密度下,启用与禁用 k-means 排序的注意力召回率。图中的红色三角形代表启用 k-means 排序的情况,而灰色三角形代表禁用该功能的情况。结果表明,启用排排序时,注意力召回率普遍较高。
- Figure 8 demonstrates that enabling
semantic-aware permutationconsistently achieveshigher attention recallacross variousdensities. This improvement is attributed to the formation ofsemantic-coherent clusters, which provide more precise representations for identifyingcritical tokens. This directly supports the claim thatsemantic-aware clusteringimproves the accuracy ofcritical token identification.
Effectiveness on Reducing Computation Waste:
The impact on computation waste is evaluated by comparing the computational overhead with and without semantic-aware permutation, using the exact same set of critical tokens selected by centroid-based Top-p selection.
- The results show that
enabling semantic-aware permutationreducescomputational overheadby an average of36%. This confirms that reorderingscattered critical tokensinto acontiguous layoutsignificantly minimizescomputation wasteby better utilizingGPUhardware.
6.2.5. Ablation on the Number of Clusters
The paper investigates the impact of the number of Query clusters () and Key clusters () on both PSNR and end-to-end efficiency.
The following are the results from Table 5 of the original paper:
| Cq Ck | PSNR SSIM LPIPS | Speedup | |
|---|---|---|---|
| 100 250 | 25.497 0.801 0.182 | 1.90x | |
| 100 1000 | 26.276 0.825 0.159 | 1.71x | |
| 50 500 | 22.561 0.742 0.258 | 1.90x | |
| 200 500 | 26.213 0.820 0.157 | 1.78x | |
| 400 500 | 26.488 0.868 0.132 | 1.25x | |
| 100 500 | 26.128 0.816 0.169 | 1.89x |
- Table 5 shows that setting and provides the
best balancebetweengeneration qualityandefficiency(Speedup1.89xwithPSNR 26.128). - Increasing the number of clusters generally improves quality up to a point, but can
degrade efficiency. This is becausetensor coresonNVIDIA GPUsrequire minimum fixed input sizes (e.g., 64 form64n64k16). If the average cluster size falls below this threshold (e.g., ), it leads tounderutilizationof the hardware, reducing efficiency despite potential quality gains from finer-grained clustering. For example, and results in higher PSNR but significantly lower speedup (1.25x) due to potential underutilization.
6.2.6. Ablation on Permutation
The authors investigate if Query and Key representations can share the same clustering strategy (i.e., using a single permutation matrix for both).
The following are the results from Table 6 of the original paper:
| Permutation used by Q | Permutation used by K | Density | PSNR |
|---|---|---|---|
| 31.28% | 26.562 | ||
| 38.23% | 22.439 | ||
| 38.58% | 22.183 | ||
| 87.27% | 26.495 |
- Table 6 shows that using independent
Query permutation() andKey permutation() (31.28% density,26.562 PSNR) yields superior performance compared to applying the same permutation to both and (e.g., for both:38.23% density,22.439 PSNR). - Even clustering based on
hidden statesbefore theQKV linear layer(sharedQK embedding, ) results in worsePSNRand much higherdensity(87.27% density). - The
Adjusted Rand Index (ARI)betweenQ clustersandK clustersis0.345, indicating substantial differences in theirpermutation patterns. This suggests that and capture distinct aspects ofsemantic relationships, and thus,independent clusteringfor and is necessary to preserve the fullexpressiveness of attention.
6.3. Performance Comparison in Warmup-free Setting
Appendix B (Table 2) provides results without the 30% warmup steps at the beginning of denoising.
The following are the results from Table 2 of the original paper:
| Model | Config | PSNR ↑ | SSIM ↑ | LPIPS ↓ | VBench ↑ | Density ↓ | FLOP ↓ | Attn Speedup ↑ | Speedup ↑ |
|---|---|---|---|---|---|---|---|---|---|
| Wan 2.1 | 14B, 720P, Image-to-Video | - | - | - | 0.841 | 100% | 526.76 PFLOPs | 1× | 1× |
| SVG | 15.608 | 0.512 | 0.404 | 0.823 | 29.54% | 262.85 PFLOPs | 2.26× | 1.86× | |
| Ours | 18.276 | 0.615 | 0.317 | 0.832 | 29.34% | 262.10 PFLOPs | 2.95× | 2.10× | |
| Wan 2.1 | 14B, 720P, Text-to-Video | - | - | - | 0.851 | 100% | 658.46 PFLOPs | 1× | 1× |
| SVG | 13.294 | 0.407 | 0.512 | 0.849 | 29.54% | 328.56 PFLOPs | 2.28× | 1.89× | |
| Ours | 16.502 | 0.562 | 0.373 | 0.852 | 30.12% | 331.28 PFLOPs | 2.98× | 2.13× | |
| Hunyuan | 13B, 720P, Text-to-Video | - | - | - | 0.820 | 100% | 612.38 PFLOPs | 1× | 1× |
| SVG | 12.298 | 0.492 | 0.483 | 0.808 | 29.86% | 240.05 PFLOPs | 3.45× | 2.48× | |
| Ours | 19.879 | 0.735 | 0.260 | 0.816 | 28.94% | 235.16 PFLOPs | 4.06× | 2.69× | |
- In the
warmup-free setting, the absolutePSNRvalues for all methods (includingdense attentionin the reference) are generally lower, indicating that the initialdenoising stepsare indeed crucial for quality. - However,
SVG2consistently offersbetter quality(higher PSNR,SSIM,VBench,lower LPIPS) thanSVGat comparabledensities. For example, onHunyuanVideo,SVG2achieves19.879 PSNR(vsSVG's12.298) with28.94% density(vsSVG's29.86%), resulting in a2.69x end-to-end speedup(vsSVG's2.48x). This further reinforcesSVG2's robustness and effectiveness even under more challenging conditions.
6.4. VBench Results
Appendix C (Table 3 and Table 4) provides the full VBench results for both warmup-free and 30% warmup settings.
The following are the results from Table 3 of the original paper:
| Model | Config | SubConsis | BackConsis | MotionSmooth | AesQual | ImagQual | Average |
|---|---|---|---|---|---|---|---|
| Wan 2.1 | 14B, 720P, Image-to-Video | 0.946 | 0.956 | 0.979 | 0.618 | 0.709 | 0.841 |
| SVG | 0.916 | 0.935 | 0.976 | 0.591 | 0.698 | 0.823 | |
| Ours | 0.936 | 0.946 | 0.977 | 0.597 | 0.700 | 0.832 | |
| Wan 2.1 | 14B, 720P, Text-to-Video | 0.970 | 0.970 | 0.992 | 0.612 | 0.708 | 0.851 |
| SVG | 0.963 | 0.969 | 0.991 | 0.612 | 0.708 | 0.849 | |
| Ours | 0.971 | 0.970 | 0.992 | 0.624 | 0.707 | 0.852 | |
| Hunyuan | 13B, 720P, Text-to-Video | 0.888 | 0.938 | 0.994 | 0.594 | 0.685 | 0.820 |
| SVG | 0.867 | 0.930 | 0.991 | 0.594 | 0.656 | 0.808 | |
| Ours | 0.888 | 0.935 | 0.994 | 0.589 | 0.675 | 0.816 |
The following are the results from Table 4 of the original paper:
| Model | Config | SubConsis | BackConsis | MotionSmooth | AesQual | ImagQual | Average |
|---|---|---|---|---|---|---|---|
| Wan 2.1 | 14B, 720P, Image-to-Video | 0.946 | 0.956 | 0.979 | 0.618 | 0.709 | 0.841 |
| SVG | 0.941 | 0.948 | 0.978 | 0.606 | 0.709 | 0.836 | |
| Ours | 0.943 | 0.951 | 0.977 | 0.606 | 0.709 | 0.838 | |
| Wan 2.1 | 14B, 720P, Text-to-Video | 0.956 | 0.968 | 0.983 | 0.613 | 0.713 | 0.846 |
| SpargeAttn | 0.927 | 0.948 | 0.978 | 0.567 | 0.684 | 0.820 | |
| SVG | 0.947 | 0.960 | 0.980 | 0.597 | 0.703 | 0.837 | |
| Ours | 0.954 | 0.965 | 0.982 | 0.602 | 0.709 | 0.842 | |
| Hunyuan | 13B, 720P, Text-to-Video | 0.915 | 0.941 | 0.993 | 0.648 | 0.753 | 0.850 |
| XAttention | 0.912 | 0.924 | 0.992 | 0.631 | 0.739 | 0.839 | |
| SVG | 0.914 | 0.928 | 0.993 | 0.652 | 0.739 | 0.845 | |
| Ours | 0.917 | 0.946 | 0.993 | 0.657 | 0.751 | 0.852 |
- The full
VBenchresults confirm thatSVG2consistently outperforms other baselines across various quality aspects (e.g.,SubConsis,BackConsis,MotionSmooth,AesQual,ImagQual) and overallAveragescore, in bothwarmupandwarmup-freesettings. This showsSVG2's ability to maintain not just pixel-level fidelity but also higher-level perceptual and temporal consistency in generated videos.
6.5. Performance Gap between HunyuanVideo and Wan 2.1
The paper addresses observed performance differences between HunyuanVideo and Wan 2.1.
6.5.1. Quality Difference
Wan 2.1generally exhibits lowerPSNR,SSIM, andLPIPSvalues compared toHunyuanVideoacross all methods.- The reason stated is
Wan 2.1's high sensitivity toprecision varianceandnumerical changes, even across different backend implementations (e.g.,FlexAttention,FlashAttention,Torch SDPA).HunyuanVideois more robust. This meansSVG2, which introduces some numerical approximations viasparse attention, naturally achieves lowerPSNRon the more sensitiveWan 2.1. This difference is model-specific and not indicative ofSVG2's methodological performance.
6.5.2. Speedup Difference
- The
end-to-end speeduponWan 2.1(1.89x) is generally lower than onHunyuanVideo(2.30x). - This difference primarily stems from varying
attention cost ratiosin the two models, driven by differentcontext lengthsandmodel architectures.HunyuanVideohas a longercontext length(118k) compared toWan 2.1's (75k). Additionally,HunyuanVideo's layers consist mainly ofSelf-AttentionandFeed-Forward Networks, whileWan 2.1includes an additionalcross-attention block. - Consequently, the
attention moduleconstitutes a larger proportion ofHunyuanVideo's total runtime. SinceSVG2primarily accelerates theattention module, itsoverall speedupnaturally scales with theattention module's contribution to the total runtime. This explanation clarifies thatSVG2's effectiveness is consistent, but its impact onend-to-end speedupvaries depending on the baseline model's architecture.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces SVG2, a novel, training-free framework designed to accelerate Diffusion Transformers (DiTs) for video generation by optimizing sparse attention mechanisms. The core innovation of SVG2 is semantic-aware permutation, which clusters tokens based on semantic similarity using k-means and then reorders them into a contiguous layout. This approach effectively addresses two key limitations of prior sparse attention methods: inaccurate critical token identification (by providing precise cluster representations) and excessive computation waste (by enabling efficient GPU processing without padding).
SVG2 further integrates centroid-based Top-p dynamic budget control for flexible quality-efficiency trade-offs and develops customized kernel implementations to support dynamic block sizes and leverage GPU capabilities. Comprehensive evaluations on HunyuanVideo and Wan 2.1 demonstrate that SVG2 consistently achieves a Pareto frontier trade-off, delivering superior generation quality at any given computational budget. It provides significant end-to-end speedups (up to 2.30x on HunyuanVideo and 1.89x on Wan 2.1) while maintaining high video quality (PSNR up to 30 and 26 respectively). By improving both the accuracy of sparsity identification and the efficiency of sparse computation, SVG2 makes DiT-based video generation more practical and accessible.
7.2. Limitations & Future Work
The major limitation explicitly stated by the authors is the lack of discussion and evaluation on whether the proposed methods can be extended to attention mechanisms other than DiTs. This suggests that while SVG2 is highly effective for DiT-based video generation, its applicability to other Transformer-based models (e.g., LLMs, image DiTs, or other vision Transformers) or different types of attention (e.g., cross-attention in multi-modal models beyond Wan 2.1's specific use) is not fully explored. Future work could involve adapting SVG2's principles to a broader range of Transformer architectures and attention mechanisms.
7.3. Personal Insights & Critique
This paper presents a rigorous and well-executed solution to a critical problem in generative AI. The dual approach of improving identification accuracy and minimizing computation waste through semantic-aware permutation is elegant and addresses fundamental mismatches between algorithmic sparsity and hardware architecture.
Inspirations and Applications:
- Generalizability of Semantic Clustering: The concept of using
semantic clustering(e.g.,k-meansonQKV activations) to identifycritical tokenscould be highly beneficial beyond videoDiTs. It could be applied toLLMsfor long-context windows, whereposition-based sparsityoften struggles to capture global semantic relationships. This could lead to more accuratesparse attentionin variousTransformer-based models. - Hardware-Algorithm Co-design: The emphasis on
system-algorithm co-design, particularly thecustomized kernelfordynamic block sizes, is a crucial takeaway. It highlights that optimizing AI models for real-world deployment requires not just algorithmic innovation but also deep understanding and customization for underlying hardware. This principle is applicable to any domain wheresparse operationsare performed onparallel accelerators. - Dynamic Sparsity Control: The
centroid-based Top-p selectionoffers a flexible way to manage thequality-efficiency trade-off. This dynamic control, driven by learnedsemantic scores, is a robust mechanism that could be adapted to other adaptive sparsity schemes, allowing users to define their desired performance envelope without extensive manual tuning.
Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
k-meansOverhead and Hyperparameter Sensitivity: While thecentroid cachesignificantly reducesk-meanslatency,k-meansitself still incurs some computational cost and, more importantly, requires choosing the number of clusters (). The paper finds optimal values empirically (100, 500), but these might be sensitive to different models, datasets, or video lengths. An adaptive method for determining and on the fly, perhaps based on the inherentdensityoflatent space activations, could further enhance robustness. -
Definition of "Semantic Similarity": The paper assumes
k-meanson and vectors effectively captures "semantic similarity." While this is a reasonable heuristic given and are used to computeattention scores, the exact nature of this "semantic" grouping might be complex. Further analysis into whatk-meansclusters represent (e.g., specific objects, backgrounds, motion types) could provide deeper insights and potentially lead to more semantically informed clustering algorithms. -
Cold Start Problem for Centroid Cache: The
centroid cacheis effective after the first few steps. However, the initialdenoising steps(thewarmupperiod) or the very first video generation still incur the fullk-meanscost. While this is amortized over many steps, for very short generation processes, this initial overhead might still be noticeable. -
Applicability to
Cross-Attention: The stated limitation focuses onattention mechanisms other than DiTs. It would be interesting to see howsemantic-aware permutationcould be applied tocross-attention(e.g., between text embeddings and video tokens). WhileWan 2.1hascross-attention, the paper's main focus isself-attention. The differing roles ofQuery(from video) andKey/Value(from text) incross-attentionmight require modifications to theclustering strategy. -
Theoretical Guarantees: The paper provides strong empirical evidence. A deeper theoretical analysis of why
k-meansclustering on and activations leads to optimalattention recalland how the permutation maintains mathematical equivalence while being hardware-efficient could strengthen the fundamental understanding ofSVG2's performance.Overall,
SVG2is a significant advancement in making high-quality video generation withDiTsmore efficient, demonstrating an excellent balance between algorithmic innovation and practical system optimization.
Similar papers
Recommended via semantic vector search.