Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Ion Stoica

Paper status: completed

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

Published:05/25/2025

Video Generation Acceleration (2)Sparse Attention Video Generation (1)Diffusion Transformers (1)Semantic-Aware Permutation (1)Training-Free Framework (1)

Original Link PDF

Price: 0.10

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents SVG2, a training-free framework that enhances critical token identification accuracy through semantic-aware permutation, reducing computation waste and addressing efficiency bottlenecks in sparse attention for video generation, achieving up to 2.30x accelerati

Abstract

Diffusion Transformers (DiTs) are essential for video generation but suffer from significant latency due to the quadratic complexity of attention. By computing only critical tokens, sparse attention reduces computational costs and offers a promising acceleration approach. However, we identify that existing methods fail to approach optimal generation quality under the same computation budget for two reasons: (1) Inaccurate critical token identification: current methods cluster tokens based on position rather than semantics, leading to imprecise aggregated representations. (2) Excessive computation waste: critical tokens are scattered among non-critical ones, leading to wasted computation on GPUs, which are optimized for processing contiguous tokens. In this paper, we propose SVG2, a training-free framework that maximizes identification accuracy and minimizes computation waste, achieving a Pareto frontier trade-off between generation quality and efficiency. The core of SVG2 is semantic-aware permutation, which clusters and reorders tokens based on semantic similarity using k-means. This approach ensures both a precise cluster representation, improving identification accuracy, and a densified layout of critical tokens, enabling efficient computation without padding. Additionally, SVG2 integrates top-p dynamic budget control and customized kernel implementations, achieving up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively. Our code is open-sourced at \href{https://github.com/svg-project/Sparse-VideoGen}{https://github.com/svg-project/Sparse-VideoGen}.

Mind Map

In-depth Reading

English Analysis~35 min read · 46,890 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is accelerating video generation using sparse attention mechanisms, specifically by employing semantic-aware permutation. The title is Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation.

1.2. Authors

The authors are Shuo Yang*, Haocheng X*, Yilong Zhao, Muyang Li, Jintao Zhang, Han Cai, Yujun Lin, Xiuyu Li, Chenfeng Xu, Kelly Peng, Jianfei Chen, Song Han, Kurt Keutzer, and Ion Stoica. Their affiliations include the University of California, Berkeley, MIT, and Stanford University. This diverse authorship from leading academic institutions suggests a strong research background in deep learning, efficient AI, and hardware acceleration.

1.3. Journal/Conference

The paper is published on arXiv, a preprint server, as indicated by the Original Source Link and PDF Link. While arXiv is not a peer-reviewed journal or conference, it is a widely recognized platform for disseminating cutting-edge research in computer science, physics, mathematics, and other fields. Papers often appear on arXiv before formal publication at conferences (e.g., ICML, NeurIPS) or journals, allowing for early sharing and feedback. Many of the authors have a strong publication record in top-tier machine learning conferences (e.g., ICML, NeurIPS, CVPR), suggesting that this work is likely intended for such a venue.

1.4. Publication Year

The publication timestamp indicates 2025-05-24T21:30:29.000Z, implying a publication year of 2025.

1.5. Abstract

The paper addresses the significant latency bottleneck in Diffusion Transformers (DiTs) for video generation, primarily caused by the quadratic complexity of their attention mechanisms. While sparse attention offers a promising solution by computing only critical tokens, existing methods fall short in generation quality for a given computation budget. The authors identify two main issues: (1) Inaccurate critical token identification due to position-based clustering, leading to imprecise aggregated representations. (2) Excessive computation waste because scattered critical tokens cause inefficient processing on GPUs optimized for contiguous memory access.

To overcome these, the paper proposes SVG2, a training-free framework designed to maximize identification accuracy and minimize computation waste, achieving a Pareto frontier trade-off between quality and efficiency. The core of SVG2 is semantic-aware permutation, which uses k-means clustering to group and reorder tokens based on semantic similarity. This ensures precise cluster representation for accurate identification and a densified layout of critical tokens for efficient computation without padding. SVG2 also incorporates top-p dynamic budget control and customized kernel implementations. It achieves up to 2.30x and 1.89x speedup while maintaining a PSNR of up to 30 and 26 on HunyuanVideo and Wan 2.1, respectively. The code is open-sourced.

1.6. Original Source Link

The original source link is https://arxiv.org/abs/2505.18875. This is a preprint on arXiv.

1.7. PDF Link

The PDF link is https://arxiv.org/pdf/2505.18875v3.pdf. This is the third version of the preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The paper addresses the critical problem of high computational latency in Diffusion Transformers (DiTs) when applied to video generation. DiTs have proven highly effective in generating high-quality images and videos, but their 3D spatio-temporal attention mechanisms introduce a quadratic computational complexity with respect to the sequence length. This means that as videos get longer or have higher resolution (more tokens), the computation time increases quadratically, making it prohibitively expensive for practical deployment. For instance, generating a short video using HunyuanVideo on an NVIDIA A100 GPU can take nearly an hour, with attention operations consuming over 80% of the runtime.

Previous research has noted that self-attention mechanisms are often sparse, meaning only a small fraction of computations significantly influence the final output. This observation led to sparse attention methods, which aim to reduce computational costs by processing only the most critical tokens. Current approaches typically involve an identification step where token activations are used to estimate attention scores, and tokens with the highest scores are selected. To minimize overhead, this identification is often performed at a block granularity, treating consecutive tokens as a single unit.

However, the authors identify two significant challenges with existing sparse attention methods that prevent them from achieving optimal generation quality under a given computational budget:

Inaccurate critical token identification: Existing block-wise identification methods cluster tokens based on their position in the sequence rather than their semantic similarity. This can group semantically diverse tokens into the same block, leading to an imprecise aggregated representation (e.g., using mean or max pooling for a block). Such imprecise representations result in inaccurate estimations of attention scores and, consequently, incorrect identification of critical tokens.
Excessive computation waste: Even if critical tokens could be perfectly identified, their scattered distribution within the tensor leads to computation waste on modern ML accelerators like GPUs. These accelerators are optimized for dense matrix multiplications with contiguous memory layouts. When critical tokens are scattered, they must be padded with non-critical tokens to fit the hardware's contiguous processing units (e.g., tensor cores requiring 16x16x8 shapes), wasting computational resources on non-essential data. The paper notes that up to 80% of computation can be wasted this way.

The paper's innovative idea is to leverage semantic-aware permutation to address these two challenges simultaneously, aiming to bridge the gap between existing sparse attention methods and the theoretical upper bound of an oracle policy.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Proposed SVG2 Framework: SVG2 is introduced as a novel, training-free framework for sparse attention specifically designed to accelerate DiT-based video generation. It aims to maximize the accuracy of critical token identification and minimize computation waste.
Semantic-Aware Permutation: The core innovation is semantic-aware permutation, which utilizes k-means clustering to group and reorder Query, Key, and Value tokens based on their semantic similarity (derived from their activations). This approach serves a dual purpose:
1. Improved Identification Accuracy: By creating semantically coherent clusters, the aggregated representations (centroids) become more precise, leading to more accurate estimation of attention scores and thus better identification of critical tokens.
2. Minimized Computation Waste: The permutation reorders scattered critical tokens into compact, dense blocks. This densified layout allows GPUs to process only critical tokens efficiently without needing padding, thereby reducing computation waste.
Centroid-Based Top-p Dynamic Budget Control: SVG2 integrates a mechanism to dynamically adjust the computational budget. It uses cluster centroids to approximate attention scores and then employs a Top-p selection strategy to select critical clusters until a predefined attention score target is met. This enables flexible trade-offs between quality and efficiency without manual tuning.
Efficient System-Algorithm Co-designs:
- Fast k-means with Centroid Cache: To mitigate the computational overhead of k-means clustering, SVG2 implements a centroid cache that reuses centroids from previous denoising steps, significantly accelerating the clustering process (up to 76x speedup).
- Customized Attention Kernel for Dynamic Block Sizes: Recognizing that semantic-aware permutation naturally produces clusters of varying sizes, SVG2 introduces a customized attention kernel capable of handling dynamic block sizes. This kernel supports FlashAttention-2 (FA2) and FlashAttention-3 (FA3) backends, enabling efficient sparse loading and dense computation without padding, achieving over 85% of theoretical maximum performance.
Achieving Pareto Frontier Trade-off: SVG2 consistently outperforms existing methods, achieving a Pareto frontier in the quality-efficiency trade-off curve. This means for any given computational budget (density), SVG2 delivers superior generation quality, and for any target quality, it offers higher efficiency.
Significant Speedup and Quality Retention: The framework demonstrates substantial practical benefits, achieving an end-to-end speedup of up to 2.30x and 1.89x on HunyuanVideo and Wan 2.1 respectively, while maintaining high visual quality with PSNR values up to 30 and 26.

3.1. Foundational Concepts

3.1.1. Diffusion Transformers (DiTs)

Diffusion Transformers (DiTs) are a class of generative models that combine diffusion models with the Transformer architecture.

Diffusion Models: These are generative models that learn to reverse a gradual noising process. They start with random noise and progressively denoise it over several steps to generate a coherent data sample (e.g., an image or video). They are known for generating high-quality and diverse outputs.
Transformers: Originally developed for natural language processing, Transformers are neural network architectures primarily based on the self-attention mechanism. They excel at processing sequential data by allowing each element in a sequence to "attend" to all other elements, capturing long-range dependencies effectively.
DiTs for Video Generation: In the context of video generation, DiTs extend this concept to 3D spatio-temporal data. This means they process tokens that represent pixels/patches across both spatial dimensions (width, height) and the temporal dimension (frames). The Transformer part, especially its self-attention modules, is responsible for modeling the intricate spatio-temporal relationships within the video latent space, guiding the denoising process to generate coherent video content.

3.1.2. Self-Attention Mechanism

The self-attention mechanism is the core component of Transformer models. It allows a model to weigh the importance of different parts of the input sequence when processing a specific element. For an input sequence of tokens, each token is transformed into three different vectors:

Query (Q): Represents what the current token is "looking for."
Key (K): Represents what information the current token "offers."
Value (V): Contains the actual information that is "offered" by the current token. The attention score between a Query and all Keys determines how much Value to aggregate from each token. The standard formula for self-attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
$Q$ is the matrix of Query vectors.
$K$ is the matrix of Key vectors.
$V$ is the matrix of Value vectors.
$Q K^T$ calculates the dot product similarity between each query and all keys.
$\sqrt{d_k}$ is a scaling factor (where $d_k$ is the dimension of the key vectors) used to prevent the dot products from becoming too large, which can push the softmax function into regions with tiny gradients.
softmax is an activation function that converts raw scores into probabilities, ensuring that the weights sum to 1.
The result is a weighted sum of Value vectors, where the weights are determined by the softmax output.

3.1.3. Quadratic Complexity of Attention

The term quadratic complexity refers to how the computational cost of the attention mechanism scales with the input sequence length. If $N$ is the sequence length (number of tokens), computing the $Q K^T$ matrix involves multiplying an $N \times d_k$ matrix ( $Q$ ) by a $d_k \times N$ matrix ( $K^T$ ). This operation has a computational complexity of $O(N^2 \cdot d_k)$ . Since $N$ can be very large in video generation (e.g., thousands of tokens per frame across multiple frames), the $N^2$ term leads to a rapid increase in computation time and memory usage, making it a significant bottleneck.

3.1.4. Sparse Attention

Sparse attention is a technique designed to mitigate the quadratic complexity of traditional self-attention. Instead of computing attention scores for all possible Query-Key pairs, sparse attention selectively computes only a subset of these pairs, focusing on the most "important" or "critical" ones. This reduces the number of operations from $O(N^2)$ to $O(N \cdot S)$ or $O(N \log N)$ , where $S$ is a sparsity factor, leading to significant computational savings. The challenge lies in accurately identifying these critical tokens without sacrificing model performance.

3.1.5. K-Means Clustering

k-means clustering is a popular unsupervised machine learning algorithm used to partition $n$ observations into $k$ clusters, where each observation belongs to the cluster with the nearest mean (centroid), serving as a prototype of the cluster.

Algorithm Steps:
1. Initialization: Choose $k$ initial centroids (randomly or using a smarter method like k-means++).
2. Assignment: Assign each data point (e.g., token vector) to the cluster whose centroid is closest (typically measured by Euclidean distance).
3. Update: Recalculate the centroids as the mean of all data points assigned to that cluster.
4. Iteration: Repeat steps 2 and 3 until the cluster assignments no longer change or a maximum number of iterations is reached.
Application in SVG2: In SVG2, k-means is applied to the Query and Key token vectors to group tokens that have similar semantic activations.

3.1.6. Peak Signal-to-Noise Ratio (PSNR)

PSNR is a widely used metric to quantify the quality of reconstruction of an image or video, often used to measure the quality of lossy compression codecs or in this case, the fidelity of generated content compared to a reference. A higher PSNR generally indicates higher quality. The PSNR is defined as: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $ Where:

$\mathrm{MAX}_I$ is the maximum possible pixel value of the image (e.g., 255 for an 8-bit image).
$\mathrm{MSE}$ is the Mean Squared Error between the original and the reconstructed image. The MSE is calculated as: $ \mathrm{MSE} = \frac{1}{mn}\sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $ Where:
I(i,j) is the pixel value at position (i,j) in the original image.
K(i,j) is the pixel value at position (i,j) in the reconstructed (generated) image.
$m$ and $n$ are the dimensions (height and width) of the image.

3.1.7. Structural Similarity Index Measure (SSIM)

SSIM is a perceptual metric that measures the similarity between two images. Unlike PSNR which measures absolute error, SSIM is designed to model the human visual system's perception of quality. It considers three key aspects: luminance, contrast, and structure. A value closer to 1 indicates higher similarity. The SSIM between two images $x$ and $y$ is defined as: $ \mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $ Where:

$\mu_x$ and $\mu_y$ are the average (mean) pixel values of image $x$ and $y$ , respectively.
$\sigma_x^2$ and $\sigma_y^2$ are the variance of image $x$ and $y$ , respectively.
$\sigma_{xy}$ is the covariance of image $x$ and $y$ .
$c_1 = (K_1L)^2$ and $c_2 = (K_2L)^2$ are two constants included to avoid division by zero. $L$ is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and $K_1, K_2$ are small constants (e.g., $K_1=0.01, K_2=0.03$ ).

3.1.8. Learned Perceptual Image Patch Similarity (LPIPS)

LPIPS is a metric that assesses the perceptual similarity between two images using features extracted from a pre-trained deep neural network. Instead of comparing pixels directly, LPIPS computes the Euclidean distance between feature representations of image patches from a VGG or AlexNet model. A lower LPIPS score indicates higher perceptual similarity (i.e., the images look more alike to a human observer).

3.1.9. FLOPs (Floating Point Operations)

FLOPs refers to the number of floating-point operations performed by a model. It's a common metric to estimate the computational cost of a neural network. A lower FLOPs count indicates higher computational efficiency.

3.1.10. Speedup

Speedup is a metric used to quantify the performance improvement achieved by an optimization. It's typically calculated as the ratio of the original execution time to the optimized execution time, or, conversely, the ratio of the optimized computation rate to the original computation rate. For example, a 2x speedup means the task completes twice as fast.

3.1.11. CUDA Kernels & FlashAttention

CUDA Kernels: CUDA is a parallel computing platform and programming model developed by NVIDIA for GPUs. CUDA kernels are functions that run on the GPU and are designed to execute in parallel across many threads, enabling highly efficient computation for tasks like matrix multiplication which are common in deep learning.
FlashAttention (FA2, FA3): FlashAttention is a highly optimized attention mechanism implementation designed for NVIDIA GPUs. It significantly reduces memory I/O (read/write operations to GPU memory), which is often a bottleneck, by using tiling and reordering to keep intermediate softmax computations in fast SRAM (on-chip memory). FA2 and FA3 are subsequent versions with further optimizations. They are crucial for accelerating Transformer models.
FlashInfer: FlashInfer is an efficient and customizable attention engine built on FlashAttention principles, specifically designed for LLM inference serving. It provides optimized kernels for various attention patterns.

3.2. Previous Works

The paper categorizes related work into several areas, focusing on sparse attention for DiTs and LLMs, linear attention, and long video generation/caching.

3.2.1. Sparse Attention for Video DiTs

Previous work in sparse attention for DiTs falls into two main categories:

Static Methods: These methods pre-define sparse patterns or identify critical tokens based on fixed rules (e.g., always attending to recent tokens). Examples include Sparse VideoGen (SVG) [4] and methods by Zhang et al. [8].
- Limitation: They lack adaptability to diverse sparsity patterns, leading to suboptimal performance across different generation tasks or video content.
Dynamic Methods: These methods determine sparse patterns at runtime, usually through an additional identification step where attention scores are estimated. Examples include SpargeAttention [9], XAttention [10], and others [17, 18, 19, 20, 21, 22].
- Limitation: The paper argues that existing dynamic methods fail to achieve both high identification accuracy and low computation waste, primarily due to position-based clustering and scattered critical tokens.
- SpargeAttention [9]: Groups consecutive tokens into blocks and uses mean pooling to create an aggregated representation for each block, then approximates attention scores at the block level.
- XAttention [10]: Also employs block sparse attention but with an antidiagonal scoring mechanism.

3.2.2. Sparse Attention for Large Language Models (LLMs)

This area also has two categories:

Memory-Efficient Methods: These focus on reducing memory load to accelerate decoding, crucial for LLMs with very long contexts. Examples include Quest [5], H2O [6], Attention Sinks [7], and Duoattention [24].
- Relevance to DiTs: While beneficial for LLMs, these are often less effective for compute-bound DiT-based video generation, where the primary bottleneck is computational power rather than memory capacity.
Compute-Efficient Methods: These focus on processing only critical tokens to reduce computation. Examples include Minference [25], FlexPrefill [26], SeerAttention [27], Inflm [28], and LM-Infinite [29].
- Relevance to DiTs: These methods often cannot directly optimize video DiTs due to the unique spatio-temporal sparse patterns of video data.
- MMInference [12]: Notably, this work introduces a modality-aware permutation for multi-modal LLMs. However, it is rule-based and designed for inter-modality tokens, differing from SVG2's semantic-aware clustering for intra-modality (video) tokens.
- Tactic [11] and Twilight [15]: These methods, for LLMs, inspire SVG2's Top-p critical token selection strategy. Tactic uses adaptive sparse attention with clustering and distribution fitting, while Twilight employs hierarchical Top-p pruning.

3.2.3. Linear Attention for Diffusion Models

This line of research replaces the quadratic complexity of standard attention with linear complexity, making it highly efficient for long-context problems. Examples include Transformers are RNNs [30], Gated Linear Attention Transformers [31], and Gated Delta Networks [32]. Some works combine linear attention or state space models (SSMs) like Mamba [33, 34] with Transformers for video generation:

Matten [35] uses Mamba for global information and attention for local.
LinGen [36] uses Mamba2 and Swin attention.
M4V [37] proposes an MM-DiM block.
SANA [40, 41, 42] and DC-AE [43, 44, 45] use Linear Attention and Deep-Compressed Auto-Encoders.
- Relevance to SVG2: While these methods offer significant complexity reduction, SVG2 focuses on optimizing the existing quadratic attention where it's still used, often for local or specific types of interactions, rather than replacing it entirely with linear attention.

3.2.4. Long Video Generation and Caching-Based Acceleration

These methods address challenges in generating minute-level videos and optimizing efficiency through KV cache reuse. Examples include CausVid [46], Self-Forcing [47], LongLive [48], Framepack [49], RifleX [50], $Freelong++$ [51].

RadialAttention [52], VMOBA [53], Mixture-of-Context [54], and VSA [55] adopt sparse attention in long-context fine-tuning.
Caching-based methods [56, 57, 58, 59, 60] utilize redundancy between timesteps and classifier-free guidance (CFG) for efficiency.
- Relevance to SVG2: The paper states that these methods are orthogonal to SVG2 and can be integrated for even higher speedups, indicating that SVG2 focuses on the attention mechanism itself, while these methods optimize the overall denoising process or KV caching.

3.3. Technological Evolution

The evolution of DiT-based video generation has moved from full, dense attention (which has quadratic complexity) to sparse attention techniques to mitigate computational bottlenecks.

Dense Attention: Initial Transformer models used dense attention, calculating interactions between all Query-Key pairs. This delivered high quality but suffered from quadratic scaling, making it impractical for long video sequences.
Static Sparse Attention: Early attempts to reduce computation involved pre-defined sparse patterns, like focusing on local windows or fixed patterns (e.g., Sparse VideoGen). While offering some speedup, these methods lacked adaptability.
Dynamic Sparse Attention (Position-based): The next step involved dynamically identifying critical tokens at runtime. Methods like SpargeAttention and XAttention clustered tokens into blocks based on their position in the sequence (e.g., consecutive tokens) and then estimated attention scores for these blocks. This reduced identification overhead but introduced inaccuracy due to mixing semantically distinct tokens within a block. Furthermore, the scattered nature of truly critical tokens led to computation waste on GPUs.
SVG2 (Semantic-Aware Dynamic Sparse Attention): SVG2 represents an advancement by addressing the core limitations of position-based dynamic sparse attention. It introduces semantic-aware clustering using k-means to group tokens by actual semantic similarity, thus improving identification accuracy. Crucially, it then permutes these semantically similar (and often critical) tokens into a contiguous layout to maximize hardware efficiency and minimize computation waste. This places SVG2 at the forefront of dynamically adaptable and hardware-efficient sparse attention for video generation.

3.4. Differentiation Analysis

Compared to the main methods in related work, SVG2's core differences and innovations are:

From Position-based to Semantic-aware Clustering:
- Existing Methods (e.g., SpargeAttention): Cluster tokens based on their position (e.g., consecutive tokens in a block). This leads to inaccurate identification because position does not guarantee semantic similarity.
- SVG2's Innovation: Employs k-means clustering on Query and Key activations to group tokens based on their semantic similarity. This ensures that tokens within a cluster share similar semantics, leading to more precise aggregated representations (centroids) and significantly improved critical token identification accuracy.
From Scattered to Densified Critical Token Layout:
- Existing Methods: Even with perfect identification, critical tokens remain scattered across the tensor. This causes computation waste because GPUs (especially tensor cores) require contiguous inputs and perform padding on sparse, scattered data.
- SVG2's Innovation: Introduces semantic-aware permutation which physically reorders tokens within each k-means cluster into a contiguous layout. This densifies the sparse computation, allowing GPUs to process only critical tokens without padding, thereby maximizing hardware utilization and minimizing computation waste.
Dynamic Budget Control with Centroid-based Estimation:
- Existing Methods: Often rely on pre-defined sparsity percentages or less accurate estimation methods.
- SVG2's Innovation: Uses cluster centroids to accurately approximate attention scores and integrates a Top-p selection strategy to dynamically adjust the number of selected critical tokens to meet a target attention recall or quality requirement. This offers greater flexibility and adaptability.
Integrated System-Algorithm Co-design:
- Existing Methods: May focus on either algorithmic sparsity or hardware optimization separately, sometimes leading to mismatches.
- SVG2's Innovation: Proactively addresses hardware-software co-design challenges. It develops a centroid cache for fast k-means (addressing k-means overhead) and, critically, a customized attention kernel that efficiently handles the dynamic block sizes naturally arising from semantic-aware clustering. This ensures that the algorithmic benefits of semantic-aware permutation are fully realized on modern GPU architectures.
  
  In essence, SVG2 differentiates itself by providing a comprehensive solution that simultaneously tackles both the accuracy of critical token identification and the efficiency of their hardware-aware processing, moving beyond the limitations of prior heuristic or position-based sparse attention methods.

4. Methodology

4.1. Principles

The core idea behind SVG2 is to overcome the limitations of existing sparse attention methods in DiT-based video generation by leveraging semantic similarity for both accurate identification of critical tokens and efficient hardware processing. The theoretical basis and intuition are rooted in the observation that attention in DiTs is inherently sparse and that semantically similar tokens are more likely to attend to each other. By grouping tokens based on their semantics rather than their arbitrary position, SVG2 aims to:

Improve Identification Accuracy: Create more representative aggregated features for critical token selection. When tokens within a cluster are semantically similar, their centroid (or mean/max pooling) can more accurately represent their collective attention scores, leading to better choices of which tokens are truly important.
Minimize Computation Waste: Facilitate efficient sparse computation on GPUs. If semantically similar tokens are grouped together, and these groups also tend to be critical, then reordering them contiguously allows ML accelerators to process these dense blocks without wasting computation on padding or scattered memory access. This aligns the sparse computation pattern with GPU architectural strengths.

4.2. Core Methodology In-depth (Layer by Layer)

The SVG2 framework operates training-free, meaning it does not require additional training of the DiT model itself. It integrates three key techniques: semantic-aware permutation with k-means clustering, centroid-based Top-p selection, and efficient system-algorithm co-designs. The overall workflow of SVG2 is visualized in Figure 5.

The process begins in a Diffusion Transformer (DiT) layer, where input activations with a hidden dimension $d$ are transformed into Query (Q), Key (K), and Value (V) tensors. These tensors are fundamental for the self-attention operation. The standard self-attention operation in DiTs is defined as: $O = P \times V , P = \mathsf { s o f t m a x } \left( \frac { Q K ^ { \top } } { \sqrt { d } } , \mathsf { d i m } = - 1 \right)$ Where:

$Q \in \mathbb{R}^{N_q \times d}$ is the Query matrix, where $N_q$ is the number of query tokens.
$K \in \mathbb{R}^{N_k \times d}$ is the Key matrix, where $N_k$ is the number of key tokens.
$V \in \mathbb{R}^{N_k \times d}$ is the Value matrix. Note that Key and Value often share the same number of tokens, $N_k$ , as they come from the same source sequence in self-attention.
$d$ is the hidden dimension (or dimension of Key vectors, $d_k$ , used for scaling).
$QK^\top$ computes the raw attention scores.
$\mathsf{softmax}(\cdot, \mathsf{dim}=-1)$ normalizes these scores along the last dimension to produce the attention probability matrix $P$ . This matrix $P$ captures the relationships between Query tokens and Key tokens.
$O \in \mathbb{R}^{N_q \times d}$ is the final output matrix, representing the weighted sum of Value vectors.

The problem, as highlighted by the authors, is that computing $P$ has a quadratic complexity relative to the sequence length, making it a bottleneck. SVG2 addresses this by modifying how critical tokens are identified and processed.

4.2.1. Semantic-Aware Permutation with $k$ -means Clustering

As discussed in Section 3.2, existing sparse attention methods suffer from inaccurate identification due to position-based clustering. To counter this, SVG2 introduces semantic-aware permutation by performing k-means clustering on the activations of the input tokens.

Clustering Step: For each attention head and Transformer layer, k-means is applied independently to the Query tokens and Key tokens.

The Query tokens, $Q \in \mathbb{R}^{N_q \times d}$ , are clustered into $C_q$ query clusters: $Q_1, \ldots, Q_{C_q}$ .
The Key tokens, $K \in \mathbb{R}^{N_k \times d}$ , are clustered into $C_k$ key clusters: $K_1, \ldots, K_{C_k}$ .

This approach ensures that tokens within each resulting cluster share similar semantics. This is crucial because it leads to more precise centroid representations, which are then used for more accurate critical token identification.

Permutation Step: While k-means logically groups semantically similar tokens, these tokens are still physically scattered in the original $Q$ , $K$ , and $V$ tensors. This scattered layout is inefficient for ML accelerators. To address this, SVG2 performs a semantic-aware permutation based on the k-means clustering. It reorders tokens within each cluster into a contiguous layout.

Let $\pi_q \in \mathbb{R}^{N_q \times N_q}$ be the permutation matrix for Query tokens and $\pi_k \in \mathbb{R}^{N_k \times N_k}$ be the permutation matrix for Key and Value tokens (since $K$ and $V$ must be permuted consistently). These matrices satisfy $\pi_q \pi_q^\top = I$ and $\pi_k \pi_k^\top = I$ , where $I$ is the identity matrix. The permuted Query, Key, and Value tensors are then:

$Q' = \pi_q Q$
$K' = \pi_k K$
$V' = \pi_k V$

The permuted attention output $O'$ is mathematically equivalent to the original attention output $O$ : $\begin{array} { l } { { \displaystyle { \cal O } ^ { \prime } = \pi _ { q } ^ { \top } \mathrm { A t t e n t i o n } ( Q ^ { \prime } , K ^ { \prime } , V ^ { \prime } ) = \pi _ { q } ^ { \top } \mathrm { s o f t m a x } \left( \frac { ( \pi _ { q } Q ) ( \pi _ { k } K ) ^ { \top } } { \sqrt { d } } \right) \pi _ { k } V } } \\ { { \displaystyle ~ = ( \pi _ { q } ^ { \top } \pi _ { q } ) \mathrm { s o f t m a x } \left( \frac { Q K ^ { \top } } { \sqrt { d } } \right) ( \pi _ { k } ^ { \top } \pi _ { k } ) V = \mathrm { s o f t m a x } \left( \frac { Q K ^ { \top } } { \sqrt { d } } \right) V = { \cal O } } } \end{array}$ Here:
The first line shows the computation of $O'$ using the permuted $Q'$ , $K'$ , $V'$ and then undoing the Query permutation with $\pi_q^\top$ .
The second line demonstrates the equivalence: since $\pi_q^\top \pi_q = I$ and $\pi_k^\top \pi_k = I$ , the permutation matrices effectively cancel out, recovering the original attention calculation and output $O$ . This ensures that the permutation does not change the mathematical result of the attention operation, only its computational layout.

The benefits of this cluster-wise contiguous layout are that it can be efficiently computed by underlying ML accelerators, drastically reducing computation waste.

4.2.2. Centroid-Based Top $p$ Selection

After semantic-aware permutation creates semantic-coherent clusters, SVG2 needs to (1) efficiently estimate the criticality of these clusters and (2) dynamically determine how many critical clusters (and thus critical tokens) to select to meet desired accuracy.

Accurate and Efficient Estimation of Criticality: SVG2 estimates the criticality of each cluster using a centroid-based estimation of attention scores. Instead of calculating the full $QK^\top$ , it uses the centroids of the Query and Key clusters to approximate these scores. The raw pre-softmax scores $S_{ij}$ between Query cluster $i$ and Key cluster $j$ are estimated as: $S _ { i j } = \frac { \mathsf { c e n t r o i d } ( Q _ { i } ) \cdot \mathsf { c e n t r o i d } ( K _ { j } ) ^ { T } } { \sqrt { d _ { k } } } \quad (1)$ Where:

$\mathsf{centroid}(Q_i)$ is the centroid vector of the $i$ -th Query cluster.
$\mathsf{centroid}(K_j)$ is the centroid vector of the $j$ -th Key cluster.
$d_k$ is the dimension of the Key vectors (used for scaling).

These pre-softmax scores are then used to calculate an approximate attention score P'_{ij} for each Query cluster $i$ attending to Key cluster $j$ . This approximation also accounts for the size of the Key cluster $|K_j|$ to better reflect its overall contribution: $P _ { i j } ^ { \prime } = \frac { | K _ { j } | \exp ( S _ { i j } ) } { \sum _ { k = 1 } ^ { C _ { k } } | K _ { k } | \exp ( S _ { i k } ) } \quad (2)$ Where:
$|K_j|$ is the number of tokens in Key cluster $j$ .
$\exp(S_{ij})$ applies the exponential function to the raw score.
The denominator is a sum over all Key clusters (from $k=1$ to $C_k$ ) to normalize the scores, similar to softmax.

Since tokens within the same cluster share similar semantics, their centroids provide highly accurate representations of the actual activations, making this estimation reliable. The computational overhead of this cluster-level approximation is negligible (less than 1% of full attention), as the number of clusters ( $C_q, C_k$ ) is typically small (e.g., less than 1024).

Dynamic Adjustment of Computation Budget (Top-p Selection): To dynamically control the number of critical tokens, SVG2 employs a Top-p selection strategy.

All potential Key clusters for a given Query cluster are sorted in descending order based on their approximated attention scores P'_{ij}.
Key clusters are then sequentially selected from this sorted list until their cumulative sum of P'_{ij} reaches a predefined target value $p$ . This allows for dynamic allocation of the computational budget based on the desired level of attention recall or generation quality, without requiring manual adjustments of a fixed number of tokens.

4.2.3. Efficient System-Algorithm Co-design

To make SVG2 practical and efficient, the authors introduce several system-algorithm co-designs.

Fast $k$ -means with Centroid Cache: k-means clustering can be computationally intensive due to its iterative nature, potentially taking many iterations to converge. This can add significant latency, especially if performed for every attention head and layer at each denoising step.

Observation: DiTs tend to exhibit similar latent space activations between consecutive denoising steps [59, 64]. This implies that the cluster centroids from one step might be good initial guesses for the next.
Solution: SVG2 implements a centroids cache. This cache stores the centroids computed in the previous denoising step and reuses them as initialization points for k-means in the current step. This "warm-starts" the k-means algorithm, drastically reducing the number of iterations required for convergence. The paper reports this technique can reduce k-means runtime by up to 76x.

Efficient Sparse Attention Kernel for Varied Block-Sizes: Standard efficient attention implementations (e.g., FlashAttention [65], FlexAttention [66], FlashInfer [16]) are optimized for block-wise sparse computation but typically assume static, fixed block sizes (e.g., $128 \times 128$ ).

Challenge: The semantic-aware permutation naturally generates clusters (blocks) of dynamic and diverse sizes. For example, a Query cluster might have 128 tokens, while a Key cluster it attends to might only have 32 tokens. If a fixed block size (e.g., $128 \times 128$ ) were used, the $128 \times 32$ computation would need padding to fit, leading to 75% computation waste.
Solution: SVG2 implements a customized attention kernel that explicitly supports dynamic block sizes as input.
- Hardware Compatibility: This kernel is designed to be compatible with both FlashAttention-2 (FA2) (for A100 GPUs) and FlashAttention-3 (FA3) (for H100 GPUs).
- Sparse Loading and Dense Computation:
  - For Query tokens: These are loaded contiguously from memory, as they are already grouped after permutation.
  - For Key/Value tokens: Since Key/Value clusters can still be scattered in global memory (relative to other clusters), the kernel uses per-token address offsets to perform sparse loading. These sparsely loaded Key/Value tokens are then stored in shared memory in a contiguous layout.
- MMA Instructions: This contiguous layout in shared memory allows the GPU to efficiently utilize Matrix Multiply Accumulate (MMA) instructions (e.g., wgmma (m64n64k16) for FA3) without the need for expensive padding. This design achieves over 85% of the theoretical maximum performance, providing significant efficiency gains.
  
  Figure 5 shows the overall process: (a) Original attention map with different colors representing different semantics. Only tokens with similar semantics have high attention scores. (b) After k-means clustering, semantically similar tokens are grouped, and Query and Key centroids are used to represent cluster-level semantics. These centroids are then used to estimate attention scores for accurate critical token identification. (c) Combined with Top-p selection, critical tokens are dynamically identified in a contiguous layout due to the semantic-aware permutation, enabling efficient computation.

5. Experimental Setup

5.1. Datasets

The experiments evaluate SVG2 on state-of-the-art video generation DiT models and benchmark datasets.

Models for Video Generation:
- Wan2.1-I2V/T2V-14B [2]: A large-scale video generative model by Ang Wang et al. (2025). The paper specifies 720p resolution for generated videos. When tokenized by 3D-VAE (a 3D Variational Autoencoder used for encoding video frames into a lower-dimensional latent space), Wan2.1 generates 21 frames with 3600 tokens per frame.
- HunyuanVideo-T2V-13B [1]: A systematic framework for large video generative models by Weijie Kong et al. (2025). Also generates 720p resolution videos. This model processes 33 frames with 3600 tokens per frame.
Benchmark Datasets/Prompts:
- Text-to-Video (T2V) Generation: For T2V tasks, the authors adopt prompts from the Penguin Benchmark after prompt optimization provided by the VBench team. A prompt is a textual description used to guide the video generation process.
- Image-to-Video (I2V) Generation: For I2V tasks, the prompt-image pairs provided by VBench [67] are used. Images are cropped to 16:9 ratios to match the 720p resolution target.
- VBench [67]: A comprehensive benchmark suite for video generative models, used to evaluate various aspects of video quality.

5.2. Evaluation Metrics

The paper uses a comprehensive set of metrics to assess both the quality of generated videos and the efficiency of the sparse attention mechanisms.

5.2.1. Peak Signal-to-Noise Ratio (PSNR)

Conceptual Definition: PSNR is a quality metric that quantifies the difference between a generated video frame and a reference frame (e.g., from the original dense attention generation). It is calculated based on the Mean Squared Error (MSE) between the pixel values of the two images. Higher PSNR values indicate that the generated image is closer to the reference image, implying higher fidelity and less degradation.
Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $
Symbol Explanation:
- $\mathrm{MAX}_I$ : The maximum possible pixel value of the image (e.g., 255 for an 8-bit grayscale image, or for each color channel).
- $\mathrm{MSE}$ $MSE$ : The Mean Squared Error between the two images. For two images $I$ $I$ and $K$ $K$ of size $m \times n$ $m \times n$ : $ \mathrm{MSE} = \frac{1}{mn}\sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $
  - I(i,j): The pixel value at coordinates (i,j) in the original (reference) image.
  - K(i,j): The pixel value at coordinates (i,j) in the generated image.
  - m, n: The height and width of the images in pixels.

5.2.2. Structural Similarity Index Measure (SSIM)

Conceptual Definition: SSIM is a perceptual metric designed to assess the perceived quality of an image by modeling how the human visual system works. Unlike PSNR which focuses on absolute errors, SSIM considers aspects like luminance, contrast, and structure. A value of 1 indicates perfect structural similarity, while values closer to 0 indicate less similarity.
Mathematical Formula: $ \mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
Symbol Explanation:
- x, y: Two image patches being compared.
- $\mu_x, \mu_y$ : The average (mean) pixel values of $x$ and $y$ , respectively.
- $\sigma_x^2, \sigma_y^2$ : The variance of $x$ and $y$ , respectively.
- $\sigma_{xy}$ : The covariance of $x$ and $y$ .
- $c_1 = (K_1L)^2, c_2 = (K_2L)^2$ : Small constants to prevent division by zero, where $L$ is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and $K_1, K_2$ are small constants (e.g., $K_1=0.01, K_2=0.03$ ).

5.2.3. Learned Perceptual Image Patch Similarity (LPIPS)

Conceptual Definition: LPIPS measures the perceptual distance between two images, aiming to align with human judgments of similarity. It works by extracting features from a pre-trained deep neural network (e.g., VGG or AlexNet) and then calculating the L2 distance between these feature representations. A lower LPIPS score indicates higher perceptual similarity.

5.2.4. VBench

Conceptual Definition: VBench is a comprehensive benchmark suite specifically designed to evaluate various attributes of video generative models. It includes metrics for subject consistency (SubConsis), background consistency (BackConsis), motion smoothness (MotionSmooth), aesthetic quality (AesQual), and image quality (ImagQual). The Average score provides an overall quality assessment. Higher scores indicate better performance across these attributes.

5.2.5. Density

Conceptual Definition: Density quantifies the computational budget used by a sparse attention mechanism. It is defined as the ratio of the sparse attention computation (e.g., FLOPs or number of Query-Key pairs computed) to the full attention computation. A lower density indicates greater efficiency in terms of computational resources used for attention.

5.2.6. FLOPs (Floating Point Operations)

Conceptual Definition: FLOPs measure the total number of arithmetic operations (e.g., additions, multiplications) performed during video generation. In this context, it is used to quantify the overall computational cost. A lower FLOPs count means the generation process is more computationally efficient. The unit PFLOPs stands for PetaFLOPs, or $10^{15}$ floating-point operations.

5.2.7. Speedup

Conceptual Definition: Speedup measures how much faster a task runs with the optimized method (SVG2) compared to the baseline (dense attention). It is calculated as the ratio of the baseline's execution time to SVG2's execution time, or inversely, the ratio of the FLOPs saved. A 2x speedup means the process is twice as fast. The paper reports both end-to-end speedup (total time for video generation) and attention speedup (time saved specifically in the attention module).

5.3. Baselines

The proposed SVG2 method is compared against several state-of-the-art sparse attention algorithms:

Static Method:
- Sparse VideoGen (SVG) [4]: This is a prior work from some of the same authors, which accelerates video Diffusion Transformers using spatial-temporal sparsity based on static patterns.
Dynamic Methods:
- SpargeAttention [9]: A dynamic sparse attention method that focuses on accurate sparse attention for accelerating any model inference, typically using block-level approximation.
- XAttention [10]: Another dynamic block sparse attention method, which employs antidiagonal scoring for efficient visual generation models. The paper notes that XAttention was not evaluated on Wan2.1 due to lack of support.
  
  These baselines represent different strategies for implementing sparse attention, ranging from static patterns to dynamic, block-wise approaches, providing a comprehensive comparison for SVG2.

5.4. Implementations

Framework: SVG2 is prototyped as an end-to-end framework.
Customized Kernels: The customized kernels are built using FlashInfer [16], an efficient and customizable attention engine.
Hardware: Benchmarking is conducted on an NVIDIA H100 GPU.
Software: CUDA 12.8.
Configuration:
- For SVG2, the number of query clusters ( $C_q$ ) is set to 100 and key clusters ( $C_k$ ) to 500. The rationale for this choice is discussed in the ablation study (Section D.2 of the appendix).
- Experiments are conducted with sparse attention skipped during the first 30% of denoising steps for all methods. This warmup strategy is common in diffusion models [64, 68, 56, 59], as initial steps are critical for overall generation quality. Results without warmup are provided in the appendix.
- Various accuracy targets (i.e., attention score recall) are used to evaluate the trade-off between generation quality and efficiency. A single data point for detailed comparison is presented in Table 1.

6. Results & Analysis

6.1. Core Results Analysis

The experiments aim to validate SVG2's effectiveness in accelerating video generation while maintaining high quality, comparing it against state-of-the-art baselines.

6.1.1. Qualitative Evaluation

Figure 6 visually demonstrates the effectiveness of SVG2's semantic-aware permutation. The following figure (Figure 6 from the original paper) shows the visualization of attention maps:

$Figure 6: Visualization of attention maps from different attention heads in $\\mathrm { W a n } 2 . 1$ when generating videos from VBench \[67\]. (a) Original attention maps with diverse sparse patterns…$
该图像是一个示意图，展示了在生成视频时不同注意力头的注意力图。图中的 (a) 原始注意力图显示了多样的稀疏模式，(b) 经过语义感知排列的注意力图，(c) 在应用基于中心的 top- $p$ 选择后恢复的注意力图。这些结果展示了 SVG2 的有效性。

(a) Original Attention Maps: These show diverse sparse patterns across different attention heads in Wan2.1 during video generation. Critical tokens (highlighted in red) are often scattered.
(b) Permuted Attention Maps: After semantic-aware permutation based on k-means clustering, the critical tokens are reorganized into a contiguous layout. This transformation is crucial for efficient block-wise computation on GPUs without computation waste.
(c) Recovered Attention Maps: By applying centroid-based Top-p selection to the permuted map and then undoing the permutation, the attention map is recovered to its original layout. The high similarity between the recovered and original attention maps visually confirms SVG2's ability to accurately identify and process critical tokens while preserving the intended attention patterns.

Further qualitative comparison in the appendix (Figure 9 and Figure 10) shows that SVG2 generates videos with high pixel-level fidelity, closely matching the quality of dense attention for both HunyuanVideo and Wan 2.1. The following figure (Figure 9 from the original paper) shows the comparison of Dense Attention and SVG2 on HunyuanVideo and Wan 2.1 Text-to-Video generation:

Figure 9: Comparion of Dense Attention and SVG2 on HunyuanVideo and Wan 2.1 Text-to-Video generation.
该图像是图表，展示了Dense Attention与SVG2在HunyuanVideo和Wan 2.1文本到视频生成的对比。每一对比中，左侧为Dense Attention的结果，右侧为SVG2的效果，展现了两者在生成质量和效率上的差异。

The following figure (Figure 10 from the original paper) shows the comparison of Dense Attention and SVG2 on Wan 2.1 Image-to-Video generation:

Figure 10: Comparison of Dense Attention and SVG2 on Wan 2.1 Image-to-Video generation.
该图像是一个图表，展示了在Wan 2.1图像到视频生成任务中，密集注意力和SVG2方法的对比效果。每列分别展示采用密集注意力和SVG2生成的图像，以显示两者在生成质量上的差异。

6.1.2. Quantitative Evaluation of Quality and Efficiency

The quantitative results, presented in Table 1, compare SVG2 against baselines (SpargeAttn, SVG, XAttention) across Wan 2.1 and HunyuanVideo models. The table includes metrics for generation quality (PSNR, SSIM, LPIPS, VBench) and efficiency (Density, FLOPs, Speedup), with a 30% warmup setting.

The following are the results from Table 1 of the original paper:

Model	Config	PSNR ↑	SSIM ↑	LPIPS ↓	VBench ↑	Density ↓	FLOP↓↑	Speedup ↑
Wan 2.1	14B, 720P, Image-to-Video	-	-	-	0.841	100%	526.76 PFLOPs	1×
	SpargeAttn	21.181	0.665	0.333	-	38.99%	366.80 PFLOPs	1.47×
	SVG	24.059	0.813	0.174	0.836	30.25%	343.88 PFLOPs	1.56×
	Ours	26.562	0.861	0.138	0.838	31.28%	346.59 PFLOPs	1.58×
Wan 2.1	Ours-Turbo	24.510	0.812	0.179	0.836	14.13%	301.62 PFLOPs	1.84×
	14B, 720P, Text-to-Video	-	-	-	0.846	100%	658.46 PFLOPs	1×
	SpargeAttn	20.519	0.623	0.343	0.820	42.03%	468.46 PFLOPs	1.44×
	SVG	22.989	0.785	0.199	0.837	30.25%	429.86 PFLOPs	1.58×
	Ours	25.808	0.854	0.138	0.842	29.51%	427.43 PFLOPs	1.60×
Hunyuan	Ours-Turbo	23.682	0.789	0.196	0.838	12.87%	372.89 PFLOPs	1.89×
	13B, 720P, Text-to-Video	-	-	-	0.850	100%	612.38 PFLOPs	1×
	SpargeAttn	27.892	0.884	0.151	-	42.62%	399.16 PFLOPs	1.53×
	XAttention	28.892	0.898	0.120	0.839	39.32%	386.90 PFLOPs	1.56×
	SVG	29.157	0.905	0.120	0.845	29.86%	351.75 PFLOPs	1.91×
	SVG + FP8	29.033	0.902	0.121	0.843	29.86%	351.75 PFLOPs	2.3×
	Ours	30.452	0.910	0.117	0.852	25.45%	335.36 PFLOPs	2.30×
	Ours + FP8	30.389	0.908	0.118	0.851	25.45%	335.36 PFLOPs	2.55×

Key observations:

Superior Quality: SVG2 consistently achieves the highest quality scores across all metrics (PSNR, SSIM, LPIPS, VBench) for both Wan 2.1 and HunyuanVideo models. For example, on HunyuanVideo, SVG2 achieves a PSNR of 30.452, SSIM of 0.910, and LPIPS of 0.117, outperforming all baselines.
Highest Speedup: Despite its superior quality, SVG2 also achieves the highest end-to-end speedup. For HunyuanVideo, SVG2 provides a 2.30x speedup while using a density of 25.45%. When combined with FP8 (8-bit floating point precision), the speedup further increases to 2.55x.
Pareto Frontier: The results demonstrate that SVG2 offers a superior trade-off, achieving better quality at comparable or lower density (i.e., less computation) than other methods. This positions SVG2 on the Pareto frontier of the quality-efficiency trade-off, meaning it provides the best possible quality for a given efficiency level.
Ours-Turbo Configuration: The Ours-Turbo variants showcase SVG2's potential for even higher speedups by accepting a slight reduction in quality. For instance, Wan 2.1 I2V Ours-Turbo achieves 1.84x speedup with a density of 14.13%, still maintaining a PSNR (24.510) comparable to or better than SVG (24.059) at a much lower density. This flexibility is valuable for different application scenarios.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Efficiency Evaluation for Fast $k$ -means with Centroids Cache

To demonstrate the effectiveness of the centroids cache, the paper compares the runtime of k-means when varying the number of iterations required to achieve a 90% attention recall.

The following figure (Figure 7 from the original paper) shows the efficiency evaluation for fast k-means with centroids cache and customized attention kernel:

$Figure 7: Efficiency evaluation for fast $k$ -means with centroids cache and customized attention kernel.$
该图像是图表，展示了使用质心缓存和自定义注意力核的 $k$ -均值效率评估。图 (a) 显示了不同步骤下的延迟与密度关系，指出使用质心缓存能显著减少计算时间。图 (b) 比较了不同集群数量下的动态与静态方法的效能，显示了我们方法在效率上的提升。

Figure 7(a) illustrates that enabling the centroids cache significantly reduces the end-to-end latency of k-means. With the cache enabled, k-means achieves comparable or even lower density (meaning better quality identification) with a drastic reduction in execution time, showing a 76x speedup for k-means. This validates the importance of reusing centroids from previous denoising steps to mitigate the overhead of k-means clustering.

6.2.2. Efficiency Evaluation for Customized Attention Kernel

The efficiency of the customized attention kernel with dynamic block sizes is evaluated by comparing its computation FLOPs against FlashInfer [16] (a state-of-the-art attention library). This is done by varying combinations of Query clusters ( $C_q$ ) and Key clusters ( $C_k$ ) while maintaining 90% attention recall.

Figure 7(b) demonstrates that the customized kernels achieve an average of 1.48x computation reduction. In a practical setup with $C_q = 100$ and $C_k = 500$ , SVG2 achieves 1.88x reduction in computation waste. This highlights the advantage of SVG2's kernel in efficiently handling dynamic block sizes resulting from semantic-aware permutation, avoiding the padding waste inherent in static-block kernels.

The following figure (Figure 11 from the original paper) shows the efficiency evaluation for our attention kernel, varying the number of key clusters:

该图像是一个图表，展示了FA2和FA3在不同块密度下的吞吐量评估。左侧图表显示FA2的吞吐量，右侧为FA3。通过对比稀疏和密集配置，可以观察到各配置在不同密度条件下的性能变化。

The following figure (Figure 12 from the original paper) shows the efficiency evaluation for our attention kernel, varying the number of query clusters:

该图像是一个图表，展示了 FA2 和 FA3 算法在不同块密度下的吞吐量表现。FA2 的吞吐量在大约 340 TFLOPS 左右，而 FA3 的吞吐量可达 550 TFLOPS。图中标示了各算法在不同行数下的性能差异。

Figures 11 and 12 (in Appendix D.1) further detail the kernel performance. They show that kernel performance (throughput) drastically decreases whenC_qis larger than 200, suggesting that an optimal balance between cluster granularity and hardware utilization must be maintained. However, increasing $C_k$ has a less pronounced effect. This implies specific hardware constraints related to query processing and tensor core utilization.

6.2.3. Sensitivity Test on Quality-Efficiency Trade-off

To validate SVG2's ability to achieve a superior trade-off, a comprehensive evaluation on Wan2.1-I2V-14B is conducted across a wide range of computational budgets (density).

The following figure (Figure 2 from the original paper) shows the trade-off curves between generation quality (PSNR) and efficiency (density):

Figure 2: Trade-off curves between generation quality (PSNR) and efficiency (density). SVG2 consistently surpasses existing methods given the same density, achieving a Pareto frontier.
该图像是图表，展示了不同方法在生成质量（PSNR）与效率（密度）之间的权衡曲线。SVG2在相同的密度下，始终超越现有方法，达到帕累托前沿，最大PSNR值接近26，表现出2.3倍的减少效率。

Figure 2 clearly shows that SVG2 consistently achieves better generation quality (higher PSNR) at any given density compared to baseline methods. This positions SVG2 on the Pareto frontier of the quality-efficiency trade-off, indicating its optimal performance across various sparsity levels. Specifically, SVG2 can reduce density by up to 2.3x while maintaining the same PSNR as its counterparts.

6.2.4. Ablation Study on Semantic-Aware Permutation

Effectiveness on Improving Identification Accuracy: The impact of semantic-aware permutation on identification accuracy is assessed by comparing attention recall (how many critical tokens are correctly identified) with and without permutation, keeping mean-pooling and cluster size consistent.

The following figure (Figure 8 from the original paper) shows attention recall across various densities:

Figure 8: Attention recall across various densities. Enabling permutation consistently surpasses disabling permutation.
该图像是一个图表，展示了在不同密度下，启用与禁用 k-means 排序的注意力召回率。图中的红色三角形代表启用 k-means 排序的情况，而灰色三角形代表禁用该功能的情况。结果表明，启用排排序时，注意力召回率普遍较高。

Figure 8 demonstrates that enabling semantic-aware permutation consistently achieves higher attention recall across various densities. This improvement is attributed to the formation of semantic-coherent clusters, which provide more precise representations for identifying critical tokens. This directly supports the claim that semantic-aware clustering improves the accuracy of critical token identification.

Effectiveness on Reducing Computation Waste: The impact on computation waste is evaluated by comparing the computational overhead with and without semantic-aware permutation, using the exact same set of critical tokens selected by centroid-based Top-p selection.

The results show that enabling semantic-aware permutation reduces computational overhead by an average of 36%. This confirms that reordering scattered critical tokens into a contiguous layout significantly minimizes computation waste by better utilizing GPU hardware.

6.2.5. Ablation on the Number of Clusters

The paper investigates the impact of the number of Query clusters ( $C_q$ ) and Key clusters ( $C_k$ ) on both PSNR and end-to-end efficiency.

The following are the results from Table 5 of the original paper:

Cq Ck	PSNR SSIM LPIPS	Speedup
100 250	25.497 0.801 0.182	1.90x
100 1000	26.276 0.825 0.159	1.71x
50 500	22.561 0.742 0.258	1.90x
200 500	26.213 0.820 0.157	1.78x
400 500	26.488 0.868 0.132	1.25x
100 500	26.128 0.816 0.169	1.89x

Table 5 shows that setting $C_q = 100$ and $C_k = 500$ provides the best balance between generation quality and efficiency (Speedup 1.89x with PSNR 26.128).
Increasing the number of clusters generally improves quality up to a point, but can degrade efficiency. This is because tensor cores on NVIDIA GPUs require minimum fixed input sizes (e.g., 64 for m64n64k16). If the average cluster size falls below this threshold (e.g., $N_q / C_q < 64$ ), it leads to underutilization of the hardware, reducing efficiency despite potential quality gains from finer-grained clustering. For example, $C_q = 400$ and $C_k = 500$ results in higher PSNR but significantly lower speedup (1.25x) due to potential underutilization.

6.2.6. Ablation on Permutation

The authors investigate if Query and Key representations can share the same clustering strategy (i.e., using a single permutation matrix for both).

The following are the results from Table 6 of the original paper:

Permutation used by Q	Permutation used by K	Density	PSNR
$\pi_Q$	$\pi_K$	31.28%	26.562
$\pi_Q$	$\pi_Q$	38.23%	22.439
$\pi_K$	$\pi_K$	38.58%	22.183
$\pi_S$	$\pi_S$	87.27%	26.495

Table 6 shows that using independent Query permutation ( $\pi_Q$ ) and Key permutation ( $\pi_K$ ) (31.28% density, 26.562 PSNR) yields superior performance compared to applying the same permutation to both $Q$ and $K$ (e.g., $\pi_Q$ for both: 38.23% density, 22.439 PSNR).
Even clustering based on hidden states before the QKV linear layer (shared QK embedding, $\pi_S$ ) results in worse PSNR and much higher density (87.27% density).
The Adjusted Rand Index (ARI) between Q clusters and K clusters is 0.345, indicating substantial differences in their permutation patterns. This suggests that $Q$ and $K$ capture distinct aspects of semantic relationships, and thus, independent clustering for $Q$ and $K$ is necessary to preserve the full expressiveness of attention.

6.3. Performance Comparison in Warmup-free Setting

Appendix B (Table 2) provides results without the 30% warmup steps at the beginning of denoising.

The following are the results from Table 2 of the original paper:

Model	Config	PSNR ↑	SSIM ↑	LPIPS ↓	VBench ↑	Density ↓	FLOP ↓	Attn Speedup ↑	Speedup ↑
Wan 2.1	14B, 720P, Image-to-Video	-	-	-	0.841	100%	526.76 PFLOPs	1×	1×
	SVG	15.608	0.512	0.404	0.823	29.54%	262.85 PFLOPs	2.26×	1.86×
	Ours	18.276	0.615	0.317	0.832	29.34%	262.10 PFLOPs	2.95×	2.10×
Wan 2.1	14B, 720P, Text-to-Video	-	-	-	0.851	100%	658.46 PFLOPs	1×	1×
	SVG	13.294	0.407	0.512	0.849	29.54%	328.56 PFLOPs	2.28×	1.89×
	Ours	16.502	0.562	0.373	0.852	30.12%	331.28 PFLOPs	2.98×	2.13×
Hunyuan	13B, 720P, Text-to-Video	-	-	-	0.820	100%	612.38 PFLOPs	1×	1×
SVG		12.298	0.492	0.483	0.808	29.86%	240.05 PFLOPs	3.45×	2.48×
Ours		19.879	0.735	0.260	0.816	28.94%	235.16 PFLOPs	4.06×	2.69×

In the warmup-free setting, the absolute PSNR values for all methods (including dense attention in the reference) are generally lower, indicating that the initial denoising steps are indeed crucial for quality.
However, SVG2 consistently offers better quality (higher PSNR, SSIM, VBench, lower LPIPS) than SVG at comparable densities. For example, on HunyuanVideo, SVG2 achieves 19.879 PSNR (vs SVG's 12.298) with 28.94% density (vs SVG's 29.86%), resulting in a 2.69x end-to-end speedup (vs SVG's 2.48x). This further reinforces SVG2's robustness and effectiveness even under more challenging conditions.

6.4. VBench Results

Appendix C (Table 3 and Table 4) provides the full VBench results for both warmup-free and 30% warmup settings.

The following are the results from Table 3 of the original paper:

Model	Config	SubConsis	BackConsis	MotionSmooth	AesQual	ImagQual	Average
Wan 2.1	14B, 720P, Image-to-Video	0.946	0.956	0.979	0.618	0.709	0.841
	SVG	0.916	0.935	0.976	0.591	0.698	0.823
	Ours	0.936	0.946	0.977	0.597	0.700	0.832
Wan 2.1	14B, 720P, Text-to-Video	0.970	0.970	0.992	0.612	0.708	0.851
	SVG	0.963	0.969	0.991	0.612	0.708	0.849
	Ours	0.971	0.970	0.992	0.624	0.707	0.852
Hunyuan	13B, 720P, Text-to-Video	0.888	0.938	0.994	0.594	0.685	0.820
SVG		0.867	0.930	0.991	0.594	0.656	0.808
Ours		0.888	0.935	0.994	0.589	0.675	0.816

The following are the results from Table 4 of the original paper:

Model	Config	SubConsis	BackConsis	MotionSmooth	AesQual	ImagQual	Average
Wan 2.1	14B, 720P, Image-to-Video	0.946	0.956	0.979	0.618	0.709	0.841
	SVG	0.941	0.948	0.978	0.606	0.709	0.836
	Ours	0.943	0.951	0.977	0.606	0.709	0.838
Wan 2.1	14B, 720P, Text-to-Video	0.956	0.968	0.983	0.613	0.713	0.846
	SpargeAttn	0.927	0.948	0.978	0.567	0.684	0.820
	SVG	0.947	0.960	0.980	0.597	0.703	0.837
	Ours	0.954	0.965	0.982	0.602	0.709	0.842
Hunyuan	13B, 720P, Text-to-Video	0.915	0.941	0.993	0.648	0.753	0.850
	XAttention	0.912	0.924	0.992	0.631	0.739	0.839
	SVG	0.914	0.928	0.993	0.652	0.739	0.845
Ours		0.917	0.946	0.993	0.657	0.751	0.852

The full VBench results confirm that SVG2 consistently outperforms other baselines across various quality aspects (e.g., SubConsis, BackConsis, MotionSmooth, AesQual, ImagQual) and overall Average score, in both warmup and warmup-free settings. This shows SVG2's ability to maintain not just pixel-level fidelity but also higher-level perceptual and temporal consistency in generated videos.

6.5. Performance Gap between HunyuanVideo and Wan 2.1

The paper addresses observed performance differences between HunyuanVideo and Wan 2.1.

6.5.1. Quality Difference

Wan 2.1 generally exhibits lower PSNR, SSIM, and LPIPS values compared to HunyuanVideo across all methods.
The reason stated is Wan 2.1's high sensitivity to precision variance and numerical changes, even across different backend implementations (e.g., FlexAttention, FlashAttention, Torch SDPA). HunyuanVideo is more robust. This means SVG2, which introduces some numerical approximations via sparse attention, naturally achieves lower PSNR on the more sensitive Wan 2.1. This difference is model-specific and not indicative of SVG2's methodological performance.

6.5.2. Speedup Difference

The end-to-end speedup on Wan 2.1 (1.89x) is generally lower than on HunyuanVideo (2.30x).
This difference primarily stems from varying attention cost ratios in the two models, driven by different context lengths and model architectures. HunyuanVideo has a longer context length (118k) compared to Wan 2.1's (75k). Additionally, HunyuanVideo's layers consist mainly of Self-Attention and Feed-Forward Networks, while Wan 2.1 includes an additional cross-attention block.
Consequently, the attention module constitutes a larger proportion of HunyuanVideo's total runtime. Since SVG2 primarily accelerates the attention module, its overall speedup naturally scales with the attention module's contribution to the total runtime. This explanation clarifies that SVG2's effectiveness is consistent, but its impact on end-to-end speedup varies depending on the baseline model's architecture.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces SVG2, a novel, training-free framework designed to accelerate Diffusion Transformers (DiTs) for video generation by optimizing sparse attention mechanisms. The core innovation of SVG2 is semantic-aware permutation, which clusters tokens based on semantic similarity using k-means and then reorders them into a contiguous layout. This approach effectively addresses two key limitations of prior sparse attention methods: inaccurate critical token identification (by providing precise cluster representations) and excessive computation waste (by enabling efficient GPU processing without padding).

SVG2 further integrates centroid-based Top-p dynamic budget control for flexible quality-efficiency trade-offs and develops customized kernel implementations to support dynamic block sizes and leverage GPU capabilities. Comprehensive evaluations on HunyuanVideo and Wan 2.1 demonstrate that SVG2 consistently achieves a Pareto frontier trade-off, delivering superior generation quality at any given computational budget. It provides significant end-to-end speedups (up to 2.30x on HunyuanVideo and 1.89x on Wan 2.1) while maintaining high video quality (PSNR up to 30 and 26 respectively). By improving both the accuracy of sparsity identification and the efficiency of sparse computation, SVG2 makes DiT-based video generation more practical and accessible.

7.2. Limitations & Future Work

The major limitation explicitly stated by the authors is the lack of discussion and evaluation on whether the proposed methods can be extended to attention mechanisms other than DiTs. This suggests that while SVG2 is highly effective for DiT-based video generation, its applicability to other Transformer-based models (e.g., LLMs, image DiTs, or other vision Transformers) or different types of attention (e.g., cross-attention in multi-modal models beyond Wan 2.1's specific use) is not fully explored. Future work could involve adapting SVG2's principles to a broader range of Transformer architectures and attention mechanisms.

7.3. Personal Insights & Critique

This paper presents a rigorous and well-executed solution to a critical problem in generative AI. The dual approach of improving identification accuracy and minimizing computation waste through semantic-aware permutation is elegant and addresses fundamental mismatches between algorithmic sparsity and hardware architecture.

Inspirations and Applications:

Generalizability of Semantic Clustering: The concept of using semantic clustering (e.g., k-means on QKV activations) to identify critical tokens could be highly beneficial beyond video DiTs. It could be applied to LLMs for long-context windows, where position-based sparsity often struggles to capture global semantic relationships. This could lead to more accurate sparse attention in various Transformer-based models.
Hardware-Algorithm Co-design: The emphasis on system-algorithm co-design, particularly the customized kernel for dynamic block sizes, is a crucial takeaway. It highlights that optimizing AI models for real-world deployment requires not just algorithmic innovation but also deep understanding and customization for underlying hardware. This principle is applicable to any domain where sparse operations are performed on parallel accelerators.
Dynamic Sparsity Control: The centroid-based Top-p selection offers a flexible way to manage the quality-efficiency trade-off. This dynamic control, driven by learned semantic scores, is a robust mechanism that could be adapted to other adaptive sparsity schemes, allowing users to define their desired performance envelope without extensive manual tuning.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

k-means Overhead and Hyperparameter Sensitivity: While the centroid cache significantly reduces k-means latency, k-means itself still incurs some computational cost and, more importantly, requires choosing the number of clusters ( $C_q, C_k$ ). The paper finds optimal values empirically (100, 500), but these might be sensitive to different models, datasets, or video lengths. An adaptive method for determining $C_q$ and $C_k$ on the fly, perhaps based on the inherent density of latent space activations, could further enhance robustness.
Definition of "Semantic Similarity": The paper assumes k-means on $Q$ and $K$ vectors effectively captures "semantic similarity." While this is a reasonable heuristic given $Q$ and $K$ are used to compute attention scores, the exact nature of this "semantic" grouping might be complex. Further analysis into what k-means clusters represent (e.g., specific objects, backgrounds, motion types) could provide deeper insights and potentially lead to more semantically informed clustering algorithms.
Cold Start Problem for Centroid Cache: The centroid cache is effective after the first few steps. However, the initial denoising steps (the warmup period) or the very first video generation still incur the full k-means cost. While this is amortized over many steps, for very short generation processes, this initial overhead might still be noticeable.
Applicability to Cross-Attention: The stated limitation focuses on attention mechanisms other than DiTs. It would be interesting to see how semantic-aware permutation could be applied to cross-attention (e.g., between text embeddings and video tokens). While Wan 2.1 has cross-attention, the paper's main focus is self-attention. The differing roles of Query (from video) and Key/Value (from text) in cross-attention might require modifications to the clustering strategy.
Theoretical Guarantees: The paper provides strong empirical evidence. A deeper theoretical analysis of why k-means clustering on $Q$ and $K$ activations leads to optimal attention recall and how the permutation maintains mathematical equivalence while being hardware-efficient could strengthen the fundamental understanding of SVG2's performance.

Overall, SVG2 is a significant advancement in making high-quality video generation with DiTs more efficient, demonstrating an excellent balance between algorithmic innovation and practical system optimization.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Sparse VideoGen2: Accelerate Video Generation with Sparse Attention via Semantic-Aware Permutation

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~35 min read · 46,890 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Diffusion Transformers (DiTs)

3.1.2. Self-Attention Mechanism

3.1.3. Quadratic Complexity of Attention

3.1.4. Sparse Attention

3.1.5. K-Means Clustering

3.1.6. Peak Signal-to-Noise Ratio (PSNR)

3.1.7. Structural Similarity Index Measure (SSIM)

3.1.8. Learned Perceptual Image Patch Similarity (LPIPS)

3.1.9. FLOPs (Floating Point Operations)

3.1.10. Speedup

3.1.11. CUDA Kernels & FlashAttention

3.2. Previous Works

3.2.1. Sparse Attention for Video DiTs

3.2.2. Sparse Attention for Large Language Models (LLMs)

3.2.3. Linear Attention for Diffusion Models

3.2.4. Long Video Generation and Caching-Based Acceleration

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Semantic-Aware Permutation with kkk-means Clustering

4.2.2. Centroid-Based Top ppp Selection

4.2.3. Efficient System-Algorithm Co-design

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Peak Signal-to-Noise Ratio (PSNR)

5.2.2. Structural Similarity Index Measure (SSIM)

5.2.3. Learned Perceptual Image Patch Similarity (LPIPS)

5.2.4. VBench

5.2.5. Density

5.2.6. FLOPs (Floating Point Operations)

5.2.7. Speedup

5.3. Baselines

5.4. Implementations

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Qualitative Evaluation

6.1.2. Quantitative Evaluation of Quality and Efficiency

6.2. Ablation Studies / Parameter Analysis

6.2.1. Efficiency Evaluation for Fast kkk-means with Centroids Cache

6.2.2. Efficiency Evaluation for Customized Attention Kernel

6.2.3. Sensitivity Test on Quality-Efficiency Trade-off

6.2.4. Ablation Study on Semantic-Aware Permutation

6.2.5. Ablation on the Number of Clusters

6.2.6. Ablation on Permutation

6.3. Performance Comparison in Warmup-free Setting

6.4. VBench Results

6.5. Performance Gap between HunyuanVideo and Wan 2.1

6.5.1. Quality Difference

6.5.2. Speedup Difference

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.1. Semantic-Aware Permutation with $k$ -means Clustering

4.2.2. Centroid-Based Top $p$ Selection

6.2.1. Efficiency Evaluation for Fast $k$ -means with Centroids Cache