Bidirectional Sparse Attention for Faster Video Diffusion Training
TL;DR Summary
Bidirectional Sparse Attention (BSA) is proposed to accelerate video DiT training by dynamically sparsifying both Queries and Key-Value pairs in 3D attention. BSA selects informative queries using semantic similarity and retains salient KV blocks via dynamic thresholds. This sign
Abstract
Video diffusion Transformer (DiT) models excel in generative quality but hit major computational bottlenecks when producing high-resolution, long-duration videos. The quadratic complexity of full attention leads to prohibitively high training and inference costs. Full attention inefficiency stems from two key challenges: excessive computation due to the inherent sparsity of Queries and Key-Value pairs, and redundant computation as fixed sparse patterns fail to leverage DiT's dynamic attention. To overcome this limitation, we propose a Bidirectional Sparse Attention (BSA) framework for faster video DiT training, the first to dynamically sparsify both Queries and Key-Value pairs within 3D full attention, thereby substantially improving training and inference efficiency. BSA addresses these issues through two key components. Query sparsity is optimized by selecting the most informative query tokens via semantic similarity and with a dynamic spatial-time training strategy, while KV sparsity is achieved by computing a statistical dynamic threshold to retain only the most salient KV blocks for computation. Extensive experiments demonstrate that BSA significantly accelerates DiT training across long sequences, reducing FLOPs by up to 20x and achieving 17.79x faster attention training, while preserving or even surpassing the generative quality of full attention.
English Analysis
1. Bibliographic Information
- Title: Bidirectional Sparse Attention for Faster Video Diffusion Training
- Authors: Chenlu Zhan, Wen Li, Chuyu Shen, Jun Zhang, Suhui Wu, Hao Zhang. Their affiliations are with ByteDance.
- Journal/Conference: This paper is available as a preprint on arXiv. The arXiv identifier suggests a submission date in September 2025, which is a placeholder in the provided text. The venue is not specified, but its presence on arXiv indicates it is likely under review or intended for a major conference in computer vision or machine learning (e.g., CVPR, ICCV, NeurIPS).
- Publication Year: The paper's arXiv ID suggests a future date (2025), which is a placeholder. The content is contemporary with research from 2024-2025.
- Abstract: The abstract introduces the problem of high computational cost in Video Diffusion Transformer (DiT) models for generating long, high-resolution videos, attributing it to the quadratic complexity of full attention. The authors identify two inefficiencies: inherent sparsity in Query and Key-Value (KV) pairs, and the failure of fixed sparse patterns to adapt to DiT's dynamic attention. They propose Bidirectional Sparse Attention (BSA), a framework that dynamically sparsifies both Queries and KV pairs. Query sparsity is achieved by selecting informative tokens based on semantic similarity, while KV sparsity uses a dynamic statistical threshold to retain salient KV blocks. Experiments show BSA reduces FLOPs by up to 20x and accelerates attention training by 17.79x, while maintaining or improving generative quality compared to full attention.
- Original Source Link: https://arxiv.org/pdf/2509.01085 (Note: The provided link is a placeholder and does not resolve to an actual paper as of the current date).
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Training state-of-the-art video generation models, particularly Video Diffusion Transformers (DiTs), is extremely computationally expensive. The primary bottleneck is the
self-attention
mechanism, whose computational cost and memory usage grow quadratically with the number of video tokens (pixels x frames). This makes generating high-resolution, long-duration videos prohibitively slow and costly. - Importance & Gaps: As demand for high-quality video generation grows, overcoming this computational barrier is critical. Previous sparse attention methods, often adapted from language models, have key limitations in the video domain. They typically use fixed sparsity patterns, which fail to adapt to the dynamic and content-dependent nature of attention in videos. Furthermore, most methods only focus on reducing redundancy in the Key-Value (KV) pairs, ignoring significant redundancy present on the Query (Q) side.
- Fresh Angle: This paper introduces a novel approach called Bidirectional Sparse Attention (BSA). Its key innovation is to perform dynamic, adaptive sparsification on both the Query side and the Key-Value side simultaneously. This "bidirectional" approach is the first of its kind for video DiTs and is designed to tackle redundancy more comprehensively than prior work.
- Core Problem: Training state-of-the-art video generation models, particularly Video Diffusion Transformers (DiTs), is extremely computationally expensive. The primary bottleneck is the
-
Main Contributions / Findings (What):
- Novel Framework (BSA): The paper proposes BSA, a trainable framework that orthogonally sparsifies Queries and Key-Value pairs in 3D full attention for video DiTs.
- Dynamic Sparsity Strategies: It introduces distinct dynamic strategies for each side:
- Dynamic Query Sparsity: Selects the most informative query tokens by measuring semantic similarity within blocks, effectively pruning redundant queries.
- Dynamic KV Sparsity: Uses a statistical threshold based on attention scores to adaptively select the most relevant KV blocks for each query, moving beyond fixed top-k or static patterns.
- Significant Performance Gains: Extensive experiments demonstrate that BSA achieves massive efficiency improvements. It reduces floating-point operations (FLOPs) by up to 20x and accelerates attention training by up to 17.79x.
- Preserved or Improved Quality: Despite the aggressive sparsification, BSA maintains or even surpasses the generative quality of the original full-attention model, as measured by standard video generation benchmarks like VBench.
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Diffusion Models: A class of generative models that learn to create data by reversing a gradual noising process. They start with random noise and iteratively denoise it, guided by a learned model (and often a text prompt), to produce a clean sample like an image or video.
- Transformer: A neural network architecture originally designed for natural language processing, which relies heavily on the
self-attention
mechanism. It excels at modeling long-range dependencies between elements in a sequence. - Diffusion Transformer (DiT): An architecture that replaces the commonly used U-Net backbone in diffusion models with a Transformer. This has proven highly effective and scalable for image and video generation. In a video DiT, a video is treated as a sequence of patches (tokens) in both space and time.
- Self-Attention Mechanism: The core component of a Transformer. For each token in a sequence (the
Query
), it computes an "attention score" against every other token (theKeys
). These scores are then used to create a weighted sum of all tokens'Values
, allowing the model to focus on the most relevant parts of the sequence when updating a token's representation. Its complexity is , where is the sequence length. - Sparse Attention: A family of techniques designed to reduce the quadratic complexity of self-attention. Instead of allowing every token to attend to every other token, sparse attention restricts each query to attend to only a small subset of key-value pairs. This makes the computation much more efficient, often approaching linear complexity or .
-
Previous Works:
- General Sparse Attention (in LLMs): Methods like
Longformer
andLongNet
use fixed, predefined sparsity patterns (e.g., local windows, global tokens) to handle long text sequences. Others likeMoBA
andNSA
introduce trainable dynamic sparsity but focus only on the KV pairs and often rely on fixed selection rules that don't adapt well to different data distributions. - Sparse Attention for Video Diffusion: Early attempts transferred methods from LLMs to video DiTs, but these often targeted inference speed-up rather than training. Fixed patterns, like in
VSA
, can lead to visual artifacts because video attention is highly dynamic.VMoBA
improves on this by using thresholding instead of a fixed top-k selection but is sensitive to hyperparameters. Crucially, all these methods focus exclusively on pruning KV pairs, ignoring the redundancy on the Query side.
- General Sparse Attention (in LLMs): Methods like
-
Differentiation: BSA distinguishes itself from prior work in two key ways:
- Bidirectional Sparsification: It is the first framework to explicitly sparsify both Queries and KV pairs. This addresses redundancy more holistically, as the authors show that many query tokens (e.g., from static background across frames) are semantically repetitive and can be pruned.
- Dynamic and Adaptive Strategies: BSA's sparsity is not fixed. For Queries, it selects tokens based on semantic content. For KV pairs, it uses a statistical threshold derived from the attention scores themselves, rather than a fixed number or a static threshold. This allows the model to adaptively decide how sparse the attention should be for different content and at different stages of training.
4. Methodology (Core Technology & Implementation)
The core of the paper is the Bidirectional Sparse Attention (BSA) framework, which is designed to replace the standard full-attention mechanism in a video DiT. It consists of three main stages, as illustrated in the figure below.
该图像是双向稀疏注意力(BSA)框架的示意图,用于加速视频DiT训练。它详细展示了如何通过动态稀疏化查询(Query)和键值(Key-Value)对来提高效率。图中首先将输入序列进行(a) 3D块划分,其中块大小为 tokens。接着,(b) Query-Sparse部分通过语义相似度选择信息量最大的查询令牌,而(c) KV-Sparse部分则通过统计动态阈值选择最显著的KV块进行计算。
- Principles: The underlying principle of BSA is to exploit the inherent redundancy in video data. Spatially, nearby pixels are often similar. Temporally, consecutive frames often contain static backgrounds or slow-moving objects. This redundancy translates into a sparse attention matrix where each query only needs to interact with a small subset of key-value pairs. BSA makes this process efficient and adaptive by pruning redundant information from both queries and keys.
4.1. Review of Sparse Attention
For a single attention head with Query (), Key (), and Value () matrices of size (where is sequence length and is feature dimension), the standard attention output is: Sparse attention aims to improve efficiency by computing this only for a selected subset of keys () and values (): BSA extends this by also sparsifying the queries, using a subset .
4.2. BSA Structure
The BSA framework involves three steps:
Step 1: 3D Block Partition
To manage the long sequence of video tokens efficiently, the input latent tensor of shape (T, H, W)
(Time, Height, Width) is first divided into non-overlapping 3D blocks of size .
- The total number of tokens is .
- Each block contains tokens.
- This partitioning allows for coarse, block-level computations before fine-grained token-level attention, reducing overhead. Tokens within each block are average-pooled to get block-level representations .
Step 2: Query-Sparse
This component aims to eliminate redundant queries. The intuition is that many tokens, especially in static areas of a video, provide similar semantic information and thus have similar query vectors.
-
For each block , the method identifies a "center token" to act as a representative.
-
It then computes the cosine similarity between this center token and all other tokens within the same block. Tokens highly similar to the center are considered redundant.
-
A fixed portion of the least similar (most unique) tokens are retained, controlled by a retention ratio . The set of sparsified queries is formed by combining the retained tokens from all blocks. The selection is defined by:
- : The final set of sparsified queries.
- : The total number of blocks.
- : The set of query tokens in block .
- : The center query token of block .
- : An individual query token in block .
- : Cosine similarity function. measures dissimilarity.
- : Ranks the tokens within block in descending order of dissimilarity.
- : The retention ratio (e.g., 0.5 means keep 50% of the most unique tokens).
- : The number of tokens in block .
-
Window-based Enhancement: To better preserve critical information, each block can be further divided into smaller windows, and a center token is selected from each window. This helps capture local variations more effectively.
Figure 2 illustrates the effect of Query Sparsity. The heatmaps for "Full Attention" on Frame 3 and Frame 12 are very similar, indicating that the queries are redundant. After applying "Query Sparsity," the heatmaps become more distinct, showing that the method successfully filtered out redundant semantic features and retained the most critical queries.
Step 3: KV-Sparse
After sparsifying the queries, this component adaptively selects the most relevant KV blocks for each remaining query block. This avoids the rigidity of fixed sparsity patterns.
- Statistical Dynamic Threshold: Instead of selecting a fixed number of top-k KV blocks, BSA computes a dynamic threshold for each computation. This threshold is based on the statistics of the inter-block attention scores.
- : The set of inter-block attention scores.
- : The mean of the attention scores.
- : The standard deviation of the attention scores.
- : The total number of inter-block attention scores.
- : The desired number of key samples to select. The value of is annealed during training, starting high and gradually decreasing.
- : The quantile function (inverse of the cumulative distribution function), used here to find a threshold that corresponds to selecting the top scores.
- Dynamic Selection of Key KV Pairs: For each query block , the method selects the smallest set of KV block indices such that the cumulative softmax attention score meets the dynamic threshold .
- : An operation that returns the minimal index set .
- : The set of indices of selected KV blocks for query block .
- : The query block .
- : The key block .
- : The dynamic threshold computed previously. This ensures that enough "attention mass" is captured while minimizing computation.
4.3. Computation Cost and Kernel Design
- Final Computation: The final sparse attention is computed using the sparsified query matrix and the selected key/value matrices and :
- Overhead: The overhead for calculating similarity and sorting for sparsification is negligible ( for Query-sparse and for KV-sparse, where is the number of blocks), amounting to less than 0.1% of total FLOPs.
- Kernel Design: The authors implemented custom forward and backward kernels using Triton (a language for writing efficient GPU code) to translate the block-sparse structure into real hardware speedups, achieving performance comparable to
FlashAttention
.
5. Experimental Setup
- Datasets: The experiments used a dataset of 300k videos selected from
Vchitect T2V DataVerse
. The videos underwent a three-step preprocessing:- Shot segmentation: To ensure each clip has a single scene.
- Temporal truncation: Extracting 5-second clips.
- Caption generation: Using the
Tarsier2
model to create text descriptions. The data was processed at multiple resolutions (e.g., and ) to test scalability.
- Evaluation Metrics:
- Training Efficiency:
- FLOPs (Floating Point Operations): A measure of total computational cost. Lower is better.
- Speedup Ratio: The ratio of training time of the baseline (full attention) to the proposed method. Higher is better.
- Generation Quality:
- VBench: A comprehensive benchmark for video generation models. The paper uses five of its dimensions:
- Text Consistency: How well the generated video matches the input text prompt.
- BG Consistency: The stability and consistency of the background across frames.
- Image Quality: The visual fidelity and realism of individual frames.
- Sub Consistency: The consistency of the main subject's appearance and identity throughout the video.
- Dynamic Degree: The plausibility and amount of motion in the video.
- For all VBench metrics, a higher score is better. The paper does not provide the mathematical formulas for these metrics, as they are complex and defined by the VBench benchmark itself, often relying on other pretrained models for evaluation.
- VBench: A comprehensive benchmark for video generation models. The paper uses five of its dimensions:
- Training Efficiency:
- Baselines:
- Full Attention: The standard, non-sparse attention mechanism within the same backbone model.
- MoBA (Mixture-of-Block Attention): A trainable sparse attention method that focuses on KV sparsity.
- VSA (Trainable Sparse Attention): Another trainable sparse attention method for video models, also focusing on KV sparsity with fixed block sizes.
6. Results & Analysis
-
Core Results:
The main results are summarized in the table below (transcribed from Table 1 in the paper). BSA is compared against the
Full Attention
baseline on theWan2.1-1.3B
model.Seq.len Method Sparsity TextConsis ↑ BGConsis ↑ ImageQual ↑ SubConsist ↑ ↓FLOPs SpeedUp↑ 61*448*832 Full Attention - 32.71% 95.12% 64.33% 92.34% 1.51 × 10¹² - 23,296 tokens Sparse Attention (Ours) 0.93 32.79% 95.22% 64.29% 92.39% 1.05 × 10¹¹ 12.85x 157*768*1280 Full Attention - 34.76% 93.26% 65.91% 93.79% 6.99 × 10¹³ - 153,600 tokens Sparse Attention (Ours) 0.95 34.93% 93.41% 66.03% 94.13% 3.49 × 10¹² 17.79x Analysis:
-
BSA (
Sparse Attention (Ours)
) achieves massive speedups (12.85x and 17.79x) and drastically reduces FLOPs (by over 90%) compared to full attention. -
Crucially, this acceleration comes with no loss in quality. In fact, BSA slightly outperforms full attention on most VBench metrics, suggesting that pruning redundant information might act as a helpful regularizer.
-
The benefits are more pronounced for longer sequences (153k tokens), confirming the method's scalability.
Figure 1 visually summarizes these findings, showing the dramatic 17.79x speedup and FLOPs reduction in (a), while (b) confirms that quality metrics are maintained or slightly improved.
Figure 4 shows that the training and validation loss curves for BSA and Full Attention are nearly identical, with BSA sometimes achieving slightly lower loss. This provides strong evidence that the training dynamics are not harmed by the sparsification.
-
-
Training on Longer Sequences:
Figure 6 demonstrates a clear trend: as the input sequence length increases from 23k to 153k tokens, the speedup provided by BSA grows from 12.85x to 17.79x. This is expected, as the quadratic cost of full attention becomes a more dominant bottleneck for longer sequences, making the benefits of sparsity more impactful.
-
Sparse Adaptation and Trade-off:
Figure 7 explores the trade-off between efficiency and accuracy. As sparsity increases, FLOPs decrease linearly. The validation loss remains stable and comparable to full attention (sparsity=0) up to a sparsity level of 0.93. Beyond this point, performance degrades sharply. This indicates an optimal operating point where maximum efficiency is achieved without sacrificing quality. The dynamic thresholding in BSA helps operate near this optimal point automatically.
-
Qualitative Results:
Figure 5 provides a side-by-side visual comparison of videos generated by Full Attention and BSA. Across various prompts, resolutions, and content types (portraits, landscapes, complex scenes), the outputs from BSA are visually indistinguishable from the full attention baseline, confirming the quantitative results that generative quality is preserved.
-
Comparison with Other Training-based Attentions:
The following table is transcribed from the unnumbered table in Section 4.4, which the text refers to as Table 3.
Seqlen Method Sparsity Quality Efficiency TextConsis ↑ BGConsis ↑ ImageQual ↑ SubConsist ↑ ↓FLOPs SpeedUp↑ 61*448*832
23,296 tokensMoBA [15] 0.80 32.56% 95.14% 64.14% 92.05% 3.02 × 10¹¹ 1.2x VSA [33] 0.87 32.65% 95.03% 64.25% 92.21% 1.96 × 10¹¹ 4.5x Sparse Attention (Ours) 0.93 32.79% 95.22% 64.29% 92.39% 1.05 × 10¹¹ 12.85x 157*768*1280
153,600 tokensMoBA [15] 0.80 34.34% 93.05% 65.34% 93.49% 2.62 × 10¹² 2.3x VSA [33] 0.87 34.72% 93.22% 65.87% 93.72% 4.54 × 10¹¹ 6.2x Sparse Attention (Ours) 0.95 34.93% 93.41% 66.03% 94.13% 3.49 × 10¹² 17.79x Analysis: BSA significantly outperforms previous trainable sparse attention methods like
MoBA
andVSA
. It achieves much higher speedups (e.g., 12.85x vs. 1.2x and 4.5x) and superior generation quality across all metrics. This highlights the benefit of the bidirectional and dynamic sparsity approach. -
Ablation Study:
The following table is transcribed from Table 2 in the paper. It dissects the contributions of each component of BSA.
Method Settings Sparsity Validation Loss Quality Efficiency TextConsis ↑ BGConsis ↑ ImageQual ↑ SubConsist ↑ ↓FLOPs SpeedUp↑ Query-sparse Original 0.5 0.211 32.83% 95.25% 64.34% 92.44% 7.5 × 10¹¹ 1.96x w/ Window 0.5 0.208 32.85% 95.29% 64.36% 92.44% 7.5 × 10¹¹ 1.98x Original 0.86 0.210 32.84% 95.24% 64.30% 92.41% 2.1 × 10¹¹ 6.05x KV-sparse w/ Statistic 0.89 0.209 32.82% 95.25% 64.28% 92.42% 1.67 × 10¹¹ 6.12x Full Attention - 0 0.213 32.71% 95.12% 64.33% 92.34% 1.51 × 10¹² - Query-sparse+KV-sparse - 0.93 0.212 32.79% 95.22% 64.29% 92.39% 1.73 × 10¹¹ 12.85x Analysis:
- Query-Sparse alone: Pruning 50% of queries () already provides a ~2x speedup with better performance than full attention. The window-based selection method further improves results.
- KV-Sparse alone: The dynamic statistical threshold achieves a ~6x speedup while maintaining quality.
- Combined: When both
Query-Sparse
andKV-Sparse
are used together, their effects are complementary, leading to the highest sparsity (0.93) and speedup (12.85x) without compromising quality. This confirms the orthogonality and effectiveness of the bidirectional design.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces BSA, a novel trainable sparse attention framework for video DiTs. By being the first to dynamically and bidirectionally sparsify both Queries and Key-Value pairs, BSA effectively addresses the computational bottleneck of full attention. It achieves state-of-the-art training acceleration (up to 17.79x speedup) and FLOPs reduction (up to 20x) on long-sequence video generation tasks. Critically, these efficiency gains are realized while preserving or even slightly improving the generative quality compared to the original full-attention model.
-
Limitations & Future Work:
- The paper does not discuss limitations explicitly. However, one potential area for exploration is the sensitivity to the block partitioning scheme. The choice of block size might influence performance, and an adaptive method for setting this could be beneficial.
- The annealed sparsity schedule, while effective, is still a manually defined hyperparameter. A fully adaptive schedule learned during training could be a direction for future work.
- The framework is evaluated on a text-to-video model. Its applicability and performance on other video tasks (e.g., video understanding, prediction) could be investigated.
-
Personal Insights & Critique:
- Significance: This work represents a significant practical advancement for training large-scale video generation models. The ability to drastically cut training costs without sacrificing quality makes it feasible to train on higher resolutions and longer durations, pushing the boundaries of what is possible in video synthesis.
- Novelty: The core idea of bidirectional sparsity is both intuitive and powerful. The observation that query-side redundancy is a major untapped source of inefficiency is a key insight that sets this work apart from previous sparse attention methods in the video domain.
- Practical Impact: The provision of custom Triton kernels is crucial. Many academic works on sparse attention fail to translate theoretical FLOPs reduction into real-world speedups due to hardware inefficiencies. By co-designing the algorithm and its low-level implementation, the authors demonstrate impressive, practical acceleration. This makes the method highly attractive for both researchers and industry practitioners.
- Open Questions: Could the query selection criteria be learned instead of being based on cosine similarity? For example, a small neural network could predict which queries are most important. This might capture more complex notions of redundancy and lead to even better performance.
Similar papers
Recommended via semantic vector search.
SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse-Linear Attention
Addressing DiT's quadratic attention bottleneck in long-sequence video generation, this paper observes attention weights are separable into high-rank large and low-rank small components. It proposes SLA (Sparse-Linear Attention), classifying weights into critical ($O(N^2)$), marg
Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion
Self Forcing is a novel training paradigm for autoregressive video diffusion, addressing exposure bias by making models generate frames conditioned on their *own* prior outputs during training. Using a holistic loss and KV caching, it achieves real-time, sub-second latency video
Discussion
Leave a comment
No comments yet. Start the discussion!