- Title: Bidirectional Sparse Attention for Faster Video Diffusion Training
- Authors: Chenlu Zhan, Wen Li, Chuyu Shen, Jun Zhang, Suhui Wu, Hao Zhang. Their affiliations are with ByteDance.
- Journal/Conference: This paper is available as a preprint on arXiv. The arXiv identifier suggests a submission date in September 2025, which is a placeholder in the provided text. The venue is not specified, but its presence on arXiv indicates it is likely under review or intended for a major conference in computer vision or machine learning (e.g., CVPR, ICCV, NeurIPS).
- Publication Year: The paper's arXiv ID suggests a future date (2025), which is a placeholder. The content is contemporary with research from 2024-2025.
- Abstract: The abstract introduces the problem of high computational cost in Video Diffusion Transformer (DiT) models for generating long, high-resolution videos, attributing it to the quadratic complexity of full attention. The authors identify two inefficiencies: inherent sparsity in Query and Key-Value (KV) pairs, and the failure of fixed sparse patterns to adapt to DiT's dynamic attention. They propose Bidirectional Sparse Attention (BSA), a framework that dynamically sparsifies both Queries and KV pairs. Query sparsity is achieved by selecting informative tokens based on semantic similarity, while KV sparsity uses a dynamic statistical threshold to retain salient KV blocks. Experiments show BSA reduces FLOPs by up to 20x and accelerates attention training by 17.79x, while maintaining or improving generative quality compared to full attention.
- Original Source Link: https://arxiv.org/pdf/2509.01085 (Note: The provided link is a placeholder and does not resolve to an actual paper as of the current date).
2. Executive Summary
4. Methodology (Core Technology & Implementation)
The core of the paper is the Bidirectional Sparse Attention (BSA) framework, which is designed to replace the standard full-attention mechanism in a video DiT. It consists of three main stages, as illustrated in the figure below.
该图像是双向稀疏注意力(BSA)框架的示意图,用于加速视频DiT训练。它详细展示了如何通过动态稀疏化查询(Query)和键值(Key-Value)对来提高效率。图中首先将输入序列进行(a) 3D块划分,其中块大小为 B=Ct×Ch×Cw tokens。接着,(b) Query-Sparse部分通过语义相似度选择信息量最大的查询令牌,而(c) KV-Sparse部分则通过统计动态阈值选择最显著的KV块进行计算。
- Principles: The underlying principle of BSA is to exploit the inherent redundancy in video data. Spatially, nearby pixels are often similar. Temporally, consecutive frames often contain static backgrounds or slow-moving objects. This redundancy translates into a sparse attention matrix where each query only needs to interact with a small subset of key-value pairs. BSA makes this process efficient and adaptive by pruning redundant information from both queries and keys.
4.1. Review of Sparse Attention
For a single attention head with Query (Q), Key (K), and Value (V) matrices of size L×d (where L is sequence length and d is feature dimension), the standard attention output O is:
S=dkQK⊤,O=Softmax(S)V
Sparse attention aims to improve efficiency by computing this only for a selected subset of keys (Ks) and values (Vs):
S=dkQKs⊤,O=Softmax(S)Vs
BSA extends this by also sparsifying the queries, using a subset Qs.
4.2. BSA Structure
The BSA framework involves three steps:
Step 1: 3D Block Partition
To manage the long sequence of video tokens efficiently, the input latent tensor of shape (T, H, W)
(Time, Height, Width) is first divided into non-overlapping 3D blocks of size (Ct,Ch,Cw).
- The total number of tokens is L=T×H×W.
- Each block contains B=Ct×Ch×Cw tokens.
- This partitioning allows for coarse, block-level computations before fine-grained token-level attention, reducing overhead. Tokens within each block are average-pooled to get block-level representations Qc,Kc,Vc.
Step 2: Query-Sparse
This component aims to eliminate redundant queries. The intuition is that many tokens, especially in static areas of a video, provide similar semantic information and thus have similar query vectors.
-
For each block b, the method identifies a "center token" qc(b) to act as a representative.
-
It then computes the cosine similarity between this center token and all other tokens qi within the same block. Tokens highly similar to the center are considered redundant.
-
A fixed portion of the least similar (most unique) tokens are retained, controlled by a retention ratio r. The set of sparsified queries Qs is formed by combining the retained tokens from all blocks. The selection is defined by:
Qs=b=1⋃N{qi∈Qc(b)rankb(1−cos(qc(b),qi))≤⌈r⋅∣Qc(b)∣⌉}
- Qs: The final set of sparsified queries.
- N: The total number of blocks.
- Qc(b): The set of query tokens in block b.
- qc(b): The center query token of block b.
- qi: An individual query token in block b.
- cos(⋅,⋅): Cosine similarity function. 1−cos(⋅,⋅) measures dissimilarity.
- rankb(⋅): Ranks the tokens within block b in descending order of dissimilarity.
- r: The retention ratio (e.g., 0.5 means keep 50% of the most unique tokens).
- ∣Qc(b)∣: The number of tokens in block b.
-
Window-based Enhancement: To better preserve critical information, each block can be further divided into smaller windows, and a center token is selected from each window. This helps capture local variations more effectively.
Figure 2 illustrates the effect of Query Sparsity. The heatmaps for "Full Attention" on Frame 3 and Frame 12 are very similar, indicating that the queries are redundant. After applying "Query Sparsity," the heatmaps become more distinct, showing that the method successfully filtered out redundant semantic features and retained the most critical queries.
Step 3: KV-Sparse
After sparsifying the queries, this component adaptively selects the most relevant KV blocks for each remaining query block. This avoids the rigidity of fixed sparsity patterns.
- Statistical Dynamic Threshold: Instead of selecting a fixed number of top-k KV blocks, BSA computes a dynamic threshold p for each computation. This threshold is based on the statistics of the inter-block attention scores.
p=mean(Sb)+std(Sb)⋅U(1−k/n)
- Sb: The set of inter-block attention scores.
- mean(Sb): The mean of the attention scores.
- std(Sb): The standard deviation of the attention scores.
- n: The total number of inter-block attention scores.
- k: The desired number of key samples to select. The value of k is annealed during training, starting high and gradually decreasing.
- U(⋅): The quantile function (inverse of the cumulative distribution function), used here to find a threshold that corresponds to selecting the top k scores.
- Dynamic Selection of Key KV Pairs: For each query block i, the method selects the smallest set of KV block indices Si such that the cumulative softmax attention score meets the dynamic threshold p.
γ(min∣Si∣s.t.(i,j)∈Si∑∑j′exp(QiKj′⊤)exp(QiKj⊤)≥p)
- γ: An operation that returns the minimal index set Si.
- Si: The set of indices of selected KV blocks for query block i.
- Qi: The query block i.
- Kj: The key block j.
- p: The dynamic threshold computed previously.
This ensures that enough "attention mass" is captured while minimizing computation.
4.3. Computation Cost and Kernel Design
- Final Computation: The final sparse attention is computed using the sparsified query matrix Qs and the selected key/value matrices KS and VS:
Ss=dkQsKS⊤,Os=Softmax(Ss)VS
- Overhead: The overhead for calculating similarity and sorting for sparsification is negligible (O(LlogL) for Query-sparse and O(N) for KV-sparse, where N is the number of blocks), amounting to less than 0.1% of total FLOPs.
- Kernel Design: The authors implemented custom forward and backward kernels using Triton (a language for writing efficient GPU code) to translate the block-sparse structure into real hardware speedups, achieving performance comparable to
FlashAttention
.
5. Experimental Setup
- Datasets: The experiments used a dataset of 300k videos selected from
Vchitect T2V DataVerse
. The videos underwent a three-step preprocessing:
- Shot segmentation: To ensure each clip has a single scene.
- Temporal truncation: Extracting 5-second clips.
- Caption generation: Using the
Tarsier2
model to create text descriptions.
The data was processed at multiple resolutions (e.g., 448×832 and 782×1280) to test scalability.
- Evaluation Metrics:
- Training Efficiency:
- FLOPs (Floating Point Operations): A measure of total computational cost. Lower is better.
- Speedup Ratio: The ratio of training time of the baseline (full attention) to the proposed method. Higher is better.
- Generation Quality:
- VBench: A comprehensive benchmark for video generation models. The paper uses five of its dimensions:
- Text Consistency: How well the generated video matches the input text prompt.
- BG Consistency: The stability and consistency of the background across frames.
- Image Quality: The visual fidelity and realism of individual frames.
- Sub Consistency: The consistency of the main subject's appearance and identity throughout the video.
- Dynamic Degree: The plausibility and amount of motion in the video.
- For all VBench metrics, a higher score is better. The paper does not provide the mathematical formulas for these metrics, as they are complex and defined by the VBench benchmark itself, often relying on other pretrained models for evaluation.
- Baselines:
- Full Attention: The standard, non-sparse attention mechanism within the same backbone model.
- MoBA (Mixture-of-Block Attention): A trainable sparse attention method that focuses on KV sparsity.
- VSA (Trainable Sparse Attention): Another trainable sparse attention method for video models, also focusing on KV sparsity with fixed block sizes.
6. Results & Analysis
-
Core Results:
The main results are summarized in the table below (transcribed from Table 1 in the paper). BSA is compared against the Full Attention
baseline on the Wan2.1-1.3B
model.
Seq.len |
Method |
Sparsity |
TextConsis ↑ |
BGConsis ↑ |
ImageQual ↑ |
SubConsist ↑ |
↓FLOPs |
SpeedUp↑ |
61*448*832 |
Full Attention |
- |
32.71% |
95.12% |
64.33% |
92.34% |
1.51 × 10¹² |
- |
23,296 tokens |
Sparse Attention (Ours) |
0.93 |
32.79% |
95.22% |
64.29% |
92.39% |
1.05 × 10¹¹ |
12.85x |
157*768*1280 |
Full Attention |
- |
34.76% |
93.26% |
65.91% |
93.79% |
6.99 × 10¹³ |
- |
153,600 tokens |
Sparse Attention (Ours) |
0.95 |
34.93% |
93.41% |
66.03% |
94.13% |
3.49 × 10¹² |
17.79x |
Analysis:
-
BSA (Sparse Attention (Ours)
) achieves massive speedups (12.85x and 17.79x) and drastically reduces FLOPs (by over 90%) compared to full attention.
-
Crucially, this acceleration comes with no loss in quality. In fact, BSA slightly outperforms full attention on most VBench metrics, suggesting that pruning redundant information might act as a helpful regularizer.
-
The benefits are more pronounced for longer sequences (153k tokens), confirming the method's scalability.
![Figure 1. (a) Speedup ratio and computational cost comparison between Sparse Attention and Full Attention. (b) Comparison of generation quality across four consistency metrics on VBench \[10\].](/files/papers/68f0a36cde19bb55d0742a87/images/1.jpg)
Figure 1 visually summarizes these findings, showing the dramatic 17.79x speedup and FLOPs reduction in (a), while (b) confirms that quality metrics are maintained or slightly improved.

Figure 4 shows that the training and validation loss curves for BSA and Full Attention are nearly identical, with BSA sometimes achieving slightly lower loss. This provides strong evidence that the training dynamics are not harmed by the sparsification.
-
Training on Longer Sequences:

Figure 6 demonstrates a clear trend: as the input sequence length increases from 23k to 153k tokens, the speedup provided by BSA grows from 12.85x to 17.79x. This is expected, as the quadratic cost of full attention becomes a more dominant bottleneck for longer sequences, making the benefits of sparsity more impactful.
-
Sparse Adaptation and Trade-off:

Figure 7 explores the trade-off between efficiency and accuracy. As sparsity increases, FLOPs decrease linearly. The validation loss remains stable and comparable to full attention (sparsity=0) up to a sparsity level of 0.93. Beyond this point, performance degrades sharply. This indicates an optimal operating point where maximum efficiency is achieved without sacrificing quality. The dynamic thresholding in BSA helps operate near this optimal point automatically.
-
Qualitative Results:

Figure 5 provides a side-by-side visual comparison of videos generated by Full Attention and BSA. Across various prompts, resolutions, and content types (portraits, landscapes, complex scenes), the outputs from BSA are visually indistinguishable from the full attention baseline, confirming the quantitative results that generative quality is preserved.
-
Comparison with Other Training-based Attentions:
The following table is transcribed from the unnumbered table in Section 4.4, which the text refers to as Table 3.
Seqlen | Method | Sparsity | Quality | Efficiency |
TextConsis ↑ | BGConsis ↑ | ImageQual ↑ | SubConsist ↑ | ↓FLOPs | SpeedUp↑ |
61*448*832 23,296 tokens | MoBA [15] | 0.80 | 32.56% | 95.14% | 64.14% | 92.05% | 3.02 × 10¹¹ | 1.2x |
VSA [33] | 0.87 | 32.65% | 95.03% | 64.25% | 92.21% | 1.96 × 10¹¹ | 4.5x |
Sparse Attention (Ours) | 0.93 | 32.79% | 95.22% | 64.29% | 92.39% | 1.05 × 10¹¹ | 12.85x |
157*768*1280 153,600 tokens | MoBA [15] | 0.80 | 34.34% | 93.05% | 65.34% | 93.49% | 2.62 × 10¹² | 2.3x |
VSA [33] | 0.87 | 34.72% | 93.22% | 65.87% | 93.72% | 4.54 × 10¹¹ | 6.2x |
Sparse Attention (Ours) | 0.95 | 34.93% | 93.41% | 66.03% | 94.13% | 3.49 × 10¹² | 17.79x |
Analysis: BSA significantly outperforms previous trainable sparse attention methods like MoBA
and VSA
. It achieves much higher speedups (e.g., 12.85x vs. 1.2x and 4.5x) and superior generation quality across all metrics. This highlights the benefit of the bidirectional and dynamic sparsity approach.
-
Ablation Study:
The following table is transcribed from Table 2 in the paper. It dissects the contributions of each component of BSA.
Method | Settings | Sparsity | Validation Loss | Quality | Efficiency |
TextConsis ↑ | BGConsis ↑ | ImageQual ↑ | SubConsist ↑ | ↓FLOPs | SpeedUp↑ |
Query-sparse | Original | 0.5 | 0.211 | 32.83% | 95.25% | 64.34% | 92.44% | 7.5 × 10¹¹ | 1.96x |
w/ Window | 0.5 | 0.208 | 32.85% | 95.29% | 64.36% | 92.44% | 7.5 × 10¹¹ | 1.98x |
Original | 0.86 | 0.210 | 32.84% | 95.24% | 64.30% | 92.41% | 2.1 × 10¹¹ | 6.05x |
KV-sparse | w/ Statistic | 0.89 | 0.209 | 32.82% | 95.25% | 64.28% | 92.42% | 1.67 × 10¹¹ | 6.12x |
Full Attention | - | 0 | 0.213 | 32.71% | 95.12% | 64.33% | 92.34% | 1.51 × 10¹² | - |
Query-sparse+KV-sparse | - | 0.93 | 0.212 | 32.79% | 95.22% | 64.29% | 92.39% | 1.73 × 10¹¹ | 12.85x |
Analysis:
- Query-Sparse alone: Pruning 50% of queries (Sparsity=0.5) already provides a ~2x speedup with better performance than full attention. The window-based selection method further improves results.
- KV-Sparse alone: The dynamic statistical threshold achieves a ~6x speedup while maintaining quality.
- Combined: When both
Query-Sparse
and KV-Sparse
are used together, their effects are complementary, leading to the highest sparsity (0.93) and speedup (12.85x) without compromising quality. This confirms the orthogonality and effectiveness of the bidirectional design.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces BSA, a novel trainable sparse attention framework for video DiTs. By being the first to dynamically and bidirectionally sparsify both Queries and Key-Value pairs, BSA effectively addresses the computational bottleneck of full attention. It achieves state-of-the-art training acceleration (up to 17.79x speedup) and FLOPs reduction (up to 20x) on long-sequence video generation tasks. Critically, these efficiency gains are realized while preserving or even slightly improving the generative quality compared to the original full-attention model.
-
Limitations & Future Work:
- The paper does not discuss limitations explicitly. However, one potential area for exploration is the sensitivity to the block partitioning scheme. The choice of block size (Ct,Ch,Cw) might influence performance, and an adaptive method for setting this could be beneficial.
- The annealed sparsity schedule, while effective, is still a manually defined hyperparameter. A fully adaptive schedule learned during training could be a direction for future work.
- The framework is evaluated on a text-to-video model. Its applicability and performance on other video tasks (e.g., video understanding, prediction) could be investigated.
-
Personal Insights & Critique:
- Significance: This work represents a significant practical advancement for training large-scale video generation models. The ability to drastically cut training costs without sacrificing quality makes it feasible to train on higher resolutions and longer durations, pushing the boundaries of what is possible in video synthesis.
- Novelty: The core idea of bidirectional sparsity is both intuitive and powerful. The observation that query-side redundancy is a major untapped source of inefficiency is a key insight that sets this work apart from previous sparse attention methods in the video domain.
- Practical Impact: The provision of custom Triton kernels is crucial. Many academic works on sparse attention fail to translate theoretical FLOPs reduction into real-world speedups due to hardware inefficiencies. By co-designing the algorithm and its low-level implementation, the authors demonstrate impressive, practical acceleration. This makes the method highly attractive for both researchers and industry practitioners.
- Open Questions: Could the query selection criteria be learned instead of being based on cosine similarity? For example, a small neural network could predict which queries are most important. This might capture more complex notions of redundancy and lead to even better performance.