Fast Video Generation with Sliding Tile Attention

Hao Zhang

Paper status: completed

Fast Video Generation with Sliding Tile Attention

Published:02/07/2025

Sliding Tile Attention Mechanism (1)Video Diffusion Generation Models (1)Efficient Attention Mechanism (1)HunyuanVideo (1)Computational Efficiency Optimization (1)

Original Link PDF

Price: 0.10

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study introduces Sliding Tile Attention (STA) to reduce computational bottlenecks in video generation, achieving 58.79% Model FLOPs Utilization while decreasing latency to 501 seconds without quality loss, demonstrating significant efficiency improvements over existing method

Abstract

Diffusion Transformers (DiTs) with 3D full attention power state-of-the-art video generation, but suffer from prohibitive compute cost -- when generating just a 5-second 720P video, attention alone takes 800 out of 945 seconds of total inference time. This paper introduces sliding tile attention (STA) to address this challenge. STA leverages the observation that attention scores in pretrained video diffusion models predominantly concentrate within localized 3D windows. By sliding and attending over the local spatial-temporal region, STA eliminates redundancy from full attention. Unlike traditional token-wise sliding window attention (SWA), STA operates tile-by-tile with a novel hardware-aware sliding window design, preserving expressiveness while being hardware-efficient. With careful kernel-level optimizations, STA offers the first efficient 2D/3D sliding-window-like attention implementation, achieving 58.79% MFU. Precisely, STA accelerates attention by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). On the leading video DiT, HunyuanVideo, STA reduces end-to-end latency from 945s (FA3) to 685s without quality degradation, requiring no training. Enabling finetuning further lowers latency to 268s with only a 0.09% drop on VBench. We make our codebase public at https://github.com/hao-ai-lab/FastVideo.

Mind Map

In-depth Reading

English Analysis~18 min read · 24,862 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is Fast Video Generation with Sliding Tile Attention. It introduces a novel attention mechanism to accelerate the inference process of state-of-the-art video generation models.

1.2. Authors

The authors of this paper are:

Peiyuan Zhang
Yongqi Chen
Runlong Su
Hangliang Ding
Ion Stoica
Zhengzhong Liu
Hao Zhang

Their affiliations include universities and research labs, with multiple authors from hao-ai-lab.

1.3. Journal/Conference

The paper is published on arXiv, a preprint server. While arXiv is highly influential in the research community for rapid dissemination of scientific work, it hosts preprints that have not yet undergone formal peer review, or are undergoing review at conferences or journals. The publication timestamp is 2025-02-06T21:17:09.000Z, indicating it's a recent submission.

1.4. Publication Year

The paper was published in 2025.

1.5. Abstract

Diffusion Transformers (DiTs) are currently the leading models for state-of-the-art video generation, primarily due to their use of 3D full attention. However, this mechanism comes with a prohibitive computational cost; for instance, generating a 5-second 720P video can take 945 seconds, with attention alone consuming 800 seconds. To tackle this, the paper introduces Sliding Tile Attention (STA). STA is based on the observation that attention scores in pretrained video diffusion models are highly concentrated within localized 3D spatial-temporal windows. By sliding and attending over these local regions, STA effectively eliminates the redundancy associated with full attention.

Unlike conventional token-wise sliding window attention (SWA), STA operates tile-by-tile using a novel hardware-aware sliding window design. This approach preserves the model's expressiveness while significantly enhancing hardware efficiency. Through meticulous kernel-level optimizations, STA achieves the first efficient implementation of 2D/3D sliding-window-like attention, boasting an impressive 58.79% Memory-Bandwidth-Utilization (MFU). Specifically, STA accelerates attention computations by 2.8-17x over FlashAttention-2 (FA2) and 1.6-10x over FlashAttention-3 (FA3). When applied to HunyuanVideo, a leading video DiT, STA reduces end-to-end latency from 945s (with FA3) to 501s without any degradation in video quality, requiring no additional training. Further finetuning can reduce latency even more significantly to 268s, with only a marginal 0.09% drop in VBench score. The authors have made their codebase publicly available.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2502.04507
PDF Link: https://arxiv.org/pdf/2502.04507v3.pdf
Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the prohibitive computational cost of 3D full attention in Diffusion Transformers (DiTs) for state-of-the-art video generation. While DiTs with 3D full attention excel at synthesizing high-resolution, long-duration, and visually coherent videos, the quadratic complexity of attention with respect to sequence length leads to immense computational overhead, making both training and inference exceedingly slow. For example, generating a 5-second 720P video with HunyuanVideo on a high-end H100 GPU still takes 16 minutes (945 seconds), with attention computation dominating 800 seconds of that time. This bottleneck severely impedes the practical deployment and accessibility of advanced video generation models.

The paper identifies a critical observation: video data inherently contains high redundancy. Adjacent frames show minimal differences, and spatially close pixels exhibit strong correlations. This suggests that allowing every token to attend to every other token (as in 3D full attention) might be overly expensive and unnecessary. The authors hypothesize that much of this redundancy is carried within the 3D full attention mechanism of pretrained video diffusion models. Visualizations of HunyuanVideo's attention scores confirm a strong 3D locality pattern, where queries primarily attend to spatially and temporally nearby keys. Quantitatively, a local window covering only ~15.52% of the total token space accounts for 70% of the total attention score. This striking locality pattern motivates the need for a more efficient attention mechanism that can exploit this characteristic without sacrificing quality.

Existing sliding window attention (SWA) implementations for 2D or 3D data, such as NATTEN and CLEAR, have failed to translate theoretical FLOP reductions into proportional wall-clock speedups. This inefficiency stems from the irregular attention masks they generate, which create numerous "mixed blocks" that are computationally inefficient for modern GPU architectures like those optimized by FlashAttention. These mixed blocks lead to wasted computations and significant masking overhead, resulting in poor hardware utilization. The paper's entry point is to overcome this hardware-unfriendliness of traditional SWA by rethinking its computation via a system-algorithm co-design.

2.2. Main Contributions / Findings

The paper makes the following primary contributions:

Identification and Quantification of 3D Locality and Head Specialization: The authors rigorously identify and quantify a pronounced 3D locality pattern and head specialization in state-of-the-art video Diffusion Transformers (DiTs). This reveals substantial redundancy in the full 3D attention mechanisms of these models. They show that different attention heads exhibit specialized locality patterns, focusing on varying scales, and importantly, this head specialization is consistent across different prompts.
Introduction of SLIDING TILE ATTENTION (STA) with Optimized Kernel: The paper introduces SLIDING TILE ATTENTION (STA), a novel tile-based sliding window attention mechanism. Unlike previous Sliding Window Attention (SWA) implementations, STA operates tile-by-tile and leverages a hardware-aware design that eliminates mixed blocks, ensuring that computations are dense and GPU-friendly. The optimized kernel-level implementation, based on ThunderKittens and FlashAttention3, achieves minimal overhead compared to FlashAttention 3 itself, with a high Memory-Bandwidth-Utilization (MFU) of 58.79%. This makes STA the first higher-order sliding-window-like attention to achieve wall-clock speedups proportional to sparsity.
Significant Acceleration of Video Generation with Quality Preservation: STA demonstrates substantial acceleration of attention computations, achieving speedups of over 10x. This translates to an end-to-end video generation speedup of up to 3.53x. Critically, these efficiency gains are achieved with minimal or no degradation in output quality. This is demonstrated through both a training-free application (which uses a heuristic mask search) and a finetuning approach for even greater sparsity, significantly reducing HunyuanVideo's inference latency from 945s to 268s.

3.1. Foundational Concepts

3.1.1. Diffusion Models

Diffusion Models are a class of generative models that learn to reverse a diffusion process. They work by iteratively denoising a noisy input to generate a clean sample (e.g., an image or video). In the forward diffusion process, noise is progressively added to data until it becomes pure noise. In the reverse process, the model learns to gradually remove this noise, transforming noisy data back into coherent samples. This iterative denoising process allows them to generate high-quality and diverse outputs.

3.1.2. Transformers

Transformers are neural network architectures that have revolutionized sequence modeling, particularly in natural language processing and computer vision. Their core innovation is the self-attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. Unlike recurrent neural networks, Transformers can process all input elements in parallel, making them highly efficient for long sequences.

3.1.3. Diffusion Transformers (DiTs)

Diffusion Transformers (DiTs) combine the generative power of diffusion models with the architectural strengths of Transformers. In DiTs, the U-Net architecture typically used in diffusion models for noise prediction is replaced by a Transformer block. For video generation, DiTs utilize 3D attention mechanisms to model both spatial relationships within frames and temporal relationships across frames. This involves flattening the 3D video data (e.g., time, height, width) into a sequence of visual tokens, then applying attention to this sequence. This approach has led to state-of-the-art results in video generation.

3.1.4. Attention Mechanism

The attention mechanism is a crucial component of Transformers. It allows a model to dynamically weigh the importance of different parts of an input sequence when computing a representation for a specific element. The basic idea is that for each element (query), the model looks at all other elements (keys) in the sequence, calculates a similarity score between the query and each key, and then uses these scores to create a weighted sum of corresponding value vectors.

The standard scaled dot-product attention operation is defined as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:

$Q$ (Query), $K$ (Key), $V$ (Value) are matrices representing the input sequences, typically derived from the same input features through linear transformations. $Q \in \mathbb{R}^{N \times d_k}$ , $K \in \mathbb{R}^{N \times d_k}$ , $V \in \mathbb{R}^{N \times d_v}$ , where $N$ is the sequence length, $d_k$ is the dimension of keys and queries, and $d_v$ is the dimension of values.
$Q K^T$ calculates the dot products (similarity scores) between each query and all keys.
$\sqrt{d_k}$ is a scaling factor to prevent the dot products from becoming too large, especially with high dimensions, which could push the softmax function into regions with very small gradients.
$\mathrm{softmax}$ applies a softmax function row-wise to normalize the scores, turning them into probability distributions. This results in the attention weights $A$ .
$V$ (Value) matrix is multiplied by the attention weights $A$ to produce the final output, which is a weighted sum of the value vectors, emphasizing the most relevant parts of the input.

In 3D attention for video, the video frames are flattened into a unified sequence of visual tokens. For a video latent of shape $(L_t, L_h, L_w)$ (time, height, width), the total sequence length $N$ becomes $L_t \times L_h \times L_w$ . This leads to an $\mathcal{O}(N^2)$ computational complexity and memory footprint, which can be extremely high for high-resolution, long-duration videos.

3.1.5. FlashAttention

FlashAttention is a highly optimized attention algorithm designed to address the memory and computational bottlenecks of standard attention, particularly the $\mathcal{O}(N^2)$ memory footprint due to the materialization of the attention scores matrix ( $S$ ) and attention weights matrix ( $A$ ) in High Bandwidth Memory (HBM). It achieves this by:

Tiling: Breaking down the input sequences (Q, K, V) into smaller blocks (tiles).
Online Softmax: Performing the softmax computation incrementally and on-chip (in faster SRAM) for these blocks, avoiding the need to write the large intermediate $S$ and $A$ matrices to HBM. Each GPU Streaming Multiprocessor (SM) loads a block of Q, K, V into SRAM, computes partial attention outputs, and combines them to form the final result, which is then written back to HBM.
Reduced Data Transfer: By keeping intermediate computations in SRAM, FlashAttention significantly reduces the number of reads and writes to slower HBM, leading to substantial speedups.

FlashAttention-2 (FA2) further optimizes parallelism and work partitioning for even better performance, and FlashAttention-3 (FA3) introduces asynchrony and support for low-precision computations (e.g., FP8) for additional speedups.

3.1.6. Memory-Bandwidth-Utilization (MFU)

Memory-Bandwidth-Utilization (MFU) is a metric used to quantify how effectively a GPU kernel is using the available memory bandwidth. It is defined as the ratio of the actual memory bandwidth achieved by the kernel to the theoretical peak memory bandwidth of the GPU. A higher MFU (closer to 100%) indicates that the kernel is efficiently accessing data from memory, which is crucial for performance, especially in memory-bound operations like attention. Low MFU suggests that the kernel is spending too much time waiting for data from memory, rather than performing computations, indicating inefficiency.

3.2. Previous Works

3.2.1. Sliding Window Attention (SWA)

Sliding Window Attention (SWA) is a sparse attention mechanism commonly used to reduce the quadratic complexity of full attention. Instead of allowing each query token to attend to all key tokens in the sequence, SWA restricts attention such that a query only attends to key tokens within a fixed-size window centered around itself. By stacking multiple attention layers, the effective receptive field can still expand beyond the window size. SWA has been widely applied in Natural Language Processing (NLP) (e.g., Longformer) to handle very long sequences.

For 2D or 3D data (like images or videos), SWA becomes 2D SWA or 3D SWA. The challenge in these higher dimensions is managing the window boundaries and ensuring efficient computation.

3.2.2. NATTEN and Tiled NATTEN

NATTEN (Neighborhood Attention Transformer) is an implementation of sliding window attention specifically designed for image and video data. It aims to improve efficiency by restricting attention to local neighborhoods.

Mechanism: NATTEN shifts window centers at image/video boundaries to ensure each query attends to a constant number of keys.
Inefficiency: As discussed in the paper, NATTEN creates highly irregular attention masks due to individual tokens attending to distinct sets of keys. This results in a large number of mixed blocks when implemented with FlashAttention. Mixed blocks require full computation before masking, incurring overhead and failing to translate FLOP reductions into actual speedups. Tiled NATTEN is an optimized variant that attempts to reorder inputs to increase the number of dense blocks and improve kernel efficiency through input tiling and kernel fusion. However, even with these optimizations, a significant portion of blocks still remains mixed, limiting its overall efficiency.

3.2.3. CLEAR

CLEAR (Conv-like Linearization Revs Pre-trained Diffusion Transformers Up) is another approach that aims for linear attention by using a circular window-based attention mechanism. Each query token only attends to key-value tokens within a defined radius $r$ . It maintains the scaled dot-product attention formula but restricts its computation to these local windows. The authors implement CLEAR with FlexAttention. Similar to NATTEN, CLEAR also faces challenges in translating theoretical FLOP reductions into proportional wall-clock speedups due to the complexities of mask generation and handling mixed blocks in higher dimensions.

3.2.4. Swin Transformer

Swin Transformer is a hierarchical vision transformer that introduces a shifted window-based attention mechanism.

Mechanism: Instead of global self-attention, Swin partitions the input (e.g., image) into non-overlapping local windows and applies attention within these windows. To enable interaction between windows, a key innovation is the alternating window partitioning strategy: one layer uses standard window partitioning, while the next shifts the windows to create new connections.
Limitations: While efficient, Swin attention disrupts local connectivity within a single attention layer if adjacent tokens fall into different windows. This can violate the 3D locality property that is crucial for video diffusion models, potentially leading to degraded quality. It is typically used in a train-from-scratch setting, and applying it to a pretrained model like HunyuanVideo without specific adaptation can harm performance.

3.2.5. $\Delta$ -DiT

\Delta`-DiT` (Delta-DiT: A Training-Free Acceleration Method Tailored for Diffusion Transformers) is a training-free inference acceleration method that optimizes speed by `caching feature offsets` instead of full feature maps.
*   **Mechanism:** It uses a `staged caching strategy` where residuals from later `DiT` blocks are stored for early-step sampling, and residuals from earlier blocks are cached for later steps. Key parameters include `residual cache interval` ( $N$ ), `number of cached blocks` ( $N_c$ ), and a `timestep boundary` ( $b$ ) determining cache position.
*   **Differentiation from STA:**  $\Delta$ -DiT focuses on caching intermediate features to reduce overall computation, while `STA` directly optimizes the attention mechanism itself by exploiting locality and hardware efficiency.

### 3.2.6. Linear Attention Methods
These methods, such as `Linformer`, aim to reduce the quadratic complexity of attention to linear complexity by decomposing the `softmax` operation using kernel or gate functions. While promising in theory, they have not yet achieved state-of-the-art performance in `video DiTs` compared to `full 3D attention`.

### 3.2.7. Alternating Spatial and Temporal Attention
Some approaches accelerate video diffusion by decomposing `3D attention` into separate `spatial` and `temporal` components (e.g., `Make-A-Video`, `LaVie`).
*   **Mechanism:** Instead of a single 3D attention block, they might apply 2D spatial attention within each frame, followed by 1D temporal attention across frames.
*   **Limitations:** The paper argues that these methods fail to capture interactions between tokens that are offset in *both* spatial and temporal dimensions. For example, a token at  $(1, 1, 1)$  cannot directly attend to a token at  $(2, 2, 2)$  if the attention is strictly separated. This disrupts the `3D locality` pattern observed in video diffusion models, which is why `full 3D attention` often outperforms them in state-of-the-art models.

## 3.3. Technological Evolution
The evolution of attention mechanisms in video generation `Diffusion Transformers` can be summarized as follows:
1.  **Full 3D Attention:** Initially, `DiTs` adopted `full 3D attention` for video, flattening video frames into a sequence and applying quadratic attention. This achieved high quality but suffered from prohibitive computational cost.
2.  **Early Sparsity Attempts (e.g., Alternating S/T Attention, Swin):** Researchers explored decomposing 3D attention or using `window-based` approaches (like `Swin`) to reduce cost. However, these often compromised `3D locality` or generated suboptimal quality, leading `full 3D attention` to remain dominant for state-of-the-art.
3.  **Hardware-Optimized Dense Attention (FlashAttention):** The advent of `FlashAttention` significantly boosted the efficiency of `dense attention` by optimizing memory access and on-chip computation, making `full 3D attention` more feasible, but still costly for high resolutions.
4.  **Hardware-Unaware Sparse Attention (NATTEN, CLEAR):** Attempts to apply `sliding window attention` to 2D/3D data (`NATTEN`, `CLEAR`) demonstrated theoretical `FLOP` reductions but struggled with practical `wall-clock speedups` due to generating `irregular attention masks` and `mixed blocks` that were inefficient for `FlashAttention`-style GPU kernels.
5.  **Hardware-Aware Sparse Attention (STA):** This paper's `Sliding Tile Attention (STA)` represents a significant step forward by specifically addressing the hardware inefficiency of previous `SWA` implementations. By co-designing the algorithm with `FlashAttention`'s underlying architecture, `STA` ensures only `dense blocks` are computed, thereby translating `FLOP` reductions directly into proportional `wall-clock speedups` while preserving crucial `3D locality`.

## 3.4. Differentiation Analysis
Compared to the main methods in related work, `Sliding Tile Attention (STA)` offers several core differences and innovations:

*   **vs. Full 3D Attention (e.g., FA2, FA3):**
    *   **Core Difference:** `STA` introduces sparsity by restricting attention to local `3D windows`, whereas `full 3D attention` allows every token to attend to every other token.
    *   **Innovation:** `STA` leverages the observed `3D locality` in pretrained models, drastically reducing redundant computations while maintaining quality. It achieves significant speedups ( $>10x$  for attention, `3.53x` end-to-end) that `FlashAttention` alone cannot provide for sparse patterns.

*   **vs. Traditional Token-wise Sliding Window Attention (SWA) (e.g., NATTEN, CLEAR):**
    *   **Core Difference:** Traditional SWA operates `token-by-token`, with each query attending to a distinct window, creating irregular attention patterns. `STA` operates `tile-by-tile`, where all queries within a tile attend to the same set of key tiles.
    *   **Innovation:** This `tile-by-tile` design allows `STA` to generate *only* `dense blocks` and `empty blocks`, completely eliminating inefficient `mixed blocks` that plague `NATTEN` and `CLEAR`. This hardware-aware design ensures `STA` translates theoretical `FLOP` reductions into proportional `wall-clock speedups`, unlike previous SWA implementations that fail to do so. `STA` is the first to achieve high `MFU` (`58.79%`) for 3D SWA.

*   **vs. Swin Transformer:**
    *   **Core Difference:** `Swin` uses non-overlapping, shifted windows, which can disrupt local connectivity if adjacent tokens fall into different windows within a single layer. `STA` uses a sliding window that allows any query to attend to its local neighborhood, preserving continuous `3D locality`.
    *   **Innovation:** `STA` is designed to be plug-and-play with pretrained models and maintains `3D locality`, which `Swin` inherently struggles with in its vanilla form when applied to models like `DiTs`. `STA` preserves quality better while achieving higher speedups, and fine-tuning further enhances its performance.

*   **vs.  $\Delta$ -DiT:**
    *   **Core Difference:**

\Delta-DiT accelerates inference by caching intermediate features (residuals) to reduce redundant computations across diffusion steps. STA accelerates by optimizing the attention mechanism itself through sparsity. * Innovation: STA addresses a different bottleneck (attention computation) with a different strategy (hardware-efficient sparse attention). The paper shows that STA consistently outperforms

\Delta`-DiT` in quality-efficiency tradeoff, achieving higher speedups with better or comparable video quality. The methods are also complementary.

*   **vs. Alternating Spatial and Temporal Attention:**
    *   **Core Difference:** Alternating attention mechanisms decompose `3D interactions` into separate 2D spatial and 1D temporal components. `STA` maintains a unified `3D local window`, allowing direct interactions between tokens offset in *both* spatial and temporal dimensions.
    *   **Innovation:** `STA` explicitly preserves the observed `3D locality` pattern, which is crucial for high-quality video generation and has led `full 3D attention` to largely supersede alternating approaches in state-of-the-art models.

        In essence, `STA`'s core innovation lies in its `system-algorithm co-design`, which makes `sliding window attention` truly hardware-efficient for higher-dimensional data by completely eliminating `mixed blocks` and leveraging asynchronous data loading, something previous sparse attention methods for 2D/3D data failed to achieve effectively.

# 4. Methodology

The core of `Sliding Tile Attention (STA)` lies in its innovative approach to `sliding window attention (SWA)` that ensures hardware efficiency on GPUs while preserving the essential `3D locality` observed in video diffusion models. This is achieved by moving from a `token-by-token` sliding window to a `tile-by-tile` mechanism, meticulously designed to avoid `mixed blocks` and maximize `Memory-Bandwidth-Utilization (MFU)`.

## 4.1. Attention in Video DiTs
State-of-the-art `Video Diffusion Transformers (DiTs)`, such as `HunyuanVideo`, employ `3D full attention` to enable signal mixing across all tokens. This allows each visual token to attend to any other token in the video. When a video latent of shape `(L, L, L)` (where  $L$  typically represents spatial and temporal dimensions after VAE encoding) is processed, it is flattened into a 1D sequence of length  $N = L^3$ . `Full bidirectional attention` is then applied to this sequence.

The attention operation is formally defined as:
 $S = \frac{QK^T}{\sqrt{d_k}}, \quad A = \operatorname{Softmax}(S + M), \quad O = AV$ 
Where:
*    $Q, K, V \in \mathbb{R}^{N \times d}$  represent the query, key, and value matrices for a single attention head.  $N = L^3$  is the sequence length, and  $d$  is the dimension of each head.
*    $M \in \{-\infty, 0\}^{N \times N}$  represents the `attention mask`. A value of  $-\infty$  in  $M$  effectively masks out certain key positions by making their `softmax` probability zero, preventing attention to them. A value of `0` means no masking.
*    $S$  is the unnormalized attention score matrix.
*    $A$  is the attention weight matrix, obtained by applying `Softmax` to the scaled scores  $S$  and adding the mask  $M$ .
*    $O$  is the output matrix, a weighted sum of the value vectors.

    A naive implementation of this operation would materialize the full  $S$  and  $A$  matrices, each of size  $N \times N$ , in `GPU High Bandwidth Memory (HBM)`. This leads to both  $\mathcal{O}(N^2)$  memory overhead and extensive data transfers between `HBM` and faster `SRAM`, making it prohibitively expensive for large  $N$ .

`FlashAttention` mitigates this by tiling the input sequences into smaller blocks. It uses an `online softmax` approach where each `GPU Streaming Multiprocessor (SM)` loads blocks of queries, keys, and values into `SRAM`, performs the necessary computations, and writes the final result directly to `HBM`, thus avoiding the materialization of the full  $A$  and  $S$  matrices in `HBM`. To introduce sparsity and reduce computation costs, an `attention mask` can be applied. This mask is computed on-chip per block, avoiding the  $\mathcal{O}(N^2)$  memory cost of a global mask. Sparsity can theoretically reduce latency by skipping masked-out attention regions. However, as the paper demonstrates, achieving this in practice for complex 2D/3D masking patterns is challenging.

## 4.2. Inefficiency of Sliding Window Attention (SWA)
Implementing `2D/3D Sliding Window Attention (SWA)` with `FlashAttention` requires defining an appropriate `attention mask`. `FlashAttention` calculates and applies masks at the block level. Based on how an attention mask impacts a block, `attention blocks` can be categorized into three types, as illustrated in Figure 4:
*   **Dense Blocks:** All attention scores within the block are retained (no masking). These are the most computationally efficient for `FlashAttention`.
*   **Empty Blocks:** All attention scores within the block are masked out (set to  $-\infty$ ). These blocks can be entirely skipped, leading to `FLOP` reduction and speedup.
*   **Mixed Blocks:** Some attention scores within the block are retained, while others are masked out. These blocks are the primary source of inefficiency in traditional `SWA` for higher dimensions.

    ![Figure 7. Human evaluation on 200 prompts from the MovieGen Bench (Polyak et al., 2024). STA achieves a  $1 . 8 9 \\times$  end-to-end speedup while maintaining performance comparable to the original Hu…](/files/papers/691859e2110b75dcc59ae191/images/7.jpg)
    *该图像是一个条形图，展示了STA与HunyuanVideo及Δ-DiT在200个提示下的人工评估结果。结果显示，STA-tf-1.89x相比HunyuanVideo的胜率为5.0%，平局83.0%，失利12.0%；而与Δ-DiT-1.36x比较，胜率66.5%，平局23.5%，失利10.0%。*

Figure 4. The attention map of NATTEN, Tiled NATTEN, and STA. We plot with an image size  $2 4 \\times 2 4$  and a  $1 2 \\times 1 2$  local window. The tile size is set to  $4 \\times 4$  (a) NATTEN creates many mixed blocks that are very inefficient for Flash Attention computation.(b) Tiled NA but for better illustration, we present the 2D scenario in this plot.

`Mixed blocks` introduce two major inefficiencies:
1.  **No FLOP Reduction for Sparsity:** Despite being sparser than `dense blocks`, `mixed blocks` often require `full computation` for the entire block *before* applying the masks to retain or discard scores. This means `unnecessary computations` are performed for the masked-out parts.
2.  **Mask Evaluation Overhead:** The attention kernel must calculate the mask value based on the `SWA` pattern and the block's position to determine which positions to retain or discard. This `mask calculation` introduces substantial overhead. For example, `FlexAttention` reports a 15% overhead for a simple `causal mask`. For more complex patterns like `sliding windows`, this overhead can dominate latency.

    In traditional `SWA` (like `NATTEN`), each query attends to a distinct set of keys, which results in a `zigzag pattern` in the attention map, generating a large number of `mixed blocks` (Figure 4a). Even `Tiled NATTEN`, which tries to reorder inputs to increase `dense blocks`, still leaves a significant portion as `mixed blocks`. The table below (Table 1 from the paper) quantitatively shows the block ratios:

    Attention Window Size Dense Block Mixed Block
Tiled NATTEN (11,11,11) 0.06% 7.17%
STA (12, 12, 12) 1.56% 0.0%
STA (20, 20, 20) 7.23% 0.0%

Table 1. Ratio of dense and mixed blocks for tiled NATTEN and STA with tile size (4,4,4) and video size (48,48,48). STA generate only dense blocks, which is more computationally friendly than mixed blocks on GPU.

As evident, `Tiled NATTEN` has a very low percentage of `dense blocks` and a significant percentage of `mixed blocks`, highlighting its inefficiency.

## 4.3. Sliding Tile Attention (STA) Principles

### 4.3.1. Core Idea and Tile-by-Tile Operation
`STA` addresses the inefficiencies of traditional `SWA` by introducing a `hardware-aware` approach that eliminates `mixed blocks`. The core idea is to change the unit of sliding from individual tokens to `tiles`. A `tile` is defined as a contiguous group of tokens forming a `spatial-temporal cube` (e.g., a  $T \times T \times T$  cube in 3D). The size of this tile is chosen to correspond to the `block size` used in `FlashAttention`.

In `STA`, queries and keys are organized into these `tiles`. Crucially, all queries within the *same tile* attend to the *same set of keys* within their common local window. This structured attention pattern is key to creating efficient `dense blocks`. By arranging queries in a tile with consecutive token indices and setting the `tile area` equal to the `block size` in `FlashAttention`, `STA` ensures that when a `query tile` attends to a `key tile` within its window, it results in a `dense FlashAttention block`. This design ensures there are *no mixed blocks*, only `dense` or `empty` blocks, as shown in Figure 4c and Table 1.

For 3D `STA`, given a video of dimension `(L, L, L)` and a `FlashAttention block size` of `(B, B)`, `STA` sets the tile size  $T$  such that  $B = T^3$ . It further assumes that both the video size  $L$  and the window size  $W$  are integer multiples of  $T$ . The video is partitioned into non-overlapping `tiles` of size `(T, T, T)`. These tiles are then flattened into a 1D sequence such that tokens *within the same tile* have consecutive sequence indices. Conceptually, `STA` slides the attention window with a step size of `(T, T, T)` (i.e., tile by tile). For each step, it computes attention between the central `query tiles` and all `key tiles` within its local window. This process generates  $\left(\frac{W}{T}\right)^3$  `dense attention blocks` without any `mixed blocks`.

The following figure (Figure 5 from the original paper) illustrates the 2D `Sliding Tile Attention` mechanism:

![Figure 8. Left: Conventional zigzag flattening strategy. Right: STA' sequence flattening strategy. The plot is given assuming a (9, 9) image with (3, 3) tile size.](/files/papers/691859e2110b75dcc59ae191/images/8.jpg)
*该图像是一个示意图，左侧展示了传统的锯齿状展平策略，右侧展示了STA的序列展平策略。该图假定图像大小为(9, 9)，瓷砖大小为(3, 3)。*

Figure 5. 2D SLIDING TILE ATTENTION with tile size (2, 2) and window size (6, 6). After attending to all the key tiles, each query tile will generate nine 4x4 dense blocks in the attention map. We showcase 2D STA for better illustration. 3D STA can be inferred similarly.

In this 2D example, a `tile size` of (2,2) and a `window size` of (6,6) are used. Each `query tile` (a 2x2 group of tokens) attends to a 6x6 `key window` which comprises nine (3x3) `key tiles`. Each of these 9 `query-key tile` interactions results in a `4x4 dense block` in the attention map (since a 2x2 query tile attending to a 2x2 key tile forms a 4-token x 4-token block, and `FlashAttention block size` is  $B=T^3$ , so for 2D,  $B=T^2=4$ ).

### 4.3.2. Quantitative Block Analysis
The paper provides formulas to quantitatively measure the types of blocks generated by `Tiled NATTEN` and `STA` in 3D:

**Theorem 3.1 (Tiled NATTEN Block Counts):**
Consider a `tiled NATTEN` configuration with `tile size` `(T, T, T)`, `window size` `(W, W, W)`, and `video size` `(L, L, L)`. Let the `FA block size` be `(B, B)`, where  $B = T^3$ . Ignoring boundary effects, the number of `dense blocks` is given by:
 $N_{\mathrm{dense}} = \left( \operatorname{max} \Bigl ( 2 \Bigl \lfloor \frac{W + 1}{2T} \Bigr \rfloor - 1 , 0 \Bigr ) \right)^3 \cdot \left( \frac{L}{T} \right)^3 .$ 
The number of `mixed blocks` in `tiled NATTEN` is:
 $N_{\mathrm{mix}} = \left( 2 \left\lceil \frac{W - 1}{2T} \right\rceil + 1 \right)^3 \cdot \left( \frac{L}{T} \right)^3 - N_{\mathrm{dense}} .$ 
*   **Intuition:** For a block in `NATTEN` to be dense, the `window size` ( $W$ ) must be large enough to cover at least twice the `tile size` ( $T$ ) such that the leftmost query in a tile can attend to the rightmost query. However, the `leftmost query` in a tile can still attend to keys far to its left, up to `W-1` tiles away, which often results in `mixed blocks` near window edges.

**Theorem 3.2 (STA Block Counts):**
With the same notation, if  $W$  is an integer multiple of  $T$ , the number of `dense blocks` in `SLIDING TILE ATTENTION` is:
 $S_{\mathrm{dense}} = \left( \frac{W}{T} \right)^3 \cdot \left( \frac{L}{T} \right)^3 .$ 
All remaining blocks are `empty`, and there are *no mixed blocks*.
*   **Intuition:** Each `query tile` in `STA` only attends to `key tiles` within its local window. If the `window size`  $W$  is a multiple of `tile size`  $T$ , then the number of `key tiles` within the window is precisely  $(W/T)^3$ . For each of the  $(L/T)^3$  `query tiles`, it will generate  $(W/T)^3$  `dense blocks` corresponding to these `key tiles`. All other `query-key tile` interactions are outside the window, forming `empty blocks`.

### 4.3.3. Kernel-level Optimization
`STA`'s design is inherently `GPU-friendly` because it generates only `dense` and `empty blocks`. The authors implement their attention kernels based on `ThunderKittens` and `FlashAttention3`, which provide the necessary functionalities to skip `empty blocks` and avoid adding unnecessary intra-block masks on `dense blocks`.

The implementation uses a `consumer-producer paradigm`:
*   The `threadblock` is split into `compute warpgroups` and `data warpgroups`.
*   `Compute warpgroups` are responsible for calculating `query blocks`. Each `query block` always resides in `SRAM` (following the `Split-Q` approach).
*   `Data warpgroups` are responsible for asynchronously loading `key` and `value (KV)` blocks from `HBM` to `SRAM`.
*   The `inter-block mask logic` (determining which `KV blocks` a `query block` should attend to in `STA`) is entirely managed by the `data warpgroups`. They calculate the mask and only load the relevant `KV blocks`.
*   Crucially, since `data warpgroups` operate asynchronously, the overhead of calculating this `inter-block mask` and deciding which data to load can be hidden through `overlapping` with computation.
*   The `compute warpgroups` remain oblivious to the `sparse attention pattern`. They perform dense attention computation using `KV blocks` already present in shared memory (loaded by `data warpgroups`). Once all data from the `circular cache` is consumed, the computation for that `query block` is complete.

    This separation of concerns and `asynchronous loading` allows `STA` to achieve high `MFU` and wall-clock speedups proportional to sparsity.

## 4.4. Applying STA to Video Diffusion Model

`STA` can be integrated into `video Diffusion Transformers` in two main ways: `training-free` or with `finetuning`.

### 4.4.1. Training-free Application
The `training-free` approach leverages two key observations from Figure 3:
1.  **3D Locality:** Pretrained video diffusion models exhibit a strong `3D locality` pattern, meaning queries primarily attend to nearby keys.
2.  **Head Specialization:** Different attention heads specialize in different locality patterns (some focus on fine details in small areas, others capture broader context). Importantly, this `head specialization` is largely consistent across different input prompts.

    These properties allow for finding optimal `window sizes` for each head without extensive retraining. The paper proposes a simple heuristic method (`Algorithm 1`) to find such a configuration:

**Algorithm 1 STA Mask Search**
Input: Transformer model  $M$ , Total steps  $T$ , Mask pattern list  $\mathcal{P}$ , Keep first  $T_0$  timestep full attn
Output: Dictionary `dict` that stores selected mask pattern for each head

1.  Initialize `dict`.
2.  For  $t = T_0 + 1$  to  $T$  do:
3.  For each `layer-head combination` `(l, h)` in  $M$  do:
4.  Get  $O$  (attention output of original `(l, h)` with full attention).
5.  Initialize `minimum_loss`  $\leftarrow \infty$ .
6.  Initialize `best_pattern`  $\leftarrow$  null.
7.  For each  $p$  in  $\mathcal{P}$  do:
8.  Mask head  $h$  for layer  $l$  using mask pattern  $p$ .
9.  Get  $O'$  (attention output of  $M$  after masking).
10. Calculate  $\mathrm{loss} \leftarrow \mathrm{MSE}(O, O')$ .
11. If `loss` `<` `minimum_loss` then:
12. `minimum_loss`  $\leftarrow$  `loss`.
13. `best_pattern`  $\leftarrow p$ .
14. Record `best_pattern` for `(t, l, h)` in `dict`.
15. Return `dict`.

*   **Explanation:** The algorithm iterates through different diffusion timesteps and each `attention head`. For each head, it compares its full attention output ( $O$ ) with the output ( $O'$ ) when different `mask patterns` (i.e., varying window sizes) from a predefined list  $\mathcal{P}$  are applied. The `Mean Squared Error (MSE)` between  $O$  and  $O'$  is used as a `mask-search loss`. The mask pattern that yields the lowest loss (i.e., best approximates the full attention behavior) is selected as the optimal window size for that specific head and timestep.
*   **Practical Application:** In practice, the final configuration is decided by averaging the `mask-search loss` across a small number of prompts (e.g., 16 prompts). Additionally, `full attention` is retained for the initial  $T_0$  timesteps (a common practice in diffusion models to ensure stability in early denoising steps), and `STA` is applied for the remaining timesteps.

### 4.4.2. Finetuning Application
Beyond the `training-free` approach, `STA` can also be applied by fixing a `window size` (potentially with higher sparsity) and then `fine-tuning` the model to adapt to this new sparse attention pattern.
*   **Efficiency of Adaptation:** Since `STA` already aligns with the `3D locality` property, this adaptation can be learned efficiently with minimal training overhead (ee.g., 8 hours on 8 H100 GPUs, which is negligible compared to the original pretraining cost of video diffusion models).
*   **Receptive Field Expansion:** Even with restricted local windows per attention layer, the `receptive field` expands through the stacking of multiple `transformer layers`, allowing the `Diffusion Transformer` to generate globally coherent videos.

    The `finetuning` process employs a combined objective function with three different loss terms:

1.  **Attention Distillation Loss ( $\mathcal{L}_{attn}$ ):** This loss directly supervises the intermediate `attention patterns` of the `STA` model (student) to match the behavior of the original `dense attention` model (teacher).
     $\mathcal{L}_{attn} = \frac{1}{N} \sum_{i=1}^{N} \Vert f_{\phi}^{(i)}(x_t, t, c) - f_{\psi}^{(i)}(x_t, t, c) \Vert_2^2 ,$ 
    Where:
    *    $f_{\phi}^{(i)}(x_t, t, c)$  is the output of the  $i$ -th `transformer layer` of the `sliding tile model` (student).
    *    $f_{\psi}^{(i)}(x_t, t, c)$  is the output of the  $i$ -th `transformer layer` of the original `dense attention teacher`.
    *    $x_t$  is the noised latent at diffusion step  $t$ .
    *    $c$  denotes the text embedding (conditioning).
    *    $N$  is the total number of transformer layers.
    *    $\Vert \cdot \Vert_2^2$  is the squared L2 norm, measuring the difference between the student and teacher outputs.
    *   This loss ensures that each sparse attention layer approximates its corresponding dense attention teacher.

2.  **Final Layer Loss ( $\mathcal{L}_{final}$ ):** This loss aligns the final output of the `STA` student model with the final output of the `dense attention` teacher.
     $\mathcal{L}_{final} = \Vert f_{\phi}(x_t, t, c) - f_{\psi}(x_t, t, c) \Vert_2^2$ 
    Where:
    *    $f_{\phi}(x_t, t, c)$  is the final output of the `sliding tile model`.
    *    $f_{\psi}(x_t, t, c)$  is the final output of the original `attention teacher`.

3.  **Data Loss ( $\mathcal{L}_{data}$ ):** This is a standard loss term following the `flow matching formulation` used in diffusion models.
     $\mathcal{L}_{data} = \Vert (f - x_0) - f_{\phi}(x_t, t, c) \Vert_2^2 ,$ 
    Where:
    *    $x_0$  represents the `VAE latent` of the original input frame (the target clean data).
    *    $x_t$  is the noised latent at diffusion step  $t$ .
    *    $c$  denotes the text embedding.
    *    $f$  is a target velocity field (often derived from  $x_0$  and  $x_t$ ) that the model  $f_{\phi}$  is trained to predict to guide the denoising process towards  $x_0$ .

        The complete objective function combines these terms with weighting coefficients:
 $\operatorname{min}_{\phi} \mathbb{E}_{x \sim p(x), c \sim N(0, 1), t} [ \alpha \mathcal{L}_{data} + \beta \mathcal{L}_{final} + \gamma \mathcal{L}_{attn} ]$ 
Where:
*    $\alpha, \beta, \gamma$  are hyperparameters that control the relative importance of each loss term.
*   The expectation is taken over data  $x$  sampled from the data distribution `p(x)`, text embeddings  $c$  (often from a standard normal distribution), and diffusion timesteps  $t$ .

## 4.5. Further Details of SLIDING TILE ATTENTION (Appendix A)

### 4.5.1. Mask Definition of 3D NATTEN (Algorithm 2)
`3D NATTEN` computes an attention mask based on individual token coordinates and a shifting window concept.

**Algorithm 2 Mask Definition of 3D NATTEN**
Input: Query coordinates  $(q_t, q_h, q_w)$ , Key coordinates  $(k_t, k_h, k_w)$ , Video size  $(L_t, L_h, L_w)$ , Window size  $(W_t, W_h, W_w)$ 

1.  **Compute window center:** For each query  $(q_t, q_h, q_w)$ , the center of its attention window  $(qc_t, qc_h, qc_w)$  is calculated. This center is shifted inwards if the query is near the video boundaries to ensure the window stays within bounds.
     $qc_t \leftarrow \max (\min (q_t, L_t - 1 - \frac{W_t}{2}), \frac{W_t}{2})$ 
     $qc_h \leftarrow \max (\min (q_h, L_h - 1 - \frac{W_h}{2}), \frac{W_h}{2})$ 
     $qc_w \leftarrow \max (\min (q_w, L_w - 1 - \frac{W_w}{2}), \frac{W_w}{2})$ 
    *   **Explanation:** These formulas effectively clamp the window center.  $min(q, L-1 - W/2)$  ensures the window doesn't go past the right/bottom edge.  $max(..., W/2)$  ensures it doesn't go past the left/top edge.

2.  **Compute masks:** The mask is constructed by enforcing spatial-temporal constraints on the distances between the window center and the key coordinates.
     $\mathrm{time\_constraint} \leftarrow |qc_t - k_t| \leq \frac{W_t}{2}$ 
     $\mathrm{hori\_constraint} \leftarrow |qc_h - k_h| \leq \frac{W_h}{2}$ 
     $\mathrm{vert\_constraint} \leftarrow |qc_w - k_w| \leq \frac{W_w}{2}$ 
    *   **Explanation:** For a key  $(k_t, k_h, k_w)$  to be included in the attention window of query  $(q_t, q_h, q_w)$ , its temporal, horizontal, and vertical distance from the window center must be within half of the respective window dimensions.

3.  Return  $\mathrm{time\_constraint} \land \mathrm{hori\_constraint} \land \mathrm{vert\_constraint}$ .
    *   **Explanation:** The final mask value is true (allow attention) if and only if all three constraints (temporal, horizontal, vertical) are met.

### 4.5.2. Mask Definition of 3D STA (Algorithm 3)
`3D STA` operates with a tile-based coordinate framework.

**Algorithm 3 Mask Definition of 3D STA**
Input: Query coordinates  $(q_t, q_h, q_w)$ , Key coordinates  $(k_t, k_h, k_w)$ , Video size  $(L_t, L_h, L_w)$ , Kernel size  $(W_t, W_h, W_w)$ , Tile size  $(T_t, T_h, T_w)$ 

1.  **Compute QK coordinates in tiles:** Query and key coordinates are mapped to their respective tile coordinates.
     $q_{t,\mathrm{tile}} \leftarrow q_t // T_t$ 
     $q_{h,\mathrm{tile}} \leftarrow q_h // T_h$ 
     $q_{w,\mathrm{tile}} \leftarrow q_w // T_w$ 
     $k_{t,\mathrm{tile}} \leftarrow k_t // T_t$ 
     $k_{h,\mathrm{tile}} \leftarrow k_h // T_h$ 
     $k_{w,\mathrm{tile}} \leftarrow k_w // T_w$ 
    *   **Explanation:** The `//` operator performs integer division, effectively giving the tile index for each dimension. Queries and keys within the same tile will share the same tile ID.

2.  **Compute window size in tiles:** The window size is also converted to tile units.
     $W_{t,\mathrm{tile}} \leftarrow W_t // T_t$ 
     $W_{h,\mathrm{tile}} \leftarrow W_h // T_h$ 
     $W_{w,\mathrm{tile}} \leftarrow W_w // T_w$ 

3.  **Compute window center:** Similar to `NATTEN`, the window center is computed, but now in tile coordinates.
     $qc_{t,\mathrm{tile}} \leftarrow \max (\min (q_{t,\mathrm{tile}}, (L_t // T_t - 1) - \frac{W_{t,\mathrm{tile}}}{2}), \frac{W_{t,\mathrm{tile}}}{2})$ 
     $qc_{h,\mathrm{tile}} \leftarrow \max (\min (q_{h,\mathrm{tile}}, (L_h // T_h - 1) - \frac{W_{h,\mathrm{tile}}}{2}), \frac{W_{h,\mathrm{tile}}}{2})$ 
     $qc_{w,\mathrm{tile}} \leftarrow \max (\min (q_{w,\mathrm{tile}}, (L_w // T_w - 1) - \frac{W_{w,\mathrm{tile}}}{2}), \frac{W_{w,\mathrm{tile}}}{2})$ 
    *   **Explanation:** This calculates the clamped window center in terms of tile indices.

4.  **Compute masks:** The mask then checks if the key tile is within the window defined by the query tile's center and the window size in tiles.
     $\mathrm{time\_constraint} \leftarrow |qc_{t,\mathrm{tile}} - k_{t,\mathrm{tile}}| \leq \frac{W_{t,\mathrm{tile}}}{2}$ 
     $\mathrm{hori\_constraint} \leftarrow |qc_{h,\mathrm{tile}} - k_{h,\mathrm{tile}}| \leq \frac{W_{h,\mathrm{tile}}}{2}$ 
     $\mathrm{vert\_constraint} \leftarrow |qc_{w,\mathrm{tile}} - k_{w,\mathrm{tile}}| \leq \frac{W_{w,\mathrm{tile}}}{2}$ 

5.  Return  $\mathrm{time\_constraint} \land \mathrm{hori\_constraint} \land \mathrm{vert\_constraint}$ .
    *   **Explanation:** The mask is true if the key tile is within the calculated attention window in all three dimensions.

        The key difference is that `STA` computes these masks at the tile level, ensuring that all tokens within a `query tile` will attend to all tokens within a `key tile` if the tile-level constraints are met. This guarantees `dense blocks`.

### 4.5.3. Tiling in STA
The following figure (Figure 8 from the original paper) illustrates the token tiling and ordering mechanism in `STA` for a 2D scenario, which extends naturally to 3D:

![该图像是展示不同视频生成方法所需时间的对比图，包含四个视频生成模型的结果，分别为 HunyuanVideo、STA-tf-1.89x、STA-t-2.43x 和 Δ-DiT-1.36x，以及对应生成每个视频所需的时间。](/files/papers/691859e2110b75dcc59ae191/images/11.jpg)
*该图像是展示不同视频生成方法所需时间的对比图，包含四个视频生成模型的结果，分别为 HunyuanVideo、STA-tf-1.89x、STA-t-2.43x 和 Δ-DiT-1.36x，以及对应生成每个视频所需的时间。*

Figure 8. Left: Conventional zigzag flattening strategy. Right: STA' sequence flattening strategy. The plot is given assuming a (9, 9) image with (3, 3) tile size.

*   **Conventional Flattening (Left):** Typically, 2D/3D data is flattened into a 1D sequence using a `zigzag pattern` or row-major/column-major order. This can cause tokens that are spatially or temporally adjacent to end up with non-consecutive 1D sequence indices, disrupting locality.
*   **STA Sequence Flattening (Right):** `STA` organizes tokens into `tiles` (e.g., 3x3 tiles in this 2D example). Tokens *within the same tile* are given consecutive sequence IDs. This strategy preserves locality: when one `query tile` attends to another `key tile`, all participating sequence IDs (within those tiles) remain consecutive. This directly translates into a `dense block` in the attention map, as `FlashAttention` works optimally with contiguous blocks of memory.

### 4.5.4. Visualization of 2D SWA
The following figure (Figure 9 from the original paper) visualizes how `query tokens` attend to their `window key tokens` in traditional `2D Sliding Window Attention`:

![该图像是一个对比图，展示了不同视频生成方法的生成时间。上方为 Hunyuan 视频（15分钟45秒），下方为使用滑动瓷砖注意力（STA）技术的结果，其中 STA-tf-1.89x 用时8分钟21秒，STA-t-2.43x 用时6分钟29秒，Δ-DiT-1.36x 用时11分钟34秒。](/files/papers/691859e2110b75dcc59ae191/images/12.jpg)
*该图像是一个对比图，展示了不同视频生成方法的生成时间。上方为 Hunyuan 视频（15分钟45秒），下方为使用滑动瓷砖注意力（STA）技术的结果，其中 STA-tf-1.89x 用时8分钟21秒，STA-t-2.43x 用时6分钟29秒，Δ-DiT-1.36x 用时11分钟34秒。*

Figure 9. 2D Sliding Window Attention visualization.

In this visualization, the window slides `token by token`. For each central `query` (green point), `SWA` calculates attention with all `keys` (magma-colored regions) within its local window. This `token-wise` sliding is what leads to the irregular `mixed blocks` when trying to map it to `FlashAttention`'s block-based computation. `STA` avoids this by sliding `tile-by-tile`.

# 5. Experimental Setup

## 5.1. Datasets
The primary evaluation of `STA` is conducted on video generation tasks using `HunyuanVideo`, a state-of-the-art open video `DiT`.
*   **HunyuanVideo:**
    *   **Task:** Generating 5-second videos with 117 frames at a resolution of  $1280 \times 768$ .
    *   **Latent Space:** After `VAE compression` and tokenization, this corresponds to a latent video of shape  $(30, 48, 80)$ , where 30 is the temporal dimension, and 48x80 are the spatial dimensions. The total number of tokens is  $30 \times 48 \times 80 = 115,200$ .
*   **FLUX:**
    *   **Task:** Image super-resolution.
    *   **Purpose:** Used to demonstrate `STA`'s effectiveness in 2D image diffusion models, indicating its broader applicability beyond video.
*   **Mixkit Dataset:**
    *   **Purpose:** Used to source prompts for generating 2,000 synthetically generated videos from `HunyuanVideo` for `finetuning`.
    *   **Characteristics:** These videos are generated at  $1280 \times 768$  resolution with 117 frames.

## 5.2. Evaluation Metrics

The paper evaluates `STA` based on both `efficiency` and `generated video quality`.

### 5.2.1. Efficiency Metrics

1.  **Latency (ms/s):**
    *   **Conceptual Definition:** `Latency` measures the wall-clock time required to complete a specific operation (e.g., attention computation or end-to-end video generation). It's a direct measure of how fast a process runs.
    *   **Mathematical Formula:** Not a specific formula, but usually measured as  $Time = EndTime - StartTime$ .
    *   **Symbol Explanation:** `Latency` is typically expressed in milliseconds (ms) for kernel performance or seconds (s) for end-to-end tasks.

2.  **Speedup ( $\times$ ):**
    *   **Conceptual Definition:** `Speedup` quantifies how much faster a new method is compared to a baseline method. It's the ratio of the baseline's latency to the new method's latency.
    *   **Mathematical Formula:**
         $\mathrm{Speedup} = \frac{\mathrm{Latency}_{\mathrm{Baseline}}}{\mathrm{Latency}_{\mathrm{New Method}}}$ 
    *   **Symbol Explanation:**  $\mathrm{Latency}_{\mathrm{Baseline}}$  is the time taken by the baseline method, and  $\mathrm{Latency}_{\mathrm{New Method}}$  is the time taken by the new method.

3.  **Memory-Bandwidth-Utilization (MFU) (%):**
    *   **Conceptual Definition:** `MFU` measures how effectively a GPU kernel utilizes the available memory bandwidth of the hardware. A higher MFU (closer to 100%) indicates more efficient data transfer between `High Bandwidth Memory (HBM)` and `Streaming Multiprocessor (SM)` `SRAM`, which is critical for performance in memory-bound tasks like attention.
    *   **Mathematical Formula:**
         $\mathrm{MFU} = \frac{\mathrm{Actual\ Memory\ Bandwidth}}{\mathrm{Theoretical\ Peak\ Memory\ Bandwidth}} \times 100\%$ 
    *   **Symbol Explanation:** `Actual Memory Bandwidth` is the measured data transfer rate during kernel execution, and `Theoretical Peak Memory Bandwidth` is the maximum possible data transfer rate of the GPU.

4.  **Kernel Efficiency (%):**
    *   **Conceptual Definition:** `Kernel Efficiency` specifically assesses how well a sparse attention kernel translates its theoretical `FLOP` reductions into actual improvements in `MFU`. It's the ratio of the sparse kernel's `MFU` to that of `full attention`.
    *   **Mathematical Formula:**
         $\mathrm{Kernel\ Efficiency} = \frac{\mathrm{MFU}_{\mathrm{Sparse\ Kernel}}}{\mathrm{MFU}_{\mathrm{Full\ Attention}}} \times 100\%$ 
    *   **Symbol Explanation:**  $\mathrm{MFU}_{\mathrm{Sparse\ Kernel}}$  is the memory bandwidth utilization of the sparse attention kernel, and  $\mathrm{MFU}_{\mathrm{Full\ Attention}}$  is the memory bandwidth utilization of the full attention kernel.

5.  **TFLOPs / PFLOPS (Tera/Peta Floating Point Operations Per Second):**
    *   **Conceptual Definition:** `FLOPs` (Floating Point Operations) measure the number of floating-point arithmetic operations performed by a model. `TFLOPs` (TeraFLOPs,  $10^{12}$  FLOPs) or `PFLOPS` (PetaFLOPs,  $10^{15}$  FLOPs) are used to quantify the computational cost of a task. Lower FLOPs indicate less theoretical computation.
    *   **Mathematical Formula:** Not a single formula, but the sum of all floating point operations (additions, multiplications, etc.) in a given computation.
    *   **Symbol Explanation:** `TFLOPs` or `PFLOPS` are a direct measure of computational workload.

### 5.2.2. Video Quality Metrics

1.  **Human Evaluation:**
    *   **Conceptual Definition:** Human evaluators assess the overall quality of generated videos, often through `pairwise comparisons`. This is considered a highly reliable metric for subjective quality.
    *   **Methodology:** The paper sampled 200 prompts from the `MovieGen Bench` and conducted pairwise comparisons. Evaluators selected the video with higher overall quality or marked both as a tie.
    *   **Output:** Reported as `win rate`, `tie rate`, and `loss rate` percentages.

2.  **VBench:**
    *   **Conceptual Definition:** `VBench` is a comprehensive benchmark suite designed for evaluating video generative models. It assesses various aspects of video quality and content, often providing scores across multiple dimensions (e.g., imaging quality, motion smoothness, semantic alignment, temporal consistency).
    *   **Mathematical Formula:** No single formula as it's a suite of metrics. The paper reports a total VBench score, as well as sub-scores for "Quality Score" and "Semantic Score," and even more granular dimensions (e.g., Appearance Style, Subject Consistency, Temporal Flickering, Imaging Quality, Object Classification, Color, etc.) in Appendix Tables 8 and 9.
    *   **Symbol Explanation:** Scores are typically percentages, where higher is generally better (though for some specific negative metrics like "Temporal Flickering," lower might be better if the score represents prevalence).

3.  **SSIM (Structural Similarity Index Measure):**
    *   **Conceptual Definition:** `SSIM` is a perceptual metric that quantifies the similarity between two images (or frames in a video). It is designed to model the human visual system's perception of structural information, luminance, and contrast. A higher SSIM value indicates greater similarity.
    *   **Mathematical Formula:**
         $\mathrm{SSIM}(x, y) = [l(x, y)]^{\alpha} \cdot [c(x, y)]^{\beta} \cdot [s(x, y)]^{\gamma}$ 
        Where:
        *    $l(x, y) = \frac{2\mu_x\mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1}$  is the luminance comparison function.
        *    $c(x, y) = \frac{2\sigma_x\sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2}$  is the contrast comparison function.
        *    $s(x, y) = \frac{\sigma_{xy} + C_3}{\sigma_x\sigma_y + C_3}$  is the structural comparison function.
        *   `x, y` are the two image patches being compared.
        *    $\mu_x, \mu_y$  are the average (mean) pixel values of  $x$  and  $y$ .
        *    $\sigma_x, \sigma_y$  are the standard deviations of pixel values of  $x$  and  $y$ .
        *    $\sigma_{xy}$  is the covariance of  $x$  and  $y$ .
        *    $C_1, C_2, C_3$  are small constants to prevent division by zero (e.g.,  $C_1 = (K_1L)^2, C_2 = (K_2L)^2$ , where  $L$  is the dynamic range of pixel values, and  $K_1, K_2$  are small constants).
        *    $\alpha, \beta, \gamma$  are weights for each component (often set to 1).
    *   **Symbol Explanation:** A higher SSIM value (typically between 0 and 1) indicates greater perceptual similarity.

4.  **PSNR (Peak Signal-to-Noise Ratio):**
    *   **Conceptual Definition:** `PSNR` is a quantitative metric used to measure the quality of reconstruction of an image or video compared to an original reference. It's typically expressed in decibels (dB) and is computed based on the Mean Squared Error (MSE) between the two images. Higher PSNR values generally indicate better quality.
    *   **Mathematical Formula:**
         $\mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right)$ 
        Where:
        *    $\mathrm{MAX}_I$  is the maximum possible pixel value of the image (e.g., 255 for 8-bit grayscale images).
        *    $\mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2$ 
        *   `I(i,j)` is the pixel value of the original image at coordinate `(i,j)`.
        *   `K(i,j)` is the pixel value of the reconstructed (or generated) image at coordinate `(i,j)`.
        *   `m, n` are the dimensions of the image.
    *   **Symbol Explanation:** PSNR is measured in decibels (dB). A higher PSNR indicates less noise and better quality.

5.  **CD-FVD (Clean-FID for Video / Fréchet Video Distance):**
    *   **Conceptual Definition:** `Fréchet Video Distance (FVD)` is a metric used to evaluate the perceptual quality and diversity of generated videos. It measures the `Fréchet distance` between the feature distributions of real videos and generated videos. It's often considered a more robust metric than pixel-wise comparisons because it operates on high-level feature representations (e.g., from an InceptionV3 network). `CD-FVD` (Clean-FID for Video) is a variant focusing on cleaner feature representations. A lower FVD score indicates higher quality and diversity.
    *   **Mathematical Formula:**
         $\mathrm{FVD} = \Vert \mu_x - \mu_g \Vert_2^2 + \mathrm{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x\Sigma_g)^{1/2})$ 
        Where:
        *    $\mu_x$  and  $\mu_g$  are the mean feature vectors of real and generated videos, respectively.
        *    $\Sigma_x$  and  $\Sigma_g$  are the covariance matrices of feature vectors of real and generated videos, respectively.
        *    $\Vert \cdot \Vert_2^2$  is the squared L2 norm.
        *    $\mathrm{Tr}(\cdot)$  is the trace of a matrix.
    *   **Symbol Explanation:** FVD is a non-negative value. A lower FVD score indicates that the distribution of generated videos is closer to the distribution of real videos, implying higher quality and realism.

## 5.3. Baselines
The `STA` method is compared against several state-of-the-art and commonly used attention mechanisms:

1.  **FlashAttention-2 (FA2) & FlashAttention-3 (FA3):**
    *   **Description:** These are highly optimized dense attention implementations that serve as the standard for efficient full attention on modern GPUs. They represent the state-of-the-art in accelerating unmasked or simple masked attention computations.
    *   **Representativeness:** They are the direct baselines for measuring kernel-level speedups for full attention, as `STA` aims to replace or augment them with sparsity.

2.  **CLEAR (Liu et al., 2024):**
    *   **Description:** A `circular window-based attention` method implemented with `FlexAttention`, aiming to linearize `Diffusion Transformers` by restricting attention to a local radius.
    *   **Representativeness:** It's a recent sparse attention method designed for diffusion models, making it a relevant comparison for quality and efficiency.

3.  **NATTEN (Hassani et al., 2023) & Tiled NATTEN:**
    *   **Description:** `NATTEN` implements `neighborhood attention` (sliding window attention) for 2D/3D data. `Tiled NATTEN` is an optimized variant.
    *   **Representativeness:** These are direct competitors in the domain of `sliding window attention` for higher-dimensional data, highlighting the challenges `STA` aims to overcome regarding hardware efficiency.

4.  **Swin Transformer (Liu et al., 2021b):**
    *   **Description:** Uses hierarchical, `shifted window-based attention` for vision transformers.
    *   **Representativeness:** It's a widely recognized efficient vision transformer architecture. Its inclusion tests how well a non-overlapping window strategy performs in a `DiT` context, especially when applied to a pretrained model.

5.  ** $\Delta$ -DiT (Chen et al., 2024):**
    *   **Description:** A `training-free acceleration method` for `Diffusion Transformers` that uses `feature caching`.
    *   **Representativeness:** It's a different type of acceleration method (caching vs. sparse attention) that is also training-free, providing a complementary benchmark for overall efficiency and quality tradeoff.

# 6. Results & Analysis

## 6.1. Core Results Analysis

The experiments rigorously evaluate `STA`'s efficiency and video generation quality, comparing it against various baselines.

### 6.1.1. Efficiency of SLIDING TILE ATTENTION (Table 2 & 7)
The paper benchmarks the efficiency of various attention algorithms, assuming the generation of 720P 5s videos using `HunyuanVideo`. The `sequence length` used for benchmarking is `seq_len=115K`,  $d_head=128$ , and  $# heads=24$ . `ThunderKittens' FA3` is used as the baseline for speedup calculations.

The following are the results from Table 2 of the original paper:

Methods Implementation Config Sparsity TFLOPS Latency(ms) MFU Kernel Efficiency Speedup
FA 3 ThunderKittens - 0.00% 164.03 265.28 62.49% 100.00% 1.00×
FA 3 CUDA - 0.00% 164.03 256.59 64.61% 103.39% 1.03×
CLEAR FlexAttention r=16 90.46% 15.65 307.44 5.15% 8.24% 0.86×
NATTEN FlexAttention w=(19,25,25) 89.69% 16.91 313.92 5.44% 8.71% 0.85×
Tiled NATTEN CUDA w=(19,25,25) 89.69% 16.91 458.36 3.73% 5.97% 0.58×
Tiled NATTEN FlexAttention w=(19,25,25) 89.69% 16.91 208.36 8.20% 13.12% 1.27×
Swin FlexAttention w=(24,32,32) 87.42% 20.64 47.90 43.55% 69.69% 5.54×
STA FlexAttention w=(18,24,24) 91.00% 14.76 36.36 41.03% 65.66% 7.30×
STA ThunderKittens w=(30,40,40) 58.33% 68.35 111.73 61.82% 98.93% 2.37×
STA ThunderKittens w=(18,24,24) 91.00% 14.76 25.38 58.79% 94.09% 10.45×

Table 2. Speedup with sparse attention kernels on H100. (Config controls the window size of each sparse attention.  $N = 115K$  seq_len,  $d_{head} = 128$ , # heads  $= 24$ ). Config controls the window size of each sparse attention.

**Analysis of Table 2:**
*   **Inefficiency of Existing Sparse Methods:** `CLEAR` and `NATTEN` variants (even `Tiled NATTEN` with `FlexAttention`) show poor `kernel efficiency` (ranging from 5.97% to 13.12%) despite significant `TFLOPS` reduction (up to ~90% sparsity). `CLEAR` and vanilla `NATTEN` actually `slow down` (speedup < 1x) compared to `FA3` due to mask overhead and mixed blocks. `Tiled NATTEN` achieves a modest `1.27x` speedup, but its `MFU` is still very low (8.20%). This confirms the paper's claim that existing SWA implementations fail to translate FLOP reductions into proportional wall-clock speedups.
*   **Swin's Moderate Efficiency:** `Swin` achieves better `MFU` (43.55%) and `kernel efficiency` (69.69%) with a `5.54x` speedup. However, as discussed in the methodology, its non-overlapping windows can impact expressiveness and quality.
*   **STA's Superior Efficiency:**
    *   `STA` (implemented with `FlexAttention`) already shows strong performance, achieving `7.30x` speedup with 91.00% sparsity and an `MFU` of 41.03% (65.66% `kernel efficiency`).
    *   The `ThunderKittens`-optimized `STA` is even more impressive. With 91.00% sparsity, it achieves an outstanding `10.45x` speedup over `FA3` and maintains a high `MFU` of 58.79%, translating to `94.09% kernel efficiency`. This demonstrates that `STA` effectively translates sparsity into wall-clock speedup, nearly matching `FA3`'s `MFU` for sparse operations.
    *   Even at lower sparsity (58.33%), `STA` with `ThunderKittens` still provides a `2.37x` speedup and an `MFU` of 61.82%, very close to `FA3`'s `MFU` for dense attention. This flexibility allows for larger window sizes while remaining highly efficient.

        The following are the results from Table 7 of the original paper, showing performance at around 56% sparsity:

        Methods Implementation Config Sparsity TFLOPS Latency(ms) MFU Kernel Efficiency Speedup
FA 3 ThunderKittens - 0.00% 164.03 265.28 62.49% 100.00% 1.00x
FA 3 CUDA - 0.00% 164.03 256.59 64.61% 103.39% 1.03×
CLEAR FlexAttention r=32 56.23% 71.80 675.05 10.75% 17.20% 0.39x
NATTEN FlexAttention W=(30,41,41) 56.22% 71.81 804.62 9.02% 14.43% 0.33x
Tiled NATTEN CUDA w=(29,41,41) 57.68% 69.41 173.57 4.04% 6.47% 0.15x
Tiled NATTEN FlexAttention W=(30,41,41) 56.22% 71.81 409.89 17.70% 28.33% 0.65x
Swin FlexAttention W=(48,64,64) 55.81% 72.49 127.51 57.46% 91.95% 2.08x
STA FlexAttention w=(30,40,40) 58.33% 68.35 174.17 39.66% 63.46% 1.52x
STA ThunderKittens w=(30,40,40) 58.33% 68.35 111.73 61.82% 98.93% 2.37x

Table 7. Speedup with sparse attention kernels on H100.

**Analysis of Table 7:**
*   Even with `lower sparsity` (around 56%), the trends from Table 2 persist. `CLEAR` and `NATTEN` still exhibit very poor performance, with `sub-1x speedups` (i.e., slowdowns) compared to `FA3`, and extremely low `MFU` and `kernel efficiency`.
*   `Swin` maintains good `MFU` (57.46%) and `kernel efficiency` (91.95%) at this sparsity, achieving `2.08x` speedup.
*   `STA` (ThunderKittens) again leads the pack with `2.37x` speedup, maintaining an `MFU` of 61.82% and `kernel efficiency` of 98.93%. This confirms `STA`'s robust efficiency across different sparsity levels.

    These results unequivocally establish `STA` as the first `sliding-window sparse attention` that achieves both `3D locality` and `hardware efficiency`, effectively translating theoretical `FLOP` reductions into practical `wall-clock speedups`.

### 6.1.2. Human Evaluations (Figure 7)
The paper assesses human preference across five models (HunyuanVideo, STA-tf-1.89x, STA-t-2.43x, and two variants of  $\Delta$ -DiT) using 200 prompts from the `MovieGen Bench`.

The following figure (Figure 7 from the original paper) shows the human evaluation results:

![该图像是示意图，展示了使用滑动平铺注意力（STA）生成视频的效果对比，包括HunyuanVideo和不同模型的生成时间。提示描述了宇航员在石头建筑间行走的场景。各模型的生成时间分别为15分钟45秒、8分钟21秒、6分钟29秒和11分钟34秒。](/files/papers/691859e2110b75dcc59ae191/images/10.jpg)
*该图像是示意图，展示了使用滑动平铺注意力（STA）生成视频的效果对比，包括HunyuanVideo和不同模型的生成时间。提示描述了宇航员在石头建筑间行走的场景。各模型的生成时间分别为15分钟45秒、8分钟21秒、6分钟29秒和11分钟34秒。*

Figure 7. Human evaluation on 200 prompts from the MovieGen Bench (Polyak et al., 2024). STA achieves a  $1 . 8 9 \\times$  end-to-end speedup while maintaining performance comparable to the original HunyuanVideo. Additionally, STA consistently outperforms  $\\Delta$  . DiT across different inference budgets.

**Analysis of Figure 7:**
*   **STA vs.  $\Delta$ -DiT:** `STA-t-2.43x` (finetuned `STA` with `2.43x` speedup) decisively outperforms ` $\Delta$ -DiT-1.8x` (with `1.8x` speedup), achieving a dominant `70.0% win rate` versus `11.0%`, despite `STA` having a higher speedup. Similarly, `STA-tf-1.89x` (training-free `STA` with `1.89x` speedup) surpasses ` $\Delta$ -DiT-1.36x` with a `66.5% win rate` against `10.0%`. This clearly indicates `STA`'s superior quality-efficiency tradeoff compared to

\Delta-DiT.

STA vs. Original HunyuanVideo: STA-tf-1.89x maintains competitive quality compared to the original HunyuanVideo, achieving an 83.0% tie rate. While it has a 7.0 percentage point lower win rate than its loss rate, this is a very strong outcome given the 1.89x speedup, demonstrating excellent quality preservation.

These human evaluation results are crucial as they confirm that the efficiency gains of STA do not come at the cost of perceived video quality, solidifying its practical value.

6.1.3. Training-free Results (Table 3)

The paper evaluates the mask-search STA (training-free) and

\Delta`-DiT` on `VBench prompts`, examining robustness across different sampling steps. `HunyuanVideo` outputs at the same step count are used as a reference.

The following are the results from Table 3 of the original paper:

Model SSIM ↑ PSNR ↑ CD-FVD ↓ Latency Speedup
steps = 50
Δ-DiT 72.86 18.09 122.74 693s 1.36×
STA 87.67 28.76 66.12 501s 1.89×
steps = 25
Δ-DiT 77.91 19.86 196.25 352s 1.34×
STA 88.96 28.99 76.34 250s 1.89×
steps = 10
Δ-DiT 83.19 21.20 201.24 144s 1.32×
STA 87.84 27.14 84.80 105s 1.76×

Table 3. Training-free performance with varying sampling steps.  $\Delta$  -DiT shows consistently worse quality compared to STA

**Analysis of Table 3:**
*   **STA's Superiority over  $\Delta$ -DiT:** The `training-free STA` consistently outperforms

\Delta-DiT across all sampling steps (50, 25, 10), even with significantly higher speedups. * At 50 steps: STA achieves 14.81 higher SSIM (87.67 vs. 72.86), 10.67 higher PSNR (28.76 vs. 18.09), and a dramatically lower CD-FVD (66.12 vs. 122.74, where lower is better). STA provides a 1.89x speedup, while

\Delta`-DiT` only `1.36x`.
    *   This performance gap widens with fewer steps: at 25 steps, `CD-FVD` difference is 119.91; at 10 steps, it's 116.44.
*   **Qualitative Observation:** The paper notes that

\Delta-DiT consistently produces visually degraded outputs (compromised structural similarity, diminished fine details), whereas STA maintains high fidelity to the original model.

This section reinforces `STA`'s effectiveness in a `training-free` setting, delivering substantial efficiency gains without compromising quality, significantly outperforming a contemporary `training-free acceleration method`.

6.1.4. Finetuning Results (Table 4)

This section examines the impact of replacing full attention with sparse attention, both without and with finetuning. VBench is the primary metric for quality.

The following are the results from Table 4 of the original paper:

Methods	Config	VBench Quality	VBench Semantic	VBench Total	Attn Sparsity	PFLOPS	Latency	Speedup
FA2		85.34%	72.17%	82.71%	0.00%	574.16	1496s	0.63 ×
FA3		85.34%	72.17%	82.71%	0.00%	574.16	945s	1.00×
w.o training
CLEAR	r=32	84.41%	74.20%	82.37%	56.23%	280.90	2567s	0.37 ×
Tiled NATTEN	w=(30,41,41)	84.61%	75.00%	82.69%	58.33%	269.92	1858s	0.51×
Swin	w=(48,64,64)	80.91%	71.35%	79.00%	55.81%	283.11	762s	1.24×
Swin	w=(30,40,40)	78.84%	72.28%	77.53%	76.49%	175.20	497s	1.90 ×
STA	w=(30,40,40)	84.63%	73.83%	82.46%	58.33%	269.92	527s	1.79×
STA	w=(18,24,24)	81.47%	77.03%	80.58%	91.00%	99.54	268s	3.53 ×
w. training
Swin	w=(30,40,40)	77.50%	67.39%	75.48%	55.81%	283.08	497s	1.90 ×
STA	w=(30,24,40)	85.37%	73.52%	83.00%	75.00%	182.99	388s	2.44×
STA	w=(18,24,24)	84.76%	74.05%	82.62%	91.00%	99.54	268s	3.53×

Table 4. End-to-end performance and speedup comparison with HunyuanVideo. VBench Score is calculated across 1000 prompts. Speedup is relative to FA3. Latency is for 50 steps inference of 117 frames (1280x768).

Analysis of Table 4:

Inefficiency of CLEAR and Tiled NATTEN: CLEAR and Tiled NATTEN, despite significant attention sparsity (56.23% and 58.33% respectively), actually increase end-to-end latency (0.37x and 0.51x speedup, meaning slowdowns) compared to FA3. While their VBench Total scores (82.37%, 82.69%) are close to FA3 (82.71%) in the training-free setting, their practical utility is diminished by the speed regression.
Swin's Limitations: Swin achieves moderate speedups (1.24x to 1.90x). However, its rigid, non-overlapping window partitions violate 3D locality, leading to a degraded VBench Total score (e.g., 79.00% at 1.24x speedup). Crucially, even finetuning with Swin fails to recover performance and further lowers the VBench Total score to 75.48%. This highlights Swin's unsuitability for direct application or finetuning in this context.
STA's Strong Performance (w.o training):
- With a window configuration of $w=(30,40,40)$ (58.33% sparsity), STA achieves a 1.79x speedup with a VBench Total of 82.46%, very close to FA3's 82.71%.
- At a higher sparsity of 91.00% ( $w=(18,24,24)$ ), STA delivers a remarkable 3.53x speedup (latency 268s) with a VBench Total of 80.58%, demonstrating minimal quality tradeoff for significant efficiency.
STA's Enhanced Performance (w. training):
- Finetuning further improves STA's quality. For $w=(30,24,40)$ (75.00% sparsity), STA achieves a 2.44x speedup and a VBench Total of 83.00%, surpassing FA3's score.
- For the highest sparsity $w=(18,24,24)$ (91.00%), finetuning raises the VBench Total to 82.62% (from 80.58% without training), nearly matching FA3's baseline while maintaining the 3.53x speedup.
  
  These results confirm STA's ability to achieve a superior quality-efficiency tradeoff, effectively leveraging sparsity without significant quality loss, and even improving quality with targeted finetuning.

6.1.5. Results on Image Super-Resolution (Table 6)

The paper also applies STA to speed up image super-resolution with SDEdit using the FLUX model.

The following are the results from Table 6 of the original paper:

Methods	SSIM	PSNR	Sparsity	Latency	Speedup
1K →2K
CLEAR r=16	0.9291	28.1142	96.12%	13s	1.54×
CLEAR r=32	0.9443	29.6722	85.94%	15s	1.33×
STA w=(48,72)	0.9357	29.1086	81.25%	14s	1.43×
2K→4K
CLEAR r=16	0.9394	29.0463	98.98%	67s	2.90×
CLEAR r=32	0.9455	30.0742	96.08%	92s	2.11×
STA w=(48,72)	0.9470	30.1939	95.31%	57s	3.40×

Table 6. Image superresolution results with FLUX (Black-Forest, 2023) on 1000 captions randomly sampled from C0CO-2014 (Lin et al., 2015) validation dataset.

Analysis of Table 6:

STA with $w=(48,72)$ achieves comparable or slightly better quality (SSIM, PSNR) to CLEAR while generally offering higher efficiency (lower latency, higher speedup), especially for the more demanding $2K->4K$ super-resolution task.
For $2K->4K$ , STA achieves 3.40x speedup with SSIM 0.9470 and PSNR 30.1939, outperforming $CLEAR r=16$ (2.90x speedup, lower quality) and $CLEAR r=32$ (2.11x speedup, slightly lower quality). This demonstrates STA's versatility and effectiveness not just in 3D video generation but also in 2D image tasks, hinting at its broader applicability.

6.2. Ablation Studies / Parameter Analysis

The paper conducts implicit ablation studies and parameter analyses through its various experimental setups:

6.2.1. Training-free vs. Finetuning

This is a direct comparison of STA's performance under two different application modes.

Training-free: Demonstrates STA's immediate plug-and-play benefit by leveraging inherent 3D locality and head specialization with heuristic mask search. This yields significant speedups with minimal quality loss (e.g., $STA w=(18,24,24)$ achieves 3.53x speedup with 80.58% VBench Total in Table 4).
Finetuning: Shows that with a small amount of additional training, STA can adapt to more aggressive sparsity levels and even recover or surpass the original model's quality ( $STA w=(18,24,24)$ with training achieves 3.53x speedup and 82.62% VBench Total, almost matching FA3's 82.71%). This highlights the model's adaptability and the value of targeted training.

6.2.2. Impact of Sparsity / Window Size

Tables 2 and 4 show how different window sizes (e.g., $w=(30,40,40)$ vs. $w=(18,24,24)$ for STA) lead to varying levels of attention sparsity and, consequently, different latency and speedup values.

Higher sparsity (e.g., 91.00% with $w=(18,24,24)$ ) leads to greater speedups (3.53x) but might initially cause a slight quality drop (VBench Total of 80.58% without training).
Lower sparsity (e.g., 58.33% with $w=(30,40,40)$ ) yields moderate speedups (1.79x) with almost no quality degradation (82.46% VBench Total). The ability to finetune allows the model to achieve high sparsity and high quality simultaneously.

6.2.3. Detailed VBench Results (Tables 8 & 9)

These tables provide a granular view of how different VBench dimensions are affected by STA and finetuning.

The following are the results from Table 8 of the original paper:

Model	Appearance Style	Subject Consistency	Background Consistency	Temporal Flickering	Motion Smoothness	Dynamic Degree	Aesthetic Quality	Imaging Quality	Overall Consistency
FA3	18.43%	94.22%	96.74%	99.21%	99.15%	75.00%	64.63%	67.97%	25.96%
w.o training
CLEAR	18.73%	93.63%	96.51%	98.99%	99.01%	68.06%	63.75%	68.35%	26.23%
Tiled NATTEN	18.79%	94.59%	96.61%	98.75%	98.85%	70.83%	63.79%	68.16%	26.53%
Swin w=(48,64,64)	20.85%	91.74%	95.48%	98.67%	97.77%	77.78%	51.01%	62.22%	25.27%
Swin w=(30,40,40)	20.62%	90.33%	93.09%	98.78%	96.53%	75.00%	48.10%	61.89%	25.62%
STA w=(30,40,40)	18.79%	94.75%	96.50%	98.82%	98.83%	69.44%	64.18%	68.39%	26.47%
STA w=(18,24,24)	21.25%	89.66%	91.64%	98.46%	97.27%	83.33%	59.75%	64.23%	26.61%
w. training
Swin w=(30,40,40)	20.07%	89.78%	94.93%	98.86%	96.64%	70.83%	44.91%	55.99%	26.00%
STA w=(30,24,40)	18.90%	94.90%	97.60%	99.68%	99.23%	73.61%	63.77%	66.21%	26.58%
STA w=(18,24,24)	18.90%	94.64%	96.76%	99.22%	99.11%	69.44%	64.52%	66.67%	26.09%

Table 8. Model Performance Comparison - Part 1

The following are the results from Table 9 of the original paper:

Model	Object Classification	Multiple Objects	Human Action	Color	Spatial Relationship	Scene	Quality Score	Semantic Score	Final Score
FA3	85.76%	70.12%	90.00%	88.66%	71.28%	35.25%	85.34%	72.17%	82.71%
w.o training
CLEAR	88.13%	77.97%	88.00%	91.10%	77.49%	32.85%	84.41%	74.20%	82.37%
Tiled NATTEN	83.54%	72.18%	94.00%	92.28%	81.21%	37.94%	84.61%	75.00%	82.69%
Swin w=(48,64,64)	78.16%	58.54%	87.00%	93.68%	77.45%	37.79%	80.91%	71.35%	79.00%
Swin w=(30,40,40)	79.19%	60.44%	88.00%	93.68%	77.24%	35.54%	78.84%	72.28%	77.53%
STA w=(30,40,40)	80.54%	71.19%	93.00%	89.81%	79.25%	36.77%	84.63%	73.83%	82.47%
STA w=(18,24,24)	88.13%	75.46%	91.00%	91.61%	82.52%	42.15%	81.47%	77.03%	80.58%
w. training
Swin w=(30,40,40)	77.14%	48.86%	73.00%	87.00%	63.38%	39.03%	77.50%	67.39%	75.48%
STA w=(30,24,40)	91.77%	68.45%	86.00%	89.59%	72.76%	39.53%	85.37%	73.52%	83.00%
STA w=(18,24,24)	92.96%	74.16%	93.00%	84.50%	73.41%	38.23%	84.76%	74.05%	82.62%

Table 9. Model Performance Comparison - Part 2

Analysis of Tables 8 & 9:

STA vs. Other Baselines: STA generally surpasses Swin in video quality metrics like Imaging Quality and Multiple Objects, and achieves comparable or superior scores to CLEAR and Tiled NATTEN.
Impact of Sparsity (Training-Free STA):
- As sparsity increases in training-free STA (e.g., from $w=(30,40,40)$ to $w=(18,24,24)$ ), there is a systematic degradation in quality-related metrics such as Temporal Flickering, Motion Smoothness, Aesthetic Quality, and Imaging Quality. This is expected as reducing attention range can impact fine-grained visual coherence.
- Conversely, semantic-aligned dimensions—including Appearance Style, Color, and Spatial Relationships—tend to improve under higher sparsity regimes. The paper hypothesizes this is because text embeddings' role in attention computation becomes amplified when spatial-temporal attention is sparsified, leading to stronger text-to-video alignment in semantic aspects.
Efficacy of Training: The finetuned STA models demonstrate significant gains in video quality metrics over their untrained counterparts. For example, $STA w=(18,24,24)$ (trained) improves its Imaging Quality from 64.23% to 66.67% and Aesthetic Quality from 59.75% to 64.52% compared to its untrained version, while maintaining semantic coherence at comparable levels. This underscores that finetuning is highly effective in refining low-level visual fidelity without compromising text-video alignment, making STA a robust solution.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces SLIDING TILE ATTENTION (STA), a novel attention mechanism designed to significantly accelerate video Diffusion Transformers (DiTs) while preserving high video generation quality. The core innovation of STA is its hardware-aware tile-by-tile sliding window design, which effectively exploits the 3D locality observed in pretrained video diffusion models. Unlike previous sliding window attention (SWA) implementations that suffer from mixed blocks and poor GPU utilization, STA guarantees the generation of only dense and empty blocks, leading to proportional wall-clock speedups aligned with theoretical FLOP reductions.

Through careful kernel-level optimizations (leveraging ThunderKittens and FlashAttention3's consumer-producer paradigm), STA achieves impressive Memory-Bandwidth-Utilization (MFU) of 58.79% and attention speedups of up to 10.45x over FlashAttention-3. Applied to HunyuanVideo, a leading video DiT, STA reduces end-to-end inference latency from 945 seconds to 501 seconds without any quality degradation in a training-free manner. Further finetuning enables even greater efficiency, lowering latency to 268 seconds with only a marginal 0.09% drop in VBench score. The human evaluations and quantitative metrics (SSIM, PSNR, CD-FVD, VBench) consistently demonstrate STA's superior quality-efficiency tradeoff compared to existing sparse attention and acceleration methods like NATTEN, CLEAR, Swin, and

\Delta`-DiT`.

## 7.2. Limitations & Future Work
The authors highlight that `STA` is `orthogonal` to other acceleration techniques, such as `caching` (

\Delta-DiT) and consistency distillation (methods to reduce sampling steps in diffusion models). This suggests that STA can potentially be combined with these methods for even greater efficiency gains. The paper explicitly states that exploring their combined effectiveness is a plan for future work.

While not explicitly stated as limitations, the current approach involves:

A mask search algorithm for optimal window sizes in the training-free setting, which requires a small amount of profiling.
The assumption that video content exhibits strong 3D locality. While verified for state-of-the-art models like HunyuanVideo, this might not universally hold for all types of video content or future diffusion architectures.
The need for finetuning to achieve the highest sparsity and quality, which, while minimal, adds a training step.

7.3. Personal Insights & Critique

This paper presents a highly impactful contribution to the field of efficient video generation. The core insight—that previous sliding window attention implementations fail due to hardware-unfriendly mixed blocks, and that a tile-by-tile approach can resolve this—is elegant and effective. The system-algorithm co-design philosophy is a crucial takeaway, demonstrating that theoretical efficiency gains often require deep understanding and optimization at the hardware/kernel level to translate into practical speedups.

Inspirations & Transferability:

The tile-by-tile strategy, coupled with asynchronous data loading, could be broadly applicable to other sparse attention patterns or any high-dimensional sparse computation where locality is a strong characteristic. This could extend beyond video generation to other 3D data processing tasks in vision (e.g., medical imaging, point clouds) or even very long 1D sequences where local attention is desired.
The concept of head specialization and training-free mask search is a pragmatic approach for deploying efficient models without costly retraining, which is vital for large, pretrained models. This could inspire similar profiling-based optimization techniques for other architectural components.
The attention distillation loss in finetuning is a robust way to adapt pretrained models to sparser architectures while retaining performance, a valuable technique for model compression and acceleration.

Potential Issues & Areas for Improvement:

Fixed Window Sizes: The current approach uses fixed window sizes (or fixed sets of window sizes) determined by profiling. Future work could explore adaptive window sizing that dynamically adjusts window dimensions based on content or attention patterns during inference, potentially leading to even more optimal sparsity.
Dynamic Masking: While STA eliminates mixed blocks, the inter-block mask is still determined by data warpgroups. Investigating more dynamic, content-aware masking at this level could yield further gains.
Dependency on FlashAttention-like Kernels: STA's efficiency heavily relies on FlashAttention's underlying architecture (tiling, online softmax, consumer-producer). While FlashAttention is a de-facto standard, alternative hardware architectures might require different system-algorithm co-designs.
Generality of 3D Locality: While the paper provides strong evidence for 3D locality in HunyuanVideo, it's an empirical observation. Whether this holds universally across all future video DiTs, diverse datasets, and various generation tasks (e.g., highly dynamic, non-local motions) remains to be seen. If locality breaks down, STA's performance might degrade, necessitating fallback to denser attention or more global sparse patterns.
"Semantic Score Improvement with Sparsity": The observation that semantic-aligned dimensions sometimes improve with higher sparsity because text embeddings' role is amplified is intriguing. This could imply a "feature suppression" effect where too much local visual attention might distract from global semantic alignment, or that the model relies more heavily on the text signal when local visual context is pruned. Further research could explore this phenomenon to balance visual fidelity and semantic coherence more explicitly.

Overall, STA is a practical and well-engineered solution that significantly pushes the boundaries of efficient video generation, making advanced video AI systems more accessible.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Attention	Window Size	Dense Block	Mixed Block
Tiled NATTEN	(11,11,11)	0.06%	7.17%
STA	(12, 12, 12)	1.56%	0.0%
STA	(20, 20, 20)	7.23%	0.0%

Methods	Implementation	Config	Sparsity	TFLOPS	Latency(ms)	MFU	Kernel Efficiency	Speedup
FA 3	ThunderKittens	-	0.00%	164.03	265.28	62.49%	100.00%	1.00×
FA 3	CUDA	-	0.00%	164.03	256.59	64.61%	103.39%	1.03×
CLEAR	FlexAttention	r=16	90.46%	15.65	307.44	5.15%	8.24%	0.86×
NATTEN	FlexAttention	w=(19,25,25)	89.69%	16.91	313.92	5.44%	8.71%	0.85×
Tiled NATTEN	CUDA	w=(19,25,25)	89.69%	16.91	458.36	3.73%	5.97%	0.58×
Tiled NATTEN	FlexAttention	w=(19,25,25)	89.69%	16.91	208.36	8.20%	13.12%	1.27×
Swin	FlexAttention	w=(24,32,32)	87.42%	20.64	47.90	43.55%	69.69%	5.54×
STA	FlexAttention	w=(18,24,24)	91.00%	14.76	36.36	41.03%	65.66%	7.30×
STA	ThunderKittens	w=(30,40,40)	58.33%	68.35	111.73	61.82%	98.93%	2.37×
STA	ThunderKittens	w=(18,24,24)	91.00%	14.76	25.38	58.79%	94.09%	10.45×

Model	SSIM ↑	PSNR ↑	CD-FVD ↓	Latency	Speedup
steps = 50
Δ-DiT	72.86	18.09	122.74	693s	1.36×
STA	87.67	28.76	66.12	501s	1.89×
steps = 25
Δ-DiT	77.91	19.86	196.25	352s	1.34×
STA	88.96	28.99	76.34	250s	1.89×
steps = 10
Δ-DiT	83.19	21.20	201.24	144s	1.32×
STA	87.84	27.14	84.80	105s	1.76×

Fast Video Generation with Sliding Tile Attention

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~18 min read · 24,862 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Diffusion Models

3.1.2. Transformers

3.1.3. Diffusion Transformers (DiTs)

3.1.4. Attention Mechanism

3.1.5. FlashAttention

3.1.6. Memory-Bandwidth-Utilization (MFU)

3.2. Previous Works

3.2.1. Sliding Window Attention (SWA)

3.2.2. NATTEN and Tiled NATTEN

3.2.3. CLEAR

3.2.4. Swin Transformer

3.2.5. Δ\DeltaΔ-DiT

6.1.3. Training-free Results (Table 3)

6.1.4. Finetuning Results (Table 4)

6.1.5. Results on Image Super-Resolution (Table 6)

6.2. Ablation Studies / Parameter Analysis

6.2.1. Training-free vs. Finetuning

6.2.2. Impact of Sparsity / Window Size

6.2.3. Detailed VBench Results (Tables 8 & 9)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.3. Personal Insights & Critique

Similar papers

3.2.5. $\Delta$ -DiT