Paper status: completed

SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference

Published:02/25/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SpargeAttention introduces a training-free universal sparse attention mechanism to accelerate model inference. It employs a two-stage online filter to predict attention maps and skip matrix multiplications, significantly enhancing performance across various tasks while maintainin

Abstract

An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The code is available at https://github.com/thu-ml/SpargeAttn.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference. It focuses on developing an efficient and universal sparse attention mechanism that can accelerate various large models without requiring retraining or sacrificing accuracy.

1.2. Authors

The authors are:

  • Intao Zhang

  • Chendong Xiang

  • Haofeng Huang

  • Jia Wei

  • Haocheng Xi

  • Jun Zhu

  • Jianfei Chen

    Affiliations are primarily associated with Tsinghua University (THU), indicated by 1 and 2, and 3 likely an external affiliation. The specific full affiliations are not explicitly detailed in the provided abstract/header, but the thu-ml in the GitHub link points to Tsinghua University's Machine Learning group.

1.3. Journal/Conference

The paper is published as a preprint on arXiv, with the most recent version v8v8 published on 2025-02-25T12:02:17.000Z. The specific journal or conference where it might be formally published is not stated, but the submission date suggests it is intended for a top-tier ML conference or journal in 2025. Given the arXiv status and the early 2025 date, it is likely awaiting peer review or has been accepted to a major conference.

1.4. Publication Year

The paper was published on arXiv on 2025-02-25.

1.5. Abstract

The abstract introduces the problem of quadratic time complexity in attention mechanisms, which is a major bottleneck for large models, especially with long sequence lengths. It notes that attention maps commonly exhibit sparsity, meaning many values are near zero, which can be exploited for computational efficiency. While existing sparse attention methods accelerate specific models by leveraging certain sparse patterns, a universal solution that guarantees both speedup and end-to-end performance across diverse models remains elusive.

This paper proposes SpargeAttn, a universal, training-free, sparse, and quantized attention mechanism. It employs a two-stage online filter:

  1. First Stage: Rapidly and accurately predicts the attention map to skip some matrix multiplications (QKQK^\top and PV). This stage uses selective token compression based on token similarity within blocks.

  2. Second Stage: Introduces an online softmax-aware filter that incurs no extra overhead, further skipping matrix multiplications (P~ijVj\tilde{P}_{ij}V_j). This stage leverages the properties of online softmax to identify negligible values.

    SpargeAttn is integrated with the SageAttention framework for additional quantization-based acceleration. Experiments demonstrate that SpargeAttn significantly accelerates diverse models (language, image, and video generation) without compromising end-to-end performance metrics. The code is publicly available.

https://arxiv.org/abs/2502.18137 (Preprint)

https://arxiv.org/pdf/2502.18137v8.pdf (Preprint)

2. Executive Summary

2.1. Background & Motivation

The core problem SpargeAttn addresses is the computational inefficiency of the attention mechanism in large deep learning models. The standard self-attention operation has a time complexity that is quadratic with respect to the sequence length (typically O(N2d)O(N^2 d), where NN is sequence length and dd is head dimension). As models handle increasingly longer sequences (e.g., 45K-128K tokens in video generation and language models), this quadratic complexity leads to a significant portion of inference latency.

The paper identifies two key challenges with existing sparse attention methods:

  • L1. Universality: Most methods are tailored to specific tasks (e.g., language modeling) by exploiting predefined patterns (e.g., sliding windows, attention sinks). However, attention patterns vary significantly across different model types (language, image, video), making these task-specific methods non-universal.

  • L2. Usability: Achieving both accuracy (precise prediction of important attention map regions) and efficiency (minimal overhead for prediction) is difficult. Aggressive approximation can lead to loss of critical information, while complex prediction can negate speedup benefits. For example, some methods require very large sequence lengths to show noticeable speedup.

    The paper's entry point or innovative idea is to create a universal, training-free sparse attention operator that can accelerate any model inference across various modalities (language, image, video) and sequence lengths, without compromising the model's end-to-end performance. It aims to achieve this by dynamically identifying and skipping negligible computations in the attention map using a novel two-stage filtering approach.

2.2. Main Contributions / Findings

The primary contributions of SpargeAttn are:

  1. A Universal Two-Stage Online Filtering Mechanism:

    • First Stage (Sparse Prediction): A novel selective token compression algorithm is proposed. It accurately predicts sparse blocks in the attention map by compressing blocks of query (Q) and key (K) that exhibit high self-similarity into single representative tokens. This allows for skipping of corresponding matrix multiplications (QiKjQ_i K_j^\top) and partial value multiplications (P~ijVj\tilde{P}_{ij} V_j). This stage is pattern-free and dynamically adapts to input.
    • Second Stage (Sparse Warp Online Softmax): An additional softmax-aware filter is designed that incurs no extra overhead. It leverages the properties of online softmax to further identify and skip negligible value multiplications (P~ijVj\tilde{P}_{ij} V_j) when the row maximum of the attention score is significantly smaller than the running maximum, indicating that the softmax output will be near zero for those elements.
  2. Integration with Quantized Attention (SageAttention): SpargeAttn is seamlessly integrated into the 8-bit quantized SageAttention framework, further accelerating inference. The authors note that quantization and sparse operations are orthogonal, allowing for their combined benefits.

  3. HilbertCurve Permutation for Visual Models: For image and video models, SpargeAttn introduces HilbertCurve permutation to enhance the self-similarity of adjacent tokens. By reordering 3D visual tokens along a Hilbert Curve, it preserves locality, increases block self-similarity, and thus improves sparsity without accuracy loss.

  4. Demonstrated Universality and Performance:

    • Universality: SpargeAttn is shown to be effective across diverse generative models, including language modeling (Llama3.1), text-to-image (Flux, Stable-Diffusion3.5), and text-to-video (CogvideoX, Mochi, Open-Sora-Plan), and across various sequence lengths.

    • Accuracy: It robustly retains model end-to-end performance with almost no metric loss compared to Full-Attention, significantly outperforming existing sparse attention baselines (MInference, FlexPrefill) which often incur performance degradation.

    • Efficiency: SpargeAttn achieves 2.5x to 5x faster speeds compared to dense and existing sparse attention models. For instance, it achieves 1.83x speedup on Mochi without video quality loss. The sparse block prediction overhead is shown to be minimal, especially for longer sequences.

      These findings highlight SpargeAttn's ability to overcome the limitations of prior sparse attention methods by providing a training-free, universal, accurate, and highly efficient solution for accelerating large model inference.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand SpargeAttn, a reader needs to be familiar with the following core concepts:

  • Attention Mechanism: At its heart, attention in transformer models allows the model to weigh the importance of different parts of the input sequence when processing each element.

    • Query (Q), Key (K), Value (V) Matrices: In self-attention, an input sequence XX is transformed into three matrices: Query (QQ), Key (KK), and Value (VV). These are typically derived from the input embeddings through linear transformations.
    • Attention Score Calculation: The core operation computes a similarity score between each query and all keys. This is done via a matrix multiplication QKQK^\top. The scores are then scaled by dk\sqrt{d_k} (where dkd_k is the dimension of the key vectors) to prevent very large values from pushing softmax into regions with tiny gradients.
    • Softmax: The scaled scores are then passed through a softmax function to produce an attention map (PP). Softmax normalizes these scores into a probability distribution, where each value indicates the "attention weight" a query should place on a specific key. The formula for softmax for a vector x=[x1,...,xn]x = [x_1, ..., x_n] is: $ \mathrm{softmax}(x)i = \frac{e^{x_i}}{\sum{j=1}^{n} e^{x_j}} $
    • Weighted Sum of Values: Finally, the attention map (PP) is multiplied by the Value matrix (VV) to produce the output. This effectively creates a weighted sum of value vectors, where the weights are determined by the attention map. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
    • Quadratic Complexity: The matrix multiplication QKQK^\top results in an N×NN \times N attention map for a sequence of length NN. This operation has a computational complexity of O(N2dk)O(N^2 d_k), which is the primary reason for the performance bottleneck at long sequence lengths.
  • Sparsity in Attention Maps: The softmax function tends to concentrate probability mass on a few dominant attention scores, making most other scores very close to zero. This phenomenon is known as sparsity in the attention map. SpargeAttn and other sparse attention methods aim to exploit this sparsity by skipping computations for these near-zero values, thus reducing computational load.

  • FlashAttention: FlashAttention (Dao et al., 2022; Dao, 2024) is a highly optimized implementation of exact attention. Instead of computing the full N×NN \times N attention map in device memory, FlashAttention uses a tiling strategy. It divides the Q, K, V matrices into smaller blocks and computes the attention output progressively using an online softmax algorithm. This significantly reduces memory usage and improves speed by optimizing I/O awareness between fast on-chip memory (SRAM) and slower off-chip memory (HBM). SpargeAttn builds directly on the FlashAttention framework.

  • Online Softmax: In traditional softmax, the normalization term (the sum in the denominator) requires summing over all elements. For long sequences, this sum can become very large or very small, leading to numerical instability. Online softmax is a technique used in FlashAttention that computes the softmax in a numerically stable and incremental manner. It maintains a running maximum (mm) and a running sum (ll) of the exponents. When processing a new block of attention scores, it updates these running statistics to calculate the softmax for that block, effectively normalizing across the entire sequence seen so far. This avoids storing the full N×NN \times N attention score matrix.

  • Quantization: Quantization is a technique used to reduce the precision of numerical representations (e.g., from 32-bit floating-point to 8-bit integer) in neural networks. This reduces memory footprint and can accelerate computations, especially on hardware optimized for lower precision arithmetic. SpargeAttn integrates with SageAttention, a quantized attention framework, to further enhance speed. SageAttention specifically focuses on 8-bit quantization for plug-and-play inference acceleration.

  • Cosine Similarity: Cosine similarity is a measure of similarity between two non-zero vectors of an inner product space. It measures the cosine of the angle between them. A value of 1 means the vectors are identical in direction, 0 means they are orthogonal, and -1 means they are opposite in direction. In SpargeAttn, it's used to measure the self-similarity of tokens within a block of QQ or KK. $ \mathrm{CosineSimilarity}(A, B) = \frac{A \cdot B}{|A| |B|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} $

  • Cumulative Distribution Function (CDF) Concept (for TopCdf): A cumulative distribution function describes the probability that a random variable takes on a value less than or equal to a given value. In SpargeAttn, TopCdf uses a similar idea: it selects tokens whose cumulative attention scores (when sorted) reach a certain percentage (tau) of the total sum. This effectively identifies the most important attention scores.

  • GPU Warps: A warp is a group of threads that execute the same instruction in parallel on a GPU (Graphics Processing Unit). GPUs achieve high parallelism by executing many warps simultaneously. SpargeAttn's second stage (sparse warp online softmax) leverages this warp-level parallelism to further optimize computation by allowing individual warps to skip calculations if their assigned part of the attention map is deemed negligible.

3.2. Previous Works

The paper categorizes existing sparse attention methods into three types based on how the sparse mask (which indicates important non-zero entries to compute) is generated:

  1. Pattern-based methods: These rely on fixed, predefined sparsity patterns derived from empirical observations, often specific to certain tasks.

    • Examples: sliding windows (H2O, InfLLM, DUOAttention), attention sinks (SampleAttention, M0A, StreamingLLM). DitFastAttn combines sliding windows with similarity between attention maps but is restricted to Diffusion Transformers.
    • Limitation: Lack universality. As attention patterns vary significantly across different models (language, image, video, see Figure 2 in the paper), these fixed patterns do not generalize well. For example, DitFastAttn is incompatible with language models and MMDiT models.
  2. Dynamic sparse methods: These methods compute the sparse mask on-the-fly based on the input, making them potentially more universal.

    • Channel compression: Methods like SparQAttn and LokiAttn reduce the dimensionality of QQ and KK while performing full attention.
      • Limitation: Limited speedup potential, as the head dimension is often already small (e.g., 64, 128).
    • Token compression: Methods like MInference and FlexPrefill compress blocks of tokens into single representative tokens and compute attention on this shorter sequence.
      • Limitation: Can be too aggressive. Missing important blocks is possible if their attention scores are not high on the compressed sequence. MInference may require very large sequence lengths (e.g., 100K) for noticeable speedup.
    • Training-required dynamic methods: SeerAttention requires training additional parameters to learn sparse attention.
      • Limitation: Expensive to use due to the need for retraining.
    • General Limitation: Many dynamic sparse methods are designed primarily for language models, and their applicability to other modalities (like diffusion models) is uncertain.
  3. Training-based methods: These methods modify the attention computation logic and require retraining the entire model from scratch.

    • Examples: Reformer, FastAttention.

    • Limitation: Extremely expensive and impractical to apply to pre-trained large models without significant resources.

      Beyond sparse attention, other orthogonal acceleration techniques include:

  • Kernel Optimization: Optimizing the underlying CUDA kernels (e.g., FlashAttention, FlashAttention-2, FlashAttention-3). SpargeAttn builds on FlashAttention.
  • Quantization: Reducing numerical precision (e.g., SageAttention). SpargeAttn integrates with this.
  • Workload Distribution: Distributing attention computation across multiple devices (e.g., RingAttention).
  • Linear Time Attention: Designing attention mechanisms with linear time complexity (e.g., Linformer, Performer).

3.3. Technological Evolution

The evolution of attention mechanisms has moved from the original Transformer's self-attention with its quadratic complexity to various optimizations. Initially, the focus was on exact attention with memory and speed optimizations, notably FlashAttention, which addresses I/O bottlenecks. Concurrently, researchers recognized the inherent sparsity in attention maps and began developing sparse attention methods. Early sparse attention often relied on predefined patterns, which limited their universality. Later, dynamic sparse methods emerged to adapt to input, but still faced challenges in balancing accuracy and efficiency without retraining, and were often modality-specific. Quantization also emerged as an orthogonal technique for further acceleration.

SpargeAttn fits into this timeline by building on the FlashAttention framework and addressing the universality and usability limitations of prior dynamic sparse methods. It aims to provide a training-free, universal, and accurate solution by combining selective token compression with softmax-aware filtering and integrating quantization, effectively merging the benefits of FlashAttention's efficiency with a robust dynamic sparsity approach across diverse models.

3.4. Differentiation Analysis

SpargeAttn differentiates itself from previous sparse attention methods primarily in its universality, training-freeness, and accuracy preservation across diverse model types:

  • Universality (vs. Pattern-based & Modality-specific Dynamic Methods):

    • Unlike pattern-based methods (e.g., sliding windows, attention sinks) which are model or task-specific, SpargeAttn employs a pattern-free selective token compression that dynamically adapts to input similarities. This allows it to work across language, image, and video generation models without requiring task-specific knowledge.
    • Many dynamic sparse methods (e.g., MInference, FlexPrefill, SeerAttention) were primarily designed for language models or required training. SpargeAttn explicitly demonstrates applicability across various modalities and does not require training of additional parameters.
  • Training-free (vs. Training-based & some Dynamic Methods):

    • SpargeAttn is training-free, meaning it can be directly applied to pre-trained models without the prohibitive cost and effort of retraining the entire model (unlike Reformer or FastAttention, or SeerAttention). This makes it highly usable for existing large models.
  • Accuracy Preservation (vs. Aggressive Token Compression):

    • Token compression methods like MInference and FlexPrefill can be too aggressive, potentially missing important attention regions and leading to accuracy degradation. SpargeAttn mitigates this through selective token compression (only compressing blocks with high self-similarity) and a two-stage online filtering approach. The self-similarity judge ensures that non-self-similar blocks (which might contain critical information) are always computed, preventing loss of critical information. This is validated by its ability to robustly retain model end-to-end performance.
  • Efficiency and Overhead (vs. Baselines):

    • SpargeAttn achieves significant speedups (2.5x to 5x) while maintaining accuracy, outperforming baselines like MInference and FlexPrefill which often show accuracy degradation at comparable sparsity levels or require specific conditions (e.g., very long sequences for MInference).
    • The two-stage online filter is designed to have minimal prediction overhead. The softmax-aware filter in the second stage incurs no extra overhead, making the overall approach very efficient.
  • Novel Components:

    • The selective token compression based on block self-similarity is a novel pattern-free approach.

    • The sparse warp online softmax is a new mechanism for fine-grained pruning within the FlashAttention online process.

    • The HilbertCurve permutation is a novel technique to enhance sparsity specifically for 3D visual tokens in image and video models, demonstrating SpargeAttn's thoughtful design for multimodal application.

      In essence, SpargeAttn seeks to provide a plug-and-play solution for attention acceleration that is broadly applicable, highly accurate, and computationally very efficient, bridging the gap left by prior methods that were either too specific, too expensive to implement, or compromised accuracy.

4. Methodology

The core methodology of SpargeAttn revolves around a two-stage online filtering mechanism integrated into the FlashAttention framework, further combined with quantization and a HilbertCurve permutation for visual tasks. The goal is to dynamically identify and skip computations for sparse regions of the attention map without sacrificing accuracy.

4.1. Principles

The fundamental principle of SpargeAttn is to exploit the inherent sparsity of the attention map (where many values are near zero) to accelerate computations. It operates on the FlashAttention tiling strategy and aims to skip matrix multiplications in two stages:

  1. Macro-level Block Pruning (First Stage): Predict which blocks of the N×NN \times N attention map will contribute negligibly to the final output. This is done by compressing QQ and KK blocks that are highly self-similar and computing a coarse-grained attention map.

  2. Micro-level Warp Pruning (Second Stage): Within the remaining, unpruned blocks, further identify and skip individual value contributions that are negligible during the online softmax computation, leveraging GPU warp parallelism.

    A crucial underlying intuition is that neighboring tokens in QQ and KK often exhibit high similarity, especially in visual data. By leveraging this, SpargeAttn can compress these similar regions for quick sparse prediction, while carefully ensuring that non-self-similar (and potentially critical) regions are always fully computed to maintain accuracy.

4.2. Core Methodology In-depth (Layer by Layer)

SpargeAttn modifies the FlashAttention algorithm, which processes query, key, and value matrices (Q, K, V) in blocks. Let NN be the sequence length and dd be the head dimension. Q, K, V have dimensions N×dN \times d. The attention score matrix S=QK/dS = QK^\top / \sqrt{d} and attention map P = \sigma(S) are N×NN \times N. The output O=PVO = PV. FlashAttention divides Q, K, V into blocks {Qi},{Kj},{Vj}\{Q_i\}, \{K_j\}, \{V_j\} with sizes bq,bk,bvb_q, b_k, b_v (where bk=bvb_k = b_v).

The workflow of SpargeAttn is illustrated in Figure 3.

As depicted in Figure 3, the SpargeAttn workflow is structured into three main phases:

  1. Sparse Block Online Prediction (Step 1): This phase computes a coarse-grained attention map using compressed QQ and KK blocks to identify which full attention blocks can be skipped.

  2. Sparse Block Online Masking (Step 2): Based on the prediction from Step 1, binary masks MgM_g are generated and applied to skip computations for entire attention blocks.

  3. Sparse Warp Online Softmax (Step 3): Within the remaining (unmasked) attention blocks, a finer-grained filtering mechanism is applied at the GPU warp level to further skip value vector updates if their contribution is negligible.

    The complete algorithm for SpargeAttn is provided in Algorithm 1.

4.2.1. Sparse FlashAttention Basics (Section 3.1)

SpargeAttn leverages the tiling strategy and online softmax of FlashAttention. The online softmax mechanism progressively computes each block of the output OiO_i for a query block QiQ_i by iterating over key-value blocks Kj,VjK_j, V_j. The standard FlashAttention equations are:

$ S_{ij} = Q_i K_j^\top / \sqrt{d} $

Here, SijS_{ij} represents the attention scores for the ii-th query block QiQ_i with the jj-th key block KjK_j. The d\sqrt{d} scaling is applied.

The online softmax mechanism updates maximums (mijm_{ij}) and normalization terms (lijl_{ij}) iteratively: $ (m_{ij}, \widetilde{P}{ij}) = \widetilde{\sigma}(m{i, j-1}, S_{ij}) $ $ l_{ij} = \exp(m_{i, j-1} - m_{ij}) l_{i, j-1} + \mathrm{rowsum}(\widetilde{P}{ij}) $ $ O{ij} = \mathrm{diag}\left(\exp(m_{i, j-1} - m_{ij})\right) {O}{i, j-1} + \widetilde{P}{ij} V_j $

where:

  • mi,j1m_{i, j-1} and li,j1l_{i, j-1} are the running maximum and sum up to the (j-1)-th key block.

  • mijm_{ij} is a bq×1b_q \times 1 vector representing the maximum value for each row of SijS_{ij} considering all key blocks up to jj. It's updated as mij=max{mi,j1,rowmax(Sij)}m_{ij} = \max\{m_{i, j-1}, \mathrm{rowmax}(S_{ij})\}.

  • P~ij=exp(Sijmij)\widetilde{P}_{ij} = \exp({S_{ij}} - m_{ij}) is the unnormalized attention map block, shifted by the current row maximum mijm_{ij} for numerical stability.

  • lijl_{ij} is a bq×1b_q \times 1 vector representing the running row sum of the softmax denominator.

  • OijO_{ij} is the ii-th output block accumulating contributions from value blocks up to jj.

  • diag()\mathrm{diag}(\cdot) creates a diagonal matrix from a vector.

  • rowsum()\mathrm{rowsum}(\cdot) computes the sum of elements along rows.

  • Finally, the complete ii-th output block is Oi=diag(li,Tn)1Oi,TnO_i = \mathrm{diag}(l_{i, T_n})^{-1} O_{i, T_n}, where Tn=N/bkT_n = N/b_k is the total number of key blocks.

    SpargeAttn defines sparse attention by introducing block masks:

Definition 1 (Block Masks): Let MgM_g and MpvM_{pv} be binary masks of dimensions N/bq×N/bk\lceil N/b_q \rceil \times \lceil N/b_k \rceil, where each value is either 0 or 1. These masks determine which computations are skipped.

  • MgM_g: Mask for skipping both QiKjQ_i K_j^\top and P~ijVj\widetilde{P}_{ij} V_j computations.
  • MpvM_{pv}: Mask for skipping only P~ijVj\widetilde{P}_{ij} V_j computations.

Definition 2 (Sparse FlashAttention): The computation rules for sparse FlashAttention based on the masks are:

  • QiKjQ_i K_j^\top and P~ijVj\widetilde{P}_{ij} V_j are skipped if Mg[i,j]=0M_g[i, j] = 0.
  • P~ijVj\widetilde{P}_{ij} V_j is skipped if Mpv[i,j]=0M_{pv}[i, j] = 0.

4.2.2. Selective Token Compression for Sparse Prediction (Section 3.2)

This is the first stage of the two-stage online filter, focusing on generating the global mask MgM_g.

Key Idea: The core observation is that many neighboring tokens within QQ and KK blocks are highly similar. If a block of tokens in QQ or KK is highly uniform (high self-similarity), it can be effectively represented by a single mean token. This compression allows for a much faster, coarse-grained computation of a compressed attention map (P^\hat{P}), which then indicates which full attention blocks (QiKjQ_i K_j^\top) are likely to be sparse and can be skipped.

Prediction Steps (Algorithm 1, Lines 4-6):

  1. Block Compression: For each block QiQ_i and KjK_j, compute a single representative token by taking the mean across tokens in that block. $ q = {q_i} = {\mathrm{mean}(Q_i, \mathrm{axis}=0)} $ $ k = {k_j} = {\mathrm{mean}(K_j, \mathrm{axis}=0)} $ where QiRbq×dQ_i \in \mathbb{R}^{b_q \times d} becomes qiR1×dq_i \in \mathbb{R}^{1 \times d}, and KjRbk×dK_j \in \mathbb{R}^{b_k \times d} becomes kjR1×dk_j \in \mathbb{R}^{1 \times d}.

  2. Block Self-Similarity Measurement: Calculate the cosine similarity within each QiQ_i and KjK_j block to assess its self-similarity. $ s_{qi} = \mathrm{CosSim}(Q_i) $ $ s_{kj} = \mathrm{CosSim}(K_j) $ The function CosSim(X)\mathrm{CosSim}(X) is defined as: $ \mathrm{CosSim}(X) = \mathrm{mean}\left(\frac{X X^\top}{|\mathrm{max}(X X^\top)|}\right) $ This measures the average cosine similarity between all pairs of tokens within the block XX. A higher value indicates greater self-similarity.

  3. Compressed Attention Score Calculation: Compute a compressed attention score matrix S^\hat{S} using the compressed tokens qq and kk. $ \hat{S} = q k^\top $

  4. Handling Non-Self-Similar (Fix) Blocks: This is a crucial step for accuracy. If a key block KjK_j is not self-similar (i.e., skj<θs_{kj} < \theta, where θ\theta is a hyper-parameter), its corresponding column in S^\hat{S} is set to -\infty. This ensures that these non-self-similar blocks will not be pruned based on the compressed attention map and will always be considered for full computation. $ \hat{S}[: , j] = -\infty, \quad \mathrm{If } \ s_{kj} < \theta $

  5. Compressed Attention Map: Apply softmax to each row of S^\hat{S} to obtain the compressed attention map P^\hat{P}. $ \hat{P}[i] = \mathrm{Softmax}(\hat{S}[i]) $

  6. Mask Generation (MgM_g): For each row of P^\hat{P} (corresponding to a query block QiQ_i), the TopCdf function is used to select the most important key blocks. $ M_g[i, :] = \mathrm{TopCdf}(\hat{P}[i], \tau) $ where τ\tau is a hyper-parameter between 0 and 1. TopCdf selects the positions of key blocks whose cumulative sum of attention scores in P^[i]\hat{P}[i] reaches τP^[i]\tau \cdot \sum \hat{P}[i]. These selected positions are set to 1 in MgM_g, indicating they should be computed. All other positions are set to 0.

    The Top_Cdf pseudocode is:

    def Top_Cdf(P[i], tau):
        sorted_P, idx = torch.sort(P[i], descending = True)
        cusum_P = torch.cumsum(sorted_P, dim != 0)
        mask = cusum_P <= tau * P[i].sum()
        M_i = torch.zeros_like(mask)
        M_i[idx] = mask
        return M_i
    
    • P[i]: A row from the compressed attention map P^\hat{P}, representing the attention of QiQ_i to all KjK_j.
    • tau: A threshold hyper-parameter (e.g., 0.9).
    • sortedP,idxsorted_P, idx: Sorts the values in P[i] in descending order and stores their original indices.
    • cusumPcusum_P: Computes the cumulative sum of the sorted attention scores.
    • mask: Creates a boolean mask where True indicates that the cumulative sum up to that point is less than or equal to tau times the total sum of P[i].
    • MiM_i: Initializes a zero mask of the same shape as P[i].
    • Mi[idx]=maskM_i[idx] = mask: Populates MiM_i with 1s at the original indices corresponding to the selected top values.
  7. Ensuring Fix Blocks are Computed: Finally, to prevent non-self-similar blocks (fix blocks) from being skipped, any row ii in MgM_g corresponding to a non-self-similar query block (sqi<θs_{qi} < \theta) or any column jj corresponding to a non-self-similar key block (skj<θs_{kj} < \theta) is entirely set to 1. This means computations involving these critical blocks are never omitted. $ M_g[i, :] = 1, \quad \mathrm{If } \ s_{qi} < \theta $ $ M_g[:, j] = 1, \quad \mathrm{If } \ s_{kj} < \theta $

4.2.3. Masking of the First Stage (Section 3.3)

The block mask MgM_g generated in the previous stage is directly applied within the FlashAttention inner loop.

Specifically, in the inner loop where attention between a query block QiQ_i and a key-value block pair (Kj,VjK_j, V_j) is computed (Algorithm 1, Line 9), if Mg[i,j]=0M_g[i, j] = 0, then the matrix multiplications for QiKjQ_i K_j^\top and subsequent P~ijVj\widetilde{P}_{ij} V_j (and VjV_j loading) are entirely skipped. This directly reduces the computational load and memory access. $ \mathrm{Skip } Q_i K_j^\top \ \mathrm{and} \ \widetilde{P}_{ij} V_j, \ \mathrm{If } \ M_g[i, j] = 0 $

This corresponds to the ifM[i,j]!=0thenif M[i, j] != 0 then check in Algorithm 1, Line 10.

4.2.4. Sparse Warp Online Softmax (Section 3.4)

This is the second stage of the two-stage online filter, providing finer-grained pruning.

Key Idea: Even within the blocks that are not entirely skipped by MgM_g, some individual rows of attention scores might still be negligible. This stage identifies and skips value vector updates (P~ijVj\widetilde{P}_{ij} V_j) for these negligible rows without extra overhead.

Condition for Skipping: The mechanism relies on the online softmax properties. Recall that OijO_{ij} is updated as: $ O_{ij} = \mathrm{diag}\left(\exp(m_{i, j-1} - m_{ij})\right) O_{i, j-1} + \widetilde{P}_{ij} V_j $ where mij=max{mi,j1,rowmax(Sij)}m_{ij} = \max\{m_{i, j-1}, \mathrm{rowmax}(S_{ij})\}. If rowmax(Sij)<mij\mathrm{rowmax}(S_{ij}) < m_{ij}, it implies that the maximum value in the current score block SijS_{ij} is smaller than the running maximum mi,j1m_{i, j-1} (meaning mijm_{ij} takes the value of mi,j1m_{i, j-1}). In this case, the values in P~ij=exp(Sijmij)\widetilde{P}_{ij} = \exp(S_{ij} - m_{ij}) will be small, because SijmijS_{ij} - m_{ij} will be negative. If rowmax(Sij)\mathrm{rowmax}(S_{ij}) is significantly smaller than mijm_{ij}, then all values in P~ij\widetilde{P}_{ij} will be close to 0, making P~ijVj\widetilde{P}_{ij} V_j negligible.

This condition is formalized as: $ O_{ij} \approx O_{i, j-1}, \quad \mathrm{if } \ \max\left(\exp(S_{ij} - m_{ij})\right) \to 0 $ This equivalence holds when max(mlocalmij)<λ\max(m_{\mathrm{local}} - m_{ij}) < \lambda, where λ\lambda is a negative hyper-parameter (e.g., -5). A small λ\lambda means that if the local maximum mlocalm_{\mathrm{local}} (i.e., rowmax(Sij)\mathrm{rowmax}(S_{ij})) is sufficiently smaller than the global running maximum mijm_{ij}, the exponent term exp(Sijmij)\exp(S_{ij} - m_{ij}) will be tiny, making the contribution negligible.

Warp-level Application (Algorithm 1, Lines 14-17): The computation of SijS_{ij} is typically split across multiple GPU warps for parallelism. Each warp ww processes a sub-block of rows Iw=[iwbqcw:(iw+1)bqcw]I_w = [\frac{i_w \cdot b_q}{c_w} : \frac{(i_w+1) \cdot b_q}{c_w}] within SijS_{ij}, where cwc_w is the number of GPU warps. A warp ww will skip its part of the P~ijVj\widetilde{P}_{ij} V_j computation (i.e., updating Oij[Iw]O_{ij}[I_w]) if the maximum value in its local SijS_{ij} sub-block, after being shifted by the row maximum mijm_{ij}, falls below the threshold λ\lambda. $ \mathrm{if } \ \max(m_{\mathrm{local}}[I_w] - m_{ij}[I_w]) < \lambda \quad \mathrm{then} \ \mathrm{skip} \ \widetilde{P}_{ij}[I_w] V_j $ This ensures that only meaningful contributions update the output block OijO_{ij}, achieving fine-grained sparsity at the warp level with no additional overhead, as the maximum values (mlocal,mijm_{\mathrm{local}}, m_{ij}) are already computed as part of online softmax.

4.2.5. Combined with SageAttention (Section 3.5)

To further boost performance, SpargeAttn integrates with SageAttention (Zhang et al., 2025a;d;g;b;e), a framework for quantized attention.

Integration:

  • SpargeAttn's sparse operations are orthogonal to quantization operations. This means they can be directly combined.

  • The quantization is performed per-block (Algorithm 1, Line 3: Quant(Qi, Kj) returns quantized Q^i,K^j\hat{Q}_i, \hat{K}_j and dequantization scales δQ,δK\delta_Q, \delta_K). The matrix multiplication SijS_{ij} then uses these quantized values and scales for dequantization (Algorithm 1, Line 12: Matmul(hatQi,hatKjT)deltaQdeltaKMatmul(hat_Qi, hat_Kj^T) * delta_Q * delta_K).

  • The prediction overhead for sparse block selection is minimized by implementing it in CUDA and applying kernel fusion techniques.

    Algorithm 1 provides the full integrated process:

  • Lines 1-2: Input and block division.

  • Line 3: Quantization of Qi,KjQ_i, K_j.

  • Lines 4-6: Selective Token Compression for Sparse Prediction (generating MgM_g).

  • Lines 7-22: Main FlashAttention loop.

    • Line 10: First stage block pruning based on MgM_g. If M[i,j]=0M[i,j] = 0, skip the entire inner loop for this block.
    • Lines 11-13: Load and compute SijS_{ij}, update mij,P~ij,lijm_{ij}, \widetilde{P}_{ij}, l_{ij}.
    • Lines 14-17: Second stage sparse warp online softmax. If the condition max(mlocal[Iw]mij[Iw])<λ\max(m_{\mathrm{local}}[I_w] - m_{ij}[I_w]) < \lambda is met, skip the P~ij[Iw]Vj\widetilde{P}_{ij}[I_w]V_j update for that warp.

4.2.6. Hyper-parameters Determination for Model Layer (Section 3.6)

SpargeAttn introduces three hyper-parameters: τ\tau, θ\theta, and λ\lambda. These need to be determined for each attention layer in a model.

Hyper-parameters:

  • τ(0,1)\tau \in (0, 1): Controls the sparsity of the first stage TopCdf mask MgM_g. A higher τ\tau means more elements are kept, leading to lower sparsity.
  • θ(1,1)\theta \in (-1, 1): Threshold for block self-similarity. Blocks with cosine similarity below θ\theta are considered non-self-similar (fix blocks) and are always fully computed, preserving accuracy.
  • λ<0\lambda < 0: Threshold for the second stage sparse warp online softmax. A smaller (more negative) λ\lambda allows more aggressive skipping, increasing sparsity.

Determination Process: The goal is to find hyper-parameters that maximize attention sparsity while keeping the attention error within strict bounds.

  1. Error Metric: The Relative L1 distance is used to evaluate attention accuracy: $ L1 = \frac{\sum |O - O'|}{\sum |O|} $ where OO is the output of Full-Attention (ground truth) and OO' is the output of SpargeAttn.

  2. Two-Stage Grid Search:

    • Stage 1 (for τ,θ\tau, \theta): A grid search is performed for τ\tau and θ\theta to find the optimal pair that maximizes sparsity while ensuring L1<l1L1 < l_1. For example, l1=0.05l_1 = 0.05.

    • Stage 2 (for λ\lambda): After fixing τ\tau and θ\theta, another grid search is performed for λ\lambda to further maximize sparsity while ensuring L1<l2L1 < l_2, with l2l_2 being a slightly looser bound (e.g., l2=0.06l_2 = 0.06). This two-stage approach allows for sequential optimization of sparsity while controlling accuracy.

      Example thresholds for l1,l2l_1, l_2 are provided for different models:

  • Llama3.1: (0.08,0.09)(0.08, 0.09)
  • CogvideoX and Mochi: (0.05,0.06)(0.05, 0.06)
  • Stable-Diffusion3.5 and Flux: (0.07,0.08)(0.07, 0.08)
  • Open-Sora-Plan: (0.03,0.035)(0.03, 0.035)

4.2.7. HilbertCurve Permutation (Section 3.7)

This technique is specifically designed to enhance sparsity for visual models (image and video).

Key Idea: The performance of SpargeAttn benefits from high self-similarity within QQ and KK blocks, as this allows more blocks to be compressed and participate in TopCdf selection, leading to higher sparsity. Attention is computationally invariant to token permutations, meaning reordering tokens does not change the final output as long as the original positions are properly mapped. Therefore, if tokens can be permuted to increase the similarity of adjacent tokens, SpargeAttn can achieve better sparsity.

Application to 3D Visual Tokens: For 3D visual tokens (e.g., from video frames) represented as ΛˉQ,K,VRT×H×W×d\mathbf{\bar{\Lambda}}_{Q,K,V} \in \mathbb{R}^{T \times H \times W \times d} (Time, Height, Width, Dimension), SpargeAttn uses the Hilbert Curve to reorder them.

  • The Hilbert Curve is a space-filling curve that effectively preserves locality. It traverses a multi-dimensional space, visiting every cell, without crossing rows or columns in a way that breaks spatial relationships.

  • By flattening the 3D visual tokens along a Hilbert Curve into a 1D sequence of shape RL×d\mathbb{R}^{L \times d} (where L=T×H×WL = T \times H \times W), tokens that were originally spatially or temporally close remain adjacent or close in the permuted sequence.

  • This significantly increases the similarity of adjacent tokens in the 1D sequence, which in turn leads to higher self-similarity within the blocks of QQ and KK. Higher block self-similarity allows more blocks to be compressed (fewer fix blocks) and thus increases the overall sparsity achievable by SpargeAttn.

    Figure 5 illustrates this for a 1×6×61 \times 6 \times 6 space, showing how a Hilbert Curve maintains locality better than rowmajor ordering.

    Figure 5. Illustration of different token permutation methods in \(1 \\times 6 \\times 6\) space, with block size of 4. 该图像是示意图,展示了在 1 imes 6 imes 6 空间中不同的 token 置换方法,块大小为 4。图中标示了视觉 tokens 的排列方式,以及块内和块间的相似性。箭头表示重新排列的路径,颜色深浅显示不同的区块性质。

Figure 5. Illustration of different token permutation methods in 1times6times61 \\times 6 \\times 6 space, with block size of 4.

5. Experimental Setup

5.1. Datasets

SpargeAttn is evaluated on a diverse range of generative tasks across different modalities:

  • Language Models:

    • Llama3.1 (8B): A large language model (LLM).
    • WikiText (Merity et al., 2017): A dataset of Wikipedia articles, used to assess the model's prediction confidence, typically measured by perplexity.
    • Longbench (Bai et al., 2024): A benchmark for long context understanding capabilities. It includes various tasks to comprehensively evaluate how well LLMs handle extended input sequences.
    • En.MC of InfiniteBench (Zhang et al., 2024): Another benchmark for long context understanding, specifically evaluating multi-choice question answering over very long contexts.
    • Needle-in-a-Haystack (NIAH) task (Kamradt, 2023): A specific task designed to test an LLM's ability to retrieve a small piece of critical information ("needle") embedded within a very long, irrelevant context ("haystack"). This assesses retrieval ability and long-context robustness.
  • Text-to-Video Models:

    • CogvideoX (2B), Mochi (Team, 2024), Open-Sora-Plan (Lin et al., 2024): These are models for generating video from text descriptions.
    • Open-Sora prompt sets (Zheng et al., 2024c): These are text prompts used to generate videos for evaluation, covering a variety of scenarios and content.
  • Text-to-Image Models:

    • Flux (.1-dev) (Black Forest Labs, 2023), Stable-Diffusion3.5 (large) (Stability AI, 2023): These are models for generating images from text descriptions.

    • COCO annotations (Lin et al., 2014): A large-scale object detection, segmentation, and captioning dataset. Its annotations are used as prompts for text-to-image generation and for evaluating the quality of generated images.

      The choice of these datasets and models reflects a comprehensive evaluation strategy, covering different tasks (generation, comprehension, retrieval) and modalities (text, image, video), which is crucial for validating the universality claim of SpargeAttn.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a detailed explanation:

5.2.1. Language Models

  • Perplexity (PPL):

    • Conceptual Definition: Perplexity is a common metric for evaluating language models. It quantifies how well a probability distribution or language model predicts a sample. Lower perplexity indicates a better model, as it means the model is more confident and accurate in predicting the next word in a sequence. It can be interpreted as the inverse probability of the test set, normalized by the number of words.
    • Mathematical Formula: $ \mathrm{PPL}(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \dots, w_{i-1})\right) $
    • Symbol Explanation:
      • W=(w1,w2,,wN)W = (w_1, w_2, \dots, w_N): A sequence of NN words.
      • NN: The total number of words in the sequence.
      • P(wiw1,,wi1)P(w_i | w_1, \dots, w_{i-1}): The probability of the ii-th word given the preceding i-1 words, as predicted by the language model.
      • log\log: Natural logarithm.
  • Longbench score:

    • Conceptual Definition: Longbench (Bai et al., 2024) is a benchmark designed to evaluate long context understanding of language models across various tasks (e.g., multi-choice QA, summarization, few-shot tasks). The Longbench score typically refers to an aggregated average performance across these diverse tasks, with higher scores indicating better long context understanding. The specific calculation method for the aggregated score is usually detailed in the Longbench paper itself and can vary by task type. Given its nature as a composite score, a universal formula cannot be provided without referring to the original Longbench paper, but generally, it's an average of accuracy or F1 scores across sub-tasks.
  • InfiniteBench (En.MC) score:

    • Conceptual Definition: InfiniteBench (Zhang et al., 2024) extends long context evaluation beyond 100K tokens. En.MC likely refers to the English Multi-Choice task within this benchmark. The score would typically be the accuracy of correctly answering multi-choice questions that require understanding information spread across very long contexts. Higher scores mean better performance.
    • Mathematical Formula (for Accuracy): $ \mathrm{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $
    • Symbol Explanation:
      • Number of correct predictions: The count of instances where the model's output matches the ground truth.
      • Total number of predictions: The total number of instances evaluated.
  • Needle-in-a-Haystack (NIAH) Retrieval Accuracy:

    • Conceptual Definition: This metric specifically assesses how well an LLM can retrieve a single, specific piece of information (the "needle") hidden within a large volume of irrelevant text (the "haystack"). Retrieval accuracy is the percentage of times the model correctly identifies and extracts the "needle." Higher accuracy indicates better long-context retrieval.
    • Mathematical Formula (for Accuracy): Same as Accuracy above.

5.2.2. Text-to-Video Models

  • CLIPSIM (CLIP Similarity):

    • Conceptual Definition: CLIPSIM measures the text-video alignment. It uses the CLIP (Contrastive Language-Image Pre-training) model to extract embeddings for the input text prompt and the generated video frames. The cosine similarity between the text embedding and the average (or maximum) of the video frame embeddings (or a video embedding generated by a video CLIP variant) quantifies how well the video content matches the text description. Higher CLIPSIM indicates better alignment.
    • Mathematical Formula (conceptual, as exact implementation varies): $ \mathrm{CLIPSIM} = \mathrm{CosineSimilarity}(\mathrm{Embed}{\text{text}}, \mathrm{Embed}{\text{video}}) $
    • Symbol Explanation:
      • Embedtext\mathrm{Embed}_{\text{text}}: The CLIP embedding of the input text prompt.
      • Embedvideo\mathrm{Embed}_{\text{video}}: A single embedding representing the entire video, often derived by averaging CLIP embeddings of individual frames or using a video-specific CLIP model.
  • CLIP-T (CLIP Temporal Consistency):

    • Conceptual Definition: CLIP-T measures the temporal consistency or smoothness of the generated video frames in terms of their CLIP embeddings. It calculates the average cosine similarity between CLIP embeddings of adjacent frames in the video. High CLIP-T suggests that consecutive frames are semantically similar and transition smoothly.
    • Mathematical Formula (conceptual): $ \mathrm{CLIP-T} = \frac{1}{F-1} \sum_{i=1}^{F-1} \mathrm{CosineSimilarity}(\mathrm{Embed}{\text{frame}i}, \mathrm{Embed}{\text{frame}{i+1}}) $
    • Symbol Explanation:
      • FF: Total number of frames in the video.
      • Embedframei\mathrm{Embed}_{\text{frame}_i}: The CLIP embedding of the ii-th frame.
  • VQA-a (Video Quality Assessment - aesthetic quality):

    • Conceptual Definition: VQA-a assesses the aesthetic quality of the generated video. This is typically evaluated using a trained quality prediction model that outputs a score based on visual appeal, composition, lighting, etc. Higher scores indicate more aesthetically pleasing videos. The specific model used for VQA-a is often a neural network trained on human aesthetic judgments.
  • VQA-t (Video Quality Assessment - technical quality):

    • Conceptual Definition: VQA-t assesses the technical quality of the generated video, focusing on aspects like resolution, clarity, absence of artifacts, and overall visual fidelity. Similar to VQA-a, this is often determined by a specialized model trained to predict technical quality. Higher scores mean better technical quality.
  • Flow-score (FScore):

    • Conceptual Definition: Flow-score is a metric for temporal consistency, often specifically related to the realism and smoothness of motion. It typically involves computing optical flow between frames and then evaluating its quality or consistency. A higher Flow-score indicates more realistic and consistent motion in the video. The exact calculation depends on the underlying optical flow and evaluation method (e.g., comparing generated flow to ground truth or assessing consistency).

5.2.3. Text-to-Image Models

  • FID (Frechet Inception Distance):

    • Conceptual Definition: FID is a widely used metric to evaluate the quality of images generated by generative models. It measures the statistical distance between the feature distributions of generated images and real images. It uses features extracted from an Inception-v3 network. Lower FID scores indicate higher quality and diversity of generated images, meaning they are more similar to real images.
    • Mathematical Formula (conceptual): $ \mathrm{FID} = ||\mu_x - \mu_g||^2 + \mathrm{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}) $
    • Symbol Explanation:
      • μx\mu_x: Mean feature vector of real images.
      • μg\mu_g: Mean feature vector of generated images.
      • Σx\Sigma_x: Covariance matrix of features of real images.
      • Σg\Sigma_g: Covariance matrix of features of generated images.
      • 2||\cdot||^2: Squared L2 norm (Euclidean distance).
      • Tr()\mathrm{Tr}(\cdot): Trace of a matrix.
  • Clipscore (CLIP):

    • Conceptual Definition: Clipscore (Hessel et al., 2021) evaluates the text-image alignment by measuring the cosine similarity between the CLIP embeddings of a generated image and its corresponding text prompt. Higher Clipscore indicates that the generated image semantically matches the text description more closely.
    • Mathematical Formula (conceptual): $ \mathrm{Clipscore} = \mathrm{CosineSimilarity}(\mathrm{Embed}{\text{image}}, \mathrm{Embed}{\text{text}}) $
    • Symbol Explanation:
      • Embedimage\mathrm{Embed}_{\text{image}}: The CLIP embedding of the generated image.
      • Embedtext\mathrm{Embed}_{\text{text}}: The CLIP embedding of the input text prompt.
  • ImageReward (IR):

    • Conceptual Definition: ImageReward (Xu et al., 2024) is a metric that predicts human preference for text-to-image generations. It's often trained on human feedback data to provide a score that correlates with how much humans would prefer one generated image over another, given a text prompt. Higher ImageReward scores indicate images that are more likely to be preferred by humans.

5.2.4. General Performance Metrics

  • Speed (1/t1/t):

    • Conceptual Definition: Measures the inference speed, defined as the inverse of latency. It's calculated as the total number of operations in a standard attention computation divided by the latency. A higher value indicates faster execution.
    • Mathematical Formula: $ \mathrm{Speed} = \frac{O(\mathrm{attn})}{t} $
    • Symbol Explanation:
      • O(attn)O(\mathrm{attn}): The total number of operations required for a standard attention computation (fixed for a given input size).
      • tt: The latency (time in seconds) from input Q, K, V to the attention output.
  • Sparsity:

    • Conceptual Definition: Sparsity quantifies the proportion of computations that are skipped by the sparse attention method. It is the ratio of skipped matrix multiplications (QˉiKj\bar{Q}_i K_j^\top and P~ijV~j\tilde{P}_{ij} \tilde{V}_j) to the total number of such multiplications required in full attention. Higher sparsity indicates greater computational savings.
    • Mathematical Formula: $ \mathrm{Sparsity} = \frac{\text{Number of skipped operations}}{\text{Total number of operations in full attention}} $
    • Symbol Explanation:
      • Number of skipped operations: The count of QiKjQ_i K_j^\top and P~ijV~j\tilde{P}_{ij} \tilde{V}_j computations that were omitted.
      • Total number of operations in full attention: The total count of QiKjQ_i K_j^\top and P~ijV~j\tilde{P}_{ij} \tilde{V}_j computations if no sparsity was applied.

5.3. Baselines

The paper compares SpargeAttn against two representative dynamic sparse attention methods:

  • MInference (Jiang et al., 2024): A blocksparse method that accelerates pre-filling for long-context LLMs. It likely uses some form of token compression or block pruning.

    • Configuration: Evaluated at 30% and 70% sparsity levels. The paper uses sparsity as a percentage of operations skipped, so 0.5 sparsity means 50% operations skipped. MInference (0.5) implies 50% sparsity.
  • FlexPrefill (Lai et al., 2025): A context-aware sparse attention mechanism also designed for efficient long-sequence inference. It is likely another token compression or dynamic masking approach.

    • Configuration: Evaluated with gamma values of 0.95 and 0.99. These gamma values typically control the aggressiveness of pruning, where higher gamma might imply more pruning or a different threshold.

      These baselines were chosen because they represent recent dynamic sparse attention approaches that do not require retraining and are relevant for accelerating LLMs, making them suitable comparators for SpargeAttn's claims of universality and training-freeness. The paper specifically notes that methods applicable across different model types are limited, making these LLM-focused baselines the most appropriate for comparison.

5.4. Implementation and Hyper-parameters

  • Implementation: The method is implemented using CUDA, indicating a focus on GPU efficiency and low-level optimization.
  • Hyper-parameters: As described in Section 3.6, the hyper-parameters τ,θ,λ\tau, \theta, \lambda for each attention layer are determined through a two-stage grid search process. The Relative L1 distance (L1) is used as the accuracy metric to constrain error.
    • Specific L1 thresholds (l1,l2l_1, l_2) used:
      • Llama3.1: (l1=0.08,l2=0.09)(l_1 = 0.08, l_2 = 0.09)
      • CogvideoX and Mochi: (l1=0.05,l2=0.06)(l_1 = 0.05, l_2 = 0.06)
      • Stable-Diffusion3.5 and Flux: (l1=0.07,l2=0.08)(l_1 = 0.07, l_2 = 0.08)
      • Open-Sora-Plan: (l1=0.03,l2=0.035)(l_1 = 0.03, l_2 = 0.035) These specific values indicate that different models have different L1 error tolerances for maintaining end-to-end quality. Lower thresholds (like for Open-Sora-Plan) imply a stricter accuracy requirement.

6. Results & Analysis

6.1. Core Results Analysis

The experiments aim to demonstrate SpargeAttn's universality, accuracy, and efficiency across diverse models compared to Full-Attention and existing sparse attention baselines.

The following are the results from Table 1 of the original paper:

Model (seq_len) Attention (Sparsity) Speed (1/t)↑ Llama3.1 Metrics CogvideoX Metrics Mochi Metrics Flux Metrics Stable-Diffusion3.5 Metrics
WikiText (Ppl.) ↓ Longbench ↑ InfiniteBench ↑ NIAH↑ CLIPSIM ↑ CLIP-T ↑ VQA-a ↑ VQA-t ↑ FScore ↑ CLIPSIM ↑ CLIP-T ↑ VQA-a ↑ VQA-t ↑ FScore ↑ FID ↓ CLIP ↑ IR ↑ FID ↓ CLIP ↑ IR ↑
Llama3.1(128K) Full-Attention 156.9 6.013 38.682 0.6594 0.907 ----- ----- --- ---
Minference (0.5) 140.1 6.310 28.860 0.5152 0.832 ----- ----- --- ---
FlexPrefill (0.5) 240.6 6.476 38.334 0.6460 0.858 ----- ----- --- ---
Minference (0.3) 115.7 6.705 34.074 0.6532 0.870 ----- ----- --- ---
FlexPrefill (0.42) SpargeAttn (0.54) 206.9 708.1 6.067 6.020 38.334 39.058 0.6581 0.6638 0.878 0.909 ----- ----- --- ---
CogvideoX(17K) Full-Attention 166.0 ---- 0.1819 0.9976 80.384 75.946 5.342 ----- --- ---
Minference (0.5) 264.6 ---- 0.1728 0.9959 70.486 62.410 2.808 ----- --- ---
FlexPrefill (0.6) 175.3 ---- 0.1523 0.9926 1.5171 4.5034 1.652 ----- --- ---
Minference (0.3) 196.9 ---- 0.1754 0.9964 77.326 63.525 3.742 ----- --- ---
FlexPrefill (0.45) 142.0 ---- 0.1564 0.9917 7.7259 8.8426 2.089 ----- --- ---
SpargeAttn (0.46) 507.9 ---- 0.1798 0.9974 78.276 74.846 5.030 ----- --- ---
Mochi(22K) Full-Attention 164.2 ---- ----- 0.1725 0.9990 56.472 67.663 1.681 --- ---
Minference (0.5) 202.4 ---- ----- 0.1629 0.9891 6.668 50.839 0.653 --- ---
FlexPrefill (0.48) 191.3 ---- ----- 0.1667 0.9898 0.582 0.0043 X --- ---
Minference (0.3) 147.7 ---- ----- 0.1682 0.9889 14.541 42.956 0.833 --- ---
FlexPrefill (0.4) 171.7 ---- ----- 0.1677 0.9909 2.941 0.7413 X --- ---
SpargeAttn (0.47) 582.4 ---- ----- 0.1720 0.9990 54.179 67.219 1.807 --- ---
Open-Sora-Plan (38K) Full-Attention ----- 0.1650 0.9994 81.40 80.60 0.847 ----- --- ---
Open-Sora-Plan (38K) SpargeAttn (0.34) ----- 0.1686 0.9985 77.59 76.91 0.839 ----- --- ---
Flux (4.5K) Full-Attention 158.2 ---- ----- ----- 166.103 31.217 0.8701 ---
Minference (0.5) 151.8 ---- ----- ----- 180.650 30.235 0.4084 ---
FlexPrefill (0.48) 47.7 ---- ----- ----- 443.928 18.3377 -2.2657 ---
Minference (0.3) 118.9 ---- ----- ----- 170.221 31.001 0.7701 ---
FlexPrefill (0.41) SpargeAttn (0.38) 40.9 280.3 ---- ----- ----- 405.043 163.982 19.5591 31.448 -2.2362 0.9207 ---
Stable- Diffusion3.5 (4.5K) Full-Attention 164.2 ---- ----- ----- --- 166.101 32.007 0.9699
Minference (0.5) 186.4 ---- ----- ----- --- 348.930 18.3024 -2.2678
FlexPrefill (0.37) 23.1 ---- ----- ----- --- 350.497 18.447 -2.2774
Minference (0.3) 150.3 ---- ----- ----- --- 337.530 18.099 -
FlexPrefill (0.35) 22.7 ---- ----- ----- --- 348.612 18.147 -2.2647
SpargeAttn (0.31) 293.0 ---- ----- ----- --- 166.193 32.114 0.9727

6.1.1. End-to-End Metrics and Speed (Table 1)

Key Observations:

  • Llama3.1 (128K sequence length):

    • SpargeAttn (0.54 sparsity) achieves a Speed (1/t) of 708.1, which is ~4.5x faster than Full-Attention (156.9).
    • Crucially, SpargeAttn maintains or even slightly improves end-to-end metrics: NIAH accuracy (0.909 vs 0.907), WikiText PPL (6.020 vs 6.013, lower is better), Longbench (39.058 vs 38.682), and InfiniteBench (0.6638 vs 0.6594). This demonstrates excellent accuracy preservation while providing substantial speedup.
    • Baselines (MInference, FlexPrefill) show speedup, but consistently come with metric degradation. For instance, MInference (0.5) has a NIAH of 0.832 (vs 0.907 for Full-Attention) and FlexPrefill (0.5) has 0.858. Even at lower sparsity (0.3), MInference (0.3) still lags SpargeAttn in NIAH (0.870) and Speed (115.7 vs 708.1).
  • CogvideoX (17K sequence length):

    • SpargeAttn (0.46 sparsity) reaches a Speed of 507.9, approximately 3x faster than Full-Attention (166.0).
    • It largely preserves video quality metrics: CLIPSIM (0.1798 vs 0.1819), CLIP-T (0.9974 vs 0.9976), VQA-a (78.276 vs 80.384), VQA-t (74.846 vs 75.946), and FScore (5.030 vs 5.342). The slight differences are well within acceptable bounds for visual generation.
    • Baselines (MInference, FlexPrefill) show severe metric degradation. FlexPrefill (0.6) particularly performs poorly on VQA-a (1.5171) and VQA-t (4.5034), indicating poor video quality.
  • Mochi (22K sequence length):

    • SpargeAttn (0.47 sparsity) achieves a Speed of 582.4, which is ~3.5x faster than Full-Attention (164.2).
    • It maintains video quality very well: CLIPSIM (0.1720 vs 0.1725), CLIP-T (0.9990 vs 0.9990), VQA-a (54.179 vs 56.472), VQA-t (67.219 vs 67.663), and FScore (1.807 vs 1.681).
    • Baselines again show significant metric degradation, with FlexPrefill (0.48) having extremely low VQA-a (0.582) and VQA-t (0.0043) and some FScore values marked as XX (inability to generate results or invalid metric).
  • Flux (4.5K sequence length):

    • SpargeAttn (0.38 sparsity) provides a Speed of 280.3, about 1.77x faster than Full-Attention (158.2).
    • It nearly matches Full-Attention in FID (163.982 vs 166.103, lower is better), CLIP (31.448 vs 31.217), and IR (0.9207 vs 0.8701).
    • Baselines MInference and FlexPrefill show much worse performance on image generation metrics, with much higher FID (e.g., 443.928 for FlexPrefill) and lower CLIP/IR scores (e.g., -2.2657 for FlexPrefill).
  • Stable-Diffusion3.5 (4.5K sequence length):

    • SpargeAttn (0.31 sparsity) reaches a Speed of 293.0, roughly 1.78x faster than Full-Attention (164.2).
    • It maintains FID (166.193 vs 166.101), CLIP (32.114 vs 32.007), and IR (0.9727 vs 0.9699) comparable to Full-Attention.
    • Similar to Flux, baselines exhibit severe metric degradation.
  • Open-Sora-Plan (38K sequence length):

    • For Open-Sora-Plan, only end-to-end latency and video quality metrics are reported. SpargeAttn (0.34 sparsity) has a latency of 393s compared to Full-Attention's 629s, indicating a significant speedup (~1.6x) while maintaining competitive quality metrics.

      Overall Summary: SpargeAttn consistently demonstrates superior performance across all evaluated models and modalities. It achieves substantial speedups (ranging from ~1.6x to ~4.5x) compared to Full-Attention while robustly retaining end-to-end model quality. In contrast, existing sparse attention baselines (MInference, FlexPrefill) frequently lead to significant degradation in model performance metrics, especially in image and video generation, making them less suitable for practical application where quality is paramount.

6.1.2. Visible Examples (Figures 6, 7, 8, 9, 12, 13)

The paper provides visual comparisons that reinforce the quantitative results:

The following figure (Figure 6 from the original paper) shows visible examples on CogvideoX using SpargeAttention:

Figure 6. Visible examples on CogvideoX using SpargeAttention. 该图像是一个示意图,展示了使用 SpargeAttention 的结果对比,左侧为 Full Attention,右侧为 Sparge Attention。图中包含交通、自然风景和模型的不同生成效果,表明 SpargeAttention 在多种场景下的应用效果。

Figure 6. Visible examples on CogvideoX using SpargeAttention. This figure shows that SpargeAttn generates videos of comparable visual quality to Full Attention for various scenes (e.g., traffic, nature, models), demonstrating that the speedup does not compromise visual fidelity.

The following figure (Figure 7 from the original paper) shows comparison examples on Flux and Stable-Diffusion3.5:

Figure 7. Comparison examples on Flux and Stable-Diffusion3.5. The sparsity of SpargeAttn, MInference and FlexPrefill is 0.38, 0.3, and 0.4 on F 1ux and 0.31, 0.3, and 0.35 on Stable-Diffusion3.5. 该图像是一个比较示意图,展示了在Flux和Stable-Diffusion3.5上使用Full Attention、SpargeAttention、MInference和FlexPrefill的结果。不同模型的输出在视觉效果上有明显差异,SpargeAttention在保留细节的同时有效减少了计算复杂度。

Figure 7. Comparison examples on Flux and Stable-Diffusion3.5. The sparsity of SpargeAttn, MInference and FlexPrefill is 0.38, 0.3, and 0.4 on F 1ux and 0.31, 0.3, and 0.35 on Stable-Diffusion3.5. This figure visually confirms the metric degradation of baselines. Images generated by MInference and FlexPrefill (e.g., for "a teddy bear sitting on a swing") show clear artifacts or lack of detail compared to Full-Attention and SpargeAttn. SpargeAttn outputs are visually indistinguishable from Full-Attention, indicating effective accuracy preservation.

The following figure (Figure 8 from the original paper) shows comparison examples on Mochi:

Figure 8. Comparison examples on Mochi. The sparsity of SpargeAt t n, MInference and FlexPrefill is 0.47, 0.3, and 0.4. 该图像是图表,展示了四种注意力机制的对比示例:Full Attention、SpargeAttention、MInference和FlexPrefill。可以看到,SpargeAttention的稀疏度为0.47,而其他方法分别为0.3和0.4。

Figure 8. Comparison examples on Mochi. The sparsity of SpargeAt t n, MInference and FlexPrefill is 0.47, 0.3, and 0.4. Similar to CogvideoX, SpargeAttn on Mochi produces videos (a cute fluffy monster, cartoon style) with quality comparable to Full-Attention, while baselines show noticeable visual degradation.

The following figure (Figure 9 from the original paper) shows a Needle-in-a-Haystack comparison example on Llama3.1:

Figure 9. A Needle-in-a-Haystack comparison example on Llama3.1. The sparsity of SpargeAttn, MInference, and FlexPrefill is 0.5, 0.5, and 0.54. 该图像是一个比较不同注意力机制性能的热图,展示了 Full Attention、SpargeAttention、MInference 和 FlexPrefill 模型在不同 Token 限制下的深度 (%) 和得分。SpargeAttention 的整体得分为 0.909,表现优于其他模型。

Figure 9. A Needle-in-a-Haystack comparison example on Llama3.1. The sparsity of SpargeAttn, MInference, and FlexPrefill is 0.5, 0.5, and 0.54. This figure demonstrates SpargeAttn's superior retrieval accuracy in long-context LLM tasks. While Full-Attention correctly identifies the "needle," MInference and FlexPrefill fail to retrieve the correct information, whereas SpargeAttn succeeds, often even outperforming Full-Attention slightly, as seen in Table 1.

The following figure (Figure 12 from the original paper) shows visible examples on Open-sora-Plan:

Figure 12. Visible examples on Open-sora-Plan. 该图像是一个比较图,展示了全注意力与稀疏注意力在两个场景下的效果。第一行显示的是一只弹吉他的熊猫,以及其背景的部分,而第二行展示了一个模型船。在每个场景中,左侧为全注意力产生的图像,右侧为稀疏注意力产生的图像,以对比两种注意力机制的效果。

Figure 12. Visible examples on Open-sora-Plan. Here, SpargeAttn generated videos ("Panda playing guitar," "model ship in an old bottle") are visually consistent with those from Full-Attention, reinforcing the claim of no quality loss.

The following figure (Figure 13 from the original paper) shows a visible comparison example on Mochi:

该图像是一个对比图,展示了不同注意力机制的效果,包括Full Attention、SpargeAttention、MInference和FlexPrefill。每种方法侧重于不同的生成场景,以说明SpargeAttention的高效性和性能优势。 该图像是一个对比图,展示了不同注意力机制的效果,包括Full Attention、SpargeAttention、MInference和FlexPrefill。每种方法侧重于不同的生成场景,以说明SpargeAttention的高效性和性能优势。

Figure 13. Comparison examples on Mochi. The sparsity of SpargeAttn, MInference, and FlexPrefill is 0.47, 0.3, and 0.4. This figure likely provides more detailed visual examples from Mochi, further corroborating the findings that SpargeAttn retains high visual quality while baselines degrade.

6.1.3. Kernel Speed Comparison (Figure 10)

The following figure (Figure 10 from the original paper) shows kernel speed comparison under varying sparsity:

Figure 10. Kernel speed comparison under varying sparsity. Input tensors have a sequence length of 22K and a head dimension of 128. SpargeAttn \(+ F A 2\) means deploying our method on FlashAttention2. 该图像是图表,展示了在不同稀疏度下的内核速度比较。图中展示了包括 SpargeAttn+FA2、Sage 和 Sage2 等不同方法的速度表现,横轴为稀疏度,纵轴为速度(1/t)。

Figure 10. Kernel speed comparison under varying sparsity. Input tensors have a sequence length of 22K and a head dimension of 128. SpargeAttn +FA2+ F A 2 means deploying our method on FlashAttention2. This figure directly compares the kernel-level efficiency. SpargeAttn+FA2SpargeAttn+FA2 (deploying SpargeAttn on FlashAttention2) shows the highest speed (1/t1/t) across various sparsity levels. It significantly outperforms Sage and Sage2, demonstrating that the sparse mechanisms effectively translate to kernel-level speedups beyond what quantization alone offers. The speed advantage grows with increasing sparsity.

6.2. Data Presentation (Tables)

The following are the results from Table 2 of the original paper:

Model GPU Original SageAttn SpargeAttn
CogvideoX RTX4090 87 s 68 s 53 s
Mochi L40 1897 s 1544 s 1037 s
Llama3.1 (24K) RTX4090 4.01 s 3.53 s 2.6 s
Llama3.1(128K) L40 52 s 42s 29.98 s

The following are the results from Table 3 of the original paper:

Sequence Len Prediction (ms) Full Attention (ms) Overhead
8k 0.251 6.649 3.78%
16k 0.487 26.83 1.82%
32k 0.972 106.68 0.911%
64k 2.599 424.24 0.612%
128k 8.764 1696.2 0.516%

The following are the results from Table 4 of the original paper:

Method Sim-q ↑ Sim-k ↑ L1 ↓ Sparsity ↑
Random 0.321 0.019 0.0414 0.048
Rowmajor 0.551 0.390 0.0307 0.363
Timemajor 0.514 0.367 0.0342 0.338
HilbertCurve 0.572 0.479 0.0389 0.392

The following are the results from Table 5 of the original paper:

Method VQA-a ↑ VQA-t ↑ FScore ↑
W/o. self-sim Judge 34.664 44.722 1.138
With self-sim Judge 54.179 67.219 1.807

The following are the results from Table 6 of the original paper:

Strategy only Mg only Mpv Mg +Mpv
Sparsity 51.2% 27.7% 54%

The following are the results from Table 7 of the original paper:

Sequence Len 8K 16K 24K 48K 128K
Sparsity 6.8% 26.4% 35.7% 49.8% 54%

The following are the results from Table 8 of the original paper:

Method Detailed Description
Random Random permutation of tokens, the order is recorded to perform inverse permutation.
Rowmajor Permutation following row-major order. Tokens are continuous along the W dimension.
Columnmajor Permutation following column-major order. Tokens are continuous along the H dimension.
Timemajor Permutation following time-major order. Tokens are continuous along the T dimension.
HilbertCurve Permutation following a Hilbert curve.

The following are the results from Table 9 of the original paper:

Method Sim-q↑ Sim-k↑ Precision(L1)↓ Sparsity↑
CogvideoX Mochi CogvideoX Mochi CogvideoX Mochi CogvideoX Mochi
Random 0.502 0.321 0.025 0.019 0.0348 0.0414 0.027 0.048
Rowmajor 0.676 0.551 0.435 0.390 0.0265 0.0307 0.242 0.363
Columnmajor 0.633 0.547 0.335 0.394 0.0274 0.0342 0.198 0.366
Timemajor 0.692 0.514 0.479 0.367 0.0294 0.0342 0.238 0.338
HilbertCurve 0.709 0.572 0.523 0.479 0.0323 0.0389 0.265 0.392

The following are the results from Table 10 of the original paper:

Method w/ judge w/o judge filter w/ judge filter w/o judge
CogvideoX Mochi CogvideoX Mochi CogvideoX Mochi CogvideoX Mochi
L1 error↓ 0.0316 0.0343 0.0325 0.0365 0.0843 0.0555 0.214 0.154
Sparsity ↑ 0.199 0.301 0.203 0.305 0.242 0.371 0.275 0.392

The following are the results from Table 11 of the original paper:

Model (seq_len) Attention (Sparsity) Speed (TOPS)↑ NIAH ↑
Llama3.1 (24K) Full-Attention 156.9 0.838
Minference (0.5) 122.5 0.635
FlexPrefill (0.6) 179.6 0.776
Minference (0.3) 102.3 0.652
FlexPrefill (0.3) SpargeAttn (0.36) 117.6 443.6 0.797 0.863

6.1.4. SpargeAttn Enhances LLM Performance (Figures 9 and 11, Tables 1 and 11)

The following figure (Figure 11 from the original paper) shows a comparison example on Llama3.1 (128K):

该图像是一个比较不同注意力机制性能的热图,其中包含四个部分:Full Attention、SpargeAttention、MInference 和 FlexPrefill。在各自的横轴和纵轴上,标出了 Token Limit 和 Depth,并且给出了整体分数,显示 SpargeAttention 的整体分数最高,为 0.863。 该图像是一个比较不同注意力机制性能的热图,其中包含四个部分:Full Attention、SpargeAttention、MInference 和 FlexPrefill。在各自的横轴和纵轴上,标出了 Token Limit 和 Depth,并且给出了整体分数,显示 SpargeAttention 的整体分数最高,为 0.863。

Figure 11. A Needle-in-a-Haystack comparison example on Llama3.1. The sparsity of SpargeAttn, MInference, and FlexPrefill is 0.5, 0.5, and 0.54. From Table 1 and Table 11, and visually in Figures 9 and 11, SpargeAttn consistently demonstrates that it not only matches but can sometimes slightly enhance LLM performance in long-context tasks (e.g., NIAH score for Llama3.1 (128K): SpargeAttn 0.909 vs Full-Attention 0.907). This counter-intuitive improvement, where sparsity leads to better accuracy, suggests that sparse attention helps the LLM to focus on more relevant information by effectively pruning irrelevant context, leading to a more robust understanding and retrieval.

6.3. Ablation Studies / Parameter Analysis

6.3.1. Overhead of Sparse Block Prediction (Table 3)

The following are the results from Table 3 of the original paper:

Sequence Len Prediction (ms) Full Attention (ms) Overhead
8k 0.251 6.649 3.78%
16k 0.487 26.83 1.82%
32k 0.972 106.68 0.911%
64k 2.599 424.24 0.612%
128k 8.764 1696.2 0.516%

This table shows the overhead incurred by the sparse block prediction stage (the first stage of SpargeAttn's filter). The prediction time (in ms) is very small compared to the full attention latency. More importantly, the overhead percentage decreases significantly as the sequence length increases (from 3.78% for 8K to 0.516% for 128K). This indicates that the prediction mechanism is highly efficient and scales well with long contexts, making SpargeAttn particularly advantageous for very large sequence lengths where the savings from sparse attention far outweigh the prediction cost.

6.3.2. End-to-End Speedup (Table 2)

The following are the results from Table 2 of the original paper:

Model GPU Original SageAttn SpargeAttn
CogvideoX RTX4090 87 s 68 s 53 s
Mochi L40 1897 s 1544 s 1037 s
Llama3.1 (24K) RTX4090 4.01 s 3.53 s 2.6 s
Llama3.1(128K) L40 52 s 42s 29.98 s

This table presents the end-to-end latency for generative tasks.

  • Original refers to Full-Attention.

  • SageAttn is the quantized version (Zhang et al., 2025a;d;g;b;e).

  • SpargeAttn integrates the sparse mechanism with SageAttn.

    For Mochi, SpargeAttn reduces the latency from 1897s (Original) to 1037s, achieving a 1.83x speedup (1897/1037 \approx 1.83). Similarly, for Llama3.1(128K)Llama3.1 (128K), SpargeAttn reduces latency from 52s to 29.98s, roughly a 1.73x speedup. These results are significant because they represent real-world inference speedups for the entire generative process, not just the attention kernel. The benefits of quantization (comparing Original to SageAttn) are further compounded by sparsity (comparing SageAttn to SpargeAttn), confirming the orthogonality of these acceleration techniques.

6.3.3. Effect of Hilbert Curve Permutation (Table 4 and Table 9)

The following are the results from Table 4 of the original paper:

Method Sim-q ↑ Sim-k ↑ L1 ↓ Sparsity ↑
Random 0.321 0.019 0.0414 0.048
Rowmajor 0.551 0.390 0.0307 0.363
Timemajor 0.514 0.367 0.0342 0.338
HilbertCurve 0.572 0.479 0.0389 0.392

The following are the results from Table 9 of the original paper:

Method Sim-q↑ Sim-k↑ Precision(L1)↓ Sparsity↑
CogvideoX Mochi CogvideoX Mochi CogvideoX Mochi CogvideoX Mochi
Random 0.502 0.321 0.025 0.019 0.0348 0.0414 0.027 0.048
Rowmajor 0.676 0.551 0.435 0.390 0.0265 0.0307 0.242 0.363
Columnmajor 0.633 0.547 0.335 0.394 0.0274 0.0342 0.198 0.366
Timemajor 0.692 0.514 0.479 0.367 0.0294 0.0342 0.238 0.338
HilbertCurve 0.709 0.572 0.523 0.479 0.0323 0.0389 0.265 0.392

These tables show the impact of different token permutation methods on block self-similarity (Sim-q, Sim-k), L1 error, and Sparsity for Mochi and CogvideoX.

  • Self-Similarity: HilbertCurve consistently yields the highest Sim-q and Sim-k (e.g., for Mochi: Sim-q 0.572, Sim-k 0.479 vs Rowmajor 0.551, 0.390). This confirms its effectiveness in preserving locality and increasing similarity among adjacent tokens within blocks.
  • Sparsity: Higher self-similarity directly translates to higher sparsity. HilbertCurve achieves the highest sparsity (e.g., for Mochi: 0.392 vs Rowmajor 0.363).
  • Accuracy: While HilbertCurve shows slightly higher L1 error than Rowmajor (e.g., 0.0389 vs 0.0307 for Mochi), the error remains within acceptable bounds (below l1=0.05l_1=0.05 or l2=0.06l_2=0.06 thresholds). This demonstrates that the HilbertCurve permutation is a beneficial technique for visual models, significantly increasing sparsity without critically compromising accuracy. Random permutation, as expected, leads to very low self-similarity and sparsity.

6.3.4. Ablation of Self-Similarity Judge (Table 5 and Table 10)

The following are the results from Table 5 of the original paper:

Method VQA-a ↑ VQA-t ↑ FScore ↑
W/o. self-sim Judge 34.664 44.722 1.138
With self-sim Judge 54.179 67.219 1.807

The following are the results from Table 10 of the original paper:

Method w/ judge w/o judge filter w/ judge filter w/o judge
CogvideoX Mochi CogvideoX Mochi CogvideoX Mochi CogvideoX Mochi
L1 error↓ 0.0316 0.0343 0.0325 0.0365 0.0843 0.0555 0.214 0.154
Sparsity ↑ 0.199 0.301 0.203 0.305 0.242 0.371 0.275 0.392

These tables highlight the critical role of the self-similarity judge (i.e., the parameter θ\theta) in SpargeAttn.

  • Comparing "W/o. self-sim Judge" with "With self-sim Judge" (Table 5 on Mochi) shows a dramatic drop in VQA-a, VQA-t, and FScore when the judge is absent. For instance, VQA-a drops from 54.179 to 34.664. This confirms that omitting computations for fix blocks (non-self-similar blocks) without the judge results in the loss of critical information and severe metric degradation.
  • Table 10 provides a more detailed ablation. "w/ judge" refers to the full SpargeAttn with the self-similarity judge enabled. "filter w/ judge" means only using the TopCdf filter on all blocks without the judge to ensure critical blocks are always computed. "filter w/o judge" means applying TopCdf blindly to all blocks.
    • The L1 error is significantly higher in "filter w/o judge" (e.g., 0.214 for CogvideoX) compared to "w/ judge" (0.0316). This reinforces that simply applying a TopCdf filter without first identifying and preserving non-self-similar blocks leads to unacceptable accuracy loss.
    • While the self-similarity judge slightly reduces sparsity (e.g., Mochi 0.301 vs 0.305 without judge, or CogvideoX 0.199 vs 0.203), this minor reduction is a necessary trade-off to ensure end-to-end accuracy. This validates the design choice of selectively compressing blocks based on their self-similarity.

6.3.5. Analysis of Sparsity from MgM_g and MpvM_{pv} (Table 6)

The following are the results from Table 6 of the original paper:

Strategy only Mg only Mpv Mg +Mpv
Sparsity 51.2% 27.7% 54%

This table breaks down the contribution of each masking stage to the total sparsity on Llama3.1 (128K seq_len).

  • The first stage mask (MgM_g, from selective token compression) alone achieves a sparsity of 51.2%.

  • The second stage mask (MpvM_{pv}, from sparse warp online softmax) alone achieves a sparsity of 27.7%.

  • When both stages are combined (Mg+MpvM_g + M_{pv}), the total sparsity reaches 54%.

    This shows that both stages contribute significantly to the overall sparsity, with the first stage (Mg) being the dominant contributor. The second stage (Mpv) provides an additional 2.8% sparsity (54% - 51.2%), which is achieved without extra overhead, demonstrating its efficiency.

6.3.6. Sparsity Increases with Sequence Length (Table 7)

The following are the results from Table 7 of the original paper:

Sequence Len 8K 16K 24K 48K 128K
Sparsity 6.8% 26.4% 35.7% 49.8% 54%

This table, based on Llama3.1, reveals a crucial trend: sparsity increases significantly with sequence length under a constant accuracy bound. For an 8K sequence, sparsity is only 6.8%, but it rises to 54% for a 128K sequence. This implies that the longer the context, the more irrelevant information can be pruned, and thus the higher the potential speedup that SpargeAttn can achieve. This makes SpargeAttn particularly effective and scalable for long-context models.

6.3.7. Sparsity Analysis Over Diffusion Model (Figures 14, 15, 16, 17)

The paper includes an in-depth sparsity analysis for CogvideoX, a diffusion model, providing insights into where sparsity manifests.

The following figure (Figure 14 from the original paper) shows layer-wise sparsity of CogvideoX:

Figure 14. Layer-wise sparsity of CogvideoX. 该图像是一个图表,展示了CogvideoX模型在不同层级的稀疏性分析。X轴为层索引,Y轴为平均稀疏度,红色虚线表示全局平均值0.27。不同层的稀疏度值变化显著,L5层的稀疏度最高,达到0.60。

Figure 14. Layer-wise sparsity of CogvideoX. This figure shows that sparsity varies considerably across different layers of the model. Some layers (e.g., L5) exhibit very high sparsity (~0.6), while others are much lower (e.g., L0-L1 are around 0.15). This heterogeneity confirms the need to set hyper-parameters (and thus sparsity levels) independently for each attention layer to maximize efficiency while preserving accuracy.

The following figure (Figure 15 from the original paper) shows timestep-wise sparsity of CogvideoX:

Figure 15. Timestep-wise sparsity of CogvideoX. 该图像是一个图表,展示了随着时间步的增加,平均稀疏度的变化。图中可以看到,稀疏度从0.2逐渐增加到接近0.3,说明随着时间步的推移,注意力机制的稀疏性在增强。

Figure 15. Timestep-wise sparsity of CogvideoX. For diffusion models, the sparsity increases with sample timesteps. Early timesteps (e.g., around 0-20) show lower sparsity (around 0.2), while later timesteps (e.g., after 60) show higher sparsity (approaching 0.3). This suggests that attention is denser and more critical in the initial stages of the diffusion process (when generating coarse structures) and becomes sparser as the model refines details.

The following figure (Figure 16 from the original paper) shows sample-wise sparsity of CogvideoX:

Figure 16. Sample-wise sparsity of CogvideoX. 该图像是一个图表,展示了不同提示的稀疏性(跨时间步、块、批次和头部的均值)。横轴为提示索引,纵轴为平均稀疏度,显示了各个提示的稀疏程度与整体均值0.271的对比。

Figure 16. Sample-wise sparsity of CogvideoX. This figure shows sparsity varying across different input samples (prompts). While there's a mean sparsity of 0.271, individual samples can have higher or lower sparsity. This highlights the dynamic nature of SpargeAttn and its ability to adapt sparsity to the specific input, rather than relying on a fixed pattern.

The following figure (Figure 17 from the original paper) shows head-wise sparsity of CogvideoX:

Figure 17. Head-wise sparsity of CogvideoX. 该图像是图表,展示了CogvideoX中多个层次的头稀疏性。每个图表对应一个层,表现出不同层次中的最大稀疏值,提供了对模型稀疏性特征的直观理解。

Figure 17. Head-wise sparsity of CogvideoX. This figure (composed of multiple charts, one for each layer) illustrates that sparsity also varies across different attention heads within a single layer. Some heads are much sparser than others. Similar to layer-wise sparsity, this indicates that setting hyper-parameters for each attention head independently could yield further optimization.

These detailed sparsity analyses provide strong evidence for the effectiveness of SpargeAttn's dynamic and adaptive approach, as it can leverage varied sparsity patterns across layers, heads, timesteps, and inputs, rather than relying on a single fixed pattern.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces SpargeAttn, a novel training-free, universal, sparse, and quantized attention mechanism designed to significantly accelerate inference for a wide array of large models without sacrificing end-to-end performance. The core of SpargeAttn is a two-stage online filtering process. The first stage employs selective token compression and a TopCdf mechanism to rapidly predict and prune coarse-grained attention blocks based on block self-similarity. The second stage implements a softmax-aware filter at the GPU warp level, further skipping negligible value updates with no additional overhead. SpargeAttn is integrated into the SageAttention framework for 8-bit quantization benefits, and for visual models, it incorporates a HilbertCurve permutation to enhance token locality and sparsity.

Extensive experiments on diverse language (Llama3.1), image (Flux, Stable-Diffusion3.5), and video generation (CogvideoX, Mochi, Open-Sora-Plan) models demonstrate SpargeAttn's robust performance. It consistently achieves substantial speedups (up to 4.5x) compared to Full-Attention and significantly outperforms existing sparse attention baselines (MInference, FlexPrefill) that often incur severe metric degradation. The sparse prediction overhead is shown to be minimal, especially for longer sequences, and the sparsity itself increases with sequence length, making SpargeAttn highly beneficial for long-context models. The self-similarity judge and HilbertCurve permutation are validated as critical components for maintaining accuracy and improving sparsity respectively.

7.2. Limitations & Future Work

The paper doesn't explicitly dedicate a section to limitations or future work, but some points can be inferred:

  • Hyper-parameter Tuning Overhead: While SpargeAttn is training-free in terms of model weights, it requires a per-layer grid search for hyper-parameters (τ,θ,λ\tau, \theta, \lambda) to ensure accuracy within specified bounds. This calibration process, though offline, could be time-consuming for models with many attention layers and heads. Future work could explore automated or more efficient hyper-parameter search strategies.
  • Generality of HilbertCurve: The HilbertCurve permutation is specifically applied to 3D visual tokens. While effective, its direct applicability to language models (which are inherently 1D sequences) or other data modalities is not discussed. Future work might investigate generalizable permutation strategies for improving token similarity across various data types.
  • Scope of "Any Model Inference": While "any model" is a strong claim, the experiments focus on generative models. The performance on discriminative tasks or other architectures might warrant further investigation to fully support the "any model" claim.
  • Theoretical Bounds on Accuracy: The paper empirically shows that accuracy is preserved, but a deeper theoretical understanding of why the L1 error remains low with these pruning strategies could provide stronger guarantees and guide further improvements.
  • Further Optimization of the Second Stage: Although the softmax-aware filter has no extra overhead, its contribution to total sparsity is smaller than the first stage. Exploring ways to make this fine-grained pruning even more aggressive without accuracy loss could be a direction.

7.3. Personal Insights & Critique

SpargeAttn presents a highly compelling solution to the attention bottleneck, particularly for the increasingly common paradigm of long-context and multimodal large models. Its training-free nature is a major strength, as it allows immediate application to powerful, pre-trained models without the immense computational cost of retraining. The combination of selective token compression with a softmax-aware online filter is elegant, addressing the accuracy-efficiency trade-off by being dynamic and adaptive rather than relying on fixed patterns.

The empirical results are very strong, showcasing consistent speedups across language, image, and video generation tasks while maintaining virtually identical end-to-end performance. The visual comparisons are particularly convincing in highlighting the quality degradation of baselines versus SpargeAttn's fidelity. The analysis of sparsity increasing with sequence length is a key insight, indicating that SpargeAttn's benefits will only grow as context windows expand further.

A potential area for improvement or further investigation lies in the hyper-parameter tuning for each layer. While the proposed grid search for τ,θ,λ\tau, \theta, \lambda is effective, it still represents a calibration step that could become cumbersome for extremely deep models or frequent model updates. Research into adaptive hyper-parameter strategies that learn or dynamically adjust these values on-the-fly without a separate search phase would enhance SpargeAttn's usability even further. Additionally, exploring the interaction between quantization and sparsity more deeply, perhaps even co-optimizing them, could yield further gains.

The paper's claim of "accelerating any model inference" is ambitious, and while well-supported for generative models, its application to other domains (e.g., discriminative tasks like classification, or different architectural variants beyond standard transformers) would solidify this claim even more. Nevertheless, SpargeAttn stands as a significant advancement in practical attention acceleration, offering a robust and readily deployable solution for a critical challenge in large model inference. The code release further enhances its practical impact.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.