SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference
TL;DR Summary
SpargeAttention introduces a training-free universal sparse attention mechanism to accelerate model inference. It employs a two-stage online filter to predict attention maps and skip matrix multiplications, significantly enhancing performance across various tasks while maintainin
Abstract
An efficient attention implementation is essential for large models due to its quadratic time complexity. Fortunately, attention commonly exhibits sparsity, i.e., many values in the attention map are near zero, allowing for the omission of corresponding computations. Many studies have utilized the sparse pattern to accelerate attention. However, most existing works focus on optimizing attention within specific models by exploiting certain sparse patterns of the attention map. A universal sparse attention that guarantees both the speedup and end-to-end performance of diverse models remains elusive. In this paper, we propose SpargeAttn, a universal sparse and quantized attention for any model. Our method uses a two-stage online filter: in the first stage, we rapidly and accurately predict the attention map, enabling the skip of some matrix multiplications in attention. In the second stage, we design an online softmax-aware filter that incurs no extra overhead and further skips some matrix multiplications. Experiments show that our method significantly accelerates diverse models, including language, image, and video generation, without sacrificing end-to-end metrics. The code is available at https://github.com/thu-ml/SpargeAttn.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference. It focuses on developing an efficient and universal sparse attention mechanism that can accelerate various large models without requiring retraining or sacrificing accuracy.
1.2. Authors
The authors are:
-
Intao Zhang
-
Chendong Xiang
-
Haofeng Huang
-
Jia Wei
-
Haocheng Xi
-
Jun Zhu
-
Jianfei Chen
Affiliations are primarily associated with Tsinghua University (THU), indicated by
1and2, and3likely an external affiliation. The specific full affiliations are not explicitly detailed in the provided abstract/header, but thethu-mlin the GitHub link points to Tsinghua University's Machine Learning group.
1.3. Journal/Conference
The paper is published as a preprint on arXiv, with the most recent version published on 2025-02-25T12:02:17.000Z. The specific journal or conference where it might be formally published is not stated, but the submission date suggests it is intended for a top-tier ML conference or journal in 2025. Given the arXiv status and the early 2025 date, it is likely awaiting peer review or has been accepted to a major conference.
1.4. Publication Year
The paper was published on arXiv on 2025-02-25.
1.5. Abstract
The abstract introduces the problem of quadratic time complexity in attention mechanisms, which is a major bottleneck for large models, especially with long sequence lengths. It notes that attention maps commonly exhibit sparsity, meaning many values are near zero, which can be exploited for computational efficiency. While existing sparse attention methods accelerate specific models by leveraging certain sparse patterns, a universal solution that guarantees both speedup and end-to-end performance across diverse models remains elusive.
This paper proposes SpargeAttn, a universal, training-free, sparse, and quantized attention mechanism. It employs a two-stage online filter:
-
First Stage: Rapidly and accurately predicts the
attention mapto skip somematrix multiplications( andPV). This stage usesselective token compressionbased on token similarity within blocks. -
Second Stage: Introduces an
online softmax-aware filterthat incurs no extra overhead, further skippingmatrix multiplications(). This stage leverages the properties of online softmax to identify negligible values.SpargeAttnis integrated with theSageAttentionframework for additionalquantization-based acceleration. Experiments demonstrate thatSpargeAttnsignificantly accelerates diverse models (language, image, and video generation) without compromising end-to-end performance metrics. The code is publicly available.
1.6. Original Source Link
https://arxiv.org/abs/2502.18137 (Preprint)
1.7. PDF Link
https://arxiv.org/pdf/2502.18137v8.pdf (Preprint)
2. Executive Summary
2.1. Background & Motivation
The core problem SpargeAttn addresses is the computational inefficiency of the attention mechanism in large deep learning models. The standard self-attention operation has a time complexity that is quadratic with respect to the sequence length (typically , where is sequence length and is head dimension). As models handle increasingly longer sequences (e.g., 45K-128K tokens in video generation and language models), this quadratic complexity leads to a significant portion of inference latency.
The paper identifies two key challenges with existing sparse attention methods:
-
L1. Universality: Most methods are tailored to specific tasks (e.g., language modeling) by exploiting predefined patterns (e.g., sliding windows, attention sinks). However,
attention patternsvary significantly across different model types (language, image, video), making these task-specific methods non-universal. -
L2. Usability: Achieving both accuracy (precise prediction of important
attention mapregions) and efficiency (minimal overhead for prediction) is difficult. Aggressive approximation can lead to loss of critical information, while complex prediction can negate speedup benefits. For example, some methods require very large sequence lengths to show noticeable speedup.The paper's entry point or innovative idea is to create a
universal,training-freesparse attention operator that can accelerate any model inference across various modalities (language, image, video) and sequence lengths, without compromising the model's end-to-end performance. It aims to achieve this by dynamically identifying and skipping negligible computations in theattention mapusing a novel two-stage filtering approach.
2.2. Main Contributions / Findings
The primary contributions of SpargeAttn are:
-
A Universal Two-Stage Online Filtering Mechanism:
- First Stage (Sparse Prediction): A novel
selective token compressionalgorithm is proposed. It accurately predicts sparse blocks in theattention mapby compressing blocks ofquery (Q)andkey (K)that exhibit high self-similarity into single representative tokens. This allows for skipping of correspondingmatrix multiplications() and partialvaluemultiplications (). This stage ispattern-freeand dynamically adapts to input. - Second Stage (Sparse Warp Online Softmax): An additional
softmax-aware filteris designed that incursno extra overhead. It leverages the properties ofonline softmaxto further identify and skip negligiblevaluemultiplications () when therow maximumof theattention scoreis significantly smaller than the running maximum, indicating that thesoftmaxoutput will be near zero for those elements.
- First Stage (Sparse Prediction): A novel
-
Integration with Quantized Attention (
SageAttention):SpargeAttnis seamlessly integrated into the8-bit quantized SageAttention framework, further accelerating inference. The authors note thatquantizationandsparse operationsare orthogonal, allowing for their combined benefits. -
HilbertCurve Permutation for Visual Models: For image and video models,
SpargeAttnintroducesHilbertCurve permutationto enhance theself-similarityof adjacent tokens. By reordering 3D visual tokens along aHilbert Curve, it preserves locality, increasesblock self-similarity, and thus improves sparsity without accuracy loss. -
Demonstrated Universality and Performance:
-
Universality:
SpargeAttnis shown to be effective across diverse generative models, includinglanguage modeling(Llama3.1),text-to-image(Flux, Stable-Diffusion3.5), andtext-to-video(CogvideoX, Mochi, Open-Sora-Plan), and across various sequence lengths. -
Accuracy: It robustly retains model end-to-end performance with
almost no metric losscompared toFull-Attention, significantly outperforming existing sparse attention baselines (MInference, FlexPrefill) which often incur performance degradation. -
Efficiency:
SpargeAttnachieves2.5x to 5x fasterspeeds compared to dense and existing sparse attention models. For instance, it achieves1.83x speedupon Mochi without video quality loss. Thesparse block predictionoverhead is shown to be minimal, especially for longer sequences.These findings highlight
SpargeAttn's ability to overcome the limitations of prior sparse attention methods by providing atraining-free,universal,accurate, andhighly efficientsolution for accelerating large model inference.
-
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand SpargeAttn, a reader needs to be familiar with the following core concepts:
-
Attention Mechanism: At its heart,
attentionin transformer models allows the model to weigh the importance of different parts of the input sequence when processing each element.- Query (Q), Key (K), Value (V) Matrices: In
self-attention, an input sequence is transformed into three matrices:Query(),Key(), andValue(). These are typically derived from the input embeddings through linear transformations. - Attention Score Calculation: The core operation computes a similarity score between each
queryand allkeys. This is done via amatrix multiplication. The scores are then scaled by (where is the dimension of thekeyvectors) to prevent very large values from pushingsoftmaxinto regions with tiny gradients. - Softmax: The scaled scores are then passed through a
softmaxfunction to produce anattention map().Softmaxnormalizes these scores into a probability distribution, where each value indicates the "attention weight" aqueryshould place on a specifickey. The formula forsoftmaxfor a vector is: $ \mathrm{softmax}(x)i = \frac{e^{x_i}}{\sum{j=1}^{n} e^{x_j}} $ - Weighted Sum of Values: Finally, the
attention map() is multiplied by theValuematrix () to produce the output. This effectively creates a weighted sum ofvaluevectors, where the weights are determined by theattention map. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ - Quadratic Complexity: The
matrix multiplicationresults in anattention mapfor a sequence of length . This operation has a computational complexity of , which is the primary reason for the performance bottleneck at long sequence lengths.
- Query (Q), Key (K), Value (V) Matrices: In
-
Sparsity in Attention Maps: The
softmaxfunction tends to concentrate probability mass on a few dominantattention scores, making most other scores very close to zero. This phenomenon is known assparsityin theattention map.SpargeAttnand other sparse attention methods aim to exploit thissparsityby skipping computations for these near-zero values, thus reducing computational load. -
FlashAttention:
FlashAttention(Dao et al., 2022; Dao, 2024) is a highly optimized implementation of exactattention. Instead of computing the fullattention mapin device memory,FlashAttentionuses atiling strategy. It divides theQ, K, Vmatrices into smaller blocks and computes theattention outputprogressively using anonline softmaxalgorithm. This significantly reduces memory usage and improves speed by optimizingI/O awarenessbetween fast on-chip memory (SRAM) and slower off-chip memory (HBM).SpargeAttnbuilds directly on theFlashAttentionframework. -
Online Softmax: In traditional
softmax, the normalization term (the sum in the denominator) requires summing over all elements. For long sequences, this sum can become very large or very small, leading to numerical instability.Online softmaxis a technique used inFlashAttentionthat computes thesoftmaxin a numerically stable and incremental manner. It maintains a running maximum () and a running sum () of the exponents. When processing a new block ofattention scores, it updates these running statistics to calculate thesoftmaxfor that block, effectively normalizing across the entire sequence seen so far. This avoids storing the fullattention scorematrix. -
Quantization:
Quantizationis a technique used to reduce the precision of numerical representations (e.g., from 32-bit floating-point to 8-bit integer) in neural networks. This reduces memory footprint and can accelerate computations, especially on hardware optimized for lower precision arithmetic.SpargeAttnintegrates withSageAttention, aquantized attentionframework, to further enhance speed.SageAttentionspecifically focuses on8-bit quantizationforplug-and-play inference acceleration. -
Cosine Similarity:
Cosine similarityis a measure of similarity between two non-zero vectors of an inner product space. It measures the cosine of the angle between them. A value of 1 means the vectors are identical in direction, 0 means they are orthogonal, and -1 means they are opposite in direction. InSpargeAttn, it's used to measure theself-similarityof tokens within a block of or . $ \mathrm{CosineSimilarity}(A, B) = \frac{A \cdot B}{|A| |B|} = \frac{\sum_{i=1}^{n} A_i B_i}{\sqrt{\sum_{i=1}^{n} A_i^2} \sqrt{\sum_{i=1}^{n} B_i^2}} $ -
Cumulative Distribution Function (CDF) Concept (for TopCdf): A
cumulative distribution functiondescribes the probability that a random variable takes on a value less than or equal to a given value. InSpargeAttn,TopCdfuses a similar idea: it selects tokens whose cumulativeattention scores(when sorted) reach a certain percentage (tau) of the total sum. This effectively identifies the most importantattention scores. -
GPU Warps: A
warpis a group of threads that execute the same instruction in parallel on aGPU (Graphics Processing Unit).GPUsachieve high parallelism by executing manywarpssimultaneously.SpargeAttn's second stage (sparsewarponlinesoftmax) leverages thiswarp-level parallelism to further optimize computation by allowing individualwarpsto skip calculations if their assigned part of theattention mapis deemed negligible.
3.2. Previous Works
The paper categorizes existing sparse attention methods into three types based on how the sparse mask (which indicates important non-zero entries to compute) is generated:
-
Pattern-based methods: These rely on fixed, predefined
sparsity patternsderived from empirical observations, often specific to certain tasks.- Examples:
sliding windows(H2O, InfLLM, DUOAttention),attention sinks(SampleAttention, M0A, StreamingLLM).DitFastAttncombinessliding windowswith similarity between attention maps but is restricted to Diffusion Transformers. - Limitation: Lack
universality. Asattention patternsvary significantly across different models (language, image, video, see Figure 2 in the paper), these fixed patterns do not generalize well. For example,DitFastAttnis incompatible withlanguage modelsandMMDiT models.
- Examples:
-
Dynamic sparse methods: These methods compute the
sparse maskon-the-flybased on the input, making them potentially moreuniversal.- Channel compression: Methods like
SparQAttnandLokiAttnreduce the dimensionality of and while performing full attention.- Limitation: Limited speedup potential, as the head dimension is often already small (e.g., 64, 128).
- Token compression: Methods like
MInferenceandFlexPrefillcompress blocks of tokens into single representative tokens and compute attention on this shorter sequence.- Limitation: Can be
too aggressive. Missing important blocks is possible if theirattention scoresare not high on the compressed sequence.MInferencemay require very large sequence lengths (e.g., 100K) for noticeable speedup.
- Limitation: Can be
- Training-required dynamic methods:
SeerAttentionrequires training additional parameters to learnsparse attention.- Limitation: Expensive to use due to the need for
retraining.
- Limitation: Expensive to use due to the need for
- General Limitation: Many
dynamic sparse methodsare designed primarily forlanguage models, and their applicability to other modalities (likediffusion models) is uncertain.
- Channel compression: Methods like
-
Training-based methods: These methods modify the
attention computation logicand requireretrainingthe entire model from scratch.-
Examples:
Reformer,FastAttention. -
Limitation: Extremely expensive and impractical to apply to pre-trained large models without significant resources.
Beyond
sparse attention, other orthogonal acceleration techniques include:
-
- Kernel Optimization: Optimizing the underlying
CUDA kernels(e.g.,FlashAttention,FlashAttention-2,FlashAttention-3).SpargeAttnbuilds onFlashAttention. - Quantization: Reducing numerical precision (e.g.,
SageAttention).SpargeAttnintegrates with this. - Workload Distribution: Distributing attention computation across multiple devices (e.g.,
RingAttention). - Linear Time Attention: Designing attention mechanisms with linear time complexity (e.g.,
Linformer,Performer).
3.3. Technological Evolution
The evolution of attention mechanisms has moved from the original Transformer's self-attention with its quadratic complexity to various optimizations. Initially, the focus was on exact attention with memory and speed optimizations, notably FlashAttention, which addresses I/O bottlenecks. Concurrently, researchers recognized the inherent sparsity in attention maps and began developing sparse attention methods. Early sparse attention often relied on predefined patterns, which limited their universality. Later, dynamic sparse methods emerged to adapt to input, but still faced challenges in balancing accuracy and efficiency without retraining, and were often modality-specific. Quantization also emerged as an orthogonal technique for further acceleration.
SpargeAttn fits into this timeline by building on the FlashAttention framework and addressing the universality and usability limitations of prior dynamic sparse methods. It aims to provide a training-free, universal, and accurate solution by combining selective token compression with softmax-aware filtering and integrating quantization, effectively merging the benefits of FlashAttention's efficiency with a robust dynamic sparsity approach across diverse models.
3.4. Differentiation Analysis
SpargeAttn differentiates itself from previous sparse attention methods primarily in its universality, training-freeness, and accuracy preservation across diverse model types:
-
Universality (vs. Pattern-based & Modality-specific Dynamic Methods):
- Unlike
pattern-based methods(e.g.,sliding windows,attention sinks) which are model or task-specific,SpargeAttnemploys apattern-freeselective token compressionthat dynamically adapts to input similarities. This allows it to work acrosslanguage,image, andvideo generation modelswithout requiring task-specific knowledge. - Many
dynamic sparse methods(e.g.,MInference,FlexPrefill,SeerAttention) were primarily designed forlanguage modelsor requiredtraining.SpargeAttnexplicitly demonstrates applicability across various modalities and does not requiretrainingof additional parameters.
- Unlike
-
Training-free (vs. Training-based & some Dynamic Methods):
SpargeAttnistraining-free, meaning it can be directly applied topre-trained modelswithout the prohibitive cost and effort ofretrainingthe entire model (unlikeReformerorFastAttention, orSeerAttention). This makes it highlyusablefor existing large models.
-
Accuracy Preservation (vs. Aggressive Token Compression):
Token compressionmethods likeMInferenceandFlexPrefillcan betoo aggressive, potentially missing important attention regions and leading toaccuracy degradation.SpargeAttnmitigates this throughselective token compression(only compressing blocks with high self-similarity) and atwo-stage online filteringapproach. Theself-similarity judgeensures thatnon-self-similar blocks(which might contain critical information) are always computed, preventingloss of critical information. This is validated by its ability torobustly retain model end-to-end performance.
-
Efficiency and Overhead (vs. Baselines):
SpargeAttnachieves significant speedups (2.5x to 5x) while maintaining accuracy, outperforming baselines likeMInferenceandFlexPrefillwhich often showaccuracy degradationat comparable sparsity levels or require specific conditions (e.g., very long sequences forMInference).- The
two-stage online filteris designed to haveminimal prediction overhead. Thesoftmax-aware filterin the second stage incursno extra overhead, making the overall approach very efficient.
-
Novel Components:
-
The
selective token compressionbased onblock self-similarityis a novelpattern-freeapproach. -
The
sparse warp online softmaxis a new mechanism for fine-grained pruning within theFlashAttentiononline process. -
The
HilbertCurve permutationis a novel technique to enhancesparsityspecifically for3D visual tokensinimageandvideo models, demonstratingSpargeAttn's thoughtful design for multimodal application.In essence,
SpargeAttnseeks to provide aplug-and-playsolution forattention accelerationthat is broadly applicable, highly accurate, and computationally very efficient, bridging the gap left by prior methods that were either too specific, too expensive to implement, or compromised accuracy.
-
4. Methodology
The core methodology of SpargeAttn revolves around a two-stage online filtering mechanism integrated into the FlashAttention framework, further combined with quantization and a HilbertCurve permutation for visual tasks. The goal is to dynamically identify and skip computations for sparse regions of the attention map without sacrificing accuracy.
4.1. Principles
The fundamental principle of SpargeAttn is to exploit the inherent sparsity of the attention map (where many values are near zero) to accelerate computations. It operates on the FlashAttention tiling strategy and aims to skip matrix multiplications in two stages:
-
Macro-level Block Pruning (First Stage): Predict which blocks of the
attention mapwill contribute negligibly to the final output. This is done by compressing and blocks that are highlyself-similarand computing a coarse-grainedattention map. -
Micro-level Warp Pruning (Second Stage): Within the remaining, unpruned blocks, further identify and skip individual
valuecontributions that are negligible during theonline softmaxcomputation, leveragingGPU warpparallelism.A crucial underlying intuition is that
neighboring tokensin and often exhibit high similarity, especially in visual data. By leveraging this,SpargeAttncan compress these similar regions for quicksparse prediction, while carefully ensuring thatnon-self-similar(and potentially critical) regions are always fully computed to maintainaccuracy.
4.2. Core Methodology In-depth (Layer by Layer)
SpargeAttn modifies the FlashAttention algorithm, which processes query, key, and value matrices (Q, K, V) in blocks. Let be the sequence length and be the head dimension. Q, K, V have dimensions . The attention score matrix and attention map P = \sigma(S) are . The output . FlashAttention divides Q, K, V into blocks with sizes (where ).
The workflow of SpargeAttn is illustrated in Figure 3.
As depicted in Figure 3, the SpargeAttn workflow is structured into three main phases:
-
Sparse Block Online Prediction (Step 1): This phase computes a coarse-grained
attention mapusing compressed and blocks to identify which fullattention blockscan be skipped. -
Sparse Block Online Masking (Step 2): Based on the prediction from Step 1, binary masks are generated and applied to skip computations for entire
attention blocks. -
Sparse Warp Online Softmax (Step 3): Within the remaining (unmasked)
attention blocks, a finer-grained filtering mechanism is applied at theGPU warplevel to further skipvaluevector updates if their contribution is negligible.The complete algorithm for
SpargeAttnis provided in Algorithm 1.
4.2.1. Sparse FlashAttention Basics (Section 3.1)
SpargeAttn leverages the tiling strategy and online softmax of FlashAttention. The online softmax mechanism progressively computes each block of the output for a query block by iterating over key-value blocks . The standard FlashAttention equations are:
$ S_{ij} = Q_i K_j^\top / \sqrt{d} $
Here, represents the attention scores for the -th query block with the -th key block . The scaling is applied.
The online softmax mechanism updates maximums () and normalization terms () iteratively:
$
(m_{ij}, \widetilde{P}{ij}) = \widetilde{\sigma}(m{i, j-1}, S_{ij})
$
$
l_{ij} = \exp(m_{i, j-1} - m_{ij}) l_{i, j-1} + \mathrm{rowsum}(\widetilde{P}{ij})
$
$
O{ij} = \mathrm{diag}\left(\exp(m_{i, j-1} - m_{ij})\right) {O}{i, j-1} + \widetilde{P}{ij} V_j
$
where:
-
and are the running maximum and sum up to the
(j-1)-thkeyblock. -
is a vector representing the maximum value for each row of considering all
keyblocks up to . It's updated as . -
is the unnormalized
attention mapblock, shifted by the current row maximum for numerical stability. -
is a vector representing the running row sum of the
softmaxdenominator. -
is the -th
outputblock accumulating contributions fromvalueblocks up to . -
creates a diagonal matrix from a vector.
-
computes the sum of elements along rows.
-
Finally, the complete -th
outputblock is , where is the total number ofkeyblocks.SpargeAttndefinessparse attentionby introducingblock masks:
Definition 1 (Block Masks):
Let and be binary masks of dimensions , where each value is either 0 or 1. These masks determine which computations are skipped.
- : Mask for skipping both and computations.
- : Mask for skipping only computations.
Definition 2 (Sparse FlashAttention):
The computation rules for sparse FlashAttention based on the masks are:
- and are skipped if .
- is skipped if .
4.2.2. Selective Token Compression for Sparse Prediction (Section 3.2)
This is the first stage of the two-stage online filter, focusing on generating the global mask .
Key Idea: The core observation is that many neighboring tokens within and blocks are highly similar. If a block of tokens in or is highly uniform (high self-similarity), it can be effectively represented by a single mean token. This compression allows for a much faster, coarse-grained computation of a compressed attention map (), which then indicates which full attention blocks () are likely to be sparse and can be skipped.
Prediction Steps (Algorithm 1, Lines 4-6):
-
Block Compression: For each block and , compute a single representative token by taking the
meanacross tokens in that block. $ q = {q_i} = {\mathrm{mean}(Q_i, \mathrm{axis}=0)} $ $ k = {k_j} = {\mathrm{mean}(K_j, \mathrm{axis}=0)} $ where becomes , and becomes . -
Block Self-Similarity Measurement: Calculate the
cosine similaritywithin each and block to assess itsself-similarity. $ s_{qi} = \mathrm{CosSim}(Q_i) $ $ s_{kj} = \mathrm{CosSim}(K_j) $ The function is defined as: $ \mathrm{CosSim}(X) = \mathrm{mean}\left(\frac{X X^\top}{|\mathrm{max}(X X^\top)|}\right) $ This measures the averagecosine similaritybetween all pairs of tokens within the block . A higher value indicates greaterself-similarity. -
Compressed Attention Score Calculation: Compute a
compressed attention score matrixusing the compressed tokens and . $ \hat{S} = q k^\top $ -
Handling Non-Self-Similar (Fix) Blocks: This is a crucial step for
accuracy. If akeyblock is notself-similar(i.e., , where is ahyper-parameter), its corresponding column in is set to . This ensures that thesenon-self-similar blockswill not be pruned based on thecompressed attention mapand will always be considered for full computation. $ \hat{S}[: , j] = -\infty, \quad \mathrm{If } \ s_{kj} < \theta $ -
Compressed Attention Map: Apply
softmaxto each row of to obtain thecompressed attention map. $ \hat{P}[i] = \mathrm{Softmax}(\hat{S}[i]) $ -
Mask Generation (): For each row of (corresponding to a
queryblock ), theTopCdffunction is used to select the most importantkeyblocks. $ M_g[i, :] = \mathrm{TopCdf}(\hat{P}[i], \tau) $ where is ahyper-parameterbetween 0 and 1.TopCdfselects the positions ofkeyblocks whose cumulative sum ofattention scoresin reaches . These selected positions are set to 1 in , indicating they should be computed. All other positions are set to 0.The
Top_Cdfpseudocode is:def Top_Cdf(P[i], tau): sorted_P, idx = torch.sort(P[i], descending = True) cusum_P = torch.cumsum(sorted_P, dim != 0) mask = cusum_P <= tau * P[i].sum() M_i = torch.zeros_like(mask) M_i[idx] = mask return M_iP[i]: A row from thecompressed attention map, representing the attention of to all .tau: A thresholdhyper-parameter(e.g., 0.9).- : Sorts the values in
P[i]in descending order and stores their original indices. - : Computes the
cumulative sumof the sortedattention scores. mask: Creates a booleanmaskwhereTrueindicates that the cumulative sum up to that point is less than or equal totautimes the total sum ofP[i].- : Initializes a zero
maskof the same shape asP[i]. - : Populates with 1s at the original indices corresponding to the selected top values.
-
Ensuring Fix Blocks are Computed: Finally, to prevent
non-self-similar blocks(fix blocks) from being skipped, any row in corresponding to anon-self-similar query block() or any column corresponding to anon-self-similar key block() is entirely set to 1. This means computations involving these critical blocks are never omitted. $ M_g[i, :] = 1, \quad \mathrm{If } \ s_{qi} < \theta $ $ M_g[:, j] = 1, \quad \mathrm{If } \ s_{kj} < \theta $
4.2.3. Masking of the First Stage (Section 3.3)
The block mask generated in the previous stage is directly applied within the FlashAttention inner loop.
Specifically, in the inner loop where attention between a query block and a key-value block pair () is computed (Algorithm 1, Line 9), if , then the matrix multiplications for and subsequent (and loading) are entirely skipped. This directly reduces the computational load and memory access.
$
\mathrm{Skip } Q_i K_j^\top \ \mathrm{and} \ \widetilde{P}_{ij} V_j, \ \mathrm{If } \ M_g[i, j] = 0
$
This corresponds to the check in Algorithm 1, Line 10.
4.2.4. Sparse Warp Online Softmax (Section 3.4)
This is the second stage of the two-stage online filter, providing finer-grained pruning.
Key Idea: Even within the blocks that are not entirely skipped by , some individual rows of attention scores might still be negligible. This stage identifies and skips value vector updates () for these negligible rows without extra overhead.
Condition for Skipping: The mechanism relies on the online softmax properties. Recall that is updated as:
$
O_{ij} = \mathrm{diag}\left(\exp(m_{i, j-1} - m_{ij})\right) O_{i, j-1} + \widetilde{P}_{ij} V_j
$
where .
If , it implies that the maximum value in the current score block is smaller than the running maximum (meaning takes the value of ). In this case, the values in will be small, because will be negative. If is significantly smaller than , then all values in will be close to 0, making negligible.
This condition is formalized as:
$
O_{ij} \approx O_{i, j-1}, \quad \mathrm{if } \ \max\left(\exp(S_{ij} - m_{ij})\right) \to 0
$
This equivalence holds when , where is a negative hyper-parameter (e.g., -5). A small means that if the local maximum (i.e., ) is sufficiently smaller than the global running maximum , the exponent term will be tiny, making the contribution negligible.
Warp-level Application (Algorithm 1, Lines 14-17):
The computation of is typically split across multiple GPU warps for parallelism. Each warp processes a sub-block of rows within , where is the number of GPU warps.
A warp will skip its part of the computation (i.e., updating ) if the maximum value in its local sub-block, after being shifted by the row maximum , falls below the threshold .
$
\mathrm{if } \ \max(m_{\mathrm{local}}[I_w] - m_{ij}[I_w]) < \lambda \quad \mathrm{then} \ \mathrm{skip} \ \widetilde{P}_{ij}[I_w] V_j
$
This ensures that only meaningful contributions update the output block , achieving fine-grained sparsity at the warp level with no additional overhead, as the maximum values () are already computed as part of online softmax.
4.2.5. Combined with SageAttention (Section 3.5)
To further boost performance, SpargeAttn integrates with SageAttention (Zhang et al., 2025a;d;g;b;e), a framework for quantized attention.
Integration:
-
SpargeAttn'ssparse operationsare orthogonal toquantization operations. This means they can be directly combined. -
The
quantizationis performedper-block(Algorithm 1, Line 3:Quant(Qi, Kj)returns quantized anddequantization scales). Thematrix multiplicationthen uses these quantized values and scales fordequantization(Algorithm 1, Line 12: ). -
The
prediction overheadforsparse block selectionis minimized by implementing it inCUDAand applyingkernel fusion techniques.Algorithm 1 provides the full integrated process:
-
Lines 1-2: Input and block division.
-
Line 3:
Quantizationof . -
Lines 4-6:
Selective Token Compression for Sparse Prediction(generating ). -
Lines 7-22: Main
FlashAttentionloop.- Line 10: First stage
block pruningbased on . If , skip the entire inner loop for this block. - Lines 11-13: Load and compute , update .
- Lines 14-17: Second stage
sparse warp online softmax. If the condition is met, skip the update for thatwarp.
- Line 10: First stage
4.2.6. Hyper-parameters Determination for Model Layer (Section 3.6)
SpargeAttn introduces three hyper-parameters: , , and . These need to be determined for each attention layer in a model.
Hyper-parameters:
- : Controls the
sparsityof the first stageTopCdfmask . A higher means more elements are kept, leading to lowersparsity. - : Threshold for
block self-similarity. Blocks withcosine similaritybelow are considerednon-self-similar(fix blocks) and are always fully computed, preservingaccuracy. - : Threshold for the second stage
sparse warp online softmax. A smaller (more negative) allows more aggressive skipping, increasingsparsity.
Determination Process:
The goal is to find hyper-parameters that maximize attention sparsity while keeping the attention error within strict bounds.
-
Error Metric: The
Relative L1 distanceis used to evaluateattention accuracy: $ L1 = \frac{\sum |O - O'|}{\sum |O|} $ where is the output ofFull-Attention(ground truth) and is the output ofSpargeAttn. -
Two-Stage Grid Search:
-
Stage 1 (for ): A
grid searchis performed for and to find the optimal pair that maximizessparsitywhile ensuring . For example, . -
Stage 2 (for ): After fixing and , another
grid searchis performed for to further maximizesparsitywhile ensuring , with being a slightly looser bound (e.g., ). This two-stage approach allows for sequential optimization ofsparsitywhile controllingaccuracy.Example thresholds for are provided for different models:
-
- Llama3.1:
- CogvideoX and Mochi:
- Stable-Diffusion3.5 and Flux:
- Open-Sora-Plan:
4.2.7. HilbertCurve Permutation (Section 3.7)
This technique is specifically designed to enhance sparsity for visual models (image and video).
Key Idea: The performance of SpargeAttn benefits from high self-similarity within and blocks, as this allows more blocks to be compressed and participate in TopCdf selection, leading to higher sparsity. Attention is computationally invariant to token permutations, meaning reordering tokens does not change the final output as long as the original positions are properly mapped. Therefore, if tokens can be permuted to increase the similarity of adjacent tokens, SpargeAttn can achieve better sparsity.
Application to 3D Visual Tokens:
For 3D visual tokens (e.g., from video frames) represented as (Time, Height, Width, Dimension), SpargeAttn uses the Hilbert Curve to reorder them.
-
The
Hilbert Curveis aspace-filling curvethat effectively preserveslocality. It traverses a multi-dimensional space, visiting every cell, without crossing rows or columns in a way that breaks spatial relationships. -
By flattening the
3D visual tokensalong aHilbert Curveinto a 1D sequence of shape (where ), tokens that were originally spatially or temporally close remain adjacent or close in the permuted sequence. -
This significantly increases the
similarity of adjacent tokensin the 1D sequence, which in turn leads to higherself-similaritywithin the blocks of and . Higherblock self-similarityallows more blocks to be compressed (fewerfix blocks) and thus increases the overallsparsityachievable bySpargeAttn.Figure 5 illustrates this for a space, showing how a
Hilbert Curvemaintains locality better thanrowmajorordering.
该图像是示意图,展示了在 1 imes 6 imes 6空间中不同的 token 置换方法,块大小为 4。图中标示了视觉 tokens 的排列方式,以及块内和块间的相似性。箭头表示重新排列的路径,颜色深浅显示不同的区块性质。
Figure 5. Illustration of different token permutation methods in space, with block size of 4.
5. Experimental Setup
5.1. Datasets
SpargeAttn is evaluated on a diverse range of generative tasks across different modalities:
-
Language Models:
- Llama3.1 (8B): A large language model (LLM).
- WikiText (Merity et al., 2017): A dataset of Wikipedia articles, used to assess the model's prediction confidence, typically measured by
perplexity. - Longbench (Bai et al., 2024): A benchmark for
long context understanding capabilities. It includes various tasks to comprehensively evaluate how well LLMs handle extended input sequences. - En.MC of InfiniteBench (Zhang et al., 2024): Another benchmark for
long context understanding, specifically evaluatingmulti-choice question answeringover very long contexts. - Needle-in-a-Haystack (NIAH) task (Kamradt, 2023): A specific task designed to test an LLM's ability to retrieve a small piece of critical information ("needle") embedded within a very long, irrelevant context ("haystack"). This assesses
retrieval abilityandlong-context robustness.
-
Text-to-Video Models:
- CogvideoX (2B), Mochi (Team, 2024), Open-Sora-Plan (Lin et al., 2024): These are models for generating video from text descriptions.
- Open-Sora prompt sets (Zheng et al., 2024c): These are text prompts used to generate videos for evaluation, covering a variety of scenarios and content.
-
Text-to-Image Models:
-
Flux (.1-dev) (Black Forest Labs, 2023), Stable-Diffusion3.5 (large) (Stability AI, 2023): These are models for generating images from text descriptions.
-
COCO annotations (Lin et al., 2014): A large-scale object detection, segmentation, and captioning dataset. Its annotations are used as prompts for text-to-image generation and for evaluating the quality of generated images.
The choice of these datasets and models reflects a comprehensive evaluation strategy, covering different tasks (generation, comprehension, retrieval) and modalities (text, image, video), which is crucial for validating the
universalityclaim ofSpargeAttn.
-
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a detailed explanation:
5.2.1. Language Models
-
Perplexity (PPL):
- Conceptual Definition:
Perplexityis a common metric for evaluating language models. It quantifies how well a probability distribution or language model predicts a sample. Lower perplexity indicates a better model, as it means the model is more confident and accurate in predicting the next word in a sequence. It can be interpreted as the inverse probability of the test set, normalized by the number of words. - Mathematical Formula: $ \mathrm{PPL}(W) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(w_i | w_1, \dots, w_{i-1})\right) $
- Symbol Explanation:
- : A sequence of words.
- : The total number of words in the sequence.
- : The probability of the -th word given the preceding
i-1words, as predicted by the language model. - : Natural logarithm.
- Conceptual Definition:
-
Longbench score:
- Conceptual Definition:
Longbench(Bai et al., 2024) is a benchmark designed to evaluatelong context understandingof language models across various tasks (e.g., multi-choice QA, summarization, few-shot tasks). TheLongbench scoretypically refers to an aggregated average performance across these diverse tasks, with higher scores indicating betterlong context understanding. The specific calculation method for the aggregated score is usually detailed in theLongbenchpaper itself and can vary by task type. Given its nature as a composite score, a universal formula cannot be provided without referring to the originalLongbenchpaper, but generally, it's an average of accuracy or F1 scores across sub-tasks.
- Conceptual Definition:
-
InfiniteBench (En.MC) score:
- Conceptual Definition:
InfiniteBench(Zhang et al., 2024) extendslong context evaluationbeyond 100K tokens.En.MClikely refers to the EnglishMulti-Choicetask within this benchmark. The score would typically be theaccuracyof correctly answeringmulti-choice questionsthat require understanding information spread across very long contexts. Higher scores mean better performance. - Mathematical Formula (for Accuracy): $ \mathrm{Accuracy} = \frac{\text{Number of correct predictions}}{\text{Total number of predictions}} $
- Symbol Explanation:
Number of correct predictions: The count of instances where the model's output matches the ground truth.Total number of predictions: The total number of instances evaluated.
- Conceptual Definition:
-
Needle-in-a-Haystack (NIAH) Retrieval Accuracy:
- Conceptual Definition: This metric specifically assesses how well an LLM can retrieve a single, specific piece of information (the "needle") hidden within a large volume of irrelevant text (the "haystack").
Retrieval accuracyis the percentage of times the model correctly identifies and extracts the "needle." Higher accuracy indicates betterlong-context retrieval. - Mathematical Formula (for Accuracy): Same as
Accuracyabove.
- Conceptual Definition: This metric specifically assesses how well an LLM can retrieve a single, specific piece of information (the "needle") hidden within a large volume of irrelevant text (the "haystack").
5.2.2. Text-to-Video Models
-
CLIPSIM (CLIP Similarity):
- Conceptual Definition:
CLIPSIMmeasures thetext-video alignment. It uses theCLIP (Contrastive Language-Image Pre-training)model to extract embeddings for the input text prompt and the generated video frames. Thecosine similaritybetween the text embedding and the average (or maximum) of the video frame embeddings (or a video embedding generated by a video CLIP variant) quantifies how well the video content matches the text description. HigherCLIPSIMindicates better alignment. - Mathematical Formula (conceptual, as exact implementation varies): $ \mathrm{CLIPSIM} = \mathrm{CosineSimilarity}(\mathrm{Embed}{\text{text}}, \mathrm{Embed}{\text{video}}) $
- Symbol Explanation:
- : The
CLIPembedding of the input text prompt. - : A single embedding representing the entire video, often derived by averaging
CLIPembeddings of individual frames or using a video-specificCLIPmodel.
- : The
- Conceptual Definition:
-
CLIP-T (CLIP Temporal Consistency):
- Conceptual Definition:
CLIP-Tmeasures thetemporal consistencyorsmoothnessof the generated video frames in terms of theirCLIPembeddings. It calculates the averagecosine similaritybetweenCLIPembeddings of adjacent frames in the video. HighCLIP-Tsuggests that consecutive frames are semantically similar and transition smoothly. - Mathematical Formula (conceptual): $ \mathrm{CLIP-T} = \frac{1}{F-1} \sum_{i=1}^{F-1} \mathrm{CosineSimilarity}(\mathrm{Embed}{\text{frame}i}, \mathrm{Embed}{\text{frame}{i+1}}) $
- Symbol Explanation:
- : Total number of frames in the video.
- : The
CLIPembedding of the -th frame.
- Conceptual Definition:
-
VQA-a (Video Quality Assessment - aesthetic quality):
- Conceptual Definition:
VQA-aassesses theaesthetic qualityof the generated video. This is typically evaluated using a trained quality prediction model that outputs a score based on visual appeal, composition, lighting, etc. Higher scores indicate more aesthetically pleasing videos. The specific model used forVQA-ais often a neural network trained on human aesthetic judgments.
- Conceptual Definition:
-
VQA-t (Video Quality Assessment - technical quality):
- Conceptual Definition:
VQA-tassesses thetechnical qualityof the generated video, focusing on aspects like resolution, clarity, absence of artifacts, and overall visual fidelity. Similar toVQA-a, this is often determined by a specialized model trained to predict technical quality. Higher scores mean better technical quality.
- Conceptual Definition:
-
Flow-score (FScore):
- Conceptual Definition:
Flow-scoreis a metric fortemporal consistency, often specifically related to the realism and smoothness of motion. It typically involves computing optical flow between frames and then evaluating its quality or consistency. A higherFlow-scoreindicates more realistic and consistent motion in the video. The exact calculation depends on the underlyingoptical flowand evaluation method (e.g., comparing generated flow to ground truth or assessing consistency).
- Conceptual Definition:
5.2.3. Text-to-Image Models
-
FID (Frechet Inception Distance):
- Conceptual Definition:
FIDis a widely used metric to evaluate thequality of imagesgenerated by generative models. It measures the statistical distance between the feature distributions of generated images and real images. It uses features extracted from anInception-v3network. LowerFIDscores indicate higher quality and diversity of generated images, meaning they are more similar to real images. - Mathematical Formula (conceptual): $ \mathrm{FID} = ||\mu_x - \mu_g||^2 + \mathrm{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}) $
- Symbol Explanation:
- : Mean feature vector of real images.
- : Mean feature vector of generated images.
- : Covariance matrix of features of real images.
- : Covariance matrix of features of generated images.
- : Squared L2 norm (Euclidean distance).
- : Trace of a matrix.
- Conceptual Definition:
-
Clipscore (CLIP):
- Conceptual Definition:
Clipscore(Hessel et al., 2021) evaluates thetext-image alignmentby measuring thecosine similaritybetween theCLIPembeddings of a generated image and its corresponding text prompt. HigherClipscoreindicates that the generated image semantically matches the text description more closely. - Mathematical Formula (conceptual): $ \mathrm{Clipscore} = \mathrm{CosineSimilarity}(\mathrm{Embed}{\text{image}}, \mathrm{Embed}{\text{text}}) $
- Symbol Explanation:
- : The
CLIPembedding of the generated image. - : The
CLIPembedding of the input text prompt.
- : The
- Conceptual Definition:
-
ImageReward (IR):
- Conceptual Definition:
ImageReward(Xu et al., 2024) is a metric that predictshuman preferencefor text-to-image generations. It's often trained on human feedback data to provide a score that correlates with how much humans would prefer one generated image over another, given a text prompt. HigherImageRewardscores indicate images that are more likely to be preferred by humans.
- Conceptual Definition:
5.2.4. General Performance Metrics
-
Speed ():
- Conceptual Definition: Measures the inference speed, defined as the inverse of latency. It's calculated as the total number of operations in a standard attention computation divided by the latency. A higher value indicates faster execution.
- Mathematical Formula: $ \mathrm{Speed} = \frac{O(\mathrm{attn})}{t} $
- Symbol Explanation:
- : The total number of operations required for a
standard attention computation(fixed for a given input size). - : The
latency(time in seconds) from inputQ, K, Vto theattention output.
- : The total number of operations required for a
-
Sparsity:
- Conceptual Definition:
Sparsityquantifies the proportion of computations that are skipped by thesparse attentionmethod. It is the ratio of skippedmatrix multiplications( and ) to the total number of such multiplications required infull attention. Highersparsityindicates greater computational savings. - Mathematical Formula: $ \mathrm{Sparsity} = \frac{\text{Number of skipped operations}}{\text{Total number of operations in full attention}} $
- Symbol Explanation:
Number of skipped operations: The count of and computations that were omitted.Total number of operations in full attention: The total count of and computations if no sparsity was applied.
- Conceptual Definition:
5.3. Baselines
The paper compares SpargeAttn against two representative dynamic sparse attention methods:
-
MInference (Jiang et al., 2024): A
blocksparsemethod that accelerates pre-filling for long-contextLLMs. It likely uses some form oftoken compressionorblock pruning.- Configuration: Evaluated at
30%and70%sparsity levels. The paper uses sparsity as a percentage of operations skipped, so0.5sparsity means 50% operations skipped. MInference (0.5) implies 50% sparsity.
- Configuration: Evaluated at
-
FlexPrefill (Lai et al., 2025): A
context-aware sparse attention mechanismalso designed for efficientlong-sequence inference. It is likely anothertoken compressionordynamic maskingapproach.-
Configuration: Evaluated with
gammavalues of0.95and0.99. Thesegammavalues typically control the aggressiveness of pruning, where highergammamight imply more pruning or a different threshold.These baselines were chosen because they represent recent
dynamic sparse attentionapproaches that do not requireretrainingand are relevant for acceleratingLLMs, making them suitable comparators forSpargeAttn's claims ofuniversalityandtraining-freeness. The paper specifically notes that methods applicable across different model types are limited, making theseLLM-focused baselines the most appropriate for comparison.
-
5.4. Implementation and Hyper-parameters
- Implementation: The method is implemented using
CUDA, indicating a focus onGPUefficiency and low-level optimization. - Hyper-parameters: As described in Section 3.6, the
hyper-parametersfor eachattention layerare determined through a two-stagegrid searchprocess. TheRelative L1 distance(L1) is used as theaccuracy metricto constrain error.- Specific
L1thresholds () used:- Llama3.1:
- CogvideoX and Mochi:
- Stable-Diffusion3.5 and Flux:
- Open-Sora-Plan:
These specific values indicate that different models have different
L1 error tolerancesfor maintaining end-to-end quality. Lower thresholds (like for Open-Sora-Plan) imply a stricteraccuracyrequirement.
- Specific
6. Results & Analysis
6.1. Core Results Analysis
The experiments aim to demonstrate SpargeAttn's universality, accuracy, and efficiency across diverse models compared to Full-Attention and existing sparse attention baselines.
The following are the results from Table 1 of the original paper:
| Model (seq_len) | Attention (Sparsity) | Speed (1/t)↑ | Llama3.1 Metrics | CogvideoX Metrics | Mochi Metrics | Flux Metrics | Stable-Diffusion3.5 Metrics | ||||||||||||||||||
| WikiText (Ppl.) ↓ | Longbench ↑ | InfiniteBench ↑ | NIAH↑ | CLIPSIM ↑ | CLIP-T ↑ | VQA-a ↑ | VQA-t ↑ | FScore ↑ | CLIPSIM ↑ | CLIP-T ↑ | VQA-a ↑ | VQA-t ↑ | FScore ↑ | FID ↓ | CLIP ↑ | IR ↑ | FID ↓ | CLIP ↑ | IR ↑ | ||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Llama3.1(128K) | Full-Attention | 156.9 | 6.013 | 38.682 | 0.6594 | 0.907 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | |||
| Minference (0.5) | 140.1 | 6.310 | 28.860 | 0.5152 | 0.832 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||||
| FlexPrefill (0.5) | 240.6 | 6.476 | 38.334 | 0.6460 | 0.858 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||||
| Minference (0.3) | 115.7 | 6.705 | 34.074 | 0.6532 | 0.870 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||||
| FlexPrefill (0.42) SpargeAttn (0.54) | 206.9 708.1 | 6.067 6.020 | 38.334 39.058 | 0.6581 0.6638 | 0.878 0.909 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | ||||
| CogvideoX(17K) | Full-Attention | 166.0 | - | - | - | - | 0.1819 | 0.9976 | 80.384 | 75.946 | 5.342 | - | - | - | - | - | - | - | - | - | - | - | |||
| Minference (0.5) | 264.6 | - | - | - | - | 0.1728 | 0.9959 | 70.486 | 62.410 | 2.808 | - | - | - | - | - | - | - | - | - | - | - | ||||
| FlexPrefill (0.6) | 175.3 | - | - | - | - | 0.1523 | 0.9926 | 1.5171 | 4.5034 | 1.652 | - | - | - | - | - | - | - | - | - | - | - | ||||
| Minference (0.3) | 196.9 | - | - | - | - | 0.1754 | 0.9964 | 77.326 | 63.525 | 3.742 | - | - | - | - | - | - | - | - | - | - | - | ||||
| FlexPrefill (0.45) | 142.0 | - | - | - | - | 0.1564 | 0.9917 | 7.7259 | 8.8426 | 2.089 | - | - | - | - | - | - | - | - | - | - | - | ||||
| SpargeAttn (0.46) | 507.9 | - | - | - | - | 0.1798 | 0.9974 | 78.276 | 74.846 | 5.030 | - | - | - | - | - | - | - | - | - | - | - | ||||
| Mochi(22K) | Full-Attention | 164.2 | - | - | - | - | - | - | - | - | - | 0.1725 | 0.9990 | 56.472 | 67.663 | 1.681 | - | - | - | - | - | - | |||
| Minference (0.5) | 202.4 | - | - | - | - | - | - | - | - | - | 0.1629 | 0.9891 | 6.668 | 50.839 | 0.653 | - | - | - | - | - | - | ||||
| FlexPrefill (0.48) | 191.3 | - | - | - | - | - | - | - | - | - | 0.1667 | 0.9898 | 0.582 | 0.0043 | X | - | - | - | - | - | - | ||||
| Minference (0.3) | 147.7 | - | - | - | - | - | - | - | - | - | 0.1682 | 0.9889 | 14.541 | 42.956 | 0.833 | - | - | - | - | - | - | ||||
| FlexPrefill (0.4) | 171.7 | - | - | - | - | - | - | - | - | - | 0.1677 | 0.9909 | 2.941 | 0.7413 | X | - | - | - | - | - | - | ||||
| SpargeAttn (0.47) | 582.4 | - | - | - | - | - | - | - | - | - | 0.1720 | 0.9990 | 54.179 | 67.219 | 1.807 | - | - | - | - | - | - | ||||
| Open-Sora-Plan (38K) | Full-Attention | - | - | - | - | - | 0.1650 | 0.9994 | 81.40 | 80.60 | 0.847 | - | - | - | - | - | - | - | - | - | - | - | |||
| Open-Sora-Plan (38K) | SpargeAttn (0.34) | - | - | - | - | - | 0.1686 | 0.9985 | 77.59 | 76.91 | 0.839 | - | - | - | - | - | - | - | - | - | - | - | |||
| Flux (4.5K) | Full-Attention | 158.2 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 166.103 | 31.217 | 0.8701 | - | - | - | |||
| Minference (0.5) | 151.8 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 180.650 | 30.235 | 0.4084 | - | - | - | ||||
| FlexPrefill (0.48) | 47.7 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 443.928 | 18.3377 | -2.2657 | - | - | - | ||||
| Minference (0.3) | 118.9 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 170.221 | 31.001 | 0.7701 | - | - | - | ||||
| FlexPrefill (0.41) SpargeAttn (0.38) | 40.9 280.3 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 405.043 163.982 | 19.5591 31.448 | -2.2362 0.9207 | - | - | - | ||||
| Stable- Diffusion3.5 (4.5K) | Full-Attention | 164.2 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 166.101 | 32.007 | 0.9699 | |||
| Minference (0.5) | 186.4 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 348.930 | 18.3024 | -2.2678 | ||||
| FlexPrefill (0.37) | 23.1 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 350.497 | 18.447 | -2.2774 | ||||
| Minference (0.3) | 150.3 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 337.530 | 18.099 | - | ||||
| FlexPrefill (0.35) | 22.7 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 348.612 | 18.147 | -2.2647 | ||||
| SpargeAttn (0.31) | 293.0 | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | - | 166.193 | 32.114 | 0.9727 | ||||
6.1.1. End-to-End Metrics and Speed (Table 1)
Key Observations:
-
Llama3.1 (128K sequence length):
SpargeAttn(0.54 sparsity) achieves aSpeed (1/t)of 708.1, which is ~4.5x faster thanFull-Attention(156.9).- Crucially,
SpargeAttnmaintains or even slightly improvesend-to-end metrics:NIAHaccuracy (0.909 vs 0.907),WikiText PPL(6.020 vs 6.013, lower is better),Longbench(39.058 vs 38.682), andInfiniteBench(0.6638 vs 0.6594). This demonstrates excellentaccuracy preservationwhile providing substantial speedup. - Baselines (
MInference,FlexPrefill) showspeedup, but consistently come withmetric degradation. For instance,MInference(0.5) has aNIAHof 0.832 (vs 0.907 for Full-Attention) andFlexPrefill(0.5) has 0.858. Even at lower sparsity (0.3),MInference(0.3) still lagsSpargeAttninNIAH(0.870) andSpeed(115.7 vs 708.1).
-
CogvideoX (17K sequence length):
SpargeAttn(0.46 sparsity) reaches aSpeedof 507.9, approximately 3x faster thanFull-Attention(166.0).- It largely preserves video quality metrics:
CLIPSIM(0.1798 vs 0.1819),CLIP-T(0.9974 vs 0.9976),VQA-a(78.276 vs 80.384),VQA-t(74.846 vs 75.946), andFScore(5.030 vs 5.342). The slight differences are well within acceptable bounds for visual generation. - Baselines (
MInference,FlexPrefill) show severemetric degradation.FlexPrefill(0.6) particularly performs poorly onVQA-a(1.5171) andVQA-t(4.5034), indicating poor video quality.
-
Mochi (22K sequence length):
SpargeAttn(0.47 sparsity) achieves aSpeedof 582.4, which is ~3.5x faster thanFull-Attention(164.2).- It maintains video quality very well:
CLIPSIM(0.1720 vs 0.1725),CLIP-T(0.9990 vs 0.9990),VQA-a(54.179 vs 56.472),VQA-t(67.219 vs 67.663), andFScore(1.807 vs 1.681). - Baselines again show significant
metric degradation, withFlexPrefill(0.48) having extremely lowVQA-a(0.582) andVQA-t(0.0043) and someFScorevalues marked as (inability to generate results or invalid metric).
-
Flux (4.5K sequence length):
SpargeAttn(0.38 sparsity) provides aSpeedof 280.3, about 1.77x faster thanFull-Attention(158.2).- It nearly matches
Full-AttentioninFID(163.982 vs 166.103, lower is better),CLIP(31.448 vs 31.217), andIR(0.9207 vs 0.8701). - Baselines
MInferenceandFlexPrefillshow much worse performance on image generation metrics, with much higherFID(e.g., 443.928 forFlexPrefill) and lowerCLIP/IRscores (e.g., -2.2657 forFlexPrefill).
-
Stable-Diffusion3.5 (4.5K sequence length):
SpargeAttn(0.31 sparsity) reaches aSpeedof 293.0, roughly 1.78x faster thanFull-Attention(164.2).- It maintains
FID(166.193 vs 166.101),CLIP(32.114 vs 32.007), andIR(0.9727 vs 0.9699) comparable toFull-Attention. - Similar to Flux, baselines exhibit severe
metric degradation.
-
Open-Sora-Plan (38K sequence length):
-
For
Open-Sora-Plan, onlyend-to-end latencyand video quality metrics are reported.SpargeAttn(0.34 sparsity) has alatencyof393scompared toFull-Attention's629s, indicating a significant speedup (~1.6x) while maintaining competitive quality metrics.Overall Summary:
SpargeAttnconsistently demonstrates superior performance across all evaluated models and modalities. It achieves substantial speedups (ranging from ~1.6x to ~4.5x) compared toFull-Attentionwhilerobustly retaining end-to-end model quality. In contrast, existingsparse attentionbaselines (MInference,FlexPrefill) frequently lead to significant degradation in model performance metrics, especially in image and video generation, making them less suitable for practical application where quality is paramount.
-
6.1.2. Visible Examples (Figures 6, 7, 8, 9, 12, 13)
The paper provides visual comparisons that reinforce the quantitative results:
The following figure (Figure 6 from the original paper) shows visible examples on CogvideoX using SpargeAttention:
该图像是一个示意图,展示了使用 SpargeAttention 的结果对比,左侧为 Full Attention,右侧为 Sparge Attention。图中包含交通、自然风景和模型的不同生成效果,表明 SpargeAttention 在多种场景下的应用效果。
Figure 6. Visible examples on CogvideoX using SpargeAttention.
This figure shows that SpargeAttn generates videos of comparable visual quality to Full Attention for various scenes (e.g., traffic, nature, models), demonstrating that the speedup does not compromise visual fidelity.
The following figure (Figure 7 from the original paper) shows comparison examples on Flux and Stable-Diffusion3.5:
该图像是一个比较示意图,展示了在Flux和Stable-Diffusion3.5上使用Full Attention、SpargeAttention、MInference和FlexPrefill的结果。不同模型的输出在视觉效果上有明显差异,SpargeAttention在保留细节的同时有效减少了计算复杂度。
Figure 7. Comparison examples on Flux and Stable-Diffusion3.5. The sparsity of SpargeAttn, MInference and FlexPrefill is 0.38, 0.3, and 0.4 on F 1ux and 0.31, 0.3, and 0.35 on Stable-Diffusion3.5.
This figure visually confirms the metric degradation of baselines. Images generated by MInference and FlexPrefill (e.g., for "a teddy bear sitting on a swing") show clear artifacts or lack of detail compared to Full-Attention and SpargeAttn. SpargeAttn outputs are visually indistinguishable from Full-Attention, indicating effective accuracy preservation.
The following figure (Figure 8 from the original paper) shows comparison examples on Mochi:
该图像是图表,展示了四种注意力机制的对比示例:Full Attention、SpargeAttention、MInference和FlexPrefill。可以看到,SpargeAttention的稀疏度为0.47,而其他方法分别为0.3和0.4。
Figure 8. Comparison examples on Mochi. The sparsity of SpargeAt t n, MInference and FlexPrefill is 0.47, 0.3, and 0.4.
Similar to CogvideoX, SpargeAttn on Mochi produces videos (a cute fluffy monster, cartoon style) with quality comparable to Full-Attention, while baselines show noticeable visual degradation.
The following figure (Figure 9 from the original paper) shows a Needle-in-a-Haystack comparison example on Llama3.1:
该图像是一个比较不同注意力机制性能的热图,展示了 Full Attention、SpargeAttention、MInference 和 FlexPrefill 模型在不同 Token 限制下的深度 (%) 和得分。SpargeAttention 的整体得分为 0.909,表现优于其他模型。
Figure 9. A Needle-in-a-Haystack comparison example on Llama3.1. The sparsity of SpargeAttn, MInference, and FlexPrefill is 0.5, 0.5, and 0.54.
This figure demonstrates SpargeAttn's superior retrieval accuracy in long-context LLM tasks. While Full-Attention correctly identifies the "needle," MInference and FlexPrefill fail to retrieve the correct information, whereas SpargeAttn succeeds, often even outperforming Full-Attention slightly, as seen in Table 1.
The following figure (Figure 12 from the original paper) shows visible examples on Open-sora-Plan:
该图像是一个比较图,展示了全注意力与稀疏注意力在两个场景下的效果。第一行显示的是一只弹吉他的熊猫,以及其背景的部分,而第二行展示了一个模型船。在每个场景中,左侧为全注意力产生的图像,右侧为稀疏注意力产生的图像,以对比两种注意力机制的效果。
Figure 12. Visible examples on Open-sora-Plan.
Here, SpargeAttn generated videos ("Panda playing guitar," "model ship in an old bottle") are visually consistent with those from Full-Attention, reinforcing the claim of no quality loss.
The following figure (Figure 13 from the original paper) shows a visible comparison example on Mochi:
该图像是一个对比图,展示了不同注意力机制的效果,包括Full Attention、SpargeAttention、MInference和FlexPrefill。每种方法侧重于不同的生成场景,以说明SpargeAttention的高效性和性能优势。
Figure 13. Comparison examples on Mochi. The sparsity of SpargeAttn, MInference, and FlexPrefill is 0.47, 0.3, and 0.4.
This figure likely provides more detailed visual examples from Mochi, further corroborating the findings that SpargeAttn retains high visual quality while baselines degrade.
6.1.3. Kernel Speed Comparison (Figure 10)
The following figure (Figure 10 from the original paper) shows kernel speed comparison under varying sparsity:
该图像是图表,展示了在不同稀疏度下的内核速度比较。图中展示了包括 SpargeAttn+FA2、Sage 和 Sage2 等不同方法的速度表现,横轴为稀疏度,纵轴为速度(1/t)。
Figure 10. Kernel speed comparison under varying sparsity. Input tensors have a sequence length of 22K and a head dimension of 128. SpargeAttn means deploying our method on FlashAttention2.
This figure directly compares the kernel-level efficiency. (deploying SpargeAttn on FlashAttention2) shows the highest speed () across various sparsity levels. It significantly outperforms Sage and Sage2, demonstrating that the sparse mechanisms effectively translate to kernel-level speedups beyond what quantization alone offers. The speed advantage grows with increasing sparsity.
6.2. Data Presentation (Tables)
The following are the results from Table 2 of the original paper:
| Model | GPU | Original | SageAttn | SpargeAttn |
|---|---|---|---|---|
| CogvideoX | RTX4090 | 87 s | 68 s | 53 s |
| Mochi | L40 | 1897 s | 1544 s | 1037 s |
| Llama3.1 (24K) | RTX4090 | 4.01 s | 3.53 s | 2.6 s |
| Llama3.1(128K) | L40 | 52 s | 42s | 29.98 s |
The following are the results from Table 3 of the original paper:
| Sequence Len | Prediction (ms) | Full Attention (ms) | Overhead |
|---|---|---|---|
| 8k | 0.251 | 6.649 | 3.78% |
| 16k | 0.487 | 26.83 | 1.82% |
| 32k | 0.972 | 106.68 | 0.911% |
| 64k | 2.599 | 424.24 | 0.612% |
| 128k | 8.764 | 1696.2 | 0.516% |
The following are the results from Table 4 of the original paper:
| Method | Sim-q ↑ | Sim-k ↑ | L1 ↓ | Sparsity ↑ |
|---|---|---|---|---|
| Random | 0.321 | 0.019 | 0.0414 | 0.048 |
| Rowmajor | 0.551 | 0.390 | 0.0307 | 0.363 |
| Timemajor | 0.514 | 0.367 | 0.0342 | 0.338 |
| HilbertCurve | 0.572 | 0.479 | 0.0389 | 0.392 |
The following are the results from Table 5 of the original paper:
| Method | VQA-a ↑ | VQA-t ↑ | FScore ↑ |
|---|---|---|---|
| W/o. self-sim Judge | 34.664 | 44.722 | 1.138 |
| With self-sim Judge | 54.179 | 67.219 | 1.807 |
The following are the results from Table 6 of the original paper:
| Strategy | only Mg | only Mpv | Mg +Mpv |
|---|---|---|---|
| Sparsity | 51.2% | 27.7% | 54% |
The following are the results from Table 7 of the original paper:
| Sequence Len | 8K | 16K | 24K | 48K | 128K |
|---|---|---|---|---|---|
| Sparsity | 6.8% | 26.4% | 35.7% | 49.8% | 54% |
The following are the results from Table 8 of the original paper:
| Method | Detailed Description |
|---|---|
| Random | Random permutation of tokens, the order is recorded to perform inverse permutation. |
| Rowmajor | Permutation following row-major order. Tokens are continuous along the W dimension. |
| Columnmajor | Permutation following column-major order. Tokens are continuous along the H dimension. |
| Timemajor | Permutation following time-major order. Tokens are continuous along the T dimension. |
| HilbertCurve | Permutation following a Hilbert curve. |
The following are the results from Table 9 of the original paper:
| Method | Sim-q↑ | Sim-k↑ | Precision(L1)↓ | Sparsity↑ | ||||
| CogvideoX | Mochi | CogvideoX | Mochi | CogvideoX | Mochi | CogvideoX | Mochi | |
|---|---|---|---|---|---|---|---|---|
| Random | 0.502 | 0.321 | 0.025 | 0.019 | 0.0348 | 0.0414 | 0.027 | 0.048 |
| Rowmajor | 0.676 | 0.551 | 0.435 | 0.390 | 0.0265 | 0.0307 | 0.242 | 0.363 |
| Columnmajor | 0.633 | 0.547 | 0.335 | 0.394 | 0.0274 | 0.0342 | 0.198 | 0.366 |
| Timemajor | 0.692 | 0.514 | 0.479 | 0.367 | 0.0294 | 0.0342 | 0.238 | 0.338 |
| HilbertCurve | 0.709 | 0.572 | 0.523 | 0.479 | 0.0323 | 0.0389 | 0.265 | 0.392 |
The following are the results from Table 10 of the original paper:
| Method | w/ judge | w/o judge | filter w/ judge | filter w/o judge | ||||
| CogvideoX | Mochi | CogvideoX | Mochi | CogvideoX | Mochi | CogvideoX | Mochi | |
|---|---|---|---|---|---|---|---|---|
| L1 error↓ | 0.0316 | 0.0343 | 0.0325 | 0.0365 | 0.0843 | 0.0555 | 0.214 | 0.154 |
| Sparsity ↑ | 0.199 | 0.301 | 0.203 | 0.305 | 0.242 | 0.371 | 0.275 | 0.392 |
The following are the results from Table 11 of the original paper:
| Model (seq_len) | Attention (Sparsity) | Speed (TOPS)↑ | NIAH ↑ |
|---|---|---|---|
| Llama3.1 (24K) | Full-Attention | 156.9 | 0.838 |
| Minference (0.5) | 122.5 | 0.635 | |
| FlexPrefill (0.6) | 179.6 | 0.776 | |
| Minference (0.3) | 102.3 | 0.652 | |
| FlexPrefill (0.3) SpargeAttn (0.36) | 117.6 443.6 | 0.797 0.863 |
6.1.4. SpargeAttn Enhances LLM Performance (Figures 9 and 11, Tables 1 and 11)
The following figure (Figure 11 from the original paper) shows a comparison example on Llama3.1 (128K):
该图像是一个比较不同注意力机制性能的热图,其中包含四个部分:Full Attention、SpargeAttention、MInference 和 FlexPrefill。在各自的横轴和纵轴上,标出了 Token Limit 和 Depth,并且给出了整体分数,显示 SpargeAttention 的整体分数最高,为 0.863。
Figure 11. A Needle-in-a-Haystack comparison example on Llama3.1. The sparsity of SpargeAttn, MInference, and FlexPrefill is 0.5, 0.5, and 0.54.
From Table 1 and Table 11, and visually in Figures 9 and 11, SpargeAttn consistently demonstrates that it not only matches but can sometimes slightly enhance LLM performance in long-context tasks (e.g., NIAH score for Llama3.1 (128K): SpargeAttn 0.909 vs Full-Attention 0.907). This counter-intuitive improvement, where sparsity leads to better accuracy, suggests that sparse attention helps the LLM to focus on more relevant information by effectively pruning irrelevant context, leading to a more robust understanding and retrieval.
6.3. Ablation Studies / Parameter Analysis
6.3.1. Overhead of Sparse Block Prediction (Table 3)
The following are the results from Table 3 of the original paper:
| Sequence Len | Prediction (ms) | Full Attention (ms) | Overhead |
|---|---|---|---|
| 8k | 0.251 | 6.649 | 3.78% |
| 16k | 0.487 | 26.83 | 1.82% |
| 32k | 0.972 | 106.68 | 0.911% |
| 64k | 2.599 | 424.24 | 0.612% |
| 128k | 8.764 | 1696.2 | 0.516% |
This table shows the overhead incurred by the sparse block prediction stage (the first stage of SpargeAttn's filter). The prediction time (in ms) is very small compared to the full attention latency. More importantly, the overhead percentage decreases significantly as the sequence length increases (from 3.78% for 8K to 0.516% for 128K). This indicates that the prediction mechanism is highly efficient and scales well with long contexts, making SpargeAttn particularly advantageous for very large sequence lengths where the savings from sparse attention far outweigh the prediction cost.
6.3.2. End-to-End Speedup (Table 2)
The following are the results from Table 2 of the original paper:
| Model | GPU | Original | SageAttn | SpargeAttn |
|---|---|---|---|---|
| CogvideoX | RTX4090 | 87 s | 68 s | 53 s |
| Mochi | L40 | 1897 s | 1544 s | 1037 s |
| Llama3.1 (24K) | RTX4090 | 4.01 s | 3.53 s | 2.6 s |
| Llama3.1(128K) | L40 | 52 s | 42s | 29.98 s |
This table presents the end-to-end latency for generative tasks.
-
Originalrefers toFull-Attention. -
SageAttnis thequantizedversion (Zhang et al., 2025a;d;g;b;e). -
SpargeAttnintegrates thesparsemechanism withSageAttn.For
Mochi,SpargeAttnreduces thelatencyfrom 1897s (Original) to 1037s, achieving a1.83x speedup(1897/1037 1.83). Similarly, for ,SpargeAttnreduceslatencyfrom 52s to 29.98s, roughly a1.73x speedup. These results are significant because they representreal-world inference speedupsfor the entire generative process, not just theattentionkernel. The benefits ofquantization(comparingOriginaltoSageAttn) are further compounded bysparsity(comparingSageAttntoSpargeAttn), confirming the orthogonality of these acceleration techniques.
6.3.3. Effect of Hilbert Curve Permutation (Table 4 and Table 9)
The following are the results from Table 4 of the original paper:
| Method | Sim-q ↑ | Sim-k ↑ | L1 ↓ | Sparsity ↑ |
|---|---|---|---|---|
| Random | 0.321 | 0.019 | 0.0414 | 0.048 |
| Rowmajor | 0.551 | 0.390 | 0.0307 | 0.363 |
| Timemajor | 0.514 | 0.367 | 0.0342 | 0.338 |
| HilbertCurve | 0.572 | 0.479 | 0.0389 | 0.392 |
The following are the results from Table 9 of the original paper:
| Method | Sim-q↑ | Sim-k↑ | Precision(L1)↓ | Sparsity↑ | ||||
| CogvideoX | Mochi | CogvideoX | Mochi | CogvideoX | Mochi | CogvideoX | Mochi | |
|---|---|---|---|---|---|---|---|---|
| Random | 0.502 | 0.321 | 0.025 | 0.019 | 0.0348 | 0.0414 | 0.027 | 0.048 |
| Rowmajor | 0.676 | 0.551 | 0.435 | 0.390 | 0.0265 | 0.0307 | 0.242 | 0.363 |
| Columnmajor | 0.633 | 0.547 | 0.335 | 0.394 | 0.0274 | 0.0342 | 0.198 | 0.366 |
| Timemajor | 0.692 | 0.514 | 0.479 | 0.367 | 0.0294 | 0.0342 | 0.238 | 0.338 |
| HilbertCurve | 0.709 | 0.572 | 0.523 | 0.479 | 0.0323 | 0.0389 | 0.265 | 0.392 |
These tables show the impact of different token permutation methods on block self-similarity (Sim-q, Sim-k), L1 error, and Sparsity for Mochi and CogvideoX.
- Self-Similarity:
HilbertCurveconsistently yields the highestSim-qandSim-k(e.g., for Mochi:Sim-q0.572,Sim-k0.479 vsRowmajor0.551, 0.390). This confirms its effectiveness in preservinglocalityand increasing similarity among adjacent tokens within blocks. - Sparsity: Higher
self-similaritydirectly translates to highersparsity.HilbertCurveachieves the highestsparsity(e.g., for Mochi: 0.392 vsRowmajor0.363). - Accuracy: While
HilbertCurveshows slightly higherL1 errorthanRowmajor(e.g., 0.0389 vs 0.0307 for Mochi), the error remains within acceptable bounds (below or thresholds). This demonstrates that theHilbertCurve permutationis a beneficial technique for visual models, significantly increasingsparsitywithout critically compromisingaccuracy.Random permutation, as expected, leads to very lowself-similarityandsparsity.
6.3.4. Ablation of Self-Similarity Judge (Table 5 and Table 10)
The following are the results from Table 5 of the original paper:
| Method | VQA-a ↑ | VQA-t ↑ | FScore ↑ |
|---|---|---|---|
| W/o. self-sim Judge | 34.664 | 44.722 | 1.138 |
| With self-sim Judge | 54.179 | 67.219 | 1.807 |
The following are the results from Table 10 of the original paper:
| Method | w/ judge | w/o judge | filter w/ judge | filter w/o judge | ||||
| CogvideoX | Mochi | CogvideoX | Mochi | CogvideoX | Mochi | CogvideoX | Mochi | |
|---|---|---|---|---|---|---|---|---|
| L1 error↓ | 0.0316 | 0.0343 | 0.0325 | 0.0365 | 0.0843 | 0.0555 | 0.214 | 0.154 |
| Sparsity ↑ | 0.199 | 0.301 | 0.203 | 0.305 | 0.242 | 0.371 | 0.275 | 0.392 |
These tables highlight the critical role of the self-similarity judge (i.e., the parameter ) in SpargeAttn.
- Comparing "W/o. self-sim Judge" with "With self-sim Judge" (Table 5 on Mochi) shows a dramatic drop in
VQA-a,VQA-t, andFScorewhen the judge is absent. For instance,VQA-adrops from 54.179 to 34.664. This confirms thatomitting computations for fix blocks(non-self-similar blocks) without the judgeresults in the loss of critical informationand severemetric degradation. - Table 10 provides a more detailed ablation. "w/ judge" refers to the full
SpargeAttnwith theself-similarity judgeenabled. "filter w/ judge" means only using theTopCdf filteron all blocks without the judge to ensure critical blocks are always computed. "filter w/o judge" means applyingTopCdfblindly to all blocks.- The
L1 erroris significantly higher in "filter w/o judge" (e.g., 0.214 for CogvideoX) compared to "w/ judge" (0.0316). This reinforces that simply applying aTopCdffilter without first identifying and preservingnon-self-similar blocksleads to unacceptableaccuracy loss. - While the
self-similarity judgeslightly reducessparsity(e.g.,Mochi0.301 vs 0.305 without judge, orCogvideoX0.199 vs 0.203), this minor reduction is a necessary trade-off to ensureend-to-end accuracy. This validates the design choice ofselectively compressingblocks based on theirself-similarity.
- The
6.3.5. Analysis of Sparsity from and (Table 6)
The following are the results from Table 6 of the original paper:
| Strategy | only Mg | only Mpv | Mg +Mpv |
|---|---|---|---|
| Sparsity | 51.2% | 27.7% | 54% |
This table breaks down the contribution of each masking stage to the total sparsity on Llama3.1 (128K seq_len).
-
The first stage mask (, from
selective token compression) alone achieves asparsityof51.2%. -
The second stage mask (, from
sparse warp online softmax) alone achieves asparsityof27.7%. -
When both stages are combined (), the total
sparsityreaches54%.This shows that both stages contribute significantly to the overall
sparsity, with the first stage (Mg) being the dominant contributor. The second stage (Mpv) provides an additional2.8%sparsity (54% - 51.2%), which is achievedwithout extra overhead, demonstrating its efficiency.
6.3.6. Sparsity Increases with Sequence Length (Table 7)
The following are the results from Table 7 of the original paper:
| Sequence Len | 8K | 16K | 24K | 48K | 128K |
|---|---|---|---|---|---|
| Sparsity | 6.8% | 26.4% | 35.7% | 49.8% | 54% |
This table, based on Llama3.1, reveals a crucial trend: sparsity increases significantly with sequence length under a constant accuracy bound. For an 8K sequence, sparsity is only 6.8%, but it rises to 54% for a 128K sequence. This implies that the longer the context, the more irrelevant information can be pruned, and thus the higher the potential speedup that SpargeAttn can achieve. This makes SpargeAttn particularly effective and scalable for long-context models.
6.3.7. Sparsity Analysis Over Diffusion Model (Figures 14, 15, 16, 17)
The paper includes an in-depth sparsity analysis for CogvideoX, a diffusion model, providing insights into where sparsity manifests.
The following figure (Figure 14 from the original paper) shows layer-wise sparsity of CogvideoX:
该图像是一个图表,展示了CogvideoX模型在不同层级的稀疏性分析。X轴为层索引,Y轴为平均稀疏度,红色虚线表示全局平均值0.27。不同层的稀疏度值变化显著,L5层的稀疏度最高,达到0.60。
Figure 14. Layer-wise sparsity of CogvideoX.
This figure shows that sparsity varies considerably across different layers of the model. Some layers (e.g., L5) exhibit very high sparsity (~0.6), while others are much lower (e.g., L0-L1 are around 0.15). This heterogeneity confirms the need to set hyper-parameters (and thus sparsity levels) independently for each attention layer to maximize efficiency while preserving accuracy.
The following figure (Figure 15 from the original paper) shows timestep-wise sparsity of CogvideoX:
该图像是一个图表,展示了随着时间步的增加,平均稀疏度的变化。图中可以看到,稀疏度从0.2逐渐增加到接近0.3,说明随着时间步的推移,注意力机制的稀疏性在增强。
Figure 15. Timestep-wise sparsity of CogvideoX.
For diffusion models, the sparsity increases with sample timesteps. Early timesteps (e.g., around 0-20) show lower sparsity (around 0.2), while later timesteps (e.g., after 60) show higher sparsity (approaching 0.3). This suggests that attention is denser and more critical in the initial stages of the diffusion process (when generating coarse structures) and becomes sparser as the model refines details.
The following figure (Figure 16 from the original paper) shows sample-wise sparsity of CogvideoX:
该图像是一个图表,展示了不同提示的稀疏性(跨时间步、块、批次和头部的均值)。横轴为提示索引,纵轴为平均稀疏度,显示了各个提示的稀疏程度与整体均值0.271的对比。
Figure 16. Sample-wise sparsity of CogvideoX.
This figure shows sparsity varying across different input samples (prompts). While there's a mean sparsity of 0.271, individual samples can have higher or lower sparsity. This highlights the dynamic nature of SpargeAttn and its ability to adapt sparsity to the specific input, rather than relying on a fixed pattern.
The following figure (Figure 17 from the original paper) shows head-wise sparsity of CogvideoX:
该图像是图表,展示了CogvideoX中多个层次的头稀疏性。每个图表对应一个层,表现出不同层次中的最大稀疏值,提供了对模型稀疏性特征的直观理解。
Figure 17. Head-wise sparsity of CogvideoX.
This figure (composed of multiple charts, one for each layer) illustrates that sparsity also varies across different attention heads within a single layer. Some heads are much sparser than others. Similar to layer-wise sparsity, this indicates that setting hyper-parameters for each attention head independently could yield further optimization.
These detailed sparsity analyses provide strong evidence for the effectiveness of SpargeAttn's dynamic and adaptive approach, as it can leverage varied sparsity patterns across layers, heads, timesteps, and inputs, rather than relying on a single fixed pattern.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces SpargeAttn, a novel training-free, universal, sparse, and quantized attention mechanism designed to significantly accelerate inference for a wide array of large models without sacrificing end-to-end performance. The core of SpargeAttn is a two-stage online filtering process. The first stage employs selective token compression and a TopCdf mechanism to rapidly predict and prune coarse-grained attention blocks based on block self-similarity. The second stage implements a softmax-aware filter at the GPU warp level, further skipping negligible value updates with no additional overhead. SpargeAttn is integrated into the SageAttention framework for 8-bit quantization benefits, and for visual models, it incorporates a HilbertCurve permutation to enhance token locality and sparsity.
Extensive experiments on diverse language (Llama3.1), image (Flux, Stable-Diffusion3.5), and video generation (CogvideoX, Mochi, Open-Sora-Plan) models demonstrate SpargeAttn's robust performance. It consistently achieves substantial speedups (up to 4.5x) compared to Full-Attention and significantly outperforms existing sparse attention baselines (MInference, FlexPrefill) that often incur severe metric degradation. The sparse prediction overhead is shown to be minimal, especially for longer sequences, and the sparsity itself increases with sequence length, making SpargeAttn highly beneficial for long-context models. The self-similarity judge and HilbertCurve permutation are validated as critical components for maintaining accuracy and improving sparsity respectively.
7.2. Limitations & Future Work
The paper doesn't explicitly dedicate a section to limitations or future work, but some points can be inferred:
- Hyper-parameter Tuning Overhead: While
SpargeAttnistraining-freein terms of model weights, it requires a per-layergrid searchforhyper-parameters() to ensureaccuracywithin specified bounds. Thiscalibrationprocess, though offline, could be time-consuming for models with manyattention layersand heads. Future work could explore automated or more efficienthyper-parametersearch strategies. - Generality of
HilbertCurve: TheHilbertCurve permutationis specifically applied to3D visual tokens. While effective, its direct applicability tolanguage models(which are inherently 1D sequences) or other data modalities is not discussed. Future work might investigate generalizablepermutation strategiesfor improvingtoken similarityacross various data types. - Scope of "Any Model Inference": While "any model" is a strong claim, the experiments focus on generative models. The performance on discriminative tasks or other architectures might warrant further investigation to fully support the "any model" claim.
- Theoretical Bounds on Accuracy: The paper empirically shows that accuracy is preserved, but a deeper theoretical understanding of why the
L1 errorremains low with these pruning strategies could provide stronger guarantees and guide further improvements. - Further Optimization of the Second Stage: Although the
softmax-aware filterhasno extra overhead, its contribution to totalsparsityis smaller than the first stage. Exploring ways to make this fine-grained pruning even more aggressive withoutaccuracy losscould be a direction.
7.3. Personal Insights & Critique
SpargeAttn presents a highly compelling solution to the attention bottleneck, particularly for the increasingly common paradigm of long-context and multimodal large models. Its training-free nature is a major strength, as it allows immediate application to powerful, pre-trained models without the immense computational cost of retraining. The combination of selective token compression with a softmax-aware online filter is elegant, addressing the accuracy-efficiency trade-off by being dynamic and adaptive rather than relying on fixed patterns.
The empirical results are very strong, showcasing consistent speedups across language, image, and video generation tasks while maintaining virtually identical end-to-end performance. The visual comparisons are particularly convincing in highlighting the quality degradation of baselines versus SpargeAttn's fidelity. The analysis of sparsity increasing with sequence length is a key insight, indicating that SpargeAttn's benefits will only grow as context windows expand further.
A potential area for improvement or further investigation lies in the hyper-parameter tuning for each layer. While the proposed grid search for is effective, it still represents a calibration step that could become cumbersome for extremely deep models or frequent model updates. Research into adaptive hyper-parameter strategies that learn or dynamically adjust these values on-the-fly without a separate search phase would enhance SpargeAttn's usability even further. Additionally, exploring the interaction between quantization and sparsity more deeply, perhaps even co-optimizing them, could yield further gains.
The paper's claim of "accelerating any model inference" is ambitious, and while well-supported for generative models, its application to other domains (e.g., discriminative tasks like classification, or different architectural variants beyond standard transformers) would solidify this claim even more. Nevertheless, SpargeAttn stands as a significant advancement in practical attention acceleration, offering a robust and readily deployable solution for a critical challenge in large model inference. The code release further enhances its practical impact.
Similar papers
Recommended via semantic vector search.