SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention
TL;DR Summary
SampleAttention is introduced as an adaptive, near-lossless sparse attention method for long-context LLMs, significantly reducing Time-to-First-Token latency while maintaining accuracy, achieving up to 2.42x TTFT reduction compared to FlashAttention.
Abstract
Large language models (LLMs) now support extremely long context windows, but the quadratic complexity of vanilla attention results in significantly long Time-to-First-Token (TTFT) latency. Existing approaches to address this complexity require additional pretraining or finetuning, and often sacrifice model accuracy. In this paper, we first provide both theoretical and empirical foundations for near-lossless sparse attention. We find dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial. To address this, we propose SampleAttention, an adaptive structured and near-lossless sparse attention. Leveraging observed significant sparse patterns, SampleAttention attends to a fixed percentage of adjacent tokens to capture local window patterns, and employs a two-stage query-guided key-value filtering approach, which adaptively select a minimum set of key-values with low overhead, to capture column stripe patterns. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, and reduces TTFT by up to compared with FlashAttention.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention". This title clearly indicates the paper's focus on improving the inference efficiency of Large Language Models (LLMs) with very long input sequences by employing a novel, adaptive sparse attention mechanism that aims to maintain high accuracy.
1.2. Authors
The authors are Qianchao Zhu, Jiangfei Duan, Chang Chen, Xiuhong Li, Siran Liu, Guanyu Feng, Xin Lv, Chuanfu Xiao, Dahua Lin, and Chao Yang. Their affiliations are indicated by numerical superscripts, with most authors affiliated with institution 1 and some with 2 or 3. The paper itself does not explicitly list the full names of the institutions, which is common in preprint versions.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server, under the identifier arxiv:2406.15486. arXiv is a widely recognized platform for sharing research preprints in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers posted on arXiv are not peer-reviewed by arXiv itself, but they are often subsequently submitted to and published in reputable conferences or journals.
1.4. Publication Year
The paper was published at (UTC) 2024-06-17T11:05:15.000Z.
1.5. Abstract
The paper addresses the challenge of Time-to-First-Token (TTFT) latency in Large Language Models (LLMs) due to the quadratic complexity of vanilla attention, especially with long context windows. Existing sparse attention methods often require additional pretraining/finetuning and sacrifice accuracy. The authors first provide theoretical and empirical foundations for near-lossless sparse attention, emphasizing the importance of dynamically capturing head-specific sparse patterns at runtime. To this end, they propose SampleAttention, an adaptive structured and near-lossless sparse attention mechanism. SampleAttention leverages significant sparse patterns by attending to a fixed percentage of adjacent tokens (for local window patterns) and employs a two-stage query-guided key-value filtering approach to adaptively select a minimum set of key-values (for column stripe patterns) with low overhead. Comprehensive evaluations show that SampleAttention can seamlessly replace vanilla attention in off-the-shelf LLMs with nearly no accuracy loss, reducing TTFT by up to (this value is updated to in the introduction of the main paper) compared to FlashAttention.
1.6. Original Source Link
The original source link is https://arxiv.org/abs/2406.15486, and the PDF link is https://arxiv.org/pdf/2406.15486v3.pdf. Its publication status is a preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The rapid advancement in Large Language Models (LLMs) has led to models supporting exceptionally long context windows, some exceeding 1 million tokens. While this enables more complex applications like document analysis, code copilot, and prolonged conversations, it introduces a significant challenge: the quadratic computational complexity of the conventional attention mechanism. This quadratic scaling of computation and memory with sequence length results in drastically increased Time-to-First-Token (TTFT) latency, especially during the prefill phase of inference. For instance, processing a 1-million-token context on ChatGLM3-6B can take over 1500 seconds, dominating more than 90% of the TTFT. Such prohibitive latency makes real-time interaction impractical and hinders the deployment of long-context LLMs in many real-world applications.
Prior research has consistently shown that attention scores in LLMs exhibit high sparsity, meaning only a small fraction of query-key pairs significantly contribute to the final attention output. This inherent sparsity presents a promising avenue for optimizing the attention mechanism by computing attention selectively. However, existing sparse attention approaches face two key limitations:
-
Static Patterns: Many methods use static sparse patterns (e.g., fixed windows, global tokens), failing to capture the dynamic and adaptive nature of attention sparsity across different attention heads, input contents, and model architectures. This often leads to significant accuracy degradation.
-
Dynamic Methods' Drawbacks: More dynamic approaches either incur substantial computational overhead for index selection, rely on coarse-grained selection struggling with accuracy, or use predefined patterns and fixed budgets that don't adapt to varying sparsity ratios and dynamic patterns.
The paper's entry point is to effectively exploit this inherent high attention sparsity while overcoming the challenges posed by its adaptive ratio and dynamic patterns. It seeks to develop a runtime-efficient sparse attention mechanism that can seamlessly integrate into off-the-shelf LLMs without compromising accuracy.
2.2. Main Contributions / Findings
The primary contributions and findings of the paper are:
- Empirical and Theoretical Foundation for Near-Lossless Sparse Attention: The paper provides foundational insights demonstrating that dynamically capturing head-specific sparse patterns at runtime with low overhead is crucial for achieving near-lossless sparse attention. It reveals that sparsity ratios are adaptive across attention heads, input contents, and model architectures, and attention patterns are dynamic, often combining typical column and slash patterns.
- Introduction of
Cumulative Residual Attention (CRA): A novel and robust metric,CRA, is proposed to measure the percentage of attention recall, serving as a reliable indicator for guiding the trade-off between efficiency and accuracy in sparse attention. A consistent positive correlation between theCRAthreshold and model accuracy is empirically shown. - Proposal of
SampleAttention: An adaptive structured and near-lossless sparse attention mechanism is introduced.SampleAttentiondynamically determines sparse ratios and patterns at runtime. - Novel Two-Stage Query-Guided Key-Value Filtering: This core methodology allows
SampleAttentionto efficiently identify importantcolumnandslashpatterns.- Stage 1: Query-Guided Chunked Sampling: Estimates the full attention score by sampling attention scores for a few queries across equal segments, enabling more accurate pattern detection than prior methods.
- Stage 2: Score-Based Key-Value Filtering: Decomposes the
CRAthreshold into separate thresholds for columns () and slashes () to independently filter a minimal set ofkey-valueblocks, which are then merged into a final sparse mask.
- Automated Hyperparameter Tuning: A method is developed for offline tuning of
SampleAttention's hyperparameters (, , ) on small profiling datasets across different length ranges, optimizing the accuracy-efficiency trade-off. - Hardware-Efficient Implementation:
SampleAttentionis implemented withIO-awarenessthrough operator fusion and modification ofFlashAttention2kernels to achieve substantial wall-clock time speedups. - Significant Performance Gains:
SampleAttentionseamlessly replaces vanilla attention in off-the-shelf LLMs (ChatGLM, YI, InternLM) with nearly no accuracy loss (consistently above 99% of full attention scores). It reducesTTFTby up to compared toFlashAttention2on long contexts (e.g., 1 million tokens), establishing a new Pareto frontier in the accuracy-efficiency trade-off.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand SampleAttention, a foundational understanding of Large Language Models (LLMs), the Transformer architecture, and the attention mechanism is essential.
3.1.1. Large Language Models (LLMs) and the Transformer Architecture
Large Language Models (LLMs) are advanced artificial intelligence models, typically based on the Transformer architecture, that are trained on vast amounts of text data to understand, generate, and process human language. They excel at tasks like translation, summarization, question answering, and text generation. The Transformer architecture, introduced by Vaswani et al. (2017), revolutionized natural language processing by largely relying on self-attention mechanisms rather than recurrent or convolutional layers. It consists of an encoder-decoder structure (for sequence-to-sequence tasks) or a decoder-only structure (common for generative LLMs). Each Transformer block contains an attention layer (often Self-Attention) followed by a feed-forward network (MLP).
3.1.2. Attention Mechanism
The attention mechanism is the core component of Transformers. It allows the model to weigh the importance of different tokens in an input sequence when processing a specific token. Instead of processing tokens sequentially, attention can process all tokens in parallel and dynamically adjust their influence on the output. For each token, the model generates three vectors: a query (Q), a key (K), and a value (V). The similarity between a query and all keys determines the attention scores, which are then used to weigh the values.
3.1.3. Quadratic Complexity of Attention
The standard attention mechanism involves computing pairwise interactions between all query tokens and all key tokens. If the input sequence has length , the computation of the attention scores matrix () involves an matrix multiplication. This leads to a computational complexity of with respect to the sequence length . Similarly, the memory required to store this attention scores matrix is also . For short sequences, this is manageable, but for very long contexts (e.g., 100K or 1M tokens), this quadratic scaling becomes a severe bottleneck, leading to extremely high computational costs and memory requirements.
3.1.4. Time-to-First-Token (TTFT) and Prefill Latency
LLM inference typically operates in two phases:
-
Prefill Phase: This is when the model processes the entire input prompt (e.g., a long document) in parallel. During this phase, all
querytokens from the prompt attend to allkeytokens. This phase generates the first output token and populates theKey-Value (KV) cachefor all input tokens. The quadratic complexity of attention during this phase directly translates toTTFTlatency, which is the time taken to generate the very first token of the response. For long contexts,prefill latencydominates the overall response time. -
Decoding Phase: After the first token is generated, subsequent tokens are generated one by one (
autoregressively). In this phase, each newquerytoken attends to its previously generated tokens and theKV cacheof the original prompt. This phase has a linear complexity for each new token, where is the current sequence length, but the prompt'sKV cachememory still grows with the total sequence length.TTFT(orprefill latency) is a critical metric for user experience, especially in interactive applications, as it dictates how long a user waits for any response to appear.
3.1.5. KV Cache
The Key-Value (KV) cache is a memory optimization technique used during the decoding phase of LLM inference. Instead of recomputing the keys and values for all previously generated tokens at each step, these vectors are stored in memory after the prefill phase and appended with new key and value vectors from subsequently generated tokens. This significantly speeds up the decoding phase by avoiding redundant computations. However, for extremely long contexts, the KV cache itself can consume a substantial amount of memory.
3.1.6. Sparse Attention
Sparse attention is a general strategy to address the quadratic complexity of full attention. The core idea is that not all query-key interactions are equally important. By identifying and only computing a subset of the attention scores (i.e., making the attention matrix sparse), sparse attention aims to reduce computational and memory costs from to a lower complexity, ideally . Various approaches define different sparse patterns, such as attending only to a fixed window of tokens, a few global tokens, or randomly selected tokens.
3.1.7. FlashAttention
FlashAttention (Dao et al., 2022) is an optimized implementation of the full attention mechanism. It significantly reduces the memory footprint and speeds up computation by reordering the attention computation and using tiling and fusing operations. Specifically, it computes the softmax in an online fashion, avoiding the need to materialize the large attention scores matrix in high-bandwidth memory. While FlashAttention addresses the memory bottleneck and improves the constant factor for computation, it does not change the fundamental computational complexity of full attention. FlashAttention2 further optimizes this by improving parallelism and work partitioning.
3.2. Previous Works
The paper discusses various prior approaches to sparse attention and KV cache compression, highlighting their limitations that SampleAttention aims to overcome.
3.2.1. Full Attention Formula
The fundamental attention mechanism used in Transformers is Scaled Dot-Product Attention. For a single attention head, given query matrix , key matrix , and value matrix , where is the query sequence length, is the key/value sequence length, and is the head dimension, the output is computed as:
$ \mathbf { P } = { \mathsf { s o f t m a x } } ( { \frac { \mathbf { Q } \mathbf { K } ^ { T } } { { \sqrt { d } } } } ) \in [ 0 , 1 ] ^ { S _ q \times S_k } $
$ \mathbf { O } = \mathbf { P } \mathbf { V } \in \mathbb { R } ^ { S _ q \times d } $
Where:
- represents the queries.
- represents the keys.
- represents the values.
- is the number of queries (often the sequence length of the current input).
- is the number of keys/values (often the sequence length of the context).
- is the dimension of the
query,key, andvaluevectors (head dimension). - computes the dot product similarity between each
queryand eachkey. - is a scaling factor to prevent large dot product values from pushing the
softmaxfunction into regions with tiny gradients. - is applied row-wise (across the dimension for each query ), normalizing the
attention scoresto sum to 1 for eachquery. This produces theattention weightsmatrix . - is the
attention weightsmatrix, indicating the importance of eachkeytoken for eachquerytoken. - is the final output of the
attention layer, a weighted sum of thevaluevectors based on theattention weights.
3.2.2. Static Sparse Attention Methods
These approaches define fixed patterns for attention, often based on structural assumptions:
-
LongFormer(Beltagy et al., 2020): Combines local windowed attention with task-specific global attention. -
BigBird(Zaheer et al., 2020): Uses a combination of local window attention, global tokens, and random attention to achieve complexity. The paper assignsBigBirda window size ratio of 8% and a global ratio of 8% for baselines. -
LongNet(Ding et al., 2023): Replaces full attention withdilated attentionto cover longer distances with sparse connections. -
StreamingLLM(Xiao et al., 2023b): Focuses on infinite length generation by keepingattention sinks(a few initial tokens) and a fixed number of recent tokens. The paper sets its initial attention sink at 4 tokens for baselines.Limitation: These methods struggle to capture the dynamic and adaptive nature of attention patterns across different heads and inputs, often leading to accuracy degradation when applied to pretrained LLMs without fine-tuning.
3.2.3. Dynamic Sparse Attention Methods
These methods attempt to adapt the sparse pattern at runtime but face their own challenges:
-
DSA(Liu et al., 2022): Approximates attention patterns using low-rank hidden dimensions, but incurs significant computational overhead for long contexts. -
HyperAttention(Han et al., 2023): EmploysLocality Sensitive Hashing (LSH)to identify important attention scores. Its coarse-grained selection struggles to maintain model accuracy. The paper sets bothbucket sizeand the number of sampled columns to 256 for baselines. -
MInference(Jiang et al., 2024): A hybrid approach that predefines several sparse patterns (A-shape,Vertical-Slash,Block-Sparse) and matches each attention head to its optimal pattern offline. It then dynamically searches for sparse indices during theprefill phasewithin a fixed sparsification budget.Limitation:
DSAhas high overhead.HyperAttentionis coarse-grained.MInferencerelies on predefined patterns and fixed budgets, failing to capture the varying sparsity ratios and dynamic patterns that adapt to different input contents and heads at runtime.
3.2.4. KV Cache Compression Methods
These works focus on reducing KV cache memory consumption, which is orthogonal to SampleAttention's focus on prefill computation speedup but can be combined:
StreamingLLM(Xiao et al., 2023b): Also applies here by pruning older KV entries.H2O(Zhang et al., 2024c): Dynamically retains a balance of recent and heavy hitter tokens during decoding.FastGen(Ge et al., 2023): Adaptively constructsKV cachebased on head-specific policies.SnapKV(Li et al., 2024): CompressesKV cacheby selecting clustered critical positions.CHAI(Agarwal et al., 2024): Head pruning methods to reduceKV cacheoverhead at the attention head level.- Quantization approaches (Duanmu et al., 2024; Xiao et al., 2023a; Zhao et al., 2023) to lower
KV cacheprecision.
3.3. Technological Evolution
The evolution of LLM inference efficiency for long contexts can be seen as a progression:
- Full Attention: The foundational
Transformermechanism, inherently quadratic. - Optimized Full Attention: Implementations like
FlashAttentionthat improve memory and speed constants but retain complexity. - Static Sparse Attention: Early attempts to break by applying predefined, fixed sparse patterns (e.g.,
LongFormer,BigBird,LongNet). These often required retraining or fine-tuning and incurred accuracy penalties. - Dynamic Sparse Attention: More recent efforts to adapt sparse patterns at runtime (
DSA,HyperAttention,MInference). While more flexible, they often introduced significant overhead, lacked fine-grained adaptivity, or still relied on some static assumptions. - Adaptive Structured Sparse Attention (SampleAttention): The current paper's contribution, aiming for a truly dynamic and adaptive approach that determines optimal sparsity ratios and patterns per head, per input, at runtime, guided by an accuracy metric (
CRA), and implemented efficiently. This represents a step towards combining the best of dynamic adaptation with near-lossless accuracy.
3.4. Differentiation Analysis
Compared to main methods in related work, SampleAttention introduces several core differences and innovations:
-
Adaptive Sparsity Ratio: Unlike
MInference(Jiang et al., 2024) andDuoAttention(Xiao et al., 2024) which use fixed sparsification budgets or predefined patterns,SampleAttentiondynamically determines the sparsity ratio at runtime for each individual attention head and input prompt. This directly addresses the observation that optimal sparsity varies greatly across heads, contents, and models. -
Dynamic Sparse Pattern Flexibility: While
MInferenceacknowledgescolumnandslashpatterns, it classifies attention heads into fixed categories offline.SampleAttention's two-stagequery-guided key-value filteringadaptively identifies criticalcolumnandslashpatterns on-the-fly, allowing for flexible combinations of these patterns (e.g. Figure 4(c-d)) that are content-dependent. This avoids the limitations of static pattern classification. -
Cumulative Residual Attention (CRA)as an Accuracy Proxy:SampleAttentionintroducesCRAas a principled, robust metric to guide the selection of sparse indices. This allows for a direct, data-driven way to balance efficiency and accuracy, ensuring "near-lossless" performance without needing extensive re-training or fine-tuning. Previous dynamic methods often lacked such a direct, runtime-evaluable accuracy indicator. -
Query-Guided Chunked Sampling: Instead of relying solely on the last
querytokens (like some prior works),SampleAttentionpartitionsqueriesintochunksand samples from each, providing a more comprehensive and accurate estimation of the attention distribution across the entireattention matrix, especially for complex hybrid patterns. -
Hardware-Efficient Implementation:
SampleAttentionis built uponFlashAttention2withIO-awareness, including operator fusion and custom kernels, ensuring that the dynamic selection overhead is minimized and substantial wall-clock time speedups are achieved. This contrasts with some dynamic methods that might have high computational overhead for pattern discovery.In essence,
SampleAttentionstands out by offering a highly adaptive, runtime-efficient, and accuracy-preserving sparse attention mechanism that is guided by a novel metric, making it seamlessly applicable to off-the-shelfLLMsfor long context inference.
4. Methodology
SampleAttention is designed to efficiently accelerate long context LLM inference by exploiting the inherent sparsity of the attention mechanism in a near-lossless manner. The core idea is to dynamically identify and retain only the most important key-value pairs (column and slash patterns) that contribute significantly to the attention scores, guided by a robust accuracy metric called Cumulative Residual Attention (CRA).
4.1. Principles
The fundamental principle behind SampleAttention rests on two key observations:
-
Adaptive Sparsity: The optimal degree of sparsity (sparsity ratio) and the specific patterns of important
attention scoresare not static. They vary dynamically across differentattention heads, input contents, andLLM architectures. This means a fixed, predefined sparse pattern or budget will inevitably lead to suboptimal performance or accuracy degradation. -
Structured Patterns: Despite their dynamism, significant
attention scoresoften coalesce into identifiable structured patterns, primarilycolumn stripes(capturing global context, likeattention sinks) andslash stripes(capturing local context, likewindow patterns). These patterns, and their combinations, account for the majority of theattention recall.SampleAttentionproposes to address these by:
-
Introducing
Cumulative Residual Attention (CRA)as a principled metric to balance efficiency and accuracy.CRAmeasures the percentage ofattention recall, indicating how much of the original attention mass is preserved after sparsification. -
Employing a novel two-stage
query-guided key-value filteringmethod to dynamically identify and select these criticalcolumnandslashpatterns at runtime with low overhead. -
Implementing hardware-efficient kernels to minimize the computational cost of this dynamic selection and subsequent sparse attention computation.
The goal is to achieve maximal efficiency (high speedup) by aggressively sparsifying the attention computation, while maintaining accuracy by ensuring that the selected sparse patterns capture sufficient
attention recallas guided byCRA.
4.2. Core Methodology In-depth (Layer by Layer)
The SampleAttention methodology operates in two main stages: Query-Guided Chunked Sampling and Score-Based Key-Value Filtering. These stages work in conjunction to construct a dynamic, adaptive sparse mask that guides the sparse FlashAttention computation.
4.2.1. Initial Full Attention & Sparse Attention Formulation
Before diving into SampleAttention's specific stages, let's reiterate the foundational full attention and the general formulation for sparse attention as presented in the paper.
For a given attention head, with query matrix , key matrix , and value matrix , the full attention output is:
$ \mathbf { P } = { \mathsf { s o f t m a x } } ( { \frac { \mathbf { Q } \mathbf { K } ^ { T } } { { \sqrt { d } } } } ) \in [ 0 , 1 ] ^ { S _ q \times S_k } $
$ \mathbf { O } = \mathbf { P } \mathbf { V } \in \mathbb { R } ^ { S _ q \times d } $
Where:
-
:
Querytensor. -
:
Keytensor. -
:
Valuetensor. -
: Query sequence length.
-
: Key sequence length.
-
: Head dimension.
-
:
Attention scoresmatrix aftersoftmax. -
: Output tensor of the attention layer.
-
: Row-wise
softmaxfunction.To approximate this with
sparse attention, a binary mask matrix is introduced. Thesparse attention scoresare computed as:
$ \hat { \mathbf { P } } = { \mathsf { s o f t m a x } } ( \frac { \mathbf { Q } \mathbf { K } ^ { T } } { \sqrt { d } } - c ( 1 - \mathbf { M } ) ) $
Where:
- : The
sparse attention scoresmatrix. - : A large constant.
- : A binary mask matrix, where if the attention from query to key is computed, and if it is masked out.
The term effectively sets the unselected attention logits to a very small negative number, causing their
softmaxprobabilities to become nearly zero. Thesparsity ratiois the percentage of attention scores that are masked (i.e., ).
4.2.2. Cumulative Residual Attention (CRA)
The paper introduces Cumulative Residual Attention (CRA) as a key insight to guide the balance between efficiency and accuracy. CRA is defined as the minimum sum of remaining attention probabilities per query after sparsification. In simpler terms, if you sum up the attention probabilities of the selected (unmasked) key-value pairs for each query, CRA quantifies how much of the original attention mass (which sums to 1 for each query in full attention) is retained.
The relationship between the CRA threshold and model accuracy is empirically observed to be a consistent positive correlation (e.g., Figure 5). This means that by setting a target CRA threshold, the model can dynamically select a minimal set of attention indices that achieve this threshold, thus maximizing computational efficiency while preserving accuracy.
The core challenge is that precisely calculating CRA for a given mask requires computing the full attention score matrix first, which is computationally expensive. SampleAttention overcomes this by using its two-stage sampling and filtering process to estimate and select important column and slash patterns that approximate the desired CRA threshold. The insight (illustrated in Figure 7) is that selecting an appropriate number of key column and slash strips can accurately approximate the CRA.
4.2.3. Stage 1: Query-Guided Chunked Sampling
The first stage, Query-Guided Chunked Sampling, aims to efficiently estimate the full attention score distribution without incurring the cost of computing all query-key interactions.
Motivation: Prior methods often sample only a few query tokens (e.g., the last queries) to estimate attention patterns. However, SampleAttention observes that attention patterns can be complex and vary across the entire attention matrix (e.g., Figure 4(c) shows a column pattern significant in the upper half but fading in the lower half). A single, concentrated sampling location might introduce bias and miss important patterns.
Procedure:
- Query Partitioning: The full
querysequence (of length ) is partitioned into equal segments orchunks. - Segment Selection: For each chunk, a small subset of
queries(aquery block) is selected. The paper mentionsQslicebeing , implying that for each chunk , a block of queries ending at (whereitvis the interval between chunks) and startingblktokens before it, is sampled. Thisequidistant samplingacross chunks is described as low-overhead and more stable than random or bottom sampling, effectively capturing varying patterns across theattention matrix. - Attention Score Computation: For these selected
query blocks(Qslice),attention scoresare accurately computed against allkeytokens . This still involves matrix multiplication, but with a significantly reduced number of queries, making it much faster than full . Thecasual mask(mcasual) is applied for auto-regressive decoding where queries can only attend to preceding keys. $ \mathbf{\dot{A}} \leftarrow \mathsf{softmax} (\mathbf{Q_{slice}} \mathbf{K}^T / \sqrt{d} + \mathbf{m_{casual}}) $ Where:- : The sampled
query blocks. - : All
keytokens. - : The
causal maskapplied to prevent future information leakage. - : The attention scores for the sampled query blocks.
- : The sampled
- Block Reduction: The resulting token-level
attention scoresare then reduced at a block size (blk) granularity in bothcolumnandslashdirections. This means grouping adjacentattention scoresinto blocks and summarizing them (e.g., summing or averaging) to get block-level scores. $ \mathbf{\dot{A}_c}, \mathbf{\dot{A}_s} \leftarrow \mathsf{block_reduction} (\mathbf{\dot{A}}, \mathsf{blk}) $ Where:- : Block-reduced scores representing column importance.
- : Block-reduced scores representing slash importance.
- : The block size for reduction.
These block-level scores are then passed to the second stage for
key-value filtering.
4.2.4. Stage 2: Score-Based Key-Value Filtering
The second stage, Score-Based Key-Value Filtering, takes the block-level scores from Stage 1 to dynamically select the most important key-value blocks that meet the desired CRA thresholds.
Motivation: Selecting column and slash strips jointly would involve evaluating combinations, which is computationally prohibitive. SampleAttention decomposes this problem.
Procedure:
- Threshold Decomposition: The single global
CRAthreshold is decomposed into two separate thresholds: forcolumnsand forslashes. This reduces the complexity of selection to . - Independent Selection of Column Strips:
- The block-reduced
column scoresare sorted. - A cumulative sum (
cumsum) is performed on the sortedcolumn scores. - The minimum number of
key-valueblocks () whose cumulative sum meets the threshold is determined. This represents the "quota" of column blocks needed. - The
top-koperation is then applied to the original to identify the indices () of these most importantcolumn blocks. $ \mathsf{k_c} \leftarrow \mathsf{find_k} (\mathsf{cumsum}(\mathsf{sort}(\mathbf{\dot{A}_c})), \alpha_c) $ $ \mathbf{I_c} \leftarrow \mathsf{arg_topk} (\mathbf{\dot{A}_c}, \mathsf{k_c}) $
- The block-reduced
- Independent Selection of Slash Strips: A similar process is followed for
slash stripsusing and to determine and . $ \mathsf{k_s} \leftarrow \mathsf{find_k} (\mathsf{cumsum}(\mathsf{sort}(\mathbf{\dot{A}_s})), \alpha_s) $ $ \mathbf{I_s} \leftarrow \mathsf{arg_topk} (\mathbf{\dot{A}_s}, \mathsf{k_s}) $ Where:- : A function to determine the minimum number of
top-kelements needed to exceed a cumulative sum threshold. - : A function to return the indices of the
top-kelements.
- : A function to determine the minimum number of
- Extension and Merging of Mask: The identified block indices and are then "extended" to cover the full
attention matrixaccording to their respectivecolumnandslashpatterns. These extended patterns (masks) are then merged to create the final block-sparse mask . $ \hat{\mathbf{M}} \leftarrow \mathsf{merge_index} (\mathbf{I_c}, \mathbf{I_s}, \mathsf{itv}) $ Note: Theitvparameter likely refers to the interval or chunk size used in theQuery-Guided Chunked Samplingstage, ensuring correct alignment during merging. This merged mask is designed to be a near-lossless approximation, capturing criticalattention sinks(via columns) andlocal window patterns(via slashes).
4.2.5. Final Sparse Attention Computation
Finally, the constructed mask is used in a modified FlashAttention kernel to compute the sparse attention output:
$
\mathbf{O} \leftarrow \mathsf{sparse_flash_attn} (\mathbf{Q}, \mathbf{K}, \mathbf{V}, \hat{\mathbf{M}})
$
This leverages the hardware-efficient kernels (discussed next) to compute only the unmasked attention scores, significantly reducing computation and IO overhead.
The entire two-stage process is summarized in Algorithm 1, which details the flow for a single attention head. The process is applied across all heads and layers of the LLM.
The following figure (Figure 6 from the original paper) visually summarizes the two-stage methodology of SampleAttention:

Algorithm 1: Two-stage Implementation of SampleAttention
| Input: Q, K, V, αc, αs [0, 1], chunkn # Stagel: Query-Guided Chunked Attention Sampling Sq |
| Qslice ← [Q[i*itv−blk:i*itv] for i in range(1, chunkn +1)] Å ← softmax (QsliceK T /√d + mcasual) Åc, Ås ← block_reduction (Å, blk) |
| # Stage2: Score-Based Key-Value Block Filtering kc ← find_k (cumsum(sort(Åc), α) |
| Ic ← arg_topk (Åc, kc) |
| ks ← find_k (cumsum(sort(Ås)), αs) Is ← arg_topk (Ås, ks) |
| # Extend and Merge Block-sparse Mask across Each Head |
| ¯ ← merge_index(Ic, Is, itv) # Final Sparse FlashAttention with Block Index O ← sparse_flash_attn (Q, K, V, M) |
Note on Algorithm 1 representation: The algorithm snippet provided in the paper is not fully formatted in a typical pseudocode style (e.g., missing specific variable definitions for itv, blk, mcasual, and precise function signatures). However, I have presented it exactly as it appears in the paper. The explanation above clarifies the logical flow and implied operations.
4.2.6. Hyperparameter Tuning
SampleAttention introduces three key hyperparameters that control the trade-off between efficiency and accuracy:
-
: The desired
CRAthreshold for columns. -
: The desired
CRAthreshold for slashes. -
: The number of sampling chunks.
These hyperparameters are tuned offline using a small profiling dataset. The tuning process involves:
- Length-based Segmentation: Discretizing
sequence lengthsinto distinct intervals (e.g., <16K, [16K, 48K), etc.). - Multi-task Tuning: Performing tuning within each segment to find optimal
(\alpha_c,\alpha_s)configurations that maximize computational efficiency while maintaining accuracy (e.g., relative toFlashAttention2's accuracy and latency). Larger\alpha_c/\alpha_svalues generally increase accuracy at the cost of speedup, whilechunk_nimpacts sampling overhead and accuracy. An appropriatechunk_n$ is crucial to capture sparse structures without excessive overhead.
4.2.7. Hardware-Efficient Implementation
To ensure practical speedups, SampleAttention is designed with IO-awareness and optimized for hardware:
-
Operator Fusion: The dynamic selection process involves a series of small operations (
bmm,mask_fill,softmax,reduction). These are fused to minimizeIOoverhead, which is critical for performance on modern GPUs. -
Custom Kernel Modification:
SampleAttentionmodifiesFlashAttention2kernels to implement its adaptive structuredsparse attention. This allows it to leverage the underlying hardware efficiency ofFlashAttention2while incorporating the dynamic masking logic.This integrated approach ensures that the overhead of dynamic pattern identification is mitigated, translating to significant wall-clock time speedups.
5. Experimental Setup
5.1. Datasets
SampleAttention is evaluated on a diverse suite of benchmarks designed for long-context LLMs.
-
RULER (Hsieh et al., 2024):
- Description: A comprehensive benchmark for evaluating long-context language models. It provides flexible configurations for various
sequence lengths. - Characteristics: Extends the traditional
needle-in-a-haystacktest by incorporating diverse types and quantities of "needles" and introduces new tasks likemulti-hop tracingandaggregation. This evaluates more complex behaviors beyond simple retrieval. - Scale: Encompasses 13 distinct tasks.
- Usage: Used for comprehensive evaluation of understanding capabilities and for tuning hyperparameters of
SampleAttentionat specific lengths (e.g., 16K, 32K, 64K, 128K). Synthetic data fromRULERis also used for micro-benchmarks.
- Description: A comprehensive benchmark for evaluating long-context language models. It provides flexible configurations for various
-
LongBench (Bai et al., 2023):
- Description: A multi-task benchmark for long context understanding.
- Characteristics: Comprises
single and multi-document QA,summarization,few-shot learning,synthetic tasks, andcode completion. - Scale: Over 4,750 test cases with task lengths ranging from 4K to 35K.
- Domain: Bilingual.
-
InfiniteBench (Zhang et al., 2024b):
- Description: A benchmark specifically designed to evaluate language models' ability to handle, understand, and reason in contexts exceeding an average length of 200K tokens.
- Characteristics: Comprises 10 unique tasks, each crafted to assess different aspects of language processing and comprehension in extended contexts.
Example Data Sample (Conceptual - as the paper doesn't provide specific examples for these benchmarks):
For RULER's needle-in-a-haystack, a data sample might involve a very long document (the "haystack") with a specific, rare piece of information (the "needle") embedded within it, and the model is prompted to retrieve this information.
For LongBench's multi-document QA, a query might ask a question that requires synthesizing information from several long documents provided as context.
For InfiniteBench, a task could involve complex reasoning over an extremely long code base or a dense legal document.
These datasets are chosen because they collectively cover a wide range of sequence lengths (from 4K to over 200K, and even up to 1M with RULER's flexibility) and diverse tasks, effectively validating the method's performance across different long-context scenarios and ensuring robust evaluation.
5.2. Evaluation Metrics
The paper uses accuracy, speedup, and Time-to-First-Token (TTFT) to evaluate SampleAttention.
5.2.1. Accuracy
Conceptual Definition: Accuracy, in the context of these benchmarks, refers to how well the LLM performs on specific tasks (e.g., answering questions correctly, generating coherent summaries, retrieving information accurately). For SampleAttention, a key goal is to achieve "near-lossless" accuracy, meaning the performance of the sparse attention mechanism should be almost identical to that of full attention. The specific calculation varies by task (e.g., F1 score for QA, ROUGE for summarization, exact match for retrieval), but the paper often reports a normalized score or a percentage relative to the full attention baseline. The Cumulative Residual Attention (CRA) metric internally helps guide this by ensuring a high percentage of attention recall.
Mathematical Formula: The paper does not provide a single overarching formula for "accuracy" as it is a composite measure across diverse tasks. Instead, it refers to "scores" from benchmarks. For specific tasks within RULER, LongBench, and InfiniteBench, common metrics would be:
- Exact Match (EM): $ EM = \frac{\text{Number of predictions exactly matching the ground truth}}{\text{Total number of predictions}} $
- F1 Score: The harmonic mean of precision and recall, often used for QA tasks.
$
F1 = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}
$
Where:
- ROUGE (Recall-Oriented Understudy for Gisting Evaluation): Used for summarization tasks, measures the overlap of n-grams, word sequences, and word pairs between the generated summary and reference summaries. (No single simple formula, involves comparing n-gram counts).
Symbol Explanation:
Number of predictions exactly matching the ground truth: Count of model outputs that are identical to the correct answer.Total number of predictions: Total number of questions or tasks the model attempted.True Positives: Correctly identified positive instances.False Positives: Incorrectly identified positive instances.False Negatives: Incorrectly identified negative instances.
5.2.2. Time-to-First-Token (TTFT)
Conceptual Definition: TTFT is the time elapsed from when a prompt is submitted to an LLM until the first output token of the response is generated. It directly reflects the prefill latency and is a critical measure of responsiveness for interactive applications. A lower TTFT indicates faster initial response.
Mathematical Formula: $ TTFT = T_{prefill} $ Where:
- : The total time taken by the model to process the entire input prompt and generate the first output token.
Symbol Explanation:
- : Time taken for the prefill phase.
5.2.3. Speedup
Conceptual Definition: Speedup quantifies the performance improvement of a proposed method relative to a baseline method. It is calculated as the ratio of the baseline's execution time to the proposed method's execution time. A speedup of means the proposed method is times faster.
Mathematical Formula: $ \text{Speedup} = \frac{TTFT_{Baseline}}{TTFT_{ProposedMethod}} $ Where:
- : The
Time-to-First-Tokenof the baseline method (e.g.,FlashAttention2). - : The
Time-to-First-Tokenof the proposed method (SampleAttention).
Symbol Explanation:
- : The
TTFTof the comparison method. - : The
TTFTof theSampleAttentionmethod.
5.2.4. Sparsity Ratio
Conceptual Definition: The sparsity ratio measures the percentage of attention scores that are masked or not computed. A higher sparsity ratio implies greater computational savings, but must be balanced against accuracy.
Mathematical Formula: $ \text{Sparsity Ratio} = \frac{\text{Number of masked attention scores}}{\text{Total possible attention scores}} \times 100% $ Where:
Number of masked attention scores: The count ofquery-keypairs where the attention weight is set to (effectively) zero by the mask.Total possible attention scores: The total number ofquery-keypairs, which is for a single head.
5.3. Baselines
SampleAttention is compared against full attention and several representative sparse attention methods. All experiments are conducted on a single NVIDIA-A100 GPU (80GB).
- Full Attention (Gold Baseline):
- Description: The standard
attention mechanismwithout any sparsification. It serves as the upper bound for accuracy and the reference forFlashAttention2speedup comparisons.
- Description: The standard
- FlashAttention2 (Dao, 2023):
- Description: An optimized implementation of
full attentionthat significantly improves speed and memory efficiency, but still has computational complexity. This is the primary baseline forspeedupcomparisons forSampleAttention.
- Description: An optimized implementation of
- MInference (Jiang et al., 2024):
- Description: A dynamic
sparse attentionmethod that uses pre-profiling to determine optimal predefined sparse patterns for each attention head. - Representativeness: Represents state-of-the-art dynamic sparse methods that attempt to adapt patterns.
- Description: A dynamic
- BigBird (Zaheer et al., 2020):
- Description: A static
sparse attentionmethod that combines local window attention, global tokens, and random attention. - Settings: Assigned a
window size ratioof 8% and aglobal ratioof 8%. - Representativeness: Represents prominent static sparse methods.
- Description: A static
- StreamingLLM (Xiao et al., 2023b):
- Description: A method for efficient
long context LLM inferencethat maintains attention sinks and a fixed number of recent tokens. Primarily for decoding, but evaluated for prefill performance in this context. - Settings: Sets its initial
attention sinkat 4 tokens. - Representativeness: Represents methods focused on efficient
KV cachemanagement and streaming.
- Description: A method for efficient
- HyperAttention (Han et al., 2023):
- Description: Utilizes
Locality Sensitive Hashing (LSH)to identify important entries on the attention map. - Settings: Both
bucket sizeand the number ofsampled columnsare set to 256. - Representativeness: Represents
LSH-based approximation methods.
- Description: Utilizes
- Hash-Sparse (Pagliardini et al., 2023):
-
Description: Another
sparse attentionmethod that likely uses hashing to approximate attention. (Specific details on its mechanism are not elaborated in this paper, but it's listed as a baseline). -
Representativeness: A generalized sparse attention baseline.
For
SampleAttention's evaluation, hyperparameters are tuned using small-scale tasks from theRULERbenchmark at specific lengths (e.g., 16K, 32K, 64K, 128K). These tuned parameters are then applied to tasks across different length ranges to achieve an optimal balance.
-
5.4. Backbones
The SampleAttention method is evaluated on three widely used open-source LLM variants:
-
ChatGLM4-9B: Based on
GLM(Du et al., 2021; GLM et al., 2024), with a 1M context window. -
YI-9B: Features a 200K context window (Young et al., 2024).
-
InternLM2-7B: Also with a 200K context window (Cai et al., 2024).
All models are
decoder-only transformers(Radford et al., 2018), pretrained viacausal language modeling. They share architectural components likerotary positional encoding(Su et al., 2024) andgrouped-query attention(Ainslie et al., 2023). Notably, some models achieve long context through continued training with extended sequences (e.g.,ChatGLM4), while others userope scalingfor length extrapolation (e.g.,YI,InternLM2).
SampleAttention and baseline methods only replace the full-attention implementation during the prompt prefill stage. The KV cache remains uncompressed, and dense attention is used in the decoding phase. This isolates the evaluation to prefill latency reduction.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results highlight SampleAttention's superior accuracy-efficiency trade-off and significant TTFT reduction compared to existing sparse attention methods and FlashAttention2.
The following figure (Figure 8 from the original paper) illustrates the trade-offs between accuracy and speedup for various methods:

Analysis of Figure 8 (Accuracy-Efficiency Trade-off):
Figure 8 (from the original paper) presents a Pareto frontier analysis, comparing the relative speedup (vs. FlashAttention2) against the accuracy for different sparse attention methods on the RULER benchmark.
-
SampleAttentionconsistently achieves scores above 99% offull attentionaccuracy, demonstrating its "near-lossless" efficiency across various models andsequence lengths. -
Crucially,
SampleAttentionestablishes a newPareto frontier, indicating that it achieves higher speedups for a given accuracy level, or higher accuracy for a given speedup, compared to existing methods. -
For example, in the GLM4-9B model (top row):
- At 32K context,
SampleAttentionachieves nearfull attentionaccuracy with a speedup of approximately 1.7x.Minferencealso achieves high accuracy but with no speedup.BigBirdoffers some speedup but with noticeable accuracy degradation.StreamingLLMandHyperAttentionshow significant accuracy drops despite some speedup. - The trend holds for longer contexts (64K and 128K), where
SampleAttentionconsistently delivers the best combination of high accuracy and considerable speedup.
- At 32K context,
-
The results show that methods like
StreamingLLMandHyperAttentionsuffer from significant performance degradation across tasks, suggesting their inability to capture criticalKV elementsduring theprefill stageeffectively.BigBirdshows varying degradation but provides a relatively stable speedup due to its static pattern.The following are the results from Table 2 of the original paper:
Benchmark Baseline Task Type Total Score Single-Doc QA Multi-Doc QA Summarization Few-shot Learning Synthetic Tasks Code Completion LongBench Full Attention 213.12 174.35 109.69 273.87 231.49 121.52 1124.04 Ours 214.53 174.42 108.92 278.33 234.55 125.18 1135.93 Minference 212.14 173.37 110.02 274.45 231.87 124.37 1126.22 BigBrid 207.57 146.45 95.64 272.17 161.60 117.38 1000.81 StreamingLLM 142.79 129.36 89.71 168.13 19.70 98.43 648.12 HyperAttention 125.74 119.08 88.05 206.35 32.69 86.35 658.26 En.Sum En.QA En.MC En.Dia Zh.QA Code.Debug Math. Retr. Find PassKey Retr.Number Retr.KV InfiniteBench Full Attention 28.30 12.17 58.95 34.00 13.22 30.71 37.71 100 100 44.0 Ours 28.30 16.52 61.57 31.50 14.28 31.40 37.14 100 100 49.6 Minference 28.00 11.39 60.26 28.70 14.81 31.70 39.43 100 100 43.0
Analysis of Table 2 (Accuracy on LongBench and InfiniteBench):
Table 2 details the accuracy scores on LongBench and InfiniteBench for SampleAttention ("Ours") and various baselines.
- LongBench:
SampleAttention(Ours) consistently achieves scores comparable to or even slightly exceedingFull Attentionacross almost all task types (e.g.,Single-Doc QA,Multi-Doc QA,Few-shot Learning,Synthetic Tasks,Code Completion). Its total score (1135.93) is even slightly higher thanFull Attention(1124.04), potentially due to subtle regularization effects or being within statistical noise.Minferencealso shows strong accuracy, very close toFull Attention, but as noted in Figure 8, it doesn't provide speedup.BigBirdshows a noticeable drop in accuracy compared toFull Attention(e.g.,Multi-Doc QA146.45 vs. 174.35,Summarization95.64 vs. 109.69).StreamingLLMandHyperAttentionexhibit significant accuracy degradation, especially inSynthetic TasksandSingle-Doc QA, demonstrating their limitations forprefillaccuracy.
- InfiniteBench:
- For tasks specifically designed for extremely long contexts,
SampleAttentionagain maintains performance very close toFull Attention. For some tasks likeEn.QA,Retr.Number, andRetr.KV,SampleAttentioneven marginally outperformsFull Attention. Minferencealso performs well onInfiniteBench, similar toSampleAttention, maintaining high accuracy. These results robustly demonstrate thatSampleAttentionachieves its "near-lossless" accuracy claim across diverse tasks and extremely longsequence lengths.
- For tasks specifically designed for extremely long contexts,
The following figure (Figure 11 from the original paper) presents the acceleration speedup benchmarking:

Analysis of Figure 11 (Acceleration Speedup Benchmarking):
- Figure 11(a) - Sampling Overhead: This chart shows the time breakdown for a full 40-layer model of
ChatGLM4-9Bacrosssequence lengthsfrom 32K to 128K. It differentiates between time spent onsampling(dynamic selection of sparse patterns) andsparse computation.- As
sequence lengthincreases, the relative contribution ofsampling overheadto the total execution time diminishes. This is crucial because it implies that the overhead of dynamic pattern identification, while present, becomes proportionally smaller for the scenarios whereSampleAttentionprovides the most benefit (very long contexts). - For shorter
sequence lengths, thesampling overheadcan be a larger fraction, which explains why the performance gains are less pronounced for short sequences (e.g., at 8K,SampleAttentionlatency is similar toFlashAttention2).
- As
- Figure 11(b) - TTFT Speedup: This chart compares the
TTFTmetric ofSampleAttentionagainstFlashAttention2as the baseline, scaling up to 1Msequence length.- The
TTFTmetric shows significant reduction withSampleAttention. - At 1M
sequence length,SampleAttentionachieves a remarkableTTFTreduction of compared toFlashAttention2. This is a substantial speedup, enabling much faster initial responses for extremely long contexts. - The speedup increases with
sequence length, demonstrating the scalability and effectiveness ofSampleAttentionfor the target use case.
- The
6.2. Ablation Studies and Parameter Analysis
The paper conducts ablation studies and analyzes the impact of SampleAttention's hyperparameters, specifically CRA thresholds , , and the .
The following figure (Figure 9 from the original paper) illustrates the impact of CRA thresholds:
该图像是热力图,展示了不同 和 值对计算块准确度的影响,分为三部分:(a) GLM4-9B 32K,(b) GLM4-9B 128K,(c) YI-9B 128K。每个部分的颜色深浅代表了准确度的变化,数值显示了对应的准确率和变化百分比。
Analysis of Figure 9 (Impact of and on Accuracy):
Figure 9 displays heatmaps showing how varying the CRA thresholds for column () and slash () patterns influences accuracy under different conditions (GLM4-9B at 32K and 128K, and YI-9B at 128K).
-
General Trend: For all cases, increasing either or generally leads to improved accuracy. This is expected, as higher thresholds imply retaining more
attention mass, thus reducing sparsity but improvingattention recall. -
Trade-off: This improvement comes at the expense of increased computational load (lower speedup), as more
key-valuepairs are selected. -
Model/Pattern Specificity:
- The YI model at 128K (
Figure 9(c)) shows significantly greater sensitivity to changes in theslash threshold, indicating thatslash patternsmight be more critical for maintaining accuracy in this specific model/context. Figure 9(a)(GLM4-9B at 32K) demonstrates a more balanced response to changes in both thresholds, suggestingcolumnandslashpatterns contribute more equally.
- The YI model at 128K (
-
Efficiency at Lower Thresholds:
Figure 9(b)(GLM4-9B at 128K) shows that even smaller thresholds can deliver sufficiently good scores, allowing for higher speedup while maintaining acceptable accuracy. This emphasizes the importance of tuning for the optimal balance.The following are the results from Table 3 of the original paper:
Model (αc,αs) | chunkn 1 2 4 6 GLM(128K) (0.90,0.90) 82.89/1.92 84.70/1.89 84.02/1.70 83.62/1.53 (0.95,0.95) 84.17/1.64 84.04/1.60 83.14/1.46 83.73/1.33 Yi(128K) (0.95,0.95) 52.81/2.17 54.54/2.12 53.10/1.89 54.58/1.71 (0.98,0.98) 56.24/1.29 58.58/1.25 59.37/1.21 59.36/1.13
Analysis of Table 3 (Impact of ):
Table 3 investigates the impact of the (1, 2, 4, 6) on accuracy (score) and speedup for GLM (128K) and YI (128K) models under fixed CRA thresholds.
-
Optimal : An appropriate size can yield more cost-effective results. For instance, for GLM(128K) at
(\alpha_c=0.90, \alpha_s=0.90), increasing from 1 (82.89/1.92) to 2 (84.70/1.89) enhances accuracy (from 82.89 to 84.70) with only a slight reduction in speedup (from 1.92x to 1.89x). This indicates that 2 chunks provide a better representation of the attention pattern without significantly increasing overhead. -
Excessive : However, an excessively large may not further improve accuracy and can decrease speedup due to increased
sampling overhead. For example, for GLM(128K) with(\alpha_c=0.90, \alpha_s=0.90), increasing from 2 to 4 or 6 actually slightly reduces accuracy and consistently reduces speedup. This highlights the trade-off inherent in sampling; more samples mean more accurate pattern detection but also more computation.The following are the results from Table 4 of the original paper:
Sequence Length Average Sparsity in ChatGLM-6B Average Sparsity in InternLM-7B 4K 88.00% 91.13% 8K 90.74% 92.72% 16K 92.52% 93.89% 32K 93.88% 94.83% 64K 94.89% 95.89% 128K 95.84% 96.67%
Analysis of Table 4 (Sparsity Analysis for "Needle in a Haystack" task):
This table quantifies the inherent sparsity in ChatGLM-6B and InternLM-7B as sequence length increases.
-
Increasing Sparsity with Length: As the
sequence lengthincreases, the averagesparsity ratioconsistently increases for both models. ForChatGLM-6B, it goes from 88.00% at 4K to 95.84% at 128K. ForInternLM-7B, it increases from 91.13% to 96.67%. -
Implication for Optimization: This finding provides strong empirical evidence for the potential of
sparse attentionmethods. It suggests that for longer contexts, a smaller proportion ofKV elementsis truly essential, validating the core premise ofSampleAttention. The paper notes that with each doubling of length, the proportion of neededKV elements(to maintain the sameCRAthreshold) decreases by approximately 20%.The following are the results from Table 5 of the original paper:
ratio of top-k 2.5% 10% 20% 40% 80% sampling ratio 100% 0.4% 100% 0.4% 100% 0.4% 100% 0.4% 100% 0.4% HEAD-1 16.35% 12.74% 26.91% 23.33% 45.99% 42.14% 58.21% 55.34% 96.30% 93.65% HEAD-2 55.43% 48.40% 63.89% 58.63% 85.92% 81.98% 89.07% 84.21% 99.15% 98.08% HEAD-3 93.20% 90.44% 98.32% 97.62% 99.14% 98.43% 99.41% 99.12% 99.98% 99.66%
Analysis of Table 5 (Effectiveness of Sampling):
This table validates the effectiveness of SampleAttention's chunked sampling method by comparing the CRA achieved using full sampling (100% sampling ratio) against a low sampling ratio (0.4%, with ) across different top-k stripe ratios for three different heads.
-
Accurate Approximation: The
CRAobtained from the low 0.4%sampling ratiois consistently very close to theCRAobtained fromfull attention scores. For instance, in HEAD-3, at a 40%top-kratio,full samplingyields 99.41%CRA, while 0.4% sampling yields 99.12%, a negligible difference. -
Decreasing Difference with More Strips: The difference between the two
sampling ratiosinCRAdecreases as theratio of top-kstripes increases. This implies that as more important stripes are selected, the approximation becomes even more accurate. -
Efficiency and Stability: This demonstrates that
chunked samplingis both simple and efficient in accurately estimatingCRA, providing a reliable basis for theScore-Based Key-Value Filteringstage.The following figure (Figure 10 from the original paper) presents the cross-task robustness of tuned hyperparameters:

Analysis of Figure 10 (Cross-Task Robustness): Figure 10 demonstrates the
cross-task robustnessofSampleAttention's hyperparameters. It shows results for GLM4-9B (a) and InternLM2-7B (b) where hyperparameters were tuned offline on subsets of theRULERbenchmark and then applied toLongBenchandInfiniteBenchtasks whensequence lengthsfall within the same range. -
Generalization: The results show that hyperparameters tuned on a subset of tasks (from
RULER) generalize effectively across diverse benchmarks (LongBench,InfiniteBench). This is a crucial finding, indicating thatSampleAttention's tuning method produces robust configurations. -
Near-Lossless Performance: For accuracy-optimized settings, GLM4's and InternLM2's performance maintained
near-losslessperformance on different tasks fromLongBenchandInfiniteBench, confirmingrobust cross-domain adaptability. -
Consistent Speedup: Consistent
speedup gainswere observed for the same model under identicalsequence lengthsacross tasks, further validating the tuning approach. This implies that users don't need to retune for every new task within asequence lengthrange, simplifying adoption.
Summary of Hyperparameter Tuning Results (Table 6 from Appendix): The following are the results from Table 6 of the original paper:
| range of sequence length | < 16K | [16K,48K) | [48K,80K) | [80K,112K) | >=112K | |||||
| models Acc I Spd | GLM4 | InternLM2 | GLM4 | InternLM2 | GLM4 | InternLM2 | GLM4 | InternLM2 | GLM4 | InternLM2 |
| CRA Column | 0.98 | 0.85 | 0.98 | 0.85 | 0.95 | 0.85 | 0.95 | 0.80 | 0.95 | 0.80 | 0.92 | 0.80 | 0.90 | 0.80 | 0.92 | 0.80 | 0.95 | 0.80 | 0.95 | 0.80 |
| CRA Slash | 0.90 | 0.85 | 0.95 | 0.85 | 0.95 | 0.80 | 0.90 | 0.80 | 0.95 | 0.80 | 0.92 | 0.80 | 0.90 | 0.80 | 0.90 | 0.80 | 0.85 | 0.80 | 0.90 | 0.80 |
| Num of Chunks | 111 | 111 | 111 | 111 | 111 | 111 | 211 | 211 | 111 | 211 |
Table 6 (from Appendix) provides the specific hyperparameter settings (CRA Column (\alpha_c), CRA Slash (\alpha_s), and ) for GLM4 and InternLM2 across different sequence length ranges, optimized for the highest speedups while retaining at least 90% accuracy.
- The values show dynamic adaptation: for instance, and tend to vary across length ranges and between models, reflecting the adaptive nature of sparsity.
- is often set to 1, indicating that even a single chunk can be effective, but for some longer ranges (e.g., [80K, 112K) for GLM4 and InternLM2, and >=112K for InternLM2), is increased to 2, suggesting the need for more granular sampling to maintain accuracy at greater lengths.
6.3. Qualitative Analysis (Attention Visualizations)
The following figures (Figures 12 and 13 from the original paper) show visualizations of attention patterns:
该图像是示意图,展示了不同层(Layer0, Layer4, Layer8, Layer12)在序列长度变化下的注意力模式。每个子图呈现对应层的注意力矩阵,矩阵的颜色变化反映了注意力强度和稀疏性。

Analysis of Figures 12 & 13 (Visualization of Attention Score):
Figures 12 and 13 display heatmaps of sparse attention patterns across different attention heads in ChatGLM3-6B (28 layers × 32 heads) for a 61K sequence length. These visualizations are based on row-by-row filtering with a CRA threshold of .
-
Column Stripes:
Figure 12(Layer 0, Head 10) clearly shows prominentcolumn stripes, particularly towards the beginning of the context. These embody crucialglobal contextual information(e.g.,attention sinksor important initial tokens). -
Slash Stripes:
Figure 12(Layer 4, Head 19) andFigure 13(Layer 16, Head 4) illustrateslash patterns, which maintain connections between contexts of regular intervals, typically capturinglocal window patternsor dependencies over recent tokens. -
Hybrid Patterns:
Figure 12(Layer 8, Head 24) andFigure 13(Layer 20, Head 29) demonstrate intricate combinations of bothcolumnandslash patternson a single head. This provides strong visual evidence for the paper's claim that attention patterns are complex and often a composition of these two primary types. -
Head-Specific Variation: The visualizations clearly show that the specific sparse patterns vary significantly across different attention heads and layers, reinforcing the need for an adaptive, head-specific approach like
SampleAttention.These visualizations serve as strong qualitative evidence supporting the paper's core hypothesis about the existence of dynamic and structured
columnandslashpatterns inLLM attention, whichSampleAttentionis designed to exploit.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper, "SampleAttention: Near-Lossless Acceleration of Long Context LLM Inference with Adaptive Structured Sparse Attention", successfully addresses the critical challenge of high Time-to-First-Token (TTFT) latency in Large Language Models (LLMs) operating with extremely long context windows. The authors identify that the inherent sparsity of the attention mechanism can be leveraged, but only if the adaptive sparsity ratios and dynamic structured patterns (primarily column and slash stripes) across different attention heads and input contents are effectively captured at runtime.
SampleAttention is proposed as an innovative solution that adaptively determines these sparse patterns. Its key contributions include the introduction of Cumulative Residual Attention (CRA) as a robust indicator for guiding accuracy-efficiency trade-offs, and a novel two-stage query-guided key-value filtering approach. This two-stage process efficiently estimates attention distribution through chunked sampling and then filters a minimal set of critical key-value blocks based on decomposed CRA thresholds for column and slash patterns. Complemented by automated hyperparameter tuning and hardware-efficient FlashAttention2 kernels, SampleAttention achieves remarkable performance. Experimental results demonstrate that it can seamlessly replace vanilla attention in off-the-shelf LLMs (ChatGLM, YI, InternLM) with consistently near-lossless accuracy (often >99% of full attention) while significantly reducing TTFT by up to for 1-million-token contexts.
7.2. Limitations & Future Work
The paper implicitly or explicitly points to several limitations and potential avenues for future work:
- Sampling Overhead for Short Sequences: As shown in
Figure 11(a), for shortersequence lengths(e.g., 8K), thesampling overheadofSampleAttentioncan be significant enough that itslatencyremains nearly identical toFlashAttention2. This implies thatSampleAttention's benefits are primarily realized for very long contexts, and it may not be the optimal choice for shorterprefillprompts. - Hyperparameter Tuning Complexity: While the paper proposes an automated offline tuning method, this still requires a small profiling dataset and multi-task tuning across different
length ranges. This process, though automated, adds an initial setup cost for each new model or architecture. The paper mentions "cost-effective threshold values," implying a balance that might not be universally simple to find for all scenarios. - Orthogonality with KV Cache Compression: The paper explicitly states that
SampleAttentionfocuses onprefill computationoverhead and is orthogonal toKV cache evictionorcompression approaches. This means whileSampleAttentionimprovesTTFT, it doesn't directly address the memory consumption of theKV cacheduring thedecoding phase. - Generalization of
CRA: WhileCRAis shown to correlate with accuracy, its universal applicability and optimal threshold setting across all possibleLLMtasks and architectures might require further theoretical backing or extensive empirical validation beyond the benchmarks presented. - Dynamic : Currently, is a tuned hyperparameter. Future work could explore making dynamically adaptive at runtime, similar to and , to further optimize the
sampling overhead.
7.3. Personal Insights & Critique
The paper presents a highly practical and effective solution to a major bottleneck in long context LLM inference.
Personal Insights:
- Principled Accuracy Guidance: The introduction of
CRAas a robust indicator ofattention recallis a significant insight. It provides a principled, quantifiable way to navigate theaccuracy-efficiency trade-off, moving beyond heuristic or fixed sparse patterns. This is a crucial differentiator from many priorsparse attentionmethods that often sacrificed accuracy arbitrarily. - Adaptive Structured Sparsity: The empirical observation that
attention patternsarehead-specific,content-aware, andmodel-aware, and often manifest as combinations ofcolumnandslashstripes, is very valuable.SampleAttention's ability to dynamically identify and combine these patterns at runtime is a robust approach. The visualizations in the appendix strongly support this claim. - Practicality and Generalizability: The method's ability to seamlessly replace vanilla attention in off-the-shelf
LLMswithout additional pretraining or finetuning is a huge advantage for practical adoption. Thecross-task robustnessof tuned hyperparameters further enhances its usability. - Hardware-Aware Design: Integrating with
FlashAttention2and implementingoperator fusiondemonstrates a deep understanding of practical system-level optimization, ensuring that the theoretical gains translate into real-world wall-clock speedups.
Critique / Areas for Improvement:
-
Defining
CRAmore concretely: While the definition is given, a more precise mathematical formulation ofCRAitself (beyond "minimum sum of remaining attention probabilities") might be beneficial for theoretical rigor. How is "remaining attention probability" specifically calculated after sparsification and how does it relate to thesoftmaxin the sparse attention formula? -
Clarity on
block_reductionandmerge_index:Algorithm 1is concise but lacks specific details on theblock_reductionandmerge_indexfunctions. For a truly beginner-friendly explanation, the exact operations (e.g., sum, max, average for reduction; logical OR for merging) would be helpful. -
Sensitivity to
blkparameter: The block sizeblkis mentioned inAlgorithm 1but not explicitly discussed as a hyperparameter or tuned. Its impact on the granularity of pattern detection and overhead could be significant. -
Comparison to
Hash-Sparse:Hash-Sparseis listed as a baseline, but its performance is not explicitly detailed in the mainResult & Analysistables (Table 2 and Figure 8 do not include it). Providing its performance would complete the comparison with all listed baselines. -
Energy Efficiency: While speedup is crucial,
long context inferencealso has significant energy consumption implications. An analysis ofSampleAttention's energy efficiency compared toFlashAttention2and other baselines would be a valuable addition.Overall,
SampleAttentionis an impressive piece of work that makes significant strides in acceleratinglong context LLM inferencewhile maintaining high accuracy, pushing the boundaries of what's possible for real-time, high-contextLLMapplications. Its adaptive and principled approach provides a strong foundation for future research insparse attention.
Similar papers
Recommended via semantic vector search.