Paper status: completed

Twilight: Adaptive Attention Sparsity with Hierarchical Top-$p$ Pruning

Published:02/05/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study introduces the `Twilight` framework that employs `top-p` sampling for adaptive attention sparsity to accelerate long-context LLMs. Results demonstrate up to 98% token pruning, achieving 15.4x speedup in self-attention and 3.9x in end-to-end latency without compromising

Abstract

Leveraging attention sparsity to accelerate long-context large language models (LLMs) has been a hot research topic. However, current algorithms such as sparse attention or key-value (KV) cache compression tend to use a fixed budget, which presents a significant challenge during deployment because it fails to account for the dynamic nature of real-world scenarios, where the optimal balance between accuracy and efficiency can vary greatly. In this paper, we find that borrowing top-pp sampling (nucleus sampling) to sparse attention can surprisingly achieve adaptive budgeting. Based on this, we propose Twilight, a framework to bring adaptive sparsity to any existing sparse attention algorithm without sacrificing their accuracy. Empirical results show that Twilight can adaptively prune at most 98% of redundant tokens, leading to 15.4×15.4\times acceleration in self-attention operations and 3.9×3.9\times acceleration in end-to-end per token latency in long context LLM decoding.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Twilight: Adaptive Attention Sparsity with Hierarchical Top-pp Pruning

1.2. Authors

Chaofan Lin, Jiaming Tang, Shuo Yang, Hanshuo Wang, Tian Tang, Boyu Tian, Ion Stoica, Song Han, Mingyu Gao.

Affiliations: Tsinghua University, Massachusetts Institute of Technology, University of California, Berkeley.

1.3. Journal/Conference

This paper was published as a preprint on arXiv. The abstract states "Published at (UTC): 2025-02-04T23:26:10.000Z", indicating its public availability. ArXiv is a widely recognized platform for sharing academic research prior to, or in parallel with, formal peer review and publication in journals or conference proceedings. It allows for rapid dissemination of scientific findings.

1.4. Publication Year

2025

1.5. Abstract

The paper addresses the challenge of accelerating long-context large language models (LLMs) by leveraging attention sparsity. Existing sparse attention algorithms and key-value (KV) cache compression methods typically rely on a fixed budget for token selection, which is problematic in real-world deployments due to the dynamic nature of accuracy-efficiency trade-offs. The authors propose that adopting the concept of top-p sampling (also known as nucleus sampling), originally used in LLM generation, can surprisingly achieve adaptive budgeting for sparse attention. Based on this insight, they introduce Twilight, a framework designed to bring adaptive sparsity to any existing sparse attention algorithm without compromising accuracy. Empirical results demonstrate that Twilight can adaptively prune up to 98% of redundant tokens, leading to a 15.4x acceleration in self-attention operations and a 3.9x acceleration in end-to-end per token latency during long context LLM decoding.

https://arxiv.org/abs/2502.02770

https://arxiv.org/pdf/2502.02770v5.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the inefficiency of long-context large language models (LLMs), particularly during the decoding stage, due to the high computational and memory costs associated with the attention mechanism. Specifically, the key-value (KV) cache grows rapidly with longer token sequences, leading to significant latency overheads from memory accesses and increased GPU memory consumption.

This problem is highly important in the current field because LLMs with long-context capabilities are crucial for a wide range of advanced applications, including retrieval-based tasks, document summarization, code generation, video language models (VLMs), and large reasoning models that require extensive chain-of-thought (CoT) reasoning. The increasing demand for models supporting context windows up to 1 million or 10 million tokens further highlights the necessity of efficient long-context processing.

A significant challenge in prior research on attention sparsity (or KV cache sparsity) is the reliance on a fixed budget for selecting "critical tokens." This top-k approach presents two major issues:

  1. Dynamic Optimal Budget: The ideal number of tokens to select (budget B) varies dynamically at runtime. Different attention heads exhibit different attention weight distributions—some are "focused" (peaked, requiring fewer tokens), while others are "diffuse" (flat, requiring more tokens). A fixed budget leads to either over-selection (wasting computation) or under-selection (losing accuracy).

  2. Algorithm-Specific Over-selection: Existing sparse attention algorithms require different degrees of over-selection to compensate for inaccuracies in estimating token importance, necessitating tedious offline calibration for each algorithm.

    The paper's entry point or innovative idea is to draw an analogy between this top-k budget selection dilemma in sparse attention and a similar problem previously encountered in LLM sampling. Just as top-p sampling (also known as nucleus sampling) addresses the limitations of top-k sampling in generating diverse and coherent text, the authors propose that applying top-p principles to sparse attention can enable efficient and adaptive budget decisions by directly imposing a threshold on the cumulative sum of attention weights.

2.2. Main Contributions / Findings

The primary contributions of this paper are:

  1. Identification of the top-k budget dilemma: The paper thoroughly investigates the critical challenge of selecting an optimal, fixed budget in top-k sparse attention due to the dynamic nature of attention weight distributions.

  2. Introduction of top-p sampling for sparse attention: It proposes using top-p sampling to dynamically determine the KV cache budget at runtime, offering a more intrinsic and adaptive approach compared to top-k.

  3. Proposal of Twilight framework: The authors introduce Twilight, a hierarchical KV cache pruning framework with a Select-then-Prune architecture. This framework enhances any existing sparse attention algorithm with adaptive budget selection capabilities without sacrificing accuracy, positioning it as an optimizer for existing methods rather than a new sparse attention algorithm itself.

  4. Efficient kernel implementations: Twilight includes optimized kernel implementations for its SpGEMV (Sparse General Matrix-Vector Multiplication) operations with 4-bit quantization of the Key cache and an efficient top-p mechanism via binary search, along with load balancing for head dynamism.

    The key conclusions and findings are:

  • Twilight can adaptively prune up to 98% of redundant tokens with nearly no accuracy loss across both long- and medium-context benchmarks.

  • It achieves significant efficiency improvements:

    • Up to 15.8x speedup for the self-attention operator compared to FlashAttention2.
    • 1.4x speedup over state-of-the-art sparse attention mechanisms (e.g., Quest).
    • 3.9x acceleration in end-to-end per token latency in long context LLM decoding compared to FlashInfer.
    • 1.35x speedup in end-to-end decoding compared to Quest without Twilight.
  • Twilight offers a more reasonable and tunable hyperparameter pp compared to the highly sensitive kk in top-k methods.

  • The hierarchical architecture and 4-bit quantization of the K cache contribute to its efficiency, especially in memory-bound scenarios.

    These findings address the problem of dynamic budget selection in sparse attention, significantly improving the efficiency of long-context LLMs while maintaining accuracy, making their deployment more practical and cost-effective.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand the paper, several fundamental concepts related to Large Language Models (LLMs) and their attention mechanisms are essential:

  • Large Language Models (LLMs): LLMs are deep learning models, typically based on the Transformer architecture, trained on vast amounts of text data to understand, generate, and process human language. They have billions of parameters and excel at various Natural Language Processing (NLP) tasks. The paper focuses on long-context LLMs, which are models designed to process and reason over very long sequences of text (e.g., tens of thousands to millions of tokens), crucial for tasks like summarizing long documents or complex reasoning.

  • Attention Mechanism: The attention mechanism is a core component of Transformers. It allows the model to weigh the importance of different parts of the input sequence when processing each token. Instead of processing a fixed-size window of text, attention enables the model to "attend" to relevant parts of the entire input, regardless of their position. This is particularly important for long contexts where dependencies can span many tokens.

    The fundamental self-attention calculation for a single head is often represented as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:

    • QQ (Query), KK (Key), VV (Value) are matrices representing linear projections of the input embedding (or previous layer's output). Each row corresponds to a token in the sequence.
    • QRL×dkQ \in \mathbb{R}^{L \times d_k}, KRL×dkK \in \mathbb{R}^{L \times d_k}, VRL×dvV \in \mathbb{R}^{L \times d_v}.
    • LL is the sequence length (number of tokens).
    • dkd_k is the dimension of the key and query vectors.
    • dvd_v is the dimension of the value vectors.
    • KTK^T is the transpose of the key matrix.
    • The product QKTQK^T calculates attention scores (or logits), representing how much each token in the sequence should attend to every other token.
    • dk\sqrt{d_k} is a scaling factor to prevent the dot products from becoming too large, which can push the softmax function into regions with very small gradients.
    • softmax is an activation function that converts the attention scores into probability distributions, ensuring they sum to 1.
    • The final multiplication with VV produces a weighted sum of value vectors, where the weights are the attention probabilities.
  • KV Cache: During the decoding stage of LLMs (when generating text token by token), the Key (K) and Value (V) vectors for previously generated tokens are stored. This stored collection is called the KV cache. When a new token is generated, its query vector QQ is used to attend to all previously computed KK and VV vectors in the cache. Storing them avoids recomputing them for every new token, significantly speeding up generation. However, the size of the KV cache grows linearly with the sequence length, leading to substantial memory consumption and memory bandwidth bottlenecks for long-context LLMs.

  • Attention Sparsity / KV Cache Sparsity: This refers to techniques that reduce the computational and memory cost of the attention mechanism by only computing attention over a subset of the KV cache (i.e., a subset of past tokens). The idea is that not all past tokens are equally important for predicting the next token; only a few "critical tokens" (often called "heavy hitters") contribute significantly. By identifying and using only these critical tokens, the overall computation and memory access can be greatly reduced.

  • Top-k Sampling (in LLM generation context): In LLM text generation, after a model predicts a probability distribution over the entire vocabulary for the next token, top-k sampling involves selecting the next token only from the kk most probable tokens. This helps to make the generation process less repetitive than always picking the most probable token, but it uses a fixed kk regardless of the probability distribution's shape.

  • Top-p Sampling / Nucleus Sampling (in LLM generation context): Also used in LLM text generation, top-p sampling addresses the limitation of top-k. Instead of fixing kk, it selects the smallest set of tokens whose cumulative probability exceeds a threshold pp. This makes the sampling process adaptive: if the probability distribution is sharp (few tokens have high probability), pp will select a small kk; if it's flat (many tokens have similar, lower probabilities), pp will select a larger kk. This approach leads to more diverse and contextually appropriate generations. Twilight borrows this adaptive concept for attention sparsity.

3.2. Previous Works

The paper categorizes previous works into several groups, highlighting the landscape of research on KV cache optimization and sparse attention:

  • Top-k Sparse Attention: These methods aim to select a fixed number of kk critical tokens.

    • KV Cache Compression (token dropping/eviction): These approaches decide which tokens to evict from the KV cache in a static or query-agnostic manner. Examples include H2O [8], StreamingLLM [17], and SnapKV [18]. They physically remove non-critical tokens from memory.
    • Token Selection (retaining all tokens, selecting for computation): These methods keep all tokens in GPU memory but selectively compute attention only with critical tokens. Examples include SparQ [19], Quest [9], Double Sparsity (DS) [12], and HShare [20]. More advanced algorithms like RetrievalAttention [21] and PQCache [22] focus on better estimating token criticality. Some, like NSA [23] and MoBA [24], explore trainable sparse attention. The common thread among all these is their reliance on top-k selection, which Twilight aims to improve.
  • Dynamic Budget: Recent studies have shown that the optimal KV cache budget varies significantly across different layers [25, 26], attention heads [27, 10, 28], and prompts/tasks [29]. These works often focus on one specific aspect of dynamism. Twilight argues that the root cause of this dynamism is the varying distributions of attention weights.

  • Non-top-k Sparse Attention:

    • MagicPIG [30] uses locality-sensitive hash (LSH) sampling to estimate attention weights instead of dropping tokens. It requires complex algorithm-system co-design.
    • SampleAttention [31] also features adaptive sparsity but is primarily focused on the prefill stage.
    • Tactic [32] is a concurrent work that also explores top-p sparsity but uses function fitting to estimate weight distributions, which can sometimes overestimate the budget.
  • Other KV Cache Optimizations: These methods complement sparsification by addressing different aspects of KV cache management.

    • Quantization [33, 34, 35, 36]: Reducing the precision of KV cache elements (e.g., from FP16 to INT8 or INT4) to save memory and bandwidth.
    • Linear Attention [37, 38]: Approximating the softmax operation to reduce the quadratic complexity of attention to linear.
    • Memory-efficient Attention [39, 40, 41]: Techniques like FlashAttention and SageAttention optimize the underlying attention computation kernels to reduce memory access and improve throughput. Twilight is orthogonal to these methods and can be combined with them.

3.3. Technological Evolution

The evolution of LLM efficiency has moved from full attention (where all tokens attend to all others, leading to quadratic complexity in sequence length) to sparse attention (which reduces this by selecting a subset of tokens). Early sparse attention methods often relied on heuristic-based token dropping or fixed-budget top-k selection. While these methods provided significant gains, they introduced a new challenge: how to optimally choose the fixed kk budget. The observation that attention weight distributions are highly dynamic across different parts of the model and different inputs led to research on dynamic budgeting, but often focusing on specific dimensions like layer-wise or head-wise changes.

This paper's work (Twilight) fits into the technological timeline by addressing the fundamental limitation of fixed-budget top-k sparse attention. It proposes a more general and adaptive solution by borrowing the concept of top-p sampling (successfully used in LLM output generation) and applying it to the internal attention mechanism. This represents a shift from static, fixed-size pruning to dynamic, probability-mass-based pruning, directly tackling the root cause of budget dynamism (varying attention weight distributions).

3.4. Differentiation Analysis

Compared to the main methods in related work, Twilight introduces several core differences and innovations:

  • Adaptive Budgeting via Top-p: The most significant innovation is the direct application of top-p sampling to determine the KV cache budget. Unlike top-k methods (e.g., Quest, DS, H2O, StreamingLLM), which select a fixed number of tokens, Twilight dynamically adjusts the number of selected tokens to accumulate a certain attention weight probability mass. This intrinsically adapts to focused versus diffuse attention distributions (as shown in Figure 3), preventing both over-selection and under-selection that plague top-k methods.

  • Framework for Existing Algorithms: Instead of proposing a completely new sparse attention algorithm, Twilight is designed as an optimizer framework for existing methods. Its Select-then-Prune architecture allows it to be integrated with any existing sparse attention algorithm (e.g., Quest, DS), enhancing them with adaptive budget selection capabilities without requiring a complete redesign. This makes Twilight highly versatile and potentially future-proof.

  • Improved Accuracy-Efficiency Trade-off: By dynamically pruning redundant tokens based on attention weight contribution, Twilight achieves higher efficiency (significant speedups) while maintaining or even improving accuracy compared to the base algorithms operating with fixed budgets. Top-p provides a theoretical upper bound on error, allowing for more precise control over the accuracy-efficiency trade-off.

  • Precision Requirement Consideration: The paper explicitly considers the precision requirements for attention weight estimation for top-p, finding that 4-bit quantization strikes a suitable balance, unlike top-k methods that might push to extremely low 1-2 bits or full attention which uses 8-bit or FP16.

  • System-Algorithm Co-design: Twilight emphasizes efficient kernel implementations, including SpGEMV with 4-bit quantized Key cache and a binary search-based top-p kernel, which is crucial for practical deployment on GPUs. This contrasts with some non-top-k methods like MagicPIG that also require algorithm-system co-design but use different underlying mechanisms.

    In essence, Twilight innovates by providing a general, adaptive, and theoretically grounded solution to the persistent problem of budget selection in sparse attention, leveraging a proven concept from LLM sampling to enhance existing and future sparse attention techniques.

4. Methodology

4.1. Principles

The core idea behind Twilight is to address the limitations of fixed-budget top-k sparse attention by borrowing the adaptive nature of top-p sampling (also known as nucleus sampling), a technique widely used in LLM text generation. The intuition is that the goal of sparse attention is not merely to select a fixed number of tokens, but rather to identify the minimum set of tokens whose cumulative attention weights contribute significantly to the overall attention output, thereby approximating the full attention computation with minimal error.

The theoretical basis stems from the observation that the error introduced by sparse attention can be bounded by the sum of the attention weights of the unselected tokens. Therefore, by ensuring that the selected tokens collectively account for a sufficient proportion (pp) of the total attention weight, the error can be controlled. This approach naturally adapts to the dynamic attention weight distributions observed across different attention heads, layers, and queries (as illustrated in Figure 3). If attention weights are focused on a few tokens, top-p will select a small number of tokens. If they are diffuse, top-p will select more tokens until the cumulative probability threshold pp is met. This adaptive behavior is crucial for balancing accuracy and efficiency in dynamic LLM inference scenarios.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation of Sparse Attention

The paper begins by formally defining sparse attention during the decoding phase. Given a query vector qR1×d\mathbf { q } \in \mathbb { R } ^ { 1 \times d } and a KV cache K,VRn×d˘\mathbf { K }, \mathbf { V } \in \mathbb { R } ^ { n \times \breve { d } }, where dd is the head dimension and nn is the context length.

Sparse attention calculates an approximate output o^\hat { \mathbf { o } } by selecting a subset of indices T\mathcal { T }. The formula for sparse attention is: o^=softmax(qKTd)ΛTV=WΛTVR1×d \hat { \mathbf { o } } = \mathrm { s o f t m a x } \bigg ( \frac { \mathbf { q } \cdot \mathbf { K } ^ { T } } { \sqrt { d } } \bigg ) \pmb { \Lambda } _ { \mathcal { T } } \mathbf { V } = \mathbf { W } \pmb { \Lambda } _ { \mathcal { T } } \mathbf { V } \in \mathbb { R } ^ { 1 \times d } Where:

  • o^\hat { \mathbf { o } } is the approximate output vector of dimensions 1×d1 \times d.

  • softmax(qKTd)\mathrm { s o f t m a x } \bigg ( \frac { \mathbf { q } \cdot \mathbf { K } ^ { T } } { \sqrt { d } } \bigg ) computes the attention weights for all nn tokens based on the dot product similarity between the query and keys, scaled by d\sqrt{d}, and then normalized by softmax.

  • WR1×n\mathbf { W } \in \mathbb { R } ^ { 1 \times n } represents the full attention weights vector (the softmax output), where W[i]\mathbf{W}[i] is the attention weight for the ii-th token.

  • ΛT\pmb { \Lambda } _ { \mathcal { T } } is a diagonal matrix of size n×nn \times n. Its diagonal elements are 1 for indices iTi \in \mathcal { T } and 0 otherwise. This effectively masks out non-critical tokens.

  • V\mathbf { V } is the value matrix from the KV cache.

    The goal is to minimize the error between the full attention output o\mathbf{o} and the sparse attention output o^\hat{\mathbf{o}}. The full attention output is: o=softmax(qKTd)VR1×d \mathbf { o } = \mathrm { s o f t m a x } \big ( \frac { \mathbf { q } \cdot \mathbf { K } ^ { T } } { \sqrt { d } } \big ) \mathbf { V } \in \mathbb { R } ^ { 1 \times d } The error oo^\lVert \mathbf o - \hat { \mathbf o } \rVert can be bounded using the sub-multiplicative property of the Frobenius norm: L=oo^=W(ΛT1n×n)VW(ΛT1n×n)VF \begin{array} { r l } & { \mathcal { L } = \| \mathbf { o } - \hat { \mathbf { o } } \| = \| \mathbf { W } ( \pmb { \Lambda } _ { \mathcal { T } } - \mathbf { 1 } ^ { n \times n } ) \mathbf { V } \| } \\ & { \qquad \leq \| \mathbf { W } ( \pmb { \Lambda } _ { \mathcal { T } } - \mathbf { 1 } ^ { n \times n } ) \| \cdot \| \mathbf { V } \| _ { F } } \end{array} Where:

  • L\mathcal{L} is the error in the output vector.

  • 1n×n\mathbf{1}^{n \times n} is an n×nn \times n matrix of all ones. The term (ΛT1n×n)(\pmb{\Lambda}_{\mathcal{T}} - \mathbf{1}^{n \times n}) represents a diagonal matrix where elements for iTi \in \mathcal{T} are 0 and elements for iTi \notin \mathcal{T} are -1. Essentially, it represents the portion of attention weights that are not selected.

  • \| \cdot \| denotes a vector norm.

  • F\| \cdot \|_F denotes the Frobenius norm of a matrix.

  • Since the distribution of V\mathbf{V} is generally smooth, VF\| \mathbf { V } \| _ { F } can be considered a constant.

    Therefore, the objective simplifies to minimizing W(ΛT1n×n)\| { \bf W } ( \Lambda _ { \mathcal { T } } - { \bf 1 } _ { n \times n } ) \|, which is equivalent to maximizing iTW[i]\sum _ { i \in \mathcal { T } } { \bf W } [ i ]. This means selecting a subset of tokens whose attention weights sum to the largest possible value.

4.2.2. Oracle Top-k Sparse Attention

If the size of the subset T\mathcal{T} is fixed to a budget BB, this leads to Oracle Top-k Sparse Attention: T=argmaxTiTW[i]s.t. T=B { \mathcal { T } } = \arg \operatorname* { m a x } _ { { \mathcal { T } } } \sum _ { i \in { \mathcal { T } } } \mathbf { W } [ i ] \quad { \mathrm { s . t . ~ } } | { \mathcal { T } } | = B Where:

  • T\mathcal{T} is the set of selected token indices.
  • T| \mathcal{T} | is the size of the set T\mathcal{T}.
  • BB is the predefined fixed budget (number of tokens to select). This represents the theoretical upper bound for top-k methods.

4.2.3. Oracle Top-p Sparse Attention

The paper argues that the fixed budget BB in top-k is problematic because attention weight distributions are dynamic (Figure 3). Some distributions are peaked (focused attention), where a few tokens have high weights, while others are flat (diffuse attention), where many tokens have similar, lower weights.

The following figure (Figure 3 from the original paper) shows the diverse distributions observed in attention weights, illustrating "flat" (diffuse) and "peaked" (focused) distributions.

Figure 3: Diverse distributions observed in attention weights. The leftmost image illustrates a "flat distribution (diffuse attention), where the weights are close to uniformly distributed. The middle image depicts a "peaked" distribution (focused attention), where the weights are concentrated on the tokens at the two sides. When overlaid as in the rightmost image, the differences between these distributions become readily apparent. 该图像是图表,展示了注意力权重的多样分布。左侧图显示了“平坦分布”(扩散注意力),权重接近均匀分布;中间图呈现“尖峰”分布(集中注意力),权重集中在两侧的标记上;右侧图则展示了这两种分布的叠加,明显体现出差异。

To address this, Twilight introduces Oracle Top-p Sparse Attention, inspired by nucleus sampling. Instead of fixing the number of tokens, it fixes a cumulative probability threshold pp: T=argminTTs.t.iTW[i]p \mathcal { T } = \arg \operatorname* { m i n } _ { \mathcal { T } } | \mathcal { T } | \quad \mathrm { s . t . } \sum _ { i \in \mathcal { T } } \mathbf { W } [ i ] \geq p Where:

  • T\mathcal{T} is the set of selected token indices.
  • T| \mathcal{T} | is the size of the set T\mathcal{T}.
  • pp is the cumulative attention weight threshold (e.g., 0.9). This means selecting the minimum number of tokens whose attention weights sum up to at least pp. This approach ensures that the error is bounded by (1p)VF(1 - p) \cdot \| \mathbf{V}\|_F, directly linking the threshold pp to the theoretical error. Figure 4 demonstrates how top-p can select a much smaller budget (B=97B=97 for p=0.8p=0.8) to achieve an error comparable to a large fixed budget (B=1024B=1024).

The following figure (Figure 4 from the original paper) shows cumulative attention scores of different budget selections in one example attention head.

Figure 4: Cumulative attention scores of different budget selections in one example attention head. 该图像是图表,展示了不同预算选择下某个注意力头的累积注意力分数。X轴表示令牌位置,Y轴表示累积注意力分数,其中标注的点对应于特定位置的注意力得分,例如(97, 0.80)和(1024, 0.87)。

4.2.4. Twilight Architecture: Hierarchical Pruning with Select-then-Prune

Twilight is designed as a framework to enhance existing sparse attention algorithms. It employs a two-step, hierarchical pruning process, dubbed the Select-then-Prune architecture.

The following figure (Figure 5 from the original paper) illustrates the Twilight architecture.

Figure 5: Twilight architecture. Twilight incorporates a certain existing base sparse attention algorithm and further optimizes it. It computes self-attention in three steps. First, the Token Selector selects critical tokens using the base algorithm under a conservative budget. Then, the Twilight Pruner prunes the selected token subset via top- \(p\) thresholding. Finally, the pruned token indices are passed to the Sparse Attention Kernel to perform the attention computation. 该图像是示意图,展示了Twilight架构的自注意力计算过程。该过程分为三个步骤:首先,Token Selector使用基础算法在保守预算下选择关键token;接着,Twilight Pruner通过top-pp阈值修剪选定的token子集;最后,将修剪后的token索引传递给稀疏注意力内核以执行注意力计算。

The self-attention computation proceeds in three steps:

  1. Token Selector: This component is a black-box representation of an existing base sparse attention algorithm (e.g., Quest, DS). It's configured to use a conservative budget (e.g., 1/41/4 sparsity), meaning it initially selects a relatively large subset of critical tokens. This step ensures that important tokens are not prematurely discarded and provides a pool of candidates for further refinement.
  2. Twilight Pruner: This is the core innovation. It takes the subset of tokens selected by the Token Selector and further prunes them. It applies top-p thresholding, retaining only the minimum number of tokens whose attention weights sum exceeds the specified threshold pp. This step adaptively determines the final budget for each query/head.
  3. Sparse Attention Kernel: The final sparse attention computation is performed only on the indices of the tokens retained by the Twilight Pruner. This computation is optimized for sparsity and leverages the reduced number of tokens.

4.2.5. Efficient Kernel Implementations

To make Twilight practical and efficient, the authors develop specialized kernel implementations.

4.2.5.1. Efficient SpGEMV with 4-bit Quantization of Key Cache

Estimating attention weights (qKT\mathbf{q} \cdot \mathbf{K}^T) without loading the full KV cache is critical. This involves computing a Sparse General Matrix-Vector Multiplication (SpGEMV). Since loading KK is often memory-bound, Twilight reduces memory access cost by quantizing KK into lower precision.

The paper empirically finds that 4-bit precision for K cache quantization offers the best balance between accuracy and efficiency for top-p, lying between the 1-2 bits used by some top-k methods and 8-bit or FP16 for full attention. 2-bit quantization leads to significant attention weight sum drops, while 4-bit and 8-bit maintain stability (Figure 6).

The following figure (Figure 6 from the original paper) shows the sums of normalized attention weights for selected tokens under different quantization precisions, with p=0.85p=0.85.

Figure 6: Sums of normalized attention weights for the selected tokens under different quantization precisions, with \(p = 0 . 8 5\) . 该图像是图表,展示了在不同量化精度下选定令牌的注意力权重总和。横坐标表示层编号,纵坐标为注意力权重总和,数据点分别对应于2位、4位和8位量化精度,显示了不同层次的影响趋势。

The implementation details from Appendix B.1 include:

  • Calculation Process: Adapts FlashInfer [47]'s attention decoding kernel. It involves asynchronously loading and dequantizing INT4 K cache from global memory to shared memory using cp.async. A two-stage pipeline overlaps data loading with computation. FP16 is used for dequantized K cache to balance accuracy and computation cost.
  • Dequantization: Uses per-head, dynamic KV quantization. FP16 scale and zero-point are stored for each head. Dequantization is performed on-the-fly using parameters inspired by QServe [61] and a fast algorithm from [62] (NVIDIA's FasterTransformer [63]) using custom PTX assembly instructions for INT4 to FP16 conversion.
  • Bit-packing: INT4 K elements are packed into an uint8_t buffer (two 4-bit elements per 8-bit byte). An offset of +128+128 is added to each INT4 element before packing to convert it to an unsigned value. Address calculation is remapped for 4-bit granularity by halving the effective byte offset.

A brute-force approach to top-p sampling (sorting all elements and accumulating) is inefficient on GPUs. Twilight implements an efficient top-p kernel by modifying FlashInfer's top-p sampling kernel, utilizing a parallel-friendly binary search method.

The algorithm is presented as Algorithm 1: Algorithm 1 Top- pp via Binary Search. Input: normalized attention weights WRBS×H×N\mathbf{W} \in \mathbb{R}^{\mathrm{BS} \times \mathrm{H} \times \mathrm{N}}, top- pp threshold pp, hyper-parameter ϵ\epsilon. Output: indices I\mathcal{I}, mask M{0,1}BS×H×N\mathcal{M} \in \{0, 1\}^{\mathrm{BS} \times \mathrm{H} \times \mathrm{N}}.

  1. Initialize search range: l=0,r=max(W),m=(l+r)/2;l = 0, r = \operatorname* { m a x } ( W ), m = ( l + r ) / 2;
  2. repeat
  3. $W _ { 0 } = \mathsf { w h e r e } ( W < m , ~ 0 . 0 , ~ W ) ; \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \quad \begin{quad} \end{array} \end{quote}
  • W\mathbf{W} is the raw attention weights output before any selection.
  • BS is the batch size.
  • HH is the number of attention heads.
  • NN is the sequence length (context length).
  • pp is the top-p threshold.
  • epsilon is a small hyper-parameter for numerical stability, defining the convergence criterion for the binary search.
  • ll (left) and rr (right) define the search range for the minimum weight value that satisfies the top-p condition.
  • mm is the midpoint of the search range.
  • W0=where(W<m, 0.0, W)W_0 = \mathsf{where}(W < m, ~ 0.0, ~ W) creates a temporary vector where all weights less than mm are set to 0.0, and others retain their original values.
  • W1=where(Wl, INF, W)W_1 = \mathsf{where}(W \leq l, ~ \mathrm{INF}, ~ W) creates a temporary vector used to track the lower bound of selected weights.
  • W2=where(W>r, INF, W)W_2 = \mathsf{where}(W > r, ~ -\mathrm{INF}, ~ W) creates a temporary vector used to track the upper bound of selected weights.
  • The binary search iteratively adjusts ll or rr based on whether the sum of attention weights greater than or equal to mm (sum(W0)sum(W0)) meets the threshold pp.
  • The loop continues until the range between the maximum of W2W_2 and minimum of W1W_1 is smaller than epsilon, effectively converging to the correct minimum weight value.
  • Finally, the indices I\mathcal{I} and mask M\mathcal{M} are selected where WlW \geq l. The paper notes that element-wise operations like max, where, and sum can be fused into a single loop and tensorized on the GPU, avoiding materializing intermediate variables.

4.2.5.3. Load Balancing with Awareness of Head Dynamism

The top-p Pruner leads to head-wise dynamic budgets, which can cause load imbalance in the attention kernel (different heads process different numbers of tokens). Twilight reuses the load balancing algorithm from FlashInfer [47] (originally for requests with dynamic lengths) to address this head-wise dynamism by flattening the head dimension.

In Appendix B.2, the paper details how Group Query Attention (GQA) (e.g., in LLaMA 3) is tackled. GQA maps a group of query heads to a single key-value head, creating an incompatibility with query-aware sparse attention that relies on individual query vectors. A brute-force solution (loading tokens independently for each query head) is inefficient due to repeated memory reads. Twilight addresses this by operating at the granularity of query groups: the selected tokens for a query group are the union of tokens identified by all query heads within that group. This shifts head-wise dynamism to group-wise dynamism for GQA, where all heads in a group share a common token budget. This is a deliberate trade-off for implementation efficiency and compatibility.

The following figure (Figure 13 from the original paper) shows the head-wise/group-wise varlen attention with flattened paged KV cache in Twilight and a comparison among different attention methods.

Figure 13: Left: Head-wise/group-wise varlen attention with flattend paged KV cache in Twilight. Right: Comparison among the three attention methods on a real budget distribution of a LLaMA-3.1- 7B layer on a 16k retrieval task. Here "Padded" means padding all heads to the maximum budget legth; "Head Varlen" loads KV at the head granularity which causes repeated loading; and "Group Varlen" strikes a balance between the two methods. 该图像是示意图,左侧展示了使用头级动态性的多头注意力与展平的 KV 缓存的交互,右侧则比较了在真实预算分布下,三种注意力方法在不同批次大小下的时间消耗,其中“Padded”表示将所有头部填充到最大预算长度;“Head Varlen”在头粒度上加载 KV;“Group Varlen”在两者之间取得平衡。

4.2.6. Overhead Analysis and Discussion

4.2.6.1. Execution Time

The total execution time for Twilight is the sum of three parts: TTokenSel+TPruner+TSparseAttnT _ { \mathrm { T o k e n S e l } } + T _ { \mathrm { P r u n e r } } + T _ { \mathrm { S p a r s e A t t n } } Compared to baseline sparse attention (which only has TTokenSelT_{TokenSel} and TSparseAttnT_{SparseAttn}), Twilight adds an extra latency term TPrunerT _ { \mathrm { P r u n e r } } but reduces TSparseAttnT _ { \mathrm { S p a r s e A t t n } } due to more aggressive pruning. The hierarchical sparsity, where tokens decrease as precision increases, is key. If the base algorithm selects N/16N/16 tokens and Twilight prunes this to N/64N/64 (assuming B0=N/4B_0 = N/4 and B1=N/64B_1 = N/64 where B0B_0 is the budget from Token Selector and B1B_1 is the budget after Twilight Pruner), the theoretical speedup can be approximated. The formula is given as: N/16+B0N/16+B0/4+B1\frac { N / 1 6 + B _ { 0 } } { N / 1 6 + B _ { 0 } / 4 + B _ { 1 } } If B0=N/4B_0 = N/4 and B1=N/64B_1 = N/64, the speedup is approximately 2×2 \times. The overheads of the top-p kernel are considered minor because SpGEMV dominates the latency when B0B_0 is around N/8N/8 to N/4N/4.

4.2.6.2. Memory Overheads

Twilight introduces an extra INT4 quantized K cache. This adds a 1/2×1/4=1/81/2 \times 1/4 = 1/8 additional KV cache memory overhead (since INT4 is half the size of FP16 and only for KK). However, this cost can be mitigated:

  • Some base algorithms (e.g., DS [12]) already use INT4 K cache.
  • Efforts like SageAttention2 [41] explore INT4 full-attention, allowing direct reuse of estimated attention weights without needing the original FP16 K cache.
  • Techniques like offloading or selective quantization (e.g., INT4 K cache only for hot tokens) could further reduce memory if it becomes a bottleneck.

4.2.6.3. Integration with LLM Serving Systems

Twilight's design is compatible with PagedAttention [44], allowing seamless integration into LLM serving systems like vLLM [44] and SGLang [45]. It also works with techniques like prefix sharing and multi-phase attention [48, 45, 49, 50, 51] due to its page-level or token-level sparse operations.

5. Experimental Setup

5.1. Datasets

The experiments evaluate Twilight on a variety of benchmarks covering different context lengths and task types.

  • Long-Context Benchmarks:

    • Longbench [1]: A bilingual, multitask benchmark for long context understanding. Twilight evaluates on 12 tasks covering all task types. For efficiency evaluation, specific tasks like Qasper [56] (for QA), GovReport [57] (for summarization), and LCC [58] (for coding) are used with prompts ranging from 10k to 30k tokens.
    • RULER [16]: A benchmark specifically designed to assess the true context size of long-context language models. It includes specialized tests like CWE/FWE for comprehensive non-retrieval accuracy evaluation.
  • Medium-Context Benchmarks (500 to 2k tokens):

    • GSM8K [13]: A dataset of math word problems designed to test LLMs' reasoning abilities.

    • COQA [14]: A conversational question answering challenge dataset.

    • PG-19 [15]: A large dataset of books used for long-range sequence modeling (used here for perplexity evaluation, which measures how well a language model predicts a sample of text).

      The datasets were chosen to rigorously test Twilight's performance across diverse tasks and context lengths, from typical medium-length interactions to very long documents, validating its effectiveness in various real-world scenarios.

5.2. Evaluation Metrics

The paper employs several evaluation metrics to assess both the accuracy and efficiency of Twilight.

  • Accuracy Metrics:

    • Score (Longbench, RULER): For Longbench and RULER, the paper reports "scores." These are typically task-specific accuracy metrics (e.g., F1 score, Exact Match, Rouge score, or specialized metrics for coding/reasoning tasks) that quantify the performance of the LLM on each task. A higher score indicates better performance. The paper reports the average score across tasks.
    • GSM8K (flexible/strict): For GSM8K, flexible and strict accuracy metrics are used. These measure the correctness of the model's answers to math word problems, with strict typically requiring an exact match and flexible allowing for minor variations. Higher values are better.
    • COQA (em/f1): For COQA, Exact Match (EM) and F1 score are used.
      • Conceptual Definition (Exact Match - EM): EM measures the percentage of predictions that match the ground truth answer exactly. It is a strict measure of accuracy.
      • Mathematical Formula (Exact Match - EM): While no single universal formula is typically provided, it's defined as: $ \mathrm{EM} = \frac{\text{Number of exact matches}}{\text{Total number of questions}} \times 100% $ Where:
        • "Number of exact matches" refers to cases where the model's generated answer string is identical to the reference answer string.
        • "Total number of questions" is the total count of questions in the dataset.
      • Conceptual Definition (F1 Score): The F1 score is the harmonic mean of precision and recall. It is a more robust metric than EM for QA tasks, especially when answers can be phrased differently but convey the same meaning. It measures the overlap between the predicted answer and the ground truth answer at the word level.
      • Mathematical Formula (F1 Score): $ \mathrm{F1} = 2 \times \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $ Where:
        • Precision = Number of shared words between prediction and ground truthNumber of words in prediction\frac{\text{Number of shared words between prediction and ground truth}}{\text{Number of words in prediction}}
        • Recall = Number of shared words between prediction and ground truthNumber of words in ground truth\frac{\text{Number of shared words between prediction and ground truth}}{\text{Number of words in ground truth}}
        • For a single question, Precision and Recall are calculated based on the word overlap between the predicted answer and the (set of) ground truth answers. The F1 score for the dataset is typically the average F1 over all questions. For both EM and F1, higher values are better.
    • PG-19 Perplexity (PPL):
      • Conceptual Definition: Perplexity is a common intrinsic evaluation metric for language models. It quantifies how well a probability model predicts a sample. A lower perplexity score indicates that the model is better at predicting the given text, implying higher accuracy and a better fit to the data. It's essentially the exponentiated average negative log-likelihood of a sequence.
      • Mathematical Formula: $ \mathrm{PPL}(W) = \exp \left( -\frac{1}{N} \sum_{i=1}^N \log P(w_i | w_1, \dots, w_{i-1}) \right) $ Or, equivalently, using the inverse product of probabilities: $ \mathrm{PPL}(W) = \sqrt[N]{\prod_{i=1}^N \frac{1}{P(w_i | w_1, \dots, w_{i-1})}} $ Where:
        • W=(w1,w2,,wN)W = (w_1, w_2, \dots, w_N) is a sequence of NN words (or tokens).
        • NN is the total number of words/tokens in the sequence.
        • P(wiw1,,wi1)P(w_i | w_1, \dots, w_{i-1}) is the probability of the ii-th word given the preceding words, as predicted by the language model.
        • logP()\log P(\cdot) is the natural logarithm of the probability.
        • exp()\exp(\cdot) is the exponential function. A lower perplexity value is better.
  • Efficiency Metrics:

    • Latency (μs\mu s): Measures the time taken for a specific operation, such as a self-attention operator. Lower latency is better.
    • Speedup (×\times): A ratio indicating how many times faster Twilight is compared to a baseline. A higher speedup is better.
    • Time-Per-Output-Token (TPOT): Measures the average time (e.g., in milliseconds) required to generate a single output token during end-to-end decoding. This is a crucial metric for LLM serving systems. Lower TPOT is better.

5.3. Baselines

Twilight is primarily presented as a framework to optimize existing sparse attention algorithms. Therefore, its performance is evaluated against various baselines, including full attention and state-of-the-art sparse attention methods, both with and without Twilight integration.

  • Full Attention Baselines:

    • PyTorch's Scaled-Dot-Product-Attention (SDPA): Standard PyTorch implementation of the attention mechanism.
    • FlashAttention2 (FA2) [39]: A highly optimized memory-efficient attention implementation, serving as a backend for SDPA. It's a strong baseline for full attention performance.
    • MemoryEfficient Attention [59]: Another backend for SDPA focused on memory optimization.
    • FlashInfer [47]: A high-performance kernel library specifically designed for LLM serving, providing highly optimized full attention kernels.
  • State-of-the-Art Top-k Sparse Attention Methods:

    • Quest [9]: A query-aware sparsity method for efficient long-context LLM inference. It's cited as achieving state-of-the-art runtime performance among sparse attention methods.
    • Double Sparsity (DS) [12]: A method applying post-training sparse attention with double sparsity. The authors use optimized configurations tuned for each model, provided by its official repository.
  • State-of-the-Art Non-Top-k Sparse Attention Method:

    • MagicPIG [30]: Uses locality-sensitive hash (LSH) sampling for attention weights estimation. It relies on configurable parameters KK and LL instead of a fixed budget. The paper uses two standard configurations from the original MagicPIG paper. (Note: MagicPIG was not evaluated for LLaMA-2 due to lack of official support).
  • Token Dropping Methods (for comparison of sparsity types):

    • StreamingLLM [17]: An efficient streaming language model with attention sinks, typically for managing KV cache in streaming scenarios.

    • SnapKV [18]: A method where LLM knows what to look for before generation by managing the KV cache.

      These baselines are representative of current state-of-the-art approaches in LLM efficiency, covering both full attention and various sparse attention strategies (top-k, non-top-k, token dropping). This comprehensive comparison allows Twilight to demonstrate its advantages as an adaptive optimizer.

5.4. Models

The experiments use three widely adopted LLM models, covering different architectures and context length capabilities:

  • Longchat-7B-v1.5-32k [52]: A 7-billion parameter model with a 32k context length. This model provides a strong baseline for long-context evaluation.

  • LLaMA2-7B-Chat [53]: A 7-billion parameter LLaMA-2 model fine-tuned for chat applications.

  • LLaMA-3.1-8B-Instruct [54]: An 8-billion parameter LLaMA-3.1 model, instructed for various tasks, with 128k context length. This model utilizes Group Query Attention (GQA).

    These models cover two mainstream attention implementations: Multi-Head Attention (MHA) (e.g., Longchat, LLaMA2) and Group Query Attention (GQA) (e.g., LLaMA-3.1), allowing for a broad evaluation of Twilight's applicability. For fair comparison, sparse methods are not applied to the first two layers of the models, as is common practice in many baselines.

5.5. Hyperparameters

  • Twilight's pp threshold: The hyperparameter p for Twilight is set to 0.95 for LLaMA-2/3 models and 0.85 for Longchat. This value was determined through an ablation study (discussed in Section 5.3) to balance accuracy and efficiency.
  • Base Algorithm Budget for Token Selector: For Quest and DS, the budget (kk) in the Token Selector (the first step of Twilight) is varied from 256 to 8192. When integrated with Twilight, a conservative budget (e.g., 1/41/4 sparsity) is chosen for the Token Selector as an initial, larger subset of tokens before the Twilight Pruner refines it.
  • MagicPIG Parameters: MagicPIG uses parameters KK and LL. The paper adopts two standard configurations: K=8,L=75K=8, L=75 and K=10,L=150K=10, L=150.
  • Medium-Context Baseline Budget: For medium-context tasks, baseline Quest and DS models use a fixed budget of 128. This is chosen to be comparable to the average budget after Twilight's pruning in those scenarios.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Accuracy Evaluation

6.1.1.1. Results on Longbench

The Longbench evaluation (Table 2 and the detailed Table 5 in Appendix C) demonstrates Twilight's effectiveness in long-context scenarios. For Longchat-7B-v1.5-32k, Twilight improves the average score over its original version by up to 5.7%5.7\% (when optimizing DS) while pruning up to 98%98\% of tokens over-selected by the base algorithm. This shows significant efficiency gains with accuracy improvement. For LLaMA-3.1-8B-Instruct, Twilight achieves nearly zero accuracy loss (<1%<1\%) with a slight increase in budget usage, suggesting that knowledge in LLaMA-3.1 might be more compressed. This indicates Twilight can adapt to different model characteristics.

The following are the results from Table 2 of the original paper:

Budget Longchat-7B -v1.5-32k LLaMA-3.1-8B -Instruct
Score Pruned % (Avg. Budget) Score Pruned % (Avg. Budget)
Full 32k 36.78 - 52.01 -
Twilight 38.52 (+4.7%) 99.6% (141.4) 51.64 (-0.7%) 99.6% (478)
MagicPIG K=8,L=75 - - 51.70 -
K=10, L=150 - - 51.32 -
Quest 256 31.26 - 38.20 -
1024 36.85 - 47.79 -
4096 37.33 - 50.79 -
8192 37.10 - 51.44 -
Twilight 38.04 (+2.5%) 98.4% (131) 51.57 (+0.3%) 99.5% (427)
DS 256 35.32 - 45.74 -
1024 35.96 - 49.43 -
4096 36.31 - 50.98 -
8192 36.62 - 51.14 -
Twilight 38.71 (+5.7%) 98.5% (126) 51.73 (+1.2%) 99.5% (446)

6.1.1.2. Results on RULER

On the RULER benchmark (Table 3), Twilight significantly boosts the performance of base sparse attention methods. Quest-Twi achieves scores comparable to the SOTA non-top-k method MagicPIG. More remarkably, DS-Twi sets new performance records, surpassing all existing methods. This indicates Twilight's ability to not only recover lost accuracy from fixed-budget methods but also push SOTA performance further.

The following are the results from Table 3 of the original paper:

Budget 16k 32k 64k 96k Avg.
Full 100% Twilight 92.88 89.42 93.13 89.10 85.17
84.64 83.10 87.49 85.23 88.18
MagicPIG K=8, L=75 92.22 89.37 84.07 82.58 87.06
K=10, L=150 91.38 88.20 83.34 82.02 86.23
Quest 4% 79.35 79.8 78.64 73.22 77.75
8% 87.31 83.06 80.82 75.28 81.62
Twilight 91.53 87.97 84.12 82.96 86.65
DS 4% 92.04 88.11 84.43 82.56 86.79
8% 92.89 88.70 84.39 82.72 87.18
Twilight 93.54 89.24 85.91 82.81 87.88

6.1.1.3. Results on Medium-Context Tasks

For medium-context tasks (GSM8K, COQA, PG-19 Perplexity in Table 4), Twilight demonstrates that its Pruner does not negatively impact performance. Twilight significantly outperforms Quest and DS (when they use a fixed budget of 128) and shows nearly zero accuracy loss compared to full attention. The average budgets after Twilight's pruning are shown to be quite small (e.g., 90.82 for GSM8K with LLaMA-2-7B-Chat), confirming its aggressive but accurate pruning capabilities.

The following are the results from Table 4 of the original paper:

GSM8K(flexible/strict)↑ COQA(em/f1)↑ PG-19 Perplexity↓
LLaMA-2-7B-Chat
Full 0.2290/0.2282 0.5935/0.7511 7.503
Quest 0.0523/0.0508 0.5710/0.7425 14.15
DS 0.2191/0.2190 0.5855/0.7401 7.622
Twilight 0.2153/0.2115 0.6088/0.7642 7.600
(Twilight Avg. Budget) 90.82 91.86 102.58
LLaMA-3.1-8B-Instruct
Full 0.7726/0.7475 0.6363/0.7882 7.490
Quest 0.3639/0.3533 0.6007/0.7554 19.00
DS 0.6194/0.6027 0.6455/0.7964 7.967
Twilight 0.7771/0.7604 0.6325/0.7869 7.529
(Twilight Avg. Budget) 112.40 86.85 110.98

6.1.2. Efficiency Evaluation

6.1.2.1. Self-Attention Speedup

Figure 7 shows the self-attention operator speedups. FlashInfer-Twi and Quest-Twi achieve impressive speedups, up to 6.5×6.5 \times and 15.8×15.8 \times respectively, compared to FlashAttention2. Furthermore, they accelerate their respective base algorithms (FlashInfer and Quest) by 2.4×2.4 \times and 1.4×1.4 \times. This highlights Twilight's ability to significantly improve the core attention operation itself by reducing the number of tokens processed.

The following figure (Figure 7 from the original paper) shows latencies and speedups of self-attention at different sequence lengths and batch sizes.

Figure 7: Latencies and speedups of self-attention at different sequence lengths and batch sizes. 该图像是一个图表,展示了在不同批大小和序列长度下自注意力的延迟时间。图中列出了四组不同批大小(8、16、32、64)对应的延迟(μs)以及各种方法的速度提升比率,包括FA2、FlashInfer、Quest、FlashInfer-Twi和Quest-Twi。

6.1.2.2. End-to-End Decoding Speedup

In end-to-end decoding scenarios (Figure 8), Quest-Twi achieves up to a 3.9×3.9 \times speedup compared to FlashInfer and a 1.35×1.35 \times speedup compared to Quest without Twilight. This demonstrates the practical benefits of Twilight in LLM serving systems by reducing the Time-Per-Output-Token (TPOT).

The following figure (Figure 8 from the original paper) shows Time-Per-Output-Token (TPOT) improvements in end-to-end serving scenarios.

Figure 8: Time-Per-Output-Token (TPOT) improvements in end-to-end serving scenarios. 该图像是一个柱状图,展示了不同批量大小(32、64、128、256)下,各种序列长度(10k、20k、30k)对应的处理时间(ms)。图中的数据表明,通过使用不同的方法(FlashInfer、Quest、Quest-Twi),在多种场景下,处理时间表现出显著差异,显示出改进程度,部分方法在处理特定序列长度时实现了加速效果。

6.1.2.3. Accuracy Comparison with Token Dropping Methods

Table 6 compares DS-Twilight against token-dropping methods like StreamingLLM and SnapKV on Longbench. DS-Twilight consistently achieves notably better performance across all tasks, with an average score of 38.71 compared to 32.69 for StreamingLLM and 36.22 for SnapKV. This reinforces the argument that token-selecting (which Twilight enhances) generally outperforms token-dropping because dropping tokens leads to irreversible information loss.

The following are the results from Table 6 of the original paper:

Dataset StreamingLLM (Budget=4096) SnapKV (Budget=4096) DS-Twilight
Qasper 26.39 29.44 32.34
MulQA-en 33.2 40.03 43.89
HotpotQA 24.29 33.67 34.67
2WikiMQA 20.1 24.13 25.43
Musique 10.87 13.45 13.84
GovReport 26.92 26.09 31.88
QMSum 20.8 22.53 23.01
MultiNews 26.46 25.61 26.32
TrivialQA 75.6 80.82 85.29
PR-en 24.17 30.25 35.50
LCC 52.47 52.62 55.03
Repobench-P 51.02 55.99 57.27
Avg. 32.69 36.22 38.71

6.1.2.4. Efficiency Evaluation in Offloading Scenarios

Table 7 demonstrates that Twilight yields even more significant gains in memory-offloading scenarios where per-token loading cost is dominant. Quest-Twi achieves up to 16×16 \times speedups compared to Quest. This is because Twilight effectively reduces the number of tokens that need to be loaded, directly addressing the bottleneck in offloading.

The following are the results from Table 7 of the original paper:

10k 20k 30k
Quest 3038.98 5990.75 8490.95
Quest-Twi 415.89 480.61 527.77

6.2. Ablation Studies / Parameter Analysis

6.2.1. Sensitivity to Threshold pp

The paper investigates the sensitivity of Twilight to the hyperparameter p. While pp is a new parameter, the authors argue it is more reasonable and tunable than kk because it represents accumulated probability, making it less affected by varying attention weight distributions across heads, layers, or queries. Figure 9 illustrates the trade-off between PG-19 perplexity (accuracy) and sparse attention latency (efficiency) at different pp values. The results show that a balance between accuracy and efficiency is struck at p0.85p \approx 0.85. This informed the choice of p=0.85p=0.85 for Longchat-7B-v1.5-32k and p=0.95p=0.95 for LLaMA-2/3 models.

The following figure (Figure 9 from the original paper) shows PG-19 perplexity (accuracy) and sparse attention latency (efficiency) under different threshold pp values.

Figure 9: PG-19 perplexity (accuracy) and sparse attention latency (efficiency) under different threshold \(p\) values. 该图像是图表,展示了在不同阈值pp下,PG-19的困惑度(越低越好)和稀疏注意力延迟(以微秒为单位)的关系。数据呈现了困惑度随pp值的变化趋势,以及相应的延迟时间。

6.2.2. Time Breakdown for Twilight

Figure 10 provides an insightful time breakdown of the self-attention operation for Quest and Quest-Twi at different batch sizes. For a 32k retrieval task, Quest uses a budget of 8192 (1/41/4 sparsity), while Quest-Twi prunes this down to an average of 256 tokens. The breakdown confirms the theoretical cost model from Section 4.3: Twilight significantly reduces the time for the sparse attention kernel (the final computation step) while introducing minor overheads for the Token Selector and Twilight Pruner components. For instance, at batch size 64, Quest-Twi outperforms Quest by approximately 2×2 \times, demonstrating that the cost of pruning is well-justified by the reduced computation in the final attention step.

The following figure (Figure 10 from the original paper) shows the time breakdown of self-attention. At batch size 64, Quest-Twi outperforms Quest by about 2×2 \times.

Figure 10: Time breakdown of self-attention. At batch size 64, Quest-Twi outperforms Quest by about \(2 \\times\) . 该图像是图表,展示了不同批量大小下(16、32、64),Quest-Twi 与 Quest 的自注意力时间分解。可以看到,在批量大小为 64 的情况下,Quest-Twi 的性能约提高了 2 imes

6.3. Full Results on Longbench

The detailed results on Longbench are provided in Table 5 in Appendix C of the original paper. These results corroborate the summary provided in Section 6.1.1.1, showing Twilight's ability to improve or maintain accuracy while significantly reducing the active budget across a diverse set of long-context tasks for both Longchat and LLaMA-3.1 models.

The following are the results from Table 5 of the original paper:

Method Budget Single-Doc. QA Multi-Doc. QA Summarization Few-shot Synthetic LCC Repobench-P Code Avg. Score
Qasper MF-en HotpotQA 2WikiMQA Musique GovReport QMSum MultiNews
Full 32k 29.48 42.11 30.97 23.74 13.11 31.03 22.77 26.09 83.25 30.50 52.70 55.62 36.78
Twilight (Avg 141.4) 31.95 43.91 33.59 25.65 13.93 32.19 23.15 26.30 85.14 34.50 54.98 57.12 38.52 (+4.7%)
Quest 256 26.00 32.83 23.23 22.14 7.45 22.64 20.98 25.05 67.40 33.60 48.70 45.07 31.26
1024 31.63 42.36 30.47 24.42 10.11 29.94 22.70 26.39 84.21 34.5 51.52 53.95 36.85
4096 29.77 42.71 32.94 23.94 13.24 31.54 22.86 26.45 84.37 31.50 53.17 55.52 37.33
8192 29.34 41.70 33.27 23.46 13.51 31.18 23.02 26.48 84.70 30.00 53.02 55.57 37.10
Twilight (Avg. 131) 31.95 43.28 31.62 24.87 13.48 32.21 22.79 26.33 84.93 33.50 54.86 56.70 38.04 (+2.5%)
DS 256 28.28 39.78 27.10 20.75 9.34 29.68 21.79 25.69 83.97 32.00 52.01 53.44 35.32
1024 30.55 41.27 30.85 21.87 7.27 26.82 22.95 26.51 83.22 31.50 53.23 55.50 35.96
4096 28.95 41.90 32.52 23.65 8.07 29.68 22.75 26.55 83.34 30.00 52.7 55.48 36.31
8192 29.05 41.42 31.79 22.95 12.50 30.44 22.50 26.43 83.63 30.50 52.87 55.33 36.62
Twilight (Avg. 126) 32.34 43.89 34.67 25.43 13.84 31.88 23.01 26.32 85.29 35.50 55.03 57.27 38.71 (+5.7%)
Full 128k 46.17 53.33 55.36 43.95 27.08 35.01 25.24 27.37 91.18 99.50 62.17 57.76 52.01
Twilight (Avg. 478) 43.08 52.99 52.22 44.83 25.79 34.21 25.47 26.98 91.85 100.00 64.06 58.22 51.64 (-0.7%)
MagicPIG K=8,L=75 45.03 54.24 56.46 47.34 26.58 33.63 24.98 26.70 92.13 100.00 61.94 51.40 51.70
K=10, L=150 44.68 53.63 56.19 47.18 26.79 33.31 25.13 26.22 91.89 99.50 60.07 51.15 51.32
Quest 256 24.65 37.50 30.12 23.60 12.93 27.53 20.11 26.59 65.34 95.00 49.70 45.27 38.20
1024 38.47 49.32 47.43 38.48 20.59 33.71 23.67 26.60 81.94 99.50 60.78 52.96 47.79
4096 43.97 53.64 51.94 42.54 24.00 34.34 24.36 26.75 90.6 99.50 62.03 55.49 50.79
8192 44.34 53.25 54.72 44.84 25.98 34.62 24.98 26.70 91.61 100.00 62.02 54.20 51.44
Twilight (Avg. 427) 43.44 53.2 53.77 43.56 25.42 34.39 25.23 26.99 91.25 100.0 63.55 58.06 51.57 (+0.3%)
DS 256 38.24 49.58 43.38 31.98 15.52 33.40 24.06 26.86 84.41 99.50 53.28 48.64 45.74
1024 42.97 54.65 51.5 33.92 20.39 34.50 24.92 26.71 92.81 99.50 62.66 48.37 49.43
4096 43.50 53.17 54.21 44.70 23.14 34.73 25.40 26.71 92.78 99.50 62.59 51.31 50.98
8192 43.82 53.71 54.19 45.13 23.72 34.27 24.98 26.69 91.61 100.00 62.40 52.87 51.14
Twilight (Avg. 446) 43.08 52.89 54.68 44.86 24.88 34.09 25.20 27.00 91.20 100.00 62.40 52.87 51.73 (+1.2%)

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper effectively highlights a fundamental flaw in existing top-k sparse attention methods: their inability to dynamically adapt KV cache budgets to the varying attention weight distributions encountered in LLM inference. To address this, the authors introduce Twilight, a novel framework that leverages top-p sampling from LLM generation to bring adaptive sparsity to any existing sparse attention algorithm.

Twilight employs a hierarchical Select-then-Prune architecture. It first utilizes a base sparse attention algorithm to conservatively select a large token subset, which is then refined by the Twilight Pruner using top-p thresholding. This adaptive approach allows Twilight to prune up to 98% of redundant tokens while maintaining or even improving accuracy across various long-context and medium-context benchmarks.

Empirically, Twilight demonstrates substantial efficiency gains: up to a 15.4×15.4 \times speedup for the self-attention operator and a 3.9×3.9 \times reduction in end-to-end per-token latency in long context LLM decoding. When compared to the base sparse attention algorithms it enhances, Twilight provides an additional 1.4×1.4 \times speedup. The framework also features efficient kernel implementations, including 4-bit quantized K cache and a binary search-based top-p kernel.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

  • Estimation Overheads: While Twilight significantly reduces the sparse attention kernel computation, the estimation overheads (for Token Selector and Twilight Pruner) are still non-negligible. This makes Twilight particularly beneficial in scenarios where loading tokens from the KV cache is the dominant cost, such as LLM serving with large batch sizes or offloading to slower memory. Future work could focus on optimizing these estimation methods to further improve end-to-end latency and throughput.
  • Group Query Attention (GQA) Compatibility: The paper notes that head-wise dynamism (a key feature enabled by top-p) is not fully compatible with GQA architectures. This leads to a compromise where dynamism transitions to group-wise granularity, meaning all heads within a GQA group share a common token budget. Future research should explore how to more effectively integrate Twilight with emerging model architectures like GQA and Multi-Head Latent Attention (MLA) [64] to leverage head-wise dynamism without compromising efficiency.

7.3. Personal Insights & Critique

This paper presents a highly intuitive and impactful idea: transplanting the well-understood concept of top-p sampling from LLM output generation to the often more opaque realm of attention sparsity. The analogy is elegant and powerful, addressing a critical bottleneck in LLM efficiency in a principled way.

One of the strongest aspects of Twilight is its design as a framework rather than a standalone algorithm. This Select-then-Prune architecture makes it broadly applicable, allowing existing and future sparse attention methods to gain adaptive capabilities without extensive re-engineering. This modularity is a significant advantage in the rapidly evolving LLM landscape. The empirical results are compelling, demonstrating substantial speedups with minimal to no accuracy loss, which is the holy grail for LLM optimization.

The explicit consideration of 4-bit quantization for the K cache and efficient GPU kernel implementations (like the binary search for top-p) highlights a strong system-algorithm co-design approach. This pragmatic focus on deployment challenges ensures that the theoretical gains translate into real-world performance improvements. The detailed ablation studies on the pp threshold and time breakdown further strengthen the paper's rigor, providing transparency into Twilight's costs and benefits.

A potential area for deeper exploration, as implicitly raised by the authors' GQA limitation, is the interaction between top-p pruning and the architectural nuances of different LLMs. While top-p is adaptive to attention weight distributions, the fundamental "importance" of tokens might not always perfectly align with their attention scores, especially in highly specialized attention heads (e.g., retrieval heads). Future work could investigate if learned mechanisms could dynamically adjust the pp threshold or if top-p could be combined with other importance estimation signals for even more nuanced adaptive sparsity.

Overall, Twilight offers a promising direction for adaptive attention sparsity, emphasizing the importance of dynamic resource allocation in LLM inference. Its methods and conclusions could potentially be transferred to other domains requiring adaptive pruning or resource management in deep learning, especially where input feature importance varies dynamically.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.