- Title: Native Sparse Attention: Hardware-Aligned and Natively Trainable Sparse Attention
- Authors: Jingyang Yuan, Huazuo Gao, Damai Dai, Junyu Luo, Liang Zhao, Zhengyan Zhang, Zhenda Xie, Y. X. Wei, Lean Wang, Zhiping Xiao, Yuqing Wang, Chong Ruan, Ming Zhang, Wenfeng Liang, Wangding Zeng.
- Affiliations: The authors are from DeepSeek-AI, Peking University, and the University of Washington. This indicates a collaboration between a leading AI industry lab and academic institutions.
- Journal/Conference: The paper is available as a preprint on arXiv. Preprints are common in fast-moving fields like AI, allowing for rapid dissemination of results before formal peer review.
- Publication Year: 2025 (as listed in a citation within the paper, likely referring to its anticipated publication date). The arXiv submission is from February 2025.
- Abstract: The abstract introduces NSA (Natively trainable Sparse Attention), a mechanism designed to make long-context language models more efficient. The core problem is the high computational cost of standard
attention
. NSA addresses this with a dynamic, hierarchical sparse strategy that combines coarse-grained token compression with fine-grained token selection. The key innovations are (1) achieving significant speedups through hardware-aligned algorithm design and (2) enabling end-to-end training without performance loss. Experiments show NSA models perform on par with or better than Full Attention
models on various benchmarks while achieving substantial speedups (e.g., on 64k-length sequences) in decoding, forward propagation, and backward propagation.
- Original Source Link:
2. Executive Summary
4. Methodology (Core Technology & Implementation)
NSA replaces the standard attention mechanism with a dynamic, multi-branch architecture that constructs a compact set of key-value pairs (K~t,V~t) for each query qt.
The overall framework is shown in Image 1. For a given query qt, the preceding keys and values (k:t,v:t) are processed through three parallel branches.

The outputs of these three branches are combined using a learned gating mechanism:
ot∗=c∈C∑gtc⋅Attn( qt,K~tc,V~tc).
- Explanation:
- C={cmp,slc,win} represents the three branches: compression, selection, and sliding window.
- K~tc,V~tc are the keys and values generated by each branch c.
- gtc is a gate score (between 0 and 1) that learns how much to weigh the output of each branch. It is computed by a small MLP.
4.1. Branch 1: Token Compression (cmp
)
This branch creates a coarse-grained, global view of the context.
- Principle: It groups sequential tokens into blocks and compresses each block into a single representative key/value vector. This captures the gist of long-range information at a low computational cost.
- Formula: The compressed keys are formed as:
K~tcmp=fKcmp(k:t)={φ(kid+1:id+l)0⩽i⩽⌊dt−l⌋}
- Explanation:
- l is the block length (e.g., 32 tokens).
- d is the stride (e.g., 16 tokens). A stride smaller than the block length creates overlapping blocks to avoid information loss at the boundaries.
- φ is a learnable MLP that takes a block of key vectors and outputs a single compressed key vector.
4.2. Branch 2: Token Selection (slc
)
This branch preserves fine-grained details by selecting the most important token blocks.
- Principle: Instead of selecting individual tokens (which is hardware-inefficient), NSA selects contiguous blocks of tokens. The selection is guided by the attention scores from the compression branch, which provides a cheap way to estimate block importance.
- Steps & Procedures:
- Importance Score Computation: The attention scores computed between the query qt and the compressed keys K~tcmp are used as a proxy for the importance of the corresponding regions in the original sequence.
ptcmp=Softmax(qtTK~tcmp)
- Score Aggregation for GQA: For GQA models, the importance scores from all query heads within a group are summed up. This ensures all heads in the group select the same KV blocks, which is crucial for efficient memory access during decoding.
ptslc′=h=1∑Hptslc,(h)
- Top-n Block Selection: The model selects the top n blocks with the highest aggregated importance scores. The original (uncompressed) tokens from these selected blocks form the key-value set K~tslc,V~tslc.
4.3. Branch 3: Sliding Window (win
)
This branch explicitly focuses on local context.
- Principle: LLMs often rely heavily on recent tokens. By dedicating a separate branch to a fixed-size sliding window of the most recent w tokens (e.g., w=512), the model can handle local dependencies without them "shortcutting" or dominating the learning process in the other two branches. This architectural separation helps the compression and selection branches specialize in learning long-range patterns.
4.4. Kernel Design for Hardware Alignment
The authors designed a custom Triton kernel to ensure the algorithmic ideas translate to real-world speed. Image 2 illustrates the data flow.

- Key Optimizations:
-
Group-Centric Data Loading: Instead of loading queries one by one, the kernel loads all queries belonging to the same GQA group into the fast on-chip SRAM at once.
-
Shared KV Fetching: Since all queries in a GQA group attend to the same sparse set of KV blocks, these blocks are fetched from the slower High Bandwidth Memory (HBM) only once and shared among the queries. This minimizes redundant memory transfers, the primary bottleneck in decoding.
-
Optimized Loop Scheduling: The workload is balanced across the GPU's processing units to maximize utilization.
5. Experimental Setup
- Model: The experiments use a 27B parameter Transformer model with a Mixture-of-Experts (MoE) and Grouped-Query Attention (GQA) architecture. It has 3B active parameters per forward pass.
- Training:
- The models (both NSA and Full Attention baseline) were pretrained on 270B tokens of 8k-length text.
- They were then further adapted for long contexts using 32k-length text with the YaRN position encoding scaling method.
- For the reasoning task, the models were fine-tuned on 10B tokens of mathematical reasoning data.
- NSA Hyperparameters:
- Compression block size l = 32
- Compression stride d = 16
- Selection block size
l'
= 64
- Number of selected blocks n = 16
- Sliding window size w = 512
- Datasets:
- General Benchmarks: MMLU, MMLU-PRO (knowledge), BBH, GSM8K, MATH (reasoning), MBPP, HumanEval (coding).
- Long-Context Benchmarks: LongBench (various long-document tasks) and Needle-in-a-Haystack (testing information retrieval from specific locations in a long context).
- Reasoning Benchmark: AIME (American Invitational Mathematics Examination) for evaluating complex chain-of-thought reasoning.
- Evaluation Metrics: Standard metrics for each benchmark, such as Accuracy, F1-score, and Pass@1 for code generation.
- Baselines:
-
Full Attention
: A model with the same architecture but using standard, non-sparse attention.
-
State-of-the-art inference-only sparse methods: H2O
, infLLM
, Quest
, and Exact-Top
(an oracle method that computes full attention first and then selects the top scores).
6. Results & Analysis
The results robustly demonstrate that NSA achieves both high performance and high efficiency.
7. Conclusion & Reflections
-
Conclusion Summary: The paper successfully introduces NSA, a sparse attention mechanism that is both highly efficient and performant. By co-designing a hierarchical sparse algorithm with a hardware-aligned kernel and enabling native, end-to-end training, NSA overcomes the key limitations of prior work. It delivers substantial speedups across the entire model lifecycle (training, prefilling, decoding) while maintaining or even improving upon the capabilities of standard Full Attention
models, particularly in long-context and complex reasoning tasks.
-
Limitations & Future Work:
- Hyperparameter Sensitivity: The performance of NSA depends on several hyperparameters (l, d,
l'
, n, w). The paper uses a fixed set of values, but the optimal configuration might vary across different model sizes, tasks, or data distributions. Future work could explore methods to learn these parameters dynamically.
- Architectural Complexity: The three-branch architecture with gating is more complex than standard attention. While efficient, this adds complexity to the model implementation and understanding.
- Generalization to Other Modalities: The work is focused on language models. Investigating the applicability of NSA to other domains like vision or multimodal models, where attention patterns might differ, would be a valuable direction.
-
Personal Insights & Critique:
- A Paradigm Shift: NSA represents a significant step forward for sparse attention. The emphasis on native trainability and hardware co-design is a crucial lesson for the field. It moves beyond simply approximating full attention at inference to building fundamentally more efficient architectures from the ground up.
- Practical Impact: The demonstrated speedups are not just theoretical; they are practical and substantial. A 6-9x reduction in training cost for long sequences could dramatically lower the barrier to developing powerful long-context models. Similarly, a >10x speedup in decoding makes such models far more viable for real-world applications.
- The "Inductive Bias" of Sparsity: The finding that NSA sometimes outperforms Full Attention is fascinating. It suggests that building sparsity into the architecture acts as a powerful inductive bias, forcing the model to learn more robust and generalizable representations by filtering out noise. This challenges the long-held assumption that sparse attention is merely a compromise for efficiency.