Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
TL;DR Summary
A systematic study using 1.7B/15B LLMs demonstrates a simple head-specific sigmoid gate after SDPA consistently boosts performance, stability, and long-context extrapolation. This is due to introducing non-linearity and sparse gating, which mitigates "attention sink" and enhances
Abstract
Gating mechanisms have been widely utilized, from early models like LSTMs and Highway Networks to recent state space models, linear attention, and also softmax attention. Yet, existing literature rarely examines the specific effects of gating. In this work, we conduct comprehensive experiments to systematically investigate gating-augmented softmax attention variants. Specifically, we perform a comprehensive comparison over 30 variants of 15B Mixture-of-Experts (MoE) models and 1.7B dense models trained on a 3.5 trillion token dataset. Our central finding is that a simple modification-applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA)-consistently improves performance. This modification also enhances training stability, tolerates larger learning rates, and improves scaling properties. By comparing various gating positions and computational variants, we attribute this effectiveness to two key factors: (1) introducing non-linearity upon the low-rank mapping in the softmax attention, and (2) applying query-dependent sparse gating scores to modulate the SDPA output. Notably, we find this sparse gating mechanism mitigates 'attention sink' and enhances long-context extrapolation performance, and we also release related and to facilitate future research.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
- Authors: Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, Dayiheng Liu, Jingren Zhou, Junyang Lin.
- Affiliations: The authors are primarily from the Qwen Team at Alibaba Group, with collaborators from the University of Edinburgh, Stanford University, MIT, and Tsinghua University. This indicates a strong industry-led research effort with academic collaboration.
- Journal/Conference: This paper is available as a preprint on arXiv. Preprints are common in the fast-moving field of AI for rapid dissemination of results, often before or during submission to a peer-reviewed conference or journal.
- Publication Year: The paper was submitted to arXiv in May 2024 (the citation
Lin et al., 2025and others suggest it was prepared for submission to a 2025 conference). - Abstract: The paper systematically investigates the effects of adding gating mechanisms to the standard softmax attention in large language models (LLMs). Through extensive experiments on 1.7B dense and 15B Mixture-of-Experts (MoE) models trained on 3.5 trillion tokens, the authors find that a simple modification—applying a head-specific sigmoid gate after the Scaled Dot-Product Attention (SDPA) output—consistently improves performance, training stability, and scaling properties. They attribute this success to two factors: (1) introducing non-linearity to the attention's low-rank value-to-output transformation and (2) applying query-dependent sparse gating scores that modulate the SDPA output. Notably, this sparse gating mechanism is shown to eliminate the "attention sink" phenomenon and improve the model's ability to generalize to longer contexts.
- Original Source Link: https://arxiv.org/abs/2505.06708 (PDF: http://arxiv.org/pdf/2505.06708v1)
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: Gating mechanisms are widely used in neural networks, from LSTMs to modern Transformers and state-space models, but their specific contributions are often not well understood or isolated from other architectural changes. For example, recent methods like
Switch Headsuse gating for routing, but the paper finds the gate itself provides value even without routing. - Importance & Gaps: Without a clear understanding of gating, it's difficult to assess its true value or design more effective architectures. Prior work has not systematically studied where and how to apply gates within the standard softmax attention block to maximize benefits. Furthermore, LLMs suffer from issues like training instability and the "attention sink" phenomenon (where initial tokens receive disproportionately high attention), which hinder performance, especially in long-context scenarios.
- Innovation: This paper provides the first large-scale, systematic investigation into various gating configurations for softmax attention. Its innovation lies in isolating the effects of gating and providing a clear, evidence-backed explanation for its effectiveness.
- Core Problem: Gating mechanisms are widely used in neural networks, from LSTMs to modern Transformers and state-space models, but their specific contributions are often not well understood or isolated from other architectural changes. For example, recent methods like
-
Main Contributions / Findings (What):
- Optimal Gating Configuration: The authors identify that applying a head-specific, multiplicative sigmoid gate directly after the Scaled Dot-Product Attention (
SDPA) output (G₁position) is the most effective configuration, consistently improving model performance across various benchmarks. - Explanation of Effectiveness: The performance boost is attributed to two key mechanisms:
- Non-linearity: The gate breaks the linearity of the consecutive value projection () and output projection () matrices, which otherwise form a low-rank bottleneck, thereby increasing the model's expressive power.
- Query-Dependent Sparsity: The gate learns to produce sparse, input-dependent scores that selectively filter the SDPA output for each attention head, effectively acting as a dynamic information filter.
- Elimination of Attention Sink: The sparse gating mechanism is shown to completely mitigate the "attention sink" problem. Models with this gate no longer concentrate attention on initial tokens.
- Improved Training and Scaling: Gated attention enhances training stability, reduces loss spikes, and allows for the use of larger learning rates and batch sizes, which in turn leads to better model performance.
- Enhanced Long-Context Extrapolation: By eliminating the reliance on attention sinks, the gated models demonstrate significantly better performance when their context window is extended beyond what they were trained on.
- Optimal Gating Configuration: The authors identify that applying a head-specific, multiplicative sigmoid gate directly after the Scaled Dot-Product Attention (
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Multi-Head Softmax Attention: The core component of the Transformer architecture. It allows a model to jointly attend to information from different representation subspaces at different positions. It involves projecting an input sequence into Queries (Q), Keys (K), and Values (V), calculating attention scores via dot products between Q and K, normalizing these scores with softmax to get weights, and finally computing a weighted sum of the Values.
- Gating Mechanism: A technique in neural networks that controls the flow of information. It typically involves an element-wise multiplication of an input tensor with a "gate" tensor, whose values are usually between 0 and 1 (produced by a sigmoid or similar function). This allows the network to dynamically decide what information to keep or discard.
- Mixture-of-Experts (MoE): A model architecture that uses a routing network to selectively activate only a subset of "expert" sub-networks for each input token. This allows for a massive increase in model parameters while keeping the computational cost per token constant.
- Attention Sink: A phenomenon observed in many LLMs where an outsized portion of attention scores is consistently allocated to the very first few tokens in a sequence (e.g., the start-of-sequence token), regardless of their relevance to the current query. This is believed to be a learned mechanism to store global information, but it can be inefficient and problematic for context extension.
-
Previous Works & Differentiation:
- Early Gating: LSTMs and GRUs used gates to manage memory and combat vanishing gradients in recurrent networks.
Highway Networksextended this to deep feedforward networks.SwiGLUintroduced gating into the FFN layer of Transformers, which is now a standard practice. - Modern Gating: Recent architectures like state-space models (
Mamba) and linear attention variants (RetNet,FLASH) also heavily rely on gating to modulate their token-mixing outputs. - Confounded Gating: Architectures like
Switch HeadsandNative Sparse Attention (NSA)use sigmoid-based gating for routing or selecting attention heads. However, their performance gains are a mix of the routing/sparsity mechanism and the intrinsic value of the gate itself. This paper's contribution is to disentangle these effects by showing that a simple, non-routing gate provides substantial benefits on its own. - Attention Sink Mitigation: Previous works identified the attention sink problem and proposed solutions like modifying the softmax function or adding "key biases". This paper offers a novel solution: the attention sink is a side effect of the model needing a mechanism to discard irrelevant information, and query-dependent sparse gating provides a much more direct and effective way to achieve this, thereby rendering the sink unnecessary.
- Early Gating: LSTMs and GRUs used gates to manage memory and combat vanishing gradients in recurrent networks.
4. Methodology (Core Technology & Implementation)
The paper's methodology revolves around augmenting the standard multi-head attention layer with a gating mechanism and systematically evaluating different configurations.
-
Principles: The core idea is that the standard attention mechanism has two potential weaknesses: (1) a low-rank bottleneck between the value () and output () projections and (2) an inability to "turn off" or ignore irrelevant information from the SDPA output before it enters the residual stream. A gating mechanism can address both by introducing non-linearity and sparsity.
-
Steps & Procedures: The standard multi-head attention layer consists of four stages:
-
QKV Linear Projections: , , .
-
Scaled Dot-Product Attention (SDPA): .
-
Multi-Head Concatenation: Outputs from h heads are concatenated.
-
Final Output Layer: The concatenated output is projected by .
The gating mechanism is formalized as: where is the input to be gated, is the input used to compute the gate scores (often the same as 's source), are learnable gate parameters, and is an activation function like sigmoid.
The authors explore several variants of this mechanism:
-
Gating Position: As shown in Figure 1, gates are inserted at five different locations:
-
G₁: After the SDPA output (found to be most effective). -
G₂: After the Value projection (). -
G₃: After the Key projection (). -
G₄: After the Query projection (). -
G₅: After the final output projection ().
该图像由示意图和三组柱状图及折线图组成。示意图展示了在Scaled Dot Product Attention(SDPA)结构中,不同位置接入门控机制(G1至G5)的示意,强调G1门控在SDPA输出后、连接前的最优效果。右侧柱状图显示不同门控位置对avg PPL和MMLU指标的影响,G1门控性能提升最显著;折线图展示训练过程中baseline与加入SDPA输出门控G1的损失趋势,G1门控带来更稳定及更优训练表现。
-
-
Granularity:
- Elementwise: Each dimension of the output vector gets its own gating score.
- Headwise: A single scalar gate score is applied to the entire output of an attention head.
-
Head Specificity:
- Head-Specific: Each attention head has its own gating parameters .
- Head-Shared: Gating parameters are shared across all heads.
-
Application:
- Multiplicative: (standard).
- Additive: .
-
Activation Function:
Sigmoid(output in[0, 1]) andSiLU(unbounded) are primarily tested.
-
5. Experimental Setup
-
Datasets:
- Training: A large, high-quality dataset of 3.5 trillion tokens, including multilingual text, mathematical content, and general knowledge.
- Evaluation:
- Language Modeling: Perplexity (PPL) is calculated on diverse held-out test sets (English, Chinese, Code, Math, etc.).
- Downstream Benchmarks: Few-shot performance is measured on Hellaswag (common sense), MMLU (general knowledge), GSM8k (math reasoning), HumanEval (coding), C-eval, and CMMLU (Chinese proficiency).
- Long Context: The RULER benchmark is used to evaluate performance on sequences longer than the training context length.
-
Evaluation Metrics:
- Perplexity (PPL): Measures how well a probability model predicts a sample. It is the exponentiated average negative log-likelihood. A lower PPL indicates the model is less "surprised" by the test data and thus has a better understanding of the language. where is the sequence of tokens, and is the probability of the -th token given the preceding tokens.
- Accuracy: Used for multiple-choice tasks like MMLU, Hellaswag, and C-eval. It is the ratio of correct predictions to the total number of examples.
- Pass@k: Used for code generation tasks like HumanEval. It measures the percentage of problems for which at least one of the top-k generated code samples passes all unit tests.
- Exact Match: Used for math reasoning tasks like GSM8k, where the generated answer must exactly match the ground truth.
-
Baselines:
- A standard 1.7B parameter dense Transformer.
- A 15B parameter MoE model (15A2B, meaning 2.54B parameters are activated per token).
- Parameter-equivalent Baselines: To ensure gains are not just from adding parameters, baselines are augmented by increasing the number of Q/K/V heads or MoE experts to match or exceed the parameter count of the gating mechanism.
- Stability Baselines: For high learning rate experiments, the baseline is compared with a model using
Sandwich Norm, another technique known to improve training stability.
6. Results & Analysis
Core Results
The experiments consistently show that gating, particularly at the G₁ position (after SDPA), provides significant benefits.
-
MoE Model Comparison (Table 1): The following is a transcription of Table 1 from the paper.
Method Act Func Score Shape Added Param (M) Avg PPL Hellaswag MMLU GSM8k C-eval Reference Baselines (1) Baseline - - 0 6.026 73.07 58.79 52.92 60.26 (2) k = 8 - - 50 5.979 73.51 59.78 52.16 62.26 (3) q = 48 - - 201 5.953 73.59 58.45 53.30 59.67 (4) Add 4 Experts - - 400 5.964 73.19 58.84 52.54 63.19 Gating Position Variants (5) SDPA Elementwise G1 sigmoid n × q × dk 201 5.761 74.64 60.82 55.27 62.20 (6) v Elementwise G2 sigmoid n × k × dk 25 5.820 74.38 59.17 53.97 61.0 (7) k Elementwise G3 sigmoid n × k × dk 25 6.016 72.88 59.18 50.49 61.74 (8) q Elementwise G4 sigmoid n × q × dk 201 5.981 73.01 58.74 53.97 62.14 (9) Dense Output G5 sigmoid n × dmodel 100 6.017 73.32 59.41 50.87 59.43 Gating Granularity Variants (10) SDPA Headwise G1 sigmoid n × q 1.6 5.792 74.50 60.05 54.44 62.61 (11) v Headwise G2 sigmoid n × q 0.2 5.808 74.38 59.32 53.53 62.61 Head-Specific vs. Head-Shared (12) SDPA Head-Shared G1 sigmoid n × dk 201 5.801 74.34 60.06 53.15 61.01 (13) v Head-Shared G2 sigmoid n × dk 25 5.867 74.10 59.02 53.03 60.61 Multiplicative vs. Additive (14) SDPA Additive G1 SiLU n × q × dk 201 5.821 74.81 60.06 53.30 60.98 Activation Variants (15) SDPA Elementwise G1 SiLU n × q × dk 201 5.822 74.22 60.49 54.59 62.34 - Key Insight:
SDPA Elementwise G1(row 5) provides the best overall performance, significantly outperforming the baseline and all parameter-equivalent baselines.Gating at the value layer (G2)is also effective but less so thanG1. Gating Q, K, or the final output has little to no positive effect. Even a very cheapSDPA Headwise G1gate (row 10, only 1.6M params) gives a huge performance boost.
- Key Insight:
-
Dense Model and Stability (Table 2): Experiments on dense models confirm the benefits and highlight improved stability. The gated models can be trained with a learning rate of
8.0e-3, which causes the baseline model to diverge. This tolerance for higher LR leads to even better final performance. The training loss curve in Figure 1 (right) shows the gated model is much more stable with fewer loss spikes.
Analysis: Why Gating Works
-
1. Non-linearity: The paper argues that in standard attention, the value projection (mapping ) and the head-specific part of the output projection (mapping ) are two consecutive linear transformations. Their composition, , is a low-rank matrix, limiting the expressiveness of the value transformation. By inserting a non-linear function (like a gate or even
RMSNorm) between these two steps, the model breaks this linearity and increases its expressive power.- Gating at
G₂: - Gating at
G₁: This explains whyG₁andG₂are effective, whileG₅(after ) is not.
- Gating at
-
2. Sparsity: The analysis reveals that the most effective gating configurations produce highly sparse gate scores (i.e., many values close to zero).
该图像包含一个表格和三个直方图。表格展示了不同注意力门控方法(如元素级门控、头部门控等)及其激活函数、门控分数(Gate Score)、两个注意力指标(M-Act、F-Attn)和任务表现指标(PPL、Hellaswag、MMLU、GSM8k)。下方三幅直方图分别显示三种门控方式的门控分数分布及其均值,反映不同门控策略的稀疏性和非线性特征。- Key Insight (Table 4 and Figure 3): The
SDPA Elementwise Gatehas a mean score of just 0.116. This sparsity is input-dependent and head-specific, allowing each head to dynamically filter out irrelevant information from the weighted value vectors for the current query token. - Query-Dependency is Crucial: Gating at
G₁is query-dependent (the gate is computed from the query's hidden state), whereas gating atG₂is not (it's computed from the key/value hidden states). The superior performance ofG₁suggests that filtering information based on the query's needs is more effective. Experiments with input-independent gates or a less-sparse "NS-sigmoid" function confirm that query-dependent sparsity is a key driver of performance.
- Key Insight (Table 4 and Figure 3): The
-
3. Attention-Sink-Free: The paper provides compelling evidence that sparse gating at the SDPA output eliminates the attention sink.
该图像为图表,展示了基线模型和加门控输出机制(atten_output_gate)模型在不同层的首个token注意力分数对比。左侧两图是首token注意力分数随层数变化的曲线,基线模型得分较高且波动明显,加门控模型得分低且均匀。右侧四个热图展示了第21层和第23层的注意力分布,基线模型首token得分较高且集中,加门控模型得分极低,注意力分布更为分散,表明门控机制减少了注意力汇聚现象。- Key Insight (Figure 2): The baseline model allocates an average of 46.7% of its attention to the first token. The gated model allocates only 4.8%. The attention maps show the baseline has a strong vertical line at the first token position, which vanishes in the gated model.
- Sparsity is the Cause: The authors hypothesize that the attention sink is a crude, learned mechanism for the model to ignore irrelevant information (by dumping attention scores on a "dummy" token). The sparse gate provides a much more precise, query-dependent tool for this filtering, making the attention sink obsolete.
- Massive Activations: While gating also reduces massive hidden state activations, it's not the primary cause of sink elimination. Gating at
G₂reduces activations but not the sink, decoupling the two phenomena.
-
4. Long-Context Extrapolation: Eliminating the attention sink has a profound benefit for extending the model's context length.
-
Transcription of Table 5: Performance across varying sequence lengths
Method 4k 8k 16k 32k 64k 128k Baseline 88.89 85.88 83.15 79.50 - - SDPA-Gate 90.56 87.11 84.61 79.77 - - YaRN Extended Baseline 82.90(-6.0) 71.52(-14.4) 61.23(-21.9) 37.94(-41.56) 37.51 31.65 SDPA-Gate 88.13(-2.4) 80.01(-7.1) 76.74(-7.87) 72.88(-6.89) 66.60 58.82 -
Key Insight: When using
YaRNto extend the context window to 128k, the baseline model's performance collapses dramatically, even within its original 32k training length. The gated model degrades much less and maintains strong performance at 64k and 128k, outperforming the baseline by a huge margin. The hypothesis is that the baseline's reliance on the attention sink pattern is brittle and does not adapt well to the RoPE modifications used for context extension, whereas the gated model's dynamic filtering is more robust.
-
7. Conclusion & Reflections
-
Conclusion Summary: This work provides a rigorous and comprehensive empirical study of gating in softmax attention. The central finding is that a simple, head-specific sigmoid gate applied after the SDPA output is a highly effective and low-cost modification for LLMs. This improvement is driven by the introduction of non-linearity and query-dependent sparsity. A major practical benefit is the complete elimination of the "attention sink," which not only improves model behavior but also significantly enhances its ability to generalize to sequence lengths far beyond its training window. The authors release their attention-sink-free models to facilitate further research.
-
Limitations & Future Work: The authors acknowledge that while they provide strong empirical evidence, their work has limitations:
- The theoretical understanding of how non-linearity and sparsity affect the training dynamics and final capabilities of LLMs is still incomplete.
- A formal theoretical explanation for why eliminating attention sinks leads to better long-context generalization is not provided.
-
Personal Insights & Critique:
- Strengths: This is an excellent example of impactful, practical research. The paper addresses a clear gap in understanding, the experiments are extensive and well-designed (e.g., using parameter-equivalent baselines), and the analysis is insightful and convincing. The connection drawn between sparsity, attention sinks, and long-context performance is a significant contribution. The proposed modification is simple to implement and has clear benefits for performance and stability, making it likely to be adopted in future models.
- Critique: The work is heavily empirical, and as the authors note, it lacks a deep theoretical foundation. While the intuitions about non-linearity and sparsity are compelling, a more formal analysis could strengthen the claims. However, given the current state of LLM theory, this is a minor critique. The paper's strength lies in its clear, actionable findings.
- Future Impact: This research could have a significant impact on LLM architecture design. It provides a strong justification for making gated attention a standard component. The finding that attention sinks can be eliminated without performance loss (and with gains in other areas) may lead to a re-evaluation of how models handle global information and context. The released attention-sink-free models will be a valuable resource for the community.
Similar papers
Recommended via semantic vector search.