Paper status: completed

Kimi Linear: An Expressive, Efficient Attention Architecture

Published:10/31/2025

Linear Attention Architecture (1)Kimi Delta Attention (KDA) Module (1)Diagonal-Plus-Low-Rank (DPLR) Matrices (1)Multi-Head Latent Attention (MLA) (1)Large-Scale Long-Context Modeling (1)

Original Link PDF

Price: 0.100000

4 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Kimi Linear, combining Kimi Delta Attention and multi-head latent attention, significantly enhances performance and efficiency in long and short contexts and RL. Its 3B-parameter model outperforms full attention, achieving up to 6× throughput and 75% KV cache reduction.

Abstract

We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.

Mind Map

In-depth Reading

English Analysis~29 min read · 36,396 chars

1. Bibliographic Information

1.1. Title

Kimi Linear: An Expressive, Efficient Attention Architecture

The title clearly states the paper's central topic: the introduction of a new attention architecture named Kimi Linear. It highlights its two main claimed properties: being expressive (capable of high performance) and efficient (computationally less demanding).

1.2. Authors

The authors are listed as "Kimi Team" from Moonshot AI, along with several external collaborators. The full list of contributors is provided in Appendix A, indicating a large-scale team effort typical of major industry AI labs. Key contributors include Yu Zhang, Zongyu Lin, Xingcheng Yao, and many others. This collaborative structure suggests a project with significant engineering and research resources behind it.

1.3. Journal/Conference

The paper is presented as a technical report on arXiv, a preprint server. This means it has not yet undergone a formal peer-review process for a conference or journal. Publishing on arXiv is a common practice in the fast-paced field of AI/ML to quickly disseminate new findings to the research community. The metadata indicates a publication date in the future (October 2025), which suggests this analysis is based on a preliminary or hypothetical version of the paper.

1.4. Publication Year

2025 (according to the provided metadata and references within the paper).

1.5. Abstract

The abstract introduces Kimi Linear, a hybrid linear attention architecture that claims to outperform standard full attention in a wide range of scenarios (short-context, long-context, and RL). The core innovation is Kimi Delta Attention (KDA), an extension of Gated DeltaNet that uses a finer-grained gating mechanism for more effective memory management. The authors also highlight a custom, hardware-efficient chunkwise algorithm based on a specialized Diagonal-Plus-Low-Rank (DPLR) formulation. They present a 3B parameter model that outperforms a full-attention baseline across all tasks while significantly reducing KV cache usage (by 75%) and improving decoding throughput (up to 6x for 1M context). The paper concludes by announcing the open-sourcing of the KDA kernel, vLLM integration, and model checkpoints to encourage further research.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2510.26692
PDF Link: https://arxiv.org/pdf/2510.26692v2.pdf
Publication Status: The paper is an unreviewed preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The primary challenge this paper addresses is the computational bottleneck of the standard Transformer architecture, whose self-attention mechanism has a computational and memory complexity that scales quadratically with the input sequence length ( $O(T^2)$ ). This quadratic scaling makes processing very long sequences (e.g., millions of tokens) prohibitively expensive, which is a major obstacle for next-generation applications like advanced AI agents, reinforcement learning (RL) over long trajectories, and deep document analysis.

While linear attention methods, which scale linearly ( $O(T)$ ), have been proposed as a solution, they have historically struggled to match the performance and expressivity of standard softmax attention, often lagging even on short-sequence tasks. This performance gap has prevented their widespread adoption.

The paper's entry point is to bridge this gap by designing a new linear attention mechanism that is not only efficient but also more expressive than its predecessors. The innovative idea is to enhance an existing linear attention framework (Gated DeltaNet) with more precise, fine-grained control over its internal memory state, allowing it to selectively retain and forget information more effectively.

2.2. Main Contributions / Findings

The paper presents three main contributions:

Kimi Delta Attention (KDA): A novel linear attention module that serves as the core of the new architecture. It improves upon Gated DeltaNet by replacing its coarse, scalar "forget gate" with a fine-grained, channel-wise gating mechanism. This allows the model to control the decay of information in its memory state with much higher precision for each feature dimension, enhancing its expressiveness.
The Kimi Linear Architecture: A hybrid model design that strategically interleaves efficient KDA layers with a small number of standard full attention layers (Multi-Head Latent Attention or MLA). The paper finds an optimal 3:1 ratio (three KDA layers for every one MLA layer), creating a balance that preserves global information flow while drastically reducing the memory footprint and computational cost associated with long sequences.
Fair Empirical Validation at Scale: The authors conduct extensive experiments, training a 3B-parameter Kimi Linear model on 1.4 trillion tokens and comparing it against a full-attention baseline with an identical training recipe. Their findings show that Kimi Linear outperforms the full-attention model across a diverse set of benchmarks, including short and long-context tasks and RL-style evaluations. This demonstrates that Kimi Linear can serve as a "drop-in replacement" for full attention, offering superior performance and efficiency. To support these claims, the authors are open-sourcing their custom CUDA kernels, vLLM integration, and pre-trained models.

3.1. Foundational Concepts

3.1.1. Transformer and Self-Attention

The Transformer, introduced by Vaswani et al. in 2017, is a neural network architecture that has become the foundation for most modern Large Language Models (LLMs). Its key innovation is the self-attention mechanism.

Conceptual Definition: Self-attention allows a model to weigh the importance of different words (tokens) in an input sequence when processing a specific word. For each token, it generates a representation by attending to all other tokens in the sequence, including itself. This allows the model to capture long-range dependencies and contextual relationships.
Mechanism: For each input token, we create three vectors: a Query ( $Q$ $Q$ ), a Key ( $K$ $K$ ), and a Value ( $V$ $V$ ). The Query vector represents the current token's request for information. The Key vectors of all other tokens act as labels or identifiers. The Value vectors contain the actual information of each token.
1. Score: The similarity between the Query of the current token and the Key of every other token is calculated, typically using a dot product.
2. Scale: The scores are scaled by the square root of the key dimension ( $d_k$ ) to stabilize gradients.
3. Softmax: A softmax function is applied to the scaled scores to convert them into a probability distribution (attention weights), ensuring they sum to 1.
4. Weighted Sum: The Value vectors are multiplied by their corresponding attention weights and summed up to produce the final output representation for the current token.
Mathematical Formula: The standard scaled dot-product attention is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
Symbol Explanation:
- $Q$ : The matrix of query vectors.
- $K$ : The matrix of key vectors.
- $V$ : The matrix of value vectors.
- $d_k$ : The dimension of the key vectors.
- $K^T$ : The transpose of the key matrix.
Complexity: The calculation of the $QK^T$ matrix, which has dimensions of (sequence length $\times$ sequence length), is the source of the $O(T^2)$ computational and memory complexity, where $T$ is the sequence length.

3.1.2. Linear Attention

Conceptual Definition: Linear attention is a class of methods designed to reduce the complexity of self-attention from quadratic to linear, i.e., $O(T)$ . The core idea is to re-order the computations to avoid explicitly forming the large $T \times T$ attention matrix.
Mechanism: Instead of using the softmax function, linear attention methods replace it with a kernel function $\phi(\cdot)$ that is associative. This allows the calculation to be re-ordered: $ \text{Attention} = \phi(QK^T)V = (\phi(Q)\phi(K)^T)V = \phi(Q)(\phi(K)^T V) $ By first computing $\phi(K)^T V$ (a small fixed-size matrix) and then multiplying by $\phi(Q)$ , the explicit $T \times T$ matrix is never created. This formulation also has an equivalent recurrent (RNN-like) representation, which is highly efficient for autoregressive decoding. The state $S_t$ can be updated as: $ \mathbf{S}t = \mathbf{S}{t-1} + \phi(\mathbf{k}_t) \phi(\mathbf{v}_t)^\top, \quad \mathbf{o}_t = \phi(\mathbf{q}_t)^\top \mathbf{S}_t $ Here, the state $\mathbf{S}_t$ accumulates key-value information over time, and the output $\mathbf{o}_t$ is computed by querying this state. This requires only constant time and memory per step during inference.

3.1.3. State Space Models (SSMs)

Conceptual Definition: SSMs are a class of sequence models inspired by classical control theory. They map an input sequence x(t) to an output sequence y(t) through a latent (hidden) state h(t).
Mathematical Formula: A continuous-time SSM is defined by the linear ordinary differential equations: $ \frac{dh(t)}{dt} = \mathbf{A}h(t) + \mathbf{B}x(t), \quad y(t) = \mathbf{C}h(t) + \mathbf{D}x(t) $ In practice, these are discretized for use in deep learning.
Symbol Explanation:
- h(t): The latent state vector.
- x(t): The input vector.
- y(t): The output vector.
- $\mathbf{A}, \mathbf{B}, \mathbf{C}, \mathbf{D}$ : State matrices that are learned during training.
Relevance: Recent models like Mamba have shown that by making the state matrices ( $\mathbf{A}, \mathbf{B}, \mathbf{C}$ ) data-dependent (i.e., they are functions of the input $x_t$ ), SSMs can achieve highly selective and expressive sequence modeling. This connects them closely to linear attention, as both can be formulated as efficient, recurrent systems.

3.1.4. Mixture of Experts (MoE)

Conceptual Definition: MoE is an architectural technique to increase model capacity without a proportional increase in computational cost. Instead of a single, dense feed-forward network (FFN) layer, an MoE layer consists of multiple "expert" FFNs and a "gating network" or "router."
Mechanism: For each input token, the router dynamically selects a small number of experts (e.g., 2 out of 64) to process it. The outputs of the selected experts are then combined, often via a weighted sum determined by the router. This means that while the total number of parameters in the model can be very large (e.g., 48B in this paper), the number of activated parameters used for any given token remains small (e.g., 3B).

3.2. Previous Works

3.2.1. DeltaNet and Gated DeltaNet (GDN)

DeltaNet: This work reinterprets linear attention from a "fast weight programmer" perspective. It frames the state update as performing online gradient descent on a reconstruction loss objective: $\mathscr{L}_t(\mathbf{S}) = \frac{1}{2} \| \mathbf{S}^\top \mathbf{k}_t - \mathbf{v}_t \|^2$ . This means the model continually tries to update its memory $\mathbf{S}$ to correctly map the current key $\mathbf{k}_t$ to the current value $\mathbf{v}_t$ . The update rule, known as the classical delta rule, is: $ \mathbf{S}_t = (\mathbf{I} - \beta_t \mathbf{k}_t \mathbf{k}t^\top) \mathbf{S}{t-1} + \beta_t \mathbf{k}_t \mathbf{v}_t^\top $ This provides a more stable learning dynamic than simple additive updates but retains all information indefinitely.
Gated DeltaNet (GDN): GDN improves DeltaNet by introducing a simple but effective "forgetting" mechanism. It adds a scalar forget gate $\alpha_t \in [0, 1]$ that acts as a form of weight decay on the memory state, allowing the model to forget outdated information. The update rule becomes: $ \mathbf{S}_t = \alpha_t (\mathbf{I} - \beta_t \mathbf{k}_t \mathbf{k}t^\top) \mathbf{S}{t-1} + \beta_t \mathbf{k}_t \mathbf{v}_t^\top $ This scalar gate is applied uniformly across all dimensions of the state, which is the key limitation KDA aims to address.

3.2.2. Mamba and DPLR

Mamba: A highly successful SSM that introduced a selection mechanism through data-dependent state matrices. It can be seen as an RNN with gates that control how much information flows from the input into the state and how much the state is forgotten at each step. Its efficiency comes from a parallel scan algorithm for training and a recurrent mode for inference.
Diagonal-Plus-Low-Rank (DPLR): A technique for structuring the state transition matrix (the $\mathbf{A}$ matrix in SSMs or the transition matrix in linear attention) to be both expressive and computationally efficient. A DPLR matrix is the sum of a diagonal matrix and a low-rank matrix ( $\mathbf{D} + \mathbf{a}\mathbf{b}^\top$ ). This structure is more powerful than a purely diagonal matrix (as in some earlier SSMs) but more structured and efficient than a dense matrix. KDA's transition matrix is a specialized, constrained form of DPLR.

3.2.3. Hybrid Attention Architectures

The idea of combining different types of attention is not new.

Intra-layer hybrids: Models like Jamba mix attention mechanisms within the same layer, for example, by having some attention heads be standard attention and others be Mamba blocks.
Inter-layer hybrids: Models like the one in this paper alternate between different types of layers. Kimi Linear's approach of regularly interleaving KDA and full MLA layers simplifies the architecture and KV cache management compared to more complex hybrid designs.

3.3. Technological Evolution

The field has evolved from the monolithic full attention mechanism to a diverse ecosystem of efficient alternatives.

Full Attention: The original Transformer, powerful but quadratically expensive.
Early Efficient Attention: Researchers explored sparse attention (attending to only a subset of tokens) and linear attention (reordering computations). These often traded performance for efficiency.
Rise of SSMs: Models like S4 and especially Mamba demonstrated that structured state-space models could achieve performance competitive with Transformers at linear complexity, revitalizing interest in RNN-like architectures.
Modern Hybrids: Recognizing that full attention and linear/SSM methods have complementary strengths (global lookup vs. efficient state compression), recent work has focused on creating hybrid architectures. Kimi Linear fits into this latest stage, proposing a specific and highly optimized hybrid of its novel KDA linear attention and standard full attention.

3.4. Differentiation Analysis

Compared to prior works, Kimi Linear's innovations are:

vs. GDN/Mamba2: KDA introduces channel-wise (diagonal) gating instead of scalar/head-wise gating. This allows for more nuanced control over memory, as different feature dimensions can have different decay rates.
vs. General DPLR: KDA uses a constrained DPLR formulation where the low-rank update vectors are tied to the key vector $\mathbf{k}_t$ . This simplifies computation and improves hardware efficiency and numerical stability, avoiding issues that require workarounds in more general DPLR models like GLA.
vs. Other Hybrids: Kimi Linear proposes a simple, regular 3:1 interleaving of KDA and full attention layers. This is less complex than intra-layer hybrids and simplifies implementation and optimization.
vs. RoPE-based models: The Kimi Linear architecture deliberately uses No Position Embedding (NoPE) in its full attention layers, forcing the model to rely on the KDA layers to dynamically learn and encode positional information and recency bias. This is a strong design choice that differs from the common practice of applying RoPE everywhere.

4. Methodology

4.1. Principles

The core principle of Kimi Linear is to enhance the expressiveness of linear attention by giving it more precise control over its finite-state memory. Standard Gated DeltaNet (GDN) uses a single scalar value per head to control forgetting, which treats all information within that head's memory state equally. Kimi Linear posits that this is too coarse. By introducing a channel-wise gating mechanism, different features within the memory can be retained or forgotten at different rates. This fine-grained control allows the model to better manage its limited memory, preserving crucial long-term information while discarding irrelevant details. This enhanced expressiveness is achieved while maintaining and even improving hardware efficiency through a specialized, computationally cheaper variant of the DPLR matrix structure.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Kimi Delta Attention (KDA)

KDA is the foundational block of the Kimi Linear model. It refines the Gated DeltaNet (GDN) update rule by introducing a diagonalized gate, enabling fine-grained control over memory decay.

The state update for KDA is defined by the following recurrence relation, which is a modification of the GDN rule.

Step 1: The KDA Recurrence Relation At each time step $t$ , the model updates its memory state matrix $\mathbf{S}_{t-1} \in \mathbb{R}^{d_k \times d_v}$ to a new state $\mathbf{S}_t$ . This update rule is presented in Equation 1 of the paper: $ \mathbf { S } _ { t } = \left( \mathbf { I } - \beta _ { t } \pmb { k } _ { t } \pmb { k } _ { t } ^ { \top } \right) \operatorname * { D i a g } \left( \alpha _ { t } \right) \mathbf { S } _ { t - 1 } + \beta _ { t } \pmb { k } _ { t } \pmb { v } _ { t } ^ { \top } \in \mathbb { R } ^ { d _ { k } \times d _ { v } } $ After updating the state, the output vector $\mathbf{o}_t$ is computed as: $ \mathbf{o}_t = \mathbf{S}_t^\top \mathbf{q}_t \in \mathbb{R}^{d_v} $
Formula Explanation:
- $\mathbf{S}_t$ : The memory state matrix at time step $t$ . It stores key-value associations.
- $\mathbf{S}_{t-1}$ : The memory state from the previous time step.
- $\pmb{q}_t, \pmb{k}_t, \pmb{v}_t$ : The query, key, and value vectors for the current token, with dimensions $d_k$ , $d_k$ , and $d_v$ respectively.
- $\beta_t$ : A scalar learning rate or update gate, determining the magnitude of the update.
- $\mathbf{I}$ : The identity matrix.
- $\pmb{k}_t \pmb{k}_t^\top$ : An outer product that creates a rank-1 matrix. The term $(\mathbf{I} - \beta_t \pmb{k}_t \pmb{k}_t^\top)$ is the "delta rule" update, which corrects the memory based on the current key.
- $\operatorname{Diag}(\alpha_t)$ : This is the core innovation of KDA. It is a diagonal matrix where the diagonal elements are given by the vector $\alpha_t \in [0, 1]^{d_k}$ . This matrix applies a channel-wise decay to the previous state $\mathbf{S}_{t-1}$ . This contrasts with GDN, which would use a scalar $\alpha_t$ that multiplies the entire state matrix, applying the same decay to all channels.
- The term $\beta_t \pmb{k}_t \pmb{v}_t^\top$ writes the new key-value association into the memory.
  
  This recurrent formulation is efficient for autoregressive generation (inference) but slow for training on parallel hardware like GPUs. Therefore, a parallel chunkwise algorithm is needed.

4.2.2. Hardware-Efficient Chunkwise Algorithm

To train efficiently, the input sequence is split into chunks of size $C$ , and computations are parallelized within and across these chunks. The paper provides a complex but efficient formulation based on the WY representation of Householder transformations.

Step 2: Chunkwise Formulation and Parallelization The paper derives a parallel algorithm for updating the state across a chunk. The state at the end of a chunk, $\mathbf{S}_{[t+1]}$ , can be calculated from the state at the beginning of the chunk, $\mathbf{S}_{[t]}$ , without a sequential loop. The final formula for this update is given in Equation 6: $ \mathbf { S } _ { [ t + 1 ] } = \mathrm { D i a g } ( \gamma _ { [ t ] } ^ { C } ) \mathbf { S } _ { [ t ] } + ( \mathbf { T } _ { [ t ] } ^ { i \to C } \odot \mathbf { K } _ { [ t ] } ) ^ { \top } ( \mathbf { U } _ { [ t ] } - \mathbf { W } _ { [ t ] } \mathbf { S } _ { [ t ] } ) \in \mathbb { R } ^ { d _ { k } \times d _ { v } } $ To get to this point, several intermediate auxiliary matrices, $\mathbf{U}_{[t]}$ and $\mathbf{W}_{[t]}$ , must be computed. These are derived using the UT transform, an efficient method for accumulating Householder transformations. The computation for $\mathbf{W}_{[t]}$ and $\mathbf{U}_{[t]}$ is given in Equation 5: $ \mathbf { M } _ { [ t ] } = ( \mathbf { I } + \mathrm { S t r i c t T r i } ( \operatorname { D i a g } ( \beta _ { [ t ] } ) ( \mathbf { T } _ { [ t ] } ^ { 1 \to C } \odot \mathbf { K } _ { [ t ] } ) ( \frac { \mathbf { K } _ { [ t ] } } { \mathbf { T } _ { [ t ] } ^ { 1 \to C } } ) ^ { \top } ) ) ^ { - 1 } \operatorname { D i a g } ( \beta _ { [ t ] } ) $ $ \mathbf { W } _ { [ t ] } = \mathbf { M } _ { [ t ] } ( \mathbf { T } _ { [ t ] } ^ { 1 \to C } \odot \mathbf { K } _ { [ t ] } ) , \qquad \mathbf { U } _ { [ t ] } = \mathbf { M } _ { [ t ] } \mathbf { V } _ { [ t ] } $
Formula Explanation:
- $\mathbf{S}_{[t]}$ : State at the beginning of chunk $t$ .
- $\mathbf{K}_{[t]}, \mathbf{V}_{[t]}$ : Matrices containing all key and value vectors for chunk $t$ .
- $\gamma_{[t]}^C$ : The cumulative decay over the entire chunk of size $C$ .
- $\mathbf{T}_{[t]}^{1 \to C}$ : A matrix related to the cumulative decays within the chunk.
- $\odot$ : Element-wise multiplication.
- $\mathrm{StrictTril}$ : A lower-triangular mask that excludes the diagonal.
- The matrix inversion $(\mathbf{I} + \dots)^{-1}$ is what looks computationally expensive, but because the matrix is triangular, the inverse can be computed efficiently via forward substitution.
- $\mathbf{W}_{[t]}$ and $\mathbf{U}_{[t]}$ are intermediate matrices that effectively pre-compute the intra-chunk interactions in a parallel-friendly manner.

4.2.3. Efficiency Analysis: KDA as a Constrained DPLR

The paper shows that KDA's transition matrix is a special case of the general Diagonal-Plus-Low-Rank (DPLR) structure, which brings significant computational benefits.

Step 3: Relating KDA to DPLR A general DPLR update has the form $\mathbf{S}_t = (\mathbf{D} - \mathbf{a}_t \mathbf{b}_t^\top) \mathbf{S}_{t-1} + \dots$ . The KDA update rule can be rewritten to match this form, as shown in Equation 14: $ \mathbf { S } _ { t } = ( \mathbf { D } - a _ { t } b _ { t } ^ { \top } ) \mathbf { S } _ { t - 1 } + k _ { t } v _ { t } ^ { \top } , \quad \mathrm{s.t.,} \quad \mathbf { D } = \operatorname { D i a g } ( \alpha _ { t } ) , a _ { t } = \beta _ { t } k _ { t } , b _ { t } = k _ { t } \odot \alpha _ { t } . $
Explanation of the Constraint: In a general DPLR model, the vectors $\mathbf{a}_t$ and $\mathbf{b}_t$ would be parameterized independently. In KDA, both are tied to the key vector $\mathbf{k}_t$ and the decay vector $\alpha_t$ .
Benefit of the Constraint: This constraint is crucial for efficiency. As the authors explain, general DPLR formulations can suffer from numerical precision issues (especially with division by decay terms) that require computationally expensive workarounds like secondary chunking. By tying $\mathbf{a}_t$ and $\mathbf{b}_t$ to $\mathbf{k}_t$ , KDA's algorithm avoids these issues and significantly reduces the number of required matrix multiplications in the chunkwise computation. This leads to a kernel that is approximately 100% (2x) faster than a general DPLR implementation, as shown in Figure 2.

The following figure (Figure 2 from the original paper) shows the execution time comparison of the general DPLR kernel versus the specialized KDA kernel.

Figure 2: Execution time of kernels for varying input lengths, with a uniform batch size of 1 and 16 heads. 该图像是图表，展示了不同输入长度下两个内核DPLR和KDA的执行时间对比，批量大小为1，16个头。曲线显示KDA在长输入时执行时间明显低于DPLR，体现了优越的计算效率。

4.2.4. The Kimi Linear Model Architecture

Kimi Linear is not just the KDA module but a full model architecture built around it.

The overall architecture is depicted in the following figure from the paper.

该图像是论文中Kimi Linear模型架构的示意图，展示了多头潜变量注意力（MLA）、专家混合（MoE）以及核心的Kimi Delta Attention（KDA）模块的层次结构和数据流动过程，体现了共享专家与路由专家的组合机制。

Step 4: Neural Parameterization The inputs to the KDA module ( $q, k, v, \alpha, \beta$ ) are generated from the token input representation $\mathbf{x}_t$ using neural networks. The paper specifies the following parameterizations (Equation 7): $ \begin{array} { r l } & { q _ { t } ^ { h } , k _ { t } ^ { h } = \mathrm { L2Norm } ( \mathrm { Swish } ( \mathrm { S h o r t C o n v } ( \mathbf W _ { q / k } ^ { h } \pmb x _ { t } ) ) ) \in \mathbb { R } ^ { d _ { k } } } \ & { \quad \quad \pmb { v } _ { t } ^ { h } = \mathrm { S w i s h } ( \mathrm { S h o r t C o n v } ( \mathbf W _ { v } ^ { h } \pmb x _ { t } ) ) \in \mathbb { R } ^ { d _ { v } } } \ & { \quad \alpha _ { t } ^ { h } = f ( \mathbf W _ { \alpha } ^ { \uparrow } \mathbf W _ { \alpha } ^ { \downarrow } \pmb x _ { t } ) \in [ 0 , 1 ] ^ { d _ { k } } } \ & { \quad \beta _ { t } ^ { h } = \mathrm { S i g m o i d } ( \mathbf W _ { \beta } ^ { h } \pmb x _ { t } ) \in [ 0 , 1 ] } \end{array} $
Formula Explanation:
- A short 1D convolution (ShortConv) followed by a Swish activation is applied to the projected inputs for q, k, v. This helps capture local context.
- $q$ and $k$ are L2-normalized to stabilize training.
- The channel-wise decay vector $\alpha_t$ is parameterized via a low-rank projection ( $\mathbf{W}_\alpha^\downarrow, \mathbf{W}_\alpha^\uparrow$ ), which is a parameter-efficient way to generate a high-dimensional vector.
- The scalar update gate $\beta_t$ is generated via a linear projection followed by a Sigmoid activation.
Step 5: Output Gating and Hybrid Structure The output of the KDA module is further processed before being passed to the next layer. This is described in Equation 8: $ o _ { t } = \mathbf { W } _ { o } \left( \mathrm { S i g m o i d } \left( \mathbf { W } _ { g } ^ { \uparrow } \mathbf { W } _ { g } ^ { \downarrow } x _ { t } \right) \odot \mathrm { R M S N o r m } \left( \mathrm { K D A } \left( q _ { t } , k _ { t } , v _ { t } , \alpha _ { t } , \beta _ { t } \right) \right) \right) $
Formula Explanation:
- The raw output of the KDA module is first normalized using RMSNorm.
- A data-dependent output gate, generated from the input $\mathbf{x}_t$ via a Sigmoid activation, is applied element-wise ( $\odot$ ). This helps the model control the information flow and can mitigate issues like the "Attention Sink".
- Finally, the gated output is projected back to the model's hidden dimension via $\mathbf{W}_o$ .
Architectural Choices:
- Hybrid Model: The final Kimi Linear model interleaves three KDA layers with one full attention layer (Multi-Head Latent Attention or MLA). This 3:1 ratio was found to provide the best trade-off between performance and efficiency.
- MoE Integration: After the attention block (either KDA or MLA), the architecture uses a Mixture of Experts (MoE) layer for channel mixing, following the Moonlight model design.
- No Position Embeddings (NoPE): A crucial design choice is that the full attention (MLA) layers do not use any explicit position embeddings like RoPE. This delegates the entire responsibility of encoding positional information and recency bias to the recurrent dynamics of the KDA layers, which the authors argue act as a form of learnable, dynamic position embedding.

5. Experimental Setup

5.1. Datasets

The paper uses a comprehensive set of benchmarks to evaluate the model across various capabilities.

Synthetic Tasks: Used to test the core memory and retrieval capabilities of the linear attention mechanisms.
- Palindrome: The model must reverse a sequence of random tokens. This tests its ability to store and recall a sequence in order from a compressed memory state.
  - Example: $Input: O G R S U N E <SEP> | Output: <pad> ... <pad> E N U S R G O$
- Multi-Query Associative Recall (MQAR): The model is given a sequence of key-value pairs and then queried with some of the keys, for which it must retrieve the corresponding values. This tests associative memory.
  - Example: $Input: A 1 C 3 B 0 ... <SEP> B G | Output: <pad> ... <pad> 0 5$
- State Tracking: A task involving multiple stacks where the model must process PUSH and POP operations and predict the correct element when a POP occurs. This tests complex state management.
General Language Understanding and Reasoning:
- Hellaswag, ARC-Challenge, Winogrande, TriviaQA: Standard commonsense reasoning and question-answering benchmarks.
- MMLU, MMLU-Redux, MMLU-Pro: Massive multitask language understanding benchmarks testing knowledge across dozens of subjects.
- GPQA-Diamond, BBH: Challenging reasoning benchmarks designed to be difficult for current LLMs.
Code Generation:
- LiveCodeBench, EvalPlus, CRUXEval: Benchmarks for evaluating a model's ability to generate functionally correct code from natural language descriptions.
Math & Reasoning:
- AIME, MATH, HMMT, PolyMath-en, GSM8K: Benchmarks testing mathematical problem-solving and logical reasoning capabilities, from grade-school level to competition math.
Long-context Evaluation:
- MRCR, RULER, Frames, HELMET-ICL, RepoQA, Long Code Arena, LongBench v2: A suite of benchmarks specifically designed to evaluate performance on tasks requiring long context understanding (up to 1M tokens).
Chinese Language Understanding:
- C-Eval, CMMLU: Benchmarks for evaluating performance on Chinese language tasks.
  
  The choice of these datasets is effective because they cover a wide spectrum of abilities: core memory (synthetic), general knowledge, complex reasoning (math, code), and the primary target application of long-context processing.

5.2. Evaluation Metrics

The paper uses several standard metrics to evaluate model performance.

5.2.1. Perplexity (PPL)

Conceptual Definition: Perplexity is a measure of how well a probability model predicts a sample. In language modeling, it quantifies the model's uncertainty in predicting the next token in a sequence. A lower perplexity indicates that the model is more confident and accurate in its predictions. It is the exponentiated average negative log-likelihood.
Mathematical Formula: $ \text{PPL}(W) = \exp\left( -\frac{1}{N} \sum_{i=1}^{N} \log p(w_i | w_1, \dots, w_{i-1}) \right) $
Symbol Explanation:
- $W$ : A sequence of tokens $w_1, w_2, \dots, w_N$ .
- $N$ : The total number of tokens in the sequence.
- $p(w_i | w_1, \dots, w_{i-1})$ : The probability assigned by the model to the token $w_i$ , given the preceding tokens.

5.2.2. Accuracy (Acc.)

Conceptual Definition: Accuracy is the proportion of correct predictions among the total number of predictions made. It is a straightforward measure of performance on classification or multiple-choice tasks.
Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
Symbol Explanation: This is self-explanatory.

5.2.3. Pass@k

Conceptual Definition: Pass@k is a metric used primarily for evaluating code generation models. It measures the probability that at least one of $k$ generated code samples for a given problem will pass a set of predefined unit tests. It rewards models that can produce a correct solution within a few attempts.
Mathematical Formula: To estimate Pass@k, one generates $n$ samples per problem ( $n > k$ ), finds that $c$ of them are correct, and then calculates the unbiased estimator: $ \text{Pass@k} = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right] $
Symbol Explanation:
- $n$ : The total number of code samples generated for a single problem.
- $k$ : The number of samples considered for a "pass" (e.g., Pass@1, Pass@10).
- $c$ : The number of generated samples that correctly pass the unit tests.
- $\binom{n}{k}$ : The binomial coefficient, "n choose k".

5.2.4. Avg@k

The paper mentions using Avg@k for benchmarks with high variance.

Conceptual Definition: This typically refers to the practice of running the evaluation $k$ times with different random seeds and reporting the average score. This helps to produce a more stable and reliable estimate of the model's performance by smoothing out randomness in the generation or evaluation process.

5.3. Baselines

The primary baselines used for comparison are:

MLA (Multi-Head Latent Attention): This is the full-attention baseline model, built on the Moonlight architecture. It uses standard quadratic-time attention in all layers. This comparison is crucial to validate the core claim that Kimi Linear can outperform full attention.
GDN-H (Gated DeltaNet-Hybrid): This is a hybrid model similar to Kimi Linear but using the original Gated DeltaNet (with scalar gating) instead of KDA. This baseline serves to isolate the benefit of KDA's fine-grained gating mechanism over its direct predecessor.
Kimi Linear (RoPE): This is a variant of the Kimi Linear model that uses RoPE in its global attention layers, instead of NoPE. This baseline is used in the long-context evaluation to specifically test the effectiveness of the NoPE design choice.

These baselines are well-chosen as they allow for controlled comparisons to ablate the key architectural contributions of the paper: the linear attention mechanism itself (KDA vs GDN), the hybrid structure (Kimi Linear vs MLA), and the positional encoding strategy (NoPE vs RoPE).

6. Results & Analysis

6.1. Core Results Analysis

The paper's experimental results consistently support the central claim that Kimi Linear is a superior architecture, providing both performance gains and efficiency improvements over standard full attention.

6.1.1. Synthetic Task Performance

The results on synthetic tasks (Figure 4) serve as a controlled test of the memory capabilities of different linear attention mechanisms.

The following figure (Figure 4 from the original paper) shows these results.

Figure 4: Results on synthetic tasks: palindrome, multi query associative recall, and the state tracking. 该图像是由三组子图组成的图表，展示了在合成任务（回文、多查询联想检索和状态跟踪）中，不同模型随序列长度和训练步数变化的准确率表现。

Analysis: KDA consistently achieves the highest accuracy across all three tasks (Palindrome, MQAR, State Tracking), especially as the sequence length increases. Its strong performance on Palindrome and MQAR, which heavily rely on precise storage and retrieval, suggests that its fine-grained gating mechanism is more effective at managing memory than the scalar gates of GDN and Mamba2. It can more precisely decide which pieces of information to keep and which to discard.

6.1.2. Scaling Law Analysis

The scaling law experiments (Figure 5) compare how Kimi Linear and the full-attention MLA baseline improve as more computational resources are used for training.

The following figure (Figure 5 from the original paper) shows the fitted scaling law curves.

Figure 5: The fitted scaling law curves for MLA and Kimi Linear. 该图像是图表，展示了MLA和Kimi Linear的拟合缩放法则曲线，横轴为PFLOP/s-days，纵轴为Loss。两条曲线分别为MLA： $2.3092 \times C^{-0.0536}$ ，Kimi Linear： $2.2879 \times C^{-0.0527}$ ，表明Kimi Linear在相同计算量下具有更低的Loss。

Analysis: The curves show that for a given amount of training compute (PFLOP/s-days), Kimi Linear achieves a lower validation loss than MLA. The paper quantifies this as a ~1.16x improvement in computational efficiency. This means Kimi Linear learns more effectively from the same amount of computation, suggesting its architecture is inherently more efficient at modeling the data.

6.1.3. Main Results on Pre-trained and Instruction-Tuned Models

The core comparison is between Kimi Linear, MLA, and GDN-H models trained on 1.4T tokens. The results demonstrate Kimi Linear's superiority in standard short-context evaluations.

Pre-training Results (Base Models): In Table 3, Kimi Linear outperforms both MLA and GDN-H on nearly all benchmarks, including general knowledge (MMLU, BBH), reasoning (GSM8K, CRUXEval), and Chinese tasks (CEval, CMMLU). This is a powerful result, as it shows that even before fine-tuning, the architecture is more capable than a standard Transformer.
Post-training Results (Instruct Models): This trend continues after instruction tuning (Table 4). Kimi Linear leads on challenging benchmarks like MMLU-Pro, GPQA-Diamond, and difficult math/code tasks like AIME, HMMT, and LiveCodeBench. This confirms that the architectural advantages translate directly to improved capabilities in instruction-following models.

6.1.4. Long-Context and RL Performance

Long-Context: Table 5 shows Kimi Linear achieving the highest average score across a suite of long-context benchmarks. It notably excels on RULER and RepoQA. Interestingly, the Kimi Linear (RoPE) variant performs worse on average, validating the authors' design choice to use NoPE in the global attention layers and rely on KDA for positional awareness. This suggests that forcing the KDA layers to handle position information creates a more robust system for long-context generalization.
Reinforcement Learning (RL): The results in Figure 8 are particularly compelling. In RL training for mathematical reasoning, Kimi Linear not only starts at a higher accuracy but also improves faster and reaches a higher peak performance than the MLA baseline. This is significant because RL involves long generation trajectories, where the efficiency and stable memory management of Kimi Linear provide a clear advantage over the computationally heavy and potentially less stable full attention.

The following figure (Figure 8 from the original paper) shows the RL training curves.

该图像是三组对比折线图，展示了Kimi Linear与MLA在不同任务（训练、MATH 500测试和AIME 2025）上的准确率表现。图中显示Kimi Linear在所有测试点均优于MLA，表现出更高的准确率。

6.1.5. Efficiency Comparison

The efficiency gains are a cornerstone of the paper's contribution.

Prefill & Decoding Speed: As shown in Figure 1 and Figure 9, Kimi Linear's efficiency advantage becomes dramatic at long sequence lengths. While comparable to MLA at short lengths (4k-16k), its linear scaling means it becomes significantly faster for long contexts. For a 1M token context, it achieves:
- 2.9x faster prefill (initial processing of the prompt).
- 6.3x faster decoding (generating one token at a time).
KV Cache Reduction: Because KDA layers have a fixed-size state and do not need a KV cache that grows with sequence length, the hybrid model's total KV cache is drastically smaller (up to 75% reduction). This is a massive practical advantage, as it enables million-token context on less hardware.

The following figure (Figure 9 from the original paper) shows the prefill and decoding performance.

该图像是两幅折线图，展示了Kimi Linear与MLA和GDN-H在不同长度下的预填充延迟和解码吞吐量(TPOT)性能对比，体现出Kimi Linear在1M上下文长度时延迟降低2.9倍，吞吐量提升2.2倍的优势。

6.2. Data Presentation (Tables)

6.2.1. Ablation Study Results

The following are the results from Table 1 of the original paper:

		Training PPL (↓)	Validation PPL (↓)
Hybrid ratio	0:1	9.45	5.77
	1:1	9.29	5.66
	3:1	9.23	5.65
	7:1	9.23	5.70
15:1	9.34	5.82
w/o output gate		-	5.67
w/ swish output gate		-	5.81
w/o convolution layer		-	5.70

Analysis: This table validates several key design choices. The 3:1 hybrid ratio (3 KDA layers for every 1 MLA layer) achieves the best validation perplexity, indicating an optimal balance. Removing the Sigmoid output gate or replacing it with Swish hurts performance, confirming its importance. Finally, removing the ShortConv layer also degrades performance, showing that even in a hybrid model, these local convolutions are beneficial.

6.2.2. Main Results (Base Models @ 1.4T)

The following are the results from Table 3 of the original paper:

	Type	MLA	GDN-H	Kimi Linear
	Trained Tokens	1.4T	1.4T	1.4T
General	HellaSwag	81.7	82.2	82.9
	ARC-challenge	64.6	66.5	67.3
	Winogrande	78.1	77.9	78.6
	BBH	71.6	70.6	72.9
	MMLU	71.6	72.2	73.8
	MMLU-Pro	47.2	47.9	51.0
	TriviaQA	68.9	70.1	71.7
Math & Code	GSM8K	83.7	81.7	83.9
	MATH	54.7	54.1	54.7
	EvalPlus	59.5	63.1	60.2
	CRUXEval-I-cot	51.6	56.0	56.6
	CRUXEval-O-cot	61.5	58.1	62.0
Chinese	CEval	79.3	79.1	79.5
Chinese	CMMLU	79.5	80.7	80.8

6.2.3. Long Context Performance Comparison

The following are the results from Table 5 of the original paper:

	RULER	MRCR	HELMET-ICL	LongBench V2	Frames	RepoQA	Long Code Arena		Avg.
	RULER	MRCR	HELMET-ICL	LongBench V2	Frames	RepoQA	Lib	Commit	Avg.
MLA	81.3	22.6	88.0	36.1	60.5	63.0	32.8	33.2	52.2
GDN-H	80.5	23.9	85.5	32.6	58.7	63.0	34.7	30.5	51.2
Kimi Linear (RoPE)	78.8	22.0	88.0	35.4	59.9	66.5	31.3	32.5	51.8
Kimi Linear	84.3	29.6	90.0	35.0	58.8	68.5	37.1	32.7	54.5

6.3. Ablation Studies / Parameter Analysis

The ablation studies were crucial for justifying the paper's design choices:

Hybrid Ratio (Table 1): The experiments tested ratios from 0:1 (all MLA) to 15:1. The 3:1 ratio provided the best validation PPL, suggesting that while the global attention layers are important, a small number of them is sufficient. Too many linear layers (e.g., 7:1, 15:1) start to degrade performance, likely due to loss of global information flow.
Output Gating (Table 1): Removing the final Sigmoid gate on the KDA output or replacing it with Swish degrades performance. This shows the gate is a critical component for stabilizing the model and controlling information flow, consistent with findings in other recent work.
Positional Encoding (Table 5): The comparison between Kimi Linear (with NoPE) and Kimi Linear (RoPE) is a key ablation. The NoPE version achieves a much better average score (54.5 vs 51.8) on long-context tasks. The authors argue this is because RoPE biases the global attention layers towards local/short-range patterns, making them less flexible. By removing RoPE, the KDA layers are forced to handle positional information, which they do more dynamically and effectively for long-range dependencies.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Kimi Linear, a hybrid attention architecture that marks a significant step towards solving the performance-efficiency trade-off in LLMs. The core of this architecture, Kimi Delta Attention (KDA), enhances linear attention with a fine-grained, channel-wise gating mechanism that provides superior memory control. By combining KDA with a small number of full-attention layers in a 3:1 ratio, Kimi Linear achieves a remarkable feat: it outperforms a standard full-attention baseline in fair, large-scale comparisons across short-context, long-context, and reinforcement learning tasks. Simultaneously, it delivers substantial efficiency gains, including up to a 75% reduction in KV cache and a 6.3x speedup in decoding for million-token contexts. The work provides a compelling case for Kimi Linear as a practical, scalable, and high-performing drop-in replacement for traditional Transformer architectures.

7.2. Limitations & Future Work

The paper, being an industry technical report, focuses on showcasing achievements and does not include a dedicated limitations section. However, some potential limitations and future research directions can be inferred:

Empirically-Driven Design: Many key design choices, such as the 3:1 hybrid ratio, are determined empirically. While effective, a deeper theoretical understanding of why this specific ratio is optimal would be valuable.
Architectural Generalization: The experiments are conducted on a specific MoE-based architecture (Moonlight). While the results are strong, further research is needed to confirm if the benefits of Kimi Linear generalize across different LLM backbones (e.g., non-MoE models, different sizes).
Static Hybridization: The 3:1 interleaving is static. Future work could explore dynamic hybridization, where the model might learn to route tokens through linear or full-attention layers based on context or task requirements.
Synergy with Sparse Attention: The paper positions KDA as an alternative to sparse attention. An interesting future direction would be to combine KDA with sparse attention mechanisms, potentially reaping the benefits of both constant-memory state (KDA) and sub-linear attention patterns (sparse attention).

7.3. Personal Insights & Critique

Key Innovation: The most insightful contribution is the move from a scalar/head-wise gate to a channel-wise gate in KDA. It's a conceptually simple change that yields significant expressive power by allowing the model to treat different features in its memory state heterogeneously. This highlights a recurring theme in deep learning: increasing the "granularity" of control mechanisms often unlocks better performance.
Pragmatism and Co-design: The paper is an excellent example of pragmatic co-design. The KDA algorithm isn't just theoretically elegant; it's a constrained variant of DPLR specifically designed for hardware efficiency and numerical stability. This focus on making the method fast in practice is what makes it a viable industry solution, not just an academic curiosity. The open-sourcing of optimized kernels and vLLM integration further underscores this practical commitment.
A Paradigm Shift for Long Context: This work, along with others in the hybrid space, signals a potential paradigm shift. Instead of a single, uniform attention mechanism, the future of LLMs may lie in heterogeneous architectures that combine specialized components for different purposes (e.g., KDA for efficient state-tracking, full attention for global integration). The NoPE design choice is particularly bold and suggests that we may be able to delegate more implicit reasoning, including positional awareness, to well-designed recurrent mechanisms.
Critique: The claim of being the "first" to outperform full attention under "fair comparisons" is very strong. While the authors' setup appears rigorous, "fairness" in LLM comparison is notoriously complex and depends on many factors in the "training recipe." As an industry report without peer review, the results should be interpreted with the understanding that they represent the findings from a single, highly capable team. Independent replication and analysis by the broader research community will be essential to fully validate these impressive claims. The futuristic publication date also remains an unusual artifact of the source material.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Kimi Linear: An Expressive, Efficient Attention Architecture

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~29 min read · 36,396 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Transformer and Self-Attention

3.1.2. Linear Attention

3.1.3. State Space Models (SSMs)

3.1.4. Mixture of Experts (MoE)

3.2. Previous Works

3.2.1. DeltaNet and Gated DeltaNet (GDN)

3.2.2. Mamba and DPLR

3.2.3. Hybrid Attention Architectures

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Kimi Delta Attention (KDA)

4.2.2. Hardware-Efficient Chunkwise Algorithm

4.2.3. Efficiency Analysis: KDA as a Constrained DPLR

4.2.4. The Kimi Linear Model Architecture

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Perplexity (PPL)

5.2.2. Accuracy (Acc.)

5.2.3. Pass@k

5.2.4. Avg@k

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Synthetic Task Performance

6.1.2. Scaling Law Analysis

6.1.3. Main Results on Pre-trained and Instruction-Tuned Models

6.1.4. Long-Context and RL Performance

6.1.5. Efficiency Comparison

6.2. Data Presentation (Tables)

6.2.1. Ablation Study Results

6.2.2. Main Results (Base Models @ 1.4T)

6.2.3. Long Context Performance Comparison

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers