Kimi Linear: An Expressive, Efficient Attention Architecture
TL;DR Summary
Kimi Linear, combining Kimi Delta Attention and multi-head latent attention, significantly enhances performance and efficiency in long and short contexts and RL. Its 3B-parameter model outperforms full attention, achieving up to 6× throughput and 75% KV cache reduction.
Abstract
We introduce Kimi Linear, a hybrid linear attention architecture that, for the first time, outperforms full attention under fair comparisons across various scenarios -- including short-context, long-context, and reinforcement learning (RL) scaling regimes. At its core lies Kimi Delta Attention (KDA), an expressive linear attention module that extends Gated DeltaNet with a finer-grained gating mechanism, enabling more effective use of limited finite-state RNN memory. Our bespoke chunkwise algorithm achieves high hardware efficiency through a specialized variant of the Diagonal-Plus-Low-Rank (DPLR) transition matrices, which substantially reduces computation compared to the general DPLR formulation while remaining more consistent with the classical delta rule. We pretrain a Kimi Linear model with 3B activated parameters and 48B total parameters, based on a layerwise hybrid of KDA and Multi-Head Latent Attention (MLA). Our experiments show that with an identical training recipe, Kimi Linear outperforms full MLA with a sizeable margin across all evaluated tasks, while reducing KV cache usage by up to 75% and achieving up to 6 times decoding throughput for a 1M context. These results demonstrate that Kimi Linear can be a drop-in replacement for full attention architectures with superior performance and efficiency, including tasks with longer input and output lengths. To support further research, we open-source the KDA kernel and vLLM implementations, and release the pre-trained and instruction-tuned model checkpoints.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Kimi Linear: An Expressive, Efficient Attention Architecture
The title clearly states the paper's central topic: the introduction of a new attention architecture named Kimi Linear. It highlights its two main claimed properties: being expressive (capable of high performance) and efficient (computationally less demanding).
1.2. Authors
The authors are listed as "Kimi Team" from Moonshot AI, along with several external collaborators. The full list of contributors is provided in Appendix A, indicating a large-scale team effort typical of major industry AI labs. Key contributors include Yu Zhang, Zongyu Lin, Xingcheng Yao, and many others. This collaborative structure suggests a project with significant engineering and research resources behind it.
1.3. Journal/Conference
The paper is presented as a technical report on arXiv, a preprint server. This means it has not yet undergone a formal peer-review process for a conference or journal. Publishing on arXiv is a common practice in the fast-paced field of AI/ML to quickly disseminate new findings to the research community. The metadata indicates a publication date in the future (October 2025), which suggests this analysis is based on a preliminary or hypothetical version of the paper.
1.4. Publication Year
2025 (according to the provided metadata and references within the paper).
1.5. Abstract
The abstract introduces Kimi Linear, a hybrid linear attention architecture that claims to outperform standard full attention in a wide range of scenarios (short-context, long-context, and RL). The core innovation is Kimi Delta Attention (KDA), an extension of Gated DeltaNet that uses a finer-grained gating mechanism for more effective memory management. The authors also highlight a custom, hardware-efficient chunkwise algorithm based on a specialized Diagonal-Plus-Low-Rank (DPLR) formulation. They present a 3B parameter model that outperforms a full-attention baseline across all tasks while significantly reducing KV cache usage (by 75%) and improving decoding throughput (up to 6x for 1M context). The paper concludes by announcing the open-sourcing of the KDA kernel, vLLM integration, and model checkpoints to encourage further research.
1.6. Original Source Link
- Original Source Link:
https://arxiv.org/abs/2510.26692 - PDF Link:
https://arxiv.org/pdf/2510.26692v2.pdf - Publication Status: The paper is an unreviewed preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The primary challenge this paper addresses is the computational bottleneck of the standard Transformer architecture, whose self-attention mechanism has a computational and memory complexity that scales quadratically with the input sequence length (). This quadratic scaling makes processing very long sequences (e.g., millions of tokens) prohibitively expensive, which is a major obstacle for next-generation applications like advanced AI agents, reinforcement learning (RL) over long trajectories, and deep document analysis.
While linear attention methods, which scale linearly (), have been proposed as a solution, they have historically struggled to match the performance and expressivity of standard softmax attention, often lagging even on short-sequence tasks. This performance gap has prevented their widespread adoption.
The paper's entry point is to bridge this gap by designing a new linear attention mechanism that is not only efficient but also more expressive than its predecessors. The innovative idea is to enhance an existing linear attention framework (Gated DeltaNet) with more precise, fine-grained control over its internal memory state, allowing it to selectively retain and forget information more effectively.
2.2. Main Contributions / Findings
The paper presents three main contributions:
-
Kimi Delta Attention (KDA): A novel linear attention module that serves as the core of the new architecture. It improves upon
Gated DeltaNetby replacing its coarse, scalar "forget gate" with a fine-grained, channel-wise gating mechanism. This allows the model to control the decay of information in its memory state with much higher precision for each feature dimension, enhancing its expressiveness. -
The Kimi Linear Architecture: A hybrid model design that strategically interleaves efficient
KDAlayers with a small number of standardfull attentionlayers (Multi-Head Latent AttentionorMLA). The paper finds an optimal 3:1 ratio (three KDA layers for every one MLA layer), creating a balance that preserves global information flow while drastically reducing the memory footprint and computational cost associated with long sequences. -
Fair Empirical Validation at Scale: The authors conduct extensive experiments, training a 3B-parameter
Kimi Linearmodel on 1.4 trillion tokens and comparing it against a full-attention baseline with an identical training recipe. Their findings show thatKimi Linearoutperforms the full-attention model across a diverse set of benchmarks, including short and long-context tasks and RL-style evaluations. This demonstrates thatKimi Linearcan serve as a "drop-in replacement" for full attention, offering superior performance and efficiency. To support these claims, the authors are open-sourcing their custom CUDA kernels, vLLM integration, and pre-trained models.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Transformer and Self-Attention
The Transformer, introduced by Vaswani et al. in 2017, is a neural network architecture that has become the foundation for most modern Large Language Models (LLMs). Its key innovation is the self-attention mechanism.
- Conceptual Definition: Self-attention allows a model to weigh the importance of different words (tokens) in an input sequence when processing a specific word. For each token, it generates a representation by attending to all other tokens in the sequence, including itself. This allows the model to capture long-range dependencies and contextual relationships.
- Mechanism: For each input token, we create three vectors: a Query (), a Key (), and a Value (). The
Queryvector represents the current token's request for information. TheKeyvectors of all other tokens act as labels or identifiers. TheValuevectors contain the actual information of each token.- Score: The similarity between the
Queryof the current token and theKeyof every other token is calculated, typically using a dot product. - Scale: The scores are scaled by the square root of the key dimension () to stabilize gradients.
- Softmax: A
softmaxfunction is applied to the scaled scores to convert them into a probability distribution (attention weights), ensuring they sum to 1. - Weighted Sum: The
Valuevectors are multiplied by their corresponding attention weights and summed up to produce the final output representation for the current token.
- Score: The similarity between the
- Mathematical Formula: The standard scaled dot-product attention is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
- Symbol Explanation:
- : The matrix of query vectors.
- : The matrix of key vectors.
- : The matrix of value vectors.
- : The dimension of the key vectors.
- : The transpose of the key matrix.
- Complexity: The calculation of the matrix, which has dimensions of (sequence length sequence length), is the source of the computational and memory complexity, where is the sequence length.
3.1.2. Linear Attention
- Conceptual Definition: Linear attention is a class of methods designed to reduce the complexity of self-attention from quadratic to linear, i.e., . The core idea is to re-order the computations to avoid explicitly forming the large attention matrix.
- Mechanism: Instead of using the
softmaxfunction, linear attention methods replace it with a kernel function that is associative. This allows the calculation to be re-ordered: $ \text{Attention} = \phi(QK^T)V = (\phi(Q)\phi(K)^T)V = \phi(Q)(\phi(K)^T V) $ By first computing (a small fixed-size matrix) and then multiplying by , the explicit matrix is never created. This formulation also has an equivalent recurrent (RNN-like) representation, which is highly efficient for autoregressive decoding. The state can be updated as: $ \mathbf{S}t = \mathbf{S}{t-1} + \phi(\mathbf{k}_t) \phi(\mathbf{v}_t)^\top, \quad \mathbf{o}_t = \phi(\mathbf{q}_t)^\top \mathbf{S}_t $ Here, the state accumulates key-value information over time, and the output is computed by querying this state. This requires only constant time and memory per step during inference.
3.1.3. State Space Models (SSMs)
- Conceptual Definition: SSMs are a class of sequence models inspired by classical control theory. They map an input sequence
x(t)to an output sequencey(t)through a latent (hidden) stateh(t). - Mathematical Formula: A continuous-time SSM is defined by the linear ordinary differential equations: $ \frac{dh(t)}{dt} = \mathbf{A}h(t) + \mathbf{B}x(t), \quad y(t) = \mathbf{C}h(t) + \mathbf{D}x(t) $ In practice, these are discretized for use in deep learning.
- Symbol Explanation:
h(t): The latent state vector.x(t): The input vector.y(t): The output vector.- : State matrices that are learned during training.
- Relevance: Recent models like Mamba have shown that by making the state matrices () data-dependent (i.e., they are functions of the input ), SSMs can achieve highly selective and expressive sequence modeling. This connects them closely to linear attention, as both can be formulated as efficient, recurrent systems.
3.1.4. Mixture of Experts (MoE)
- Conceptual Definition: MoE is an architectural technique to increase model capacity without a proportional increase in computational cost. Instead of a single, dense feed-forward network (FFN) layer, an MoE layer consists of multiple "expert" FFNs and a "gating network" or "router."
- Mechanism: For each input token, the router dynamically selects a small number of experts (e.g., 2 out of 64) to process it. The outputs of the selected experts are then combined, often via a weighted sum determined by the router. This means that while the total number of parameters in the model can be very large (e.g., 48B in this paper), the number of activated parameters used for any given token remains small (e.g., 3B).
3.2. Previous Works
3.2.1. DeltaNet and Gated DeltaNet (GDN)
- DeltaNet: This work reinterprets linear attention from a "fast weight programmer" perspective. It frames the state update as performing online gradient descent on a reconstruction loss objective: . This means the model continually tries to update its memory to correctly map the current key to the current value . The update rule, known as the classical delta rule, is: $ \mathbf{S}_t = (\mathbf{I} - \beta_t \mathbf{k}_t \mathbf{k}t^\top) \mathbf{S}{t-1} + \beta_t \mathbf{k}_t \mathbf{v}_t^\top $ This provides a more stable learning dynamic than simple additive updates but retains all information indefinitely.
- Gated DeltaNet (GDN): GDN improves DeltaNet by introducing a simple but effective "forgetting" mechanism. It adds a scalar forget gate that acts as a form of weight decay on the memory state, allowing the model to forget outdated information. The update rule becomes: $ \mathbf{S}_t = \alpha_t (\mathbf{I} - \beta_t \mathbf{k}_t \mathbf{k}t^\top) \mathbf{S}{t-1} + \beta_t \mathbf{k}_t \mathbf{v}_t^\top $ This scalar gate is applied uniformly across all dimensions of the state, which is the key limitation KDA aims to address.
3.2.2. Mamba and DPLR
- Mamba: A highly successful SSM that introduced a selection mechanism through data-dependent state matrices. It can be seen as an RNN with gates that control how much information flows from the input into the state and how much the state is forgotten at each step. Its efficiency comes from a parallel scan algorithm for training and a recurrent mode for inference.
- Diagonal-Plus-Low-Rank (DPLR): A technique for structuring the state transition matrix (the matrix in SSMs or the transition matrix in linear attention) to be both expressive and computationally efficient. A DPLR matrix is the sum of a diagonal matrix and a low-rank matrix (). This structure is more powerful than a purely diagonal matrix (as in some earlier SSMs) but more structured and efficient than a dense matrix. KDA's transition matrix is a specialized, constrained form of DPLR.
3.2.3. Hybrid Attention Architectures
The idea of combining different types of attention is not new.
- Intra-layer hybrids: Models like Jamba mix attention mechanisms within the same layer, for example, by having some attention heads be standard attention and others be Mamba blocks.
- Inter-layer hybrids: Models like the one in this paper alternate between different types of layers.
Kimi Linear's approach of regularly interleavingKDAand fullMLAlayers simplifies the architecture and KV cache management compared to more complex hybrid designs.
3.3. Technological Evolution
The field has evolved from the monolithic full attention mechanism to a diverse ecosystem of efficient alternatives.
- Full Attention: The original Transformer, powerful but quadratically expensive.
- Early Efficient Attention: Researchers explored
sparse attention(attending to only a subset of tokens) andlinear attention(reordering computations). These often traded performance for efficiency. - Rise of SSMs: Models like S4 and especially Mamba demonstrated that structured state-space models could achieve performance competitive with Transformers at linear complexity, revitalizing interest in RNN-like architectures.
- Modern Hybrids: Recognizing that full attention and linear/SSM methods have complementary strengths (global lookup vs. efficient state compression), recent work has focused on creating hybrid architectures.
Kimi Linearfits into this latest stage, proposing a specific and highly optimized hybrid of its novelKDAlinear attention and standard full attention.
3.4. Differentiation Analysis
Compared to prior works, Kimi Linear's innovations are:
- vs. GDN/Mamba2: KDA introduces channel-wise (diagonal) gating instead of scalar/head-wise gating. This allows for more nuanced control over memory, as different feature dimensions can have different decay rates.
- vs. General DPLR: KDA uses a constrained DPLR formulation where the low-rank update vectors are tied to the key vector . This simplifies computation and improves hardware efficiency and numerical stability, avoiding issues that require workarounds in more general DPLR models like GLA.
- vs. Other Hybrids: Kimi Linear proposes a simple, regular 3:1 interleaving of KDA and full attention layers. This is less complex than intra-layer hybrids and simplifies implementation and optimization.
- vs. RoPE-based models: The Kimi Linear architecture deliberately uses No Position Embedding (NoPE) in its full attention layers, forcing the model to rely on the
KDAlayers to dynamically learn and encode positional information and recency bias. This is a strong design choice that differs from the common practice of applying RoPE everywhere.
4. Methodology
4.1. Principles
The core principle of Kimi Linear is to enhance the expressiveness of linear attention by giving it more precise control over its finite-state memory. Standard Gated DeltaNet (GDN) uses a single scalar value per head to control forgetting, which treats all information within that head's memory state equally. Kimi Linear posits that this is too coarse. By introducing a channel-wise gating mechanism, different features within the memory can be retained or forgotten at different rates. This fine-grained control allows the model to better manage its limited memory, preserving crucial long-term information while discarding irrelevant details. This enhanced expressiveness is achieved while maintaining and even improving hardware efficiency through a specialized, computationally cheaper variant of the DPLR matrix structure.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Kimi Delta Attention (KDA)
KDA is the foundational block of the Kimi Linear model. It refines the Gated DeltaNet (GDN) update rule by introducing a diagonalized gate, enabling fine-grained control over memory decay.
The state update for KDA is defined by the following recurrence relation, which is a modification of the GDN rule.
- Step 1: The KDA Recurrence Relation At each time step , the model updates its memory state matrix to a new state . This update rule is presented in Equation 1 of the paper: $ \mathbf { S } _ { t } = \left( \mathbf { I } - \beta _ { t } \pmb { k } _ { t } \pmb { k } _ { t } ^ { \top } \right) \operatorname * { D i a g } \left( \alpha _ { t } \right) \mathbf { S } _ { t - 1 } + \beta _ { t } \pmb { k } _ { t } \pmb { v } _ { t } ^ { \top } \in \mathbb { R } ^ { d _ { k } \times d _ { v } } $ After updating the state, the output vector is computed as: $ \mathbf{o}_t = \mathbf{S}_t^\top \mathbf{q}_t \in \mathbb{R}^{d_v} $
- Formula Explanation:
-
: The memory state matrix at time step . It stores key-value associations.
-
: The memory state from the previous time step.
-
: The query, key, and value vectors for the current token, with dimensions , , and respectively.
-
: A scalar learning rate or update gate, determining the magnitude of the update.
-
: The identity matrix.
-
: An outer product that creates a rank-1 matrix. The term is the "delta rule" update, which corrects the memory based on the current key.
-
: This is the core innovation of KDA. It is a diagonal matrix where the diagonal elements are given by the vector . This matrix applies a channel-wise decay to the previous state . This contrasts with GDN, which would use a scalar that multiplies the entire state matrix, applying the same decay to all channels.
-
The term writes the new key-value association into the memory.
This recurrent formulation is efficient for autoregressive generation (inference) but slow for training on parallel hardware like GPUs. Therefore, a parallel chunkwise algorithm is needed.
-
4.2.2. Hardware-Efficient Chunkwise Algorithm
To train efficiently, the input sequence is split into chunks of size , and computations are parallelized within and across these chunks. The paper provides a complex but efficient formulation based on the WY representation of Householder transformations.
- Step 2: Chunkwise Formulation and Parallelization The paper derives a parallel algorithm for updating the state across a chunk. The state at the end of a chunk, , can be calculated from the state at the beginning of the chunk, , without a sequential loop. The final formula for this update is given in Equation 6: $ \mathbf { S } _ { [ t + 1 ] } = \mathrm { D i a g } ( \gamma _ { [ t ] } ^ { C } ) \mathbf { S } _ { [ t ] } + ( \mathbf { T } _ { [ t ] } ^ { i \to C } \odot \mathbf { K } _ { [ t ] } ) ^ { \top } ( \mathbf { U } _ { [ t ] } - \mathbf { W } _ { [ t ] } \mathbf { S } _ { [ t ] } ) \in \mathbb { R } ^ { d _ { k } \times d _ { v } } $ To get to this point, several intermediate auxiliary matrices, and , must be computed. These are derived using the UT transform, an efficient method for accumulating Householder transformations. The computation for and is given in Equation 5: $ \mathbf { M } _ { [ t ] } = ( \mathbf { I } + \mathrm { S t r i c t T r i } ( \operatorname { D i a g } ( \beta _ { [ t ] } ) ( \mathbf { T } _ { [ t ] } ^ { 1 \to C } \odot \mathbf { K } _ { [ t ] } ) ( \frac { \mathbf { K } _ { [ t ] } } { \mathbf { T } _ { [ t ] } ^ { 1 \to C } } ) ^ { \top } ) ) ^ { - 1 } \operatorname { D i a g } ( \beta _ { [ t ] } ) $ $ \mathbf { W } _ { [ t ] } = \mathbf { M } _ { [ t ] } ( \mathbf { T } _ { [ t ] } ^ { 1 \to C } \odot \mathbf { K } _ { [ t ] } ) , \qquad \mathbf { U } _ { [ t ] } = \mathbf { M } _ { [ t ] } \mathbf { V } _ { [ t ] } $
- Formula Explanation:
- : State at the beginning of chunk .
- : Matrices containing all key and value vectors for chunk .
- : The cumulative decay over the entire chunk of size .
- : A matrix related to the cumulative decays within the chunk.
- : Element-wise multiplication.
- : A lower-triangular mask that excludes the diagonal.
- The matrix inversion is what looks computationally expensive, but because the matrix is triangular, the inverse can be computed efficiently via forward substitution.
- and are intermediate matrices that effectively pre-compute the intra-chunk interactions in a parallel-friendly manner.
4.2.3. Efficiency Analysis: KDA as a Constrained DPLR
The paper shows that KDA's transition matrix is a special case of the general Diagonal-Plus-Low-Rank (DPLR) structure, which brings significant computational benefits.
- Step 3: Relating KDA to DPLR A general DPLR update has the form . The KDA update rule can be rewritten to match this form, as shown in Equation 14: $ \mathbf { S } _ { t } = ( \mathbf { D } - a _ { t } b _ { t } ^ { \top } ) \mathbf { S } _ { t - 1 } + k _ { t } v _ { t } ^ { \top } , \quad \mathrm{s.t.,} \quad \mathbf { D } = \operatorname { D i a g } ( \alpha _ { t } ) , a _ { t } = \beta _ { t } k _ { t } , b _ { t } = k _ { t } \odot \alpha _ { t } . $
- Explanation of the Constraint: In a general DPLR model, the vectors and would be parameterized independently. In KDA, both are tied to the key vector and the decay vector .
- Benefit of the Constraint: This constraint is crucial for efficiency. As the authors explain, general DPLR formulations can suffer from numerical precision issues (especially with division by decay terms) that require computationally expensive workarounds like secondary chunking. By tying and to , KDA's algorithm avoids these issues and significantly reduces the number of required matrix multiplications in the chunkwise computation. This leads to a kernel that is approximately 100% (2x) faster than a general DPLR implementation, as shown in Figure 2.
The following figure (Figure 2 from the original paper) shows the execution time comparison of the general DPLR kernel versus the specialized KDA kernel.
该图像是图表,展示了不同输入长度下两个内核DPLR和KDA的执行时间对比,批量大小为1,16个头。曲线显示KDA在长输入时执行时间明显低于DPLR,体现了优越的计算效率。
4.2.4. The Kimi Linear Model Architecture
Kimi Linear is not just the KDA module but a full model architecture built around it.
The overall architecture is depicted in the following figure from the paper.
该图像是论文中Kimi Linear模型架构的示意图,展示了多头潜变量注意力(MLA)、专家混合(MoE)以及核心的Kimi Delta Attention(KDA)模块的层次结构和数据流动过程,体现了共享专家与路由专家的组合机制。
-
Step 4: Neural Parameterization The inputs to the KDA module () are generated from the token input representation using neural networks. The paper specifies the following parameterizations (Equation 7): $ \begin{array} { r l } & { q _ { t } ^ { h } , k _ { t } ^ { h } = \mathrm { L2Norm } ( \mathrm { Swish } ( \mathrm { S h o r t C o n v } ( \mathbf W _ { q / k } ^ { h } \pmb x _ { t } ) ) ) \in \mathbb { R } ^ { d _ { k } } } \ & { \quad \quad \pmb { v } _ { t } ^ { h } = \mathrm { S w i s h } ( \mathrm { S h o r t C o n v } ( \mathbf W _ { v } ^ { h } \pmb x _ { t } ) ) \in \mathbb { R } ^ { d _ { v } } } \ & { \quad \alpha _ { t } ^ { h } = f ( \mathbf W _ { \alpha } ^ { \uparrow } \mathbf W _ { \alpha } ^ { \downarrow } \pmb x _ { t } ) \in [ 0 , 1 ] ^ { d _ { k } } } \ & { \quad \beta _ { t } ^ { h } = \mathrm { S i g m o i d } ( \mathbf W _ { \beta } ^ { h } \pmb x _ { t } ) \in [ 0 , 1 ] } \end{array} $
-
Formula Explanation:
- A short 1D convolution (
ShortConv) followed by aSwishactivation is applied to the projected inputs forq, k, v. This helps capture local context. - and are L2-normalized to stabilize training.
- The channel-wise decay vector is parameterized via a low-rank projection (), which is a parameter-efficient way to generate a high-dimensional vector.
- The scalar update gate is generated via a linear projection followed by a
Sigmoidactivation.
- A short 1D convolution (
-
Step 5: Output Gating and Hybrid Structure The output of the KDA module is further processed before being passed to the next layer. This is described in Equation 8: $ o _ { t } = \mathbf { W } _ { o } \left( \mathrm { S i g m o i d } \left( \mathbf { W } _ { g } ^ { \uparrow } \mathbf { W } _ { g } ^ { \downarrow } x _ { t } \right) \odot \mathrm { R M S N o r m } \left( \mathrm { K D A } \left( q _ { t } , k _ { t } , v _ { t } , \alpha _ { t } , \beta _ { t } \right) \right) \right) $
-
Formula Explanation:
- The raw output of the KDA module is first normalized using
RMSNorm. - A data-dependent output gate, generated from the input via a
Sigmoidactivation, is applied element-wise (). This helps the model control the information flow and can mitigate issues like the "Attention Sink". - Finally, the gated output is projected back to the model's hidden dimension via .
- The raw output of the KDA module is first normalized using
-
Architectural Choices:
- Hybrid Model: The final
Kimi Linearmodel interleaves three KDA layers with one full attention layer (Multi-Head Latent AttentionorMLA). This 3:1 ratio was found to provide the best trade-off between performance and efficiency. - MoE Integration: After the attention block (either KDA or MLA), the architecture uses a
Mixture of Experts (MoE)layer for channel mixing, following theMoonlightmodel design. - No Position Embeddings (NoPE): A crucial design choice is that the full attention (
MLA) layers do not use any explicit position embeddings like RoPE. This delegates the entire responsibility of encoding positional information and recency bias to the recurrent dynamics of theKDAlayers, which the authors argue act as a form of learnable, dynamic position embedding.
- Hybrid Model: The final
5. Experimental Setup
5.1. Datasets
The paper uses a comprehensive set of benchmarks to evaluate the model across various capabilities.
- Synthetic Tasks: Used to test the core memory and retrieval capabilities of the linear attention mechanisms.
- Palindrome: The model must reverse a sequence of random tokens. This tests its ability to store and recall a sequence in order from a compressed memory state.
- Example:
- Multi-Query Associative Recall (MQAR): The model is given a sequence of key-value pairs and then queried with some of the keys, for which it must retrieve the corresponding values. This tests associative memory.
- Example:
- State Tracking: A task involving multiple stacks where the model must process
PUSHandPOPoperations and predict the correct element when aPOPoccurs. This tests complex state management.
- Palindrome: The model must reverse a sequence of random tokens. This tests its ability to store and recall a sequence in order from a compressed memory state.
- General Language Understanding and Reasoning:
Hellaswag,ARC-Challenge,Winogrande,TriviaQA: Standard commonsense reasoning and question-answering benchmarks.MMLU,MMLU-Redux,MMLU-Pro: Massive multitask language understanding benchmarks testing knowledge across dozens of subjects.GPQA-Diamond,BBH: Challenging reasoning benchmarks designed to be difficult for current LLMs.
- Code Generation:
LiveCodeBench,EvalPlus,CRUXEval: Benchmarks for evaluating a model's ability to generate functionally correct code from natural language descriptions.
- Math & Reasoning:
AIME,MATH,HMMT,PolyMath-en,GSM8K: Benchmarks testing mathematical problem-solving and logical reasoning capabilities, from grade-school level to competition math.
- Long-context Evaluation:
MRCR,RULER,Frames,HELMET-ICL,RepoQA,Long Code Arena,LongBench v2: A suite of benchmarks specifically designed to evaluate performance on tasks requiring long context understanding (up to 1M tokens).
- Chinese Language Understanding:
-
C-Eval,CMMLU: Benchmarks for evaluating performance on Chinese language tasks.The choice of these datasets is effective because they cover a wide spectrum of abilities: core memory (synthetic), general knowledge, complex reasoning (math, code), and the primary target application of long-context processing.
-
5.2. Evaluation Metrics
The paper uses several standard metrics to evaluate model performance.
5.2.1. Perplexity (PPL)
- Conceptual Definition: Perplexity is a measure of how well a probability model predicts a sample. In language modeling, it quantifies the model's uncertainty in predicting the next token in a sequence. A lower perplexity indicates that the model is more confident and accurate in its predictions. It is the exponentiated average negative log-likelihood.
- Mathematical Formula: $ \text{PPL}(W) = \exp\left( -\frac{1}{N} \sum_{i=1}^{N} \log p(w_i | w_1, \dots, w_{i-1}) \right) $
- Symbol Explanation:
- : A sequence of tokens .
- : The total number of tokens in the sequence.
- : The probability assigned by the model to the token , given the preceding tokens.
5.2.2. Accuracy (Acc.)
- Conceptual Definition: Accuracy is the proportion of correct predictions among the total number of predictions made. It is a straightforward measure of performance on classification or multiple-choice tasks.
- Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
- Symbol Explanation: This is self-explanatory.
5.2.3. Pass@k
- Conceptual Definition:
Pass@kis a metric used primarily for evaluating code generation models. It measures the probability that at least one of generated code samples for a given problem will pass a set of predefined unit tests. It rewards models that can produce a correct solution within a few attempts. - Mathematical Formula: To estimate
Pass@k, one generates samples per problem (), finds that of them are correct, and then calculates the unbiased estimator: $ \text{Pass@k} = \mathbb{E}_{\text{problems}} \left[ 1 - \frac{\binom{n-c}{k}}{\binom{n}{k}} \right] $ - Symbol Explanation:
- : The total number of code samples generated for a single problem.
- : The number of samples considered for a "pass" (e.g.,
Pass@1,Pass@10). - : The number of generated samples that correctly pass the unit tests.
- : The binomial coefficient, "n choose k".
5.2.4. Avg@k
The paper mentions using Avg@k for benchmarks with high variance.
- Conceptual Definition: This typically refers to the practice of running the evaluation times with different random seeds and reporting the average score. This helps to produce a more stable and reliable estimate of the model's performance by smoothing out randomness in the generation or evaluation process.
5.3. Baselines
The primary baselines used for comparison are:
-
MLA (Multi-Head Latent Attention): This is the full-attention baseline model, built on the
Moonlightarchitecture. It uses standard quadratic-time attention in all layers. This comparison is crucial to validate the core claim thatKimi Linearcan outperform full attention. -
GDN-H (Gated DeltaNet-Hybrid): This is a hybrid model similar to
Kimi Linearbut using the originalGated DeltaNet(with scalar gating) instead ofKDA. This baseline serves to isolate the benefit ofKDA's fine-grained gating mechanism over its direct predecessor. -
Kimi Linear (RoPE): This is a variant of the
Kimi Linearmodel that usesRoPEin its global attention layers, instead ofNoPE. This baseline is used in the long-context evaluation to specifically test the effectiveness of theNoPEdesign choice.These baselines are well-chosen as they allow for controlled comparisons to ablate the key architectural contributions of the paper: the linear attention mechanism itself (
KDAvsGDN), the hybrid structure (Kimi LinearvsMLA), and the positional encoding strategy (NoPEvsRoPE).
6. Results & Analysis
6.1. Core Results Analysis
The paper's experimental results consistently support the central claim that Kimi Linear is a superior architecture, providing both performance gains and efficiency improvements over standard full attention.
6.1.1. Synthetic Task Performance
The results on synthetic tasks (Figure 4) serve as a controlled test of the memory capabilities of different linear attention mechanisms.
The following figure (Figure 4 from the original paper) shows these results.
该图像是由三组子图组成的图表,展示了在合成任务(回文、多查询联想检索和状态跟踪)中,不同模型随序列长度和训练步数变化的准确率表现。
- Analysis:
KDAconsistently achieves the highest accuracy across all three tasks (Palindrome, MQAR, State Tracking), especially as the sequence length increases. Its strong performance on Palindrome and MQAR, which heavily rely on precise storage and retrieval, suggests that its fine-grained gating mechanism is more effective at managing memory than the scalar gates ofGDNandMamba2. It can more precisely decide which pieces of information to keep and which to discard.
6.1.2. Scaling Law Analysis
The scaling law experiments (Figure 5) compare how Kimi Linear and the full-attention MLA baseline improve as more computational resources are used for training.
The following figure (Figure 5 from the original paper) shows the fitted scaling law curves.
该图像是图表,展示了MLA和Kimi Linear的拟合缩放法则曲线,横轴为PFLOP/s-days,纵轴为Loss。两条曲线分别为MLA:,Kimi Linear:,表明Kimi Linear在相同计算量下具有更低的Loss。
- Analysis: The curves show that for a given amount of training compute (PFLOP/s-days),
Kimi Linearachieves a lower validation loss thanMLA. The paper quantifies this as a ~1.16x improvement in computational efficiency. This meansKimi Linearlearns more effectively from the same amount of computation, suggesting its architecture is inherently more efficient at modeling the data.
6.1.3. Main Results on Pre-trained and Instruction-Tuned Models
The core comparison is between Kimi Linear, MLA, and GDN-H models trained on 1.4T tokens. The results demonstrate Kimi Linear's superiority in standard short-context evaluations.
- Pre-training Results (Base Models): In Table 3,
Kimi Linearoutperforms bothMLAandGDN-Hon nearly all benchmarks, including general knowledge (MMLU,BBH), reasoning (GSM8K,CRUXEval), and Chinese tasks (CEval,CMMLU). This is a powerful result, as it shows that even before fine-tuning, the architecture is more capable than a standard Transformer. - Post-training Results (Instruct Models): This trend continues after instruction tuning (Table 4).
Kimi Linearleads on challenging benchmarks likeMMLU-Pro,GPQA-Diamond, and difficult math/code tasks likeAIME,HMMT, andLiveCodeBench. This confirms that the architectural advantages translate directly to improved capabilities in instruction-following models.
6.1.4. Long-Context and RL Performance
-
Long-Context: Table 5 shows
Kimi Linearachieving the highest average score across a suite of long-context benchmarks. It notably excels onRULERandRepoQA. Interestingly, theKimi Linear (RoPE)variant performs worse on average, validating the authors' design choice to useNoPEin the global attention layers and rely onKDAfor positional awareness. This suggests that forcing the KDA layers to handle position information creates a more robust system for long-context generalization. -
Reinforcement Learning (RL): The results in Figure 8 are particularly compelling. In RL training for mathematical reasoning,
Kimi Linearnot only starts at a higher accuracy but also improves faster and reaches a higher peak performance than theMLAbaseline. This is significant because RL involves long generation trajectories, where the efficiency and stable memory management ofKimi Linearprovide a clear advantage over the computationally heavy and potentially less stable full attention.The following figure (Figure 8 from the original paper) shows the RL training curves.
该图像是三组对比折线图,展示了Kimi Linear与MLA在不同任务(训练、MATH 500测试和AIME 2025)上的准确率表现。图中显示Kimi Linear在所有测试点均优于MLA,表现出更高的准确率。
6.1.5. Efficiency Comparison
The efficiency gains are a cornerstone of the paper's contribution.
-
Prefill & Decoding Speed: As shown in Figure 1 and Figure 9,
Kimi Linear's efficiency advantage becomes dramatic at long sequence lengths. While comparable toMLAat short lengths (4k-16k), its linear scaling means it becomes significantly faster for long contexts. For a 1M token context, it achieves:- 2.9x faster prefill (initial processing of the prompt).
- 6.3x faster decoding (generating one token at a time).
-
KV Cache Reduction: Because KDA layers have a fixed-size state and do not need a KV cache that grows with sequence length, the hybrid model's total KV cache is drastically smaller (up to 75% reduction). This is a massive practical advantage, as it enables million-token context on less hardware.
The following figure (Figure 9 from the original paper) shows the prefill and decoding performance.
该图像是两幅折线图,展示了Kimi Linear与MLA和GDN-H在不同长度下的预填充延迟和解码吞吐量(TPOT)性能对比,体现出Kimi Linear在1M上下文长度时延迟降低2.9倍,吞吐量提升2.2倍的优势。
6.2. Data Presentation (Tables)
6.2.1. Ablation Study Results
The following are the results from Table 1 of the original paper:
| Training PPL (↓) | Validation PPL (↓) | ||
| Hybrid ratio | 0:1 | 9.45 | 5.77 |
| 1:1 | 9.29 | 5.66 | |
| 3:1 | 9.23 | 5.65 | |
| 7:1 | 9.23 | 5.70 | |
| 15:1 | 9.34 | 5.82 | |
| w/o output gate | - | 5.67 | |
| w/ swish output gate | - | 5.81 | |
| w/o convolution layer | - | 5.70 | |
- Analysis: This table validates several key design choices. The 3:1 hybrid ratio (3 KDA layers for every 1 MLA layer) achieves the best validation perplexity, indicating an optimal balance. Removing the
Sigmoidoutput gate or replacing it withSwishhurts performance, confirming its importance. Finally, removing theShortConvlayer also degrades performance, showing that even in a hybrid model, these local convolutions are beneficial.
6.2.2. Main Results (Base Models @ 1.4T)
The following are the results from Table 3 of the original paper:
| Type | MLA | GDN-H | Kimi Linear | |
|---|---|---|---|---|
| Trained Tokens | 1.4T | 1.4T | 1.4T | |
| General | HellaSwag | 81.7 | 82.2 | 82.9 |
| ARC-challenge | 64.6 | 66.5 | 67.3 | |
| Winogrande | 78.1 | 77.9 | 78.6 | |
| BBH | 71.6 | 70.6 | 72.9 | |
| MMLU | 71.6 | 72.2 | 73.8 | |
| MMLU-Pro | 47.2 | 47.9 | 51.0 | |
| TriviaQA | 68.9 | 70.1 | 71.7 | |
| Math & Code | GSM8K | 83.7 | 81.7 | 83.9 |
| MATH | 54.7 | 54.1 | 54.7 | |
| EvalPlus | 59.5 | 63.1 | 60.2 | |
| CRUXEval-I-cot | 51.6 | 56.0 | 56.6 | |
| CRUXEval-O-cot | 61.5 | 58.1 | 62.0 | |
| Chinese | CEval | 79.3 | 79.1 | 79.5 |
| CMMLU | 79.5 | 80.7 | 80.8 |
6.2.3. Long Context Performance Comparison
The following are the results from Table 5 of the original paper:
| RULER | MRCR | HELMET-ICL | LongBench V2 | Frames | RepoQA | Long Code Arena | Avg. | ||
|---|---|---|---|---|---|---|---|---|---|
| Lib | Commit | ||||||||
| MLA | 81.3 | 22.6 | 88.0 | 36.1 | 60.5 | 63.0 | 32.8 | 33.2 | 52.2 |
| GDN-H | 80.5 | 23.9 | 85.5 | 32.6 | 58.7 | 63.0 | 34.7 | 30.5 | 51.2 |
| Kimi Linear (RoPE) | 78.8 | 22.0 | 88.0 | 35.4 | 59.9 | 66.5 | 31.3 | 32.5 | 51.8 |
| Kimi Linear | 84.3 | 29.6 | 90.0 | 35.0 | 58.8 | 68.5 | 37.1 | 32.7 | 54.5 |
6.3. Ablation Studies / Parameter Analysis
The ablation studies were crucial for justifying the paper's design choices:
- Hybrid Ratio (Table 1): The experiments tested ratios from 0:1 (all MLA) to 15:1. The 3:1 ratio provided the best validation PPL, suggesting that while the global attention layers are important, a small number of them is sufficient. Too many linear layers (e.g., 7:1, 15:1) start to degrade performance, likely due to loss of global information flow.
- Output Gating (Table 1): Removing the final
Sigmoidgate on the KDA output or replacing it withSwishdegrades performance. This shows the gate is a critical component for stabilizing the model and controlling information flow, consistent with findings in other recent work. - Positional Encoding (Table 5): The comparison between
Kimi Linear(withNoPE) andKimi Linear (RoPE)is a key ablation. TheNoPEversion achieves a much better average score (54.5 vs 51.8) on long-context tasks. The authors argue this is becauseRoPEbiases the global attention layers towards local/short-range patterns, making them less flexible. By removingRoPE, the KDA layers are forced to handle positional information, which they do more dynamically and effectively for long-range dependencies.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces Kimi Linear, a hybrid attention architecture that marks a significant step towards solving the performance-efficiency trade-off in LLMs. The core of this architecture, Kimi Delta Attention (KDA), enhances linear attention with a fine-grained, channel-wise gating mechanism that provides superior memory control. By combining KDA with a small number of full-attention layers in a 3:1 ratio, Kimi Linear achieves a remarkable feat: it outperforms a standard full-attention baseline in fair, large-scale comparisons across short-context, long-context, and reinforcement learning tasks. Simultaneously, it delivers substantial efficiency gains, including up to a 75% reduction in KV cache and a 6.3x speedup in decoding for million-token contexts. The work provides a compelling case for Kimi Linear as a practical, scalable, and high-performing drop-in replacement for traditional Transformer architectures.
7.2. Limitations & Future Work
The paper, being an industry technical report, focuses on showcasing achievements and does not include a dedicated limitations section. However, some potential limitations and future research directions can be inferred:
- Empirically-Driven Design: Many key design choices, such as the 3:1 hybrid ratio, are determined empirically. While effective, a deeper theoretical understanding of why this specific ratio is optimal would be valuable.
- Architectural Generalization: The experiments are conducted on a specific MoE-based architecture (
Moonlight). While the results are strong, further research is needed to confirm if the benefits ofKimi Lineargeneralize across different LLM backbones (e.g., non-MoE models, different sizes). - Static Hybridization: The 3:1 interleaving is static. Future work could explore dynamic hybridization, where the model might learn to route tokens through linear or full-attention layers based on context or task requirements.
- Synergy with Sparse Attention: The paper positions KDA as an alternative to sparse attention. An interesting future direction would be to combine
KDAwith sparse attention mechanisms, potentially reaping the benefits of both constant-memory state (KDA) and sub-linear attention patterns (sparse attention).
7.3. Personal Insights & Critique
- Key Innovation: The most insightful contribution is the move from a scalar/head-wise gate to a channel-wise gate in KDA. It's a conceptually simple change that yields significant expressive power by allowing the model to treat different features in its memory state heterogeneously. This highlights a recurring theme in deep learning: increasing the "granularity" of control mechanisms often unlocks better performance.
- Pragmatism and Co-design: The paper is an excellent example of pragmatic co-design. The KDA algorithm isn't just theoretically elegant; it's a constrained variant of DPLR specifically designed for hardware efficiency and numerical stability. This focus on making the method fast in practice is what makes it a viable industry solution, not just an academic curiosity. The open-sourcing of optimized kernels and vLLM integration further underscores this practical commitment.
- A Paradigm Shift for Long Context: This work, along with others in the hybrid space, signals a potential paradigm shift. Instead of a single, uniform attention mechanism, the future of LLMs may lie in heterogeneous architectures that combine specialized components for different purposes (e.g., KDA for efficient state-tracking, full attention for global integration). The
NoPEdesign choice is particularly bold and suggests that we may be able to delegate more implicit reasoning, including positional awareness, to well-designed recurrent mechanisms. - Critique: The claim of being the "first" to outperform full attention under "fair comparisons" is very strong. While the authors' setup appears rigorous, "fairness" in LLM comparison is notoriously complex and depends on many factors in the "training recipe." As an industry report without peer review, the results should be interpreted with the understanding that they represent the findings from a single, highly capable team. Independent replication and analysis by the broader research community will be essential to fully validate these impressive claims. The futuristic publication date also remains an unusual artifact of the source material.
Similar papers
Recommended via semantic vector search.