Paper status: completed

MAMBA-3: IMPROVED SEQUENCE MODELING USING STATE SPACE PRINCIPLES

Long-Context Modeling (16)Transformer architecture (14)Sequence Policy Optimization (40)Efficient Single-GPU Training (2)State Space Model-Based Vision Architectures (2)

Original Link

Price: 0.100000

29 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Mamba-3 proposes a novel State Space Model architecture, leveraging expressive recurrence, complex state updates, and a multi-input, multi-output formulation, to overcome Transformer's inference inefficiencies and improve hardware utilization. It achieves significant performance

Abstract

000 001 002 003 004 005 006 007 008 009 010 011 012 013 014 015 016 017 018 019 020 021 022 023 024 025 026 027 028 029 030 031 032 033 034 035 036 037 038 039 040 041 042 043 044 045 046 047 048 049 050 051 052 053 Under review as a conference paper at ICLR 2026 M AMBA -3: I MPROVED S EQUENCE M ODELING USING S TATE S PACE P RINCIPLES Anonymous authors Paper under double-blind review A BSTRACT The recent scaling of test-time compute for LLMs has restricted the practical deployment of models to those with strong capabilities that can generate high-quality outputs in an inference-efficient manner. While current Transformer-based models are the standard, their quadratic compute and linear memory bottlenecks have spurred the development of sub-quadratic models with linear-scaling compute with constant memory requirements. However, many recent linear-style models lack certain capabilities or lag behind in quality, and even their linear-time inference is not hardware-efficient. Guided by an inference-first perspective, we introduce three core methodological improvements inspired by the state- space model viewpoint of linear models. We combine a: 1) more expressive recu

Mind Map

In-depth Reading

English Analysis~15 min read · 18,969 chars

1. Bibliographic Information

Title: MAMBA-3: IMPROVED SEQUENCE MODELING USING STATE SPACE PRINCIPLES
Authors: Anonymous authors (Paper under double-blind review).
Journal/Conference: The paper is currently under double-blind review, so the publication venue is not yet determined.
Publication Year: The citations in the paper suggest it was written in or around 2025.
Abstract: The paper addresses the challenge of creating Large Language Models (LLMs) that are both powerful and inference-efficient. It identifies that while Transformer models are standard, their quadratic compute and linear memory scaling are bottlenecks. Existing sub-quadratic models often compromise on quality, capabilities, or hardware efficiency. The authors propose Mamba-3, a model built on three core improvements inspired by State Space Model (SSM) principles: 1) a more expressive recurrence through trapezoidal discretization, 2) a complex state update rule for better state-tracking, and 3) a multi-input, multi-output (MIMO) formulation to improve hardware utilization during decoding. Mamba-3 is shown to achieve superior performance on retrieval, state-tracking, and language modeling tasks, setting a new Pareto-frontier for performance under a fixed inference budget.
Original Source Link: /files/papers/68ecb843346a19cdf79de85e/paper.pdf. This is a local file path, indicating the paper is likely a preprint or under review.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Modern AI progress is increasingly driven by "test-time compute" (e.g., chain-of-thought), making the inference efficiency of LLMs a critical bottleneck for practical deployment. Standard Transformer models are inefficient due to their self-attention mechanism, which has compute costs that grow quadratically with sequence length and memory costs (for the KV cache) that grow linearly.
- Gaps in Prior Work: While sub-quadratic models like Mamba-2 and Gated DeltaNet offer linear-time inference and constant memory, they often sacrifice performance. Specifically, Mamba-2 simplified its underlying SSM for speed, which hurt model quality. Many linear models also lack crucial capabilities like state-tracking (e.g., solving simple arithmetic or parity tasks) and are not "hardware-efficient" during decoding because their operations have low arithmetic intensity, leaving hardware underutilized.
- Fresh Angle: The paper adopts an "inference-first" perspective. Instead of designing a model purely for training efficiency or theoretical elegance, the authors focus on principles that directly enhance performance and hardware utilization during the critical decoding phase. They draw inspiration from classical state-space model theory to systematically improve upon the Mamba-2 architecture.
Main Contributions / Findings (What):
- The paper introduces Mamba-3, a new sequence model architecture with three core methodological innovations:
  1. Trapezoidal Discretization: A more accurate method for converting the continuous-time SSM to a discrete recurrence, making the model more expressive. This effectively replaces the need for a separate short convolution layer.
  2. Complex-Valued SSMs: By allowing the SSM state to be complex-valued, the model can represent "rotational" dynamics. This is implemented efficiently using a data-dependent Rotary Position Embedding (RoPE), which restores the state-tracking capabilities lost in previous real-valued SSMs.
  3. Multi-Input, Multi-Output (MIMO) Formulation: The state update is changed from an outer product to a matrix multiplication. This increases the number of computations (FLOPs) per step without increasing the memory I/O, thereby boosting arithmetic intensity and making decoding more hardware-efficient.
- Key Findings:
  - Better Quality: Mamba-3 outperforms Transformer, Mamba-2, and Gated DeltaNet baselines on standard language modeling benchmarks at various scales (180M to 1.5B parameters).
  - Better Capability: The complex-valued SSM enables Mamba-3 to solve synthetic state-tracking tasks (like parity and modular arithmetic) that Mamba-2 cannot.
  - Better Inference Efficiency: The MIMO variant improves model quality without increasing the state size (which is the main driver of latency), thus pushing the Pareto-frontier of performance versus inference speed.

Foundational Concepts:
- Transformer Models: Architectures that have become the standard for NLP. Their core component is the self-attention mechanism, which allows every token in a sequence to look at every other token to compute its representation. While powerful, this all-to-all comparison leads to $O(T^2)$ computation time and O(T) memory for the Key-Value (KV) cache during generation, where $T$ is the sequence length.
- State Space Models (SSMs): A class of models from classical control theory used to describe dynamical systems. In deep learning, they model a sequence by updating a hidden state h at each timestep. A continuous-time SSM is defined by differential equations, which are then "discretized" for use with discrete data like text. Their key advantage is efficiency: inference is recurrent, taking O(T) time and requiring only constant memory to store the state, regardless of sequence length.
- Mamba Architecture: A recent family of SSM-based models that became competitive with Transformers. Its key innovation is selectivity, where the SSM parameters (A, B, C, $\Delta$ ) are functions of the input data. This allows the model to selectively focus on or ignore information, mimicking some of attention's capabilities. Mamba-2 simplified Mamba-1's design for faster training.
- Linear Attention: An approximation of self-attention that reduces the complexity from quadratic to linear. However, these models often struggle with tasks requiring precise memory or state-tracking.
- Inference Efficiency & Arithmetic Intensity: During model inference, performance can be limited by either computation (compute-bound) or data transfer from memory (memory-bound). Arithmetic intensity is the ratio of floating-point operations (FLOPs) to memory bytes transferred. To fully utilize modern GPUs, an operation's arithmetic intensity must be high enough to keep the processing cores busy while data is being fetched. Low arithmetic intensity means the model is memory-bound and hardware is idle.
Previous Works:
- Mamba-2 & Gated DeltaNet: These are cited as popular, inference-efficient models. However, the paper argues they made architectural trade-offs that sacrificed quality and capability. Mamba-2, for instance, simplified its SSM parameterization to a single scalar, which hindered its expressiveness.
- Linear Attention Models: While fast, these models are known to have poor state-tracking abilities, as highlighted by work from Grazzi et al. (2024).
- SSM Expressivity Studies: Recent work (Merrill et al., 2025; Grazzi et al., 2025) has shown that simplifying SSMs (e.g., by restricting their transition matrices to have only real eigenvalues) limits their ability to solve certain algorithmic tasks like parity, which require rotational dynamics. This provides the theoretical motivation for Mamba-3's complex-valued states.
Technological Evolution: The field has been moving from the powerful but inefficient Transformer architecture toward more efficient sub-quadratic models. Mamba-3 represents a shift in this search, moving from a training-centric design to an "inference-first" philosophy, where design choices are explicitly guided by the need for high-quality and hardware-efficient decoding.
Differentiation: Mamba-3's key distinction from Mamba-2 lies in its reversal of simplification. Where Mamba-2 simplified the SSM for speed, Mamba-3 introduces more complex and expressive machinery (trapezoidal rule, complex states, MIMO) drawn from classical SSM theory, but does so in a way that is still hardware-efficient and in some cases even faster.

4. Methodology (Core Technology & Implementation)

Mamba-3 builds upon the Mamba-2 architecture with three core innovations derived from a state-space perspective.

4.1 Trapezoidal Discretization

Principles: SSMs are often defined in continuous time but must be converted (discretized) to work on discrete sequences. Mamba-2 uses Euler's method, a simple first-order approximation. Mamba-3 proposes using a generalized trapezoidal rule, a more accurate second-order method. This rule approximates the state update by using a weighted average of the function's values at both ends of the time interval, rather than just one endpoint.
Steps & Procedures: The continuous-time SSM hidden state update is given by an integral.
- Euler's Rule (Mamba-2): Approximates the integral using the value at the right endpoint, leading to the recurrence: $h_t \approx e^{\Delta_t A_t} h_{t-1} + \Delta_t B_t x_t$
- Generalized Trapezoidal Rule (Mamba-3): Uses a data-dependent convex combination of the endpoints, controlled by a learned parameter $\lambda_t$ . This yields a more expressive recurrence.
Mathematical Formulas & Key Details (Proposition 1): The Mamba-3 recurrence is: $h_t = \alpha_t h_{t-1} + \beta_t B_{t-1}x_{t-1} + \gamma_t B_t x_t$
- Symbol Explanation:
  - $h_t$ : The hidden state at time $t$ .
  - $x_t$ : The input at time $t$ .
  - $B_t$ : The input projection matrix at time $t$ .
  - $\alpha_t := e^{\Delta_t A_t}$ : The decay term, same as in Mamba-2.
  - $\beta_t := (1 - \lambda_t) \Delta_t e^{\Delta_t A_t}$ : The weight for the previous input, x_{t-1}.
  - $\gamma_t := \lambda_t \Delta_t$ : The weight for the current input, $x_t$ .
  - $\lambda_t \in [0, 1]$ : A data-dependent scalar that balances the influence of the current and previous inputs. When $\lambda_t = 1$ , this rule reduces to Mamba-2's Euler method.
- Impact: This recurrence acts like a small, data-dependent convolution of size two applied to the input stream. As shown in Image 1, this can be represented in matrix form as the product of Mamba-2's decay mask and a new convolutional mask. This added expressivity, combined with a learned bias on the B and C projections, makes the explicit short convolution layer used in Mamba-1 and other models redundant.
  
  该图像是示意图，展示了广义梯形规则下的结构化掩码及其数学表示。左侧矩阵体现了由衰减矩阵与卷积掩码相乘形成的结构，包含带下标和上标的元素符号。右侧两个子图对比了欧拉法（保持端点）与梯形规则（端点取平均）对积分 $\int_{t_{k-1}}^{t_k} e^{A_k(t_k-\tau)} B(\tau) x(\tau) d\tau$ 的数值近似，强调了不同方法下的积分区域填充方式。

4.2 Complex-Valued SSMs

Principles: Real-valued SSMs, especially those with scalar or diagonal transition matrices, cannot produce "rotational" dynamics in their hidden state. This prevents them from solving simple state-tracking tasks like counting modulo-n (parity). To restore this capability, Mamba-3 allows the SSM's state transition matrix to have complex eigenvalues.
Steps & Procedures:
1. Start with a complex-valued SSM where the state transition matrix is diagonal with complex numbers: $\mathrm{Diag}(A(t) + i\theta(t))$ .
2. Discretize this system. The complex exponential $e^{i\theta}$ corresponds to a 2D rotation matrix in real coordinates.
3. The result (Proposition 2) is a real-valued SSM where the state dimension is doubled, and the state transition involves a block-diagonal matrix of $2 \times 2$ rotation matrices.
4. Crucially, these rotations can be absorbed into the input (B) and output (C) projections via a "RoPE trick" (Proposition 3). This means the core recurrence can remain simple (scalar decay), while the B and C vectors are rotated at each step based on a cumulative rotation.
Mathematical Formulas & Key Details (Proposition 3): The complex SSM is shown to be equivalent to a real SSM with the following form: $h_t = e^{\Delta_t A_t} h_{t-1} + \left(\prod_{i=0}^t R_i^\top\right) B_t x_t$ $y_t = \left( \left(\prod_{i=0}^t R_i^\top\right) C_t \right)^\top h_t$
- Symbol Explanation:
  - $h_t, y_t, A_t, B_t, C_t, x_t$ : Standard real-valued SSM components.
  - $R_i$ : A block-diagonal rotation matrix at time step $i$ . Its angles are data-dependent.
  - $\prod_{i=0}^t R_i^\top$ : A cumulative rotation matrix, representing the total rotation from the start of the sequence.
- Connection to RoPE: This is analogous to Rotary Position Embeddings (RoPE) used in Transformers, which applies rotations to queries and keys. Here, the rotations are data-dependent and applied to the B and C components of the SSM. This modification is lightweight and efficiently restores state-tracking capabilities.

4.3 Multi-Input, Multi-Output (MIMO)

Principles: Standard Single-Input, Single-Output (SISO) SSMs used in Mamba-2 have a state update that is an outer product. This operation has very low arithmetic intensity, making it memory-bound during decoding and leading to inefficient hardware use. Mamba-3 generalizes this to a MIMO formulation where the update is a matrix multiplication, which has much higher arithmetic intensity.
Steps & Procedures:
- SISO update (low arithmetic intensity): $H_t = a_t H_{t-1} + b_t \otimes x_t$ Here, $b_t$ and $x_t$ are vectors.
- MIMO update (high arithmetic intensity): $H_t = a_t H_{t-1} + B_t X_t^\top$ Here, $B_t$ and $X_t$ are matrices of rank $r$ , where $r$ is the MIMO rank.
- By increasing the rank $r$ , one can increase the FLOPs of the decoding step without increasing the size of the state $H_t$ that needs to be loaded from memory. This shifts the operation from being memory-bound towards being compute-bound, resulting in better hardware utilization and faster overall decoding, even with more computations.
  
  (Note: The provided image is only for SISO. The paper contrasts this with a similar table for MIMO, which shows arithmetic intensity scaling with the rank $r$ .)

4.4 Mamba-3 Architecture

The complete Mamba-3 block integrates these changes. Image 4 contrasts it with Mamba-2.

Figure 4: Contrasting Mamba-2 and Mamba-3 Architectures: Key updates include trapezoidal discretization, data-dependent RoPE embeddings, MIMO projections, QK normalization, and learnable biases. 该图像是示意图，展示了Mamba-2与Mamba-3架构的对比。图中突出Mamba-3引入了梯形离散化的SSM模块、数据依赖的RoPE编码、多输入多输出（MIMO）投影、QK归一化及可学习偏置等关键改进，以提升状态跟踪能力和硬件并行效率。图中不同颜色区分了线性投影、序列变换、MIMO投影和非线性操作。

The core SSM block is replaced by the Trapezoidal SSM.
A RoPE module is applied to the inputs of the B and C projections to implement the complex state dynamics.
The normalization layer is moved to resemble QK-Norm in Transformers.
The Conv layer is now optional, as its function is largely covered by the trapezoidal discretization and added biases.
The MIMO Projection is an optional toggle to improve inference efficiency.

5. Experimental Setup

Datasets:
- Pretraining: Models were trained on 100B tokens from the FineWeb-Edu dataset.
- Downstream Language Modeling: Standard benchmarks including LAMBADA, HellaSwag, PIQA, Arc-Easy/Challenge, WinoGrande, and OpenBookQA.
- Retrieval: Real-world cloze-style tasks (SWDE, SQUAD, FDA, TriviaQA, NQ, DROP) and synthetic Needle-In-A-Haystack (NIAH) tasks.
- State-Tracking: Synthetic tasks from the Chomsky hierarchy: Parity and Modular Arithmetic (with and without brackets).
Evaluation Metrics:
1. Perplexity (ppl):
  - Conceptual Definition: A measure of how well a probability model predicts a sample. In language modeling, it quantifies the model's confusion or uncertainty when predicting the next token in a sequence. A lower perplexity indicates a better model.
  - Mathematical Formula: For a test set $X = (x_1, x_2, \dots, x_N)$ , the perplexity is calculated as the exponential of the cross-entropy loss: $\mathrm{PPL}(X) = \exp\left(-\frac{1}{N} \sum_{i=1}^{N} \log P(x_i | x_{<i}; \theta)\right)$
  - Symbol Explanation:
    - $N$ : The total number of tokens in the test set.
    - $x_i$ : The $i$ -th token in the sequence.
    - $x_{<i}$ : The context preceding the $i$ -th token.
    - $P(x_i | x_{<i}; \theta)$ : The probability assigned to token $x_i$ by the model $\theta$ given the context.
2. Accuracy (acc):
  - Conceptual Definition: The proportion of correct predictions out of the total number of predictions made. It is a straightforward measure of performance on classification-style tasks.
  - Mathematical Formula: $\mathrm{Accuracy} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}}$
  - Symbol Explanation: The terms are self-explanatory. This is used for tasks where there is a single correct answer (e.g., multiple-choice questions in Arc-E).
Baselines: The main models used for comparison are Transformer, Gated DeltaNet, and Mamba-2, all trained under the same conditions for fair comparison.

6. Results & Analysis

6.1 Core Results: Language Modeling and Retrieval

Language Modeling (Table 1): The paper provides a comprehensive table showing Mamba-3's performance against baselines across four model sizes (180M, 440M, 820M, 1.5B). (Manual transcription of Table 1)

Model	FW-Edu ppl ↓	LAMB. ppl ↓	LAMB. acc ↑	HellaS. acc n ↑	PIQA acc ↑	Arc-E acc ↑	Arc-C acc n ↑	WinoGr. acc ↑	OBQA acc_n ↑	Average acc ↑
Transformer-1.5B	10.51	11.1	50.3	60.6	73.8	74.0	40.4	58.7	29.6	55.4
Gated DeltaNet-1.5B	10.51	10.8	49.9	60.5	74.3	73.3	40.4	61.5	30.4	55.7
Mamba-2-1.5B	10.47	12.0	47.8	61.4	73.6	75.3	41.8	57.5	32.6	55.7
Mamba-3-1.5B	10.35	10.9	49.4	61.9	73.6	75.9	42.7	59.4	32.0	56.4

Analysis: At every model scale, Mamba-3 consistently achieves the best average accuracy and generally the lowest perplexity. This demonstrates the effectiveness of its design changes for general language modeling quality.

Retrieval Capabilities (Table 2): (Manual transcription of Table 2)

Model (1.5B)	SWDE	SQUAD	FDA	TQA	NQ	Drop	NIAH-Single-1			NIAH-Single-2			NIAH-Single-3
Model (1.5B)	SWDE	SQUAD	FDA	TQA	NQ	Drop	1024	2048	4096	1024	2048	4096	1024	2048	4096
Transformer	48.9	46.6	58.4	67.5	31.7	26.4	100.0	100.0	0.0	92.2	100.0	0.0	98.6	99.4	0
Gated DeltaNet	32.7	40.0	28.3	63.5	25.7	24.5	100.0	100.0	93.8	99.8	100.0	49.8	83.8	68.4	34.2
Mamba-2	30.7	39.1	23.7	64.3	25.1	28.5	100.0	99.6	62.0	100.0	53.8	11.8	95.8	87.4	13.4
Mamba-3	28.5	40.1	23.4	64.5	26.5	27.4	100.0	100.0	88.2	100.0	95.4	50.6	92.4	81.4	34.2

Analysis: Mamba-3 shows strong performance on synthetic NIAH tasks, especially in out-of-distribution contexts (4096 tokens, when trained on 2048), significantly outperforming Mamba-2. However, on real-world retrieval, it (like other linear models) still lags behind the Transformer, which can perfectly recall information via its KV cache.

6.2 Inference Efficiency

Raw Speed (Table 3): (Manual transcription of Table 3)

Model	FP32		BF16
Model	dstate=64	dstate=128	dstate=64	dstate=128
Mamba-2	0.295	0.409	0.127	0.203
Gated DeltaNet	0.344	0.423	0.176	0.257
Mamba-3 (SISO)	0.261	0.356	0.106	0.152
Mamba-3 (MIMO)	0.285	0.392	0.136	0.185

Analysis: Despite its more complex recurrence, Mamba-3 (SISO) is faster than both Mamba-2 and Gated DeltaNet in a standard setting (bf16, d_state=128). The MIMO variant is slightly slower than SISO but still faster than the baselines, showing that the increased FLOPs do not compromise runtime significantly due to better hardware utilization.

Pareto Frontier (Figure 3):

该图像是一个折线图，展示了不同模型在“相对总状态大小”（推理速度的代理指标）与“预训练困惑度”（性能指标）之间的关系。图中显示Mamba-3 MIMO模型在不增加状态大小的前提下，实现了预训练困惑度的最佳表现，推动了性能与效率的帕累托最优边界。
- Analysis: This plot shows the trade-off between inference speed (proxied by state size) and performance (perplexity). Mamba-3 consistently achieves lower perplexity for a given state size compared to Mamba-2 and Gated DeltaNet. Mamba-3 MIMO pushes this frontier even further, achieving the best performance without increasing the state size, confirming that MIMO is an effective way to get more quality for the same inference budget.

6.3 SSM-Centric Methodological Ablations

Component Ablation (Table 4a): (Manual transcription of Table 4a)

Model Variant ppl ↓

Mamba-3 - bias - trap 16.68

Mamba-3 - bias 16.49

Mamba-3 15.72

Mamba-3 + conv 15.85
- Analysis: Removing both the trapezoidal discretization and the BC bias significantly hurts performance. Adding them back provides a large gain. The final line shows that adding back the short convolution layer on top of the full Mamba-3 model provides no benefit (and slightly hurts performance), confirming that the new discretization and bias make it redundant.

Model Variant	ppl ↓
Mamba-3 - bias - trap	16.68
Mamba-3 - bias	16.49
Mamba-3	15.72
Mamba-3 + conv	15.85

State-Tracking Capability (Table 4b): (Manual transcription of Table 4b)

Model	Parity ↑	Arith. w/o brackets ↑	Arith. w/ brackets ↑
Mamba-3	100.00	98.51	87.75
Mamba-3 (w/o RoPE)	2.27	1.49	0.72
Mamba-2	0.90	47.81	0.88
Gated DeltaNet [-1,1]	100.00	99.25	93.50

Analysis: This is a crucial validation. Mamba-3 with its data-dependent RoPE (the implementation of complex states) perfectly solves the Parity task and performs very well on modular arithmetic. In contrast, Mamba-3 without RoPE and Mamba-2 fail completely, performing at near-random levels. This directly proves that the complexification of the SSM restores the state-tracking capability that was missing in simpler real-valued SSMs.

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces Mamba-3, a new SSM-based sequence model designed with an "inference-first" philosophy. By incorporating three principles from classical SSM theory—trapezoidal discretization, complex-valued states, and a MIMO formulation—Mamba-3 achieves significant improvements in quality, capability, and inference efficiency. It establishes a new state-of-the-art on the performance-efficiency Pareto-frontier for sub-quadratic models, outperforming strong baselines like Mamba-2 and Gated DeltaNet.
Limitations & Future Work: The authors acknowledge that, like other fixed-state models, Mamba-3 still lags behind attention-based models on certain retrieval tasks that require perfect recall of long-distance information. They propose that hybrid architectures integrating explicit retrieval mechanisms with the Mamba-3 backbone could be a promising direction for future research.
Personal Insights & Critique:
- The "inference-first" design philosophy is a powerful and pragmatic approach. It moves the focus from purely theoretical complexity to practical hardware performance, which is highly relevant for the deployment of large-scale AI systems.
- The paper does an excellent job of connecting modern deep learning architectures back to classical control theory. The systematic improvements—moving from a first-order to a second-order solver, from real to complex numbers, and from SISO to MIMO—are well-motivated and demonstrate a deep understanding of the underlying principles.
- The "RoPE trick" for implementing complex SSMs is particularly elegant. It provides a theoretically-grounded way to enhance model capability with minimal computational overhead, drawing a compelling parallel to how RoPE works in Transformers.
- While Mamba-3 pushes the frontier for recurrent models, the retrieval limitation remains a fundamental challenge for any architecture with a fixed-size state. Overcoming this may require fundamentally new approaches beyond simply improving the recurrent block, as the authors suggest with hybrid models. The paper's strength lies in optimizing the recurrent component to its fullest potential under current paradigms.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.