MambaOut: Do We Really Need Mamba for Vision?
TL;DR Summary
This paper questions Mamba's necessity in vision, hypothesizing its core (SSM) is crucial only for long-sequence tasks. By creating `MambaOut` (Mamba without SSM), experiments showed it surpassed Mamba in image classification but underperformed in detection/segmentation, confirmi
Abstract
Mamba, an architecture with RNN-like token mixer of state space model (SSM), was recently introduced to address the quadratic complexity of the attention mechanism and subsequently applied to vision tasks. Nevertheless, the performance of Mamba for vision is often underwhelming when compared with convolutional and attention-based models. In this paper, we delve into the essence of Mamba, and conceptually conclude that Mamba is ideally suited for tasks with long-sequence and autoregressive characteristics. For vision tasks, as image classification does not align with either characteristic, we hypothesize that Mamba is not necessary for this task; Detection and segmentation tasks are also not autoregressive, yet they adhere to the long-sequence characteristic, so we believe it is still worthwhile to explore Mamba's potential for these tasks. To empirically verify our hypotheses, we construct a series of models named MambaOut through stacking Mamba blocks while removing their core token mixer, SSM. Experimental results strongly support our hypotheses. Specifically, our MambaOut model surpasses all visual Mamba models on ImageNet image classification, indicating that Mamba is indeed unnecessary for this task. As for detection and segmentation, MambaOut cannot match the performance of state-of-the-art visual Mamba models, demonstrating the potential of Mamba for long-sequence visual tasks. The code is available at https://github.com/yuweihao/MambaOut
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
- Title: MambaOut: Do We Really Need Mamba for Vision?
- Authors:
- Weihao Yu (National University of Singapore)
- Xinchao Wang (National University of Singapore)
- Journal/Conference: The paper is available on arXiv, a preprint server. This means it has not yet undergone formal peer review for a conference or journal, but it allows for rapid dissemination of research findings.
- Publication Year: 2024 (Initial submission in May 2024).
- Abstract: The paper investigates the Mamba architecture, which uses a State Space Model (SSM) as its token mixer, for vision tasks. The authors observe that Mamba's performance in vision is often lackluster compared to established CNN and attention-based models. They conceptually argue that Mamba is best suited for tasks that are both long-sequence and autoregressive. Since image classification on ImageNet has neither characteristic, they hypothesize Mamba's core component (SSM) is unnecessary. For detection and segmentation, which are long-sequence but not autoregressive, they suggest Mamba still holds potential. To test this, they create
MambaOut, a model that removes the SSM from Mamba blocks. Experiments showMambaOutsurpasses visual Mamba models in ImageNet classification but falls short in detection and segmentation, thus supporting their hypotheses. - Original Source Link:
- arXiv Link: https://arxiv.org/abs/2405.07992
- PDF Link: http://arxiv.org/pdf/2405.07992v3
- Publication Status: Preprint on arXiv.
2. Executive Summary
-
Background & Motivation (Why):
- Core Problem: The Mamba architecture, originally successful in natural language processing for its linear-time complexity in handling long sequences, has been adapted for computer vision. However, these "visual Mamba" models have generally failed to outperform state-of-the-art Convolutional Neural Networks (CNNs) and Transformers.
- Importance & Gap: This performance gap raises a fundamental question: Is the core mechanism of Mamba, the State Space Model (SSM), truly suitable for standard vision tasks? Prior work focused on how to adapt Mamba for vision, but this paper asks if we should adapt it at all, or at least, for which tasks.
- Innovation: Instead of proposing a new, better visual Mamba, the paper takes a step back to perform a critical, first-principles analysis. The innovation lies in deconstructing the Mamba block, identifying its essential properties (long-sequence handling, autoregressive nature), and systematically evaluating whether these properties align with the demands of different vision tasks. The creation of
MambaOutserves as an elegant ablation study to isolate and test the contribution of the SSM itself.
-
Main Contributions / Findings (What):
- Conceptual Analysis: The paper provides a clear conceptual framework, arguing that Mamba's strengths are best utilized in tasks characterized by long sequences and an autoregressive nature (where output at a given step depends only on previous inputs).
- Task-Specific Hypotheses: It applies this framework to vision, leading to two key hypotheses:
- Hypothesis 1: The SSM is unnecessary for ImageNet classification, which involves short sequences and does not require autoregressive modeling.
- Hypothesis 2: The SSM may be beneficial for object detection and segmentation, which involve processing high-resolution images (long sequences) even though they are not autoregressive.
- Empirical Validation via
MambaOut: The authors introduceMambaOut, a simple model architecture that is identical to a visual Mamba but with the core SSM component removed. Experiments validate the hypotheses:- On ImageNet classification,
MambaOutoutperforms all existing visual Mamba models, suggesting the SSM adds no value and can be detrimental. - On COCO detection and ADE20K segmentation,
MambaOutis outperformed by the best visual Mamba models, confirming that the SSM's long-sequence modeling capability is indeed valuable for these tasks.
- On ImageNet classification,
3. Prerequisite Knowledge & Related Work
-
Foundational Concepts:
- Transformer: An influential deep learning architecture, originally from NLP, that relies on a mechanism called
self-attentionto process sequences of data (like words in a sentence or patches of an image). Its main drawback is that the computational cost of self-attention grows quadratically with the sequence length (), making it inefficient for very long sequences. - RNN (Recurrent Neural Network): A type of neural network designed for sequential data. It processes data one step at a time, maintaining a hidden state (or "memory") that summarizes past information. This makes it efficient for long sequences (linear complexity,
O(L)), but it can struggle with long-range dependencies and is harder to train in parallel compared to Transformers. - State Space Model (SSM): A concept from control theory adapted for deep learning. It models a sequence by mapping an input signal to an output signal through a latent hidden state. Modern SSMs, like those in Mamba, are structured to be computationally efficient while capturing long-range dependencies, combining the strengths of RNNs and CNNs.
- Mamba: An architecture that uses a selective SSM as its core building block. It achieves linear-time complexity like an RNN but can be trained in parallel like a Transformer. Its "selective" nature means it can dynamically decide which information to focus on or ignore, making it powerful for modeling complex data.
- Gated CNN (Gated Convolutional Network): A convolutional architecture that uses gating mechanisms (similar to LSTMs/GRUs in RNNs) to control the flow of information through the network. The paper highlights that the Mamba block is an extension of the Gated CNN block, with the SSM being the key addition.
- Transformer: An influential deep learning architecture, originally from NLP, that relies on a mechanism called
-
Previous Works & Technological Evolution:
- The field of computer vision backbones has evolved from CNNs (e.g.,
ResNet,ConvNeXt) to Transformers (e.g.,ViT,Swin Transformer). The main driver has been the search for more powerful ways to model relationships between different parts of an image. - Transformers brought global context modeling but at a high computational cost. This led to a wave of "efficient Transformers" that tried to reduce this cost.
- More recently, Mamba emerged as a promising alternative from the NLP world, offering a potential "best of both worlds" solution: the linear scaling of RNNs and the modeling power of Transformers.
- This prompted a flurry of research to apply Mamba to vision, resulting in models like
Vision Mamba (Vim),VMamba, andLocalMamba. These works focused on adapting Mamba's 1D sequence processing for 2D images, often by "flattening" image patches into a sequence and scanning them in different directions.
- The field of computer vision backbones has evolved from CNNs (e.g.,
-
Differentiation:
- Unlike previous works that aimed to build better visual Mamba models, this paper asks a more fundamental question: Is Mamba even the right tool for the job?
- It stands out by performing a critical analysis rather than a constructive one. The proposed
MambaOutmodel is not meant to be a new state-of-the-art architecture but rather an experimental tool—a baseline—to rigorously test their hypotheses about the necessity of SSMs in vision. This approach follows the principle of Occam's razor: do not use a more complex model (Mamba) if a simpler one (Gated CNN, i.e.,MambaOut) suffices.
4. Methodology (Core Technology & Implementation)
The paper's methodology is divided into a conceptual discussion followed by the proposal of MambaOut for empirical verification.
-
Principles: What Tasks is Mamba Suitable For? The authors argue that Mamba's core component, the SSM, is fundamentally an RNN-like mechanism. This gives it two defining characteristics that determine its ideal use case.
-
Long-Sequence Processing:
-
The SSM updates its hidden state based on the previous state
h_{t-1}and the current input . This recurrent update has constant computational complexity, regardless of how long the sequence gets. -
In contrast, causal attention must store and access all previous keys and values, making its memory and computation grow with the sequence length.
-
Conclusion: Mamba's advantage over attention becomes significant only when dealing with long sequences where attention's quadratic complexity becomes prohibitive.
该图像是示意图,展示了因果注意力(Causal attention)与类RNN模型(RNN-like)在记忆机制上的区别。因果注意力通过不断累积所有历史令牌的键值对(k,v)实现无损记忆,但计算复杂度随序列长度增加;类RNN则采用固定大小隐藏状态压缩记忆,具有有损性,但计算复杂度与序列长度无关,适合长序列处理。
-
-
Autoregressive (Causal) Nature:
-
The recurrent nature of the SSM means the output at any step can only depend on current and past inputs (). This is known as the causal mode of token mixing. This mode is essential for generative tasks like language modeling, where the next word is predicted based on preceding words.
-
However, vision tasks are typically understanding tasks, where the model can see the entire image at once. The optimal approach here is the fully-visible mode, where every output token can draw information from all input tokens.
-
Imposing a causal constraint on an understanding task is unnecessarily restrictive and can hurt performance, as demonstrated by the paper's experiment in Figure 3(b), where a ViT with causal attention performs worse than one with default fully-visible attention.
-
Conclusion: Mamba is inherently suited for tasks requiring causal token mixing.
该图像是示意图,展示了两种token混合模式及其对图像分类性能的影响。(a) 左图为全视野模式,token输出可访问所有输入token,典型如BERT和ViT注意力机制;右图为因果模式,token输出仅依赖当前及之前的输入token,如GPT注意力和Mamba的SSM。(b) 右图柱状图显示,将ViT的注意力从全视野改为因果模式,在ImageNet分类任务上导致准确率下降,表明因果混合对于理解任务并非必要。
-
-
-
Steps & Procedures: Analyzing Vision Tasks The paper then analyzes standard vision benchmarks against these two characteristics.
-
ImageNet Classification:
- Sequence Length: With a standard image and a patch size of , the sequence length is only tokens. The paper provides a heuristic () to argue this is a short sequence.
- Autoregressive: Classification is an understanding task. The model sees the whole image. It is not autoregressive.
- Verdict: Fails on both characteristics.
-
COCO Detection & ADE20K Segmentation:
- Sequence Length: These tasks use higher resolution images (e.g., ), resulting in sequence lengths of ~4000 tokens. This qualifies as a long sequence.
- Autoregressive: Like classification, these are understanding tasks and are not autoregressive.
- Verdict: Meets the long-sequence characteristic but not the autoregressive one.
-
-
Hypotheses This analysis leads to the paper's two central hypotheses:
- Hypothesis 1: SSM is not necessary for image classification on ImageNet.
- Hypothesis 2: SSM may be beneficial for detection and segmentation tasks due to their long-sequence nature.
-
Mathematical Formulas & Key Details: The
MambaOutModel To test the hypotheses, the authors constructMambaOut. The key insight is that a Mamba block is a Gated CNN block plus an SSM.MambaOutis simply a model built by stacking Gated CNN blocks.
该图像是示意图和性能对比图:(a)分别展示了Gated CNN模块和Mamba模块的结构示意,其中Mamba模块在Gated CNN基础上增加了状态空间模型(SSM);(b)展示了MambaOut与多种视觉Mamba模型在ImageNet分类任务上的准确率、计算量(MACs)及模型大小的对比,结果显示去除SSM的MambaOut在准确率上超越了其他Mamba模型。The meta-architecture for both is given by: Where:
-
is the input tensor with tokens and channels.
-
is a normalization layer (e.g., LayerNorm).
-
are learnable weight matrices for linear projections.
-
is an activation function (e.g., GELU).
-
denotes element-wise multiplication (the gating mechanism).
-
is the module that mixes information across tokens.
The only difference is the definition of the
TokenMixer: -
For Gated CNN (
MambaOut): -
For Mamba:
MambaOut's architecture follows a standard hierarchical design, similar to ResNet and Swin Transformer, with four stages of decreasing spatial resolution and increasing channel depth.
该图像是示意图,展示了MambaOut视觉识别模型的总体框架和Gated CNN模块结构。(a)部分显示了输入图像经过4个阶段的层级下采样和Gated CNN块处理,通道维度逐步变化;(b)部分展示了Gated CNN块的具体结构,包含线性变换、卷积、归一化及门控机制,区别于包含SSM的Mamba块。The paper provides a simple PyTorch implementation of the Gated CNN block in Algorithm 1.
``python import torch import torch.nn as nn
class GatedCNNBlock(nn.Module): def init(self, dim, expension_ratio=8/3, kernel_size=7, conv_ratio=1.0, norm_layer=partial(nn.LayerNorm,eps=1e-6), act_layer=nn.GELU, drop_path=0.): super().init() self.norm = norm_layer(dim) hidden = int(expension_ratio * dim) self.fc1 = nn.Linear(dim, hidden * 2) self.act = act_layer() conv_channels = int(conv_ratio * dim) self.split_indices = (hidden, hidden - conv_channels, conv_channels) self.conv = nn.Conv2d(conv_channels, conv_channels, kernel_size=kernel_size, padding=kernel_size//2, groups=conv_channels) self.fc2 = nn.Linear(hidden, dim)
def forward(self, x): shortcut = x B, H, W, C = x.shape x = self.norm(x) g, i, c = torch.split(self.fc1(x), self.split_indices, dim=-1) c = c.permute(0, 3, 1, 2) # [B, H, W, C] -> [B, C, H, W] c = self.conv(c) c = c.permute(0, 2, 3, 1) # [B, C, H, W] -> [B, H, W, C] x = self.fc2(self.act(g) * torch.cat((i, c), dim=-1)) return x + shortcut` -
5. Experimental Setup
-
Datasets:
- ImageNet-1K: A large-scale dataset for image classification with ~1.3 million training images and 50,000 validation images across 1,000 object categories. It is the standard benchmark for pre-training vision models.
- COCO 2017: A benchmark for object detection and instance segmentation. It contains over 118k training images and 5k validation images with 80 object categories. It is challenging due to multiple objects per image, varying scales, and complex scenes.
- ADE20K: A scene parsing dataset for semantic segmentation, containing 20k training images and 2k validation images with 150 semantic categories (e.g., wall, sky, car).
-
Evaluation Metrics:
-
Image Classification (ImageNet):
- Conceptual Definition: Top-1 Accuracy measures the percentage of test images for which the model's prediction with the highest confidence score is the correct label. It is a straightforward measure of classification correctness.
- Mathematical Formula:
- Symbol Explanation: N/A.
-
Object Detection & Instance Segmentation (COCO):
- Conceptual Definition: Average Precision (AP) is the primary metric. It is the area under the precision-recall curve, calculated for each class and then averaged. A higher AP indicates better performance. The paper reports box AP (APb
) for object detection and mask AP (APm) for instance segmentation. Variants likeAP50andAP75 refer to AP calculated at a specific Intersection over Union (IoU) threshold of 0.5 and 0.75, respectively. The main AP metric is averaged over multiple IoU thresholds (from 0.5 to 0.95). - Mathematical Formula (for a single class):
- Symbol Explanation: is the total number of images, is the precision at the -th image, and is the recall at the -th image, calculated after sorting predictions by confidence.
- Conceptual Definition: Average Precision (AP) is the primary metric. It is the area under the precision-recall curve, calculated for each class and then averaged. A higher AP indicates better performance. The paper reports box AP (APb
-
Semantic Segmentation (ADE20K):
- Conceptual Definition: mean Intersection over Union (mIoU) is the standard metric. For each class, IoU is the ratio of the area of overlap to the area of union between the predicted segmentation mask and the ground truth mask. mIoU is the average of these IoU values across all classes.
- Mathematical Formula:
- Symbol Explanation: is the number of classes. , , and are the number of true positive, false positive, and false negative pixels for class , respectively.
-
-
Baselines: The paper compares MambaOut` against a comprehensive set of models:
- Visual Mamba Models:
Vim,VMamba,LocalMamba,PlainMamba,EfficientVMamba. These represent the direct competitors that use SSMs. - CNN Models:
ConvNeXt,VAN,InternImage,HorNet. These are modern, high-performing convolutional models. - Attention-based Models (Transformers):
DeiT,Swin,CSWin,Focal. These are the dominant vision backbones that Mamba aims to challenge. - Hybrid Models (Conv + Attn):
CoAtNet,CAFormer,TransNeXt. These models combine the strengths of both convolutions and attention.
- Visual Mamba Models:
6. Results & Analysis
The experimental results are presented in three main tables, which I will transcribe and analyze.
-
Core Results: ImageNet Classification (Table 1) This table compares
MambaOutwith a wide range of models on ImageNet.(Manual transcription of Table 1 from the paper)
Model TokenMixingType Param(M) Test@2242 Model TokenMixingType Param(M) Test@2242 MAC(G) Acc(%) MAC(G) Acc(%) VAN-B0 [28] Conv 4 0.9 75.4 ConvNeXt-S [52] Conv 50 8.7 83.1 MogaNet-T [45] Conv 5 1.1 79.0 VAN-B3 [28] Conv 45 9.0 83.9 FasterNet-T1 [7] Conv 8 0.9 76.2 ConvFormer-S36 [92] Conv 40 7.6 84.1 InceptionNeXt-A [93] Conv 4 0.5 75.3 InternImage-S [79] Conv 50 8 84.2 DeiT-Ti [73] Attn 6 1.3 72.2 MogaNet-B [45] Conv 44 9.9 84.3 T2T-ViT-7 [94] Attn 4 1.1 71.7 T2T-ViT-19 [94] Attn 39 8.5 81.9 PVTv2-B0 [80] Conv + Attn 3 0.6 70.5 Swin-S [51] Attn 50 8.7 83.0 MobileViTv3-XS [77] Conv + Attn 3 0.9 76.7 Focal-Small [90] Attn 51 9.1 83.5 EMO-6M [101] Conv + Attn 6.5 1.0 79.0 CSWin-S [22] Attn 35 6.9 83.6 Vim-Ti [104] Conv + SSM 7 1.5 76.1 MViTv2-S [46] Attn 35 7.0 83.6 LocalVim-T [37] Conv + SSM 8 1.5 76.2 CoAtNet-1 [16] Conv + Attn 42 8.4 83.3 EfficientVMamba-T [58] Conv + SSM 6 0.8 76.5 UniFormer-B [43] Conv + Attn 50 8.3 83.9 EfficientVMamba-S [58] Conv + SSM 11 1.3 78.7 CAFormer-S36 [92] Conv + Attn 39 8.0 84.5 MambaOut-Femto Conv 7 1.2 78.9 SG-Former-M [65] Conv + Attn 39 7.5 84.1 PoolFormer-S24 [91] Pool 21 3.4 80.3 TransNeXt-Small [69] Conv + Attn 50 10.3 84.7 ConvNeXt-T [52] Conv 29 4.5 82.1 VMamba-S [50] Conv + SSM 44 11.2 83.5 VAN-B2 [28] Conv 27 5.0 82.8 LocalVMamba-S [37] Conv + SSM 50 11.4 83.7 ConvFormer-S18 [92] Conv 27 3.9 83.0 PlainMamba-L2 [88] Conv + SSM 25 8.1 81.6 MogaNet-S [45] Conv 25 5.0 83.4 VMambaV9-S [50] Conv + SSM 50 8.7 83.6 InternImage-T [79] Conv 30 5 83.5 MambaOut-Small Conv 48 9.0 84.1 InceptionNeXt-T [93] Conv 28 4.2 82.3 ConvNeXt-B [52] Conv 89 15.4 83.8 DeiT-S [73] Attn 22 4.6 79.8 RepLKNet-31B [21] Conv 79 15.3 83.5 T2T-ViT-14 [94] Attn 22 4.8 81.5 ConvFormer-M36 [92] Conv 57 12.8 84.5 Swin-T [51] Attn 29 4.5 81.3 HorNet-B [64] Conv 88 15.5 84.3 Focal-Tiny [90] Attn 29 4.9 82.2 MogaNet-L [45] Conv 83 15.9 84.7 CSWin-T [22] Attn 23 4.3 82.7 InternImage-B [79] Conv 97 16 84.9 CoAtNet-0 [16] Conv + Attn 25 4.2 81.6 DeiT-B [73] Attn 86 17.5 81.8 iFormer-S [70] Conv + Attn 20 4.8 83.4 T2T-ViT-24 [94] Attn 64 13.8 82.3 MOAT-0 [87] Conv + Attn 28 5.7 83.3 Swin-B [51] Attn 88 15.4 83.5 CAFormer-S18 [92] Conv + Attn 26 4.1 83.6 CSwin-B [22] Attn 78 15.0 84.2 SG-Former-S [65] Conv + Attn 23 4.8 83.2 MViTv2-B [46] Attn 52 10.2 84.4 TransNeXt-Tiny [69] Conv + Attn 28 5.7 84.0 CoAtNet-2 [16] Conv + Attn 75 15.7 84.1 Vim-S [104] Conv + SSM 26 5.1 80.5 iFormer-L [70] Conv + Attn 87 14.0 84.8 VMamba-T [50] Conv + SSM 22 5.6 82.2 MOAT-2 [87] Conv + Attn 73 17.2 84.7 Mamba-2D-S [44] Conv + SSM 24 - 81.7 CAFormer-M36 [92] Conv + Attn 56 13.2 85.2 LocalVim-S [37] Conv + SSM 28 4.8 81.2 TransNeXt-Base [69] Conv + Attn 90 18.4 84.8 LocalVMamba-T [37] Conv + SSM 26 5.7 82.7 VMamba-B [50] Conv + SSM 75 18.0 83.7 EfficientVMamba-B [58] Conv + SSM 33 4.0 81.8 Mamba-2D-B [44] Conv + SSM 92 - 83.0 PlainMamba-L1 [88] Conv + SSM 7 3.0 77.9 PlainMamba-L3 [88] Conv + SSM 50 14.4 82.3 VMambaV9-T* [50] Conv + SSM 31 4.9 82.5 VMambaV9-B [50] Conv + SSM 89 15.4 83.9
Similar papers
Recommended via semantic vector search.