AiPaper
Paper status: completed

Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model

Published:01/18/2024
Original Link
Price: 0.10
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Vision Mamba (Vim) introduces an efficient visual backbone using bidirectional State Space Models and position embeddings, eliminating self-attention. It solves Vision Transformers' high-resolution computational bottlenecks, achieving superior performance and significant speed/me

Abstract

Recently the state space models (SSMs) with efficient hardware-aware designs, i.e., the Mamba deep learning model, have shown great potential for long sequence modeling. Meanwhile building efficient and generic vision backbones purely upon SSMs is an appealing direction. However, representing visual data is challenging for SSMs due to the position-sensitivity of visual data and the requirement of global context for visual understanding. In this paper, we show that the reliance on self-attention for visual representation learning is not necessary and propose a new generic vision backbone with bidirectional Mamba blocks (Vim), which marks the image sequences with position embeddings and compresses the visual representation with bidirectional state space models. On ImageNet classification, COCO object detection, and ADE20k semantic segmentation tasks, Vim achieves higher performance compared to well-established vision transformers like DeiT, while also demonstrating significantly improved computation & memory efficiency. For example, Vim is 2.8×\times faster than DeiT and saves 86.8% GPU memory when performing batch inference to extract features on images with a resolution of 1248×\times1248. The results demonstrate that Vim is capable of overcoming the computation & memory constraints on performing Transformer-style understanding for high-resolution images and it has great potential to be the next-generation backbone for vision foundation models. Code is available at https://github.com/hustvl/Vim.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

  • Title: Vision Mamba: Efficient Visual Representation Learning with Bidirectional State Space Model
  • Authors: The author list is not fully captured in the provided text, but the paper originates from a team affiliated with HUST-VL (Huazhong University of Science and Technology Vision Lab).
  • Journal/Conference: The paper's formatting and style are typical of major computer vision or machine learning conferences like CVPR, ICCV, or ICLR. The provided text does not specify the exact venue.
  • Publication Year: The most recent citations are from 2023 and early 2024, suggesting the paper was published or released as a preprint around early 2024.
  • Abstract: The paper addresses the computational challenges of Vision Transformers (ViTs), particularly their quadratic complexity with image resolution. The authors propose Vision Mamba (Vim), a generic vision backbone built purely on State Space Models (SSMs), specifically the Mamba architecture. To adapt Mamba for vision, they introduce position embeddings for spatial awareness and a bidirectional SSM to capture global context. Experiments on ImageNet classification, COCO object detection, and ADE20k semantic segmentation show that Vim outperforms established ViTs like DeiT while being significantly more efficient in terms of speed and memory, especially for high-resolution images. The authors position Vim as a potential next-generation backbone for vision foundation models.
  • Original Source Link: /files/papers/68e3c225c83c981895f3bd33/paper.pdf. This appears to be a preprint version of the paper.

2. Executive Summary

  • Background & Motivation (Why):

    • Core Problem: Vision Transformers (ViTs) have become a dominant architecture in computer vision due to their ability to model global, data-dependent context using self-attention. However, the self-attention mechanism has a computational and memory complexity that is quadratic (O(N2)O(N^2)) with respect to the number of image patches (NN). This makes ViTs prohibitively expensive for high-resolution images, which are crucial for tasks like medical imaging, satellite imagery analysis, and fine-grained object detection.
    • Importance & Gaps: While many have tried to make Transformers more efficient, these solutions often introduce image-specific inductive biases (e.g., local windows in Swin Transformer) that break the pure, sequential, and modality-agnostic nature of the original ViT. Recently, Mamba, a State Space Model (SSM), has shown great promise in natural language processing by modeling long sequences with linear complexity (O(N)) while maintaining strong performance. However, Mamba is inherently unidirectional and lacks the positional awareness needed for 2D visual data.
    • Innovation: This paper explores whether the reliance on self-attention is necessary for high-performance vision models. It introduces Vision Mamba (Vim), the first generic vision backbone built purely on SSMs. Vim adapts Mamba for vision by (1) incorporating position embeddings to make the model spatially aware and (2) designing a bidirectional SSM block to capture context from all directions, mimicking the global receptive field of self-attention.
  • Main Contributions / Findings (What):

    • A Novel Vision Backbone (Vim): The paper proposes Vim, a pure SSM-based architecture that processes images as sequences of patches, retaining the modality-agnostic design of ViTs without using self-attention.
    • Superior Efficiency: Vim demonstrates linear complexity in both computation and memory with respect to sequence length. For high-resolution images (e.g., 1248×1248), Vim is 2.8× faster and uses 86.8% less GPU memory than the highly optimized DeiT model.
    • Strong Performance: Despite its efficiency, Vim achieves better performance than DeiT across multiple benchmarks: ImageNet classification, COCO object detection, and ADE20k semantic segmentation. This proves that SSMs can be as powerful as Transformers for visual representation learning.
    • Potential for Future Models: Vim's efficiency opens the door for training vision models on extremely high-resolution images or very long video sequences, making it a strong candidate for next-generation vision foundation models.

3. Prerequisite Knowledge & Related Work

  • Foundational Concepts:

    • Convolutional Neural Networks (CNNs): The traditional backbone for computer vision (e.g., ResNet). CNNs use fixed-weight filters (kernels) to scan images, capturing local patterns. Their strength is spatial locality and parameter efficiency, but they struggle to model long-range dependencies explicitly.
    • Vision Transformer (ViT): An architecture that treats an image as a sequence of patches (like words in a sentence) and processes them using a Transformer. Its core component is the self-attention mechanism, which allows every patch to interact with every other patch, creating a global, data-dependent receptive field. This is powerful but computationally expensive.
    • State Space Model (SSM): A type of model originating from control theory. In deep learning, SSMs like S4 and Mamba are used to model sequences. They maintain a hidden "state" (hh) that evolves over time based on the current input (xx), and this state is used to produce an output (yy). They can be formulated as either a recurrent system (fast for inference) or a global convolution (fast for parallel training), and they typically have linear or near-linear complexity.
    • Mamba: A recent and highly effective SSM that introduces selection mechanisms. This means its parameters are input-dependent, allowing it to selectively focus on or ignore parts of the input sequence, similar to the attention mechanism but with linear complexity.
  • Previous Works:

    • The paper positions itself in the lineage of generic vision backbones, starting from ConvNets (ResNet, ConvNeXt) to Transformers (ViT, DeiT, Swin Transformer).
    • It acknowledges that many ViT variants like Swin Transformer improve efficiency by reintroducing CNN-like priors (local attention windows), but this sacrifices the pure sequential modeling of the original ViT.
    • It also situates itself within the growing field of SSMs, citing foundational works like S4 and its successors, which paved the way for Mamba.
    • Finally, it differentiates itself from prior vision-based SSMs, which were either used for specific tasks (e.g., video), combined with CNNs/attention in hybrid models (e.g., U-Mamba), or developed concurrently (e.g., VMamba).
  • Differentiation:

    • Pure SSM Backbone: Unlike hybrid models, Vim is built entirely on SSM blocks, making it a direct competitor to pure Transformer architectures like ViT and DeiT.
    • Sequential Modeling for Vision: Vim processes images as 1D sequences, just like the original ViT. This is in contrast to models that use 2D scanning or hierarchical structures, preserving a modality-agnostic design that is beneficial for multi-modal applications.
    • Bidirectional Context: Standard Mamba is unidirectional. Vim's key innovation is its bidirectional SSM block, which processes the sequence in both forward and backward directions to aggregate global information, a crucial requirement for image understanding where context is not sequential.

4. Methodology (Core Technology & Implementation)

The core of Vim is its adaptation of the Mamba SSM for 2D visual data.

4.1. Preliminaries of State Space Models (SSMs)

SSMs are inspired by continuous-time systems that map an input function x(t)x(t) to an output function y(t)y(t) through a latent state h(t)h(t). The dynamics are defined by:

  1. State Equation: h(t)=Ah(t)+Bx(t)h'(t) = \mathbf{A}h(t) + \mathbf{B}x(t)

  2. Output Equation: y(t)=Ch(t)y(t) = \mathbf{C}h(t)

    Here, A\mathbf{A} is the state transition matrix, and B\mathbf{B} and C\mathbf{C} are projection matrices.

For use in deep learning, this system is discretized with a timescale parameter Δ\Delta. A common method is the zero-order hold (ZOH), which transforms the continuous parameters (A,B)(\mathbf{A}, \mathbf{B}) into discrete parameters (A,B)(\overline{\mathbf{A}}, \overline{\mathbf{B}}):

A=exp(ΔA) \overline{\mathbf{A}} = \exp(\Delta \mathbf{A})

B=(ΔA)1(exp(ΔA)I)ΔB \overline{\mathbf{B}} = (\Delta \mathbf{A})^{-1}(\exp(\Delta \mathbf{A}) - \mathbf{I}) \cdot \Delta \mathbf{B}

The discrete system then operates recurrently at each timestep tt:

ht=Aht1+Bxt h_t = \overline{\mathbf{A}}h_{t-1} + \overline{\mathbf{B}}x_t

yt=Cht y_t = \mathbf{C}h_t

This recurrent formulation can also be expressed as a single global convolution over the entire input sequence x\mathbf{x} with a structured kernel K\overline{\mathbf{K}}:

y=xK,where K=(CB,CAB,,CAM1B) \mathbf{y} = \mathbf{x} * \overline{\mathbf{K}}, \quad \text{where } \overline{\mathbf{K}} = (\mathbf{C}\overline{\mathbf{B}}, \mathbf{C}\overline{\mathbf{A}}\overline{\mathbf{B}}, \dots, \mathbf{C}\overline{\mathbf{A}}^{M-1}\overline{\mathbf{B}})

  • MM is the sequence length. Mamba's innovation is making Δ\Delta, B\mathbf{B}, and C\mathbf{C} data-dependent, which makes the model more expressive but breaks the convolutional equivalence, requiring a hardware-aware parallel scan for efficient computation.

4.2. Vision Mamba (Vim) Overall Architecture

The overall architecture of Vim, shown in Image 2, closely follows the ViT design but replaces Transformer blocks with Vim blocks.

该图像为示意图,展示了Vision Mamba(Vim)模型的整体架构及其编码器内部结构。左侧展示了输入图像经过分块、位置编码、线性投影形成带有位置标记的… 该图像为示意图,展示了Vision Mamba(Vim)模型的整体架构及其编码器内部结构。左侧展示了输入图像经过分块、位置编码、线性投影形成带有位置标记的patch tokens,随后输入Vision Mamba编码器进行处理;右侧为编码器细节,包含归一化层及双向卷积和双向状态空间模型(SSM)模块,前向和后向信息流相互融合,最终生成输出特征。

  1. Image to Sequence: An input image is divided into non-overlapping patches (e.g., 16x16 pixels). These patches are flattened into vectors.
  2. Patch Embedding: The flattened patch vectors are linearly projected into a D-dimensional embedding space.
  3. Positional and Class Tokens: A learnable class token (tcls\mathbf{t}_{cls}) is prepended to the sequence of patch embeddings. Learnable position embeddings (Epos\mathbf{E}_{pos}) are added to all tokens to provide spatial information. The initial token sequence T0\mathbf{T}_0 is: T0=[tcls;tp1W;tp2W;;tpJW]+Epos \mathbf{T}_0 = [\mathbf{t}_{cls}; \mathbf{t}_p^1\mathbf{W}; \mathbf{t}_p^2\mathbf{W}; \dots; \mathbf{t}_p^{\mathcal{J}}\mathbf{W}] + \mathbf{E}_{pos}
    • tpj\mathbf{t}_p^j is the jj-th image patch.
    • W\mathbf{W} is the learnable projection matrix.
  4. Vim Encoder: The token sequence is passed through a series of LL Vim blocks. Each block applies a residual connection: Tl=Vim(Tl1)+Tl1\mathbf{T}_l = \text{Vim}(\mathbf{T}_{l-1}) + \mathbf{T}_{l-1}.
  5. Classification Head: After the final block, the output class token is normalized and fed into a simple MLP head to produce the final prediction.

4.3. The Vim Block

The key innovation is the Vim block, which processes visual tokens bidirectionally. See Algorithm 1 and the right panel of Image 2.

For an input token sequence Tl1\mathbf{T}_{l-1}:

  1. The input is first normalized.
  2. Two linear projections create intermediate representations x\mathbf{x} and z\mathbf{z}. z\mathbf{z} is used later for a gating mechanism.
  3. Bidirectional Processing: The representation x\mathbf{x} is processed in parallel through a forward path and a backward path. The backward path simply processes a reversed copy of the sequence.
  4. Inside each path (forward/backward):
    • A 1D convolution (Conv1d) is applied to x\mathbf{x}.
    • The result is passed through a SiLU activation function.
    • This activated output is then used to dynamically generate the parameters for the SSM: Δ\Delta, B\mathbf{B}, and C\mathbf{C}. This is the "selective" part of Mamba. A\mathbf{A} is a fixed, learnable parameter.
    • The SSM is computed using the hardware-aware parallel scan algorithm, producing an output sequence yforward\mathbf{y}_{forward} or ybackward\mathbf{y}_{backward}.
  5. Gating and Merging: The forward and backward outputs are modulated (gated) by the representation z\mathbf{z} (passed through a SiLU activation) and then added together. yforward=yforwardSiLU(z) \mathbf{y}'_{forward} = \mathbf{y}_{forward} \odot \text{SiLU}(\mathbf{z}) ybackward=ybackwardSiLU(z) \mathbf{y}'_{backward} = \mathbf{y}_{backward} \odot \text{SiLU}(\mathbf{z})
  6. Final Projection: The combined output (yforward+ybackward\mathbf{y}'_{forward} + \mathbf{y}'_{backward}) is projected back to the original dimension DD and added to the input via a residual connection to produce the final output Tl\mathbf{T}_l.

4.4. Efficiency Analysis

Vim inherits Mamba's efficiency, which stems from its linear complexity and hardware-aware implementation.

  • Computational Efficiency: Self-attention has a complexity of O(M2D)O(M^2 D), where MM is sequence length and DD is the embedding dimension. The SSM in Vim has a complexity of O(M D N), where NN is the small, fixed state dimension (e.g., 16). Since NMN \ll M, the complexity is effectively linear with respect to sequence length MM. Ω(self-attention)=4MD2+2M2D \Omega(\text{self-attention}) = 4MD^2 + 2M^2D Ω(SSM)=3M(2D)N+M(2D)N \Omega(\text{SSM}) = 3M(2D)N + M(2D)N
  • Memory and IO Efficiency: The implementation avoids materializing the large intermediate state matrices of size (B, M, E, N) in GPU HBM (slow memory). Instead, it loads the inputs into the much faster SRAM, performs the computation, and writes only the final output back to HBM. This, combined with recomputation during the backward pass, drastically reduces memory usage, especially for long sequences.

5. Experimental Setup

  • Datasets:
    • ImageNet-1K: A large-scale dataset with 1.28 million training images across 1000 classes, used for image classification pre-training and benchmarking.
    • ADE20K: A challenging scene parsing dataset with 20k training images, used for semantic segmentation.
    • COCO 2017: A large-scale dataset for object detection and instance segmentation, containing 118k training images.
  • Evaluation Metrics:
    • Top-1 Accuracy: For ImageNet, the percentage of validation images where the model's top prediction is correct.
    • Mean Intersection over Union (mIoU): For semantic segmentation, the average IoU over all classes, measuring pixel-level prediction accuracy.
    • Average Precision (AP): For object detection and instance segmentation, the standard metric for evaluating the accuracy of bounding boxes (APbox) and segmentation masks (APmask).
  • Baselines:
    • The primary baseline is DeiT (Data-efficient image Transformer), a well-established and highly optimized ViT variant.
    • Other baselines include ConvNets (ResNet), the original ViT, and another SSM-based model (S4ND-ViT).

6. Results & Analysis

Vim consistently outperforms DeiT in both accuracy and efficiency.

该图像为三部分图表,分别比较了ViM-Ti与DeiT-Ti模型在多个任务上的性能。左图(a)显示ViM-Ti在分类、语义分割、目标检测和实例分割上的准确率… 该图像为三部分图表,分别比较了ViM-Ti与DeiT-Ti模型在多个任务上的性能。左图(a)显示ViM-Ti在分类、语义分割、目标检测和实例分割上的准确率均优于DeiT-Ti。中图(b)展示不同分辨率下两模型推理速度,ViM-Ti速度明显更快,最高达2.8倍提升。右图(c)对比GPU内存占用,ViM-Ti显著节省内存,分辨率1248时DeiT-Ti因内存溢出而无法运行。整体说明ViM模型在准确率、速度和内存效率上均表现更优。

6.1. Core Results

  • ImageNet Classification (Table 1):

    • Vim models surpass their DeiT counterparts at all scales (Tiny, Small, Base). For instance, Vim-S achieves 80.3% top-1 accuracy, outperforming DeiT-S (79.8%).
    • With "long sequence fine-tuning" (using a smaller patch stride to create longer sequences), performance is further boosted. Vim-S† reaches 81.4%, approaching the performance of the much larger DeiT-B (81.8%).
  • Efficiency on High-Resolution Images (Image 1, 3, 4):

    • Speed: As shown in Image 3, Vim's inference speed (FPS) scales much better than DeiT's as image resolution increases. At 1248x1248 resolution, Vim is 2.8x faster.

    • Memory: As shown in Image 4, DeiT's GPU memory usage explodes quadratically, leading to an Out-of-Memory (OOM) error at 1248x1248. In contrast, Vim's memory usage grows linearly, saving 86.8% of memory at that resolution. This is Vim's most significant practical advantage.

      该图像为折线图,比较了DeiT和Vim模型在不同分辨率下的帧率(FPS,带对数刻度)。横轴为图像分辨率(512到1248),纵轴为FPS值。图中显示,Vi… 该图像为折线图,比较了DeiT和Vim模型在不同分辨率下的帧率(FPS,带对数刻度)。横轴为图像分辨率(512到1248),纵轴为FPS值。图中显示,Vim在所有分辨率下的FPS均高于DeiT,尤其在1248分辨率时,Vim的FPS约为1.70,而DeiT仅为1.25,表明Vim在该分辨率下实现了约2.8倍的速度提升。图左侧箭头标注“Faster”进一步强调了Vim的性能优势。

      该图像为图表,展示了不同分辨率下DeiT与Vim两种模型的GPU显存占用情况。横轴为图像分辨率,纵轴为GPU显存(GB)。随着分辨率提升,DeiT显存需求… 该图像为图表,展示了不同分辨率下DeiT与Vim两种模型的GPU显存占用情况。横轴为图像分辨率,纵轴为GPU显存(GB)。随着分辨率提升,DeiT显存需求显著上升,1248分辨率时达到OOM(显存溢出),而Vim显存增长较缓,1248分辨率仅需22.59GB,节省约73.2%显存,体现了Vim在高分辨率下更优的显存效率。

  • Semantic Segmentation (Table 2):

    • On the ADE20K dataset using the UperNet framework, Vim again outperforms DeiT. Vim-S achieves 44.9 mIoU, surpassing DeiT-S (44.0 mIoU) and matching the much larger ResNet-101.
  • Object Detection and Instance Segmentation (Table 3):

    • On the COCO dataset, Vim-Ti achieves 45.7 APbox and 39.2 APmask, outperforming DeiT-Ti (44.4 and 38.1, respectively).
    • Crucially, for this high-resolution task (1024x1024), the DeiT baseline had to be modified with windowed attention (ViTDet), injecting a 2D prior. Vim, thanks to its efficiency, can process the full-resolution sequences directly without such modifications, demonstrating its power as a pure sequential model.
  • Qualitative Results (Image 5):

    • The visualization shows that Vim-Ti produces a more complete and accurate segmentation mask for an airplane compared to DeiT-Ti, highlighting its superior representation learning capabilities.

      该图像为插图,展示了在同一场景上一架飞机的三张图片。左图为真实标签(GT),中图为DeiT-Ti模型的检测结果,右图为Vim-Ti模型的检测结果。Vim-… 该图像为插图,展示了在同一场景上一架飞机的三张图片。左图为真实标签(GT),中图为DeiT-Ti模型的检测结果,右图为Vim-Ti模型的检测结果。Vim-Ti在飞机区域覆盖更完整且置信度更高(86%),显示其在视觉表示学习上的优势。

6.2. Ablation Studies

  • Bidirectional SSM (Table 4): This study confirms the importance of the bidirectional design.

    • A unidirectional Mamba (None) performs decently on classification but poorly on segmentation, which requires global context.
    • The final design (Bidirectional SSM + Conv1d), which processes the sequence forward and backward within each block, provides the best results on both tasks, achieving a 3.6 mIoU gain on segmentation over the unidirectional baseline.
  • Classification Design (Table 5): This study explores where to place the class token.

    • Strategies like mean pooling or a head class token (like in ViT) perform well.
    • However, placing the class token in the middle of the sequence (Middle class token) yields the best result (76.1% accuracy). The authors hypothesize this allows the token to best aggregate information from both directions due to the recurrent nature of SSMs and aligns with the central object bias in ImageNet photos.

7. Conclusion & Reflections

  • Conclusion Summary: The paper successfully demonstrates that self-attention is not a prerequisite for high-performing, scalable vision backbones. The proposed Vision Mamba (Vim), a pure SSM-based architecture, matches or exceeds the performance of the strong DeiT baseline across major vision tasks. Its key strengths are its linear complexity and hardware-aware efficiency, which enable practical processing of high-resolution images where Transformers fail. By incorporating bidirectional modeling and position embeddings, Vim effectively adapts the sequential power of Mamba to the spatial nature of vision.

  • Limitations & Future Work:

    • The authors suggest several promising future directions:
    • Unsupervised Pre-training: Vim's architecture is well-suited for masked image modeling (like BEiT) or contrastive pre-training (like CLIP).
    • High-Resolution Applications: Its efficiency makes it ideal for downstream tasks that inherently involve large inputs, such as medical imaging (pathology slides), remote sensing, and long-form video analysis.
    • Multimodal Learning: The shared architectural principles with language-based Mamba models could simplify the creation of powerful vision-language models.
  • Personal Insights & Critique:

    • A Paradigm Shift?: Vim, along with concurrent works like VMamba, marks a potentially significant shift in vision architectures. For years, the field has oscillated between ConvNets and Transformers. SSMs offer a compelling third way that combines the global context modeling of Transformers with better-than-linear efficiency.
    • Plain vs. Hierarchical: The main Vim model is a "plain" backbone with a uniform structure, similar to ViT. In the appendix, the authors also explore a hierarchical version (Hier-Vim), which shows even stronger performance compared to hierarchical models like Swin Transformer. This suggests the bidirectional SSM block is a versatile component applicable to different architectural designs.
    • Practical Implications: The dramatic reduction in memory usage is not just an academic benchmark win; it has real-world consequences. It makes fine-tuning and deploying large vision models on consumer-grade or less powerful hardware more feasible, democratizing access to state-of-the-art vision AI. Vim's efficiency could be the key to unlocking the next scale of vision foundation models.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.