Paper status: completed

EAMamba: Efficient All-Around Vision State Space Model for Image Restoration

Published:06/27/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

EAMamba integrates a Multi-Head Selective Scan Module and all-around scanning mechanism to tackle the computational complexity and local pixel forgetting issues in image restoration. Experiments show EAMamba significantly reduces FLOPs by 31-89% while maintaining comparable perfo

Abstract

Image restoration is a key task in low-level computer vision that aims to reconstruct high-quality images from degraded inputs. The emergence of Vision Mamba, which draws inspiration from the advanced state space model Mamba, marks a significant advancement in this field. Vision Mamba demonstrates excellence in modeling long-range dependencies with linear complexity, a crucial advantage for image restoration tasks. Despite its strengths, Vision Mamba encounters challenges in low-level vision tasks, including computational complexity that scales with the number of scanning sequences and local pixel forgetting. To address these limitations, this study introduces Efficient All-Around Mamba (EAMamba), an enhanced framework that incorporates a Multi-Head Selective Scan Module (MHSSM) with an all-around scanning mechanism. MHSSM efficiently aggregates multiple scanning sequences, which avoids increases in computational complexity and parameter count. The all-around scanning strategy implements multiple patterns to capture holistic information and resolves the local pixel forgetting issue. Our experimental evaluations validate these innovations across several restoration tasks, including super resolution, denoising, deblurring, and dehazing. The results validate that EAMamba achieves a significant 31-89% reduction in FLOPs while maintaining favorable performance compared to existing low-level Vision Mamba methods.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is EAMamba: Efficient All-Around Vision State Space Model for Image Restoration. It focuses on developing an advanced Vision Mamba framework to improve image restoration tasks by enhancing efficiency and comprehensively capturing visual information.

1.2. Authors

The authors of this paper are:

  • Yu-Cheng Lin

  • Yu-Syuan Xu

  • Hao-Wei Chen

  • Hsien-Kai Kuo

  • Chun-Yi Lee

    Their affiliations indicate a collaboration between academia and industry:

  • National Tsing Hua University

  • National Taiwan University

  • MediaTek Inc.

1.3. Journal/Conference

The paper is published as a preprint on arXiv, indicated by the Published at (UTC): 2025-06-27T14:12:58.000Z and https://arxiv.org/abs/2506.22246. As a preprint, it has not yet undergone formal peer review in a journal or conference. However, arXiv is a widely respected platform for disseminating research in fields like computer science, often serving as a precursor to formal publication in top-tier conferences (e.g., CVPR, ICCV, ECCV) or journals. The "2025" publication year suggests it's a very recent or upcoming work.

1.4. Publication Year

2025

1.5. Abstract

Image restoration is a crucial task in low-level computer vision, aiming to reconstruct high-quality images from degraded inputs. Vision Mamba, inspired by the state space model Mamba, has shown promise in modeling long-range dependencies with linear complexity, a significant advantage for image restoration. However, Vision Mamba faces challenges in low-level vision tasks, including computational complexity that scales with the number of scanning sequences and a phenomenon called "local pixel forgetting."

To overcome these limitations, this study introduces Efficient All-Around Mamba (EAMamba). EAMamba incorporates a Multi-Head Selective Scan Module (MHSSM) with an all-around scanning mechanism. The MHSSM efficiently aggregates multiple scanning sequences without increasing computational complexity or parameter count. The all-around scanning strategy uses multiple patterns to capture holistic information, effectively resolving the local pixel forgetting issue.

Experimental evaluations validate these innovations across various restoration tasks, including super-resolution, denoising, deblurring, and dehazing. The results show that EAMamba achieves a significant 31-89% reduction in FLOPs while maintaining favorable performance compared to existing low-level Vision Mamba methods.

https://arxiv.org/abs/2506.22246 PDF Link: https://arxiv.org/pdf/2506.22246v1.pdf This is a preprint published on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the efficient and effective reconstruction of high-quality images from degraded inputs, a task known as image restoration. This is a fundamental challenge in low-level computer vision, which deals with tasks that operate directly on pixels, such as denoising, deblurring, and super-resolution.

This problem is important because real-world images are frequently affected by various degradations (noise, blur, low-resolution, haze), and high-quality image reconstruction is vital for numerous applications, from medical imaging to autonomous driving and consumer photography.

Prior research has seen the dominance of Convolutional Neural Networks (CNNs) for image restoration. While successful, CNNs inherently struggle to capture global information due to their localized receptive fields. The emergence of Vision Transformers (ViT) addressed this by using multi-head self-attention to model relationships across all image pixels, thereby capturing global dependencies effectively. However, ViTs suffer from a major drawback: their computational complexity scales quadratically with the number of pixels, making them computationally intensive and often infeasible for high-resolution images, especially in resource-constrained environments.

More recently, Mamba, an advanced state space model (SSM), has shown promise in Natural Language Processing (NLP) for its ability to model long-range dependencies with linear computational complexity. This led to the adaptation of Mamba for vision tasks, giving rise to Vision Mamba models. These models aim to combine ViT's global information capturing capabilities with Mamba's linear computational scaling.

Despite these advancements, existing Vision Mamba models face specific challenges in low-level vision tasks:

  1. Computational Complexity with Multiple Scanning Sequences: Traditional Vision Mamba methods, like those using two-dimensional Selective Scan (2DSS), generate multiple flattened one-dimensional sequences (e.g., four sequences from horizontal and vertical scans). Each sequence requires its own selective scan with distinct parameters, leading to increased computational overhead and parameter count proportional to the number of scanning directions. This limits the scalability of incorporating more diverse scanning patterns.

  2. Local Pixel Forgetting: When a two-dimensional image is flattened into a one-dimensional sequence, spatially adjacent pixels can become distantly separated in the token sequence. This phenomenon, termed "local pixel forgetting," causes a loss of crucial local spatial relationships, which are paramount for accurate image restoration tasks. Existing scanning strategies have not adequately addressed this.

    The paper's entry point is to leverage the strengths of Vision Mamba while directly tackling these two specific limitations, proposing an efficient and comprehensive solution for image restoration.

2.2. Main Contributions / Findings

The paper introduces Efficient All-Around Mamba (EAMamba) with two primary architectural innovations to address the identified challenges:

  1. Multi-Head Selective Scan Module (MHSSM):
    • Contribution: EAMamba proposes MHSSM to efficiently process and aggregate flattened 1D sequences. Instead of performing separate selective scans on full feature channels for each direction, MHSSM employs a channel grouping strategy. This allows multiple scanning sequences to be aggregated without incurring the typical computational complexity and parameter count overhead associated with increasing the number of sequences.
    • Benefit: This design significantly improves the scalability and efficiency of Vision Mamba frameworks, enabling the integration of more complex scanning patterns.
  2. All-Around Scanning Strategy:
    • Contribution: Benefiting from MHSSM's efficiency, EAMamba introduces an all-around scanning mechanism. This strategy goes beyond conventional two-dimensional scanning by executing selective scanning across multiple directions, including horizontal, vertical, diagonal, flipped diagonal, and their respective reversed orientations.
    • Benefit: This multi-directional approach effectively captures holistic image information, directly addressing the local pixel forgetting issue by ensuring broader neighborhood coverage and strengthening spatial context understanding.
    • Insight: The paper provides insights through Effective Receptive Field (ERF) visualizations, demonstrating how all-around scanning enhances spatial dependency preservation, especially for diagonal information often missed by 2D scans.

Key Conclusions / Findings:

  • Significant Efficiency Gains: EAMamba achieves a remarkable 31-89% reduction in FLOPs (Floating Point Operations per second), a measure of computational cost, compared to existing low-level Vision Mamba methods. This indicates a substantial improvement in computational efficiency.
  • Maintained Favorable Performance: Despite significant FLOPs reduction, EAMamba maintains favorable performance in terms of image quality metrics (PSNR and SSIM) across various image restoration tasks (super-resolution, denoising, deblurring, and dehazing). This demonstrates that the efficiency gains do not come at a significant cost to restoration quality.
  • Validation Across Tasks: The innovations are validated through extensive experiments on widely adopted benchmark datasets, demonstrating EAMamba's versatility and effectiveness across a range of degradation types.
  • Scalability for Vision Mamba: MHSSM provides a pathway for Vision Mamba models to incorporate more sophisticated scanning patterns without prohibitive computational costs, enhancing their applicability to complex visual data.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand EAMamba, a basic grasp of several foundational concepts in deep learning and computer vision is essential:

  • Image Restoration: This is a broad field in low-level computer vision that deals with recovering a high-quality (clean, sharp, high-resolution) image from a degraded (noisy, blurry, low-resolution, hazy) input image. Common tasks include:

    • Super-resolution (SR): Increasing the resolution of an image.
    • Denoising: Removing noise from an image.
    • Deblurring: Removing blur (e.g., motion blur, out-of-focus blur) from an image.
    • Dehazing: Removing haze or fog from an image. These tasks are often ill-posed, meaning multiple high-quality images could theoretically lead to the same degraded input, making the reconstruction challenging.
  • Convolutional Neural Networks (CNNs): These are a class of deep neural networks commonly used for analyzing visual imagery. They are characterized by convolutional layers that apply filters (kernels) to input data, pooling layers for dimensionality reduction, and often fully connected layers at the end.

    • Local Processing: CNNs excel at capturing local patterns and features due to their small, localized receptive fields (the area of the input image that a neuron "sees"). This is beneficial for tasks requiring local detail preservation but can be a limitation for capturing long-range dependencies across an entire image.
  • Vision Transformers (ViTs): Introduced as an adaptation of the Transformer architecture (originally for NLP) to computer vision.

    • Self-Attention Mechanism: The core of ViTs is the self-attention mechanism, specifically Multi-Head Self-Attention (MHSA). MHSA allows the model to weigh the importance of different parts of the input (or "tokens") relative to each other, thus capturing global dependencies across the entire image.
    • Image Patches: ViTs typically divide an image into fixed-size patches, which are then treated as "tokens" similar to words in an NLP sequence.
    • Quadratic Complexity: A major drawback of traditional MHSA is its computational complexity, which scales quadratically with the number of input tokens (i.e., image patches or pixels for high-resolution images). This makes it computationally expensive for high-resolution image processing.
  • State Space Models (SSMs) / Mamba: A class of models that aim to efficiently model long-range dependencies, inspired by control theory.

    • Selective State Spaces: Mamba introduces a selective state space mechanism. Unlike traditional SSMs or attention mechanisms, Mamba's parameters are data-dependent, allowing the model to selectively propagate or discard information based on the input. This selectivity is key to its efficiency and ability to model long-range dependencies.
    • Linear Complexity: A defining feature of Mamba is its linear computational complexity with respect to the sequence length, making it highly efficient for processing long sequences compared to quadratic-complexity models like ViTs.
    • Scanning Mechanism: Mamba processes sequences through a scanning mechanism that aggregates information sequentially. For 2D data like images, this typically involves flattening the image into 1D sequences and processing them in different directions (e.g., horizontal, vertical).
  • UNet Architecture: A widely used encoder-decoder convolutional network architecture, particularly popular for image-to-image tasks like segmentation and restoration.

    • Encoder: Gradually reduces the spatial dimensions of the input while increasing feature channels, capturing context.
    • Decoder: Gradually increases spatial dimensions while reducing feature channels, reconstructing the output.
    • Skip Connections: Crucially, UNet includes skip connections that directly transfer feature maps from the encoder to the corresponding decoder layers. These connections help preserve fine-grained details lost during the encoding (downsampling) process, which is critical for image restoration.
  • Layer Normalization (LN): A normalization technique applied across the features of an individual sample within a layer. It helps stabilize training, especially in deep networks, by normalizing the activations.

  • Sigmoid Linear Unit (SiLU): Also known as Swish, it is an activation function defined as f(x)=xsigmoid(x)f(x) = x \cdot \mathrm{sigmoid}(x). It's a smooth, non-monotonic function that has shown to perform well in deep networks.

  • Depth-wise Convolution (DWConv2D): A type of convolution that applies a single filter to each input channel independently. This significantly reduces the number of parameters and computational cost compared to standard convolutions, making models more efficient.

  • Multilayer Perceptron (MLP): A class of feedforward artificial neural networks. In deep learning architectures, an MLP typically refers to a block of fully connected layers with non-linear activation functions, used for feature transformation or projection.

3.2. Previous Works

The paper contextualizes EAMamba within the evolution of image restoration techniques, highlighting the shift from CNNs to Transformers and now to Mamba-based models.

  • CNN-based Approaches (Early Developments):

    • Examples: [1-3, 7-13, 14, 18].
    • Contribution: Demonstrated success in various image restoration benchmarks.
    • Limitations: Inherent limitations in capturing global information due to their focus on local pixel relationships.
  • Vision Transformer (ViT)-based Architectures:

    • Examples: ViT [40], SwinIR [48], Uformer [50], Restormer [51].
    • Contribution: Employed multi-head self-attention mechanisms to model relationships across all image pixels, effectively capturing global dependencies and achieving promising results.
    • Limitations: Quadratic computational complexity with respect to pixel count, rendering high-resolution image processing infeasible. To mitigate this, approaches like SwinIR and Uformer used window-based attention, and Restormer introduced multi-Dconv head transposed attention and gated-Dconv feed-forward.
  • Vision Mamba Models:

    • Inspired by the Mamba framework [53] from NLP, Vision Mamba models adapt SSMs for vision tasks, aiming for linear computational scaling and efficient long-range dependency modeling.

    • Pioneering Works [54, 55]:

      • VMamba [54]: Introduced cross-scan and cross-merge techniques to effectively aggregate spatial information from different scanning directions.
      • Vision Mamba [55]: Processed image patches in both forward and backward directions to capture spatial information more comprehensively.
      • Common Strategy: These works transform 2D feature maps into flattened 1D sequences through dual-direction scanning (e.g., horizontal and vertical scans), as illustrated in Fig. 3.
      • Challenges Identified by EAMamba:
        • Computational Overhead: As shown in Fig. 2(a), 2D Selective Scan (2DSS) often generates multiple 1D sequences (e.g., four sequences from horizontal and vertical scans). Each sequence requires its own selective scan with distinct parameters, leading to increased computational complexity and parameter count that scales linearly with the number of sequences. This makes incorporating more scanning directions inefficient.
        • Local Pixel Forgetting: When 2D spatial information is flattened into 1D sequences, spatially adjacent pixels can become distantly separated. This leads to a loss of local context, which is crucial for image restoration. Fig. 4(a) illustrates this phenomenon.
    • Vision Mamba for Image Restoration:

      • MambaIR [57]: Integrated the vision state space module and a modified MLP block to mitigate local pixel forgetting and channel redundancy issues found in vanilla Mamba.
      • VMambaIR [58]: Proposed an Omni Selective Scan (OSS) module, which conducts four-directional spatial scanning along with channel scanning to leverage both spatial and channel-wise information.

3.3. Technological Evolution

The evolution of image restoration models reflects a continuous effort to better capture global dependencies while maintaining computational efficiency.

  1. CNNs (Local Focus): Initial success relied on CNNs, leveraging their ability to extract local features. However, their limited receptive fields restricted their capacity for understanding global context, leading to suboptimal performance in tasks requiring a broader understanding of the image.
  2. ViTs (Global Focus, High Cost): Transformers brought the powerful self-attention mechanism, enabling models to attend to distant pixels and capture global context effectively. This marked a significant leap in performance but introduced a critical bottleneck: quadratic computational complexity, making them unsuitable for high-resolution image processing due to memory and time constraints. Efforts like window-based attention aimed to reduce this but often compromised some global context.
  3. Vision Mamba (Global Focus, Linear Cost): Mamba emerged as a promising alternative, offering linear computational complexity for long-range dependency modeling. This addressed the efficiency issues of ViTs while retaining the ability to capture global information. However, initial adaptations to vision (Vision Mamba, MambaIR, VMambaIR) faced new challenges: the computational burden of multiple scanning sequences and the problem of local pixel forgetting when flattening 2D images to 1D sequences.

3.4. Differentiation Analysis

EAMamba differentiates itself from previous Vision Mamba methods, particularly 2DSS-based approaches, by directly tackling their two main limitations:

  • Against 2DSS (e.g., VMamba, MambaIR's core scanning strategy):

    • Efficiency of Multiple Scans: Traditional 2DSS (Fig. 2a) generates separate 1D sequences for each scanning direction (e.g., horizontal, vertical, and their reverses), with each requiring its own selective scan parameters. This leads to computational overhead and increased parameter count proportional to the number of scanning directions.
    • EAMamba's Innovation (MHSSM): EAMamba replaces 2DSS with Multi-Head Selective Scan (MHSS) (Fig. 2b). MHSS partitions input feature channels into nn groups and performs scanning within these groups. This channel-split approach allows for the efficient aggregation of multiple scanning sequences without escalating computational complexity or parameter count. It enables the use of more diverse scanning patterns without a linear increase in overhead.
  • Against Local Pixel Forgetting (an issue in all 1D sequence flattening):

    • Problem: Previous 2D scanning strategies (e.g., horizontal, vertical) inherently separate spatially adjacent pixels when converting a 2D image into a 1D sequence (Fig. 4a), leading to a loss of critical local spatial context. While MambaIR used supplementary local convolution operations, it still faced challenges.

    • EAMamba's Innovation (All-Around Scanning): EAMamba introduces an all-around scanning strategy (Fig. 3). This strategy systematically combines multiple scanning directions (horizontal, vertical, diagonal, flipped diagonal, and their reversed orientations). By covering more directions, it ensures holistic spatial information capture, significantly mitigating local pixel forgetting. The ERF visualizations (Fig. 4b, Fig. 11) confirm that this strategy captures broader and more complete neighborhood information, especially in diagonal directions, which is crucial for image restoration.

      In essence, EAMamba proposes a more scalable and comprehensive way to leverage Vision Mamba's linear complexity for image restoration, overcoming the efficiency limitations of adding more scan patterns and the critical issue of local context loss.

4. Methodology

4.1. Principles

The core idea behind EAMamba is to enhance the Vision Mamba framework for image restoration by simultaneously improving both computational efficiency and the comprehensiveness of spatial information capture. This is achieved through two main principles:

  1. Efficient Aggregation of Multiple Scanning Sequences: To overcome the computational and parameter overhead associated with increasing the number of scanning directions in previous Vision Mamba models, EAMamba introduces a Multi-Head Selective Scan (MHSS) module. This module uses a channel-splitting strategy to process multiple scanning sequences in parallel without linearly scaling computational costs or parameters.

  2. Holistic Spatial Context Understanding: To address the local pixel forgetting issue that arises when 2D images are flattened into 1D sequences, EAMamba proposes an all-around scanning mechanism. This mechanism integrates diverse scanning patterns (horizontal, vertical, diagonal, etc.) to ensure that comprehensive neighborhood information is captured, thereby preserving crucial local spatial relationships.

    By combining these principles, EAMamba aims to create a Vision Mamba model that is both highly efficient and exceptionally effective at reconstructing high-quality images from degraded inputs.

4.2. Core Methodology In-depth (Layer by Layer)

The EAMamba framework adopts a UNet-like architecture [65], a common and effective design for image-to-image translation tasks due to its ability to capture both high-level semantic information and low-level spatial details.

The following figure (Figure 5 from the original paper) illustrates the overall EAMamba framework and the structure of its key components:

该图像是一个示意图,展示了 EAMamba 框架及其关键组件,包括 MambaFormer 和 MHSSM。图中使用了 \(H \\times W \\times C\) 表示特征图形状,并展示了多头选择扫描模块的结构。这些模块通过高效的全方位扫描机制解决了低级视觉任务中的局部像素遗忘问题。 该图像是一个示意图,展示了 EAMamba 框架及其关键组件,包括 MambaFormer 和 MHSSM。图中使用了 H×W×CH \times W \times C 表示特征图形状,并展示了多头选择扫描模块的结构。这些模块通过高效的全方位扫描机制解决了低级视觉任务中的局部像素遗忘问题。

Figure 5. The image is a diagram illustrating the EAMamba framework and its key components, including MambaFormer and MHSSM. In the image, H×W×CH \times W \times C represents the shape of the feature map, and the structure of the Multi-Head Selective Scan Module is shown. These modules address the local pixel forgetting issue in low-level vision tasks through an efficient all-around scanning mechanism.

4.2.1. Overview of the EAMamba Framework (Section 3.1)

The restoration process begins with a low-quality image ILQRH×W×3I^{LQ} \in \mathbb{R}^{H \times W \times 3}, where HH is height, WW is width, and 3 represents the RGB color channels.

  1. Encoder Modules: EAMamba processes this input through three MambaFormer encoder modules. These modules operate at different scales, progressively reducing the spatial resolution and expanding the feature channels (e.g., H×W×CH \times W \times C to H/2×W/2×2CH/2 \times W/2 \times 2C, then H/4×W/4×4CH/4 \times W/4 \times 4C, etc.). The encoder extracts hierarchical feature embeddings, capturing increasingly abstract representations of the input.

  2. Bottleneck Module: After the encoder, a bottleneck module processes the most abstract features at the lowest spatial resolution. This module typically consists of several MambaFormer blocks to further refine these high-level features.

  3. Decoder Modules: Following the bottleneck, three MambaFormer blocks serve as decoder modules. These modules progressively upsample the features, increasing spatial resolution while reducing feature channels, aiming to reconstruct the image details. Skip connections, characteristic of UNet-like architectures, would typically link corresponding encoder and decoder layers to transfer fine-grained spatial information, although not explicitly detailed in the text for EAMamba's UNet.

  4. Refinement Module: A final refinement module, containing additional MambaFormer blocks, processes the reconstructed features to produce a residual image IHQrRH×W×3I^{HQ}r \in \mathbb{R}^{H \times W \times 3}.

  5. Final Output: The final high-quality image IHQRH×W×3I^{HQ} \in \mathbb{R}^{H \times W \times 3} is obtained by element-wise addition of the residual image IHQrI^{HQ}r and the original low-quality input ILQI^{LQ}: IHQ=ILQ+IHQrI^{HQ} = I^{LQ} + I^{HQ}r. This residual learning approach is common in image restoration, as it's often easier for a model to learn the degradation (residual) than to directly map the degraded image to the clean image.

    The core innovations enabling this process are the MambaFormer architecture and the Multi-Head Selective Scan Module (MHSSM), which facilitate efficient scanning and holistic spatial information capture.

4.2.2. MambaFormer Block (Section 3.2)

The MambaFormer block is a fundamental building block of EAMamba, utilized throughout the encoder, decoder, bottleneck, and refinement stages. Its structure is designed to capture long-range spatial dependencies while refining features.

As depicted in Figure 5(a), each MambaFormer block comprises two main components:

  1. Multi-Head Selective Scan Module (MHSSM): This component is responsible for token mixing, which refers to the process of aggregating information across different spatial locations (tokens) to capture long-range dependencies.

  2. Channel Multilayer Perceptron (channel MLP): This component is used for feature refinement, processing features across channels to enhance their representation.

    The data flow within a MambaFormer block follows a common Transformer-like structure with Layer Normalization (LN) and residual connections:

First, the input feature XX is normalized using Layer Normalization, and then passed to the MHSSM. The output of MHSSM is added back to the input XX via a residual connection to produce XX': X=X+MHSSM(LN(X)) X' = X + \mathbf{MHSSM}(\mathbf{LN}(X)) Where:

  • XX: The input feature map to the MambaFormer block. It typically has dimensions H×W×CH \times W \times C (Height ×\times Width ×\times Channels).

  • LN()\mathbf{LN}(\cdot): The Layer Normalization operation [66]. It normalizes the activations across the feature dimension for each input sample, stabilizing training.

  • MHSSM()\mathbf{MHSSM}(\cdot): The Multi-Head Selective Scan Module, which captures long-range spatial dependencies and performs token mixing.

  • XX': The intermediate feature map after the MHSSM and the first residual connection.

    Second, the intermediate feature map XX' is again normalized using Layer Normalization, and then passed to the Channel MLP. The output of the Channel MLP is added back to XX' via another residual connection to produce the final output XX'': X=X+Channel MLP(LN(X)) X'' = X' + \mathbf{Channel~MLP}(\mathbf{LN}(X')) Where:

  • XX': The input feature map to the Channel MLP branch.

  • Channel MLP()\mathbf{Channel~MLP}(\cdot): The Channel Multilayer Perceptron, which refines features along the channel dimension.

  • XX'': The final output feature map of the MambaFormer block.

    In this architecture, MHSSM focuses on capturing long-range spatial dependencies within input features, while the channel MLP refines these features to enhance the representation. The LN layers precede each component, and residual connections help prevent vanishing gradients and aid in training deeper networks.

4.2.3. Multi-Head Selective Scan Module (MHSSM) with All-Around Scanning (Section 3.3)

The Multi-Head Selective Scan Module (MHSSM) is a critical innovation within the MambaFormer block that replaces the conventional two-dimensional Selective Scan (2DSS) module, enhancing efficiency and spatial information capture.

As illustrated in Figure 5(b), the MHSSM processes an input feature XRH×W×CX \in \mathbb{R}^{H \times W \times C} through two parallel branches, conceptually similar to a gated mechanism:

  1. Left Branch (Information Propagation Branch):

    • The input feature XX first passes through a Linear layer, which expands its feature channels to λC\lambda C, where λ\lambda is a pre-defined channel expansion factor (e.g., 2).
    • This expanded feature then undergoes a depth-wise convolution (DWConv2D) operation, which applies a separate filter to each input channel, maintaining channel independence while capturing local spatial patterns efficiently.
    • The output of DWConv2D is activated by a Sigmoid Linear Unit (SiLU) [67], a non-linear activation function.
    • The activated feature then enters the Multi-Head Selective Scan (MHSS) module, which is the core component for capturing long-range dependencies using the all-around scanning strategy.
    • Finally, the output of MHSS is normalized by Layer Normalization (LN) [66], producing the feature YY.
  2. Right Branch (Gating Branch):

    • The input feature XX also passes through a Linear layer, expanding its channels by the same factor λ\lambda.
    • The output is then activated by a SiLU function, producing the feature ZZ.
  3. Combination and Projection:

    • The outputs from both branches, YY and ZZ, are combined through element-wise multiplication (\otimes). This gating mechanism allows the model to selectively pass information, similar to how gates operate in LSTMs or Gated Linear Units.

    • A final linear projection reduces the merged output back to the original channel dimension CC, producing the block's output XoutX_{out}.

      The complete MHSSM process is mathematically formulated as: Y=LN(MHSS(SiLU(DWConv2D(Linear(X))))) Y = \mathrm{LN}(\mathrm{MHSS}(\mathrm{SiLU}(\mathrm{DWConv2D}(\mathrm{Linear}(X))))) Z=SiLU(Linear(X)) Z = \mathrm{SiLU}(\mathrm{Linear}(X)) Xout=Linear(YZ) X_{out} = \mathrm{Linear}(Y \otimes Z) Where:

  • XX: The input feature map to the MHSSM, with shape RH×W×C\mathbb{R}^{H \times W \times C}.
  • Linear()\mathrm{Linear}(\cdot): A linear projection (fully connected layer) that typically maps input channels to λC\lambda C channels or back to CC channels.
  • DWConv2D()\mathrm{DWConv2D}(\cdot): A Depth-wise Convolutional layer.
  • SiLU()\mathrm{SiLU}(\cdot): The Sigmoid Linear Unit activation function.
  • MHSS()\mathrm{MHSS}(\cdot): The Multi-Head Selective Scan operation, detailed below.
  • LN()\mathrm{LN}(\cdot): Layer Normalization.
  • YY: The output of the left branch before the final projection.
  • ZZ: The output of the right branch (gating signal).
  • \otimes: Element-wise multiplication.
  • XoutX_{out}: The final output feature map of the MHSSM block, with shape RH×W×C\mathbb{R}^{H \times W \times C}.

4.2.3.1. Multi-Head Selective Scan (MHSS) (Section 3.3.1)

The Multi-Head Selective Scan (MHSS) is a key innovation for efficient long-range spatial information capture within the MHSSM. It deviates from traditional 2DSS by employing a multi-head approach with grouped feature processing.

The following figure (Figure 6 from the original paper) illustrates the MHSS mechanism:

Figure 6. Illustration of the Multi-Head Selective Scan (MHSS) with our proposed All-Around Scanning strategy. 该图像是一个示意图,展示了多头选择扫描(MHSS)与我们提出的全方位扫描策略的结合过程。图中描述了输入通道的分离、变换、选择扫描和逆变换等步骤,旨在高效聚合多条扫描序列以提升图像恢复性能。

Figure 6. Illustration of the Multi-Head Selective Scan (MHSS) with our proposed All-Around Scanning strategy.

Here's how MHSS operates:

  1. Channel Splitting (Split): The input feature MHSSinMHSS_{in} (which is the output of SiLU(DWConv2D(Linear(X)))\mathrm{SiLU}(\mathrm{DWConv2D}(\mathrm{Linear}(X))) from the MHSSM) is partitioned into nn groups along the channel dimension. This means that instead of a single scan operating on all CC' channels (where C=λCC' = \lambda C), it operates on nn smaller groups of channels, each of size C/nC'/n. This is the "Multi-Head" aspect, as each group can be thought of as a "head."

  2. Transformation (Transform): Each of these nn groups undergoes a transformation function. This Transform step is where the all-around scanning strategy (detailed next) is applied. It converts the 2D feature map within each group into one or more flattened 1D sequences, denoted as SSiniSS_{in}^i for group ii.

  3. Selective Scanning (SelectiveScan): Each flattened 1D sequence SSiniSS_{in}^i from each group is then processed independently by a selective scanning mechanism. This is the core Mamba operation that models long-range dependencies with data-dependent parameters. This produces a corresponding output SSoutiSS_{out}^i for each group.

  4. Inverse Transformation (InverseTransform): The 1D outputs SSoutiSS_{out}^i from the selective scan for each group are then reshaped back into 2D feature maps using an InverseTransform function, reversing the flattening operation.

  5. Concatenation (Concat): Finally, the 2D outputs from all nn groups are concatenated back along the channel dimension to form the comprehensive output feature MHSSoutMHSS_{out}.

    The MHSS operations are formulated as: SSinN=Transform(Split(MHSSin)) SS_{in}^{N} = \mathrm{Transform}(\mathrm{Split}(MHSS_{in})) SSoutN=SelectiveScan(SSinN) SS_{out}^{N} = \mathrm{SelectiveScan}(SS_{in}^{N}) MHSSout=Concat(InverseTransform(SSoutN)) MHSS_{out} = \mathrm{Concat}(\mathrm{InverseTransform}(SS_{out}^{N})) Where:

  • MHSSinMHSS_{in}: The input feature map to the MHSS module.
  • nn: The number of groups (heads) the input channels are split into.
  • Split()\mathrm{Split}(\cdot): Operation that partitions MHSSinMHSS_{in} into nn channel groups.
  • Transform()\mathrm{Transform}(\cdot): The function that converts each 2D channel group into a flattened 1D sequence (or multiple sequences) for scanning. This is where the all-around scanning strategy is implemented.
  • SSinNSS_{in}^{N}: A collection of flattened 1D input sequences for all groups N={1,2,,n}N = \{1, 2, \dots, n\}.
  • SelectiveScan()\mathrm{SelectiveScan}(\cdot): The core Mamba selective scanning operation applied to each 1D sequence.
  • SSoutNSS_{out}^{N}: A collection of flattened 1D output sequences from the selective scan for all groups.
  • InverseTransform()\mathrm{InverseTransform}(\cdot): The function that reshapes the 1D output sequences back into 2D feature maps.
  • Concat()\mathrm{Concat}(\cdot): Operation that concatenates the 2D outputs from all groups back along the channel dimension.
  • MHSSoutMHSS_{out}: The final output feature map of the MHSS module.

Key Advantage of MHSS: Unlike 2DSS, which would typically duplicate parameters and increase computational load for each additional scanning direction, MHSS maintains computational complexity comparable to a standard selective scan while processing multiple sequences. By splitting channels, it processes different scanning directions or patterns on different channel groups in parallel, then aggregates their results. This avoids the linear increase in computational and parameter overhead traditionally associated with incorporating more scanning directions, making it highly efficient.

4.2.3.2. All-Around Scanning (Section 3.3.2)

The all-around scanning strategy is the specific transformation function implemented within the MHSS. It is designed to overcome the local pixel forgetting issue and enable holistic spatial dependency understanding.

The following figure (Figure 3 from the original paper) illustrates the concept of all-around scanning:

Figure 3. Illustration of an all-around scanning approach that combines two-dimensional scanning and diagonal scanning. 该图像是示意图,展示了全方位扫描的方法,该方法结合了二维扫描和对角线扫描。左侧部分显示了二维扫描的过程,右侧部分则展示了对角线扫描的方式,旨在捕捉图像的整体信息。

Figure 3. Illustration of an all-around scanning approach that combines two-dimensional scanning and diagonal scanning.

And Figure 4 further highlights the issue of local pixel forgetting and the benefit of all-around scanning:

Figure 4. (a) Ilustration of the local pixel forgetting phenomenon, where spatially adjacent pixels become distantly separated in the one-dimensional token sequence during scanning. The target pixel (highlighted in red square) and its adjacent pixels demonstrate how different scanning patterns affect spatial relationships. (b) The ERF visualization results averaged across the SIDD dataset \[59\], which depict improved spatial dependency preservation with the proposed all-around scanning approach. 该图像是图示,通过对比2D扫描和全方位扫描展示了不同扫描方式对空间关系的影响。左侧展示了相邻扫描参考图,右侧为有效感受野的可视化结果,表明全方位扫描策略更好地保持空间依赖性。

Figure 4. (a) Ilustration of the local pixel forgetting phenomenon, where spatially adjacent pixels become distantly separated in the one-dimensional token sequence during scanning. The target pixel (highlighted in red square) and its adjacent pixels demonstrate how different scanning patterns affect spatial relationships. (b) The ERF visualization results averaged across the SIDD dataset [59], which depict improved spatial dependency preservation with the proposed all-around scanning approach.

Here's how it works:

  1. Multi-directional Scanning: Instead of relying solely on traditional horizontal and vertical scans (which constitute two-dimensional scanning), the all-around strategy executes selective scanning across a wider range of directions. These include:

    • Horizontal: Left-to-right, Right-to-left.
    • Vertical: Top-to-bottom, Bottom-to-top.
    • Diagonal: Top-left to Bottom-right, Bottom-right to Top-left.
    • Flipped Diagonal: Top-right to Bottom-left, Bottom-left to Top-right.
    • And their respective reversed orientations. This diverse set of scanning patterns ensures that information from all cardinal and intercardinal directions relative to a pixel is considered.
  2. Addressing Local Pixel Forgetting: The core problem is that flattening a 2D image into a 1D sequence can separate spatially adjacent pixels (Fig. 4a). For example, a pixel's diagonal neighbors might be far apart in a purely horizontal or vertical scan sequence. By incorporating diagonal and flipped diagonal scans, the all-around strategy ensures that these crucial neighboring pixels remain "close" in at least one of the scanning sequences, thus preserving their relationship.

  3. Holistic Information Capture: The combination of these multiple scans provides more complete neighborhood coverage. This strengthens spatial context understanding, especially for Effective Receptive Field (ERF) expansion (Fig. 4b). The paper notes that 2D scanning (as in MambaIR) struggles to capture information from diagonal directions, even with local convolution. All-around scanning explicitly addresses this gap, leading to a larger and more isotropic receptive field that is critical for detailed image restoration tasks.

    Synergy with MHSS: The efficiency of MHSS is crucial here. Without it, incorporating such a comprehensive set of scanning directions would lead to prohibitive computational overhead and parameter counts. MHSS allows EAMamba to implement this all-around scanning strategy without excessive computational cost, making the holistic capture of visual information feasible for image restoration.

5. Experimental Setup

5.1. Datasets

EAMamba's performance was evaluated across a range of image restoration tasks using several benchmark datasets.

  • Image Denoising:

    • Training Data (for Synthetic Gaussian Color Denoising):
      • DIV2K [82]: A high-quality image dataset commonly used for various restoration tasks.
      • Flickr2K [83]: Another large-scale dataset of diverse images.
      • WED (Waterloo Exploration Database) [84]: Provides a rich set of natural images.
      • BSD (Berkeley Segmentation Dataset) [71]: Contains images widely used for segmentation and image processing research.
      • Usage: A single EAMamba model was trained on noise levels σ[0,50]\sigma \in [0, 50] using a combination of these datasets.
    • Evaluation Data (for Synthetic Gaussian Color Denoising):
      • CBSD68 [71]: Contains 68 images, often used for benchmarking denoising algorithms.
      • Kodak24 [73]: Comprises 24 high-quality images.
      • McMaster [72]: Another standard dataset for image quality assessment and restoration benchmarks.
      • Usage: Evaluated across multiple noise levels σ=[15,25,50]\sigma = [15, 25, 50].
    • Evaluation Data (for Real-world Denoising):
      • SIDD (Smartphone Image Denoising Dataset) [59]: A high-quality denoising dataset specifically collected from smartphone cameras, capturing realistic noise patterns.
      • Usage: EAMamba was trained and evaluated on this dataset for real-world denoising scenarios. A sample from the SIDD dataset would typically be a pair of images: a noisy photograph taken by a smartphone and its corresponding clean (ground truth) version.
  • Image Super-Resolution:

    • Evaluation Data:
      • RealSR [61]: A benchmark dataset specifically designed for real-world single image super-resolution, containing image pairs of low-resolution and high-resolution images captured by real cameras.
      • Usage: Evaluated for scaling factors ×2,×3,×4\times 2, \times 3, \times 4.
  • Image Deblurring:

    • Training Data:
      • GoPro [62]: A dataset containing sharp and blurry image pairs, specifically designed for motion deblurring, with videos of real-world scenes.
    • Evaluation Data:
      • GoPro [62]: Used for evaluation.
      • HIDE (Human-aware motion deblurring) [86]: Another benchmark dataset for motion deblurring.
  • Image Dehazing:

    • Training & Evaluation Data:
      • RESIDE (REalistic Single Image DEhazing) [63]: A large-scale benchmark dataset for image dehazing, including synthetic hazy images and corresponding ground truth clear images. Subsets like SOTS (Synthetic Objective Testing Set) are commonly used.
      • Usage: Trained and evaluated on this dataset. A sample would be a hazy image (e.g., a city street obscured by fog) and its clear counterpart.
  • Ablation Studies:

    • Urban100 [87]: A dataset of 100 images commonly used for super-resolution and other image enhancement tasks, serving as a robust test set for ablation studies on Gaussian color image denoising.

      These datasets were chosen because they are widely recognized benchmarks in their respective fields, ensuring a fair and comparable evaluation against state-of-the-art methods. They represent diverse degradation types and real-world scenarios, making them effective for validating the proposed method's performance.

5.2. Evaluation Metrics

The performance of EAMamba is primarily evaluated using two widely accepted objective image quality metrics: Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), along with Floating Point Operations (FLOPs) for computational efficiency.

5.2.1. Peak Signal-to-Noise Ratio (PSNR)

  • Conceptual Definition: PSNR is a ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Because many signals have a very wide dynamic range, PSNR is usually expressed in terms of the logarithmic decibel scale. A higher PSNR value indicates a higher quality image, meaning less distortion or noise relative to the original. It is often used as a quality measurement for reconstruction of lossy compression codecs or for evaluating restoration algorithms.
  • Mathematical Formula: The PSNR is defined as: $ \mathrm{PSNR} = 10 \cdot \log_{10}\left(\frac{\mathrm{MAX}I^2}{\mathrm{MSE}}\right) $ where MSE\mathrm{MSE} is the Mean Squared Error between the original (ground truth) image and the reconstructed image. $ \mathrm{MSE} = \frac{1}{mn} \sum{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $
  • Symbol Explanation:
    • PSNR\mathrm{PSNR}: Peak Signal-to-Noise Ratio, measured in decibels (dB).
    • log10\log_{10}: Base-10 logarithm.
    • MAXI\mathrm{MAX}_I: The maximum possible pixel value of the image. For an 8-bit grayscale image, this is 255. For color images where pixel values are normalized to [0, 1], MAXI\mathrm{MAX}_I would be 1. The paper evaluates on RGB channels or Y channel, implying values from 0-255 or 0-1.
    • MSE\mathrm{MSE}: Mean Squared Error.
    • mm: Number of rows (height) in the image.
    • nn: Number of columns (width) in the image.
    • I(i,j): The pixel value at coordinates (i,j) in the original (ground truth) image.
    • K(i,j): The pixel value at coordinates (i,j) in the reconstructed (restored) image.
    • [I(i,j)K(i,j)]2[I(i,j) - K(i,j)]^2: The squared difference between the corresponding pixel values.

5.2.2. Structural Similarity Index Measure (SSIM)

  • Conceptual Definition: SSIM is a perceptual metric that quantifies the perceived degradation in image quality caused by processing. Unlike PSNR, which focuses on absolute errors, SSIM attempts to measure the structural similarity between two images, mimicking the human visual system's perception. It considers three key factors: luminance, contrast, and structure. An SSIM value closer to 1 indicates higher similarity (better quality).
  • Mathematical Formula: The SSIM between two windows xx and yy of common size N×NN \times N is defined as: $ \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
  • Symbol Explanation:
    • SSIM(x,y)\mathrm{SSIM}(x, y): Structural Similarity Index Measure between image patches xx and yy.
    • μx\mu_x: The average (mean) of pixel values in window xx.
    • μy\mu_y: The average (mean) of pixel values in window yy.
    • σx\sigma_x: The standard deviation of pixel values in window xx (measure of contrast).
    • σy\sigma_y: The standard deviation of pixel values in window yy (measure of contrast).
    • σxy\sigma_{xy}: The covariance between pixel values in windows xx and yy (measure of structural correlation).
    • c1=(K1L)2c_1 = (K_1 L)^2 and c2=(K2L)2c_2 = (K_2 L)^2: Two constants included to avoid division by zero when the denominators are very small.
      • LL: The dynamic range of the pixel values (e.g., 255 for 8-bit images).
      • K1,K2K_1, K_2: Small constant values, typically K1=0.01K_1 = 0.01 and K2=0.03K_2 = 0.03. For a full image, SSIM is often calculated over various windows and then averaged to produce a single value.

5.2.3. Floating Point Operations (FLOPs)

  • Conceptual Definition: FLOPs (Floating Point Operations) is a common metric used to estimate the computational complexity of a model. It counts the number of floating-point arithmetic operations (additions, multiplications, divisions, etc.) required to process a single input. A lower FLOPs count indicates a more computationally efficient model, requiring less processing power and time.
  • Measurement Context: The paper calculates FLOPs using fvcore [70] for all Vision Mamba methods, at a resolution of 256×256256 \times 256 for all experiments, ensuring a fair comparison across models.
  • Significance: This metric is crucial for evaluating the practicality of deep learning models, especially for deployment on edge devices or in applications with strict latency requirements.

5.3. Baselines

EAMamba is compared against a comprehensive set of baseline models, including traditional CNN-based methods, Transformer-based methods, and other Vision Mamba variants.

  • General Image Restoration (various tasks):

    • SwinIR [48] (ViT-based)
    • Restormer [51] (ViT-based)
    • UFormer-B [50] (ViT-based)
    • MPRNet [36]
    • HINet [37]
    • IPT [49] (ViT-based)
    • MAXIM-3S [85]
  • Image Denoising Specific:

    • IRCNN [10]
    • FFDNet [11]
    • DnCNN [9]
    • BRDNetBRDNet* [12]
    • DRUNet [13]
    • BM3D [74]
    • CBDNetCBDNet* [8]
    • RIDNetRIDNet* [75]
    • VDN [76]
    • SADNetSADNet* [77]
    • DANetDANet* [78]
    • CycleISPCycleISP* [79]
    • DeamNetDeamNet* [80]
    • DAGL [81]
  • Vision Mamba Baselines (primary comparison for efficiency):

    • MambaIRMambaIR* [57] (Vision Mamba-based)
    • MambaIR-UNet [57] (Vision Mamba-based, UNet variant)
    • VMambaIR [58] (Vision Mamba-based)
  • Image Deblurring Specific:

    • DeblurGAN-v2 [16]
    • SRN [15]
    • DBGAN [17]
    • DMPHN [18]
    • SPAIR [38]
    • MIMO-UNet+ [19]
    • Stripformer [43] (ViT-based)
    • SFNet [39]
  • Image Dehazing Specific:

    • Dehamer (from table image, likely a Transformer-based model)

    • MAXIM-2S (from table image, related to MAXIM)

    • DehazeFormer-L (from table image, likely a Transformer-based model)

      These baselines are representative because they cover a wide spectrum of approaches, from classical CNNs to cutting-edge Transformers and the latest Vision Mamba models. Comparing against them allows EAMamba to demonstrate its advancements in both performance and, crucially, computational efficiency within the rapidly evolving field of image restoration. Methods marked with * indicate those that train separate models for each noise level or use additional training data, providing context for comparison.

5.4. Training Details

The paper outlines a robust training protocol to ensure fair and effective evaluation of EAMamba.

  • Training Iterations: The models are trained for a total of 450,000 iterations.
  • Learning Rate Schedule: An initial learning rate of 3×1043 \times 10^{-4} is used. This learning rate is gradually decreased to 1×1061 \times 10^{-6} using a cosine annealing schedule. Cosine annealing is a technique where the learning rate is scheduled to decrease following a cosine curve, often leading to better convergence and performance.
  • Optimizer: AdamW [69] is employed as the optimizer. AdamW is an Adam variant that decouples weight decay from the optimization step, which often improves regularization and performance.
    • Parameters: β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999 (default values for Adam-like optimizers).
    • Weight decay: 1e41e^{-4}.
  • Loss Function: L1 loss (Mean Absolute Error) is used. L1 loss is less sensitive to outliers compared to L2 (Mean Squared Error) and often promotes sharper images in image restoration tasks.
  • Progressive Training Strategy: Following the approach in [51] (Restormer), a progressive training strategy is adopted. This means training starts with smaller image patches and larger batch sizes, then gradually increases patch sizes and decreases batch sizes. This strategy helps the model learn fine details from smaller patches efficiently and then generalize to larger contexts.
    • Initial: Patches of 128×128128 \times 128 pixels with a batch size of 64.
    • Progression: These parameters are progressively adjusted at specific iteration milestones:
      • [160, 40] at 138K iterations
      • [192, 32] at 234K iterations
      • [256, 16] at 306K iterations
      • [320, 8] at 360K iterations
      • [384, 8] at 414K iterations The format [patch_size, batch_size] indicates the dimensions of the training patches and the number of samples processed in each batch.
  • Data Augmentation: To improve model generalization and prevent overfitting, standard data augmentation techniques are applied:
    • Random horizontal flipping
    • Random vertical flipping
    • 90° rotation

5.5. Architecture details

The EAMamba framework utilizes a four-level UNet architecture [65]. This architecture is commonly used in image-to-image tasks due to its effectiveness in capturing multi-scale features and local details via skip connections.

  • MambaFormer Blocks: The number of MambaFormer blocks (the core building blocks combining MHSSM and Channel MLP) varies at different levels of the UNet:
    • Encoder levels: [4, 6, 6, 7] MambaFormer blocks at the respective levels. This means 4 blocks in the first encoder stage, 6 in the second, and so on.
    • Refinement stage: Incorporates two MambaFormer blocks.
  • Channel Dimension (C): The base channel dimension CC is maintained at a constant value of 64 throughout the network. This implies that feature maps at deeper encoder/decoder levels would typically have channel dimensions that are multiples of CC (e.g., 2C, 4C).
  • Channel MLP: A simple feed-forward network (FFN) [68] is utilized as the default channel MLP within the MambaFormer blocks. This choice is later justified by ablation studies as providing an optimal balance between performance and computational efficiency.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results validate EAMamba's effectiveness and efficiency across various image restoration tasks.

The following figure (Figure 1 from the original paper) presents a high-level overview of computational efficiency versus image quality:

Figure 1. Computational efficiency versus image quality across model architectures. Our method (denoted by \(\\cdot\) )demonstrates superior efficiency compared to other Vision Mamba-based methods \(( \\bullet )\) and existing approaches \(( \\bullet )\) . EAMamba establishes a new efficiency frontier for Vision Mamba-based image restoration. 该图像是图表,展示了EAMamba在不同图像恢复任务中的PSNR与计算效率的对比。EAMamba在多项任务中表现出色,取得了优于其他方法的效果,并在计算复杂度和效率之间设置了新的平衡。图中以红星标识EAMamba的位置。

Figure 1. Computational efficiency versus image quality across model architectures. Our method (denoted by \cdot )demonstrates superior efficiency compared to other Vision Mamba-based methods ()( \bullet ) and existing approaches ()( \bullet ) . EAMamba establishes a new efficiency frontier for Vision Mamba-based image restoration.

Analysis of Figure 1: Figure 1 graphically illustrates the primary claim of the paper: EAMamba achieves a new efficiency frontier for Vision Mamba-based image restoration. The x-axis represents Computational Efficiency (likely inverse FLOPs or similar, implying higher values mean more efficient), and the y-axis represents Image Quality (e.g., PSNR, higher is better).

  • Other Vision Mamba-based methods (\bullet): These points typically fall below EAMamba, indicating either lower quality for similar efficiency or lower efficiency for similar quality.
  • Existing approaches (\bullet): These represent non-Mamba methods (e.g., CNNs, Transformers). They generally spread across the graph, but EAMamba consistently positions itself in the upper-right region, signifying a superior balance of high quality and high efficiency.
  • EAMamba (\cdot): EAMamba points are consistently high on the quality axis and far to the right on the efficiency axis, clearly demonstrating its advantage. This figure serves as a compelling visual summary of the paper's main contribution.

6.1.1. Image Denoising (Section 4.2)

EAMamba is evaluated on both synthetic Gaussian color denoising and real-world denoising.

The following are the results from Table 1 of the original paper:

Method[Param. FLOPs (M) ↓ (G) ↓CBSD68 [71] |σ = 15 σ = 25 σ = 50Kodak24 [73]McMaster [72]
|σ = 15 σ = 25 σ = 50|σ = 15 σ = 25 σ = 50
IRCNN [10]--33.8631.1627.8634.6932.1828.9334.5832.1828.91
FFDNet [11]--33.8731.2127.9634.6332.1328.9834.6632.3529.18
DnCNN [9]--33.9031.2427.9534.6032.1428.9533.4531.5228.62
BRDNet* [12]--34.1031.4328.1634.8832.4129.2235.0832.7529.52
DRUNet [13]32.614434.3031.6928.5135.3132.8929.8635.4033.1430.08
SwinIR* [48]11.578834.4231.7828.5635.3432.8929.7935.6133.2030.22
Restormer [51]26.114134.3931.7828.5935.4433.0230.0035.5533.3130.29
MambaIR* [57]15.8129034.4331.8028.6135.3432.9129.8535.6233.3530.31
EAMamba (Ours)25.313734.4331.8128.6235.3632.9529.9135.5933.3430.31

Analysis of Table 1 (Synthetic Gaussian Denoising):

  • Computational Efficiency: EAMamba (137 GFLOPs) is significantly more efficient than its Vision Mamba counterpart, MambaIR* (1290 GFLOPs). EAMamba uses approximately 11% of the FLOPs of MambaIR*. This demonstrates a massive reduction in computational cost, directly validating the efficiency claims of MHSSM. It is also more efficient than SwinIR* (788 GFLOPs) and comparable to Restormer (141 GFLOPs) and DRUNet (144 GFLOPs), while generally outperforming them in quality.

  • Image Quality (PSNR): EAMamba achieves competitive or slightly better PSNR values across all datasets (CBSD68, Kodak24, McMaster) and noise levels (σ=15,25,50\sigma = 15, 25, 50) compared to MambaIR*. For instance, on CBSD68, EAMamba matches MambaIR* at σ=15\sigma=15 (34.43 dB) and slightly surpasses it at σ=25\sigma=25 (31.81 vs 31.80 dB) and σ=50\sigma=50 (28.62 vs 28.61 dB). This shows that EAMamba maintains favorable performance despite the drastic reduction in FLOPs.

  • Parameters: EAMamba has 25.3M parameters, which is higher than MambaIR*'s 15.8M. However, the FLOPs reduction is much more substantial, indicating that EAMamba's parameter usage is more efficient in terms of computational operations.

    The following are the results from Table 2 of the original paper:

    MethodParam. (M) ↓FLOPs (G) ↓SIDD [59]
    PSNR ↑SSIM↑
    DnCNN [9]-23.660.583
    BM3D [74]25.650.685
    CBDNet* [8]30.780.801
    RIDNet* [75]1.59838.710.951
    VDN [76]7.84439.280.956
    SADNet* [77]--39.460.956
    DANet* [78]63.03039.470.957
    CycleISP* [79]2.818439.520.957
    MIRNet [35]31.878539.720.959
    DeamNet* [80]2.314739.470.957
    MPRNet [36]15.758839.710.958
    DAGL [81]5.727338.940.953
    HINet [37]88.717139.990.958
    IPT* [49]115.338039.100.954
    MAXIM-3S [85]22.233939.960.960
    UFormer-B [50]50.98939.890.960
    Restormer [51]26.114140.020.960
    MambaIR-UNet [57]26.823039.890.960
    EAMamba (Ours)25.313739.870.960

Analysis of Table 2 (Real-world Denoising on SIDD):

  • Computational Efficiency: EAMamba (137 GFLOPs) achieves a 41% reduction in FLOPs compared to MambaIR-UNet (230 GFLOPs), while having a similar parameter count (25.3M vs 26.8M). This further reinforces the efficiency benefits of MHSSM. EAMamba is also more FLOPs-efficient than many high-performing Transformer-based models like Restormer (141 GFLOPs) and HINet (171 GFLOPs).

  • Image Quality (PSNR & SSIM): EAMamba achieves a PSNR of 39.87 dB and SSIM of 0.960. While MambaIR-UNet has a slightly higher PSNR of 39.89 dB (a marginal difference of 0.02 dB), EAMamba matches its SSIM. This demonstrates that EAMamba maintains strong perceptual quality and structural fidelity with significantly reduced computational cost. Restormer leads in PSNR at 40.02 dB, showing that EAMamba is very competitive with state-of-the-art methods across different architectures.

    The following figure (Figure 7 from the original paper) presents qualitative results for real-world denoising:

    该图像是图表,展示了不同图像恢复方法的PSNR值。左侧为真实图像PSNR值为33.11,接下来是低质量图像以及MPRNet(39.62)、UFormer-B(39.73)、Restormer(40.47)、MambaIR-UNet(37.45)等方法的恢复结果,最终是我们的EAMamba方法,PSNR值达到40.99。 该图像是图表,展示了不同图像恢复方法的PSNR值。左侧为真实图像PSNR值为33.11,接下来是低质量图像以及MPRNet(39.62)、UFormer-B(39.73)、Restormer(40.47)、MambaIR-UNet(37.45)等方法的恢复结果,最终是我们的EAMamba方法,PSNR值达到40.99。

Figure 7. The image is a chart showing the PSNR values of different image restoration methods. On the left is the ground truth PSNR value of 33.11, followed by the low-quality image and the restoration results of MPRNet (39.62), UFormer-B (39.73), Restormer (40.47), MambaIR-UNet (37.45), and finally our EAMamba method with a PSNR value of 40.99.

Analysis of Figure 7 (Qualitative Results on SIDD): The qualitative comparison shows normalized difference maps, where brighter areas indicate larger deviations from the ground truth. EAMamba's output (PSNR 40.99) shows closer correspondence to ground truth compared to other methods, indicating better detail preservation and fewer artifacts. This visual evidence supports the quantitative metrics, confirming EAMamba's effectiveness in denoising. (Note: The PSNR values in the figure's description are for a specific image, not the average dataset PSNR).

6.1.2. Image Super-Resolution (Section 4.3)

The following are the results from Table 3 of the original paper:

Method|Param. Flops | (M)↓ (G) ↓|x2x3x4
| PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑
Restormer [51]26.115534.330.92931.160.87429.540.836
MambaIR-UNet [57]]26.823034.200.92731.160.87229.530.835
VMambaIR [58]26.320034.160.92731.140.87229.560.836
EAMamba (Ours)25.313734.180.92731.110.87229.600.835

Analysis of Table 3 (Super-Resolution on RealSR):

  • Computational Efficiency: EAMamba demonstrates superior computational efficiency with the lowest parameter count (25.3M) and lowest FLOPs (137 GFLOPs) among all compared methods. This is notably lower than MambaIR-UNet (230 GFLOPs) and VMambaIR (200 GFLOPs), and also more efficient than Restormer (155 GFLOPs).
  • Image Quality (PSNR & SSIM):
    • ×4\times 4 Scaling: EAMamba achieves the superior PSNR performance (29.60 dB) at the ×4\times 4 scaling factor, outperforming all other methods including VMambaIR (29.56 dB) and Restormer (29.54 dB).
    • ×2\times 2 and ×3\times 3 Scaling: EAMamba maintains a minimal PSNR gap (less than 0.05 dB) compared to other Vision Mamba methods and Restormer. For ×2\times 2, EAMamba (34.18 dB) is slightly lower than MambaIR-UNet (34.20 dB) and Restormer (34.33 dB). For ×3\times 3, EAMamba (31.11 dB) is slightly lower than MambaIR-UNet (31.16 dB) and Restormer (31.16 dB). SSIM values are generally competitive. These results solidify EAMamba's position as an efficient solution that can also deliver leading performance in super-resolution, particularly at higher scaling factors.

The following figure (Figure 8 from the original paper) presents qualitative results for image super-resolution:

该图像是展示EAMamba在图像恢复任务中比较不同方法性能的示意图,包括真实图像、低质量图像与其他恢复方法以及EAMamba的结果,最后一列显示了EAMamba获得的最高PSNR值35.04。 该图像是展示EAMamba在图像恢复任务中比较不同方法性能的示意图,包括真实图像、低质量图像与其他恢复方法以及EAMamba的结果,最后一列显示了EAMamba获得的最高PSNR值35.04。

Figure 8. The image is a diagram illustrating the performance comparison of different methods in the image restoration task using EAMamba, including the ground truth image, low-quality image, and the results of other restoration methods, with EAMamba achieving the highest PSNR value of 35.04 in the last column.

Analysis of Figure 8 (Qualitative Results for Super-Resolution): The cropped difference results (normalized differences from ground truth) visually demonstrate EAMamba's superior capability in structural preservation and detail reconstruction. It appears to surpass other Vision Mamba baselines by producing outputs that are visually closer to the ground truth, validating its qualitative advantages.

6.1.3. Image Deblurring (Section 4.4)

The following are the results from Table 4 of the original paper:

MethodParam. (M) ↓FLOPs (G) ↓GoPro [62]HIDE [86]
PSNR ↑SSIM ↑PSNR ↑SSIM↑
DeblurGAN-v2 [16]--29.550.93426.610.875
SRN [15].-30.260.93428.360.915
DBGAN [17]--31.100.94228.940.915
DMPHN [18]21.719531.200.94029.090.924
SPAIR [38]--32.060.95330.290.931
MIMO-UNet+ [19]16.115132.450.95729.990.930
MPRNet [36]20.176032.660.95930.960.939
HINet [37]88.717132.710.95930.320.932
IPT [49]115.338032.52--
MAXIM-3S [85]22.233932.860.96132.830.956
UFormer-B [50]50.98933.060.96730.900.953
Restormer [51]26.114132.920.96131.220.942
Stripformer [43]19.715533.080.96231.030.940
SFNet [39]13.312533.270.96331.100.941
EAMamba (Ours)25.313733.580.96631.420.944

Analysis of Table 4 (Image Deblurring on GoPro and HIDE):

  • Performance on GoPro: EAMamba achieves superior performance on the GoPro benchmark, with a PSNR of 33.58 dB and SSIM of 0.966. This surpasses the second-best result (SFNet, 33.27 dB) by 0.31 dB in terms of PSNR, highlighting its strong deblurring capabilities. Its FLOPs (137 GFLOPs) are also competitive, being higher than UFormer-B (89 GFLOPs) and SFNet (125 GFLOPs), but lower than MPRNet (760 GFLOPs) and MAXIM-3S (339 GFLOPs).

  • Performance on HIDE: On the HIDE benchmark, EAMamba (PSNR 31.42 dB, SSIM 0.944) ranks as the second-best method, closely behind MAXIM-3S (PSNR 32.83 dB, SSIM 0.956). This indicates robust performance across different deblurring datasets.

    The following figure (Figure 9 from the original paper) presents qualitative results for image deblurring:

    该图像是图表,展示了EAMamba在图像重建任务中与其他方法(包括MPRNet、MAXIM-3S、Restormer和SFNet)的PSNR(峰值信噪比)性能对比。在各方法中,EAMamba的PSNR达到36.52,显示了其在低级视觉任务中的优越性。 该图像是图表,展示了EAMamba在图像重建任务中与其他方法(包括MPRNet、MAXIM-3S、Restormer和SFNet)的PSNR(峰值信噪比)性能对比。在各方法中,EAMamba的PSNR达到36.52,显示了其在低级视觉任务中的优越性。

Figure 9. The image is a chart that illustrates the PSNR (Peak Signal-to-Noise Ratio) performance comparison of EAMamba with other methods, including MPRNet, MAXIM-3S, Restormer, and SFNet, in image restoration tasks. Among the methods, EAMamba achieves a PSNR of 36.52, demonstrating its superiority in low-level vision tasks.

Analysis of Figure 9 (Qualitative Results for Deblurring): The qualitative comparisons suggest that EAMamba excels in preserving details and producing images that are nearly indistinguishable from the ground truth. This is observed from the nearest objects (e.g., car bumper) to broader environmental contexts (e.g., brick pavement), reinforcing its strong deblurring capabilities.

6.1.4. Image Dehazing (Section 4.5)

The following are the results from Table 5 of the original paper:

MethodParam. (M) ↓FLOPs (G) ↓SOTS-Indoor [63]SOTS-Outdoor [63]
PSNR ↑SSIM ↑PSNR ↑SSIM ↑
DehazeNet [28]--21.050.79320.300.771
AOD-Net [29]--22.250.81720.890.785
GridDehazeNet [30]--30.730.94735.530.984
MSBDN [31]--31.430.95136.190.986
FFANet [32]--31.540.95036.310.987
DCP [33]--31.950.95336.940.988
DeamNet [34]--32.730.95637.100.988
MAXIM-3S [85]22.233932.550.95637.280.988
Restormer [51]26.114132.890.95737.420.989
EAMamba (Ours)25.313733.050.95837.520.989

Analysis of Table 5 (Image Dehazing on RESIDE):

  • Computational Efficiency: EAMamba again demonstrates excellent efficiency with 25.3M parameters and 137 GFLOPs. It is slightly more efficient than Restormer (141 GFLOPs) and significantly more efficient than MAXIM-3S (339 GFLOPs).

  • Image Quality (PSNR & SSIM): EAMamba achieves the highest PSNR on both SOTS-Indoor (33.05 dB) and SOTS-Outdoor (37.52 dB) benchmarks, surpassing all other methods, including Restormer (32.89 dB and 37.42 dB respectively) and MAXIM-3S (32.55 dB and 37.28 dB respectively). It also achieves the highest or tied-highest SSIM values (0.958 for SOTS-Indoor, 0.989 for SOTS-Outdoor). This confirms EAMamba's superior dehazing capabilities while maintaining high efficiency.

    The following figure (Figure 10 from the original paper) presents qualitative results for image dehazing:

    该图像是一个比较展示,左侧为真实图像,右侧展示了低质量图像及多种图像恢复方法的结果,包括Dehamer、MAXIM-2S、DehazeFormer-L和我们的EAMamba方法,显示出EAMamba在图像恢复中取得的最大PSNR值46.28,显著高于其他方法。 该图像是一个比较展示,左侧为真实图像,右侧展示了低质量图像及多种图像恢复方法的结果,包括Dehamer、MAXIM-2S、DehazeFormer-L和我们的EAMamba方法,显示出EAMamba在图像恢复中取得的最大PSNR值46.28,显著高于其他方法。

Figure 10. The image is a comparative display, with the left side showing the ground truth image and the right side presenting low-quality images along with the results of various image restoration methods, including Dehamer, MAXIM-2S, DehazeFormer-L, and our EAMamba method, demonstrating that EAMamba achieves the highest PSNR value of 46.28 in image restoration, significantly outperforming other methods.

Analysis of Figure 10 (Qualitative Results for Dehazing): The qualitative results show that EAMamba exhibits superior detail preservation with minimal deviation from ground truth. The reconstructed image by EAMamba appears clearer and more faithful to the original scene compared to other methods, reinforcing its quantitative advantages.

6.2. Effectiveness of Various Scanning Strategies (Section 4.6)

The paper further investigates the impact of different scanning strategies, particularly highlighting the benefits of the proposed all-around scanning.

The following figure (Figure 11 from the original paper) illustrates the ERF results for different scanning strategies:

Figure 11. Illustration of the ERF results for different scanning strategies, including two-dimensional scan, diagonal scan, zigzag scan, Z-order scan, Hilbert scan, and our all-around scan with reversing and flipping. 该图像是图表,展示了不同扫描策略(如2D扫描、对角线扫描、锯齿形扫描、Z字形扫描、Hilbert扫描和全方位扫描)的ERF结果。ERF结果展示了各方法在不同指标上的表现,以验证EAMamba的优越性。

Figure 11. Illustration of the ERF results for different scanning strategies, including two-dimensional scan, diagonal scan, zigzag scan, Z-order scan, Hilbert scan, and our all-around scan with reversing and flipping.

Analysis of Figure 11 (ERF Results for Different Scanning Strategies): The Effective Receptive Field (ERF) visualization shows which input pixels contribute to the output of a target pixel through gradient flow analysis.

  • (a) 2D Scan: Shows strong horizontal and vertical dependencies but weaker diagonal influence.

  • (b) Diagonal Scan: Clearly captures global information along diagonal paths.

  • (c) Zigzag Scan: Also captures diagonal information.

  • (d) Z-order Scan: Gathers global information but with discontinuities.

  • (e) Hilbert Scan: Appears to have a less organized or less global information capture pattern in this visualization, indicating it might struggle with broad context.

  • (f) All-Around Scan (2D + Diagonal + Reversing & Flipping): This combines multiple patterns, resulting in a significantly broader and more isotropic (uniform in all directions) ERF. This directly supports the claim that all-around scanning captures both global and local contextual information more comprehensively, especially covering diagonal directions missed by simple 2D scans. This expanded ERF is crucial for preserving local information around target pixels, which is vital for image restoration.

    The following are the results from Table 6 of the original paper:

    2DDiagonalZigzagZ-orderHilbertAll-around
    39.8039.7939.7739.7439.7439.87

Analysis of Table 6 (Average PSNR (dB) on SIDD for various scan strategies): This table quantifies the benefits of all-around scanning.

  • Individual scanning strategies (2D, Diagonal, Zigzag, Z-order, Hilbert) yield PSNR values ranging from 39.74 dB to 39.80 dB.

  • The All-around scanning strategy (which combines multiple patterns) achieves the highest PSNR of 39.87 dB. This represents an improvement of 0.07 dB to 0.15 dB over individual scanning patterns, quantitatively confirming its effectiveness in enhancing image quality by capturing more comprehensive spatial information and mitigating local pixel forgetting.

    The following are the results from Table 7 of the original paper:

    Dataset2D + Diagonal2D + Z-order2D + Hilbert2D + Diagonal + Z-order
    SIDD [59]39.8739.8239.8339.83
    RealSRx4 [61]29.6029.5829.5129.57
    OPro [62]33.5833.5133.6633.56
    SOTS-Indoor [63]43.1943.2043.0743.37

Analysis of Table 7 (Average PSNR (dB) on various image restoration datasets with different combinations of scanning strategies): This table explores different combinations of scanning strategies within the all-around approach.

  • The combination of 2D + Diagonal scanning consistently yields good performance across all tested datasets (SIDD, RealSRx4, GoPro, SOTS-Indoor), achieving values like 39.87 dB, 29.60 dB, 33.58 dB, and 43.19 dB respectively. This combination is chosen as the default configuration for EAMamba, striking a good balance between complexity and performance.
  • Other combinations like 2D + Z-order and 2D + Hilbert show slightly lower or comparable performance. 2D + Diagonal + Z-order sometimes performs very well (e.g., SOTS-Indoor with 43.37 dB), indicating that specific combinations might be optimal for particular tasks. The flexibility of the MHSSM to seamlessly incorporate novel scanning strategies is highlighted, allowing for further optimization based on specific use cases.

6.3. Ablation Studies (Section 4.7)

6.3.1. The Effectiveness of MHSS and All-Around Scan

The following are the results from Table 8 of the original paper:

MethodParam. (M) ↓FLOPs (G) ↓Urban100 [87]
σ = 15σ = 25σ = 50
Baseline31.128635.1533.0030.08
+ MHSSM25.313735.0632.8929.95
+ all-around scan25.313735.1032.9330.01

Analysis of Table 8 (Average PSNR (dB) on Urban100 with Gaussian color image denoising for different design choices): This ablation study quantifies the individual contributions of MHSSM and the all-around scanning strategy.

  • Baseline: The Baseline model uses 2DSSM [57] with conventional 2D scanning. It has 31.1M parameters and 286 GFLOPs.

  • + MHSSM: Substituting 2DSSM with MHSSM (while still using 2D scanning) drastically reduces computational cost: FLOPs are almost halved from 286 GFLOPs to 137 GFLOPs. Parameters also decrease from 31.1M to 25.3M. The PSNR values experience a negligible decrease (e.g., 35.15 to 35.06 dB at σ=15\sigma=15). This quantitatively demonstrates MHSSM's capacity for significant computational savings with minimal detriment to quality.

  • + all-around scan: Starting from the + MHSSM configuration, incorporating the all-around scanning strategy (instead of just 2D scanning) improves the PSNR values (e.g., 35.06 to 35.10 dB at σ=15\sigma=15). This improvement occurs without any increase in parameters or FLOPs (remaining at 25.3M and 137 GFLOPs). This confirms that all-around scanning delivers enhanced image quality compared to conventional 2D scanning by better capturing spatial information.

    In summary, this study strongly supports EAMamba's design choices: MHSSM provides substantial efficiency, and all-around scanning boosts performance without additional computational burden.

6.3.2. Comparison of the various channel MLP

The following are the results from Table 9 of the original paper:

Channel MLPParam. (M) ↓FLOPs (G) ↓Urban100 [87]
σ = 15σ = 25σ = 50
None16.59034.9832.7929.82
FFN28.315335.1032.9330.01
GDFN34.518935.1532.9830.08
Simple FFN25.313735.1032.9330.01
CA28.312335.0532.8829.95

Analysis of Table 9 (Average PSNR (dB) on Urban100 with Gaussian color image denoising for different channel MLPs choices): This ablation study investigates the impact of different channel MLP designs within the MambaFormer Block. The base configuration here likely includes MHSSM and all-around scanning.

  • None: Excluding the Channel MLP entirely leads to a performance decline greater than 0.1 dB in PSNR (e.g., 34.98 dB at σ=15\sigma=15), demonstrating that the channel MLP is essential for feature refinement. It also has the lowest parameters and FLOPs, as expected.

  • FFN (Vanilla Feed-Forward Network): A standard FFN yields good performance (35.10 dB) but with higher parameters (28.3M) and FLOPs (153 GFLOPs).

  • GDFN (Gated-Dconv FFN) [51]: This design demonstrates superior performance (35.15 dB at σ=15\sigma=15), achieving the highest PSNR values. However, it also has the highest parameter count (34.5M) and FLOPs (189 GFLOPs).

  • Simple FFN [68]: This is the chosen default configuration for EAMamba. It achieves the same PSNR as the vanilla FFN (35.10 dB) but with fewer parameters (25.3M vs 28.3M) and lower FLOPs (137 GFLOPs vs 153 GFLOPs). This indicates that Simple FFN strikes an optimal balance between performance and computational efficiency.

  • CA (Channel Attention) [6]: Channel Attention also offers good efficiency (123 GFLOPs) but slightly lower performance (35.05 dB) compared to Simple FFN.

    The choice of Simple FFN as the default channel MLP is justified by its ability to provide strong performance while maintaining a significantly reduced computational footprint, aligning with EAMamba's overall goal of efficiency.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces EAMamba, an Efficient All-Around Vision State Space Model specifically designed for image restoration. EAMamba significantly advances existing Vision Mamba frameworks by addressing two critical limitations: the computational overhead associated with multiple scanning sequences and the problem of local pixel forgetting when processing 2D images as 1D sequences.

The key innovations are:

  1. Multi-Head Selective Scan Module (MHSSM): This module efficiently processes and aggregates multiple flattened 1D sequences through a channel-splitting strategy. It dramatically improves the scalability and computational efficiency of Vision Mamba frameworks by avoiding the linear increase in computational and parameter costs that would otherwise arise from incorporating additional scanning directions.

  2. All-Around Scanning Mechanism: By integrating diverse scanning patterns (horizontal, vertical, diagonal, and their reversed orientations), this mechanism captures holistic spatial information. This effectively resolves the local pixel forgetting issue, ensuring that crucial local pixel relationships are preserved, which is paramount for high-quality image restoration.

    Extensive experimental evaluations across various image restoration tasks (super-resolution, denoising, deblurring, and dehazing) consistently demonstrate EAMamba's efficacy. The model achieves a significant 31-89% reduction in FLOPs compared to existing low-level Vision Mamba methods, while simultaneously maintaining or even improving favorable performance in terms of PSNR and SSIM metrics. Both qualitative and quantitative results validate EAMamba's architectural innovations, positioning it as a highly efficient and effective solution for diverse image restoration challenges.

7.2. Limitations & Future Work

The paper does not explicitly state a dedicated "Limitations" section. However, based on the results and the nature of the work, some potential aspects could be considered limitations or areas for future work:

  • Computational Overhead of Transformation/Inverse Transformation: While MHSSM efficiently aggregates selective scans, the Transform and InverseTransform steps (flattening 2D to 1D and vice-versa) for multiple directions might introduce some overhead, especially for very high-resolution images or numerous scanning patterns. The paper focuses on the FLOPs of the Mamba core, but the overall efficiency includes these reshaping operations.
  • Optimality of Scanning Patterns: The "all-around" strategy is a fixed set of patterns (e.g., 2D + Diagonal). It's possible that for specific degradation types or image content, other adaptive or learned scanning patterns could yield further improvements. The paper acknowledges this by stating "Other combinations can be employed for specific use cases" and "offers the flexibility to seamlessly incorporate novel scanning strategies with MHSSM," implying this is an open area.
  • Generalizability to Other Vision Tasks: While validated on image restoration, EAMamba's innovations could potentially benefit other low-level or even high-level vision tasks (e.g., semantic segmentation, object detection) where long-range dependencies and local context are crucial. The current evaluation is limited to restoration.
  • Hardware-Specific Optimizations: The efficiency gains are measured in FLOPs. Actual runtime performance can depend heavily on hardware architecture and memory access patterns. Further work could explore hardware-aware optimizations for the multi-head selective scan.
  • Hyperparameter Sensitivity: The choice of the channel expansion factor λ\lambda and the number of groups nn for MHSS are critical design decisions. The paper does not provide an ablation for these specific hyperparameters, which could influence the optimal balance of efficiency and performance.

7.3. Personal Insights & Critique

EAMamba presents a highly commendable advancement in the application of State Space Models to computer vision, particularly for image restoration.

Personal Insights:

  • Elegant Solution to Mamba's 2D Challenges: The paper provides an elegant and practical solution to two core problems faced by initial Vision Mamba adaptations: the inefficiency of adding more scanning directions and the loss of local spatial context. MHSSM's channel-splitting approach is a clever way to parallelize scanning operations without linearly increasing computational cost, while the all-around scanning strategy is a direct and effective counter to local pixel forgetting.
  • Efficiency Frontier: The quantitative results, especially the dramatic reduction in FLOPs (31-89%) while maintaining or even surpassing performance, are genuinely impressive. This is crucial for real-world deployment where computational resources are often constrained. EAMamba truly seems to establish a new "efficiency frontier" as claimed.
  • Versatility: Demonstrating strong performance across four diverse image restoration tasks (super-resolution, denoising, deblurring, dehazing) highlights the robustness and generalizability of the proposed architecture. This suggests that the underlying principles of efficient long-range dependency modeling and comprehensive local context capture are broadly applicable.
  • Foundation for Future SSM-Vision Research: EAMamba provides a solid blueprint for how to effectively adapt Mamba's linear complexity to the inherently 2D nature of images. The MHSSM could become a standard component in future Vision Mamba architectures seeking higher efficiency and more comprehensive spatial understanding.

Critique:

  • Lack of Detailed Failure Analysis: While qualitative results show EAMamba performing well, a deeper analysis of specific failure cases or common artifacts produced (or prevented) by EAMamba compared to baselines could offer further insights. What kinds of degradations does it struggle with? Where does it still fall short of the ground truth, and why?

  • "All-Around" Specificity: The paper states "the combination of the 2D scan and the diagonal scan generally yields good performance and is set as the default configuration." While this is practical, a more in-depth discussion on why this specific combination is optimal, or if it can be dynamically adapted, could be beneficial. The term "all-around" suggests a maximally comprehensive approach, but in practice, a subset was chosen for efficiency.

  • Parameter Explanation in Formulas: While the paper generally explains symbols in formulas, a more explicit breakdown of how the Transform and InverseTransform functions work (e.g., which specific scanning patterns are used for each group, how the 2D-to-1D flattening is done) would be helpful for a beginner to fully grasp the technical details without referring to external Mamba implementations.

  • Comparison to Non-Mamba SOTA: While EAMamba is clearly superior to other Vision Mamba methods in efficiency, and competitive with SOTA Transformer methods, a more direct comparison showing specific scenarios where it outright beats the best Transformer models in terms of quality (e.g., Restormer, MAXIM-3S) would further strengthen its position, beyond just showing FLOPs reduction. In some tables, Restormer or MAXIM-3S still have higher PSNR.

    Overall, EAMamba is a significant contribution, pushing the boundaries of efficient and effective image restoration using state space models. Its innovations are well-motivated and empirically validated, offering a promising direction for future research in computer vision.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.