EAMamba: Efficient All-Around Vision State Space Model for Image Restoration
TL;DR Summary
EAMamba integrates a Multi-Head Selective Scan Module and all-around scanning mechanism to tackle the computational complexity and local pixel forgetting issues in image restoration. Experiments show EAMamba significantly reduces FLOPs by 31-89% while maintaining comparable perfo
Abstract
Image restoration is a key task in low-level computer vision that aims to reconstruct high-quality images from degraded inputs. The emergence of Vision Mamba, which draws inspiration from the advanced state space model Mamba, marks a significant advancement in this field. Vision Mamba demonstrates excellence in modeling long-range dependencies with linear complexity, a crucial advantage for image restoration tasks. Despite its strengths, Vision Mamba encounters challenges in low-level vision tasks, including computational complexity that scales with the number of scanning sequences and local pixel forgetting. To address these limitations, this study introduces Efficient All-Around Mamba (EAMamba), an enhanced framework that incorporates a Multi-Head Selective Scan Module (MHSSM) with an all-around scanning mechanism. MHSSM efficiently aggregates multiple scanning sequences, which avoids increases in computational complexity and parameter count. The all-around scanning strategy implements multiple patterns to capture holistic information and resolves the local pixel forgetting issue. Our experimental evaluations validate these innovations across several restoration tasks, including super resolution, denoising, deblurring, and dehazing. The results validate that EAMamba achieves a significant 31-89% reduction in FLOPs while maintaining favorable performance compared to existing low-level Vision Mamba methods.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of this paper is EAMamba: Efficient All-Around Vision State Space Model for Image Restoration. It focuses on developing an advanced Vision Mamba framework to improve image restoration tasks by enhancing efficiency and comprehensively capturing visual information.
1.2. Authors
The authors of this paper are:
-
Yu-Cheng Lin
-
Yu-Syuan Xu
-
Hao-Wei Chen
-
Hsien-Kai Kuo
-
Chun-Yi Lee
Their affiliations indicate a collaboration between academia and industry:
-
National Tsing Hua University
-
National Taiwan University
-
MediaTek Inc.
1.3. Journal/Conference
The paper is published as a preprint on arXiv, indicated by the Published at (UTC): 2025-06-27T14:12:58.000Z and https://arxiv.org/abs/2506.22246. As a preprint, it has not yet undergone formal peer review in a journal or conference. However, arXiv is a widely respected platform for disseminating research in fields like computer science, often serving as a precursor to formal publication in top-tier conferences (e.g., CVPR, ICCV, ECCV) or journals. The "2025" publication year suggests it's a very recent or upcoming work.
1.4. Publication Year
2025
1.5. Abstract
Image restoration is a crucial task in low-level computer vision, aiming to reconstruct high-quality images from degraded inputs. Vision Mamba, inspired by the state space model Mamba, has shown promise in modeling long-range dependencies with linear complexity, a significant advantage for image restoration. However, Vision Mamba faces challenges in low-level vision tasks, including computational complexity that scales with the number of scanning sequences and a phenomenon called "local pixel forgetting."
To overcome these limitations, this study introduces Efficient All-Around Mamba (EAMamba). EAMamba incorporates a Multi-Head Selective Scan Module (MHSSM) with an all-around scanning mechanism. The MHSSM efficiently aggregates multiple scanning sequences without increasing computational complexity or parameter count. The all-around scanning strategy uses multiple patterns to capture holistic information, effectively resolving the local pixel forgetting issue.
Experimental evaluations validate these innovations across various restoration tasks, including super-resolution, denoising, deblurring, and dehazing. The results show that EAMamba achieves a significant 31-89% reduction in FLOPs while maintaining favorable performance compared to existing low-level Vision Mamba methods.
1.6. Original Source Link
https://arxiv.org/abs/2506.22246 PDF Link: https://arxiv.org/pdf/2506.22246v1.pdf This is a preprint published on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the efficient and effective reconstruction of high-quality images from degraded inputs, a task known as image restoration. This is a fundamental challenge in low-level computer vision, which deals with tasks that operate directly on pixels, such as denoising, deblurring, and super-resolution.
This problem is important because real-world images are frequently affected by various degradations (noise, blur, low-resolution, haze), and high-quality image reconstruction is vital for numerous applications, from medical imaging to autonomous driving and consumer photography.
Prior research has seen the dominance of Convolutional Neural Networks (CNNs) for image restoration. While successful, CNNs inherently struggle to capture global information due to their localized receptive fields. The emergence of Vision Transformers (ViT) addressed this by using multi-head self-attention to model relationships across all image pixels, thereby capturing global dependencies effectively. However, ViTs suffer from a major drawback: their computational complexity scales quadratically with the number of pixels, making them computationally intensive and often infeasible for high-resolution images, especially in resource-constrained environments.
More recently, Mamba, an advanced state space model (SSM), has shown promise in Natural Language Processing (NLP) for its ability to model long-range dependencies with linear computational complexity. This led to the adaptation of Mamba for vision tasks, giving rise to Vision Mamba models. These models aim to combine ViT's global information capturing capabilities with Mamba's linear computational scaling.
Despite these advancements, existing Vision Mamba models face specific challenges in low-level vision tasks:
-
Computational Complexity with Multiple Scanning Sequences: Traditional Vision Mamba methods, like those using
two-dimensional Selective Scan (2DSS), generate multiple flattened one-dimensional sequences (e.g., four sequences from horizontal and vertical scans). Each sequence requires its own selective scan with distinct parameters, leading to increased computational overhead and parameter count proportional to the number of scanning directions. This limits the scalability of incorporating more diverse scanning patterns. -
Local Pixel Forgetting: When a two-dimensional image is flattened into a one-dimensional sequence, spatially adjacent pixels can become distantly separated in the token sequence. This phenomenon, termed "local pixel forgetting," causes a loss of crucial local spatial relationships, which are paramount for accurate image restoration tasks. Existing scanning strategies have not adequately addressed this.
The paper's entry point is to leverage the strengths of Vision Mamba while directly tackling these two specific limitations, proposing an efficient and comprehensive solution for image restoration.
2.2. Main Contributions / Findings
The paper introduces Efficient All-Around Mamba (EAMamba) with two primary architectural innovations to address the identified challenges:
- Multi-Head Selective Scan Module (MHSSM):
- Contribution: EAMamba proposes MHSSM to efficiently process and aggregate flattened 1D sequences. Instead of performing separate selective scans on full feature channels for each direction, MHSSM employs a channel grouping strategy. This allows multiple scanning sequences to be aggregated without incurring the typical computational complexity and parameter count overhead associated with increasing the number of sequences.
- Benefit: This design significantly improves the scalability and efficiency of Vision Mamba frameworks, enabling the integration of more complex scanning patterns.
- All-Around Scanning Strategy:
- Contribution: Benefiting from MHSSM's efficiency, EAMamba introduces an all-around scanning mechanism. This strategy goes beyond conventional two-dimensional scanning by executing selective scanning across multiple directions, including horizontal, vertical, diagonal, flipped diagonal, and their respective reversed orientations.
- Benefit: This multi-directional approach effectively captures holistic image information, directly addressing the
local pixel forgettingissue by ensuring broader neighborhood coverage and strengthening spatial context understanding. - Insight: The paper provides insights through
Effective Receptive Field (ERF)visualizations, demonstrating how all-around scanning enhances spatial dependency preservation, especially for diagonal information often missed by 2D scans.
Key Conclusions / Findings:
- Significant Efficiency Gains: EAMamba achieves a remarkable 31-89% reduction in
FLOPs(Floating Point Operations per second), a measure of computational cost, compared to existing low-level Vision Mamba methods. This indicates a substantial improvement in computational efficiency. - Maintained Favorable Performance: Despite significant FLOPs reduction, EAMamba maintains
favorable performancein terms of image quality metrics (PSNRandSSIM) across various image restoration tasks (super-resolution, denoising, deblurring, and dehazing). This demonstrates that the efficiency gains do not come at a significant cost to restoration quality. - Validation Across Tasks: The innovations are validated through extensive experiments on widely adopted benchmark datasets, demonstrating EAMamba's versatility and effectiveness across a range of degradation types.
- Scalability for Vision Mamba: MHSSM provides a pathway for Vision Mamba models to incorporate more sophisticated scanning patterns without prohibitive computational costs, enhancing their applicability to complex visual data.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand EAMamba, a basic grasp of several foundational concepts in deep learning and computer vision is essential:
-
Image Restoration: This is a broad field in low-level computer vision that deals with recovering a high-quality (clean, sharp, high-resolution) image from a degraded (noisy, blurry, low-resolution, hazy) input image. Common tasks include:
Super-resolution (SR): Increasing the resolution of an image.Denoising: Removing noise from an image.Deblurring: Removing blur (e.g., motion blur, out-of-focus blur) from an image.Dehazing: Removing haze or fog from an image. These tasks are oftenill-posed, meaning multiple high-quality images could theoretically lead to the same degraded input, making the reconstruction challenging.
-
Convolutional Neural Networks (CNNs): These are a class of deep neural networks commonly used for analyzing visual imagery. They are characterized by
convolutional layersthat apply filters (kernels) to input data,pooling layersfor dimensionality reduction, and oftenfully connected layersat the end.- Local Processing: CNNs excel at capturing local patterns and features due to their small, localized receptive fields (the area of the input image that a neuron "sees"). This is beneficial for tasks requiring local detail preservation but can be a limitation for capturing long-range dependencies across an entire image.
-
Vision Transformers (ViTs): Introduced as an adaptation of the Transformer architecture (originally for NLP) to computer vision.
- Self-Attention Mechanism: The core of ViTs is the
self-attentionmechanism, specificallyMulti-Head Self-Attention (MHSA). MHSA allows the model to weigh the importance of different parts of the input (or "tokens") relative to each other, thus capturing global dependencies across the entire image. - Image Patches: ViTs typically divide an image into fixed-size
patches, which are then treated as "tokens" similar to words in an NLP sequence. - Quadratic Complexity: A major drawback of traditional MHSA is its computational complexity, which scales quadratically with the number of input tokens (i.e., image patches or pixels for high-resolution images). This makes it computationally expensive for high-resolution image processing.
- Self-Attention Mechanism: The core of ViTs is the
-
State Space Models (SSMs) / Mamba: A class of models that aim to efficiently model long-range dependencies, inspired by control theory.
- Selective State Spaces: Mamba introduces a
selective state spacemechanism. Unlike traditional SSMs or attention mechanisms, Mamba's parameters are data-dependent, allowing the model to selectively propagate or discard information based on the input. This selectivity is key to its efficiency and ability to model long-range dependencies. - Linear Complexity: A defining feature of Mamba is its
linear computational complexitywith respect to the sequence length, making it highly efficient for processing long sequences compared to quadratic-complexity models like ViTs. - Scanning Mechanism: Mamba processes sequences through a
scanning mechanismthat aggregates information sequentially. For 2D data like images, this typically involves flattening the image into 1D sequences and processing them in different directions (e.g., horizontal, vertical).
- Selective State Spaces: Mamba introduces a
-
UNet Architecture: A widely used encoder-decoder convolutional network architecture, particularly popular for image-to-image tasks like segmentation and restoration.
- Encoder: Gradually reduces the spatial dimensions of the input while increasing feature channels, capturing context.
- Decoder: Gradually increases spatial dimensions while reducing feature channels, reconstructing the output.
- Skip Connections: Crucially, UNet includes
skip connectionsthat directly transfer feature maps from the encoder to the corresponding decoder layers. These connections help preserve fine-grained details lost during the encoding (downsampling) process, which is critical for image restoration.
-
Layer Normalization (LN): A normalization technique applied across the features of an individual sample within a layer. It helps stabilize training, especially in deep networks, by normalizing the activations.
-
Sigmoid Linear Unit (SiLU): Also known as Swish, it is an activation function defined as . It's a smooth, non-monotonic function that has shown to perform well in deep networks.
-
Depth-wise Convolution (DWConv2D): A type of convolution that applies a single filter to each input channel independently. This significantly reduces the number of parameters and computational cost compared to standard convolutions, making models more efficient.
-
Multilayer Perceptron (MLP): A class of feedforward artificial neural networks. In deep learning architectures, an
MLPtypically refers to a block of fully connected layers with non-linear activation functions, used for feature transformation or projection.
3.2. Previous Works
The paper contextualizes EAMamba within the evolution of image restoration techniques, highlighting the shift from CNNs to Transformers and now to Mamba-based models.
-
CNN-based Approaches (Early Developments):
- Examples: [1-3, 7-13, 14, 18].
- Contribution: Demonstrated success in various image restoration benchmarks.
- Limitations: Inherent limitations in capturing global information due to their focus on local pixel relationships.
-
Vision Transformer (ViT)-based Architectures:
- Examples: ViT [40], SwinIR [48], Uformer [50], Restormer [51].
- Contribution: Employed multi-head self-attention mechanisms to model relationships across all image pixels, effectively capturing global dependencies and achieving promising results.
- Limitations:
Quadratic computational complexitywith respect to pixel count, rendering high-resolution image processing infeasible. To mitigate this, approaches like SwinIR and Uformer usedwindow-based attention, and Restormer introducedmulti-Dconv head transposed attentionandgated-Dconv feed-forward.
-
Vision Mamba Models:
-
Inspired by the Mamba framework [53] from NLP, Vision Mamba models adapt SSMs for vision tasks, aiming for linear computational scaling and efficient long-range dependency modeling.
-
Pioneering Works [54, 55]:
VMamba[54]: Introducedcross-scanandcross-mergetechniques to effectively aggregate spatial information from different scanning directions.Vision Mamba[55]: Processed image patches in both forward and backward directions to capture spatial information more comprehensively.- Common Strategy: These works transform 2D feature maps into flattened 1D sequences through
dual-direction scanning(e.g., horizontal and vertical scans), as illustrated in Fig. 3. - Challenges Identified by EAMamba:
Computational Overhead: As shown in Fig. 2(a), 2D Selective Scan (2DSS) often generates multiple 1D sequences (e.g., four sequences from horizontal and vertical scans). Each sequence requires its ownselective scanwith distinct parameters, leading to increased computational complexity and parameter count that scales linearly with the number of sequences. This makes incorporating more scanning directions inefficient.Local Pixel Forgetting: When 2D spatial information is flattened into 1D sequences, spatially adjacent pixels can become distantly separated. This leads to a loss of local context, which is crucial for image restoration. Fig. 4(a) illustrates this phenomenon.
-
Vision Mamba for Image Restoration:
MambaIR[57]: Integrated the vision state space module and a modified MLP block to mitigate local pixel forgetting and channel redundancy issues found in vanilla Mamba.VMambaIR[58]: Proposed anOmni Selective Scan (OSS)module, which conducts four-directional spatial scanning along with channel scanning to leverage both spatial and channel-wise information.
-
3.3. Technological Evolution
The evolution of image restoration models reflects a continuous effort to better capture global dependencies while maintaining computational efficiency.
- CNNs (Local Focus): Initial success relied on CNNs, leveraging their ability to extract local features. However, their limited receptive fields restricted their capacity for understanding global context, leading to suboptimal performance in tasks requiring a broader understanding of the image.
- ViTs (Global Focus, High Cost): Transformers brought the powerful
self-attentionmechanism, enabling models to attend to distant pixels and capture global context effectively. This marked a significant leap in performance but introduced a critical bottleneck:quadratic computational complexity, making them unsuitable for high-resolution image processing due to memory and time constraints. Efforts likewindow-based attentionaimed to reduce this but often compromised some global context. - Vision Mamba (Global Focus, Linear Cost): Mamba emerged as a promising alternative, offering
linear computational complexityfor long-range dependency modeling. This addressed the efficiency issues of ViTs while retaining the ability to capture global information. However, initial adaptations to vision (Vision Mamba, MambaIR, VMambaIR) faced new challenges: the computational burden of multiple scanning sequences and the problem oflocal pixel forgettingwhen flattening 2D images to 1D sequences.
3.4. Differentiation Analysis
EAMamba differentiates itself from previous Vision Mamba methods, particularly 2DSS-based approaches, by directly tackling their two main limitations:
-
Against 2DSS (e.g., VMamba, MambaIR's core scanning strategy):
- Efficiency of Multiple Scans: Traditional 2DSS (Fig. 2a) generates separate 1D sequences for each scanning direction (e.g., horizontal, vertical, and their reverses), with each requiring its own
selective scanparameters. This leads to computational overhead and increased parameter count proportional to the number of scanning directions. - EAMamba's Innovation (MHSSM): EAMamba replaces 2DSS with
Multi-Head Selective Scan (MHSS)(Fig. 2b). MHSS partitions input feature channels into groups and performs scanning within these groups. This channel-split approach allows for the efficient aggregation of multiple scanning sequences without escalating computational complexity or parameter count. It enables the use of more diverse scanning patterns without a linear increase in overhead.
- Efficiency of Multiple Scans: Traditional 2DSS (Fig. 2a) generates separate 1D sequences for each scanning direction (e.g., horizontal, vertical, and their reverses), with each requiring its own
-
Against
Local Pixel Forgetting(an issue in all 1D sequence flattening):-
Problem: Previous 2D scanning strategies (e.g., horizontal, vertical) inherently separate spatially adjacent pixels when converting a 2D image into a 1D sequence (Fig. 4a), leading to a loss of critical local spatial context. While MambaIR used supplementary local convolution operations, it still faced challenges.
-
EAMamba's Innovation (All-Around Scanning): EAMamba introduces an
all-around scanning strategy(Fig. 3). This strategy systematically combines multiple scanning directions (horizontal, vertical, diagonal, flipped diagonal, and their reversed orientations). By covering more directions, it ensuresholistic spatial information capture, significantly mitigatinglocal pixel forgetting. TheERFvisualizations (Fig. 4b, Fig. 11) confirm that this strategy captures broader and more complete neighborhood information, especially in diagonal directions, which is crucial for image restoration.In essence, EAMamba proposes a more scalable and comprehensive way to leverage Vision Mamba's linear complexity for image restoration, overcoming the efficiency limitations of adding more scan patterns and the critical issue of local context loss.
-
4. Methodology
4.1. Principles
The core idea behind EAMamba is to enhance the Vision Mamba framework for image restoration by simultaneously improving both computational efficiency and the comprehensiveness of spatial information capture. This is achieved through two main principles:
-
Efficient Aggregation of Multiple Scanning Sequences: To overcome the computational and parameter overhead associated with increasing the number of scanning directions in previous Vision Mamba models, EAMamba introduces a
Multi-Head Selective Scan (MHSS)module. This module uses a channel-splitting strategy to process multiple scanning sequences in parallel without linearly scaling computational costs or parameters. -
Holistic Spatial Context Understanding: To address the
local pixel forgettingissue that arises when 2D images are flattened into 1D sequences, EAMamba proposes anall-around scanning mechanism. This mechanism integrates diverse scanning patterns (horizontal, vertical, diagonal, etc.) to ensure that comprehensive neighborhood information is captured, thereby preserving crucial local spatial relationships.By combining these principles, EAMamba aims to create a Vision Mamba model that is both highly efficient and exceptionally effective at reconstructing high-quality images from degraded inputs.
4.2. Core Methodology In-depth (Layer by Layer)
The EAMamba framework adopts a UNet-like architecture [65], a common and effective design for image-to-image translation tasks due to its ability to capture both high-level semantic information and low-level spatial details.
The following figure (Figure 5 from the original paper) illustrates the overall EAMamba framework and the structure of its key components:
该图像是一个示意图,展示了 EAMamba 框架及其关键组件,包括 MambaFormer 和 MHSSM。图中使用了 表示特征图形状,并展示了多头选择扫描模块的结构。这些模块通过高效的全方位扫描机制解决了低级视觉任务中的局部像素遗忘问题。
Figure 5. The image is a diagram illustrating the EAMamba framework and its key components, including MambaFormer and MHSSM. In the image, represents the shape of the feature map, and the structure of the Multi-Head Selective Scan Module is shown. These modules address the local pixel forgetting issue in low-level vision tasks through an efficient all-around scanning mechanism.
4.2.1. Overview of the EAMamba Framework (Section 3.1)
The restoration process begins with a low-quality image , where is height, is width, and 3 represents the RGB color channels.
-
Encoder Modules: EAMamba processes this input through three
MambaFormer encoder modules. These modules operate at different scales, progressively reducing the spatial resolution and expanding the feature channels (e.g., to , then , etc.). The encoder extracts hierarchical feature embeddings, capturing increasingly abstract representations of the input. -
Bottleneck Module: After the encoder, a
bottleneck moduleprocesses the most abstract features at the lowest spatial resolution. This module typically consists of severalMambaFormer blocksto further refine these high-level features. -
Decoder Modules: Following the bottleneck, three
MambaFormer blocksserve as decoder modules. These modules progressively upsample the features, increasing spatial resolution while reducing feature channels, aiming to reconstruct the image details. Skip connections, characteristic of UNet-like architectures, would typically link corresponding encoder and decoder layers to transfer fine-grained spatial information, although not explicitly detailed in the text for EAMamba's UNet. -
Refinement Module: A final
refinement module, containing additionalMambaFormer blocks, processes the reconstructed features to produce aresidual image. -
Final Output: The
final high-quality imageis obtained by element-wise addition of the residual image and the original low-quality input : . Thisresidual learningapproach is common in image restoration, as it's often easier for a model to learn the degradation (residual) than to directly map the degraded image to the clean image.The core innovations enabling this process are the
MambaFormer architectureand theMulti-Head Selective Scan Module (MHSSM), which facilitate efficient scanning and holistic spatial information capture.
4.2.2. MambaFormer Block (Section 3.2)
The MambaFormer block is a fundamental building block of EAMamba, utilized throughout the encoder, decoder, bottleneck, and refinement stages. Its structure is designed to capture long-range spatial dependencies while refining features.
As depicted in Figure 5(a), each MambaFormer block comprises two main components:
-
Multi-Head Selective Scan Module (MHSSM): This component is responsible fortoken mixing, which refers to the process of aggregating information across different spatial locations (tokens) to capture long-range dependencies. -
Channel Multilayer Perceptron (channel MLP): This component is used forfeature refinement, processing features across channels to enhance their representation.The data flow within a MambaFormer block follows a common Transformer-like structure with
Layer Normalization (LN)andresidual connections:
First, the input feature is normalized using Layer Normalization, and then passed to the MHSSM. The output of MHSSM is added back to the input via a residual connection to produce : Where:
-
: The input feature map to the MambaFormer block. It typically has dimensions (Height Width Channels).
-
: The
Layer Normalizationoperation [66]. It normalizes the activations across the feature dimension for each input sample, stabilizing training. -
: The
Multi-Head Selective Scan Module, which captures long-range spatial dependencies and performs token mixing. -
: The intermediate feature map after the MHSSM and the first residual connection.
Second, the intermediate feature map is again normalized using Layer Normalization, and then passed to the Channel MLP. The output of the Channel MLP is added back to via another residual connection to produce the final output : Where:
-
: The input feature map to the Channel MLP branch.
-
: The
Channel Multilayer Perceptron, which refines features along the channel dimension. -
: The final output feature map of the MambaFormer block.
In this architecture,
MHSSMfocuses on capturing long-range spatial dependencies within input features, while thechannel MLPrefines these features to enhance the representation. TheLNlayers precede each component, andresidual connectionshelp prevent vanishing gradients and aid in training deeper networks.
4.2.3. Multi-Head Selective Scan Module (MHSSM) with All-Around Scanning (Section 3.3)
The Multi-Head Selective Scan Module (MHSSM) is a critical innovation within the MambaFormer block that replaces the conventional two-dimensional Selective Scan (2DSS) module, enhancing efficiency and spatial information capture.
As illustrated in Figure 5(b), the MHSSM processes an input feature through two parallel branches, conceptually similar to a gated mechanism:
-
Left Branch (Information Propagation Branch):
- The input feature first passes through a
Linearlayer, which expands its feature channels to , where is a pre-defined channel expansion factor (e.g., 2). - This expanded feature then undergoes a
depth-wise convolution (DWConv2D)operation, which applies a separate filter to each input channel, maintaining channel independence while capturing local spatial patterns efficiently. - The output of DWConv2D is activated by a
Sigmoid Linear Unit (SiLU)[67], a non-linear activation function. - The activated feature then enters the
Multi-Head Selective Scan (MHSS)module, which is the core component for capturing long-range dependencies using the all-around scanning strategy. - Finally, the output of MHSS is normalized by
Layer Normalization (LN)[66], producing the feature .
- The input feature first passes through a
-
Right Branch (Gating Branch):
- The input feature also passes through a
Linearlayer, expanding its channels by the same factor . - The output is then activated by a
SiLUfunction, producing the feature .
- The input feature also passes through a
-
Combination and Projection:
-
The outputs from both branches, and , are combined through
element-wise multiplication(). This gating mechanism allows the model to selectively pass information, similar to how gates operate in LSTMs or Gated Linear Units. -
A
final linear projectionreduces the merged output back to the original channel dimension , producing the block's output .The complete MHSSM process is mathematically formulated as: Where:
-
- : The input feature map to the MHSSM, with shape .
- : A linear projection (fully connected layer) that typically maps input channels to channels or back to channels.
- : A
Depth-wise Convolutionallayer. - : The
Sigmoid Linear Unitactivation function. - : The
Multi-Head Selective Scanoperation, detailed below. - :
Layer Normalization. - : The output of the left branch before the final projection.
- : The output of the right branch (gating signal).
- : Element-wise multiplication.
- : The final output feature map of the MHSSM block, with shape .
4.2.3.1. Multi-Head Selective Scan (MHSS) (Section 3.3.1)
The Multi-Head Selective Scan (MHSS) is a key innovation for efficient long-range spatial information capture within the MHSSM. It deviates from traditional 2DSS by employing a multi-head approach with grouped feature processing.
The following figure (Figure 6 from the original paper) illustrates the MHSS mechanism:
该图像是一个示意图,展示了多头选择扫描(MHSS)与我们提出的全方位扫描策略的结合过程。图中描述了输入通道的分离、变换、选择扫描和逆变换等步骤,旨在高效聚合多条扫描序列以提升图像恢复性能。
Figure 6. Illustration of the Multi-Head Selective Scan (MHSS) with our proposed All-Around Scanning strategy.
Here's how MHSS operates:
-
Channel Splitting (
Split): The input feature (which is the output of from the MHSSM) is partitioned into groups along the channel dimension. This means that instead of a single scan operating on all channels (where ), it operates on smaller groups of channels, each of size . This is the "Multi-Head" aspect, as each group can be thought of as a "head." -
Transformation (
Transform): Each of these groups undergoes atransformationfunction. ThisTransformstep is where theall-around scanning strategy(detailed next) is applied. It converts the 2D feature map within each group into one or more flattened 1D sequences, denoted as for group . -
Selective Scanning (
SelectiveScan): Each flattened 1D sequence from each group is then processed independently by aselective scanningmechanism. This is the core Mamba operation that models long-range dependencies with data-dependent parameters. This produces a corresponding output for each group. -
Inverse Transformation (
InverseTransform): The 1D outputs from the selective scan for each group are then reshaped back into 2D feature maps using anInverseTransformfunction, reversing the flattening operation. -
Concatenation (
Concat): Finally, the 2D outputs from all groups are concatenated back along the channel dimension to form the comprehensive output feature .The MHSS operations are formulated as: Where:
- : The input feature map to the MHSS module.
- : The number of groups (heads) the input channels are split into.
- : Operation that partitions into channel groups.
- : The function that converts each 2D channel group into a flattened 1D sequence (or multiple sequences) for scanning. This is where the
all-around scanning strategyis implemented. - : A collection of flattened 1D input sequences for all groups .
- : The core Mamba selective scanning operation applied to each 1D sequence.
- : A collection of flattened 1D output sequences from the selective scan for all groups.
- : The function that reshapes the 1D output sequences back into 2D feature maps.
- : Operation that concatenates the 2D outputs from all groups back along the channel dimension.
- : The final output feature map of the MHSS module.
Key Advantage of MHSS: Unlike 2DSS, which would typically duplicate parameters and increase computational load for each additional scanning direction, MHSS maintains computational complexity comparable to a standard selective scan while processing multiple sequences. By splitting channels, it processes different scanning directions or patterns on different channel groups in parallel, then aggregates their results. This avoids the linear increase in computational and parameter overhead traditionally associated with incorporating more scanning directions, making it highly efficient.
4.2.3.2. All-Around Scanning (Section 3.3.2)
The all-around scanning strategy is the specific transformation function implemented within the MHSS. It is designed to overcome the local pixel forgetting issue and enable holistic spatial dependency understanding.
The following figure (Figure 3 from the original paper) illustrates the concept of all-around scanning:
该图像是示意图,展示了全方位扫描的方法,该方法结合了二维扫描和对角线扫描。左侧部分显示了二维扫描的过程,右侧部分则展示了对角线扫描的方式,旨在捕捉图像的整体信息。
Figure 3. Illustration of an all-around scanning approach that combines two-dimensional scanning and diagonal scanning.
And Figure 4 further highlights the issue of local pixel forgetting and the benefit of all-around scanning:
该图像是图示,通过对比2D扫描和全方位扫描展示了不同扫描方式对空间关系的影响。左侧展示了相邻扫描参考图,右侧为有效感受野的可视化结果,表明全方位扫描策略更好地保持空间依赖性。
Figure 4. (a) Ilustration of the local pixel forgetting phenomenon, where spatially adjacent pixels become distantly separated in the one-dimensional token sequence during scanning. The target pixel (highlighted in red square) and its adjacent pixels demonstrate how different scanning patterns affect spatial relationships. (b) The ERF visualization results averaged across the SIDD dataset [59], which depict improved spatial dependency preservation with the proposed all-around scanning approach.
Here's how it works:
-
Multi-directional Scanning: Instead of relying solely on traditional horizontal and vertical scans (which constitute
two-dimensional scanning), the all-around strategy executes selective scanning across a wider range of directions. These include:- Horizontal: Left-to-right, Right-to-left.
- Vertical: Top-to-bottom, Bottom-to-top.
- Diagonal: Top-left to Bottom-right, Bottom-right to Top-left.
- Flipped Diagonal: Top-right to Bottom-left, Bottom-left to Top-right.
- And their respective
reversed orientations. This diverse set of scanning patterns ensures that information from all cardinal and intercardinal directions relative to a pixel is considered.
-
Addressing Local Pixel Forgetting: The core problem is that flattening a 2D image into a 1D sequence can separate spatially adjacent pixels (Fig. 4a). For example, a pixel's diagonal neighbors might be far apart in a purely horizontal or vertical scan sequence. By incorporating diagonal and flipped diagonal scans, the all-around strategy ensures that these crucial neighboring pixels remain "close" in at least one of the scanning sequences, thus preserving their relationship.
-
Holistic Information Capture: The combination of these multiple scans provides
more complete neighborhood coverage. This strengthens spatial context understanding, especially forEffective Receptive Field (ERF)expansion (Fig. 4b). The paper notes that 2D scanning (as in MambaIR) struggles to capture information from diagonal directions, even with local convolution. All-around scanning explicitly addresses this gap, leading to a larger and more isotropic receptive field that is critical for detailed image restoration tasks.Synergy with MHSS: The efficiency of MHSS is crucial here. Without it, incorporating such a comprehensive set of scanning directions would lead to prohibitive computational overhead and parameter counts. MHSS allows EAMamba to implement this
all-around scanningstrategy without excessive computational cost, making the holistic capture of visual information feasible for image restoration.
5. Experimental Setup
5.1. Datasets
EAMamba's performance was evaluated across a range of image restoration tasks using several benchmark datasets.
-
Image Denoising:
- Training Data (for Synthetic Gaussian Color Denoising):
DIV2K[82]: A high-quality image dataset commonly used for various restoration tasks.Flickr2K[83]: Another large-scale dataset of diverse images.WED (Waterloo Exploration Database)[84]: Provides a rich set of natural images.BSD (Berkeley Segmentation Dataset)[71]: Contains images widely used for segmentation and image processing research.- Usage: A single EAMamba model was trained on noise levels using a combination of these datasets.
- Evaluation Data (for Synthetic Gaussian Color Denoising):
CBSD68[71]: Contains 68 images, often used for benchmarking denoising algorithms.Kodak24[73]: Comprises 24 high-quality images.McMaster[72]: Another standard dataset for image quality assessment and restoration benchmarks.- Usage: Evaluated across multiple noise levels .
- Evaluation Data (for Real-world Denoising):
SIDD (Smartphone Image Denoising Dataset)[59]: A high-quality denoising dataset specifically collected from smartphone cameras, capturing realistic noise patterns.- Usage: EAMamba was trained and evaluated on this dataset for real-world denoising scenarios. A sample from the SIDD dataset would typically be a pair of images: a noisy photograph taken by a smartphone and its corresponding clean (ground truth) version.
- Training Data (for Synthetic Gaussian Color Denoising):
-
Image Super-Resolution:
- Evaluation Data:
RealSR[61]: A benchmark dataset specifically designed for real-world single image super-resolution, containing image pairs of low-resolution and high-resolution images captured by real cameras.- Usage: Evaluated for scaling factors .
- Evaluation Data:
-
Image Deblurring:
- Training Data:
GoPro[62]: A dataset containing sharp and blurry image pairs, specifically designed for motion deblurring, with videos of real-world scenes.
- Evaluation Data:
GoPro[62]: Used for evaluation.HIDE (Human-aware motion deblurring)[86]: Another benchmark dataset for motion deblurring.
- Training Data:
-
Image Dehazing:
- Training & Evaluation Data:
RESIDE (REalistic Single Image DEhazing)[63]: A large-scale benchmark dataset for image dehazing, including synthetic hazy images and corresponding ground truth clear images. Subsets like SOTS (Synthetic Objective Testing Set) are commonly used.- Usage: Trained and evaluated on this dataset. A sample would be a hazy image (e.g., a city street obscured by fog) and its clear counterpart.
- Training & Evaluation Data:
-
Ablation Studies:
-
Urban100[87]: A dataset of 100 images commonly used for super-resolution and other image enhancement tasks, serving as a robust test set for ablation studies on Gaussian color image denoising.These datasets were chosen because they are widely recognized benchmarks in their respective fields, ensuring a fair and comparable evaluation against state-of-the-art methods. They represent diverse degradation types and real-world scenarios, making them effective for validating the proposed method's performance.
-
5.2. Evaluation Metrics
The performance of EAMamba is primarily evaluated using two widely accepted objective image quality metrics: Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM), along with Floating Point Operations (FLOPs) for computational efficiency.
5.2.1. Peak Signal-to-Noise Ratio (PSNR)
- Conceptual Definition: PSNR is a ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Because many signals have a very wide dynamic range, PSNR is usually expressed in terms of the logarithmic decibel scale. A higher PSNR value indicates a higher quality image, meaning less distortion or noise relative to the original. It is often used as a quality measurement for reconstruction of lossy compression codecs or for evaluating restoration algorithms.
- Mathematical Formula: The PSNR is defined as: $ \mathrm{PSNR} = 10 \cdot \log_{10}\left(\frac{\mathrm{MAX}I^2}{\mathrm{MSE}}\right) $ where is the Mean Squared Error between the original (ground truth) image and the reconstructed image. $ \mathrm{MSE} = \frac{1}{mn} \sum{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $
- Symbol Explanation:
- : Peak Signal-to-Noise Ratio, measured in decibels (dB).
- : Base-10 logarithm.
- : The maximum possible pixel value of the image. For an 8-bit grayscale image, this is 255. For color images where pixel values are normalized to
[0, 1], would be 1. The paper evaluates on RGB channels or Y channel, implying values from 0-255 or 0-1. - : Mean Squared Error.
- : Number of rows (height) in the image.
- : Number of columns (width) in the image.
I(i,j): The pixel value at coordinates(i,j)in the original (ground truth) image.K(i,j): The pixel value at coordinates(i,j)in the reconstructed (restored) image.- : The squared difference between the corresponding pixel values.
5.2.2. Structural Similarity Index Measure (SSIM)
- Conceptual Definition: SSIM is a perceptual metric that quantifies the perceived degradation in image quality caused by processing. Unlike PSNR, which focuses on absolute errors, SSIM attempts to measure the structural similarity between two images, mimicking the human visual system's perception. It considers three key factors: luminance, contrast, and structure. An SSIM value closer to 1 indicates higher similarity (better quality).
- Mathematical Formula: The SSIM between two windows and of common size is defined as: $ \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
- : Structural Similarity Index Measure between image patches and .
- : The average (mean) of pixel values in window .
- : The average (mean) of pixel values in window .
- : The standard deviation of pixel values in window (measure of contrast).
- : The standard deviation of pixel values in window (measure of contrast).
- : The covariance between pixel values in windows and (measure of structural correlation).
- and : Two constants included to avoid division by zero when the denominators are very small.
- : The dynamic range of the pixel values (e.g., 255 for 8-bit images).
- : Small constant values, typically and . For a full image, SSIM is often calculated over various windows and then averaged to produce a single value.
5.2.3. Floating Point Operations (FLOPs)
- Conceptual Definition: FLOPs (Floating Point Operations) is a common metric used to estimate the computational complexity of a model. It counts the number of floating-point arithmetic operations (additions, multiplications, divisions, etc.) required to process a single input. A lower FLOPs count indicates a more computationally efficient model, requiring less processing power and time.
- Measurement Context: The paper calculates FLOPs using
fvcore[70] for all Vision Mamba methods, at a resolution of for all experiments, ensuring a fair comparison across models. - Significance: This metric is crucial for evaluating the practicality of deep learning models, especially for deployment on edge devices or in applications with strict latency requirements.
5.3. Baselines
EAMamba is compared against a comprehensive set of baseline models, including traditional CNN-based methods, Transformer-based methods, and other Vision Mamba variants.
-
General Image Restoration (various tasks):
SwinIR[48] (ViT-based)Restormer[51] (ViT-based)UFormer-B[50] (ViT-based)MPRNet[36]HINet[37]IPT[49] (ViT-based)MAXIM-3S[85]
-
Image Denoising Specific:
IRCNN[10]FFDNet[11]DnCNN[9]- [12]
DRUNet[13]BM3D[74]- [8]
- [75]
VDN[76]- [77]
- [78]
- [79]
- [80]
DAGL[81]
-
Vision Mamba Baselines (primary comparison for efficiency):
- [57] (Vision Mamba-based)
MambaIR-UNet[57] (Vision Mamba-based, UNet variant)VMambaIR[58] (Vision Mamba-based)
-
Image Deblurring Specific:
DeblurGAN-v2[16]SRN[15]DBGAN[17]DMPHN[18]SPAIR[38]MIMO-UNet+[19]Stripformer[43] (ViT-based)SFNet[39]
-
Image Dehazing Specific:
-
Dehamer(from table image, likely a Transformer-based model) -
MAXIM-2S(from table image, related to MAXIM) -
DehazeFormer-L(from table image, likely a Transformer-based model)These baselines are representative because they cover a wide spectrum of approaches, from classical CNNs to cutting-edge Transformers and the latest Vision Mamba models. Comparing against them allows EAMamba to demonstrate its advancements in both performance and, crucially, computational efficiency within the rapidly evolving field of image restoration. Methods marked with
*indicate those that train separate models for each noise level or use additional training data, providing context for comparison.
-
5.4. Training Details
The paper outlines a robust training protocol to ensure fair and effective evaluation of EAMamba.
- Training Iterations: The models are trained for a total of
450,000 iterations. - Learning Rate Schedule: An initial learning rate of is used. This learning rate is gradually decreased to using a
cosine annealingschedule. Cosine annealing is a technique where the learning rate is scheduled to decrease following a cosine curve, often leading to better convergence and performance. - Optimizer:
AdamW[69] is employed as the optimizer. AdamW is an Adam variant that decouples weight decay from the optimization step, which often improves regularization and performance.- Parameters: , (default values for Adam-like optimizers).
- Weight decay: .
- Loss Function:
L1 loss(Mean Absolute Error) is used. L1 loss is less sensitive to outliers compared to L2 (Mean Squared Error) and often promotes sharper images in image restoration tasks. - Progressive Training Strategy: Following the approach in [51] (Restormer), a
progressive training strategyis adopted. This means training starts with smaller image patches and larger batch sizes, then gradually increases patch sizes and decreases batch sizes. This strategy helps the model learn fine details from smaller patches efficiently and then generalize to larger contexts.- Initial: Patches of pixels with a batch size of 64.
- Progression: These parameters are progressively adjusted at specific iteration milestones:
[160, 40]at 138K iterations[192, 32]at 234K iterations[256, 16]at 306K iterations[320, 8]at 360K iterations[384, 8]at 414K iterations The format[patch_size, batch_size]indicates the dimensions of the training patches and the number of samples processed in each batch.
- Data Augmentation: To improve model generalization and prevent overfitting, standard data augmentation techniques are applied:
Random horizontal flippingRandom vertical flipping90° rotation
5.5. Architecture details
The EAMamba framework utilizes a four-level UNet architecture [65]. This architecture is commonly used in image-to-image tasks due to its effectiveness in capturing multi-scale features and local details via skip connections.
- MambaFormer Blocks: The number of
MambaFormer blocks(the core building blocks combining MHSSM and Channel MLP) varies at different levels of the UNet:Encoder levels:[4, 6, 6, 7]MambaFormer blocks at the respective levels. This means 4 blocks in the first encoder stage, 6 in the second, and so on.Refinement stage: Incorporates two MambaFormer blocks.
- Channel Dimension (C): The base channel dimension is maintained at a constant value of
64throughout the network. This implies that feature maps at deeper encoder/decoder levels would typically have channel dimensions that are multiples of (e.g.,2C, 4C). - Channel MLP: A
simple feed-forward network (FFN)[68] is utilized as the defaultchannel MLPwithin the MambaFormer blocks. This choice is later justified by ablation studies as providing an optimal balance between performance and computational efficiency.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results validate EAMamba's effectiveness and efficiency across various image restoration tasks.
The following figure (Figure 1 from the original paper) presents a high-level overview of computational efficiency versus image quality:
该图像是图表,展示了EAMamba在不同图像恢复任务中的PSNR与计算效率的对比。EAMamba在多项任务中表现出色,取得了优于其他方法的效果,并在计算复杂度和效率之间设置了新的平衡。图中以红星标识EAMamba的位置。
Figure 1. Computational efficiency versus image quality across model architectures. Our method (denoted by )demonstrates superior efficiency compared to other Vision Mamba-based methods and existing approaches . EAMamba establishes a new efficiency frontier for Vision Mamba-based image restoration.
Analysis of Figure 1:
Figure 1 graphically illustrates the primary claim of the paper: EAMamba achieves a new efficiency frontier for Vision Mamba-based image restoration. The x-axis represents Computational Efficiency (likely inverse FLOPs or similar, implying higher values mean more efficient), and the y-axis represents Image Quality (e.g., PSNR, higher is better).
- Other Vision Mamba-based methods (): These points typically fall below EAMamba, indicating either lower quality for similar efficiency or lower efficiency for similar quality.
- Existing approaches (): These represent non-Mamba methods (e.g., CNNs, Transformers). They generally spread across the graph, but EAMamba consistently positions itself in the upper-right region, signifying a superior balance of high quality and high efficiency.
- EAMamba (): EAMamba points are consistently high on the quality axis and far to the right on the efficiency axis, clearly demonstrating its advantage. This figure serves as a compelling visual summary of the paper's main contribution.
6.1.1. Image Denoising (Section 4.2)
EAMamba is evaluated on both synthetic Gaussian color denoising and real-world denoising.
The following are the results from Table 1 of the original paper:
| Method | [Param. FLOPs (M) ↓ (G) ↓ | CBSD68 [71] |σ = 15 σ = 25 σ = 50 | Kodak24 [73] | McMaster [72] | |||||||
| |σ = 15 σ = 25 σ = 50|σ = 15 σ = 25 σ = 50 | |||||||||||
| IRCNN [10] | - | - | 33.86 | 31.16 | 27.86 | 34.69 | 32.18 | 28.93 | 34.58 | 32.18 | 28.91 |
| FFDNet [11] | - | - | 33.87 | 31.21 | 27.96 | 34.63 | 32.13 | 28.98 | 34.66 | 32.35 | 29.18 |
| DnCNN [9] | - | - | 33.90 | 31.24 | 27.95 | 34.60 | 32.14 | 28.95 | 33.45 | 31.52 | 28.62 |
| BRDNet* [12] | - | - | 34.10 | 31.43 | 28.16 | 34.88 | 32.41 | 29.22 | 35.08 | 32.75 | 29.52 |
| DRUNet [13] | 32.6 | 144 | 34.30 | 31.69 | 28.51 | 35.31 | 32.89 | 29.86 | 35.40 | 33.14 | 30.08 |
| SwinIR* [48] | 11.5 | 788 | 34.42 | 31.78 | 28.56 | 35.34 | 32.89 | 29.79 | 35.61 | 33.20 | 30.22 |
| Restormer [51] | 26.1 | 141 | 34.39 | 31.78 | 28.59 | 35.44 | 33.02 | 30.00 | 35.55 | 33.31 | 30.29 |
| MambaIR* [57] | 15.8 | 1290 | 34.43 | 31.80 | 28.61 | 35.34 | 32.91 | 29.85 | 35.62 | 33.35 | 30.31 |
| EAMamba (Ours) | 25.3 | 137 | 34.43 | 31.81 | 28.62 | 35.36 | 32.95 | 29.91 | 35.59 | 33.34 | 30.31 |
Analysis of Table 1 (Synthetic Gaussian Denoising):
-
Computational Efficiency: EAMamba (137 GFLOPs) is significantly more efficient than its Vision Mamba counterpart, MambaIR* (1290 GFLOPs). EAMamba uses approximately 11% of the FLOPs of MambaIR*. This demonstrates a massive reduction in computational cost, directly validating the efficiency claims of MHSSM. It is also more efficient than SwinIR* (788 GFLOPs) and comparable to Restormer (141 GFLOPs) and DRUNet (144 GFLOPs), while generally outperforming them in quality.
-
Image Quality (PSNR): EAMamba achieves competitive or slightly better PSNR values across all datasets (CBSD68, Kodak24, McMaster) and noise levels () compared to MambaIR*. For instance, on CBSD68, EAMamba matches MambaIR* at (34.43 dB) and slightly surpasses it at (31.81 vs 31.80 dB) and (28.62 vs 28.61 dB). This shows that EAMamba maintains favorable performance despite the drastic reduction in FLOPs.
-
Parameters: EAMamba has 25.3M parameters, which is higher than MambaIR*'s 15.8M. However, the FLOPs reduction is much more substantial, indicating that EAMamba's parameter usage is more efficient in terms of computational operations.
The following are the results from Table 2 of the original paper:
Method Param. (M) ↓ FLOPs (G) ↓ SIDD [59] PSNR ↑ SSIM↑ DnCNN [9] - 23.66 0.583 BM3D [74] 25.65 0.685 CBDNet* [8] 30.78 0.801 RIDNet* [75] 1.5 98 38.71 0.951 VDN [76] 7.8 44 39.28 0.956 SADNet* [77] - - 39.46 0.956 DANet* [78] 63.0 30 39.47 0.957 CycleISP* [79] 2.8 184 39.52 0.957 MIRNet [35] 31.8 785 39.72 0.959 DeamNet* [80] 2.3 147 39.47 0.957 MPRNet [36] 15.7 588 39.71 0.958 DAGL [81] 5.7 273 38.94 0.953 HINet [37] 88.7 171 39.99 0.958 IPT* [49] 115.3 380 39.10 0.954 MAXIM-3S [85] 22.2 339 39.96 0.960 UFormer-B [50] 50.9 89 39.89 0.960 Restormer [51] 26.1 141 40.02 0.960 MambaIR-UNet [57] 26.8 230 39.89 0.960 EAMamba (Ours) 25.3 137 39.87 0.960
Analysis of Table 2 (Real-world Denoising on SIDD):
-
Computational Efficiency: EAMamba (137 GFLOPs) achieves a 41% reduction in FLOPs compared to MambaIR-UNet (230 GFLOPs), while having a similar parameter count (25.3M vs 26.8M). This further reinforces the efficiency benefits of MHSSM. EAMamba is also more FLOPs-efficient than many high-performing Transformer-based models like Restormer (141 GFLOPs) and HINet (171 GFLOPs).
-
Image Quality (PSNR & SSIM): EAMamba achieves a PSNR of 39.87 dB and SSIM of 0.960. While MambaIR-UNet has a slightly higher PSNR of 39.89 dB (a marginal difference of 0.02 dB), EAMamba matches its SSIM. This demonstrates that EAMamba maintains strong perceptual quality and structural fidelity with significantly reduced computational cost. Restormer leads in PSNR at 40.02 dB, showing that EAMamba is very competitive with state-of-the-art methods across different architectures.
The following figure (Figure 7 from the original paper) presents qualitative results for real-world denoising:
该图像是图表,展示了不同图像恢复方法的PSNR值。左侧为真实图像PSNR值为33.11,接下来是低质量图像以及MPRNet(39.62)、UFormer-B(39.73)、Restormer(40.47)、MambaIR-UNet(37.45)等方法的恢复结果,最终是我们的EAMamba方法,PSNR值达到40.99。
Figure 7. The image is a chart showing the PSNR values of different image restoration methods. On the left is the ground truth PSNR value of 33.11, followed by the low-quality image and the restoration results of MPRNet (39.62), UFormer-B (39.73), Restormer (40.47), MambaIR-UNet (37.45), and finally our EAMamba method with a PSNR value of 40.99.
Analysis of Figure 7 (Qualitative Results on SIDD):
The qualitative comparison shows normalized difference maps, where brighter areas indicate larger deviations from the ground truth. EAMamba's output (PSNR 40.99) shows closer correspondence to ground truth compared to other methods, indicating better detail preservation and fewer artifacts. This visual evidence supports the quantitative metrics, confirming EAMamba's effectiveness in denoising. (Note: The PSNR values in the figure's description are for a specific image, not the average dataset PSNR).
6.1.2. Image Super-Resolution (Section 4.3)
The following are the results from Table 3 of the original paper:
| Method | |Param. Flops | (M)↓ (G) ↓| | x2 | x3 | x4 | ||||
| | PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ PSNR ↑ SSIM ↑ | ||||||||
| Restormer [51] | 26.1 | 155 | 34.33 | 0.929 | 31.16 | 0.874 | 29.54 | 0.836 |
| MambaIR-UNet [57]] | 26.8 | 230 | 34.20 | 0.927 | 31.16 | 0.872 | 29.53 | 0.835 |
| VMambaIR [58] | 26.3 | 200 | 34.16 | 0.927 | 31.14 | 0.872 | 29.56 | 0.836 |
| EAMamba (Ours) | 25.3 | 137 | 34.18 | 0.927 | 31.11 | 0.872 | 29.60 | 0.835 |
Analysis of Table 3 (Super-Resolution on RealSR):
- Computational Efficiency: EAMamba demonstrates
superior computational efficiencywith thelowest parameter count(25.3M) andlowest FLOPs(137 GFLOPs) among all compared methods. This is notably lower than MambaIR-UNet (230 GFLOPs) and VMambaIR (200 GFLOPs), and also more efficient than Restormer (155 GFLOPs). - Image Quality (PSNR & SSIM):
- Scaling: EAMamba achieves the
superior PSNR performance(29.60 dB) at the scaling factor, outperforming all other methods including VMambaIR (29.56 dB) and Restormer (29.54 dB). - and Scaling: EAMamba maintains a
minimal PSNR gap(less than 0.05 dB) compared to other Vision Mamba methods and Restormer. For , EAMamba (34.18 dB) is slightly lower than MambaIR-UNet (34.20 dB) and Restormer (34.33 dB). For , EAMamba (31.11 dB) is slightly lower than MambaIR-UNet (31.16 dB) and Restormer (31.16 dB). SSIM values are generally competitive. These results solidify EAMamba's position as an efficient solution that can also deliver leading performance in super-resolution, particularly at higher scaling factors.
- Scaling: EAMamba achieves the
The following figure (Figure 8 from the original paper) presents qualitative results for image super-resolution:
该图像是展示EAMamba在图像恢复任务中比较不同方法性能的示意图,包括真实图像、低质量图像与其他恢复方法以及EAMamba的结果,最后一列显示了EAMamba获得的最高PSNR值35.04。
Figure 8. The image is a diagram illustrating the performance comparison of different methods in the image restoration task using EAMamba, including the ground truth image, low-quality image, and the results of other restoration methods, with EAMamba achieving the highest PSNR value of 35.04 in the last column.
Analysis of Figure 8 (Qualitative Results for Super-Resolution):
The cropped difference results (normalized differences from ground truth) visually demonstrate EAMamba's superior capability in structural preservation and detail reconstruction. It appears to surpass other Vision Mamba baselines by producing outputs that are visually closer to the ground truth, validating its qualitative advantages.
6.1.3. Image Deblurring (Section 4.4)
The following are the results from Table 4 of the original paper:
| Method | Param. (M) ↓ | FLOPs (G) ↓ | GoPro [62] | HIDE [86] | ||
| PSNR ↑ | SSIM ↑ | PSNR ↑ | SSIM↑ | |||
| DeblurGAN-v2 [16] | - | - | 29.55 | 0.934 | 26.61 | 0.875 |
| SRN [15] | . | - | 30.26 | 0.934 | 28.36 | 0.915 |
| DBGAN [17] | - | - | 31.10 | 0.942 | 28.94 | 0.915 |
| DMPHN [18] | 21.7 | 195 | 31.20 | 0.940 | 29.09 | 0.924 |
| SPAIR [38] | - | - | 32.06 | 0.953 | 30.29 | 0.931 |
| MIMO-UNet+ [19] | 16.1 | 151 | 32.45 | 0.957 | 29.99 | 0.930 |
| MPRNet [36] | 20.1 | 760 | 32.66 | 0.959 | 30.96 | 0.939 |
| HINet [37] | 88.7 | 171 | 32.71 | 0.959 | 30.32 | 0.932 |
| IPT [49] | 115.3 | 380 | 32.52 | - | - | |
| MAXIM-3S [85] | 22.2 | 339 | 32.86 | 0.961 | 32.83 | 0.956 |
| UFormer-B [50] | 50.9 | 89 | 33.06 | 0.967 | 30.90 | 0.953 |
| Restormer [51] | 26.1 | 141 | 32.92 | 0.961 | 31.22 | 0.942 |
| Stripformer [43] | 19.7 | 155 | 33.08 | 0.962 | 31.03 | 0.940 |
| SFNet [39] | 13.3 | 125 | 33.27 | 0.963 | 31.10 | 0.941 |
| EAMamba (Ours) | 25.3 | 137 | 33.58 | 0.966 | 31.42 | 0.944 |
Analysis of Table 4 (Image Deblurring on GoPro and HIDE):
-
Performance on GoPro: EAMamba achieves
superior performanceon the GoPro benchmark, with a PSNR of 33.58 dB and SSIM of 0.966. Thissurpasses the second-best result (SFNet, 33.27 dB) by 0.31 dBin terms of PSNR, highlighting its strong deblurring capabilities. Its FLOPs (137 GFLOPs) are also competitive, being higher than UFormer-B (89 GFLOPs) and SFNet (125 GFLOPs), but lower than MPRNet (760 GFLOPs) and MAXIM-3S (339 GFLOPs). -
Performance on HIDE: On the HIDE benchmark, EAMamba (PSNR 31.42 dB, SSIM 0.944) ranks as the
second-bestmethod, closely behind MAXIM-3S (PSNR 32.83 dB, SSIM 0.956). This indicates robust performance across different deblurring datasets.The following figure (Figure 9 from the original paper) presents qualitative results for image deblurring:
该图像是图表,展示了EAMamba在图像重建任务中与其他方法(包括MPRNet、MAXIM-3S、Restormer和SFNet)的PSNR(峰值信噪比)性能对比。在各方法中,EAMamba的PSNR达到36.52,显示了其在低级视觉任务中的优越性。
Figure 9. The image is a chart that illustrates the PSNR (Peak Signal-to-Noise Ratio) performance comparison of EAMamba with other methods, including MPRNet, MAXIM-3S, Restormer, and SFNet, in image restoration tasks. Among the methods, EAMamba achieves a PSNR of 36.52, demonstrating its superiority in low-level vision tasks.
Analysis of Figure 9 (Qualitative Results for Deblurring):
The qualitative comparisons suggest that EAMamba excels in preserving details and producing images that are nearly indistinguishable from the ground truth. This is observed from the nearest objects (e.g., car bumper) to broader environmental contexts (e.g., brick pavement), reinforcing its strong deblurring capabilities.
6.1.4. Image Dehazing (Section 4.5)
The following are the results from Table 5 of the original paper:
| Method | Param. (M) ↓ | FLOPs (G) ↓ | SOTS-Indoor [63] | SOTS-Outdoor [63] | ||
| PSNR ↑ | SSIM ↑ | PSNR ↑ | SSIM ↑ | |||
| DehazeNet [28] | - | - | 21.05 | 0.793 | 20.30 | 0.771 |
| AOD-Net [29] | - | - | 22.25 | 0.817 | 20.89 | 0.785 |
| GridDehazeNet [30] | - | - | 30.73 | 0.947 | 35.53 | 0.984 |
| MSBDN [31] | - | - | 31.43 | 0.951 | 36.19 | 0.986 |
| FFANet [32] | - | - | 31.54 | 0.950 | 36.31 | 0.987 |
| DCP [33] | - | - | 31.95 | 0.953 | 36.94 | 0.988 |
| DeamNet [34] | - | - | 32.73 | 0.956 | 37.10 | 0.988 |
| MAXIM-3S [85] | 22.2 | 339 | 32.55 | 0.956 | 37.28 | 0.988 |
| Restormer [51] | 26.1 | 141 | 32.89 | 0.957 | 37.42 | 0.989 |
| EAMamba (Ours) | 25.3 | 137 | 33.05 | 0.958 | 37.52 | 0.989 |
Analysis of Table 5 (Image Dehazing on RESIDE):
-
Computational Efficiency: EAMamba again demonstrates excellent efficiency with 25.3M parameters and 137 GFLOPs. It is slightly more efficient than Restormer (141 GFLOPs) and significantly more efficient than MAXIM-3S (339 GFLOPs).
-
Image Quality (PSNR & SSIM): EAMamba achieves the
highest PSNRon both SOTS-Indoor (33.05 dB) and SOTS-Outdoor (37.52 dB) benchmarks, surpassing all other methods, including Restormer (32.89 dB and 37.42 dB respectively) and MAXIM-3S (32.55 dB and 37.28 dB respectively). It also achieves the highest or tied-highest SSIM values (0.958 for SOTS-Indoor, 0.989 for SOTS-Outdoor). This confirms EAMamba'ssuperior dehazing capabilitieswhile maintaining high efficiency.The following figure (Figure 10 from the original paper) presents qualitative results for image dehazing:
该图像是一个比较展示,左侧为真实图像,右侧展示了低质量图像及多种图像恢复方法的结果,包括Dehamer、MAXIM-2S、DehazeFormer-L和我们的EAMamba方法,显示出EAMamba在图像恢复中取得的最大PSNR值46.28,显著高于其他方法。
Figure 10. The image is a comparative display, with the left side showing the ground truth image and the right side presenting low-quality images along with the results of various image restoration methods, including Dehamer, MAXIM-2S, DehazeFormer-L, and our EAMamba method, demonstrating that EAMamba achieves the highest PSNR value of 46.28 in image restoration, significantly outperforming other methods.
Analysis of Figure 10 (Qualitative Results for Dehazing):
The qualitative results show that EAMamba exhibits superior detail preservation with minimal deviation from ground truth. The reconstructed image by EAMamba appears clearer and more faithful to the original scene compared to other methods, reinforcing its quantitative advantages.
6.2. Effectiveness of Various Scanning Strategies (Section 4.6)
The paper further investigates the impact of different scanning strategies, particularly highlighting the benefits of the proposed all-around scanning.
The following figure (Figure 11 from the original paper) illustrates the ERF results for different scanning strategies:
该图像是图表,展示了不同扫描策略(如2D扫描、对角线扫描、锯齿形扫描、Z字形扫描、Hilbert扫描和全方位扫描)的ERF结果。ERF结果展示了各方法在不同指标上的表现,以验证EAMamba的优越性。
Figure 11. Illustration of the ERF results for different scanning strategies, including two-dimensional scan, diagonal scan, zigzag scan, Z-order scan, Hilbert scan, and our all-around scan with reversing and flipping.
Analysis of Figure 11 (ERF Results for Different Scanning Strategies):
The Effective Receptive Field (ERF) visualization shows which input pixels contribute to the output of a target pixel through gradient flow analysis.
-
(a) 2D Scan: Shows strong horizontal and vertical dependencies but weaker diagonal influence.
-
(b) Diagonal Scan: Clearly captures global information along diagonal paths.
-
(c) Zigzag Scan: Also captures diagonal information.
-
(d) Z-order Scan: Gathers global information but with discontinuities.
-
(e) Hilbert Scan: Appears to have a less organized or
less globalinformation capture pattern in this visualization, indicating it might struggle with broad context. -
(f) All-Around Scan (2D + Diagonal + Reversing & Flipping): This combines multiple patterns, resulting in a significantly broader and more isotropic (uniform in all directions) ERF. This directly supports the claim that all-around scanning captures
both global and local contextual informationmore comprehensively, especially covering diagonal directions missed by simple 2D scans. This expanded ERF is crucial for preserving local information around target pixels, which is vital for image restoration.The following are the results from Table 6 of the original paper:
2D Diagonal Zigzag Z-order Hilbert All-around 39.80 39.79 39.77 39.74 39.74 39.87
Analysis of Table 6 (Average PSNR (dB) on SIDD for various scan strategies):
This table quantifies the benefits of all-around scanning.
-
Individual scanning strategies (2D, Diagonal, Zigzag, Z-order, Hilbert) yield PSNR values ranging from 39.74 dB to 39.80 dB.
-
The
All-aroundscanning strategy (which combines multiple patterns) achieves the highest PSNR of39.87 dB. This represents an improvement of0.07 dB to 0.15 dBover individual scanning patterns, quantitatively confirming its effectiveness in enhancing image quality by capturing more comprehensive spatial information and mitigating local pixel forgetting.The following are the results from Table 7 of the original paper:
Dataset 2D + Diagonal 2D + Z-order 2D + Hilbert 2D + Diagonal + Z-order SIDD [59] 39.87 39.82 39.83 39.83 RealSRx4 [61] 29.60 29.58 29.51 29.57 OPro [62] 33.58 33.51 33.66 33.56 SOTS-Indoor [63] 43.19 43.20 43.07 43.37
Analysis of Table 7 (Average PSNR (dB) on various image restoration datasets with different combinations of scanning strategies): This table explores different combinations of scanning strategies within the all-around approach.
- The combination of
2D + Diagonalscanning consistently yieldsgood performanceacross all tested datasets (SIDD, RealSRx4, GoPro, SOTS-Indoor), achieving values like 39.87 dB, 29.60 dB, 33.58 dB, and 43.19 dB respectively. This combination is chosen as thedefault configurationfor EAMamba, striking a good balance between complexity and performance. - Other combinations like
2D + Z-orderand2D + Hilbertshow slightly lower or comparable performance.2D + Diagonal + Z-ordersometimes performs very well (e.g., SOTS-Indoor with 43.37 dB), indicating that specific combinations might be optimal for particular tasks. The flexibility of theMHSSMto seamlessly incorporate novel scanning strategies is highlighted, allowing for further optimization based on specific use cases.
6.3. Ablation Studies (Section 4.7)
6.3.1. The Effectiveness of MHSS and All-Around Scan
The following are the results from Table 8 of the original paper:
| Method | Param. (M) ↓ | FLOPs (G) ↓ | Urban100 [87] | ||
| σ = 15 | σ = 25 | σ = 50 | |||
| Baseline | 31.1 | 286 | 35.15 | 33.00 | 30.08 |
| + MHSSM | 25.3 | 137 | 35.06 | 32.89 | 29.95 |
| + all-around scan | 25.3 | 137 | 35.10 | 32.93 | 30.01 |
Analysis of Table 8 (Average PSNR (dB) on Urban100 with Gaussian color image denoising for different design choices): This ablation study quantifies the individual contributions of MHSSM and the all-around scanning strategy.
-
Baseline: The
Baselinemodel uses2DSSM[57] with conventional 2D scanning. It has 31.1M parameters and 286 GFLOPs. -
+ MHSSM: Substituting 2DSSM withMHSSM(while still using 2D scanning) drastically reduces computational cost:FLOPs are almost halvedfrom 286 GFLOPs to 137 GFLOPs. Parameters also decrease from 31.1M to 25.3M. The PSNR values experience anegligible decrease(e.g., 35.15 to 35.06 dB at ). Thisquantitatively demonstrates MHSSM's capacity for significant computational savings with minimal detriment to quality. -
+ all-around scan: Starting from the+ MHSSMconfiguration, incorporating theall-around scanning strategy(instead of just 2D scanning) improves the PSNR values (e.g., 35.06 to 35.10 dB at ). This improvement occurswithout any increase in parameters or FLOPs(remaining at 25.3M and 137 GFLOPs). This confirms thatall-around scanning delivers enhanced image quality compared to conventional 2D scanningby better capturing spatial information.In summary, this study strongly supports EAMamba's design choices: MHSSM provides substantial efficiency, and all-around scanning boosts performance without additional computational burden.
6.3.2. Comparison of the various channel MLP
The following are the results from Table 9 of the original paper:
| Channel MLP | Param. (M) ↓ | FLOPs (G) ↓ | Urban100 [87] | ||
| σ = 15 | σ = 25 | σ = 50 | |||
| None | 16.5 | 90 | 34.98 | 32.79 | 29.82 |
| FFN | 28.3 | 153 | 35.10 | 32.93 | 30.01 |
| GDFN | 34.5 | 189 | 35.15 | 32.98 | 30.08 |
| Simple FFN | 25.3 | 137 | 35.10 | 32.93 | 30.01 |
| CA | 28.3 | 123 | 35.05 | 32.88 | 29.95 |
Analysis of Table 9 (Average PSNR (dB) on Urban100 with Gaussian color image denoising for different channel MLPs choices):
This ablation study investigates the impact of different channel MLP designs within the MambaFormer Block. The base configuration here likely includes MHSSM and all-around scanning.
-
None: Excluding the Channel MLP entirely leads to a performance decline greater than 0.1 dB in PSNR (e.g., 34.98 dB at ), demonstrating that the channel MLP isessential for feature refinement. It also has the lowest parameters and FLOPs, as expected. -
FFN(Vanilla Feed-Forward Network): A standard FFN yields good performance (35.10 dB) but with higher parameters (28.3M) and FLOPs (153 GFLOPs). -
GDFN(Gated-Dconv FFN) [51]: This design demonstratessuperior performance(35.15 dB at ), achieving the highest PSNR values. However, it also has the highest parameter count (34.5M) and FLOPs (189 GFLOPs). -
Simple FFN[68]: This is the chosen default configuration for EAMamba. It achieves the same PSNR as the vanilla FFN (35.10 dB) but withfewer parameters(25.3M vs 28.3M) andlower FLOPs(137 GFLOPs vs 153 GFLOPs). This indicates that Simple FFN strikes anoptimal balance between performance and computational efficiency. -
CA(Channel Attention) [6]: Channel Attention also offers good efficiency (123 GFLOPs) but slightly lower performance (35.05 dB) compared to Simple FFN.The choice of
Simple FFNas the default channel MLP is justified by its ability to provide strong performance while maintaining a significantly reduced computational footprint, aligning with EAMamba's overall goal of efficiency.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces EAMamba, an Efficient All-Around Vision State Space Model specifically designed for image restoration. EAMamba significantly advances existing Vision Mamba frameworks by addressing two critical limitations: the computational overhead associated with multiple scanning sequences and the problem of local pixel forgetting when processing 2D images as 1D sequences.
The key innovations are:
-
Multi-Head Selective Scan Module (MHSSM): This module efficiently processes and aggregates multiple flattened 1D sequences through a channel-splitting strategy. It dramatically improves the scalability and computational efficiency of Vision Mamba frameworks by avoiding the linear increase in computational and parameter costs that would otherwise arise from incorporating additional scanning directions.
-
All-Around Scanning Mechanism: By integrating diverse scanning patterns (horizontal, vertical, diagonal, and their reversed orientations), this mechanism captures holistic spatial information. This effectively resolves the local pixel forgetting issue, ensuring that crucial local pixel relationships are preserved, which is paramount for high-quality image restoration.
Extensive experimental evaluations across various image restoration tasks (super-resolution, denoising, deblurring, and dehazing) consistently demonstrate EAMamba's efficacy. The model achieves a significant 31-89% reduction in FLOPs compared to existing low-level Vision Mamba methods, while simultaneously maintaining or even improving
favorable performancein terms of PSNR and SSIM metrics. Both qualitative and quantitative results validate EAMamba's architectural innovations, positioning it as a highly efficient and effective solution for diverse image restoration challenges.
7.2. Limitations & Future Work
The paper does not explicitly state a dedicated "Limitations" section. However, based on the results and the nature of the work, some potential aspects could be considered limitations or areas for future work:
- Computational Overhead of Transformation/Inverse Transformation: While MHSSM efficiently aggregates selective scans, the
TransformandInverseTransformsteps (flattening 2D to 1D and vice-versa) for multiple directions might introduce some overhead, especially for very high-resolution images or numerous scanning patterns. The paper focuses on the FLOPs of the Mamba core, but the overall efficiency includes these reshaping operations. - Optimality of Scanning Patterns: The "all-around" strategy is a fixed set of patterns (e.g., 2D + Diagonal). It's possible that for specific degradation types or image content, other adaptive or learned scanning patterns could yield further improvements. The paper acknowledges this by stating "Other combinations can be employed for specific use cases" and "offers the flexibility to seamlessly incorporate novel scanning strategies with MHSSM," implying this is an open area.
- Generalizability to Other Vision Tasks: While validated on image restoration, EAMamba's innovations could potentially benefit other low-level or even high-level vision tasks (e.g., semantic segmentation, object detection) where long-range dependencies and local context are crucial. The current evaluation is limited to restoration.
- Hardware-Specific Optimizations: The efficiency gains are measured in FLOPs. Actual runtime performance can depend heavily on hardware architecture and memory access patterns. Further work could explore hardware-aware optimizations for the multi-head selective scan.
- Hyperparameter Sensitivity: The choice of the channel expansion factor and the number of groups for MHSS are critical design decisions. The paper does not provide an ablation for these specific hyperparameters, which could influence the optimal balance of efficiency and performance.
7.3. Personal Insights & Critique
EAMamba presents a highly commendable advancement in the application of State Space Models to computer vision, particularly for image restoration.
Personal Insights:
- Elegant Solution to Mamba's 2D Challenges: The paper provides an elegant and practical solution to two core problems faced by initial Vision Mamba adaptations: the inefficiency of adding more scanning directions and the loss of local spatial context. MHSSM's channel-splitting approach is a clever way to parallelize scanning operations without linearly increasing computational cost, while the all-around scanning strategy is a direct and effective counter to local pixel forgetting.
- Efficiency Frontier: The quantitative results, especially the dramatic reduction in FLOPs (31-89%) while maintaining or even surpassing performance, are genuinely impressive. This is crucial for real-world deployment where computational resources are often constrained. EAMamba truly seems to establish a new "efficiency frontier" as claimed.
- Versatility: Demonstrating strong performance across four diverse image restoration tasks (super-resolution, denoising, deblurring, dehazing) highlights the robustness and generalizability of the proposed architecture. This suggests that the underlying principles of efficient long-range dependency modeling and comprehensive local context capture are broadly applicable.
- Foundation for Future SSM-Vision Research: EAMamba provides a solid blueprint for how to effectively adapt Mamba's linear complexity to the inherently 2D nature of images. The MHSSM could become a standard component in future Vision Mamba architectures seeking higher efficiency and more comprehensive spatial understanding.
Critique:
-
Lack of Detailed Failure Analysis: While qualitative results show EAMamba performing well, a deeper analysis of specific failure cases or common artifacts produced (or prevented) by EAMamba compared to baselines could offer further insights. What kinds of degradations does it struggle with? Where does it still fall short of the ground truth, and why?
-
"All-Around" Specificity: The paper states "the combination of the 2D scan and the diagonal scan generally yields good performance and is set as the default configuration." While this is practical, a more in-depth discussion on why this specific combination is optimal, or if it can be dynamically adapted, could be beneficial. The term "all-around" suggests a maximally comprehensive approach, but in practice, a subset was chosen for efficiency.
-
Parameter Explanation in Formulas: While the paper generally explains symbols in formulas, a more explicit breakdown of how the
TransformandInverseTransformfunctions work (e.g., which specific scanning patterns are used for each group, how the 2D-to-1D flattening is done) would be helpful for a beginner to fully grasp the technical details without referring to external Mamba implementations. -
Comparison to Non-Mamba SOTA: While EAMamba is clearly superior to other Vision Mamba methods in efficiency, and competitive with SOTA Transformer methods, a more direct comparison showing specific scenarios where it outright beats the best Transformer models in terms of quality (e.g., Restormer, MAXIM-3S) would further strengthen its position, beyond just showing FLOPs reduction. In some tables, Restormer or MAXIM-3S still have higher PSNR.
Overall, EAMamba is a significant contribution, pushing the boundaries of efficient and effective image restoration using state space models. Its innovations are well-motivated and empirically validated, offering a promising direction for future research in computer vision.
Similar papers
Recommended via semantic vector search.