Paper status: completed

$\text{S}^{3}$Mamba: Arbitrary-Scale Super-Resolution via Scaleable State Space Model

Published:11/16/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
8 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

S³Mamba uses a scalable state space model and scale-aware attention for arbitrary-scale image super-resolution, overcoming high complexity and weak long-range modeling, achieving state-of-the-art performance with linear complexity and strong generalization.

Abstract

Arbitrary scale super-resolution (ASSR) aims to super-resolve low-resolution images to high-resolution images at any scale using a single model, addressing the limitations of traditional super-resolution methods that are restricted to fixed-scale factors (e.g., ×2\times2, ×4\times4). The advent of Implicit Neural Representations (INR) has brought forth a plethora of novel methodologies for ASSR, which facilitate the reconstruction of original continuous signals by modeling a continuous representation space for coordinates and pixel values, thereby enabling arbitrary-scale super-resolution. Consequently, the primary objective of ASSR is to construct a continuous representation space derived from low-resolution inputs. However, existing methods, primarily based on CNNs and Transformers, face significant challenges such as high computational complexity and inadequate modeling of long-range dependencies, which hinder their effectiveness in real-world applications. To overcome these limitations, we propose a novel arbitrary-scale super-resolution method, called S3\text{S}^{3}Mamba, to construct a scalable continuous representation space. Specifically, we propose a Scalable State Space Model (SSSM) to modulate the state transition matrix and the sampling matrix of step size during the discretization process, achieving scalable and continuous representation modeling with linear computational complexity. Additionally, we propose a novel scale-aware self-attention mechanism to further enhance the network's ability to perceive global important features at different scales, thereby building the S3\text{S}^{3}Mamba to achieve superior arbitrary-scale super-resolution. Extensive experiments on both synthetic and real-world benchmarks demonstrate that our method achieves state-of-the-art performance and superior generalization capabilities at arbitrary super-resolution scales.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

  • Title: S³Mamba: Arbitrary-Scale Super-Resolution via Scaleable State Space Model
  • Authors: Peizhe Xia, Long Peng, Xin Di, Renjing Pei, Yang Wang, Yang Cao, Zheng-Jun Zha.
  • Affiliations: The authors are affiliated with the University of Science and Technology of China and Huawei Noah's Ark Lab.
  • Journal/Conference: This paper is a preprint available on arXiv. As of its publication date, it has not yet been published in a peer-reviewed conference or journal. arXiv is a common platform for researchers to share their work early.
  • Publication Year: 2024 (Published on arXiv on November 16, 2024).
  • Abstract: The paper tackles Arbitrary-Scale Super-Resolution (ASSR), which aims to use a single model to upscale images to any desired resolution. Current methods, particularly those based on Implicit Neural Representations (INR), often rely on CNNs or Transformers, which suffer from high computational complexity and limited ability to model long-range dependencies. To address this, the authors propose S³Mamba, a novel method built upon a Scalable State Space Model (SSSM). The SSSM modulates its internal matrices based on the desired scale factor, enabling scalable and continuous image representation with linear computational complexity. Additionally, a scale-aware self-attention mechanism is introduced to improve the model's perception of global features at different scales. The authors claim that extensive experiments show their method achieves state-of-the-art (SOTA) performance and better generalization on both synthetic and real-world datasets.
  • Original Source Link:

2. Executive Summary

Background & Motivation (Why)

The core problem is the inflexibility of traditional image super-resolution (SR) models, which are trained for and restricted to fixed integer scale factors (e.g., ×2, ×4). Arbitrary-Scale Super-Resolution (ASSR) aims to overcome this by creating a single model that can upscale an image to any continuous scale factor (e.g., ×2.7, ×5.1).

A popular and effective approach for ASSR is using Implicit Neural Representations (INR), which learn a continuous function that maps pixel coordinates to their corresponding color values. However, existing INR-based methods face significant challenges:

  1. MLP-based methods (like LIIF) have a limited receptive field. They process each coordinate independently, failing to capture the global context and long-range dependencies within the image, which can lead to artifacts and less detailed reconstructions.

  2. Transformer-based methods (like CiaoSR) can model global context effectively using self-attention, but this comes at the cost of quadratic computational complexity (O(N2)O(N^2)), making them slow and resource-intensive, especially for high-resolution images.

    This creates a gap: a need for an efficient model that can capture global context with linear complexity while being adaptable to arbitrary scales.

Main Contributions / Findings (What)

The paper introduces S³Mamba to fill this gap. Its main contributions are:

  1. Pioneering Use of SSM in ASSR: The paper is the first to introduce State Space Models (SSMs), specifically the Mamba architecture, to the task of arbitrary-scale super-resolution.
  2. A Novel Scalable State Space Model (SSSM): This is the core technical innovation. The SSSM is designed to be scale-aware. It dynamically modulates its internal parameters—the state transition matrix (AA) and sampling matrix (BB)—based on the target scale and coordinate information. This allows the model to create a consistent continuous representation of an image across different scales, all while maintaining the linear computational complexity (O(N)O(N)) of SSMs.
  3. A Scale-Aware Self-Attention Mechanism: Built upon the SSSM, this mechanism generates a global attention map that is conditioned on the scale and coordinates. This further enhances the network's ability to focus on relevant features for the specific target resolution, improving reconstruction quality.
  4. State-of-the-Art Performance: The combined S³Mamba framework is shown to achieve superior or competitive performance on the synthetic DIV2K benchmark and demonstrates strong generalization and artifact reduction on the real-world COZ benchmark, outperforming previous methods in both quantitative metrics and visual quality.

3. Prerequisite Knowledge & Related Work

Foundational Concepts

  • Image Super-Resolution (SR): The task of generating a high-resolution (HR) image from a low-resolution (LR) input. Traditional methods train a separate model for each integer scale factor (e.g., a ×2 model, a ×4 model).

  • Arbitrary-Scale Super-Resolution (ASSR): A more advanced form of SR where a single model can upscale an LR image by any non-integer or large integer scale factor. This is more practical for real-world applications where a specific output resolution is required.

  • Implicit Neural Representations (INR): A technique where a neural network, typically a Multi-Layer Perceptron (MLP), learns a continuous function f(coord)valuef(\text{coord}) \rightarrow \text{value}. For images, this means mapping a 2D coordinate (x, y) to its RGB color. By querying this function at any coordinate in a grid, an image of any resolution can be generated. As shown in Figure 1(b), INR aims to model the continuous signal of the real world from discrete LR pixels.

    该图像是论文中的示意图,展示了任意放缩超分辨率的成像过程和不同方法的原理。包括(a)真实成像过程,(b)INR构建理想连续表示空间,(c)现有MLP方法,及(d)本文提出的基于SSSM的连续空间构建和超分辨率生成。 Figure 1 shows: (a) the real-world imaging process where a continuous scene is discretized into a digital image; (b) the ideal goal of ASSR using INR to reconstruct a continuous representation space; (c) the limitation of MLP-based methods which have a limited receptive field; (d) the proposed S³Mamba method which uses SSSM to model the continuous space more effectively.

  • State Space Models (SSM): A class of models originating from control theory used to model sequences. An SSM maps an input sequence x(t) to an output sequence y(t) through a hidden state vector h(t). The core idea is that the state at the current timestep is an evolution of the state from the previous timestep plus an influence from the current input. Recent advancements like Mamba have made SSMs highly effective and efficient for deep learning, as they can capture long-range dependencies with linear complexity, unlike Transformers.

Previous Works

The paper positions itself within the evolution of ASSR methods:

  1. Meta-Learning Approaches: MetaSR was an early method that used a "meta-upscale module" to generate weights for an upscaling network dynamically based on the scale factor.
  2. INR-based MLP Methods: LIIF was a landmark paper that introduced INR to ASSR. It uses an MLP to predict the RGB value at a given continuous coordinate based on features from the four nearest pixels in the LR feature map. However, its point-wise nature limits its ability to see global context. LTE and LINF tried to improve on this by incorporating Fourier (frequency domain) information to better represent textures.
  3. INR-based Transformer Methods: To overcome the local limitations of MLPs, CiaoSR and CLIT incorporated Transformers. Transformers use a self-attention mechanism to weigh the importance of all input features for each output, allowing them to capture global context. This significantly improved performance but introduced quadratic computational complexity, making them inefficient for practical use.
  4. Real-World ASSR: COZ introduced a benchmark dataset specifically for real-world ASSR, featuring images captured with continuous optical zoom, providing a more realistic testbed than synthetically downscaled images.

Differentiation

S³Mamba differentiates itself from prior work in a critical way:

  • vs. MLP-based methods (LIIF): It overcomes the limited receptive field by using an SSM, which can theoretically model dependencies across the entire input sequence (image scan line).
  • vs. Transformer-based methods (CiaoSR): It achieves global context modeling with linear computational complexity, making it much more efficient and scalable than the quadratically complex Transformer models.
  • vs. standard SSMs (Mamba): It introduces the Scalable State Space Model (SSSM), which makes the model explicitly aware of the target super-resolution scale. Standard SSMs are not designed for this, as their dynamics are only input-dependent. The SSSM modulates its internal state transitions based on the scale, which is the key to creating a consistent continuous representation.

4. Methodology (Core Technology & Implementation Details)

The core of the paper is the proposed S³Mamba framework, which is built on the novel Scalable State Space Model (SSSM).

Figure 2. (a) Illustration of the proposed \(\\mathbf { S } ^ { 3 } \\mathbf { M }\) amba framework. (b) The SSSM Block consists of the SSSM, along with multiple instance representation modeling with lin… Figure 2 illustrates: (a) The overall S³Mamba architecture, which takes an LR image and generates an HR image at an arbitrary scale. (b) The SSSM Block, which processes input features. (c) The core SSSM mechanism, showing how scale and coordinates are used to modulate the discretization step size ΔΔ and the input matrix BB.

Principles: From SSM to Scalable SSM (SSSM)

First, the paper reviews the standard State Space Model (SSM). A continuous SSM is defined by a linear Ordinary Differential Equation (ODE): h˙(t)=Ah(t)+Bx(t),y(t)=Ch(t)+Dx(t). \begin{array} { r } { \dot { h } ( t ) = A h ( t ) + B x ( t ) , } \\ { y ( t ) = C h ( t ) + D x ( t ) . } \end{array}

  • x(t): The input signal (e.g., a pixel feature).

  • h(t)RNh(t) \in \mathbb{R}^N: The latent state vector of size NN.

  • y(t): The output signal.

  • ARN×NA \in \mathbb{R}^{N \times N}: The state transition matrix, which governs the internal dynamics of the system.

  • BRN×1B \in \mathbb{R}^{N \times 1}: The input matrix, which controls how the input affects the state.

  • CR1×NC \in \mathbb{R}^{1 \times N}: The output matrix, which controls how the state is projected to the output.

  • DRD \in \mathbb{R}: A direct feed-through term.

    For digital processing, this continuous system is discretized. A common method is the zero-order hold, which transforms the continuous parameters (A, B) into discrete parameters (Aˉ,Bˉ\bar{A}, \bar{B}) using a sampling step size Δ\Delta: A=exp(ΔA),B=(ΔA)1(exp(ΔA)I)ΔB, \begin{array} { l } { \overline { { A } } = \exp ( \Delta A ) , } \\ { \overline { { B } } = ( \Delta A ) ^ { - 1 } ( \exp ( \Delta A ) - I ) \Delta B , } \end{array} This leads to the discrete recurrence relation: hk=Ahk1+Bxk,yk=Chk+Dxk, \begin{array} { l } { { h _ { k } = \overline { { { A } } } h _ { k - 1 } + \overline { { { B } } } x _ { k } , } } \\ { { y _ { k } = C h _ { k } + D x _ { k } , } } \end{array}

  • hkh_k: The state at discrete step kk.

  • xkx_k: The input at step kk.

  • yky_k: The output at step kk.

    The key insight of the paper is that in standard SSMs used in vision (like Mamba), the step size Δ\Delta is learned from the input data xkx_k but is unaware of the image scale. This is a problem for ASSR, as the "physical distance" and correlation between adjacent pixels change with the scale factor. To solve this, the authors propose the Scalable State Space Model (SSSM).

The SSSM makes the model scale-aware by modulating both Δ\Delta and BB:

  1. Modulating the Step Size Δ\Delta: Δxk=ω(xk),Δxkscale=σ(scale,coordxk),Δxk=ΔxkΔxkscale. \begin{array} { c } { \Delta _ { x _ { k } } = \omega ( x _ { k } ) , } \\ { \Delta _ { x _ { k } } ^ { s c a l e } = \sigma ( s c a l e , c o o r d _ { x _ { k } } ) , } \\ { \Delta _ { x _ { k } } ^ { ' } = \Delta _ { x _ { k } } \cdot \Delta _ { x _ { k } } ^ { s c a l e } . } \end{array}

    • ω\omega and σ\sigma are small MLPs.
    • Δxk\Delta_{x_k} is the standard input-dependent step size.
    • Δxkscale\Delta_{x_k}^{scale} is a new scale modulation factor generated from the target scale and the current coord (coordinate).
    • Δxk\Delta'_{x_k} is the final, scale-aware step size. This allows the model to adjust how much influence the previous state has on the current state based on the magnification level.
  2. Modulating the Input Matrix BB: The matrix BB controls how the input xkx_k influences the state hkh_k. The authors apply the same modulation logic to BB to make the input mapping scale-aware. Bxk,Cxk,Δxk=ω(xk),Bxkscale,Δxkscale=σ(scale,coordxk),Δxk=ΔxkΔxkscale,Bxk=BxkBxkscale. \begin{array} { c } { B _ { x _ { k } } , C _ { x _ { k } } , \Delta _ { x _ { k } } = \omega ( x _ { k } ) , } \\ { B _ { x _ { k } } ^ { s c a l e } , { \Delta _ { x _ { k } } ^ { s c a l e } } = \sigma ( s c a l e , c o o r d _ { x _ { k } } ) , } \\ { { \Delta _ { x _ { k } } ^ { ' } = \Delta _ { x _ { k } } \cdot \Delta _ { x _ { k } } ^ { s c a l e } , } } \\ { B _ { x _ { k } } ^ { ' } = B _ { x _ { k } } \cdot B _ { x _ { k } } ^ { s c a l e } . } \end{array} The new scale-aware parameters Δxk\Delta'_{x_k} and BxkB'_{x_k} are then used in the discretization formula to get Aˉxk\bar{A}'_{x_k} and Bˉxk\bar{B}'_{x_k}, which are finally used in the discrete state update. This ensures the entire state evolution process is conditioned on the target scale.

Steps & Procedures: The S³Mamba Architecture

As shown in Figure 2(a), the full S³Mamba architecture works as follows:

  1. Feature Extraction: An LR image is passed through a backbone feature extractor (e.g., EDSR, RDN) to get a feature map FLRF_{LR}.
  2. Local and Global Feature Fusion:
    • Local features FLRlocalF_{LR}^{local} are extracted using an Unfold operation, which gathers features from a local neighborhood (similar to LIIF).
    • Global features FLRglobalF_{LR}^{global} are extracted by passing FLRF_{LR} through the proposed SSSM.
    • These features are concatenated: Ffusion=concat(FLRlocal,FLRglobal)F_{fusion} = \text{concat}(F_{LR}^{local}, F_{LR}^{global}).
  3. Scale-Aware Self-Attention and Rendering: This is the decoding stage. αweight=SSSM(coordHR,scale),FHR=SSSM(αweightFHR),RGBHR=SSSM(FHR). \begin{array} { r l } & { \alpha _ { w e i g h t } = S S S M ( c o o r d _ { H R } , s c a l e ) , } \\ & { \quad F _ { H R } ^ { \prime } = S S S M ( \alpha _ { w e i g h t } \cdot F _ { H R } ) , } \\ & { \quad R G B _ { H R } = S S S M ( F _ { H R } ^ { \prime } ) . } \end{array}
    • First, an SSSM takes the target HR coordinates (coordHRcoord_{HR}) and the scale as input to generate a global attention map αweight\alpha_{weight}. This map represents the importance of different spatial locations for the given scale.

    • This attention map is multiplied with the HR features (FHRF_{HR}, derived from FfusionF_{fusion}). The result is processed by another SSSM.

    • Finally, another SSSM block processes the refined features to produce the final output RGB values (RGBHRRGB_{HR}).

      This design ensures that both the feature encoding and the final rendering are deeply integrated with scale information, allowing for high-quality, continuous super-resolution.

5. Experimental Setup

  • Datasets:

    • DIV2K: A standard high-quality dataset used for synthetic SR tasks. It contains 800 2K-resolution images for training. LR images are generated by bicubic downsampling with a random scale factor ss from a uniform distribution U(1,4)U(1, 4).
    • COZ (Continuous Optical Zoom): A real-world benchmark dataset for ASSR. It consists of 153 training images and 37 test images captured with a real camera's optical zoom, providing a more realistic and challenging scenario with complex, non-ideal degradations.
  • Evaluation Metrics:

    1. PSNR (Peak Signal-to-Noise Ratio): Measures the ratio between the maximum possible power of a signal and the power of corrupting noise. It is a logarithmic metric measured in decibels (dB), where higher values indicate better reconstruction quality.

      • Conceptual Definition: It quantifies the pixel-wise difference between the ground truth and the reconstructed image. It is sensitive to large errors but may not perfectly align with human visual perception.
      • Mathematical Formula: PSNR=20log10(MAXI)10log10(MSE) \text{PSNR} = 20 \cdot \log_{10}(\text{MAX}_I) - 10 \cdot \log_{10}(\text{MSE})
      • Symbol Explanation:
        • MAXI\text{MAX}_I: The maximum possible pixel value of the image (e.g., 255 for an 8-bit grayscale image).
        • MSE\text{MSE}: The Mean Squared Error between the ground truth image II and the reconstructed image KK, calculated as 1mni=0m1j=0n1[I(i,j)K(i,j)]2\frac{1}{mn}\sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 for images of size m×nm \times n.
    2. SSIM (Structural Similarity Index): A metric that measures the perceptual similarity between two images. It is considered to be more aligned with human vision than PSNR.

      • Conceptual Definition: SSIM compares three aspects of the images: luminance, contrast, and structure. It produces a value between -1 and 1, where 1 indicates perfect similarity.
      • Mathematical Formula: SSIM(x,y)=(2μxμy+c1)(2σxy+c2)(μx2+μy2+c1)(σx2+σy2+c2) \text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}
      • Symbol Explanation:
        • μx,μy\mu_x, \mu_y: The mean of images xx and yy.
        • σx2,σy2\sigma_x^2, \sigma_y^2: The variance of images xx and yy.
        • σxy\sigma_{xy}: The covariance of xx and yy.
        • c1,c2c_1, c_2: Small constants to stabilize the division.
  • Baselines: The proposed S³Mamba is compared against eight state-of-the-art ASSR models: MetaSR, LIIF, LTE, LINF, SRNO, LIT, CiaoSR, and LMI. The experiments use two different SR backbones, EDSR and RDN, to show the general applicability of the proposed method.

6. Results & Analysis

Core Results

1. Results on the Real-World COZ Dataset: The following is a transcription of Table 1 from the paper.

Manually transcribed from the paper since no image was provided.

Backbones Methods In-scale Out-of-scale
×3 ×3.5 ×4 ×5 ×5.5 ×6
EDSR [34] MetaSR [23] 26.65/0.767 25.80/0.752 25.22/0.740 24.39/0.720 24.09/0.711 23.31/0.678
LIIF [10] 26.61/0.767 25.76/0.752 25.16/0.741 24.32/0.721 24.01/0.711 23.23/0.679
LTE [30] 26.55/0.767 25.71/0.752 25.15/0.740 24.37/0.720 24.05/0.712 23.26/0.679
LINF [68] 26.53/0.762 25.66/0.750 25.10/0.737 24.29/0.719 23.99/0.711 23.21/0.677
SRNO [61] 26.59/0.766 25.70/0.752 25.15/0.741 24.31/0.722 24.05/0.712 23.25/0.680
LIT [7] 26.58/0.766 25.71/0.753 25.16/0.741 24.35/0.721 24.00/0.712 23.19/0.679
CiaoSR [2] 26.56/0.770 25.65/0.755 25.13/0.746 24.31/0.725 23.96/0.721 23.23/0.709
LMI [18] 26.71/0.773 25.84/0.755 25.27/0.746 24.39/0.726 24.09/0.723 23.34/0.709
Ours 26.66/0.768 25.78/0.752 25.22/0.741 24.39/0.722 24.08/0.713 23.29/0.680
RDN [73] MetaSR [23] 26.65/0.767 25.80/0.752 25.22/0.740 24.39/0.720 24.09/0.711 23.31/0.678
LIIF [10] 26.69/0.766 25.83/0.752 25.23/0.740 24.39/0.718 24.13/0.711 23.28/0.679
LTE [30] 26.64/0.767 25.74/0.752 25.17/0.740 24.40/0.719 24.10/0.709 23.28/0.676
LINF [68] 26.60/0.762 25.73/0.750 25.15/0.737 24.32/0.719 24.03/0.711 23.28/0.677
SRNO [61] 26.67/0.766 25.73/0.752 25.19/0.741 24.40/0.722 24.09/0.712 23.28/0.680
LIT [7] 26.66/0.766 25.79/0.753 25.19/0.741 24.36/0.721 24.03/0.712 23.25/0.679
CiaoSR [2] 26.61/0.772 25.76/0.756 25.22/0.746 24.38/0.727 24.06/0.721 23.36/0.710
LMI [18] 26.74/0.769 25.86/0.753 25.30/0.742 24.48/0.723 24.14/0.714 23.37/0.682
Ours 26.74/0.777 25.92/0.760 25.34/0.749 24.50/0.728 24.15/0.724 23.39/0.710
  • Analysis: On the RDN backbone, the proposed method (Ours) consistently outperforms all other methods across all scales, in both PSNR and SSIM. For example, at ×3.5, it achieves 25.92/0.760, surpassing the next best LMI (25.86/0.753). The strong SSIM scores suggest that S³Mamba is particularly effective at reconstructing structural details, which is crucial for visual quality in real-world images. The visual results in Figure 3 confirm this, showing that the proposed method produces cleaner textures and fewer artifacts compared to others.

    Figure 3. Visual comparison with existing methods on the real COZ dataset \(\\times 3\) Please zoom in for a better view. Figure 3 shows that for ×3 SR on the real-world COZ dataset, S³Mamba (Ours) reconstructs the text and grid patterns more clearly and with fewer artifacts than other methods like CiaoSR and LIIF.

    2. Results on the Synthetic DIV2K Dataset: The following is a transcription of Table 2 from the paper.

Manually transcribed from the paper since no image was provided.

Backbones Methods ×2 ×3 ×4 ×6 ×12 ×18 ×24 ×30
Bicubic 31.01 28.22 26.66 24.82 22.27 21.00 20.19 19.59
EDSR [34] EDSR-baseline [34] 34.55 30.90 28.94 - - - - -
MetaSR [24] 34.64 30.93 28.92 26.61 23.55 22.03 21.06 20.37
LIIF [10] 34.67 30.96 29.00 26.75 23.71 22.17 21.18 20.48
ITSRN [67] 34.71 30.95 29.03 26.77 23.71 22.17 21.18 20.49
LMI [18] 34.59 30.90 28.94 26.69 23.68 22.18 21.23 20.55
LTE [30] 34.72 31.02 29.04 26.81 23.78 22.23 21.24 20.53
CLIT [7] 34.81 31.12 29.15 26.92 23.83 22.29 21.26 20.53
SRNO [61] 34.85 31.11 29.16 26.90 23.84 22.29 21.27 20.56
CiaoSR [2] 34.91 31.15 29.23 26.95 23.88 22.32 21.32 20.59
Ours 34.93 31.13 29.24 26.97 23.89 22.32 21.30 20.59
RDN [73] RDN-baseline [73] 34.94 31.22 29.19 - - - - -
MetaSR [24] 35.00 31.27 29.25 26.88 23.73 22.18 21.17 20.47
LIIF [10] 34.99 31.26 29.27 26.99 23.89 22.34 21.31 20.59
ITSRN [67] 35.09 31.36 29.38 27.06 23.93 22.36 21.32 20.61
LMI [18] 34.74 31.03 29.07 26.81 23.79 22.29 21.31 20.63
LTE [30] 35.04 31.32 29.33 27.04 23.95 22.40 21.36 20.64
CLIT [7] 35.10 31.39 29.39 27.12 24.01 22.45 21.38 20.64
SRNO [61] 35.16 31.42 29.42 27.12 24.03 22.46 21.41 20.68
CiaoSR [2] 35.15 31.42 29.45 27.16 24.06 22.48 21.43 20.70
Ours 35.17 31.40 29.47 27.17 24.07 22.50 21.43 20.68
  • Analysis: On the DIV2K dataset, S³Mamba achieves SOTA performance in most scenarios, especially for smaller to medium scales (e.g., ×2, ×4, ×6) and large out-of-scale factors (e.g., ×12, ×18). For instance, with the RDN backbone, it scores the highest PSNR at ×2, ×4, ×6, ×12, and ×18. It is highly competitive with CiaoSR but, as noted in the supplementary material, achieves this with only half the computational complexity. The visual comparison in Figure 4 demonstrates that S³Mamba reconstructs fine textures more faithfully than other methods, including CiaoSR, which tends to produce some artifacts.

    Figure 4. Visual comparison with existing methods on the DIV2K dataset \(\\times 4\) Please zoom in for a better view. Figure 4 shows a close-up of a ×4 super-resolved image. The result from "Ours" is visually closest to the Ground Truth (GT), correctly rendering the grid pattern on the building, whereas other methods blur or distort it.

Ablation Study

The ablation studies validate the effectiveness of the core components of S³Mamba.

1. Effectiveness of SSSM: Table 3 compares the performance of the INR module using a standard MLP, a standard SSM, and the proposed SSSM.

Manually transcribed from the paper since no image was provided.

MLP SSM Our SSSM PSNR on x2 PSNR on x4
X X 34.78 29.09
X X 34.85 29.17
X X 34.91 29.24
  • Analysis: The results clearly show a performance hierarchy: SSSM > SSM > MLP.
    • Replacing MLP with a standard SSM boosts PSNR (e.g., from 34.78 to 34.85 at ×2), confirming that the global modeling capability of SSMs is superior to the point-wise MLP approach.
    • Using the proposed SSSM instead of a standard SSM provides another significant boost (from 34.85 to 34.91 at ×2). This directly demonstrates the importance of making the SSM scale-aware through the modulation of Δ\Delta and BB.

2. Effectiveness of GFE and SFAtt: Table 4 analyzes the contribution of the Global Feature Extraction (GFE) and Scale-aware Self-Attention (SFAtt) modules within the S³Mamba architecture.

Manually transcribed from the paper since no image was provided.

GFE SFAtt PSNR on x2 PSNR on x3 PSNR on x4
X X 34.71 30.98 29.06
X 34.78 31.03 29.12
X 34.85 31.09 29.19
34.91 31.13 29.24
  • Analysis: Both modules contribute positively to the final performance.
    • Adding SFAtt (scale-aware self-attention) to the baseline improves PSNR by 0.07 dB at ×2, showing that the scale-conditioned attention mechanism helps the network adapt its rendering.
    • Adding GFE (global feature extraction via SSSM) provides a larger boost of 0.14 dB (34.85 vs. 34.71), highlighting the crucial role of global information in constructing a robust continuous representation.
    • Combining both (GFE + SFAtt) yields the best results, showing that their contributions are complementary.

7. Conclusion & Personal Thoughts

Conclusion Summary

The paper introduces S³Mamba, a novel and efficient architecture for arbitrary-scale super-resolution. The core innovation is the Scalable State Space Model (SSSM), which for the first time adapts the internal dynamics of an SSM to the target super-resolution scale. By modulating the step size and input matrix during discretization, the SSSM achieves a consistent, continuous image representation with only linear computational complexity. Complemented by a scale-aware self-attention mechanism, the full S³Mamba framework effectively captures both global context and scale-specific details. Experimental results confirm that the method achieves state-of-the-art performance on synthetic and real-world benchmarks, outperforming previous MLP and Transformer-based approaches in both accuracy and efficiency.

Limitations & Future Work

The paper presents a very strong case, but some areas could be explored further:

  • Authors' Stated Limitations: The authors do not explicitly state limitations in the main body of the paper, focusing on the method's strengths. The conclusion suggests it "paves a new way," implying future work will build upon this foundation.
  • Potential Limitations:
    • Extreme "Out-of-Scale" Performance: While the model performs well up to ×30, its performance degradation at extreme scales could be further analyzed. The gap between S³Mamba and CiaoSR narrows at the very largest scales (e.g., ×30 on RDN), suggesting there might be a limit to the current approach's generalization.
    • Complexity of SSSM: While computationally linear, the SSSM introduces additional MLPs and modulations. A detailed analysis of parameter counts and real-world inference speed (ms/image) compared to CiaoSR would be valuable, beyond just computational complexity.
    • Scan Direction: The paper doesn't detail the scanning mechanism (e.g., four-directional scanning like in Visual Mamba). This implementation detail can have a significant impact on performance for non-sequential 2D data like images.

Personal Insights & Critique

  • A Powerful and Transferable Idea: The core concept of modulating SSM parameters (Δ\Delta and BB) with external conditioning information (scale, coordinates) is highly innovative and elegant. This mechanism is not limited to super-resolution; it could be a general-purpose tool for any conditional sequence modeling task where the input-output relationship depends on external context (e.g., conditional audio generation, style-conditioned text generation, or even video frame prediction conditioned on motion vectors).
  • The Right Tool for the Job: This work is a perfect example of the ongoing trend of finding efficient alternatives to the Transformer. By identifying the core limitation of Transformers (quadratic complexity) and finding an architecture (SSM) that remedies it while preserving the key advantage (long-range dependency modeling), the authors provide a practical and powerful solution.
  • Solid Experimental Validation: The paper's strength is its rigorous evaluation. Testing on both synthetic (DIV2K) and real-world (COZ) data, along with comprehensive ablation studies, provides convincing evidence for the method's effectiveness. The superior performance on the COZ dataset is particularly noteworthy, as it proves the model's robustness to real-world image degradations.
  • Critique: While the quantitative gains over the previous SOTA (CiaoSR) are sometimes modest (e.g., 0.02 dB PSNR at ×2 on EDSR), the claimed halving of computational complexity is a massive practical advantage. This efficiency-for-performance trade-off is a huge win and should perhaps have been emphasized even more as a primary contribution. The S³Mamba name is also clever, succinctly capturing its key components: Scalable State Space Mamba. Overall, this paper makes a significant and timely contribution to the field of image restoration.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.