Paper status: completed

NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution

Published:06/03/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper reviews the NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution, detailing proposed methods for restoring noisy RAW images and upscaling Bayer images, with participation from 230 entrants and submissions from 45.

Abstract

This paper reviews the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, highlighting the proposed solutions and results. New methods for RAW Restoration and Super-Resolution could be essential in modern Image Signal Processing (ISP) pipelines, however, this problem is not as explored as in the RGB domain. The goal of this challenge is two fold, (i) restore RAW images with blur and noise degradations, (ii) upscale RAW Bayer images by 2x, considering unknown noise and blur. In the challenge, a total of 230 participants registered, and 45 submitted results during thee challenge period. This report presents the current state-of-the-art in RAW Restoration.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of this paper is the NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution. It focuses on reviewing the solutions and results from this challenge.

1.2. Authors

The paper lists a large number of authors, indicating a collaborative effort typical of challenge reports, where organizers and winning teams are credited. The primary organizers appear to be:

  • Marcos V. Conde *†

  • Radu Timofte *

  • Zihao Lu *

  • Xiangyu Kong

  • Xiaoxia Xing

  • Fan Wang

  • Suejin Han

  • MinKyu Park

  • Tianyu Zhang

  • Xin Luo

  • Yeda Chen

  • Dong Liu

  • Li Pang

  • Yuhang Yang

  • Hongzhong Wang

  • Xiangyong Cao

  • Ruixuan Jiang

  • Senyan Xu

  • Siyuan Jiang

  • Xueyang Fu

  • Zheng-Jun Zha

  • Tianyu Hao

  • Yuhong He

  • Ruoqi Li

  • Yueqi Yang

  • Xiang Yu

  • Guanlan Hoong

  • Minmin Yi

  • Yuanjia Chen

  • Liwen Zhang

  • Zijie Jin

  • Cheng Li

  • Lian Liu

  • Wei Song

  • Heng Sun

  • Yubo Wang

  • Jinghua Wang

  • Jiajie Lu

  • Watchara Ruangsang

    Their affiliations include academic institutions (e.g., University of Science and Technology of China, Xi'an Jiaotong University, Nanjing University, Harbin Institute of Technology, Huazhong University of Science and Technology, Northeastern University, Dalian University of Technology, Politecnico di Milano, Chulalongkorn University) and industry research labs (e.g., Samsung R&D Institute China - Beijing (SRC-B), The Department of Camera Innovation Group, Samsung Electronics, Shanghai Shuangshen Information Technology Co., Ltd., Ministry of Education Key Laboratory of Intelligent Networks and Network Security, Xiaomi Inc., E-surfing Vision Technology Co., Ltd, China, Institute of Automation, Chinese Academy of Sciences). This mix of academic and industry participants highlights the practical relevance and research interest in RAW image processing.

1.3. Journal/Conference

The paper is published at the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2025. CVPR is one of the premier conferences in computer vision, known for presenting cutting-edge research. Workshop papers, like this one, often focus on specific challenges or emerging topics, providing a snapshot of the current state-of-the-art in a specialized area.

1.4. Publication Year

2025

1.5. Abstract

The abstract introduces the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge, emphasizing its relevance to modern Image Signal Processing (ISP) pipelines. It notes that RAW image restoration is less explored than its RGB counterpart. The challenge had two main goals: (i) restoring RAW images affected by blur and noise, and (ii) upscaling RAW Bayer images by 2×2 \times despite unknown noise and blur. A total of 230 participants registered, with 45 teams submitting results. The report aims to present the current state-of-the-art in RAW Restoration.

https://arxiv.org/abs/2506.02197

https://arxiv.org/pdf/2506.02197v2.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses, via the NTIRE 2025 Challenge, is the restoration and super-resolution of RAW images. This problem is crucial for modern Image Signal Processing (ISP) pipelines, particularly in portable camera devices.

Why is this problem important?

  • Physical Limitations of Portable Devices: Smaller sensors in portable devices lead to reduced light collection, lower signal-to-noise ratios (SNR), and limited optical resolution. This makes it challenging to achieve high resolution, low noise, and sharpness simultaneously. Image restoration tasks are thus indispensable.
  • RAW vs. sRGB Processing: RAW images contain more pristine sensor data with minimal processing (linear amplification, white balance), preserving a nearly linear response. In contrast, images processed through the ISP pipeline (demosaicing, tone mapping, gamma correction, color adjustment) undergo strong nonlinear operations that can introduce irreversible information loss, amplify errors, and limit the performance of tasks like denoising and super-resolution when conducted in the sRGB domain. While some methods attempt to reconstruct RAW from sRGB, they often fail to recover original detail patterns.
  • Generalization and Robustness: sRGB images from the same sensor can vary significantly across manufacturers due to different ISP-tuning standards and stylistic preferences. This makes cross-sensor model generalization and robustness difficult when relying on sRGB inputs. Processing in the RAW domain offers greater consistency.
  • Computational Constraints: Portable devices have limited computational resources, making model size and computational complexity critical factors in model design.

What is the paper's entry point or innovative idea? The paper's entry point is the NTIRE 2025 Challenge, which explicitly focuses on RAW image restoration and super-resolution. By organizing this challenge, the authors aim to:

  1. Push the boundaries of research in the less-explored RAW domain.
  2. Provide a standardized platform and dataset for developing and benchmarking new methods.
  3. Address the practical needs of ISP pipelines in portable devices by considering both restoration quality and computational efficiency.
  4. Highlight the advantages of operating directly on RAW data before complex ISP transformations.

2.2. Main Contributions / Findings

The paper's primary contributions stem from the organization and reporting of the NTIRE 2025 RAW Image Restoration and Super-Resolution Challenge:

  • Challenge Design and Setup: The paper details the two-fold challenge: (i) RAW image restoration from blur and noise, and (ii) 2x RAW Bayer image super-resolution with unknown noise and blur. It defines the degradation models, dataset properties, and evaluation protocols for both tracks.
  • Comprehensive Dataset Provision: It describes the use of existing datasets like BSRAW and Adobe MIT5K, and new data collected from diverse smartphone sensors (Samsung Galaxy S9, iPhone X, Google Pixel 10, Vivo X90, Samsung S21) to create robust training and testing sets for both tasks. Pre-processing steps like normalization and RGGB Bayer pattern conversion are also detailed.
  • Benchmark of State-of-the-Art Solutions: The paper presents a comprehensive benchmark of the solutions submitted by 45 teams (out of 230 registered participants). These solutions represent the current state-of-the-art in RAW image restoration and super-resolution, covering a wide range of architectures from efficient to general models.
  • Detailed Analysis of Top Solutions: It provides in-depth descriptions of several top-performing methods, highlighting their architectural designs (e.g., RawRTSR, SMFFRaw, Multi-PromptIR, ERIRNet), training strategies (e.g., knowledge distillation, multi-stage training, reparameterization), and efficiency considerations.
  • Key Findings:
    • Significant Quality Improvement: The proposed methods demonstrate great ability to increase RAW image quality and resolution, reduce blurriness and noise, and avoid detectable color artifacts, even for full-resolution 12MP outputs.
    • RAW vs. sRGB Advantages: Reinforces the idea that processing in the RAW domain offers superior performance compared to sRGB due to the preservation of more detail information and linearity.
    • Efficiency vs. Performance Trade-off: The challenge included efficient tracks with parameter constraints, pushing participants to design lightweight yet high-performing models. Teams like Samsung AI Camera excelled in both general and efficient tracks.
    • Remaining Challenges: Identifies more realistic downsampling and the difficulty of simultaneously tackling denoising and deblurring effectively as open research problems.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand this paper, a reader should be familiar with several fundamental concepts in imaging and deep learning:

  • RAW Images: RAW images are unprocessed or minimally processed data captured directly by a digital camera's sensor. Unlike JPEG or sRGB images, they retain much more information, including a wider dynamic range, higher bit depth (e.g., 10-14 bits per pixel), and a linear response to light. They are often saved in proprietary formats or standardized as DNG (Digital Negative). Processing RAW images directly avoids information loss introduced by an Image Signal Processor (ISP).
  • Bayer Pattern: Most digital camera sensors use a Bayer filter array (or Bayer pattern), which is a color filter mosaic that arranges red, green, and blue color filters on a square grid of photosensors. Typically, there are twice as many green filters as red or blue, as the human eye is more sensitive to green light. A common arrangement is RGGB. RAW images are essentially Bayer-patterned images before demosaicing. The challenge often "packs" these Bayer patterns into a 4-channel representation for easier processing by deep learning models (e.g., RGGB pixels are grouped into a 2×22 \times 2 block, forming a 4-channel pixel).
  • Image Signal Processor (ISP) Pipeline: The ISP pipeline is a series of computational imaging steps that transform the raw sensor data into a viewable image (e.g., JPEG or sRGB). Key steps include:
    • Demosaicing (Debayering): Interpolating missing color information at each pixel from its neighbors to create a full-color RGB image.
    • White Balance: Adjusting color temperatures to ensure white objects appear white.
    • Noise Reduction: Removing sensor noise.
    • Sharpening: Enhancing edge details.
    • Tone Mapping/Gamma Correction: Adjusting the image's brightness and contrast to fit the display's dynamic range and human perception (which is non-linear).
    • Color Adjustment: Applying stylistic color enhancements. The paper argues that performing restoration tasks before these non-linear ISP operations is beneficial.
  • Image Restoration: A general term for improving the quality of an image that has been degraded by various factors. In this paper, it specifically refers to:
    • Denoising: Removing unwanted random variations in image intensity (noise) that can obscure details. Common types include Gaussian noise and shot noise.
    • Deblurring: Reversing the effects of blur, which can be caused by camera motion (motion blur), subject motion, or misfocus (defocus blur). This often involves estimating and reversing a Point Spread Function (PSF), which describes how a point of light is spread out in the image.
  • Super-Resolution (SR): The process of enhancing the resolution of an image, typically by generating a high-resolution (HR) image from a low-resolution (LR) input. This involves hallucinating plausible details that were lost during downsampling. The challenge focuses on 2x upscaling.
  • Convolutional Neural Networks (CNNs): A class of deep neural networks widely used for image processing tasks. They consist of layers that perform convolutional operations (applying learnable filters to detect patterns), pooling (downsampling), and activation functions. CNNs are effective at learning hierarchical features from images.
  • Transformers (in Computer Vision): Originally developed for natural language processing, Transformers have been adapted for computer vision. Key features include:
    • Self-Attention: A mechanism that allows the model to weigh the importance of different parts of the input sequence (or image patches) when processing a particular element. This enables capturing long-range dependencies.
    • Multi-Head Attention: Multiple self-attention mechanisms running in parallel, each learning different relationships.
    • Feed-Forward Networks (FFNs): Simple fully connected layers applied independently to each position. Transformer blocks often incorporate Layer Normalization for stable training.
  • Peak Signal-to-Noise Ratio (PSNR): A common metric for evaluating the quality of image reconstruction. It quantifies the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Higher PSNR values indicate better image quality.
    • Conceptual Definition: PSNR is widely used to measure the quality of reconstruction of lossy compression codecs or image restoration algorithms. It is defined as the ratio of the maximum possible power of a signal to the power of corrupting noise that affects the fidelity of its representation. Because many signals have a very wide dynamic range, PSNR is usually expressed in terms of the logarithmic decibel (dB) scale. A higher PSNR generally indicates a better reconstruction.
    • Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $
    • Symbol Explanation:
      • MAXI\mathrm{MAX}_I: The maximum possible pixel value of the image. For an 8-bit image, this is 255. For floating-point RAW images normalized to [0, 1], it is 1.
      • MSE\mathrm{MSE}: Mean Squared Error between the original (ground truth) image II and the reconstructed image KK. $ \mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $ Where mm and nn are the dimensions of the image.
  • Structural Similarity Index Measure (SSIM): Another common metric for image quality assessment, designed to be more perceptually relevant than PSNR. It measures the similarity between two images based on luminance, contrast, and structure. Values range from -1 to 1, with 1 indicating perfect similarity.
    • Conceptual Definition: SSIM is a perceptual metric that quantifies the perceived quality degradation caused by processing operations such as data compression or loss. It is designed to model the human visual system's perception of structural information in an image. Instead of comparing absolute pixel differences, SSIM considers three key comparison components: luminance, contrast, and structure.
    • Mathematical Formula: $ \mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
    • Symbol Explanation:
      • xx: Reference image (ground truth).
      • yy: Processed image.
      • μx\mu_x: Average of xx.
      • μy\mu_y: Average of yy.
      • σx2\sigma_x^2: Variance of xx.
      • σy2\sigma_y^2: Variance of yy.
      • σxy\sigma_{xy}: Covariance of xx and yy.
      • c1=(K1L)2c_1 = (K_1L)^2, c2=(K2L)2c_2 = (K_2L)^2: Small constants to avoid division by zero. LL is the dynamic range of the pixel values (e.g., 255 for 8-bit grayscale images). K1K_1 and K2K_2 are small default constants (e.g., K1=0.01K_1 = 0.01, K2=0.03K_2 = 0.03).

3.2. Previous Works

The paper references several prior works, both as foundational concepts for the challenge and as baselines for comparison.

  • BSRAW [18]: This is a key reference for the NTIRE 2025 challenge, used for both the dataset and the degradation pipeline. BSRAW (Improving Blind RAW Image Super-Resolution) likely introduced methods for synthesizing realistic low-resolution (LR) RAW images from high-resolution (HR) RAW images by modeling various degradations (noise, blur, downsampling). Its contributions likely include a robust degradation pipeline and a RAWSR dataset, which are directly utilized here.
  • NTIRE 2024 RAWSR Challenge [19]: The current challenge builds directly on its predecessor. The NTIRE 2024 RAWSR Challenge already explored deep RAW image super-resolution, and its top-performing methods (RBSFormer, BSRAW) are used as baselines, providing a performance context for the 2025 challenge.
  • PMRID [65]: (Practical Deep Raw Image Denoising on Mobile Devices) This model is a baseline for the RAW Image Restoration (RAWIR) track. As its name suggests, it focuses on efficient RAW denoising suitable for mobile devices, indicating a practical, lightweight approach.
  • MOFA [9]: (A Model Simplification Roadmap for Image Restoration on Mobile Devices) Another baseline for RAWIR, MOFA emphasizes model simplification for image restoration on mobile devices. This aligns with the challenge's focus on efficiency.
  • NAFNet [7]: (Simple Baselines for Image Restoration) NAFNet is a widely recognized lightweight U-Net-like model that achieves strong image restoration performance without complex attention mechanisms. It's a key baseline for RAWIR and is often a component or inspiration for submitted solutions due to its efficiency and effectiveness. Many solutions in this challenge (e.g., Team Samsung AI, Team NJU, Team ER-NAFNet) build upon or adapt NAFNet's NafBlock architecture.
  • XRestormer [10]: (A Comparative Study of Image Restoration Networks for General Backbone Network Design) This is used as a teacher model in knowledge distillation strategies by Team Samsung AI for both RAWSR and RAWIR. XRestormer likely represents a strong, potentially more complex, image restoration network that provides good features for student models to learn from.
  • RBSFormer [32, 33]: (Enhanced Transformer Network for Raw Image Super-Resolution) This Transformer-based network was a top performer in the NTIRE 2024 RAWSR Challenge and serves as a strong baseline in NTIRE 2025 RAWSR. Team USTC-VIDAR explicitly uses a streamlined version of RBSFormer. It's known for its Transformer blocks and efficient feature extraction.
  • SwinFIR-Tiny [76]: (Revisiting the SwinIR with Fast Fourier Convolution and Improved Training for Image Super-Resolution) Team Miers bases their RAWIR solution on SwinFIR-Tiny, which is a variant of SwinIR (a Transformer-based model for image restoration) optimized for efficiency.
  • MPRNet [74]: (Multi-Stage Progressive Image Restoration) Team WIRTeam uses MPRNet as a baseline for their LMPR-Net in RAWIR. MPRNet is a multi-stage network that decomposes image restoration into multiple sub-tasks, processing images progressively to handle various degradations.
  • PromptIR [51]: (Prompting for All-in-One Image Restoration) Team WIRTeam's Multi-PromptIR for RAWIR is based on PromptIR, which uses a prompt mechanism to guide the restoration process, potentially improving performance in complex scenarios.
  • Restormer [75]: (Efficient Transformer for High-Resolution Image Restoration) Also referenced by Team WIRTeam for Multi-PromptIR, Restormer is known for its efficient Transformer architecture for high-resolution image restoration.
  • SYEnet [23]: (A Simple Yet Effective Network for Multiple Low-Level Vision Tasks with Real-Time Performance on Mobile Device) Team EiffLowCVer bases their RepRawSR on SYEnet, emphasizing lightweight and real-time performance for mobile devices.

3.3. Technological Evolution

The field of image restoration has seen a significant evolution, moving from traditional signal processing methods to deep learning approaches, and more recently, from sRGB domain processing to RAW domain processing, with a growing emphasis on efficiency for mobile deployment.

  • Early Days (Traditional Methods): Initially, image restoration relied on classical signal processing techniques, often model-based (e.g., Wiener filtering for denoising, iterative deconvolution for deblurring). These methods were mathematically rigorous but often struggled with complex, real-world degradations and generalization.

  • Rise of Deep Learning: The advent of Convolutional Neural Networks (CNNs) revolutionized image restoration. CNNs could learn complex mappings from degraded to clean images, significantly outperforming traditional methods. Early CNNs like SRCNN for super-resolution demonstrated the power of end-to-end learning.

  • U-Net Architectures and Attention: Architectures like U-Net became popular for their ability to capture multi-scale features through encoder-decoder structures with skip connections, which are vital for pixel-level restoration tasks. Later, attention mechanisms, inspired by Transformers, were integrated into CNNs (e.g., Squeeze-and-Excitation (SE) blocks, Channel Attention) or used as full Transformer blocks to capture global dependencies and improve feature learning.

  • Shift to RAW Domain Processing: Historically, most image restoration research focused on sRGB images, as these are the common output format. However, researchers realized the limitations of sRGB due to irreversible information loss from the ISP pipeline. This led to a critical shift towards processing RAW images directly, leveraging their linear response and richer data. Papers like BSRAW and challenges like NTIRE RAWSR are at the forefront of this shift.

  • Efficiency for Mobile/Edge Devices: As deep learning models grew in complexity, their deployment on resource-constrained mobile and edge devices became a major challenge. This spurred research into lightweight network design, knowledge distillation, reparameterization, and quantization. Many solutions in NTIRE 2025 explicitly address these efficiency concerns with efficient tracks and parameter constraints.

    This paper's work fits squarely within the current wave of RAW domain deep learning with a strong focus on efficiency. It benchmarks the latest CNN- and Transformer-based architectures tailored for RAW data and mobile deployment.

3.4. Differentiation Analysis

Compared to the main methods in related work, the NTIRE 2025 Challenge and its submitted solutions offer several core differentiations and innovations:

  • Exclusive Focus on RAW Domain: While sRGB image restoration is mature, RAW domain processing is less explored. This challenge specifically pushes the boundaries of restoration and super-resolution directly on RAW Bayer data. This is a crucial differentiation, as RAW data offers higher fidelity and a more linear representation, potentially leading to superior results compared to sRGB-based methods that suffer from ISP-induced information loss.
  • Combined Restoration and Super-Resolution: The challenge tackles denoising, deblurring, and super-resolution concurrently on RAW images. This integrated approach is more aligned with real-world ISP pipeline needs, where multiple degradations often coexist. The solutions need to be robust to unknown noise and blur.
  • Emphasis on Efficiency: The introduction of an Efficient Track (e.g., max 200K parameters) explicitly drives innovation in lightweight model design for mobile devices. This is a significant differentiator from many academic papers that prioritize absolute performance without stringent efficiency constraints. Techniques like reparameterization, knowledge distillation, narrow-and-deep architectures, and depthwise separable convolutions are heavily utilized and benchmarked here.
  • Diverse Real-World Degradations: The challenge uses sophisticated degradation pipelines (e.g., BSRAW, AISP) that simulate realistic noise profiles and multiple blur kernels. Crucially, participants are encouraged to develop their own degradation pipelines to generate more realistic training data, pushing models beyond simplistic synthetic degradations. The RAWIR track even incorporates dark frame noise captured from multiple mobile sensors.
  • Benchmarking of Modern Architectures: The challenge serves as a proving ground for contemporary deep learning architectures adapted to RAW data. This includes U-Net variants (NAFNet, ER-NAFNet), Transformer-based models (RBSFormer, SwinFIR-Tiny variants, PromptIR/Restormer variants), and novel combinations (MambaIRv2). The results provide a direct comparison of how these different architectural philosophies perform in the RAW domain under challenging conditions.
  • Addressing Cross-Sensor Generalization: By using datasets from diverse camera sensors (Canon, Nikon, Samsung, iPhone, Google Pixel, Vivo), the challenge implicitly encourages the development of models that are robust and generalizable across different hardware, mitigating the ISP-tuning variability seen in sRGB images.

4. Methodology

4.1. Principles

The core principle behind the methods used in the NTIRE 2025 Challenge is to leverage deep learning models to directly process RAW image data for restoration (denoising and deblurring) and super-resolution. The intuition is that by operating on the raw, linear sensor data before the complex, non-linear transformations of an Image Signal Processor (ISP) pipeline, more original information is preserved, leading to higher quality restoration and upscaling.

The general theoretical basis involves learning a complex mapping function FF that takes a degraded low-resolution (LR) RAW image ILRI_{LR} and outputs a high-resolution (HR) clean RAW image IHRI_{HR}, i.e., IHR=F(ILR)I_{HR} = F(I_{LR}). This function FF is typically parameterized by a deep neural network.

For RAW Super-Resolution (RAWSR), the principle is to learn to:

  1. Remove noise and blur: Tackle the degradations present in the LR RAW image.

  2. Upscale: Increase the resolution, often by a factor of 2×2 \times, by inferring missing pixel information.

  3. Maintain RAW properties: Ensure the output remains a RAW Bayer image with correct color patterns and linear response.

    For RAW Image Restoration (RAWIR), the principle is to learn to:

  4. Remove noise: Address various noise profiles, including shot noise and heteroscedastic Gaussian noise.

  5. Remove blur: Undo the effects of various blur kernels (PSFs).

  6. Produce a clean RAW image: Output a RAW image that is free from degradation but retains original detail.

    Both tasks emphasize efficiency for deployment on mobile devices, requiring models with a low parameter count and computational complexity.

4.2. Core Methodology In-depth (Layer by Layer)

The challenge is structured into two main tracks: RAW Image Super-Resolution (RAWSR) and RAW Image Restoration (RAWIR).

4.2.1. NTIRE 2025 RAWSR Challenge Setup

Degradation Pipeline: The low-resolution (LR) degraded images for training are generated online using the degradation pipeline proposed in BSRAW [18]. This pipeline simulates realistic degradations, considering:

  • Different noise profiles.

  • Multiple blur kernels (PSFs).

  • A simple downsampling strategy to synthesize LR RAW images.

    Participants are allowed to apply other augmentation techniques or expand this degradation pipeline to generate more realistic training data.

Data Pre-processing: The challenge dataset is based on BSRAW [18] and the NTIRE 2024 RAWSR Challenge [19], using images from the Adobe MIT5K dataset [3].

  1. Filtering: Images are manually filtered for diversity, natural properties, and sharpness (removing extremely dark, overexposed, or blurry images; only in-focus, sharp images with low ISO are considered).

  2. Normalization: All RAW images are normalized based on their black level and bit-depth (e.g., 10, 12, 14 bits per pixel).

  3. Bayer Pattern Conversion: Images are converted ("packed") into the RGGB Bayer pattern (4-channels). This is crucial as it allows transformations and degradations to be applied without damaging the original color pattern information, treating each 2×22 \times 2 Bayer block as a 4-channel pixel.

    Training Data: 1064 clean high-resolution (HR) RAW images of resolution 1024×1024×41024 \times 1024 \times 4 are provided.

Testing: Three testing splits are used:

  1. Validation: 40 1024px images for model development.
  2. Test 1MP: 200 images of 1024px1024 \mathrm {px} resolution.
  3. Full-resolution Test: The same 200 test images at full-resolution 12MP\approx 12 \mathrm {MP}. Participants process LR RAW images (e.g., 512×512×4512 \times 512 \times 4) and submit their HR results, without access to the ground-truth images.

4.2.2. NTIRE 2025 RAW Image Restoration (RAWIR) Challenge Setup

Degradation Pipeline: Participants use the baseline degradation pipeline [18] to simulate realistic degradations. They are also encouraged to develop their own degradation pipelines to simulate more realistic blur and noise. The core components include:

  • Different noise profiles.

  • Multiple blur kernels (PSFs).

    Training Data: 2139 clean patches of dimension 512×512×4512 \times 512 \times 4 are provided for training.

Data Pre-processing: The dataset uses images from diverse sensors (Samsung Galaxy S9, iPhone X from RAW2RAW [1], plus Google Pixel 10, Vivo X90, and Samsung S21).

  1. Filtering: Images are manually filtered for high quality, sharpness, and clear details (captured under low ISO (400\leq 400), in-focus, and with proper exposure). Original RAW files are DNG format, unprocessed by ISP.

  2. Normalization: All images are normalized based on camera blacklevel and bit depth.

  3. Bayer Pattern Conversion: Images are converted to the RGGB Bayer pattern (4-channels).

  4. Cropping: Images are cropped into non-overlapping (packed) patches of dimension 512×512×4512 \times 512 \times 4.

    Testing: The synthetic test dataset is generated by applying the challenge's degradation pipeline at three different levels:

  • Test Level 1: Degradation is only sampled noise from real noise profiles. y=x+ny = x + n, where xx is the clean image and nn is the noise.
  • Test Level 2: Degradation is noise and/or blur, with 0.3 probability of blur and 0.5 of real noise. y=(xk)+ny = (x * k) + n, where kk is a blur kernel.
  • Test Level 3: All images have realistic blur and noise. y=(xk)+ny = (x * k) + n.

4.2.3. Top Solutions for RAWSR

4.2.3.1. RawRTSR: Raw Real-Time Super Resolution (Team Samsung AI)

Method Description: The model structure is based on CASR [71]. To meet parameter and inference time requirements, it adopts knowledge distillation and reparameterization. XRestormer [10] acts as the teacher model. Two student models (Efficient and General) are trained from this shared teacher. Re-parameterized convolution blocks are used during training, which are then fused into a unified convolution layer for deployment. The task involves denoising, super-resolution, and blur quality degradation, which are decomposed into two fundamental processes: denoising and detail enhancement. Both networks employ a straight-through architecture integrating these modules.

  • Efficient Model (RawRTSR): (See image/2.jpg - student model part)

    • Denoising Module: Reduces image resolution via unPixelshuffle downsampling to capture global information for noise removal. Processes features through four convolutions, then restores resolution via upsampling.
    • Detail Enhancement Module: Employs five convolutions to recover fine textures. Residual connections from the original input prevent excessive detail loss during denoising.
    • Upscaling: The final output is upscaled through pixel shuffle operations to achieve super-resolution.
    • Channel Count: Maintains a maximum feature channel number of 48 throughout both modules.
  • General Model (RawRTSR-L): (See image/3.jpg)

    • Channel Count: Increases the number of feature channels from 48 to 64 to enhance representational capacity compared to RawRTSR.
    • Channel Attention: Incorporates a channel attention mechanism to adaptively recalibrate feature responses, preventing information redundancy during the denoising stage due to channel expansion.

Implementation Details:

  • Synthetic Degradation: Two methods are used to obtain low quality (LQ) images:

    1. Randomly add noise and blur multiple times at RAW domain.
    2. Convert RAW to RGB, add motion blur and noise, then convert back to RAW.
  • Training Steps (Three steps): All steps use PyTorch and A100 GPU.

    1. Separate Training: Teacher and student models are trained separately. LQ patches are cropped to 256×256256 \times 256. AdamW [47] optimizer (β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999, weight decay 0.0001) with a learning rate of 0.0005 for 800 epochs. Uses L1L_1 loss.
    2. Feature Distillation: Model initialized with weights from step 1. Feature distillation is used, with learning rate 0.00005 for 800 epochs. Uses L2L_2 loss.
    3. Fine-tuning: Model initialized with weights from step 2. LR patches are 512×512512 \times 512. (Details on loss or epochs for this step are not fully elaborated beyond initialization).
  • Reparameterization: The final submitted model is the student model after reparameterization to ensure parameter and inference time requirements are met.

    The following figure (Figure 2 from the original paper) shows the network architecture of the student model and the submitted model for RAW image super-resolution processing:

    该图像是一个示意图,展示了学生模型和提交模型在进行RAW图像超分辨率处理时的网络结构。上半部分为教师模型的结构,显示了输入低分辨率图像LR通过多个卷积层和SSAB模块逐步生成高分辨率图像SR。下半部分展示了学生模型的训练模式与提交模型的推理模式,强调了不同阶段的操作和结构设计。 该图像是一个示意图,展示了学生模型和提交模型在进行RAW图像超分辨率处理时的网络结构。上半部分为教师模型的结构,显示了输入低分辨率图像LR通过多个卷积层和SSAB模块逐步生成高分辨率图像SR。下半部分展示了学生模型的训练模式与提交模型的推理模式,强调了不同阶段的操作和结构设计。

The following figure (Figure 3 from the original paper) illustrates the overall structure of the RawRTSR-L network:

Figure 3. The overall structure of the RawRTSR-L network: 0.311M parameters and running at \(4 . 4 4 \\mathrm { m s }\) on the A100 GPU. 该图像是一个示意图,展示了RawRTSR-L网络的整体结构。左侧是低分辨率(LR)图像,经过去噪模块和细节增强模块处理后,生成右侧的超分辨率(SR)图像。网络架构包括多个卷积层和激活层,能够有效地提升RAW图像的质量。

4.2.3.2. Streamlined Transformer Network for RealTime Raw Image Super Resolution (Team USTC-VIDAR)

Method Description: The overall framework (Figure 4) is a streamlined version of RBSFormer [33], designed for efficient processing.

  • Main Branch: Consists of a 3×33 \times 3 convolution, NN cascaded transformer blocks, and an upsample block.
  • Residual Branch: Contains only an upsample block.
  • Upsample Block: Employs a 3×33 \times 3 convolution followed by a PixelShuffle operation [58] to upscale features by 2×2 \times.

Computational Complexity Mitigation:

  • Transformers: Self-attention modules and feed-forward networks are primary sources of complexity.
  • InceptionNeXt [72]: Incorporated for efficient spatial feature extraction during Q, K, and V projection by leveraging partial convolution and depth-wise convolution.
  • ShuffleNet [78]: Adopted for feed-forward networks with GG channel groups to reduce input projection parameters while maintaining cross-channel communication.
  • Output Projection: Streamlined using element-wise multiplication with a depth-wise convolution gate (inspired by [33, 48]).
  • Parameters: N=8N = 8 transformer blocks and G=4G = 4 channel groups.

Implementation Details:

  • Dataset: Trained exclusively on the provided NTIRE 2025 dataset.

  • Augmentation: Random horizontal flips, vertical flips, and transpositions.

  • Degradation: Degraded images simulated using the BSRAW degradation pipeline with additional PSF kernels [26].

  • Training (Two stages):

    1. Stage 1: 300k300\mathrm{k} steps, batch size 8, patch size 192. Learning rate 2×1042 \times 10^{-4} to 10610^{-6} (decayed). ~12 hours on NVIDIA RTX 3090 GPUs.
    2. Stage 2: 147k147\mathrm{k} steps, batch size 64, patch size 256. Learning rate 10410^{-4} to 10610^{-6} (decayed). ~31 hours on A800 GPUs.
  • Optimizer: Adam with default hyperparameters.

  • Loss Function: Combination of Charbonnier loss and a Frequency loss [49], with the latter assigned a weight of 0.5.

    The following figure (Figure 4 from the original paper) shows Team USTC framework for RAW image super resolution:

    Figure 4. Team USTC framework for RAW image super resolution. 该图像是团队USTC在RAW图像超分辨率任务中的框架示意图。该框架通过多个Transformer块处理输入的低分辨率RAW图像,并通过上采样技术生成高分辨率RAW图像,展示了现代图像信号处理中的新方法。

4.2.3.3. SMFFRaw: Simplified Multi-Level Feature Fusion Network for RAW Image Super-Resolution (Team XJTU)

Method Description: SMFFRaw is a computationally efficient network based on MFFSSR [41], designed to bridge the gap between high restoration quality and computational overhead. It employs a novel iterative training strategy. The architecture (Figure 5) has three main components:

  • Shallow Feature Extraction: A simple 3×33 \times 3 convolutional operation extracts shallow features F0F_0 from the degraded input image ILRI_{LR}.
  • Deep Feature Extraction: Deep features are extracted using a sequence of Hybrid Attention Feature Extraction Block (HAFEB) modules. Each HAFEB consists of:
    • Point-wise Convolution (Pconv)
    • Depthwise Convolution (DWconv)
    • Reparameterized Convolution (RepConv) (no reparameterization during inference)
    • Channel Attention (CA)
    • Large Kernel Attention (LKA)
  • Reconstruction: The feature map is first upsampled using a 3×33 \times 3 convolutional layer with pixel shuffle [58]. Then, it is added to the bilinearly interpolated input to produce the final super-resolved result ISRI_{SR}. This design reduces training complexity while enhancing SR performance.

Implementation Details:

  • Dataset: Solely uses the provided challenge dataset.

  • Augmentations: Common augmentations (rotation, flipping) along with mixup [76].

  • Degradation: BSRAW degradation pipeline [18] generates RAW-degraded image pairs.

  • Training (Five stages): (See Table 5)

    • Each phase gradually introduces more complex degradations or increases patch size.
    • Adam optimizer [35] with initial learning rate 1e-3, decaying to 1e-6 using Cosine Annealing.
    • Loss function: Combination of Charbonnier loss [64] and frequency loss [31] for the first four stages; MSE and frequency loss for the final stage.
  • Hardware: PyTorch on RTX 4090 GPUs.

    The following figure (Figure 5 from the original paper) illustrates the overall framework of the SMFFRaw model:

    该图像是一个示意图,展示了SMFFRaw模型的整体框架和各个组成部分,包括浅层特征提取、深层特征提取和图像重建过程。同时,图中展示了混合注意力特征提取块(HAFEB)及其通道注意力(CA)和大内核注意力(LKA)的工作机制。 该图像是一个示意图,展示了SMFFRaw模型的整体框架和各个组成部分,包括浅层特征提取、深层特征提取和图像重建过程。同时,图中展示了混合注意力特征提取块(HAFEB)及其通道注意力(CA)和大内核注意力(LKA)的工作机制。

4.2.3.4. An Enhanced Transformer Network for Raw Image Super-Resolution (Team EGROUP)

Method Description: This approach leverages the RBSFormer [32] architecture, directly processing RAW images for super-resolution. Operating in the RAW domain avoids complexities of non-linear transformations from ISP. The pipeline maintains a three-component structure:

  1. Shallow Feature Extraction: Given a low-resolution RAW image ILRRH×W×4I_{LR} \in \mathbb{R}^{H \times W \times 4}, shallow features FsF_s are extracted: $ F_s = \mathbf{Conv}{3 \times 3}(I{LR}) $ Where Conv3×3\mathbf{Conv}_{3 \times 3} is a 3×33 \times 3 convolutional operation.

  2. Deep Feature Extraction: Transformer blocks are used to extract deep features: $ F_i = \mathcal{H}{tb_i}(F{i-1}), \quad i = 1, 2, ..., K $ $ F_d = \mathrm{Conv}_{3 \times 3}(F_K) $ Where Htbi\mathcal{H}_{tb_i} represents the ii-th transformer block and KK is the total number of blocks. FKF_K is the output of the last transformer block, which is then processed by another 3×33 \times 3 convolution to get FdF_d.

  3. Reconstruction: The high-resolution (HR) image IHRI_{HR} is reconstructed by aggregating features: $ I_{HR} = \mathcal{H}{rec}(I{LR}, F_d) = \mathbf{Up}(F_s + F_d) $ Where Hrec\mathcal{H}_{rec} is the reconstruction head, and Up\mathbf{Up} denotes an upsampling operation, typically using PixelShuffle or similar, combined with a residual connection from the shallow features and deep features.

Implementation Details:

  • Dataset: Official training dataset provided by organizers.

  • Augmentation: Random noise and blur degradation patterns in the RAW domain.

  • Optimizer: AdamW with β1=0.9\beta_1 = 0.9, β2=0.999\beta_2 = 0.999.

  • Learning Rate: Initial 7×1047 \times 10^{-4}, cosine annealing to 1×1061 \times 10^{-6}.

  • Hardware: PyTorch 1.11.0 with two NVIDIA 4090 GPUs.

  • Training Parameters: Batch size 8, crop size 192.

  • Loss Function: Trained with L1L_1 loss for 100k100\mathrm{k} iterations, then fine-tuned with FFT loss for 20k20\mathrm{k} iterations.

    The following figure (Figure 6 from the original paper) shows the architecture of the RBSFormer [32] used by Team EGROUP for RAW image super-resolution:

    Figure 6. The architecture of the RBSFormer \[32\] used by Team EGROUP for RAW image super-resolution. 该图像是示意图,展示了Team EGROUP为RAW图像超分辨率使用的RBSFormer架构。图中包括多个组件,如原始低分辨率图像(Raw LR)、多个变换器块(Transformer Block)以及增强的交叉协方差注意机制(EXCA)和增强门控前馈网络(EGFN)。每个模块的功能和操作通过图例进行说明,清晰阐释了信息流和数据处理的过程,最后输出高分辨率图像(Raw HR)。

4.2.3.5. A fast neural network to do super-resolution based on NAFSSR (Team NJU)

Method Description: Team NJU RSR proposes a CNN framework for RAW image super-resolution based on the NAFBlock from NAFSSR [14]. It adopts reparameterization during inference, fusing Batch Normalization parameters into CNN layers for efficient inference. The architecture (Figure 7) redesigns the NAFBlock by modifying the SimpleGate component with a CNN layer and GeLU activation function, and removing the FFN component to constrain parameters. Layer Normalization is replaced with Batch Normalization for easier training and more efficient inference (as BN can be fused).

Computational Cost: For a 4-channel RGGB RAW image patch of size 256×256256 \times 256, NAFBN has 11.90 GFLOPS and 189K trainable parameters after BN fusion. Inference time on NVIDIA RTX3090 is 7.19ms7.19 \mathrm{ms} (with BN fusion) and 5.19ms5.19 \mathrm{ms} (half precision), compared to 9.49ms9.49 \mathrm{ms} (without fusion).

Implementation Details:

  • Dataset: NTIRE 2025 RAW Image Super Resolution challenge data, with degradation pipeline [18].

  • Framework: PyTorch on single vGPU-32.

  • Model: 12 NAFBlocks of width 48.

  • Optimizer: Adamw with β1=0.9\beta_1 = 0.9, β2=0.99\beta_2 = 0.99.

  • Batch Normalization: Momentum set to 0.03.

  • Learning Rate: Initial 1×1031 \times 10^{-3}, Cosine Annealing to 1×1061 \times 10^{-6}.

  • Training: 50K50\mathrm{K} iterations, ~7 hours.

  • Data Augmentation:

    • Random patch cropping (32×3232 \times 32).
    • Random white balance (unit interval normalized images).
    • Random horizontal or vertical flips.
    • Random right-angle rotations.
    • Exposure adjustment (linear scaling in [-0.1, 0.1]).
    • Random downsample with AvePool2d and bicubic interpolation (0.3 probability) for lower resolution images. All augmentations with probability 0.5.
  • Loss Function: L1L_1 loss.

    The following figure (Figure 7 from the original paper) shows NAFBN proposed by Team NJU RSR:

    Figure 7. NAFBN proposed by Team NJU RSR. 该图像是示意图,展示了Team NJU RSR提出的NAFBN网络架构。图中展示了从低分辨率(LR)图像经过三个卷积层和多个NAFBlock模块处理后,最终生成超分辨率(SR)图像的过程。通过逐层处理和特征提升,最后使用像素重排(Pixel Shuffle)方法将处理结果转换为高分辨率图像。

The following figure (Figure 8 from the original paper) shows adopted NAFBlock used by Team NJU RSR:

Figure 8. Adopted NAFBlock used by Team NJU RSR. 该图像是示意图,展示了团队 NJU RSR 采用的 NAFBlock 结构。该结构通过多个卷积层和批量归一化层,以及深度卷积和通道注意力机制,提升了图像处理的效果。

4.2.3.6. A efficient neural network baseline report using Mamba (Team TYSL)

Method Description: Team TYSL implemented MambaIRv2 [24] on RAW data to provide a baseline from a different perspective. The architecture (Figure 9) is a simplified, lightweight version of MambaIRv2 (size less than 0.2m0.2 \mathrm{m}).

  • Key Components: embeddingdim=32embedding dim = 32, m=4m = 4, n=2n = 2.
  • Motivation: Chosen for its potential for lightweighting and its novel application to RAW data.
  • Downsampling Analysis: Explored various downsampling methods including direct bicubic downsampling on each channel, AvgPool2D (as provided in the competition), and bicubic downsampling with bias. AvgPool2D performed significantly better. The team highlights the impact of downsampling methods if test set images are synthetic.

Proposed Downsampling Method (Figure 10): Based on the intuition that the central pixel value of a region is closer to the region's average.

  • For a 4×44 \times 4 unit downsampled to 2×22 \times 2 (RGGB pattern), a red pixel after downsampling represents the average of the top-left 2×22 \times 2 pixels before downsampling.
  • If direct bicubic downsampling or AvgPool2D is applied to the red channel, the resulting value approximates the pixel at coordinate (1,1) in the original 4×44 \times 4 image, which ends up at the bottom-right of the downsampled red pixel, not its center.
  • Therefore, they propose bicubic interpolation using the nearest 16 points, with the interpolation point positioned at the center of the downsampled red pixel, to achieve more precise downsampling.

Implementation Details:

  • Hardware: Server with several A100 GPUs, PyTorch framework.

  • Batch Size: 64.

  • Learning Rate: 8e-4.

  • Training Time: ~26 hours (slow due to image degradation process).

  • Dataset: Provided training set, adhering to the given degradation pipeline (except for downsampling), no image enhancement.

    The following figure (Figure 9 from the original paper) illustrates the structure of MambaIRv2:

    Figure 9. MambaIRv2 structure 该图像是图示,展示了 MambaIRv2 的结构。图中包含多个模块和连接,包括 W-MSA 和 ASSM,以及合并和重构的过程。

The following figure (Figure 10 from the original paper) shows Downsampling structure:

Figure 10. Downsampling structure 该图像是一个示意图,展示了双三次插值和平均池化在图像下采样中的应用。其中(a)部分介绍了双三次插值和平均池化的基本过程,(b)部分展示了带有偏差的双三次插值及其对每种颜色的详细计算方式。

4.2.3.7. RepRawSR: Accelerating Raw Image Super-Resolution with Reparameterization (Team EiffLowCVer)

Method Description: Team EiffLowCVer designed two lightweight variants of SYEnet [23] for RAW image super-resolution, incorporating structural reparameterization and efficient network design.

  • RepTiny-21k:

    • Increases the number of feature extraction modules to four (from SYEnet's one).
    • Introduces skip connections (red arrows in Figure 11) to mitigate gradient vanishing and stabilize training.
    • Channel number set to 16 for efficiency.
    • Achieves 5.65G FLOPs and 21k parameters.
  • RepLarge-97k:

    • Increases the channel width to 32.

    • Uses only one Feature Extraction Module.

    • Incorporates the FEBlock (a pre-processing module from SYEnet) for super-resolution.

    • Parameters and FLOPs increase significantly compared to Tiny-21k.

      The team observed that increasing feature extraction modules with skip connections provides an optimal balance for performance and speed. An additional tail generates a second predicted image from intermediate feature maps, included in the loss calculation during training for stability, but removed during inference for efficiency.

Implementation Details:

  • Optimizer: Adam with initial learning rate 8×1048 \times 10^{-4}.
  • Learning Rate Scheduler: CosineAnnealingRestartLR.
  • Hardware: NVIDIA GeForce RTX 3090 24Gb.
  • Datasets: 1,064 RAW images provided by organizers, 40 images randomly selected for validation.
  • Training Time: 22 hours for Tiny-21k, 26 hours for Large-97k.
  • Training Strategies (Multi-stage):
    1. Stage 1: 100,000 training steps. Randomly cropped 256×256256 \times 256 GT patches, random rotation and flipping. LQ images generated using the organizer's degradation pipeline.

    2. Stage 2: Patch size increased to 384×384384 \times 384. Training continued for an additional 50,000 steps.

      The following figure (Figure 11 from the original paper) shows the main branch of RepRawSR proposed by Team EffiLowCVer:

      Figure 11. Main branch of RepRawSR proposed by Team EffiLowCVer. 该图像是图示,展示了Team EffiLowCVer提出的RepRawSR的主要结构。图中包含特征提取模块及通道注意力机制,主要运算由多个卷积层和批归一化组成。该模型在Tiny-21k上进行了四次迭代。

4.2.3.8. ECAN: Efficient Channel Attention Network for RAW Image Super-Resolution (Team CUEE-MDAP)

Method Description: ECAN aims to be an efficient Super-resolution algorithm for the NTIRE 2025 Efficient Track (parameter limit of 0.2M0.2 \mathrm{M}). It is a CNN-based model, trained end-to-end on the NTIRE 2025 RAW training dataset without external pre-trained models. The ECAN architecture (Figure 12) uses four stages:

  1. Shallow Feature Extraction: 3×33 \times 3 convolution on 4-channel RAW input.

  2. Deep Feature Extraction: 8 EfficientResidualBlocks with a global skip connection. Each EfficientResidualBlock uses:

    • An inverted residual structure with depthwise separable convolutions (inspired by MobileNetV2 [57]).
    • A Squeeze-and-Excitation (SE) block [27] for channel attention. This focus on channel interaction aligns with the group's work on Multi-FusNet [55] and channel attention networks [79].
  3. Upsampling: PixelShuffle.

  4. Reconstruction: 3×33 \times 3 convolution to 4-channel RAW output.

    Efficiency: Only 93,092 parameters (0.093M\approx 0.093 \mathrm{M}), well below the 0.2M0.2 \mathrm{M} limit. Computational cost estimated at 21.82 GMACs (or 43.65 GFLOPs) for a 512×512×4512 \times 512 \times 4 input (scaling to 1MP output). Inference time is ~8.25 ms\mathbf{8.25 \ ms} per output megapixel on an NVIDIA RTX 4090.

Implementation Details:

  • Framework: PyTorch.

  • Optimizer: AdamW (β1=0.9,β2=0.999\beta_1 = 0.9, \beta_2 = 0.999), weight decay 1×1041 \times 10^{-4}.

  • Learning Rate: Initial 4×1044 \times 10^{-4}, cosine annealing to 1×1071 \times 10^{-7}.

  • Hardware: NVIDIA RTX 4090 (24GB).

  • Datasets: NTIRE 2025 RAW training set. No extra data.

  • Augmentation: Random 90/180/270 rotations, horizontal flips.

  • Degradation: Gaussian Blur (σ4.0,p=0.7\sigma \leq 4.0, p = 0.7) + Gaussian Noise (level0.04,p=0.95level \le 0.04, p = 0.95).

  • Training Time: 600 epochs (~1.6 hours).

  • Training Strategies: End-to-end from scratch, Automatic Mixed Precision (AMP). Input patch size 128×128128 \times 128, batch size 64. L1L_1 Loss. Gradient clipping at 1.0.

    The following figure (Figure 12 from the original paper) illustrates the efficient residual blocks stack structure in RAW image processing:

    该图像是一个示意图,展示了RAW图像处理中的高效残差模块堆叠结构,包括全局跳跃连接、像素重排和压缩-激励(SE)块。图中详细描述了输入、特征提取和输出的各个环节。此结构在图像信号处理和超分辨率重建中起到关键作用。 该图像是一个示意图,展示了RAW图像处理中的高效残差模块堆叠结构,包括全局跳跃连接、像素重排和压缩-激励(SE)块。图中详细描述了输入、特征提取和输出的各个环节。此结构在图像信号处理和超分辨率重建中起到关键作用。

4.2.4. Top Solutions for RAWIR

4.2.4.1. Efficient RAW Image Restoration (Team SamsungAI)

Method Description: This solution for the RAW Restoration Challenge is primarily based on NAFNet [7], a lightweight network. To satisfy parameter constraints and preserve performance, Samsung AI reduced NAFNet parameters and implemented a distillation strategy. X-Restormer [10] is adopted as the teacher model. Two student models (ERIRNet-S and ERIRNet-T) are trained through knowledge distillation from this shared teacher. The method addresses coupled degradation processes (noise suppression and blur correction). They select the NafBlock architecture from NAFNet [7] as a pivotal component due to its SOTA efficacy in joint denoising and deblurring. To achieve parameter efficiency, they designed networks based on a narrow-and-deep architectural principle, reducing channel dimensions while preserving layer depth.

  • ERIRNet-S (Figure 15): A simplified version of NAFNet with reduced channel width and fewer encoder-decoder blocks for improved efficiency.
  • ERIRNet-T (Figure 16): Further reduces complexity by decreasing the number of blocks and using smaller FFN expansion ratios. It replaces PixelUnshuffle layers with ConvTranspose, enabling deeper architectures under a strict parameter budget.

Implementation Details:

  • Training Process (Three stages - Figure 14):

    1. Stage 1 - Train Base Model: Each ERIRNet variant (S and T) is trained independently using the original ground truth supervision.
    2. Stage 2 - Train Teacher Model: A Teacher Model based on X-Restormer is trained and fine-tuned on each mobile device.
    3. Stage 3 - Distillation with Teacher Model: Knowledge distillation is applied. Models are initialized with weights from Stage 1. The teacher's outputs are used as targets to guide both ERIRNet-S and ERIRNet-T.
  • Framework: PyTorch on A100 GPU.

  • Optimizer: Adam [35] with β1=0.5,β2=0.999\beta_1 = 0.5, \beta_2 = 0.999.

  • Loss Function: L1L_1 loss.

  • Learning Rate: Initial 1e-4 (Stage 1), reduced to 1e-5 (Stage 3).

  • Scheduler: MultiStepLR scheduler for learning rate decay.

  • Training Epochs: 1000 (Stage 1), 2000 (Stage 2), 1000 (Stage 3).

  • Batch Size: 16.

    The following figure (Figure 14 from the original paper) shows Training Stage Description by Samsung AI:

    Figure 14. Training Stage Description by Samsung AI 该图像是一个示意图,展示了NTIRE 2025挑战中的模型训练阶段。上半部分包括X-Restorer模型的第二阶段,下半部分是ERNnet模型的第一阶段,而右侧则展示了与Distillation Loss相关的第三阶段。

The following figure (Figure 15 from the original paper) shows Architectures of ERIRNet-S, with reduced channels and fewer blocks. Proposal by Samsung AI:

Figure 15. Architectures of ERIRNet-S, with reduced channels and fewer blocks. Proposal by Samsung AI. 该图像是ERIRNet-S的架构示意图,展示了采用NafBlock和PixelShuffle模块的网络结构,其中包含下采样和上采样的卷积层,构建了处理RAW图像的深度学习模型。

The following figure (Figure 16 from the original paper) shows Architectures of ERIRNet-T, with ConvTranspose and reduced FFN expansion. Proposal by Samsung AI:

Figure 16. Architectures of ERIRNet-T, with ConvTranspose and reduced FFN expansion. Proposal by Samsung AI. 该图像是ERIRNet-T的架构示意图,展示了不同的NafBlock配置及用于 RAW 图像恢复的上下采样过程。该结构通过多层卷积和转置卷积来处理图像的模糊和噪声,旨在有效地进行RAW图像恢复与超分辨率。图中的标注显示了每层的输出尺寸和通道数变化。

4.2.4.2. Modified SwinFIR-Tiny for Raw Image Restoration (Team Miers)

Method Description: The method is an improvement of SwinFIR-Tiny [76]. It enhances the model's feature representation capability by aggregating outputs of different RSTB (Residual Swin Transformer Block) modules. It also incorporates the HAB (Hybrid Attention Block) module from HAT [8] via zero convolution and applies reparameterization techniques [23]. The final model has 4.76M4.76 \mathrm{M} parameters. The complete network architecture is systematically illustrated in Figure 17.

  • Baseline: SwinFIR-Tiny [76].
  • RSTBs: Four Residual Swin Transformer Blocks for hierarchical feature extraction. Each RSTB contains 5 or 6 Hybrid Attention Blocks (HABs) and 1 HSFB.
  • HABs: Integrate multi-head self-attention and a locally constrained attention module to capture multi-scale contextual information.
  • FeatureFusion Strategy: A hierarchical FeatureFusion strategy systematically aggregates outputs from each RSTB block to mitigate shallow feature degradation during deep feature extraction.
  • Enhancements: Integrates Channel Attention Block (CAB [8]) and CovRep5 [23] module to improve noise robustness and blur resilience.

Implementation Details:

  • Framework: PyTorch, modified from SwinFIR project.

  • Dataset: Divided into training (2,099 samples) and validation (40 samples).

  • Data Degradation: BSRAW degradation pipeline [18], with enhanced noise levels:

    • log_max_shot_noise increased from -3 to -2.
    • sigma1sigma_1 range (heteroscedastic Gaussian noise) extended from (5e-3, 5e-2) to (5e-3, 1e-1).
    • sigma2sigma_2 range extended from (1e-3, 1e-2) to (1e-3, 5e-2).
  • Hardware: Four H800 80G GPUs.

  • Data Augmentation: mixup.

  • Optimizer: Adam.

  • Loss Function: Charbonnier loss.

  • Development Stages (Four stages):

    1. Baseline Training: Original SwinFIR-Tiny with original data degradation. Initial learning rate 2e-4, batch size 8, input size 180×180180 \times 180. Trained for 250K250\mathrm{K} iterations.
    2. Module Addition: Added Feature Fusion module, Channel Attention module, and ConvRep5 module. Initialized with weights from stage 1. Original data degradation. Initial learning rate 2e-4, batch size 8, input size 180×180180 \times 180. Trained for 170K170\mathrm{K} iterations.
    3. CAB and Noise Intensity: Introduced CAB module using zero convolution and increased noise intensity. Initial learning rate 3e-5, batch size 8, input size 180×180180 \times 180. Trained for 140K140\mathrm{K} iterations.
    4. Fine-tuning: Further adjusted initial learning rate to 2e-5, reduced batch size to 2, increased input size to 360×360360 \times 360. Trained for 15K15\mathrm{K} iterations.
  • Final Model: Utilized reparameterization to convert ConvRep5 module into a standard 5×55 \times 5 convolution.

    The following figure (Figure 17 from the original paper) shows CABATTSwinFIR proposed method by Team Miers (Xiaomi Inc.):

    Figure 17. CABATTSwinFIR proposed method by Team Miers (Xiaomi Inc.). 该图像是示意图,展示了CABATTSwinFIR模型的整体架构及其各个组成部分。图中包含多个模块,包括RSTB、HAB和特征融合块,分别对RAW图像进行处理与提升。

4.2.4.3. Multi-PromptIR: Multi-scale Prompt-base Raw Image Restoration (Team WIRTeam)

Method Description: Building on PromptIR [51] and Restormer [75], Multi-PromptIR transforms a degraded RAW image into a high-quality, clear image. The core is a four-layer encoder-decoder architecture with Transformer Blocks [75] for cross-channel global feature extraction.

  • Encoding Stage: Incorporates images at reduced resolutions (1/2, 1/4, and 1/8 of original size) to enrich the encoding process, drawing from multi-resolution image success [11].
  • Decoding Phase: Adopts a specialized prompt mechanism [51], consisting of a Prompt Generation Module (PGM) and a Prompt Interaction Module (PIM). The model combines CNNs and Transformers as shown in Figure 18. The total number of parameters is 39.92M39.92 \mathrm{M}.

Implementation Details:

  • Optimizer: AdamW (initial learning rate 2×1042 \times 10^{-4}, reduces to 1×1061 \times 10^{-6} with cosine annealing).

  • Hardware: 11 \ast NVIDIA A100 (80G).

  • Datasets: Provided datasets only. PSF blur and noise added to original images for synthetic degradations.

  • Training: End-to-end manner for 700 epochs.

  • Optimizer Parameters: β1=0.9,β2=0.99\beta_1 = 0.9, \beta_2 = 0.99.

  • Data Augmentation: Horizontal and vertical flips.

  • Inference: Splits input degraded images into 256×256256 \times 256 patches, restores them, then merges processed patches.

  • Training Time: ~12 hours.

    The following figure (Figure 18 from the original paper) shows Overall architecture proposed by the Team WIRTeam:

    Figure 18. Overall architecture proposed by the Team WIRTeam. 该图像是示意图,展示了WIRTeam提出的整体架构。图中左侧是输入RAW图像(Input I),右侧是恢复后的图像(Restored Ũ),中间部分描述了信号处理流程,包括多个下采样和Transformer模块。关键组件包括提示迭代模块(PIM)和提示生成模块(PGM),利用不同深度的信息逐步恢复图像,旨在有效处理图像的模糊和噪声问题。

4.2.4.4. LMPR-Net: Lightweight Multi-Stage Progressive RAW Restoration (Team WIRTeam)

Method Description: Considering the potential of multi-stage feature interaction, LMPR-Net is a lightweight model for RAW image restoration based on MPRNet [74]. The multi-stage model decomposes the RAW image restoration task into multiple subtasks to address various degradation information (noise, blurring, unknown degradations).

  • Original Resolution Block (ORB): Composed of convolution and channel attention mechanism for cross-channel key feature extraction.
  • SAM (Stage-wise Attention Module): Efficiently refines incoming features at each stage.
  • Lightness: Simplified components, hidden dimension set to 8.
  • Depthwise Overparameterized Convolution [4]: Introduced to increase training speed and improve expressive power without significantly increasing computational cost.
  • Parameter Count: 0.19M0.19 \mathrm{M} parameters, 2.63GFLOPs2.63 \mathrm{GFLOPs}.

Implementation Details:

  • Framework: PyTorch.

  • Optimizer: AdamW (initial learning rate 2×1042 \times 10^{-4}, reduces to 1×1061 \times 10^{-6} with cosine annealing).

  • Hardware: NVIDIA GeForce RTX 4090 (24G) * 1.

  • Datasets: Provided datasets only. PSF blur and noise added for synthetic degradations.

  • Data Augmentation: Horizontal and vertical flips.

  • Training: End-to-end for 600 epochs.

  • Optimizer Parameters: β1=0.9,β2=0.99\beta_1 = 0.9, \beta_2 = 0.99.

  • Loss Function: Charbonnier loss [5] to avoid overly smooth restored images.

  • Inference: Partitions input degraded images into 256×256256 \times 256 patches, processes them, then seamlessly merges.

  • Training Time: ~10 hours.

    The following figure (Figure 19 from the original paper) shows Overall architecture of the LMPR-Net method proposed by the Team WIRTeam:

    Figure 19. Overall architecture of the LMPR-Net method proposed by the Team WIRTeam. 该图像是示意图,展示了团队 WIRTeam 提出的 LMPR-Net 方法的整体架构。该网络通过多个阶段处理降质的 RAW 图像,采用多个卷积、注意力模块和 U-Net 结构,实现RAW图像的恢复与超分辨率。

4.2.4.5. ER-NAFNet Raw Restoration (Team ER-NAFNet)

Method Description: ER-NAFNet is a U-shaped framework for Efficient RAW image Restoration. Its architecture and compression mechanisms are built upon the NAFNet block proposed in NAFNet [6]. It enhances efficiency and performance. The network is trained directly from 4-channel RGGB RAW data with a complex blurring and noise degradation pipeline from AISP [20].

Dataset Degradation Method: Inspired by [53], [61], [77].

  • Bayer pattern RAW images are cropped (512×512512 \times 512) to speed up training and learn real degradation.
  • Sequence of degradation operations: multiple blurring, hardware specific added noise, random digital gain by luma or exposure compensations.
  • Noise Handling: Dark frames captured using multiple mobile sensors to model real dark current noise.
  • Highlight Area Pixels removed to prevent highlight color cast.

Model Framework (Figure 20):

  • Shallow Feature Extraction: A 3×33 \times 3 convolutional filter extracts shallow feature encodings FsF^s from the low-quality RAW image ILQRH×W×4I^{LQ} \in \mathbb{R}^{H \times W \times 4}.
  • Deep Feature Extraction: A classic U-shaped architecture with skip connections performs deep feature extraction.
  • NAFNet Block: Plays a pivotal role in the U-shaped architecture, addressing attention mechanism limitations. Integrates 1×11 \times 1 and 3×33 \times 3 dilated convolutions to capture fine-grained details and large-scale patterns. Eliminates reliance on complex attention mechanisms for superior performance with reduced computational overhead.
  • SimpleGate and Simple Channel Attention (SCA) modules: Incorporated to focus on relevant features and suppress irrelevant ones, enhancing feature representations.
  • Raw Reconstruction: A 3×33 \times 3 convolution reconstructs the high-quality (HQ) image.

Implementation Details:

  • Dataset: NTIRE 2024 official challenge data [20], Development Phase submission set for validation.

  • Framework: PyTorch.

  • Model Parameters: Width 16, [2, 2, 4, 8] encoder blocks, [2, 2, 2, 2] decoder blocks at each stage. Middle block num set to 6.

  • Data Augmentation: Simple horizontal or vertical flips, channel shifts, and mixup augmentations for stability, convergence, and performance.

  • Loss Function: L2L_2 loss.

  • Optimization: AdamW [46] optimizer (β1=0.9,β2=0.999\beta_1 = 0.9, \beta_2 = 0.999, weight decay 0.00001) with cosine annealing strategy.

  • Learning Rate: Initial 3×1043 \times 10^{-4} to 1×1061 \times 10^{-6} for 300,000 iterations.

  • Training Parameters: Batch size 12, patch size 512.

  • Hardware: A100 GPUs.

    The following figure (Figure 20 from the original paper) shows The overall network architecture proposed by Team ER-NAFNet:

    Figure 20. The overall network architecture proposed by Team ER-NAFNet. 该图像是图示,展示了团队ER-NAFNet提出的整体网络架构。图中包含多个NAFNet模块和相应的层归一化、卷积等操作,显示了网络的数据流动和结构设计。

5. Experimental Setup

5.1. Datasets

The challenge utilized datasets designed to represent real-world RAW image characteristics and degradations, focusing on both Super-Resolution (RAWSR) and Restoration (RAWIR).

5.1.1. RAWSR Challenge Dataset

  • Source: Based on BSRAW [18] and the NTIRE 2024 RAWSR Challenge [19]. Images are from the Adobe MIT5K dataset [3], which includes images captured by multiple Canon and Nikon DSLR cameras.
  • Characteristics:
    • Manually filtered for diversity and natural properties, removing extremely dark or overexposed images.
    • Only in-focus, sharp images with low ISO are considered.
  • Pre-processing:
    1. All RAW images are normalized based on their black level and bit-depth.
    2. Images are converted ("packed") into the well-known RGGB Bayer pattern (4-channels). This allows transformations and degradations to be applied without damaging original color pattern information.
  • Scale:
    • Training: 1064 clean high-resolution (HR) RAW images of resolution 1024×1024×41024 \times 1024 \times 4.
    • Low-Resolution (LR) Degradation Generation: LR degraded images are generated on-line during training using the degradation pipeline proposed in BSRAW [18]. This pipeline includes different noise profiles, multiple blur kernels (PSFs), and a simple downsampling strategy. Participants can expand this.
  • Testing:
    • Validation: 40 1024px images (used during model development).

    • Test 1MP: 200 images of 1024px1024 \mathrm {px} resolution.

    • Full-resolution Test: The same 200 test images at full-resolution 12MP\approx 12 \mathrm {MP}.

    • Participants process corresponding LR RAW images (e.g., 512×512×4512 \times 512 \times 4) without access to ground truth.

      The following figure (Figure 1 from the original paper) shows samples of the NTIRE 2025 RAW Image Super-Resolution Challenge testing set:

      Figure 1. Samples of the NTIRE 2025 RAW Image Super-Resolution Challenge testing set. 该图像是NTIRE 2025 RAW图像超分辨率挑战测试集的样本,对比了高分辨率(HR)地面真相与低分辨率(LR)输入图像。上半部分展示了花丛场景,下半部分则呈现了市区商店,二者都强调了恢复过程中的细节差异。

5.1.2. RAWIR Challenge Dataset

  • Source: Uses images from diverse sensors, including:
    • Samsung Galaxy S9 and iPhone X (from RAW2RAW [1]).
    • Additionally, images collected from Google Pixel 10, Vivo X90, and Samsung S21 to enrich diversity.
  • Characteristics:
    • Covers various scenes (indoor and outdoor), lighting conditions (day and night), and subjects.
    • Manually filtered to ensure high quality, sharpness, and clear details (captured under low ISO (400\leq 400), in-focus, proper exposure).
    • Original RAW files saved in DNG format, unprocessed by smartphones' ISPs.
  • Pre-processing:
    1. All images normalized based on camera blacklevel and bit depth (e.g., 10, 12, 14 bits per pixel).
    2. Images converted to the RGGB Bayer pattern (4-channels).
    3. Images cropped into non-overlapping (packed) patches of dimension 512×512×4512 \times 512 \times 4.
  • Scale:
    • Training: 2139 clean patches are provided for training models.
    • Degradation Generation: Participants use the baseline degradation pipeline [18] to simulate realistic degradations. Participants also develop their own pipelines, whose core components include different noise profiles and multiple blur kernels (PSFs).
  • Testing:
    • Synthetic test dataset generated by applying a degradation pipeline at three levels:
      • Test Level 1: y=x+ny = x + n (only sampled noise from real noise profiles).

      • Test Level 2: y=(xk)+ny = (x * k) + n (noise and/or blur, with 0.3 probability of blur and 0.5 of real noise).

      • Test Level 3: y=(xk)+ny = (x * k) + n (all images have realistic blur and noise).

        The following figure (Figure 13 from the original paper) shows RAW image samples from RawIR Dataset:

        Figure 13. RAW image samples from RawIR Dataset. 该图像是几张RAW图像样本,各自展示了不同环境下的细节,如标语、植物、街道等。这些样本来源于RawIR数据集,展示了图像恢复和超分辨率挑战下的多样性。

5.2. Evaluation Metrics

The primary quantitative evaluation metrics used in the challenge are Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index Measure (SSIM). All fidelity metrics are calculated in the RAW domain.

  • Peak Signal-to-Noise Ratio (PSNR):

    • Conceptual Definition: PSNR is a common objective metric used to quantify the quality of image reconstruction. It compares the maximum possible power of a signal to the power of distorting noise. It is expressed in decibels (dB), with higher values indicating better quality (less distortion or noise). PSNR is widely used in image compression and restoration to assess the fidelity of a processed image to its original, ground-truth version.
    • Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $
    • Symbol Explanation:
      • MAXI\mathrm{MAX}_I: The maximum possible pixel value of the image. For RAW images normalized to [0, 1], this value is 1.
      • MSE\mathrm{MSE}: Mean Squared Error, which is the average of the squared intensity differences between the reference image and the processed image. $ \mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $ Where:
      • II: The ground-truth (original) image.
      • KK: The restored or super-resolved image.
      • m, n: The dimensions (height and width) of the image.
      • I(i,j) and K(i,j): The pixel values at coordinates (i,j) in the ground-truth and processed images, respectively.
  • Structural Similarity Index Measure (SSIM):

    • Conceptual Definition: SSIM is a perceptual metric that measures the similarity between two images. Unlike PSNR which measures absolute error, SSIM attempts to quantify the perceived quality based on three independent comparisons: luminance, contrast, and structure. A value of 1 indicates perfect structural similarity, while values closer to 0 indicate less similarity. It often correlates better with human visual perception than PSNR.
    • Mathematical Formula: $ \mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
    • Symbol Explanation:
      • xx: The reference (ground-truth) image patch.

      • yy: The comparison (processed) image patch.

      • μx\mu_x: The mean (average) of image patch xx.

      • μy\mu_y: The mean (average) of image patch yy.

      • σx2\sigma_x^2: The variance of image patch xx.

      • σy2\sigma_y^2: The variance of image patch yy.

      • σxy\sigma_{xy}: The covariance of image patches xx and yy.

      • c1=(K1L)2c_1 = (K_1L)^2: A small constant to prevent division by zero in the luminance term. K1K_1 is a small scalar (e.g., 0.01), and LL is the dynamic range of pixel values (e.g., 1 for normalized RAW images).

      • c2=(K2L)2c_2 = (K_2L)^2: A small constant to prevent division by zero in the contrast term. K2K_2 is a small scalar (e.g., 0.03).

        In addition to these fidelity metrics, the challenge also reports implementation details such as the number of parameters (in millions, Par.) and MACs (Multiply-Accumulate Operations, in Giga, GG) or FLOPs (Floating Point Operations, in Giga, GG), and inference time (in ms), especially for the Efficient tracks.

5.3. Baselines

For both challenge tracks, several existing and top-performing methods from previous years are used as references or baselines to compare against the new solutions.

5.3.1. RAWSR Challenge Baselines

  • RBSFormer [33]: (Enhanced Transformer Network for Raw Image Super-Resolution) A top-performing method from the NTIRE 2024 RAWSR Challenge, with 3.3 million parameters, achieving PSNR of 43.649 and SSIM of 0.987. It serves as a strong Transformer-based baseline.
  • BSRAW [18]: (Improving Blind Raw Image Super-Resolution) Another strong baseline from NTIRE 2024, with 1.5 million parameters, achieving PSNR of 42.853 and SSIM of 0.986. This method also provides the degradation pipeline used in the 2025 challenge.
  • Bicubic [19]: A standard interpolation method used as a basic reference point, representing a non-deep learning approach. It achieves PSNR of 36.038 and SSIM of 0.952, highlighting the significant improvements offered by deep learning models.

5.3.2. RAWIR Challenge Baselines

  • PMRID [65]: (Practical Deep Raw Image Denoising on Mobile Devices) A main baseline model, known for efficient image denoising on mobile devices. It achieves PSNR values of 42.41, 38.43, and 35.97 for Test Levels 1, 2, and 3 respectively, with 1.032 million parameters and 1.21 GMACs.

  • MOFA [9]: (A Model Simplification Roadmap for Image Restoration on Mobile Devices) Another main baseline model, focusing on model simplification for image restoration on mobile devices. It achieves PSNR values of 42.54, 38.71, and 36.33 for Test Levels 1, 2, and 3 respectively, with 0.971 million parameters and 1.14 GMACs.

  • NAFNet [7]: (Simple Baselines for Image Restoration) Included as a representative U-Net-like model and a popular efficient image denoising method. It achieves PSNR values of 43.50, 39.70, and 37.49 for Test Levels 1, 2, and 3 respectively, with 1.130 million parameters and 3.99 GMACs.

  • RawIR [20]: (Toward Efficient Deep Blind Raw Image Restoration) Another baseline, achieving PSNR values of 44.20, 40.30, and 38.30 for Test Levels 1, 2, and 3 respectively, with 1.5 million parameters and 12.3 GMACs.

    These baselines provide a context for evaluating the performance and efficiency of the new methods submitted to the NTIRE 2025 challenge, demonstrating how the top-performing solutions improve upon previous state-of-the-art.

6. Results & Analysis

6.1. Core Results Analysis

The NTIRE 2025 Challenge presented two tracks: RAW Image Super-Resolution (RAWSR) and RAW Image Restoration (RAWIR). Both tracks included an Efficient category with tight parameter constraints and a General category without such limitations. The results demonstrate significant advancements in RAW image processing, with various deep learning architectures achieving high fidelity and, in some cases, remarkable efficiency.

6.1.1. RAWSR Challenge Results

The results for the RAWSR Challenge are presented in Table 1, showing PSNR/SSIM on the complete testing set (200 images), parameters, and a comparison with NTIRE 2024 baselines.

The following are the results from Table 1 of the original paper:

MethodTrackPSNRSSIM# Par.
SMFFRaw-S 4.3Efficient42.120.94330.18
RawRTSR 4.1Efficient41.740.94170.19
NAFBN 4.5Efficient40.670.93470.19
MambaIRv2 4.6Efficient40.320.93960.19
RepRAW-SR-Tiny 4.7Efficient40.010.92970.02
RepRAW-SR-Large 4.7Efficient40.560.93390.09
ECAN 4.8Efficient39.130.90570.09
USTC 4.2General42.700.94791.94
SMFFRaw-S 4.3General42.600.94671.99
RawRTSR-L 4.1General42.580.94750.26
ERBSFormer 4.4General42.450.94483.30
ER-NAFNet 5.5General41.170.9348-
RBSFormer [33]202443.6490.9873.3
BSRAW [18]202442.8530.9861.5
Bicubic [19]202436.0380.952-

Analysis:

  • Efficient Track:
    • SMFFRaw-S (Team XJTU) achieved the highest PSNR (42.12 dB) and SSIM (0.9433) with 0.18M parameters, demonstrating excellent performance under strict efficiency constraints.
    • RawRTSR (Team Samsung AI) also performed very well, close behind SMFFRaw-S, with 41.74 dB PSNR and 0.19M parameters.
    • All efficient methods are well below the 0.2M parameter limit, showcasing the success in developing lightweight RAWSR models.
  • General Track:
    • USTC (Team USTC-VIDAR) led this track with 42.70 dB PSNR and 0.9479 SSIM using 1.94M parameters. This indicates that their streamlined Transformer network (RBSFormer variant) is highly effective.
    • SMFFRaw-S and RawRTSR-L also showed strong performance in the general track, maintaining their competitive edge even with slightly higher parameter counts. Notably, RawRTSR-L achieved 42.58 dB PSNR with a relatively low 0.26M parameters, suggesting strong efficiency even in the general track.
  • Comparison to NTIRE 2024 Baselines:
    • The 2024 baselines (RBSFormer [33] and BSRAW [18]) achieved higher PSNR (43.649 dB and 42.853 dB respectively) and SSIM (0.987 and 0.986). This suggests that while the 2025 challenge solutions are highly competitive, especially in efficiency, the absolute fidelity of the top 2024 methods still sets a high bar. It's important to note that evaluation datasets or exact degradation models might vary slightly between years, influencing direct comparison.

    • All deep learning methods drastically outperform the Bicubic baseline (36.038 dB PSNR), highlighting the substantial gains from neural network approaches.

      The report concludes that the methods greatly improve RAW image quality and resolution, even for 12MP outputs, without detectable color artifacts. It notes that (synthetic) RAW image super-resolution can be solved similarly to RAW denoising, but more realistic downsampling remains an open challenge.

6.1.2. RAWIR Challenge Results

The results for the RAWIR Challenge are presented in Table 2, showing SSIM/PSNR results for three degradation levels, along with parameters and MACs.

The following are the results from Table 2 of the original paper:

MethodTypeTest Level 1Test Level 2Test Level 3# Params. (M)# MACs (G)
Test Images0.953 / 39.560.931 / 35.300.907 / 33.03
PMRID []Baseline0.982 / 42.410.965 / 38.430.951 / 35.971.0321.21
NAFNET [7]Baseline0.983 / 43.500.972 / 39.700.962 / 37.491.1303.99
MOFA [9]Baseline0.982 / 42.540.966 / 38.710.974 / 36.330.9711.14
RawIR [20]Baseline0.984 / 44.200.978 / 40.300.974 / 38.301.512.3
Samsung AI 5.1Efficient0.991 / 45.100.980 / 40.820.971 / 38.460.1910.98
LMPR-Net 5.4Efficient0.989 / 42.570.973 / 39.170.961 / 37.260.192.63
Samsung AI 5.1General0.993 / 46.040.985 / 42.250.978 / 40.104.9723.79
Miers 5.2General0.993 / 45.720.983 / 41.730.974 / 39.504.76#N/A
Multi-PromptIR 5.3General0.986 / 44.800.978 / 41.380.968 / 38.9639.92158.24
ER-NAFNet 5.5General0.992 / 45.100.972 / 39.320.953 / 36.134.57#N/A

Analysis:

  • Test Levels:

    • Level 1 (Noise only): Easiest degradation. Most solutions perform well, with top entries achieving PSNR over 45 dB. Samsung AI (General) leads with 46.04 dB PSNR.
    • Level 2 (Noise and/or Blur): Intermediate degradation. Samsung AI (General) again leads with 42.25 dB PSNR. Miers follows closely with 41.73 dB PSNR.
    • Level 3 (Realistic Blur and Noise): Most challenging degradation. Samsung AI (General) achieves 40.10 dB PSNR. This level shows the greatest performance gap between methods, as combining denoising and deblurring is complex.
  • Efficient Track:

    • Samsung AI (Efficient) wins this track convincingly, with 45.10 dB, 40.82 dB, and 38.46 dB PSNR across the three levels, all with only 0.19M parameters. This highlights their successful distillation strategy and NAFNet-based architectural modifications (ERIRNet-T).
    • LMPR-Net (Team WIRTeam) is the second best in the efficient track, with 0.19M parameters and 2.63 GMACs, showing good PSNR values (42.57, 39.17, 37.26).
  • General Track:

    • Samsung AI (General) is the overall winner for RAWIR, achieving the best PSNR and SSIM scores across all degradation levels. Their ERIRNet-S model with 4.97M parameters showcases state-of-the-art performance.
    • Miers (Team Xiaomi Inc.) with their Modified SwinFIR-Tiny follows closely, especially in the more challenging levels (41.73 dB for Level 2, 39.50 dB for Level 3).
    • Multi-PromptIR (Team WIRTeam) has a significantly higher parameter count (39.92M) and MACs (158.24 G) but achieves slightly lower PSNR than Samsung AI and Miers. This indicates that larger models do not always guarantee the best performance, or that their architectural choices might be less optimized for this specific challenge's degradations despite the increased complexity.
  • Comparison to Baselines: All top challenge solutions notably improve upon the provided baselines (PMRID, NAFNet, MOFA, RawIR), demonstrating progress in RAW image restoration. For instance, Samsung AI (General) achieves 46.04 dB PSNR at Level 1, significantly higher than RawIR's 44.20 dB.

    Qualitative results (Figure 21) show that most solutions perform excellently for Level 1 (noise removal). For Levels 2 and 3 (noise + blur), Samsung AI and Miers show the best results, while others like WIRTeam and ChickentRun (implied ER-NAFNet) show less effective blur removal. The report notes that models struggle to tackle denoising and deblurring simultaneously, especially blur, as these might require opposite operations.

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

MethodTrackPSNRSSIM# Par.
SMFFRaw-S 4.3Efficient42.120.94330.18
RawRTSR 4.1Efficient41.740.94170.19
NAFBN 4.5Efficient40.670.93470.19
MambaIRv2 4.6Efficient40.320.93960.19
RepRAW-SR-Tiny 4.7Efficient40.010.92970.02
RepRAW-SR-Large 4.7Efficient40.560.93390.09
ECAN 4.8Efficient39.130.90570.09
USTC 4.2General42.700.94791.94
SMFFRaw-S 4.3General42.600.94671.99
RawRTSR-L 4.1General42.580.94750.26
ERBSFormer 4.4General42.450.94483.30
ER-NAFNet 5.5General41.170.9348-
RBSFormer [33]202443.6490.9873.3
BSRAW [18]202442.8530.9861.5
Bicubic [19]202436.0380.952-

The following are the results from Table 2 of the original paper:

MethodTypeTest Level 1Test Level 2Test Level 3# Params. (M)# MACs (G)
Test Images0.953 / 39.560.931 / 35.300.907 / 33.03
PMRID []Baseline0.982 / 42.410.965 / 38.430.951 / 35.971.0321.21
NAFNET [7]Baseline0.983 / 43.500.972 / 39.700.962 / 37.491.1303.99
MOFA [9]Baseline0.982 / 42.540.966 / 38.710.974 / 36.330.9711.14
RawIR [20]Baseline0.984 / 44.200.978 / 40.300.974 / 38.301.512.3
Samsung AI 5.1Efficient0.991 / 45.100.980 / 40.820.971 / 38.460.1910.98
LMPR-Net 5.4Efficient0.989 / 42.570.973 / 39.170.961 / 37.260.192.63
Samsung AI 5.1General0.993 / 46.040.985 / 42.250.978 / 40.104.9723.79
Miers 5.2General0.993 / 45.720.983 / 41.730.974 / 39.504.76#N/A
Multi-PromptIR 5.3General0.986 / 44.800.978 / 41.380.968 / 38.9639.92158.24
ER-NAFNet 5.5General0.992 / 45.100.972 / 39.320.953 / 36.134.57#N/A

6.3. Ablation Studies / Parameter Analysis

Some teams provided details on their architectural choices and efficiency optimizations, which can be seen as implicit ablation or parameter analyses.

  • Team EiffLowCVer (RepRawSR): They explicitly conducted an ablation study (Table 7 in the paper) to evaluate their model variants (RepTiny-21k, RepLarge-97k) against a NAFnet-1.9M baseline.

    • NAFnet-1.9M (baseline): 9.68G FLOPs, 40.21 dB PSNR (self-val), 41.70 dB PSNR (val).
    • RepLarge-97k: 24.42G FLOPs, 39.22 dB PSNR (self-val), 40.80 dB PSNR (val).
    • RepTiny-21k: 5.65G FLOPs, 39.00 dB PSNR (self-val). This study shows the trade-off between FLOPs/parameters and PSNR. While RepLarge-97k and RepTiny-21k are significantly more parameter-efficient, they achieve slightly lower PSNR than the NAFnet baseline, particularly in the validation set. RepTiny-21k offers extremely low FLOPs and parameters but with a further PSNR drop. Their analysis also noted that increasing the number of feature extraction modules with skip connections for RepTiny-21k was key to balancing performance and speed.
  • Team Samsung AI (RawRTSR & ERIRNet): Their work implicitly details a parameter analysis by offering Efficient (e.g., RawRTSR, ERIRNet-T) and General (e.g., RawRTSR-L, ERIRNet-S) variants.

    • For RawRTSR, increasing feature channels from 48 to 64 and adding channel attention (in RawRTSR-L) led to higher performance (from 41.74 dB PSNR to 42.58 dB PSNR) with a small increase in parameters (0.19M to 0.26M).
    • For ERIRNet, the Efficient version (ERIRNet-T, 0.19M params) still achieved very strong performance (e.g., 45.10 dB PSNR at Level 1), while the General version (ERIRNet-S, 4.97M params) pushed the PSNR even higher (46.04 dB at Level 1). This clearly demonstrates how architectural choices (channel width, block count, ConvTranspose vs. PixelUnshuffle) directly impact the performance-efficiency trade-off.
    • Their knowledge distillation strategy, using X-Restormer as a teacher, is a key factor in how the smaller student models (like ERIRNet-T) achieve high performance despite their limited parameters.
  • Team Miers (Modified SwinFIR-Tiny): Their multi-stage training process, gradually adding modules like Feature Fusion, Channel Attention, ConvRep5, and CAB via zero convolution, effectively acts as an ablation study for their specific architectural enhancements. Adjusting noise intensity and patch sizes at different stages also highlights parameter tuning strategies.

    These details, while not always presented as formal ablation tables, provide crucial insights into how different teams designed, optimized, and fine-tuned their models for specific performance and efficiency targets within the challenge.

7. Conclusion & Reflections

7.1. Conclusion Summary

The NTIRE 2025 Challenge on RAW Image Restoration and Super-Resolution successfully pushed the boundaries of RAW image processing in two critical areas: restoration (denoising and deblurring) and super-resolution. The challenge highlighted the growing importance of processing RAW data directly within modern Image Signal Processing (ISP) pipelines, mitigating the irreversible information loss associated with sRGB processing.

Key contributions include the establishment of a robust challenge framework, comprehensive datasets based on real-world RAW images and sophisticated degradation pipelines, and a thorough benchmarking of diverse state-of-the-art solutions. A total of 45 teams submitted results, showcasing innovative approaches primarily leveraging Convolutional Neural Networks (CNNs) and Transformers.

The challenge demonstrated that deep learning methods can significantly improve RAW image quality and resolution, outperforming traditional baselines by a large margin. Notably, Team Samsung AI Camera emerged as a strong performer, winning both the general and efficient tracks in the RAWIR challenge and demonstrating leading performance in RAWSR. Their success was often attributed to efficient NAFNet-based architectures and effective knowledge distillation strategies. The efficient tracks, with their strict parameter budgets (e.g., 200K parameters), successfully spurred the development of lightweight yet high-performing models suitable for mobile devices.

7.2. Limitations & Future Work

The paper and challenge results implicitly and explicitly point to several limitations and suggest future research directions:

  • Realistic Downsampling: The report explicitly states that more realistic downsampling remains an open challenge for RAWSR. While synthetic degradation pipelines are used, creating low-resolution RAW images that perfectly mimic real-world acquisition physics for super-resolution is complex and crucial for real-world applicability.
  • Joint Denoising and Deblurring: For RAWIR, the paper observes that models struggle to tackle denoising and deblurring simultaneously, especially blur, as these might require opposite operations. This indicates a need for more robust and specialized architectures or training strategies that can disentangle and address these coupled degradations more effectively.
  • Generalization to Diverse Real-World Conditions: While the datasets include diverse camera sensors and scenes, real-world conditions present an even wider array of unknown noise profiles, blur types, and lighting scenarios. Further research is needed to ensure models generalize robustly to entirely unseen degradation patterns.
  • Computational Efficiency vs. Absolute Performance: Although efficient models made significant strides, a gap in absolute performance often remains when compared to larger, less constrained models (e.g., comparing 2025 efficient RAWSR methods to 2024 general track baselines). Future work could focus on techniques that close this gap further without sacrificing efficiency.
  • Specific ISP Integration: While operating on RAW is beneficial, the ultimate goal is to integrate these improved RAW outputs back into a full ISP pipeline. Research on how these restored RAW images interact with subsequent ISP stages (like demosaicing, tone mapping) and how to optimize the entire pipeline (or replace parts of it) could be a valuable direction.

7.3. Personal Insights & Critique

This paper and the NTIRE 2025 Challenge represent a vital step forward in computational photography. My personal insights and critiques are:

  • Value of RAW Domain Processing: The challenge strongly reinforces the critical advantage of processing RAW images over sRGB. By operating on linear sensor data, models have access to richer information, leading to better restoration outcomes. This paradigm shift is essential for pushing image quality limits, especially in mobile devices.

  • Importance of Efficiency: The efficient track is a brilliant addition. In real-world scenarios, particularly for smartphones, computational budgets are extremely tight. Forcing participants to innovate within these constraints drives practical solutions and fosters research into lightweight architectures, knowledge distillation, and reparameterization, which are directly transferable to commercial products.

  • Synthetic Degradation Gap: While the degradation pipelines are sophisticated, a fundamental challenge with synthetic degradations is the "reality gap." Models trained solely on synthetic data may not perform optimally on real-world degraded images due to unmodeled complexities. The encouragement for participants to develop their own degradation pipelines is a good step, but fully learning from real-world degraded-clean RAW pairs (which are hard to acquire) remains the ultimate goal. The observation that more realistic downsampling is an open challenge specifically highlights this issue for SR.

  • Complexity of Joint Restoration: The difficulty in effectively tackling both denoising and deblurring simultaneously is a recurring theme in image restoration. These two tasks can sometimes have conflicting objectives (e.g., aggressive denoising might smooth out fine details, which are then hard to deblur). This suggests that multi-task learning for these specific degradations might benefit from more sophisticated disentanglement strategies or adaptive modules that can prioritize based on estimated degradation levels.

  • Impact on ISP Pipelines: The methods and findings from this challenge have direct implications for the future of smartphone ISP pipelines. Instead of fixed, hand-tuned ISP algorithms, we might see more dynamic, AI-driven modules operating on RAW data to deliver real-time, high-quality image enhancement. The "replace mobile camera ISP with a single deep learning model" vision mentioned in the introduction is becoming increasingly tangible.

  • Future Transferability: Many of the architectural principles and optimization techniques (e.g., NafBlocks, Transformer block variants, reparameterization, knowledge distillation) developed for this RAW domain challenge are highly transferable. They could be adapted for other low-level vision tasks or even RAW processing in other imaging modalities (e.g., medical imaging, scientific cameras).

    In conclusion, the NTIRE 2025 Challenge successfully catalyzes innovation in RAW image processing, providing a robust benchmark and highlighting crucial areas for future research at the intersection of image quality, computational efficiency, and real-world applicability.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.