Abstract

DiffRAW: Leveraging Diffusion Model to Generate DSLR-Comparable Perceptual Quality sRGB from Smartphone RAW Images Mingxin Yi 1 , Kai Zhang 1,3 ∗ , Pei Liu 2 , Tanli Zuo 2 , Jingduo Tian 2* 1 Tsinghua Shenzhen International Graduate School, Tsinghua University, China 2 Media Technology Lab, Huawei, China 3 Research Institute of Tsinghua, Pearl River Delta ymx21@mails.tsinghua.edu.cn, zhangkai@sz.tsinghua.edu.cn, { liupei55,zuotanli,tianjingduo } @huawei.com Abstract Deriving DSLR-quality sRGB images from smartphone RAW images has become a compelling challenge due to discernible detail disparity, color mapping instability, and spatial misalignment in RAW-sRGB data pairs. We present DiffRAW, a novel method that incorporates the diffusion model for the first time in learning…

1. Bibliographic Information

1.1. Title

DiffRAW: Leveraging Diffusion Model to Generate DSLR-Comparable Perceptual Quality sRGB from Smartphone RAW Images

1.2. Authors

Mingxin Yi (Tsinghua Shenzhen International Graduate School, Tsinghua University, China)
Kai Zhang (Tsinghua Shenzhen International Graduate School, Tsinghua University, China; Research Institute of Tsinghua, Pearl River Delta)
Pei Liu (Media Technology Lab, Huawei, China)
Tanli Zuo (Media Technology Lab, Huawei, China)
Jingduo Tian (Media Technology Lab, Huawei, China)

1.3. Journal/Conference

The paper was published at the Thirty-Eighth AAAI Conference on Artificial Intelligence (AAAI-24), as indicated in the context of Table 1. AAAI is a highly reputable and influential conference in the field of artificial intelligence, covering a broad range of AI topics including computer vision, machine learning, and natural language processing. Its proceedings are a significant venue for disseminating cutting-edge research.

1.4. Publication Year

2024 (Published at UTC: 2024-03-24T00:00:00.000Z)

1.5. Abstract

The paper addresses the significant challenge of converting smartphone RAW images into sRGB images with a perceptual quality comparable to those captured by professional DSLR cameras. This task is complicated by inherent issues such as detail disparity, unstable color mapping, and spatial misalignment between RAW-sRGB data pairs. The authors introduce DiffRAW, a novel method that integrates a diffusion model to learn RAW-to-sRGB mappings for the first time. DiffRAW leverages the diffusion model to learn high-quality detail distributions from DSLR images, enhancing output image details, while using the RAW image as a diffusion condition to preserve structural information like contours and textures. To counteract color and spatial misalignment in training data, DiffRAW incorporates a color-position preserving condition. Furthermore, it introduces an efficient Domain Transform Diffusion Method (DTDM) to accelerate the inference process of diffusion models and improve generated image quality. Experimental evaluations on the ZRR dataset demonstrate that DiffRAW achieves state-of-the-art performance across all perceptual quality metrics (e.g., LPIPS, FID, MUSIQ) and comparable results in PSNR and SSIM.

1.6. Original Source Link

/files/papers/692655157b21625c663f25cf/paper.pdf (This link indicates the paper is hosted on a file server, likely a preprint server or an institutional repository, given the publication date and AAAI-24 conference.)

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is bridging the quality gap between images captured by smartphones and those by professional Digital Single-Lens Reflex (DSLR) cameras. While smartphones have become ubiquitous for photography due to their portability, their inherent hardware limitations (e.g., smaller apertures, sensors) result in images with less detail and overall lower quality compared to DSLRs.

Traditionally, converting RAW sensor images (raw unprocessed data directly from the camera sensor) to sRGB images (the standard color space for most displays and web content) involves a complex Image Signal Processing (ISP) pipeline. This pipeline includes various low-level vision operations like demosaicking (converting mosaic pattern from sensor to full color), white balance (adjusting colors to appear natural), color correction (mapping sensor colors to a standard color space), denoising (removing image noise), and gamma correction (adjusting brightness).

Prior research has explored end-to-end ISP algorithm research using smartphone RAW to DSLR sRGB data pairs. However, these efforts face three critical challenges:

Detail Disparity: Smartphone RAW images inherently lack the fine details present in DSLR sRGB counterparts due to hardware limitations, making the task of reconstructing DSLR sRGB imagery an ill-posed problem (meaning there isn't enough information to uniquely determine a perfect solution).
Spatial Misalignment: Collecting smartphone RAW images and DSLR sRGB images from different devices inevitably leads to non-precise alignment in the data pairs. This means pixels might not correspond perfectly between the input and target images.
Unstable Color Mapping: Data pairs collected under varying environmental conditions and camera parameters exhibit not only color disparities but also an unstable color mapping relationship, making it difficult for models to learn a consistent color transformation.

The paper's innovative idea or entry point is to leverage the powerful generative capabilities of diffusion models to address these challenges, particularly the detail disparity and the complexities of learning RAW-to-sRGB mappings.

2.2. Main Contributions / Findings

The primary contributions of the DiffRAW paper are:

First-time Integration of Diffusion Models for RAW-to-sRGB Mapping: DiffRAW is the first method to incorporate diffusion models for learning RAW-to-sRGB mappings, achieving state-of-the-art results in perceptual quality metrics.
Effective Detail Enhancement via Diffusion Models: The approach successfully leverages diffusion models to learn high-quality detail distributions from DSLR images, thereby enriching the details of the generated output. It uses the RAW image as a diffusion condition to preserve structural information (like contours and textures) without relying on it for fine details.
Novel Color-Position Preserving Condition: The paper introduces a specially designed color-position preserving condition ( $c$ ). This condition helps mitigate training interference caused by color and spatial misalignment in data pairs, ensuring that the generated images avoid color biases and pixel shifts. It also offers a color pluggable feature, allowing flexible adjustment of the output image's color style by injecting different color representations.
Efficient Domain Transform Diffusion Method (DTDM): DiffRAW proposes a novel and efficient Domain Transform Diffusion Method (DTDM), including its forward and reverse processes. DTDM significantly reduces the inference steps required by diffusion models for image restoration/enhancement tasks while simultaneously enhancing the quality of the generated images. This method is presented as a universal acceleration approach transferable to other diffusion-based algorithms.
DSLR-Comparable Perceptual Quality: Through comprehensive evaluations on the ZRR dataset, DiffRAW consistently demonstrates superior performance across all perceptual quality metrics (e.g., LPIPS, FID, MUSIQ, CLIPIQA+), while maintaining comparable PSNR and SSIM results. Notably, it achieves DSLR-comparable quality on no-reference Image Quality Assessment (IQA) metrics for the first time.

The key findings demonstrate that diffusion models, when appropriately conditioned and optimized for inference, can effectively overcome the inherent limitations of smartphone photography to produce images with perceptual quality rivaling professional cameras.

3.1. Foundational Concepts

To fully understand DiffRAW, a grasp of several fundamental concepts is essential, particularly regarding image types and the core principles of diffusion models.

3.1.1. RAW Images and sRGB Images

RAW Images (Smartphone RAW): RAW images are unprocessed data captured directly from a camera's image sensor. They contain the maximum amount of image information (e.g., full dynamic range, color depth) before any in-camera processing.
- Nature: They are often referred to as "digital negatives" because they are not directly viewable without processing. They typically have a Bayer pattern (a mosaic of color filters) and are higher bit-depth (e.g., 10-bit, 12-bit, or 14-bit per color channel) than standard images.
- Purpose: They provide maximum flexibility for post-processing, allowing photographers to make significant adjustments to exposure, white balance, color, and detail without losing quality.
- Smartphone Constraints: Despite capturing RAW data, smartphone cameras have smaller sensors and lenses compared to DSLR cameras. This often leads to more noise, less light gathering capability, and inherent loss of fine detail, even in their RAW output, compared to a professional DSLR.
sRGB Images: sRGB (standard Red Green Blue) is a standard color space created in 1996 by HP and Microsoft.
- Nature: It defines a specific range of colors that can be displayed by most monitors, printers, and web browsers. sRGB images are typically 8-bit per color channel (24-bit total for RGB), processed, compressed, and ready for display or sharing.
- Purpose: It ensures consistency in color representation across different devices and platforms.
- Image Signal Processing (ISP) Pipeline: The conversion from RAW to sRGB involves an Image Signal Processing (ISP) pipeline. This is a sequence of algorithmic steps that transforms the raw sensor data into a viewable image. Key steps often include:
  - Demosaicking: Reconstructing full-color images from the Bayer pattern data.
  - White Balance: Correcting color casts so that white objects appear white under various lighting conditions.
  - Color Correction/Grading: Mapping the camera's native color space to a standard like sRGB, and applying artistic color adjustments.
  - Denoising: Reducing noise artifacts introduced during image capture.
  - Tone Mapping/Gamma Correction: Adjusting brightness and contrast to fit the dynamic range of the display and ensure proper visual perception.
  - Sharpening: Enhancing edge details.

3.1.2. DSLR vs. Smartphone Cameras

DSLR Cameras: Digital Single-Lens Reflex cameras are professional-grade photographic devices known for their large sensors, interchangeable lenses, and advanced image processing capabilities.
- Advantages: Larger sensors (often APS-C or full-frame) capture more light and detail, produce less noise, and offer shallower depth of field (pleasing background blur). High-quality optics provide superior sharpness. Powerful ISPs handle complex image processing to produce excellent sRGB output.
Smartphone Cameras: Cameras integrated into smartphones.
- Advantages: Portability, convenience, and increasingly sophisticated computational photography algorithms.
- Disadvantages: Much smaller sensors and fixed or limited lenses compared to DSLRs. This physical constraint leads to inherent limitations in light gathering, dynamic range, and resolution of fine details, which even advanced software processing struggles to fully overcome.

3.1.3. Diffusion Models

Diffusion models are a class of generative models that have recently achieved state-of-the-art results in image generation and editing tasks. They operate on the principle of incrementally adding noise to data and then learning to reverse this process to generate new data.

Forward Diffusion Process: This is a fixed, predefined process where Gaussian noise is progressively added to an original data sample (e.g., an image $y_0$ ) over a series of $T$ time steps.
- At each step $t$ , a small amount of Gaussian noise is added to the image $y_{t-1}$ to produce $y_t$ .
- As $t$ increases, the image $y_t$ gradually loses its original information and approaches pure Gaussian noise at $y_T$ .
- A key property is that $y_t$ can be directly sampled from $y_0$ at any step $t$ using a closed-form formula, which simplifies training.
- The paper defines this as: $ q ( y _ { t } | y _ { t - 1 } ) = \mathcal { N } ( y _ { t } ; \sqrt { 1 - \beta _ { t } } y _ { t - 1 } , \beta _ { t } I ) $ Here, $q(y_t | y_{t-1})$ is the probability distribution of $y_t$ given $y_{t-1}$ , which is a normal distribution ( $\mathcal{N}$ ). $y_t$ is sampled from a Gaussian distribution with mean $\sqrt{1 - \beta_t} y_{t-1}$ and variance $\beta_t I$ .
  - $\mathcal{N}$ : Represents the normal (Gaussian) distribution.
  - $y_t$ : The noisy image at time step $t$ .
  - $y_{t-1}$ : The noisy image at time step t-1.
  - $\beta_t$ : A small, predefined variance hyperparameter for the Gaussian noise added at step $t$ . It's a sequence in the range $(0, 1)$ .
  - $I$ : Identity matrix, representing independent noise added to each pixel.
- This can be re-expressed for direct sampling from $y_0$ : $ q ( y _ { t } | y _ { 0 } ) = \mathcal N ( y _ { t } ; \sqrt { \overline { { \alpha } } _ { t } } y _ { 0 } , ( 1 - \overline { { \alpha } } _ { t } ) I ) $ Here, $\alpha_t = 1 - \beta_t$ and $\overline{\alpha}_t = \prod_{i=1}^t \alpha_i$ .
  - $\overline{\alpha}_t$ : The cumulative product of $1 - \beta_i$ up to time $t$ . It dictates the signal-to-noise ratio at step $t$ . As $t$ increases, $\overline{\alpha}_t$ decreases, meaning more noise is added.
  - This equation shows that $y_t$ can be directly obtained from $y_0$ by adding a specific amount of noise: $y _ { t } = \sqrt { \overline { { \alpha } } _ { t } } y _ { 0 } + \sqrt { 1 - \overline { { \alpha } } _ { t } } \epsilon$ , where $\epsilon \sim \mathcal{N}(0, I)$ is a random noise sample.
Reverse Denoising Process: This is the learned process where a neural network (typically a U-Net) is trained to reverse the forward process, gradually removing noise to reconstruct the original data $y_0$ from $y_T$ .
- Training: The U-Net network, denoted $f_\theta(y_t, t)$ $f_{θ} (y_{t}, t)$ , is trained to predict the noise $\epsilon$ $ϵ$ that was added to $y_0$ $y_{0}$ to get $y_t$ $y_{t}$ . The loss function aims to minimize the difference between the predicted noise and the actual noise: $ L ( \theta ) = \mathbb { E } _ { y _ { 0 } , t , \epsilon } | f _ { \theta } ( y _ { t } , t ) - \epsilon | ^ { 2 } $
  - $\theta$ : Learnable parameters of the U-Net.
  - $\mathbb{E}$ : Expectation (average) over possible $y_0$ , $t$ , and $\epsilon$ .
  - $\| \cdot \|^2$ : Squared L2 norm, measuring the squared difference between the predicted noise and actual noise.
- Inference: Starting from pure Gaussian noise $y_T \sim \mathcal{N}(0, I)$ $y_{T} \sim N (0, I)$ , the trained U-Net iteratively denoises $y_t$ $y_{t}$ to infer $y_{t-1}$ $y_{t - 1}$ , until $y_0$ $y_{0}$ is generated. The mean of the reverse step is estimated as: $ \mu _ { \theta } ( y _ { t } , t ) = \frac { 1 } { \sqrt { \alpha _ { t } } } ( y _ { t } - \frac { 1 - \alpha _ { t } } { \sqrt { 1 - \overline { { \alpha _ { t } } } } } f _ { \theta } ( y _ { t } , t ) ) $
  - $\mu_\theta(y_t, t)$ : The mean of the Gaussian distribution for $y_{t-1}$ predicted by the model.
  - This formula effectively estimates $y_0$ from $y_t$ and the predicted noise, then uses $y_0$ to guide the step to $y_{t-1}$ .

3.1.4. U-Net

A U-Net is a type of convolutional neural network (CNN) architecture particularly popular for image segmentation and image-to-image translation tasks.

Architecture: It consists of a contracting path (encoder) that captures context by downsampling and applying convolutional layers, and an expansive path (decoder) that enables precise localization by upsampling and concatenating features from the contracting path. This U-shaped design allows it to learn both high-level semantic features and low-level fine-grained details, which is ideal for tasks requiring output images of the same resolution as the input.

3.1.5. Conditional Diffusion Models

For image restoration or enhancement tasks (like RAW-to-sRGB), diffusion models are often conditioned on a low-quality (LQ) image (e.g., the smartphone RAW image, denoted as $x$ ).

During training, information about the LQ image $x$ is fed into the U-Net (e.g., $f_\theta(y_t, x, t)$ ) to guide the denoising process towards a high-quality (HQ) image $y$ that is consistent with $x$ .
The mean for the reverse process in a conditional setting becomes: $ \mu _ { \theta } ( y _ { t } , x , t ) = \frac { 1 } { \sqrt { \alpha _ { t } } } ( y _ { t } - \frac { 1 - \alpha _ { t } } { \sqrt { 1 - \overline { { \alpha } } _ { t } } } f _ { \theta } ( y _ { t } , x , t ) ) $
- The U-Net $f_\theta$ now takes $x$ as an additional input, learning the conditional distribution $p(y|x)$ .

3.2. Previous Works

The paper discusses two main areas of related work: Deep Learning-based ISP Networks and Diffusion Models.

3.2.1. Deep Learning-based ISP Networks

Traditional ISP pipelines are hand-crafted. Recent research has focused on using deep learning to learn end-to-end ISP mappings to overcome smartphone hardware limitations.

Ignatov et al. (PyNet, 2020): This work is a direct precursor, proposing an end-to-end ISP network (PyNet) to replace conventional smartphone ISPs, trained on a dataset of Huawei P20 smartphone RAW and Canon 5D Mark IV DSLR sRGB pairs. This established a benchmark for the RAW-to-sRGB task.
AWNet (Dai et al., 2020): Incorporated global context blocks to manage image misalignment, a common issue in RAW-sRGB datasets.
CoBi Loss (Zhang et al., 2019): Introduced a contextual bilateral loss to find the best matching patch for supervision, partially addressing data misalignment. However, it did not fully resolve spatial displacement from depth variations.
MW-ISPNet (Ignatov et al., 2020): Another iteration in learning ISP pipelines.
LiteISPNet (Zhang et al., 2021): Developed a color-shift-resistant GCM module to handle color inconsistencies and pixel position shifts. It also used a lightflow alignment module to synchronize DSLR sRGB images with the mobile coordinate system, reducing blurring and shifting artifacts from training data misalignment.
Color Prediction Networks (Tripathi et al., 2022): Utilized a color prediction network based on the Perceiver architecture to tackle pronounced color disparity between mobile RAW and DSLR images.

Key Limitations of Previous ISP Networks:

Detail Reconstruction: These methods often struggle with ill-posed problems like reconstructing fine details that are entirely absent in the smartphone RAW input. They might rely heavily on the input RAW for details, which can propagate artifacts.
Misalignment and Color Inconsistency: While some methods tried to mitigate misalignment and color shifts, these issues remain challenging and can lead to blurring or color biases in the output.
Generative Capacity: Most are deterministic networks, which may struggle to generate perceptually rich and diverse details when the input is highly degraded.

3.2.2. Diffusion Models

The paper situates its work within the broader context of diffusion models, highlighting their superior detail generation capabilities compared to Generative Adversarial Networks (GANs).

Sohl-Dickstein et al. (2015): Pioneers who first proposed the diffusion model concept, drawing inspiration from non-equilibrium statistical physics.
Ho et al. (2020): Established a crucial link between diffusion models and denoising score matching, leading to the widely adopted Denoising Diffusion Probabilistic Models (DDPMs) that form the basis of many modern diffusion applications.
Song et al. (2020): Advanced a unified framework for diffusion models using stochastic differential equations (SDEs).
Concurrent Works on Image Restoration:
- Inversion by Direct Iteration (InDI) (Delbracio and Milanfar, 2023): Modeled image restoration as an iterative inversion process.
- SDE-based Approaches (Luo et al., 2023; Liu et al., 2023): Expressed image restoration tasks within the SDE framework of diffusion models.

3.3. Technological Evolution

The evolution in this field has moved from traditional, hand-tuned ISP pipelines to end-to-end deep learning models. Early deep learning approaches focused on directly learning the mapping but faced challenges with misalignment, color inconsistency, and the ill-posed nature of detail reconstruction. The recent advent of generative models, particularly diffusion models, offers a new paradigm due to their ability to synthesize highly realistic and detailed images by learning underlying data distributions. This allows for a more robust approach to hallucinating details absent in low-quality inputs.

3.4. Differentiation Analysis

DiffRAW distinguishes itself from previous works in several key ways:

Novel Application of Diffusion Models: While diffusion models have been used for general image restoration, DiffRAW is the first to specifically apply them to the smartphone RAW-to-DSLR sRGB mapping task. This is a crucial distinction as this task has unique challenges related to RAW data characteristics and severe detail disparity.
Leveraging Generative Power for Detail Reconstruction: Unlike previous deterministic ISP networks that might struggle to invent details not present in the RAW input, DiffRAW's diffusion model explicitly learns the high-quality detail distribution of DSLR images. This allows it to "hallucinate" plausible, realistic details that are DSLR-comparable.
Targeted Conditioning for Robustness: DiffRAW introduces two specific conditioning mechanisms:
1. RAW Condition ( $w$ ): Used solely for structural preservation (contours, textures), preventing the model from relying on the detail-deficient RAW for fine details. This decouples detail generation from structural guidance.
2. Color-Position Preserving Condition ( $c$ ): This is a direct response to the persistent misalignment and color inconsistency issues that plagued prior ISP networks. By deriving $c$ from the target sRGB during training and from a color extraction network during testing, it effectively regularizes the training, preventing color biases and pixel shifts. This is a more robust solution than just alignment modules or specialized loss functions alone.
Efficient Inference with DTDM: Diffusion models are notorious for their slow inference speeds due to many iterative steps. DiffRAW's Domain Transform Diffusion Method (DTDM) is a significant innovation that simultaneously accelerates inference (fewer steps) and enhances perceptual quality by performing denoising and domain transformation from LQ to HQ within each step. This makes the approach more practical.
SOTA Perceptual Quality: By combining these innovations, DiffRAW demonstrably surpasses existing ISP networks in perceptual metrics, indicating a higher visual realism and fidelity, which is paramount for image enhancement tasks.

4. Methodology

4.1. Principles

The core idea behind DiffRAW is to leverage the powerful generative capabilities of diffusion models to overcome the limitations of smartphone RAW images and produce DSLR-comparable sRGB output. The method is built upon three main principles:

Detail Generation via Diffusion: Recognizing that smartphone RAW images inherently lack fine details present in DSLR images, directly trying to recover these details from the RAW is an ill-posed problem. Instead, DiffRAW uses a diffusion model to learn the distribution of high-quality details from DSLR images. This allows the model to synthesize or "hallucinate" realistic, high-frequency details that were missing in the input.
Structural Preservation through RAW Conditioning: To ensure that the generated images maintain the original scene's layout, contours, and textures, the smartphone RAW image ( $w$ ) is explicitly used as a diffusion condition. This guides the diffusion model to keep the overall structure intact without relying on the RAW for fine details, thereby preventing the detail loss in the RAW from degrading the final output.
Robustness to Data Imperfections via Color-Position Preserving Condition: To address the color disparities, unstable color mappings, and spatial misalignments common in RAW-sRGB training data pairs, DiffRAW introduces a color-position preserving condition ( $c$ ). This condition acts as a regularizer, guiding the model to produce outputs that are color-consistent and spatially aligned with a stable reference, effectively bypassing the inconsistencies in the raw data pairs.
Efficient and Enhanced Inference with Domain Transform Diffusion Method (DTDM): Diffusion models typically require many inference steps, which can be computationally expensive. DTDM is designed to accelerate this process while simultaneously improving the quality of the generated images. It achieves this by modifying the diffusion process such that each step in the reverse process not only denoises but also performs a domain transformation from the low-quality (input-like) state to the high-quality (target-like) state.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. RAW Condition

The smartphone RAW image (denoted as $w$ ) serves a critical role as a diffusion condition in DiffRAW. Its primary function is to provide structural guidance to the generative process.

Purpose: The model uses $w$ to preserve fundamental structural information such as contours and textures in the output image.
Distinction: Crucially, the model is not dependent on $w$ for intricate details. This distinction is key because smartphone RAW images inherently contain detail loss due to hardware limitations. By separating structural guidance from detail generation, DiffRAW prevents the low-detail RAW input from hindering the creation of high-quality details.
Benefit: This strategy allows DiffRAW to maintain the overall image structure of the smartphone RAW image while simultaneously injecting DSLR-comparable details learned from the diffusion model's understanding of high-quality DSLR distributions.

4.2.2. Color-Position Preserving Condition

To address the significant challenges of unstable color mapping relationships and spatial misalignment between smartphone RAW ( $w$ ) and DSLR sRGB ( $y$ ) data pairs, DiffRAW introduces a novel color-position preserving condition (denoted as $c$ ).

Problem Statement: Direct learning of $p(y|w)$ (the conditional distribution of DSLR sRGB given smartphone RAW) would lead to color biases, image blurring, and pixel shifting in the output due to the inconsistencies in the training data.
Solution: The condition $c$ $c$ is designed to provide a stable color and spatial reference.
- During Training ( $c^{train}$ ): $c^{train}$ $c^{t r ain}$ is generated by degrading the target DSLR sRGB image ( $y$ $y$ ) using a high-order degradation model ( $\mathcal{D}^2$ $D^{2}$ ). This ensures strict color consistency and spatial alignment between $c^{train}$ $c^{t r ain}$ and $y$ $y$ . $ c ^ { t r a i n } = \mathcal { D } ^ { 2 } ( y ) $
  - $\mathcal{D}^2$ : A high-order degradation model that simulates various forms of degradation (e.g., blurring, noise, downsampling, compression) but is specifically fine-tuned here to maintain color consistency between its input ( $y$ ) and output ( $c^{train}$ ).
  - By training with $c^{train}$ which is perfectly color-consistent and spatially aligned with $y$ , DiffRAW learns to maintain this consistency in its output.
- During Testing ( $c^{test}$ ): $c^{test}$ $c^{t es t}$ is derived from the input smartphone RAW image ( $w$ $w$ ) using a pre-trained color extraction network ( $\mathcal{G}$ $G$ ). $ c ^ { t e s t } = \mathcal { G } ( w ; \Theta _ { \mathcal { G } } ) $
  - $\mathcal{G}$ : A color extraction network (e.g., pre-trained lightweight ISPNet like LiteISPNet, PyNet, or MWISPNet). This network processes the RAW image $w$ to produce a naturally colored sRGB image, providing a color reference for the diffusion model.
  - $\Theta_{\mathcal{G}}$ : The parameters of the color extraction network.
Benefits:
- Mitigates Misalignment and Color Bias: By learning the relationship between $c^{train}$ and $y$ (which are consistent), the model, when given $c^{test}$ , ensures that its generated results inherit the color consistency of $c^{test}$ and avoid pixel shifts or blurring.
- Flexible Color Style Transfer: The $c$ condition also offers a "color pluggable feature." By infusing different color representations (i.e., different $c^{test}$ inputs) into the model, users can flexibly adjust the color style of the generated images.

4.2.3. Domain Transform Diffusion Method (DTDM)

The Domain Transform Diffusion Method (DTDM) is a novel and efficient diffusion process designed to accelerate inference while enhancing image quality. It cleverly integrates a domain transformation (from low-quality to high-quality) into each denoising step.

For clarity, the paper redefines $x$ as an LQ image for the purpose of describing DTDM:

During Training: $x^{train}$ is the DSLR-degraded image (i.e., $c^{train}$ ).
During Testing: $x^{test}$ $x^{t es t}$ is the output of the color extraction network $\mathcal{G}(w; \Theta_{\mathcal{G}})$ $G (w; Θ_{G})$ (i.e., $c^{test}$ $c^{t es t}$ ). $ \boldsymbol { x } ^ { t r a i n } = \mathcal { D } ^ { 2 } ( \boldsymbol { y } ) , \boldsymbol { x } ^ { t e s t } = \mathcal { G } ( \boldsymbol { w } ; \boldsymbol { \Theta } _ { \mathcal { G } } ) $
- $\boldsymbol{x}$ : The low-quality image input for the DTDM.
- $\boldsymbol{y}$ : The high-quality target image.
- $\boldsymbol{w}$ : The smartphone RAW image.
- $\mathcal{D}^2$ : The high-order degradation model.
- $\mathcal{G}$ : The color extraction network.
  
  The primary motivation for DTDM is that in standard conditional diffusion models, if LQ image $x$ is used as the starting point for inference instead of pure noise $y_T$ , a domain gap between $x_s$ and $y_s$ (where $x_s$ and $y_s$ are noisy versions of $x$ and $y$ at step $s$ ) can lead to inconsistency and reduced detail enhancement if the number of iteration steps $s$ is too small. DTDM explicitly addresses this by constructing a new diffusion sequence $m_t$ that bridges this domain gap.

4.2.3.1. Forward Process of DTDM

In the DTDM forward process, a new image diffusion sequence $\{m_t\}_{t=0}^s$ is constructed, starting from the high-quality target ( $m_0 = y$ ) and ending at a noisy version of the low-quality input ( $m_s = x_s$ ). Each diffusion step from $m_{t-1}$ to $m_t$ involves two actions: a minor degradation from $y$ to $x$ , followed by a slight noise addition.

Let's denote $m_{t-1}^{t-1}$ as $m_{t-1}$ and $m_t^t$ as $m_t$ . The intermediate image after the first minor degradation step from $m_{t-1}$ is $m_{t-1}^t$ .

Minor Degradation (Domain Transformation) Step: This step describes how the image $m_{t-1}$ is slightly degraded towards the characteristics of $x$ . $ m _ { t - 1 } ^ { t } = m _ { t - 1 } ^ { t - 1 } + \sqrt { \overline { { { \alpha } } } _ { t - 1 } } ( m _ { 0 } ^ { t } - m _ { 0 } ^ { t - 1 } ) $
- $t \in \{1, 2, \ldots, s\}$ : The time step index.
- $m_{t-1}^t$ : The intermediate image after the degradation step from $m_{t-1}$ .
- $m_{t-1}^{t-1}$ : The image at the previous step t-1.
- $\overline{\alpha}_{t-1}$ : The cumulative product of $1 - \beta_i$ up to step t-1, which scales the degradation term.
- $(m_0^t - m_0^{t-1})$ : This term represents the degradation increment from one step to the next, specifically derived from the image sequence $\{m_0^t\}_{t=0}^s$ .
Slight Noise Addition Step: After the degradation, Gaussian noise is added to the intermediate image. $ m _ { t } ^ { t } = \sqrt { \alpha _ { t } } m _ { t - 1 } ^ { t } + \sqrt { 1 - \alpha _ { t } } \epsilon $
- $m_t^t$ : The image at step $t$ after both degradation and noise addition.
- $\alpha_t = 1 - \beta_t$ : The signal-to-noise ratio parameter at step $t$ .
- $\epsilon \sim \mathcal{N}(0, I)$ : Random Gaussian noise.
  
  The image sequence $\{m_0^t\}_{t=0}^s$ and the constant $\gamma_s$ are defined as: $ m _ { 0 } ^ { t } = y + \frac { \sqrt { 1 - \overline { { \alpha } } _ { t } } } { \sqrt { \overline { { \alpha } } _ { t } } } [ \gamma _ { s } ( x - y ) ] , \gamma _ { s } = \frac { \sqrt { \overline { { \alpha } } _ { s } } } { \sqrt { 1 - \overline { { \alpha } } _ { s } } } $

$m_0^t$ : A modified version of the target image $y$ that incorporates information about the difference between $x$ and $y$ , scaled by $\gamma_s$ and related to the noise schedule $\overline{\alpha}_t$ . This term essentially defines the domain transformation.
$\gamma_s$ : A scaling parameter that depends on the total number of inference steps $s$ . It balances the contribution of the x-y difference.

Combining these steps, the overall diffusion process of the sequence $\{m_t\}_{t=0}^s$ is: $ q ( m _ { t } | m _ { t - 1 } , x , y ) = \mathcal { N } ( m _ { t } ; \mu _ { t } ^ { d i f f } , ( 1 - \alpha _ { t } ) I ) $ $ \mu _ { t } ^ { d i f f } = \sqrt { \alpha _ { t } } m _ { t - 1 } + \sqrt { \overline { { { \alpha } } } _ { t } } ( m _ { 0 } ^ { t } - m _ { 0 } ^ { t - 1 } ) $
$\mu_t^{diff}$ : The mean of the Gaussian distribution for $m_t$ given $m_{t-1}, x, y$ . This mean term encapsulates both the signal from the previous step and the degradation increment.

Recursively applying these equations, the distribution of $m_t$ can be directly computed from $x$ and $y$ : $ q ( m _ { t } | x , y ) = \mathcal { N } ( m _ { t } ; \sqrt { \overline { { \alpha } } _ { t } } m _ { 0 } ^ { t } , ( 1 - \overline { { \alpha } } _ { t } ) I ) $ This implies that applying noise $t$ times to $m_0^t$ results in $m_t$ . Substituting the definition of $m_0^t$ into this equation yields: $ m _ { t } = \sqrt { \overline { { \alpha } } _ { t } } y + \sqrt { 1 - \overline { { \alpha } } _ { t } } [ \gamma _ { s } ( x - y ) + \epsilon ] $
This final equation for $m_t$ shows that the sequence begins with $m_0 = y$ (when $t=0$ , $\overline{\alpha}_0 \approx 1$ , so $m_0 \approx y + 0 \cdot [\ldots] = y$ ) and, after $s$ diffusion steps, ends at $m_s = x_s$ (when $t=s$ , and substituting $\gamma_s$ gives $m_s = \sqrt{\overline{\alpha}_s} y + \sqrt{1-\overline{\alpha}_s} [\frac{\sqrt{\overline{\alpha}_s}}{\sqrt{1-\overline{\alpha}_s}} (x-y) + \epsilon] = \sqrt{\overline{\alpha}_s} y + \sqrt{\overline{\alpha}_s}(x-y) + \sqrt{1-\overline{\alpha}_s}\epsilon = \sqrt{\overline{\alpha}_s} x + \sqrt{1-\overline{\alpha}_s}\epsilon = x_s$ ). This confirms that the sequence successfully transforms from $y$ to $x_s$ over $s$ steps.

4.2.3.2. Training Process of DTDM

The training objective for the U-Net $f_\theta(m_t, w, c, t)$ is to predict the combined term $\gamma_s(x-y) + \epsilon$ .

Learning Target: $ \frac { m _ { t } - \sqrt { \overline { { \alpha } } _ { t } } y } { \sqrt { 1 - \overline { { \alpha } } _ { t } } } = \gamma _ { s } ( x - y ) + \epsilon $
- $\gamma_s(x-y)$ : This term captures the high-frequency details or the domain transformation information between $x$ and $y$ .
- $\epsilon$ : Represents the random noise component added to $m_t$ .
Loss Function: The network is trained to minimize the squared difference between its prediction and this target. $ L ( \theta ) = \mathbb { E } _ { x , y , t , \epsilon } | f _ { \theta } ( m _ { t } , w , c , t ) - [ \gamma _ { s } ( x - y ) + \epsilon ] | ^ { 2 } $
- $f_\theta(m_t, w, c, t)$ : The U-Net model, conditioned on $m_t$ (noisy image), $w$ (RAW image for structure), $c$ (color-position preserving condition), and $t$ (time step).
- The U-Net learns to disentangle the noise $\epsilon$ and the detail difference $\gamma_s(x-y)$ from the noisy input $m_t$ .
Estimation of Target Image: After training, for any step $t$ $t$ and current image $m_t$ $m_{t}$ , the estimate for the target image $y$ $y$ can be derived: $ \hat { y } ( m _ { t } , x , t ) = \frac { m _ { t } - \sqrt { 1 - \overline { { \alpha } } _ { t } } f _ { \theta } ( m _ { t } , w , c , t ) } { \sqrt { \overline { { \alpha } } _ { t } } } $
- $\hat{y}$ : The estimated high-quality image. This formula effectively reverses the forward process by subtracting the predicted noise + detail difference term.
  
  The training procedure is summarized in Algorithm 1:

Algorithm 1: DiffRAW Training 1: repeat 2: $(w, y) \sim q(w, y)$ // Sample a RAW-sRGB pair 3: $x = \mathcal{D}^2(y)$ // Create the low-quality image $x$ (same as $c^{train}$ ) 4: $c = x$ // Set color-position preserving condition $c$ to $x$ 5: $t \sim \text{Uniform}(\{1, 2, 3, \ldots, s\})$ // Randomly select a time step $t$ 6: $\epsilon \sim \mathcal{N}(0, I)$ // Sample random Gaussian noise 7: $m_t = \sqrt{\overline{\alpha}_t} y + \sqrt{1 - \overline{\alpha}_t}[\gamma_s(x - y) + \epsilon]$ // Construct $m_t$ using the DTDM forward process 8: Take gradient descent step on $ \nabla _ { \boldsymbol { \theta } } | \bar { f _ { \boldsymbol { \theta } } } ( m _ { t } , w , c , t ) - [ \gamma _ { s } ( x - y ) + \epsilon ] | ^ { 2 } $ // Update model parameters $\theta$ to predict the combined term 9: until converged

4.2.3.3. Reverse Process (Inference) of DTDM

The reverse process starts from $m_s = x_s$ and iteratively infers $m_{s-1}, m_{s-2}, \ldots$ , eventually reaching $m_0 = y$ . Each step performs denoising and a domain transform from $x$ towards $y$ . This is achieved by applying Bayes' theorem to derive the conditional probability $q(m_{t-1} | m_t, x, y)$ .

The mean for the reverse step is defined by: $ p _ { \theta } ( m _ { t - 1 } | m _ { t } , x ) = \mathcal { N } ( m _ { t - 1 } ; \hat { \mu } _ { \theta } ^ { b a y e s } ( m _ { t } , x ) , \sigma _ { t } ^ { 2 } I ) $ $ \hat { \mu } _ { \theta } ^ { b a y e s } ( m _ { t } , x ) = [ \displaystyle \frac { \sqrt { 1 - \overline { { \alpha _ { t } } } } } { \sqrt { \overline { { \alpha _ { t } } } } } \lambda _ { t } - \frac { 1 - \alpha _ { t } } { \sqrt { \alpha _ { t } } \sqrt { 1 - \overline { { \alpha _ { t } } } } } ] f _ { \theta } ( m _ { t } , w , c , t ) + [ \displaystyle \frac { 1 } { \sqrt { \alpha _ { t } } } - \frac { 1 } { \sqrt { \overline { { \alpha _ { t } } } } } \lambda _ { t } ] m _ { t } + \lambda _ { t } x $

$\hat{\mu}_{\theta}^{bayes}(m_t, x)$ : The predicted mean for $m_{t-1}$ given $m_t$ and $x$ . This complex term combines the U-Net prediction ( $f_\theta$ ), the current noisy image $m_t$ , and the low-quality image $x$ , all weighted by various noise schedule parameters ( $\alpha_t, \overline{\alpha}_t$ ) and $\lambda_t$ .
$\sigma_t$ : The variance of the Gaussian noise added in the reverse step, usually a predefined constant.

The parameter $\lambda_t$ is defined as: $ \lambda _ { t } = [ \sqrt { 1 - \overline { { \alpha } } _ { t - 1 } } ( 1 - \sqrt { \alpha _ { t } } \frac { \sqrt { 1 - \overline { { \alpha } } _ { t - 1 } } } { \sqrt { 1 - \overline { { \alpha } } _ { t } } } ) ] \gamma _ { s } $
$\lambda_t$ : A complex weighting factor that depends on the noise schedule parameters and $\gamma_s$ . It controls the influence of the low-quality image $x$ and the U-Net's prediction in guiding the reverse step.

The inference procedure is summarized in Algorithm 2:

Algorithm 2: DiffRAW Inference 1: $x = \mathcal{G}(w; \Theta_{\mathcal{G}})$ // Extract color-position preserving condition $c$ (which is $x$ ) from RAW $w$ 2: $c = x$ // Set color-position preserving condition $c$ to $x$ 3: $m_s \sim \mathcal{N}(m_s; \sqrt{\overline{\alpha}_s} x, (1 - \overline{\alpha}_s)I)$ // Initialize starting point $m_s$ by adding noise to $x$ (this is $x_s$ ) 4: for $t = s, \ldots, 1$ do 5: $z \sim \mathcal{N}(0, I)$ if $t > 1$ , else $z = 0$ // Sample random noise for stochasticity, except for the last step 6: $m_{t-1} = [ \frac{\sqrt{1 - \overline{\alpha_t}}}{\sqrt{\overline{\alpha_t}}} \lambda_t - \frac{1 - \alpha_t}{\sqrt{\alpha_t}\sqrt{1 - \overline{\alpha_t}}} ] f_{\theta}(m_t, w, c, t) + [ \frac{1}{\sqrt{\alpha_t}} - \frac{1}{\overline{\sqrt{\alpha_t}}} \lambda_t ] m_t + \lambda_t x + \sigma_t z$ // Calculate $m_{t-1}$ using the derived mean and adding noise 7: end for 8: return $m_0$ // The final generated high-quality sRGB image

Comparison with standard diffusion (DDPM): In previous diffusion-based image restoration algorithms (like DDPM), if $x_s$ (noisy version of low-quality input) was used as a starting point, the model would primarily denoise it. DTDM, however, integrates a domain transfer from $x$ to $y$ into each step. This means each iteration in DTDM not only denoises $m_t$ but also progressively transforms its characteristics towards the high-quality target $y$ . This dual action allows DTDM to effectively transform $x_s$ into $y$ with fewer iterations and simultaneously enhance the quality of the generated images.

The image below illustrates the overall framework of DiffRAW, showing the flow from the RAW condition and the Domain Transform Diffusion Method with its forward and inverse processes.

该图像是示意图，展示了DiffRAW算法的扩散过程与反向过程。上半部分呈现了从原手机RAW图像到经过处理的各阶段，包括模糊、调整大小、添加噪声和JPEG压缩等。下半部分则展示了反向过程，通过U-Net生成高质量的sRGB图像。图中涉及的公式以方程（15）和（23）形式标识。

urOveal ifWFramork.The chenivructuhe proo AW co process and an inverse process. In the forward process, we degrade $y$ to $x$ stochastically and construct a sequence m _ { t } with a starting point of $y$ and an endpoint of x _ { s } . In the inverse process, we first extract $x$ from $w$ , add $s$ steps of noise to $x$ to attain the starting point of the inverse process x _ { s } , and then use equation 23 for step-by-step iterative inference until $\hat { y }$ is generated.

5. Experimental Setup

5.1. Datasets

The experiments for DiffRAW were conducted on the Zurich RAW to RGB (ZRR) dataset (Ignatov, Van Gool, and Timofte 2020).

Source and Characteristics: This dataset is specifically designed for smartphone RAW-to-DSLR sRGB mapping research. It comprises images captured by a Huawei P20 smartphone (for RAW) and a Canon 5D Mark IV DSLR (for sRGB ground truth).
Alignment: The dataset addresses a critical challenge in such paired data by attempting alignment. Images were roughly aligned using SIFT keypoints (Scale-Invariant Feature Transform, a feature detection algorithm) and the RANSAC algorithm (Random Sample Consensus, an iterative method to estimate parameters of a mathematical model from a set of observed data containing outliers). To ensure quality, cropped patches with $cross-correlation < 0.9$ (indicating poor alignment) were discarded.
Scale: The full dataset contains 20 thousand image pairs. After the alignment and cropping process, it resulted in 48,043 RAW-sRGB pairs of size 448 × 448 pixels.
Division: The official division of the dataset was followed:
- Training Set: 46.8k pairs were used to train the DiffRAW model.
- Testing Set: The remaining 1.2k pairs were used for quantitative evaluation.
Purpose: The ZRR dataset is highly effective for validating the proposed method's performance because it directly provides the challenging smartphone RAW inputs and the desired DSLR sRGB targets, along with efforts to mitigate misalignment, making it a standard benchmark for this task.

5.2. Evaluation Metrics

The paper employs a comprehensive set of evaluation metrics, categorizing them into perceptual quality metrics (also known as no-reference or full-reference perceptual metrics depending on their design) and traditional pixel-wise metrics.

5.2.1. Perceptual Quality Metrics

These metrics aim to quantify how visually pleasing or realistic an image is, often correlating better with human perception than simple pixel-wise differences.

LPIPS (Learned Perceptual Image Patch Similarity)
- Conceptual Definition: LPIPS measures the perceptual similarity between two images by comparing their feature activations in a pre-trained deep neural network (e.g., VGG, AlexNet, SqueezeNet). It quantifies how visually similar two images are, even if their pixel values differ significantly. A lower LPIPS score indicates higher perceptual similarity. It is a full-reference metric, meaning it requires a ground truth image for comparison.
- Mathematical Formula: $ \text{LPIPS}(x, x_0) = \sum_l \frac{1}{H_l W_l} |w_l \odot (f_l(x) - f_l(x_0))|_2^2 $
- Symbol Explanation:
  - $x$ : The generated image.
  - $x_0$ : The reference (ground truth) image.
  - $l$ : Index over different layers of the pre-trained neural network.
  - $H_l, W_l$ : Height and width of the feature maps at layer $l$ .
  - $f_l(\cdot)$ : Feature extractor (activations) from layer $l$ of the pre-trained network.
  - $w_l$ : A learned scalar weight for each channel in layer $l$ .
  - $\odot$ : Element-wise product.
  - $\|\cdot\|_2^2$ : Squared L2 norm, summing the squared differences.
FID (Fréchet Inception Distance)
- Conceptual Definition: FID measures the similarity between two sets of images: a set of generated images and a set of real images. It calculates the Fréchet distance between the Gaussian distributions of feature representations (extracted from an Inception-v3 network) for the two image sets. A lower FID score indicates that the generated images are more similar to the real images in terms of statistical properties, perceived quality, and diversity. It is a no-reference metric in the sense that it doesn't compare individual generated images to individual ground truth images, but rather the distribution of generated images to the distribution of real images.
- Mathematical Formula: $ \text{FID}(X, G) = |\mu_X - \mu_G|_2^2 + \text{Tr}(\Sigma_X + \Sigma_G - 2(\Sigma_X \Sigma_G)^{1/2}) $
- Symbol Explanation:
  - $X$ : The set of real images.
  - $G$ : The set of generated images.
  - $\mu_X, \mu_G$ : The mean feature vectors of the real and generated images, respectively, extracted from a pre-trained Inception-v3 network's activation layer.
  - $\Sigma_X, \Sigma_G$ : The covariance matrices of the feature vectors for the real and generated images.
  - $\|\cdot\|_2^2$ : Squared L2 norm.
  - $\text{Tr}(\cdot)$ : The trace of a matrix (sum of its diagonal elements).
MUSIQ (Multi-scale Image Quality Transformer)
- Conceptual Definition: MUSIQ is a learned no-reference image quality assessment (IQA) metric. It uses a transformer-based architecture to aggregate quality information from multiple scales of an image, aiming to predict human perceptual quality scores. It is trained on large-scale human-rated datasets. Higher scores indicate better perceptual quality.
- Symbol Explanation (as per paper context):
  - MUSIQ-K: Refers to musiq-koniq, likely a variant or specific model of MUSIQ trained on the KonIQ-10k dataset (a large-scale dataset of images with human quality ratings).
  - MUSIQ-S: Refers to musiq-spaq, likely a variant trained on the SPAQ dataset (another dataset for image quality assessment).
CLIPIQA+
- Conceptual Definition: CLIPIQA+ is a no-reference IQA metric that leverages the CLIP (Contrastive Language-Image Pre-training) model's ability to understand image content and quality through text-image alignment. It assesses image quality by evaluating how well an image aligns with quality-related textual descriptions, or by comparing image features to a learned quality manifold. Higher scores indicate better perceptual quality.
- Symbol Explanation (as per paper context):
  - CLIPIQA+ RN50: Refers to a specific implementation of CLIPIQA+ that uses ResNet-50 as the image encoder backbone within the CLIP model.

5.2.2. Traditional Pixel-wise Metrics

These metrics measure numerical differences between images, often focusing on fidelity rather than perceptual quality.

PSNR (Peak Signal-to-Noise Ratio)
- Conceptual Definition: PSNR is a straightforward metric that quantifies the ratio between the maximum possible power of a signal and the power of corrupting noise. In image processing, it's typically used to measure the quality of reconstruction of lossy compression codecs or image enhancement techniques. A higher PSNR generally indicates a higher quality image, meaning less distortion. It's calculated in decibels (dB). It is a full-reference metric.
- Mathematical Formula: $ \text{PSNR} = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{\text{MSE}} \right) $ Where: $ \text{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $
- Symbol Explanation:
  - $MAX_I$ : The maximum possible pixel value of the image. For an 8-bit image, this is 255.
  - $\text{MSE}$ : Mean Squared Error between the original and processed image.
  - I(i,j): The pixel value at row $i$ , column $j$ of the original image.
  - K(i,j): The pixel value at row $i$ , column $j$ of the processed image.
  - m, n: The height and width of the image.
SSIM (Structural Similarity Index Measure)
- Conceptual Definition: SSIM is designed to measure the structural similarity between two images, considering three key components: luminance, contrast, and structure. Unlike PSNR which is pixel-wise, SSIM attempts to model the human visual system's sensitivity to structural information. A value closer to 1 indicates higher similarity. It is a full-reference metric.
- Mathematical Formula: $ \text{SSIM}(x, y) = [l(x,y)]^{\alpha} \cdot [c(x,y)]^{\beta} \cdot [s(x,y)]^{\gamma} $ Where: $ l(x,y) = \frac{2\mu_x\mu_y + C_1}{\mu_x^2 + \mu_y^2 + C_1} $ $ c(x,y) = \frac{2\sigma_x\sigma_y + C_2}{\sigma_x^2 + \sigma_y^2 + C_2} $ $ s(x,y) = \frac{\sigma_{xy} + C_3}{\sigma_x\sigma_y + C_3} $
- Symbol Explanation:
  - x, y: Two image patches being compared.
  - $\mu_x, \mu_y$ : The average (mean) of pixel values in image patches $x$ and $y$ .
  - $\sigma_x, \sigma_y$ : The standard deviation of pixel values in image patches $x$ and $y$ .
  - $\sigma_{xy}$ : The covariance between pixel values in image patches $x$ and $y$ .
  - $C_1, C_2, C_3$ : Small constants to prevent division by zero (e.g., $C_1 = (K_1 L)^2, C_2 = (K_2 L)^2, C_3 = C_2/2$ , where $L$ is the dynamic range of pixel values, and $K_1, K_2$ are small constants).
  - $\alpha, \beta, \gamma$ : Weights for luminance, contrast, and structure components, usually set to 1.
NIQE (Natural Image Quality Evaluator)
- Conceptual Definition: NIQE is a no-reference IQA metric that measures image quality without requiring a clean reference image. It is based on a statistical model of natural images. It extracts a set of features (e.g., from mean subtracted contrast normalized (MSCN) coefficients) from the image, and then measures the distance between the statistical characteristics of the test image and those of a pre-learned model of natural undistorted images. A lower NIQE score indicates better quality (closer to natural).
- Mathematical Formula: The paper does not provide an explicit formula for NIQE. Fundamentally, NIQE is derived from a multivariate Gaussian model fitted to mean subtracted contrast normalized (MSCN) coefficients of natural images. The quality score is given by the Mahalanobis distance between the feature vector of the test image and the natural image model: $ \text{NIQE} = \sqrt{(\mathbf{v}_1 - \mathbf{v}_2)^T (\Sigma_1 + \Sigma_2)^{-1} (\mathbf{v}_1 - \mathbf{v}_2)} $
- Symbol Explanation:
  - $\mathbf{v}_1$ : Mean vector of the MSCN feature coefficients of the natural image model.
  - $\mathbf{v}_2$ : Mean vector of the MSCN feature coefficients of the test image.
  - $\Sigma_1$ : Covariance matrix of the MSCN feature coefficients of the natural image model.
  - $\Sigma_2$ : Covariance matrix of the MSCN feature coefficients of the test image.
  - $(\cdot)^T$ : Transpose of a vector/matrix.
  - $(\cdot)^{-1}$ : Inverse of a matrix.
ILNIQE (Integrated Local Natural Image Quality Evaluator)
- Conceptual Definition: ILNIQE is an improvement over NIQE, designed to be more robust and accurate, particularly for images with spatially varying distortions. It processes images in smaller patches and integrates local quality assessments to provide a global score. Like NIQE, a lower score indicates better quality. It is also a no-reference IQA metric.
- Mathematical Formula: The paper does not provide an explicit formula for ILNIQE. It extends NIQE by computing local quality scores within an image and then aggregating them, often through averaging or a more sophisticated pooling strategy. The local scores are still based on the Mahalanobis distance from a natural image model.

5.3. Baselines

The DiffRAW model was compared against three state-of-the-art deep learning-based ISP methods:

PyNet (Ignatov, Van Gool, and Timofte 2020): An early and influential end-to-end ISP network that directly learns the mapping from smartphone RAW to DSLR sRGB.
MW-ISPNet (Ignatov et al. 2020): Another advanced ISP network developed as part of the AIM 2020 Challenge on Learned Image Signal Processing Pipeline.
LiteISPNet (Zhang et al. 2021): A more recent ISP network known for addressing color-shift and pixel alignment issues with specialized modules, achieving improved performance.

These baselines are representative because they are leading methods in the specific task of smartphone RAW-to-DSLR sRGB transformation using deep learning, making them direct competitors for evaluating DiffRAW's performance.

5.4. Training Details

Training Steps: The DiffRAW model was trained for 1 Million (1M) training steps.
Batch Size: A batch size of 32 was used.
Optimizer: The Adam optimizer was employed, which is a popular optimization algorithm for training deep learning models, known for its efficiency and good performance.
Learning Rate Schedule: A linear warmup schedule was applied for the first 10,000 (10k) training steps. This means the learning rate gradually increased from a small value to its full value during this phase, helping stabilize early training. After the warmup, a fixed learning rate of 1e-4 (0.0001) was maintained.
Hyperparameters:
- T (Total Noise Steps): Set to 2000. This is the total number of steps in the forward diffusion process.
- s (Inference Steps for DTDM): Set to 100. This is the number of steps used for the Domain Transform Diffusion Method's inference, indicating a significantly reduced number compared to $T$ .
Hyperparameter Tuning: The authors note that they did not conduct extensive engineering attempts on the $T$ and $s$ hyperparameters, primarily setting them to verify the effects of inference acceleration and improved image quality by DTDM. They suggest that further tuning could potentially yield even better experimental metric results.

5.5. Testing Details

Inference Steps: The number of denoising steps and iteration steps during the inference process was set to 93.
Metric Balancing: This specific choice of 93 steps was made to balance the performance across two types of metrics:
- No-reference metrics: These metrics (like NIQE, ILNIQE, MUSIQ, CLIPIQA+) often show better performance with fewer inference steps because overly aggressive denoising or detail generation can sometimes introduce artifacts that these metrics penalize.
- Full-reference metrics: These metrics (like PSNR, SSIM, LPIPS, FID) might benefit from more inference steps as they directly compare against a ground truth.
Observation on Step Count: The paper states that if the number of denoising steps and iteration steps were set to $s = 100$ (the value used for training DTDM), the performance on no-reference metrics would be even better, aligning with human visual perception of image details and overall quality. This implies that reducing steps might slightly compromise fidelity to ground truth in some cases, but can enhance perceived naturalness.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that DiffRAW significantly outperforms existing state-of-the-art methods, particularly in perceptual quality metrics, while maintaining competitive pixel-wise fidelity. This confirms the effectiveness of leveraging diffusion models with specialized conditioning and an efficient inference process for the challenging task of smartphone RAW-to-DSLR sRGB conversion.

The results are presented in two tables: one for no-reference metrics (Table 1) and another for full-reference metrics (Table 2).

The following are the results from Table 1 of the original paper:

Method	MUSIQ-K↑	MUSIQ-S↑	CLIPIQA+↑	CLIPIQA+RN50↑	NIQE↓	ILNIQE↓
PyNet	43.56	46.4990	0.5353	0.3196	7.6856	50.55
MW-ISPNet	43.34	45.5973	0.5230	0.3097	7.9001	55.19
LiteISPNet	48.52	50.4763	0.5377	0.3063	7.4839	53.50
DiffRAW (ours)	56.67	57.3660	0.5596	0.3739	7.0072	42.65
DSLR(Reference)	56.62	57.4589	0.5622	0.3895	7.0181	44.13

Analysis of Table 1 (No-Reference Metrics):

Superior Perceptual Quality: DiffRAW achieves the highest scores across all perceptual quality metrics (MUSIQ-K, MUSIQ-S, $CLIPIQA+$ , $CLIPIQA+RN50$ ) and the lowest scores (indicating better quality) for no-reference distortion metrics (NIQE, ILNIQE).
DSLR-Comparable Performance: Notably, DiffRAW's MUSIQ-K (56.67) and MUSIQ-S (57.3660) scores are virtually identical to, or even slightly surpass, the DSLR (Reference) scores (56.62 and 57.4589, respectively). This is a strong indicator that DiffRAW's generated images are perceptually indistinguishable from actual DSLR images in many aspects.
Significant Improvement over Baselines: Compared to the best baseline, LiteISPNet, DiffRAW shows substantial gains:
- MUSIQ-K: 56.67 vs. 48.52 (a large jump, indicating much better perceived quality).
- NIQE: 7.0072 vs. 7.4839 (lower is better, indicating more naturalness).
- ILNIQE: 42.65 vs. 53.50 (lower is better, indicating more naturalness).

First-time Achievement: The paper explicitly states that DiffRAW marks "the first achievement in reaching a level comparable to DSLR images on no-reference IQA metrics," which is a significant milestone for this field.

The following are the results from Table 2 of the original paper:

Method	LPIPS↓	FID	PSNR↑	SSIM↑	LPIPS↓	FID↓	PSNR↑	SSIM↑
Method	Original GT				Align GT with result
PyNet	0.193	18.69	21.19	0.7471	0.152	17.11	22.96	0.8510
MW-ISPNet	0.213	20.41	21.42	0.7544	0.164	18.48	23.31	0.8578
LiteISPNet	0.187	17.04	21.55	0.7487	0.133	15.30	23.87	0.8737
DiffRAW (ours)	0.145	15.10	21.31	0.7433	0.118	14.61	23.54	0.8682

Analysis of Table 2 (Full-Reference Metrics):

Superior Perceptual Fidelity: DiffRAW demonstrates the best performance in LPIPS (0.145 for Original GT, 0.118 for Align GT) and FID (15.10 for Original GT, 14.61 for Align GT). Lower scores are better for both, indicating that DiffRAW's outputs are perceptually closest to the ground truth and have a distribution most similar to real DSLR images.
Comparable Pixel-wise Fidelity: While not always the absolute best, DiffRAW achieves comparable results in PSNR and SSIM. For "Original GT" comparison, LiteISPNet has slightly higher PSNR and SSIM. However, for "Align GT with result" (where the ground truth is aligned to the generated output, potentially mitigating some initial misalignment issues), DiffRAW's PSNR (23.54) and SSIM (0.8682) are very competitive, only slightly behind LiteISPNet. This suggests that DiffRAW prioritizes perceptual realism (as seen in Table 1) while maintaining a reasonable level of pixel-wise accuracy.
Impact of Alignment Strategy: The table shows results with two different ground truth (GT) alignment strategies: "Original GT" (using the dataset's provided GT) and "Align GT with result" (where the GT is aligned to the generated image). The latter typically yields better scores across all methods as it reduces the impact of initial misalignment. DiffRAW's superior LPIPS and FID scores across both alignment strategies reinforce its perceptual quality.

Overall, the experimental results robustly support the claim that DiffRAW generates images with richer detail and higher clarity, closely rivaling the visual quality of DSLR images.

The image below provides a visual comparison, highlighting the richer detail and higher clarity generated by DiffRAW.

Figure 1: Comparison of results on ZRR dataset. The images generated by our method exhibit richer detail and higher clarity, rivaling the visual quality of DSLR images. 该图像是图表，展示了不同图像处理方法的比较。图中分别为(a)可视化的RAW图像，(b)LiteISPNet生成的图像，(c)DSLR sRGB图像，以及(d)我们提出的DiffRAW方法生成的图像，DiffRAW方法在细节和清晰度上表现更佳。

Figure 1: Comparison of results on ZRR dataset. The images generated by our method exhibit richer detail and higher clarity, rivaling the visual quality of DSLR images.

6.2. Ablation Study

6.2.1. Diffusion Condition

The ablation study on diffusion conditions ( $w$ and $c$ ) demonstrates their individual contributions to DiffRAW's robust performance.

Role of $w$ (RAW Condition): The smartphone RAW image ( $w$ ) serves as a condition to ensure the generated results preserve image structural information such as contours and textures.
Role of $c$ (Color-Position Preserving Condition): The color-position preserving condition ( $c$ ) is crucial for controlling the color of the generated images and, more importantly, for preventing pixel shifts and blurring that can arise from misalignment and color inconsistencies in the training data.

The image below visually illustrates the impact of these conditions:

该图像是一个示意图，展示了从RAW图像生成不同条件下的sRGB图像的效果。包括RAW图像（a）、无条件图像（b）、w条件（c）、w+c条件（d）、生成图像（e）以及与DSLR相当的sRGB图像（f）。

u without condition. Fig3(c) represents the generated result using condition $w$ Fig3(d) represents the result using both $w$ and $c$ as conditions. Fig3(e) illustrates the image $x$ utilized in these experiments. Fig3(f) represents the DSLR sRGB image.

Analysis of Figure 3:

Fig3(b) (without condition): This image likely represents a baseline generation without any specific conditions from the RAW or color-position. It would typically show less coherent structure or potentially incorrect colors.
Fig3(c) (with $w$ condition): The incorporation of $w$ (the RAW image) visibly preserves the contours and textures of the image. This confirms that the RAW condition effectively guides the structural coherence of the generated output.
Fig3(d) (with $w$ and $c$ conditions): After introducing $c$ (the color-position preserving condition), the image no longer exhibits color biases or blurry shifts. It achieves better color accuracy and spatial alignment, bringing it closer to the DSLR sRGB target.
Fig3(e) (Image $x$ ): This is the low-quality input ( $c^{test}$ derived from $w$ ) used as a starting point or reference in the experiments. It serves as a visual reference for the domain transformation.
Fig3(f) (DSLR sRGB image): This is the ground truth DSLR sRGB image, providing the target quality and appearance for comparison.

This ablation visually confirms that $w$ is effective for structural preservation, and $c$ is vital for color and spatial accuracy, working together to produce high-quality, aligned outputs. The paper also mentions that $c$ enables flexible color style transfer, which is explored further in supplementary material.

6.2.2. Diffusion Process and Inference Process

The ablation study on diffusion process and inference process compares the efficiency and quality benefits of the proposed Domain Transform Diffusion Method (DTDM) against a standard Denoising Diffusion Probabilistic Model (DDPM).

Objective: To show that DTDM can achieve superior detail enhancement with significantly fewer inference steps compared to DDPM.
Experiment Setup:
- The generation process starts by subjecting $x$ (the color-position preserving condition or low-quality input) to an eight-fold downsampling degradation.
- Then, noise is added over 1500 steps to create the starting point for inference. This ensures a significantly noisy starting point to evaluate the denoising and enhancement capabilities.
Comparison:
- DDPM: The existing method, where the diffusion and reverse processes are described by equations 1 and 4 (standard diffusion model).
- DTDM (ours): The improved method, with diffusion and reverse processes through Equations 15 and 23.
  
  The image below displays the comparison:
  
  该图像是四个不同步骤生成的图像比较，分别为DDPM 1500步、DDPM 500步、DDPM 100步和DTDM 100步。每个图像展示了从手机RAW图像生成的sRGB图像的质量差异，突出DiffRAW方法在生成高质量图像方面的有效性。

Analysis of Figure 4:

Trend of Detail Enhancement: The figure demonstrates that, for DDPM, an increase in the number of denoising steps (e.g., from 100 to 500 to 1500) generally leads to a corresponding enhancement in the detail of the generated results. This is expected as more steps allow for finer-grained denoising and reconstruction.
DTDM's Efficiency and Quality: The most striking observation is that DTDM (100 steps) achieves detail enhancement that surpasses that of DDPM (1500 iterative steps). This is a crucial validation of DTDM's core claim: it can deliver higher quality results with significantly fewer inference steps.
Reason for DTDM's Superiority: As explained in the methodology, DTDM's reverse process does more than just denoise. At each step, it also performs a domain transfer from $x$ (low-quality input) towards $y$ (high-quality target). This combined action allows for a more efficient and effective transformation, leading to faster and better detail recovery.

This ablation study confirms that DTDM is not just an acceleration technique but also a quality enhancement mechanism, making DiffRAW both performant and practical.

6.3. Visual Comparison

The images in Figure 1, the abstract, and the ablation studies visually confirm the quantitative results. The images generated by DiffRAW exhibit finer textures, clearer edges, and more vibrant, accurate colors compared to the baselines. They closely resemble the DSLR (Reference) images, validating the "DSLR-comparable perceptual quality" claim. This visual evidence, combined with the state-of-the-art scores on perceptual metrics, underscores the practical impact of DiffRAW in elevating smartphone photography quality.

7. Conclusion & Reflections

7.1. Conclusion Summary

In this work, the authors introduced DiffRAW, a pioneering method that addresses the long-standing challenge of generating DSLR-comparable sRGB images from smartphone RAW inputs. The core innovation lies in the first-time application of diffusion models to the RAW-to-sRGB mapping task, specifically designed to overcome inherent limitations like detail disparity, color instability, and spatial misalignment.

DiffRAW strategically employs RAW images as a diffusion condition to preserve structural details (contours, textures) without relying on their limited intrinsic detail. It further introduces a novel color-position preserving condition to effectively manage color biases and pixel shifts stemming from imperfect training data alignment. A key technical contribution is the Domain Transform Diffusion Method (DTDM), an efficient diffusion process that significantly reduces inference steps while simultaneously enhancing the perceptual quality of the generated images.

Evaluated on the ZRR dataset, DiffRAW achieved state-of-the-art performance across all perceptual quality metrics (LPIPS, FID, MUSIQ, CLIPIQA+), closely matching or even surpassing DSLR reference images on no-reference IQA metrics. This marks a significant achievement in delivering DSLR-comparable perceptual quality for smartphone photography. While excelling in perceptual quality, it also maintained comparable results for traditional PSNR and SSIM metrics.

7.2. Limitations & Future Work

The paper explicitly states one area for potential improvement:

Hyperparameter Tuning: The authors mention that they "did not conduct more engineering attempts on the training hyperparameters $T$ and $s$ " (total noise steps and DTDM inference steps) and that "If more training hyperparameter trials are conducted on s and T, better experimental metric results might be achieved." This suggests that the reported performance, while already state-of-the-art, could potentially be pushed further through more extensive optimization of these key diffusion model parameters.

Beyond this self-acknowledged point, some potential limitations and avenues for future work could be inferred:
Computational Cost: While DTDM significantly accelerates inference compared to standard diffusion models, diffusion models are still generally more computationally intensive than deterministic ISP networks. Further optimization for real-time or near real-time processing on mobile devices could be a future direction.
Generalization to Diverse Data: The ZRR dataset is specific to Huawei P20 and Canon 5D Mark IV. Testing the model's generalization capabilities across a wider range of smartphone RAW formats, DSLR models, and diverse shooting conditions (e.g., extreme low light, highly complex scenes) would be valuable.
Robustness to Extreme Misalignment: While the color-position preserving condition addresses misalignment effectively, real-world data might present more extreme or complex forms of misalignment not fully captured by SIFT/RANSAC processed datasets. Investigating robustness under such conditions or integrating more advanced dynamic alignment mechanisms could be beneficial.
User Control and Style Transfer: While the $c$ condition allows for flexible color style adjustment, expanding this control to other aesthetic aspects (e.g., contrast, artistic effects) could enhance user experience.
Memory Footprint: Diffusion models and their U-Net backbones can have a substantial memory footprint. Optimizing the model size for deployment on resource-constrained edge devices (like smartphones) would be a practical next step.

7.3. Personal Insights & Critique

DiffRAW presents a compelling leap forward in computational photography, demonstrating the immense potential of diffusion models beyond their traditional applications in generative art. The paper's core strength lies in its holistic approach to the RAW-to-sRGB problem, tackling not just detail recovery but also the practical challenges of misalignment and color inconsistency through carefully designed conditioning.

The Domain Transform Diffusion Method (DTDM) is a particularly clever innovation. Diffusion models have been criticized for their slow inference speeds, making them less suitable for practical applications like image enhancement where rapid processing is often desired. DTDM's ability to combine denoising with domain transformation in fewer steps offers a valuable contribution that could be broadly applicable to other diffusion-based image restoration tasks, potentially making these powerful models more viable for real-world scenarios. The clear visual evidence in the ablation studies, showing DTDM with 100 steps outperforming DDPM with 1500 steps, is very impactful.

The achievement of DSLR-comparable perceptual quality on no-reference IQA metrics is a significant milestone. This indicates that DiffRAW is not just numerically superior but genuinely produces images that look as good as those from professional cameras to the human eye, which is the ultimate goal in image enhancement.

A minor critique could be the relatively high PSNR and SSIM of LiteISPNet in some "Original GT" scenarios compared to DiffRAW. This highlights the inherent trade-off between perceptual quality (where DiffRAW excels) and pixel-wise fidelity. For a generative model, slight deviations from ground truth pixels can yield perceptually superior results if those deviations add plausible details or correct perceived artifacts. DiffRAW seems to strike a good balance, prioritizing what humans see as "better."

The methods and conclusions of DiffRAW could potentially be transferred or applied to various other domains requiring image-to-image translation or enhancement from low-quality inputs to high-quality outputs, especially where hallucinating realistic details is crucial. Examples include:

Medical Imaging: Enhancing low-resolution or noisy medical scans.
Satellite Imagery: Improving details in satellite or aerial photographs.
Historical Photo Restoration: Generating high-quality versions from degraded historical images.
Video Enhancement: Applying similar principles to individual frames for video super-resolution or denoising.

Overall, DiffRAW provides a robust, effective, and efficient solution to a critical problem in computational photography, pushing the boundaries of what smartphone cameras can achieve with advanced AI.