Paper status: completed

DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior

Published:08/29/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

DiffBIR is a unified image restoration pipeline for blind image tasks, involving degradation removal and information regeneration. It utilizes IRControlNet for realistic detail generation and introduces region-adaptive guidance for user-tunable balance between realness and fideli

Abstract

We present DiffBIR, a general restoration pipeline that could handle different blind image restoration tasks in a unified framework. DiffBIR decouples blind image restoration problem into two stages: 1) degradation removal: removing image-independent content; 2) information regeneration: generating the lost image content. Each stage is developed independently but they work seamlessly in a cascaded manner. In the first stage, we use restoration modules to remove degradations and obtain high-fidelity restored results. For the second stage, we propose IRControlNet that leverages the generative ability of latent diffusion models to generate realistic details. Specifically, IRControlNet is trained based on specially produced condition images without distracting noisy content for stable generation performance. Moreover, we design a region-adaptive restoration guidance that can modify the denoising process during inference without model re-training, allowing users to balance realness and fidelity through a tunable guidance scale. Extensive experiments have demonstrated DiffBIR's superiority over state-of-the-art approaches for blind image super-resolution, blind face restoration and blind image denoising tasks on both synthetic and real-world datasets. The code is available at https://github.com/XPixelGroup/DiffBIR.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The title of the paper is "DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior". It focuses on developing a unified framework for blind image restoration tasks, leveraging the generative capabilities of diffusion models.

1.2. Authors

The authors are: Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, and Chao Dong. Their affiliations include Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shanghai AI Laboratory, and The Chinese University of Hong Kong. This suggests a collaborative effort from research institutions with strong backgrounds in computer vision, artificial intelligence, and image processing.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server. Its publication status is a preprint as of 2023-08-29T07:11:52.000Z. While arXiv itself is not a peer-reviewed journal or conference, it is a widely recognized platform for disseminating cutting-edge research in fields like computer science, often preceding formal peer-reviewed publication.

1.4. Publication Year

The paper was published in 2023.

1.5. Abstract

DiffBIR proposes a novel, unified pipeline for various blind image restoration (BIR) tasks. It decouples the BIR problem into two distinct stages: 1) degradation removal, which targets image-independent content, and 2) information regeneration, which focuses on generating lost image content. Each stage is developed independently but functions seamlessly in a cascaded manner. For the first stage, existing restoration modules are employed to remove degradations and produce high-fidelity intermediate results. For the second stage, the paper introduces IRControlNet, which harnesses the generative power of latent diffusion models to create realistic details. IRControlNet is specifically trained using carefully prepared condition images, devoid of distracting noisy content, to ensure stable generation. Furthermore, a training-free region-adaptive restoration guidance mechanism is designed to adjust the denoising process during inference. This guidance allows users to fine-tune the balance between realness (quality) and fidelity through a tunable guidance scale. Extensive experiments demonstrate DiffBIR's superior performance over state-of-the-art methods in blind image super-resolution (BSR), blind face restoration (BFR), and blind image denoising (BID) tasks across both synthetic and real-world datasets.

The original source link is: https://arxiv.org/abs/2308.15070 PDF Link: https://arxiv.org/pdf/2308.15070v3.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by DiffBIR is blind image restoration (BIR). Image restoration aims to reconstruct a high-quality image from its low-quality observation. Traditionally, this involves image denoising, deblurring, and super-resolution under controlled settings where the degradation process is simple and known (e.g., bicubic downsampling). However, these traditional methods exhibit limited generalization ability when faced with real-world degraded images, which often contain unknown and complex degradations. BIR aims to achieve realistic image reconstruction on general images with general, unknown degradations. This is a significant challenge due to the entanglement of degradation and content information in low-quality images.

Prior research in BIR tasks like blind image super-resolution (BSR) and blind image denoising (BID) often formulates the problem as supervised large-scale degradation overfitting, primarily relying on Generative Adversarial Networks (GANs). While robust in degradation removal, these methods frequently struggle to generate truly realistic details due to the inherent limitations in their generative capabilities. Blind face restoration (BFR) methods have shown remarkable success by incorporating powerful generative facial priors (e.g., StyleGAN, VQGAN), but they are restricted to face images and fixed input sizes, lacking applicability to general images. More recently, denoising diffusion probabilistic models (DDPMs) have demonstrated outstanding performance in image generation. However, existing zero-shot image restoration (ZIR) methods using diffusion models, while generating realistic results for specific degradations, cannot generalize well to unknown or complex real-world degradations, meaning they can handle general images but not general degradations.

The paper's entry point and innovative idea stem from the observation that directly using low-quality (LQ) images as conditions for conditional image generation (e.g., with diffusion models) leads to instability and artifacts. This is because degradation and content information are entangled, and the generation process is disturbed by unreliable condition information. The paper motivates a decoupled approach to address this, aiming to separate degradation removal from content generation to achieve a stable and unified solution for diverse BIR tasks.

2.2. Main Contributions / Findings

The primary contributions and key findings of the paper are:

  • Decoupled Two-Stage Pipeline for Unified BIR: DiffBIR introduces a novel two-stage pipeline that effectively decouples the BIR problem into degradation removal and information regeneration. This design allows for the first-time achievement of state-of-the-art performance across blind image super-resolution (BSR), blind face restoration (BFR), and blind image denoising (BID) tasks within a single, unified framework. This decoupling strategy enhances stability and flexibility by ensuring the generation module only operates on image content, undisturbed by degradations.

  • Novel IRControlNet for Realistic Regeneration: The paper proposes IRControlNet, a generative module that leverages the power of pre-trained text-to-image latent diffusion models (specifically, Stable Diffusion) for realistic image reconstruction. Through comprehensive exploration of critical components (condition encoding, condition network, feature modulation), IRControlNet is identified as a robust backbone for the generation stage. Notably, it utilizes the pre-trained VAE encoder for condition encoding and a copied, auxiliary encoder with zero-initialization for efficient and stable control, addressing issues like color shifts seen in baseline ControlNet adaptations.

  • Training-Free Region-Adaptive Restoration Guidance: DiffBIR introduces a training-free region-adaptive restoration guidance module that operates during the sampling process. This module allows users to flexibly balance quality (realness) and fidelity based on their preferences. By minimizing a specially designed region-adaptive Mean Squared Error (MSE) loss between the generated result and a high-fidelity guidance image, it influences low-frequency regions more heavily towards fidelity while preserving generative ability in high-frequency regions, offering fine-grained control without requiring model re-training.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand DiffBIR, a foundational grasp of several computer vision and deep learning concepts is essential:

  • Image Restoration (IR): The general task of recovering a high-quality (HQ) image from a degraded low-quality (LQ) observation. Degradations can include noise, blur, low resolution, etc.

    • Image Denoising: Removing unwanted noise from an image.
    • Image Deblurring: Reversing the effect of blur in an image.
    • Image Super-Resolution (SR): Enhancing the resolution of an image, typically generating a high-resolution image from a low-resolution input.
    • Blind Image Restoration (BIR): A more challenging variant where the degradation process is unknown or complex, mimicking real-world scenarios. This is the primary focus of DiffBIR.
  • Deep Learning Models:

    • Convolutional Neural Networks (CNNs): A class of deep neural networks commonly applied to visual imagery. They use convolutional layers to automatically and adaptively learn spatial hierarchies of features.
    • Transformers: Initially developed for natural language processing, Transformers use self-attention mechanisms to weigh the importance of different parts of the input. They have been adapted for vision tasks, such as in Swin Transformer.
    • U-Net: A convolutional network architecture known for its U-shaped structure, featuring an encoder (downsampling path) to capture context and a decoder (upsampling path) to enable precise localization. It's widely used in image-to-image tasks.
  • Generative Models: Models capable of generating new data samples that resemble the training data.

    • Generative Adversarial Networks (GANs): Comprise two networks: a Generator that creates synthetic data, and a Discriminator that tries to distinguish real data from generated data. They are trained in an adversarial manner. GANs have been popular for image generation and restoration tasks due to their ability to produce realistic-looking images.
    • Denoising Diffusion Probabilistic Models (DDPMs): A class of generative models that learn to reverse a gradual diffusion (noise-adding) process.
      • Forward Diffusion Process: Gradually adds Gaussian noise to an image over several time steps, eventually transforming it into pure noise.
      • Reverse Denoising Process: A neural network (often a U-Net) is trained to predict and remove the noise at each step, starting from pure noise and gradually reconstructing a clean image.
      • Latent Diffusion Models (LDMs) / Stable Diffusion: An advancement of DDPMs that performs the diffusion process in a compressed latent space rather than the pixel space. This makes them significantly more efficient in terms of computation and memory. They use an autoencoder to map images to and from the latent space. Stable Diffusion is a popular large-scale text-to-image LDM.
  • Autoencoders and VAEs:

    • Autoencoder: A neural network that learns to encode data into a lower-dimensional latent representation (encoder) and then decode it back to the original data (decoder). The goal is to learn an efficient data representation.
    • Variational Autoencoder (VAE): A type of autoencoder that learns a probabilistic mapping from the input data to a continuous latent space, allowing for sampling new data points. In LDMs, the VAE encoder (E\mathcal{E}) converts an image to its latent representation, and the VAE decoder (D\mathcal{D}) reconstructs the image from the latent.
  • ControlNet: An architecture that allows for adding conditional control to pre-trained text-to-image diffusion models like Stable Diffusion. It works by taking a copy of the diffusion model's U-Net encoder, making it trainable as a "condition network," while keeping the original U-Net frozen. It uses zero convolutions to connect the trainable and frozen parts, ensuring minimal interference with the pre-trained weights during initial training.

  • Loss Functions:

    • Mean Squared Error (MSE) Loss / L2 Loss: Measures the average of the squares of the errors (the difference between predicted and actual values). Often used for fidelity, minimizing pixel-wise differences. \mathcal{L}_{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2.
    • L1 Loss: Measures the sum of the absolute differences between predicted and actual values. Often produces sharper images than MSE but can be more sensitive to outliers. LL1=1Ni=1Nyiy^i \mathcal{L}_{L1} = \frac{1}{N} \sum_{i=1}^N |y_i - \hat{y}_i| .
    • Perceptual Loss (LPIPS): Measures the perceptual similarity between two images by comparing their high-level feature representations extracted from a pre-trained deep neural network (e.g., VGG). This aligns better with human perception than pixel-wise losses.
    • Adversarial Loss: Used in GANs to train the generator to produce realistic images that can fool the discriminator.
  • Image Quality Assessment (IQA) Metrics:

    • Reference-based Metrics: Compare a restored image to a ground-truth (reference) image.
      • PSNR (Peak Signal-to-Noise Ratio): Quantifies the ratio between the maximum possible power of a signal and the power of corrupting noise. Higher PSNR indicates better quality.
      • SSIM (Structural Similarity Index Measure): Measures the similarity between two images based on luminance, contrast, and structure. Closer to 1 indicates better quality.
    • No-Reference Metrics: Assess image quality without a ground-truth reference, often relying on learned models of human perception. Examples include MANIQA, MUSIQ, CLIP-IQA.
    • FID (Fréchet Inception Distance): A metric used to assess the quality of images generated by generative models. It measures the distance between the feature distributions of real and generated images using activations from an Inception-v3 network. Lower FID indicates higher quality and diversity.

3.2. Previous Works

The paper contextualizes DiffBIR by reviewing existing approaches across various Blind Image Restoration (BIR) tasks:

  • Blind Image Super-Resolution (BSR):

    • Initial Approaches: Focused on formulating BSR as a supervised large-scale degradation overfitting problem.
    • BSRGAN [73]: Proposed a random shuffling strategy for synthesizing more practical degradations.
    • Real-ESRGAN [56]: Exploited "high-order" degradation modeling to simulate complex real-world degradations, often using GANs with adversarial and reconstruction losses. These methods are robust in degradation removal but often lack in generating realistic details.
    • SwinIR-GAN [29]: Utilized Swin Transformer as a backbone for improved performance.
    • FeMaSR [5]: Formulated SR as a feature-matching problem based on pre-trained VQ-GAN.
    • Diffusion-based BSR: Recent works like StableSR [52] and PASD [66] leverage Stable Diffusion. StableSR designs a time-aware encoder, while PASD proposes a PACA module to inject pixel-level conditions effectively.
    • Limitation: Most of these methods require re-training for different image restoration tasks, lacking a unified framework.
  • Blind Face Restoration (BFR):

    • GAN-prior-based methods [4, 18, 55, 65]: Incorporated powerful generative priors (e.g., StyleGAN) to reconstruct faces with high quality and fidelity. Examples include GFPGAN [55].
    • VQ-dictionary learning methods [16, 59, 77]: Introduced high-quality codebooks (e.g., VQGAN) to generate surprisingly realistic facial details. CodeFormer [77] is a prominent example.
    • Diffusion-prior-based methods [61, 63, 67]: Latest advances like DifFace [67], DR2 [61], PGDiff [63] leverage diffusion models for robust face restoration.
    • Limitation: These methods are typically restricted to face images with fixed input sizes, making them unsuitable for general image restoration.
  • Blind Image Denoising (BID):

    • Deep CNNs: DnCNN [71] is an end-to-end deep CNN for Gaussian denoising.
    • GAN-based methods: GCBD [7] uses GANs for noise modeling, and CBDNet [17] uses a more realistic noise model and real-world noisy-clean pairs.
    • Variational methods: VDNet [68] proposes simultaneous noise estimation and denoising.
    • SwinConv-UNet: SCUNet [74] is a state-of-the-art method that designs a practical noise degradation model and uses L1 or adversarial loss.
    • Limitation: While effective at noise removal, these methods often produce overly smooth results, lacking realistic texture details.
  • Zero-shot Image Restoration (ZIR):

    • GAN-based ZIR [2, 9, 36, 39]: Focused on searching latent codes within a pre-trained GAN's latent space.
    • Diffusion-based ZIR [23, 57, 14]: Employ DDPMs as priors. DDRM [23] uses an SVD-based approach for linear tasks. DDNM [57] analyzes range-null space decomposition. GDP [14] introduces classifier guidance.
    • Limitation: ZIR methods, while leveraging powerful generative priors, are generally limited to clearly defined degradations and cannot generalize well to the unknown, complex degradations found in real-world low-quality images.

3.3. Technological Evolution

The field of image restoration has evolved significantly:

  1. Classical Methods: Early methods relied on signal processing techniques, statistical models, and hand-crafted priors (e.g., non-local means for denoising). These were often limited by their reliance on explicit mathematical models of degradation.
  2. Deep Learning (Constrained IR): The advent of CNNs revolutionized IR. Methods like SRCNN [12] for super-resolution or DnCNN [71] for denoising, trained on large datasets with known degradations, achieved impressive results, significantly outperforming classical methods.
  3. Deep Learning (Blind IR - GAN-based): Recognizing the limitations of known degradations, research shifted to BIR. GANs became prominent, especially in BSR (e.g., ESRGAN [54], Real-ESRGAN [56]) and BFR (e.g., GFPGAN [55]). These leveraged adversarial training to generate more perceptually realistic textures, often by training on synthesized data designed to mimic real-world degradations. However, GANs can suffer from training instability, mode collapse, and sometimes generate artifacts.
  4. Deep Learning (Blind IR - Diffusion-based): Most recently, denoising diffusion models have emerged as powerful generative models, surpassing GANs in image quality and diversity. Their application to IR initially focused on zero-shot restoration for known degradations. This paper positions itself at the forefront of diffusion-based BIR, aiming to combine the robustness of GAN-based BIR (in handling unknown degradations) with the superior generative capabilities of diffusion models.

3.4. Differentiation Analysis

DiffBIR differentiates itself from previous works in several key ways:

  • Unified Framework for Diverse BIR Tasks: Unlike most prior works that specialize in one task (e.g., BSR, BFR, or BID) and require re-training for others, DiffBIR provides a single, unified pipeline applicable to all three. This significantly improves versatility and reduces development overhead.
  • Decoupled Two-Stage Approach:
    • Problem: Previous GAN-based BIR and diffusion-based ZIR methods often struggle with entangled degradation and content information, leading to artifacts or limited generalization to unknown degradations. Directly feeding LQ images as conditions to generative models (ControlNet) can lead to instability and artifacts.
    • DiffBIR's Solution: It explicitly decouples the problem into degradation removal (Stage I) and information regeneration (Stage II). This ensures that the generation module receives clean, content-focused conditions, leading to more stable and higher-quality outputs. This is a crucial distinction from direct end-to-end approaches or those that don't explicitly handle the "blind" aspect in a decoupled manner for generation.
  • Robust IRControlNet for General Images:
    • Problem: BFR methods achieve great results by leveraging strong face priors but are limited to faces. Generic BSR methods often lack the generative power for truly realistic details. Prior ControlNet adaptations for IR (e.g., StableSR, PASD) might still face issues with color consistency or optimal condition encoding.
    • DiffBIR's Solution: IRControlNet is specifically designed for general image content regeneration by leveraging Stable Diffusion. Its innovative use of the pre-trained VAE encoder for condition encoding and efficient ControlNet-like structure (copied UNet encoder with zero-initialization, modulating skipped features) effectively addresses issues like color shifts and ensures stable, high-quality generation across diverse image types.
  • Training-Free Controllable Trade-off:
    • Problem: Many BIR methods struggle to offer a flexible balance between fidelity (being faithful to the input degraded image) and quality/realness (generating perceptually pleasing details). Users often have different preferences for this trade-off.
    • DiffBIR's Solution: The region-adaptive restoration guidance is a novel, training-free module applied during inference. It allows users to dynamically adjust a guidance scale to emphasize fidelity or quality, and intelligently applies this guidance more strongly to low-frequency regions (where fidelity is paramount) while allowing high-frequency regions to benefit more from the generative prior. This offers unprecedented user control.

4. Methodology

4.1. Principles

DiffBIR operates on the core principle that blind image restoration (BIR) can be effectively solved by decoupling the complex problem into two distinct, manageable stages: degradation removal and information regeneration. The intuition behind this is that real-world degraded images contain both content information and image-independent degradation artifacts that are deeply entangled. Directly feeding such noisy images to a generative model as conditions leads to instability and artifacts, as the model struggles to distinguish between content to preserve and degradation to remove.

The proposed two-stage pipeline addresses this:

  1. Degradation Removal (Stage I): This stage focuses on producing a high-fidelity intermediate image by removing the various degradations (e.g., noise, blur, compression artifacts) present in the low-quality input. The key is to obtain a "cleaner" representation of the image content, even if it's still somewhat smooth or lacking fine details. This module acts as a "condition preprocessor" for the next stage.

  2. Information Regeneration (Stage II): With a relatively clean (high-fidelity) condition image from Stage I, this stage leverages a powerful generative latent diffusion model to synthesize lost or missing realistic details and textures. By conditioning on a degradation-free representation, the generative model can focus purely on adding plausible, high-quality content without being distracted or misled by artifacts.

    Additionally, a training-free controllable module is integrated to allow users to achieve a flexible trade-off between fidelity (closeness to the original content of the input) and quality/realness (perceptual realism of generated details). This is achieved through region-adaptive restoration guidance, which modifies the denoising process during inference.

The entire pipeline is illustrated in Figure 3 from the original paper.

该图像是一个示意图,展示了DiffBIR的恢复管道,包括低质量图像(lq images)通过恢复模块(Restoration Module)生成高保真图像(high fidelity images),并利用生成模块(Generation Module)进行信息再生,展示了区域自适应恢复指导在恢复过程中的作用。 该图像是一个示意图,展示了DiffBIR的恢复管道,包括低质量图像(lq images)通过恢复模块(Restoration Module)生成高保真图像(high fidelity images),并利用生成模块(Generation Module)进行信息再生,展示了区域自适应恢复指导在恢复过程中的作用。

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Restoration Module

In the first stage, the goal is to remove distracting degradations from low-quality images without generating new content. Since different BIR tasks (e.g., BID, BFR, BSR) have distinct degradation characteristics and dataset specifics, DiffBIR uses separate, task-specific restoration modules during inference to leverage their specialized expertise. These are typically off-the-shelf BIR models trained with MSE loss. For example:

  • For BSR, BSRNet [73] is used.

  • For BFR, SwinIR [29] (as used in DifFace [67]) is adopted.

  • For BID, SCUNet-PSNR [74] is utilized.

    However, for training the subsequent generation module, a stable and diversified set of condition images is required. To this end, DiffBIR trains an additional dedicated Restoration Module (RM). This RM is trained using a classic degradation model with a wide degradation range and MSE loss. This wide range helps generate sufficiently diversified condition images, which in turn improves the overall generative capacity of the generation module. This RM (used for condition generation during training) is discarded during inference.

The MSE loss for training this specific RM is defined as: $ \mathcal { L } _ { R M } = | | I _ { R M } - I _ { h q } | | _ { 2 } ^ { 2 } $ Here:

  • I _ { h q } denotes the high-quality ground-truth image.

  • I _ { l q } denotes the synthesized low-quality counterpart of IhqI_{hq}.

  • IRM=RM(Ilq)I _ { R M } = \text{RM}(I_{lq}) denotes the restored image produced by the Restoration Module from IlqI_{lq}.

  • 22||\cdot||_2^2 represents the squared Euclidean (L2) norm, which calculates the sum of squared differences between corresponding pixels of the two images.

    This RM (for training condition images) acts as a preprocessing step, ensuring that the condition images for the generation module are reliable and not contaminated by complex degradations, which as illustrated in Figure 2, can cause unpleasant artifacts.

    Figure 2. The effects of condition information on generated results. The 2nd row shows that directly using LQ images as conditions causes unpleasant artifacts induced by different degradations (Gaussian, speckle, Poisson, and JPEG compression noises). While our DiffBIR's two-stage pipeline is more stable (see 3rd-row). 该图像是图表,展示了不同条件下图像生成的效果。第一行展示了低质量(LQ)图像,第二行为直接将LQ图像作为条件,显示出不同降解带来的视觉伪影,而最后一行展示了我们DiffBIR方法的生成效果,表现出更稳定的结果。

4.2.2. Generation Module

The generation module leverages the power of pre-trained large-scale text-to-image latent diffusion models, specifically Stable Diffusion 2.1-base. Given the reliable condition image IRMI_{RM} (from Stage I), this module aims to generate realistic details. The core of this module is IRControlNet, which involves three main aspects: condition encoding, condition network, and feature modulation.

Preliminary: Stable Diffusion

Stable Diffusion is built upon an autoencoder [26] that efficiently converts an image xx into a latent representation zz using an encoder E\mathcal { E }, and reconstructs it using a decoder D\mathcal { D }. Both the diffusion (noise-adding) and denoising (noise-removing) processes occur in this latent space.

In the diffusion process, Gaussian noise ϵN(0,I)\epsilon \sim \mathcal { N } ( 0 , \mathbf { I } ) with variance βt(0,1)\beta _ { t } \in ( 0 , 1 ) is incrementally added to the encoded latent z=E(x)z = \mathcal { E } ( x ) at time step tt to produce a noisy latent ztz_t: $ z _ { t } = \sqrt { \bar { \alpha } _ { t } } z + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon $ Here:

  • ztz_t is the noisy latent at time tt.

  • zz is the clean latent (from E(x)\mathcal{E}(x)).

  • ϵ\epsilon is a random sample from a standard Gaussian distribution (mean 0, variance 1).

  • αt=1βt\alpha _ { t } = 1 - \beta _ { t } and αˉt=s=1tαs\bar { \alpha } _ { t } = \prod _ { s = 1 } ^ { t } \alpha _ { s }. These terms control the amount of noise added. As tt increases, ztz_t progressively becomes more noisy, approaching a standard Gaussian distribution when tt is large enough.

    A neural network, denoted as ϵθ\epsilon _ { \theta }, is trained to predict the noise ϵ\epsilon that was added to ztz_t. This prediction is conditioned on context cc (e.g., text prompts) and the current time step tt. The optimization objective for the latent diffusion model is to minimize the squared difference between the actual noise and the predicted noise: $ \mathcal { L } _ { l d m } = \mathbb { E } _ { z , c , t , \epsilon } [ | | \epsilon - \epsilon _ { \theta } ( \sqrt { \bar { \alpha } _ { t } } z + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon , c , t ) | | _ { 2 } ^ { 2 } ] $ Here:

  • Lldm\mathcal{L}_{ldm} is the latent diffusion model's training loss.

  • E\mathbb{E} denotes the expectation over randomly sampled zz, cc, tt, and ϵ\epsilon.

  • z=E(x)z = \mathcal{E}(x) where xx is an image from the dataset.

  • cc represents the conditioning information (e.g., text prompts, although in DiffBIR, it's often empty or negative prompts).

  • ϵθ\epsilon_\theta is the noise prediction network (typically a U-Net).

IRControlNet Architecture and Training

The architecture of IRControlNet and its variants is depicted in Figure 4.

Figure 4. Architectures of our IRControlNet and four model variants. 该图像是 IRControlNet 的架构示意图及其四种模型变体。左侧展示了 IRControlNet 的主要结构,包括固定和可训练的模块。右侧则分别展示四个变体的设计,突出了条件编码和特征调制的不同实现方式。

  1. Condition encoding: IRControlNet uses the pre-trained and fixed VAE encoder (E\mathcal { E }) to encode the high-fidelity condition image I _ { R M } (from Stage I) into a latent space. This produces the condition latent: cRM=E(IRM)c _ { R M } = \mathcal { E } ( I _ { R M } ) This is a crucial design choice, as the VAE is pre-trained on large-scale datasets and can effectively preserve sufficient image information in the latent space, which is where the diffusion process operates.

  2. Condition network: Similar to ControlNet [75], DiffBIR creates a trainable copy of the pre-trained UNet encoder and middle block from Stable Diffusion. This copied network, denoted as Fcond.\mathbf { F } _ { c o n d . }, receives the condition information and outputs control signals. The advantage of copying weights is a good initialization. The input to Fcond.\mathbf { F } _ { c o n d . } is a concatenation of the condition latent c _ { R M } and the noisy latent z _ { t } at time tt: zt=cat(zt,cRM)\boldsymbol { z } _ { t } ^ { \prime } = c a t ( \boldsymbol { z } _ { t } , c _ { R M } ). Since concatenation increases channel numbers, a few parameters are added to the first layer of Fcond.\mathbf { F } _ { c o n d . } and initialized to zero. This zero initialization functions like zero convolution in ControlNet, preventing random noise from acting as gradients early in training and thus stabilizing convergence.

  3. Feature modulation: The multi-scale features output by the condition network (Fcond.\mathbf { F } _ { c o n d . }) are used to modulate the intermediate features of the frozen UNet denoiser of Stable Diffusion. Following ControlNet, this modulation is performed via addition operation on the middle block features and the skipped features. Zero convolutions are again used to connect Fcond.\mathbf { F } _ { c o n d . } to the frozen UNet denoiser for training stability.

    During training of the generation module, only the parameters of the condition network and feature modulation layers are updated; the rest of the pre-trained Stable Diffusion model (VAE and UNet denoiser) remains frozen. The objective function for training the generation module is: $ \mathcal { L } _ { G M } = \mathbb { E } _ { z _ { t } , c , t , \epsilon , c _ { R M } } [ | | \epsilon - \epsilon _ { \theta } ( z _ { t } , c , t , c _ { R M } ) | | _ { 2 } ^ { 2 } ] $ Here:

  • LGM\mathcal{L}_{GM} is the generation module's training loss.
  • ztz_t is the noisy latent.
  • cc is the text prompt (can be empty or negative prompts like "low quality", "blurry").
  • tt is the time step.
  • ϵ\epsilon is the ground-truth noise.
  • cRMc_{RM} is the condition latent derived from IRMI_{RM}.
  • ϵθ(zt,c,t,cRM)\epsilon_\theta(z_t, c, t, c_{RM}) is the noise predicted by the UNet denoiser of Stable Diffusion, controlled by IRControlNet's outputs, conditioned on ztz_t, cc, tt, and cRMc_{RM}.

Discussion on IRControlNet Variants

The paper explores several IRControlNet variants (shown in Figure 4 and Figure 11 in the appendix) to validate its design choices:

  • Variant 1 (ControlNet-like condition encoding): Replaces IRControlNet's VAE encoder (E\mathcal{E}) for condition encoding with a tiny, trained-from-scratch network. This variant shows significantly worse results (e.g., 3dB3 \mathrm{dB} \downarrow in PSNR and color shift problems as seen in Figure 10 (right)). This demonstrates the critical role of the pre-trained VAE encoder in projecting the condition image into the correct latent space for effective control.

  • Variant 2 (w/o ztz_t in condition network): Removes the noisy latent ztz_t from the condition network input, using only cRMc_{RM}. While achieving good fidelity metrics, it tends to produce smoother results without sufficient texture details (Figure 12). Its training losses are consistently higher (Figure 5). This indicates ztz_t helps the condition network be aware of randomness at each timestep, boosting convergence and high-quality generation.

  • Variant 3 (w/o copied initialization): Trains the condition network from random initialization instead of copying UNet weights. This variant struggles with convergence and achieves the worst performance across all metrics (Figure 5, Table 1), highlighting the importance of good weight initialization.

  • Variant 4 (w/ control decoder features): Modulates decoder features instead of skipped features. It shows comparable performance to IRControlNet in convergence and quantitative results (Figure 5, Table 1). However, skipped features are preferred as controlling decoder features would introduce more parameters and computation due to larger channel numbers.

  • Variant 5 (w/ control concat features, from Appendix): Simultaneously controls middle block, decoder, and skipped features. Achieves better PSNR and SSIM but worse MANIQA scores (Table 9). This indicates that applying more control enhances fidelity but can damage generation quality by restricting the generative prior.

  • Variant 6 (w/ SFT modulation, from Appendix): Uses SFT (Scale and Shift Transformation) layers [53] for modulation. SFT is defined as SFT(Fγ,β)=F(1+γ)+βSFT(\mathbf{F} | \gamma, \beta) = \mathbf{F} \odot (1 + \gamma) + \beta, where F\mathbf{F} are feature maps, and γ,β\gamma, \beta are element-wise scale and shift parameters produced by zero-conv. This variant also improves fidelity but at the cost of MANIQA scores (Table 9), again trading quality for fidelity.

    In summary, IRControlNet's design choices are empirically validated to be crucial for either model convergence or performance, making it a robust backbone for the generative module in BIR tasks.

4.2.3. Restoration Guidance

This section describes a training-free controllable module designed to achieve a trade-off between quality (realness, generated details) and fidelity (faithfulness to the input). Users typically desire more generated details in high-frequency regions (e.g., textures, edges) and less generated content (i.e., more fidelity to the smoother IRM image) in flat, low-frequency regions (e.g., sky, walls). This is achieved through region-adaptive restoration guidance applied at each sampling step during inference.

The process is as follows: At a given time step tt, the UNet denoiser first predicts the noise ϵt\epsilon _ { t } from the noisy latent z _ { t }. Then, this predicted noise is removed from z _ { t } to obtain a clean latent z~0\tilde { z } _ { 0 }: $ \epsilon _ { t } = \epsilon _ { \theta } ( z _ { t } , c , t , c _ { R M } ) $ $ \tilde { z } _ { 0 } = \frac { z _ { t } - \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon _ { t } } { \sqrt { \bar { \alpha } _ { t } } } $ Here:

  • ϵt\epsilon_t is the noise predicted by the UNet denoiser (conditioned by IRControlNet).

  • z~0\tilde{z}_0 is the predicted clean latent representation of the image at time 0, before any noise was added.

  • The terms 1αˉt\sqrt{1 - \bar{\alpha}_t} and αˉt\sqrt{\bar{\alpha}_t} are scaling factors derived from the diffusion schedule.

    The goal is to guide the decoded version of this clean latent D(z~0)\mathcal { D } ( \tilde { z } _ { 0 } ) towards the high-fidelity condition image I _ { R M } (output of Stage I). This is done by applying a region-adaptive MSE loss in the pixel space and updating z~0\tilde { z } _ { 0 } using a gradient descent algorithm.

The region-adaptive MSE loss function is defined as: $ \mathcal { L } ( \tilde { z } _ { 0 } ) = \frac { 1 } { H W C } | | \mathcal { W } \odot ( \mathcal { D } ( \tilde { z } _ { 0 } ) - I _ { R M } ) | | _ { 2 } ^ { 2 } $ Here:

  • L(z~0)\mathcal{L}(\tilde{z}_0) is the region-adaptive MSE loss.

  • H, W, C denote the height, width, and channel count of IRMI_{RM}.

  • D(z~0)\mathcal{D}(\tilde{z}_0) is the image reconstructed from the clean latent z~0\tilde{z}_0 by the VAE decoder.

  • IRMI_{RM} is the high-fidelity guidance image from Stage I.

  • W\mathcal{W} is a weight map that assigns different importance to different image regions.

  • \odot denotes the element-wise product.

  • 22||\cdot||_2^2 is the squared L2 norm.

    The weight map W\mathcal{W} is calculated based on the gradient magnitude of IRMI_{RM}. The idea is to assign larger weights to low-frequency (flat) regions and smaller weights to high-frequency (textured, edge) regions. This means low-frequency areas will have a higher loss if they deviate from IRMI_{RM}, thus being influenced more by IRMI_{RM} to maintain fidelity. Conversely, high-frequency regions are less constrained, allowing more generative freedom.

To obtain W\mathcal{W}, first, the gradient magnitude M(IRM)M(I_{RM}) for each pixel (i, j) in IRMI_{RM} is calculated using Sobel operators GxG_x and GyG_y (which detect gradients in the horizontal and vertical directions): $ M ( I _ { R M } ) = \sqrt { G _ { x } ( I _ { R M } ) ^ { 2 } + G _ { y } ( I _ { R M } ) ^ { 2 } } $ Since strong gradient signals are sparse, patch-level gradient signals are used for better estimation. IRMI_{RM} is divided into multiple equal-sized, non-overlapping patches IRM(k)I _ { R M } ^ { ( k ) }. For each patch, the sum of gradient magnitudes of all its pixels is calculated and then mapped into the range [0, 1) using the tanh function: $ S ( I _ { R M } ^ { ( k ) } ) = \operatorname { t a n h } \left( \sum _ { i , j } M _ { i , j } ( I _ { R M } ) \right) , ( i , j ) \in I _ { R M } ^ { ( k ) } $ Where (i, j) denotes a pixel in patch IRM(k)I _ { R M } ^ { ( k ) }. A higher S(IRM(k))S(I_{RM}^{(k)}) indicates a stronger gradient signal in that patch. The final gradient magnitude map G(IRM)\mathcal{G}(I_{RM}) at each pixel (i, j) is then formulated as: $ \mathcal { G } _ { i , j } ( I _ { R M } ) = \sum _ { k } \mathbb { I } \left[ ( i , j ) \in I _ { R M } ^ { ( k ) } \right] S ( I _ { R M } ^ { ( k ) } ) $ Where I[(i,j)IRM(k)]\mathbb { I } \left[ ( i , j ) \in I _ { R M } ^ { ( k ) } \right] is an indicator function, which is 1 if pixel (i,j) is in patch IRM(k)I _ { R M } ^ { ( k ) }, and 0 otherwise. This effectively assigns the patch-level gradient intensity to all pixels within that patch. Finally, the weight map is calculated as W=1G(IRM)\mathcal { W } = 1 - \mathcal { G } ( I _ { R M } ). This means regions with strong gradients (high-frequency) will have small weights (closer to 0), and regions with weak gradients (low-frequency) will have large weights (closer to 1), as illustrated in Figure 6.

Figure 6. Region-adaptive restoration guidance. Given the high-fidelity guidance image `I _ { R M }` , it aims to minimize the region-adaptive MSE loss between clean latent \(\\tilde { z } _ { 0 }\) and `I _ { R M }` at each sampling step through gradient-descent algorithm. 该图像是示意图,展示了区域自适应恢复引导机制。图中界定了不同权重地区的处理方法,其中较大的权重用于高频区域,而较小的权重针对低频区域。具体来说,区域自适应均方误差损失 L(z0)=1HWW(D(z0)IRM)2L(z_0) = \frac{1}{HW} W \circ (D(z_0) - I_{RM})^2 通过梯度下降算法最小化清洁潜在表示 z~0\tilde{z}_0 与高保真引导图 IRMI_{RM} 之间的损失。图例展示了过程中的权重图和梯度图计算。

The clean latent z~0\tilde { z } _ { 0 } is then updated at each sampling step tt using the gradient descent algorithm: $ \tilde { z } _ { 0 } ^ { \prime } = \tilde { z } _ { 0 } - s \nabla _ { \tilde { z } _ { 0 } } \mathcal { L } ( \tilde { z } _ { 0 } ) $ Here:

  • z~0\tilde{z}_0' is the updated clean latent.

  • ss is the guidance scale, a tunable hyper-parameter controlled by the user. A larger ss pushes D(z~0)\mathcal{D}(\tilde{z}_0) closer to IRMI_{RM}, resulting in higher fidelity.

  • z~0L(z~0)\nabla _ { \tilde { z } _ { 0 } } \mathcal { L } ( \tilde { z } _ { 0 } ) is the gradient of the region-adaptive MSE loss with respect to the clean latent z~0\tilde{z}_0.

    The detailed algorithm for restoration guidance is provided in Algorithm 1 in the Appendix:

    Algorithm 1 Restoration guidance, given a diffusion model θ\theta, and the VAE's encoder E and decoder D
    Input: Guidance image IRMI_{RM}, text description cc (set to empty), diffusion steps TT, gradient scale ss Output: Output image D(z0)\mathcal{D}(z_0) Sample zTz_T from N(0,I)\mathcal{N} (0, I) for tt from TT to 1 do
    ϵtϵθ(zt,c,t,E(IRM))\epsilon_t \leftarrow \epsilon_\theta(z_t, c, t, \mathcal{E}(I_{RM}))
    z~0zt1αˉtϵtαˉt\tilde{z}_0 \leftarrow \frac{z_t - \sqrt{1 - \bar{\alpha}_t} \epsilon_t}{\sqrt{\bar{\alpha}_t}}
    Calculate W\mathcal{W} from IRMI_{RM} as 1G(IRM)1 - \mathcal{G}(I_{RM})
    L(z~0)=1HWCW(D(z~0)IRM)22\mathcal{L}(\tilde{z}_0) = \frac{1}{HWC} ||\mathcal{W} \odot (\mathcal{D}(\tilde{z}_0) - I_{RM})||_2^2
    z~0z~0sz~0L(z~0)\tilde{z}_0 \leftarrow \tilde{z}_0 - s \nabla_{\tilde{z}_0} \mathcal{L}(\tilde{z}_0)
    Sample zt1z_{t-1} from q(zt1zt,z~0)q(z_{t-1} | z_t, \tilde{z}_0)
    end for
    return D(z0)\mathcal{D}(z_0)

This mechanism provides flexible control without re-training, allowing users to fine-tune the output image's characteristics based on their preferences.

5. Experimental Setup

5.1. Datasets

DiffBIR's models are trained and evaluated on a variety of synthetic and real-world datasets tailored for Blind Image Super-Resolution (BSR), Blind Face Restoration (BFR), and Blind Image Denoising (BID) tasks.

  • Training Dataset (for DiffBIR's Generation Module):

    • Filtered laion2b-en [46]: A large-scale dataset containing approximately 15 million high-quality images. All images are randomly cropped to 512×512512 \times 512 resolution during training. This dataset is chosen due to its vast size and diversity, which is crucial for training robust generative diffusion models like Stable Diffusion.
  • Evaluation Datasets for BSR Task:

    • Synthetic Datasets: These datasets contain images with artificially induced degradations, allowing for quantitative comparison against ground truth.
      • DIV2K-Val [1]: A widely used high-quality image dataset for super-resolution, often used for validation.
      • DRealSR [62]: A dataset designed for real-world super-resolution challenges, offering diverse degradations.
      • RealSR [3]: Another benchmark dataset focusing on real-world single image super-resolution.
    • Real-world Datasets: These datasets consist of images with authentic, complex degradations, posing a greater challenge.
      • RealSRSet [73]: A dataset specifically collected for blind image super-resolution from real-world scenarios.
      • Real47: A dataset collected by the authors for evaluating BSR in real-world conditions.
  • Evaluation Datasets for BFR Task:

    • These datasets contain real-world face images with varying degrees of degradation.
      • LFW-Test [55] (based on LFW [21]): A subset derived from the Labeled Faces in the Wild dataset, commonly used for face restoration.
      • WIDER-Test [77]: A subset used for face restoration, often containing faces with various poses, expressions, and occlusions under diverse conditions.
  • Evaluation Datasets for BID Task:

    • This task evaluates performance on images degraded by various real-world noise types.
      • Mixed Real-world Dataset: Composed of images from real3 [74], real9 [74], and RNI15 [72]. These datasets are known for containing realistic camera sensor noises (e.g., dark current noise, shot noise, thermal noise) that are challenging to remove.

5.2. Evaluation Metrics

For every evaluation metric, a comprehensive explanation is provided below:

  1. PSNR (Peak Signal-to-Noise Ratio)

    • Conceptual Definition: PSNR quantifies the reconstruction quality of an image compared to an original reference image. It is most commonly used to measure the quality of reconstruction of lossy compression codecs or restoration algorithms. A higher PSNR value generally indicates a better quality image, meaning the reconstructed image is closer to the original. It is usually expressed in decibels (dB).
    • Mathematical Formula: $ \mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $ $ \mathrm{PSNR} = 10 \cdot \log_{10}\left(\frac{\mathrm{MAX}I^2}{\mathrm{MSE}}\right) = 20 \cdot \log{10}\left(\frac{\mathrm{MAX}_I}{\sqrt{\mathrm{MSE}}}\right) $
    • Symbol Explanation:
      • MSE\mathrm{MSE}: Mean Squared Error between the two images.
      • I(i,j): The pixel value at position (i,j) in the original (ground truth) image.
      • K(i,j): The pixel value at position (i,j) in the reconstructed (restored) image.
      • m, n: Dimensions of the image (rows and columns).
      • MAXI\mathrm{MAX}_I: The maximum possible pixel value of the image. For 8-bit grayscale images, this is 255. For images with pixel values in [0, 1], this is 1.
  2. SSIM (Structural Similarity Index Measure)

    • Conceptual Definition: SSIM is a perceptual metric that quantifies the similarity between two images, often used to assess the quality of lossy compression or restored images. Unlike PSNR which relies on pixel-wise error, SSIM attempts to model the human visual system by considering three key features: luminance, contrast, and structure. The SSIM index ranges from -1 to 1, where 1 indicates perfect structural similarity.
    • Mathematical Formula: $ \mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
    • Symbol Explanation:
      • xx: A window from the first image (e.g., ground truth).
      • yy: A window from the second image (e.g., restored image).
      • μx\mu_x: Average of xx.
      • μy\mu_y: Average of yy.
      • σx2\sigma_x^2: Variance of xx.
      • σy2\sigma_y^2: Variance of yy.
      • σxy\sigma_{xy}: Covariance of xx and yy.
      • c1=(K1L)2c_1 = (K_1L)^2, c2=(K2L)2c_2 = (K_2L)^2: Two constants to stabilize the division with a weak denominator.
      • LL: The dynamic range of the pixel values (e.g., 255 for 8-bit grayscale images).
      • K1=0.01K_1 = 0.01, K2=0.03K_2 = 0.03: Default constant values.
  3. LPIPS (Learned Perceptual Image Patch Similarity) [76]

    • Conceptual Definition: LPIPS is a metric that aims to correlate better with human perception of image quality than traditional metrics like PSNR or SSIM. It works by extracting deep features from pre-trained neural networks (like AlexNet or VGG), then calculating the Euclidean distance between these feature representations of two images. A lower LPIPS score indicates higher perceptual similarity.
    • Mathematical Formula: The paper does not provide an explicit formula, as LPIPS is defined by the specific network architecture and distance calculation. Conceptually, it can be represented as: $ \mathrm{LPIPS}(x, x_0) = \sum_l \frac{1}{H_l W_l} ||w_l \odot (\phi_l(x) - \phi_l(x_0))||_2^2 $
    • Symbol Explanation:
      • xx: Original image (ground truth).
      • x0x_0: Restored image.
      • ϕl\phi_l: Feature stack from layer ll of a pre-trained network (e.g., AlexNet, VGG).
      • wlw_l: Weights for each layer ll.
      • Hl,WlH_l, W_l: Height and width of the feature maps at layer ll.
      • \odot: Element-wise multiplication.
      • 22||\cdot||_2^2: Squared L2 norm.
  4. MUSIQ (Multi-scale Image Quality Transformer) [24]

    • Conceptual Definition: MUSIQ is a no-reference image quality assessment (IQA) metric designed to predict subjective image quality without needing a reference image. It uses a Transformer-based architecture to extract features at multiple scales and aggregates them to provide an overall quality score. A higher MUSIQ score indicates better perceived image quality.
  5. MANIQA (Multi-dimension Attention Network for No-Reference Image Quality Assessment) [64]

    • Conceptual Definition: MANIQA is another no-reference IQA metric that employs a multi-dimension attention network to assess image quality. It focuses on learning discriminative quality-aware features by attending to different aspects (e.g., spatial, channel) of an image. A higher MANIQA score typically suggests better image quality.
  6. CLIP-IQA [51]

    • Conceptual Definition: CLIP-IQA is a no-reference IQA metric that leverages the capabilities of CLIP (Contrastive Language-Image Pre-training) to assess image quality. It aligns image quality assessment with human aesthetic and semantic understanding by measuring the semantic similarity between an image and quality-related text prompts. A higher CLIP-IQA score indicates better image quality, often correlating with semantic integrity and visual appeal.
  7. FID (Fréchet Inception Distance) [19]

    • Conceptual Definition: FID is a metric used to evaluate the quality of images generated by generative models, particularly GANs and diffusion models. It measures the "distance" between the feature distributions of real images and generated images. It uses a pre-trained Inception-v3 network to extract features from both sets of images and then computes the Fréchet distance between these two multivariate Gaussian distributions. A lower FID score indicates that the generated images are more similar to real images in terms of quality and diversity.
    • Mathematical Formula: $ \mathrm{FID} = ||\mu_x - \mu_g||_2^2 + \mathrm{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x\Sigma_g)^{1/2}) $
    • Symbol Explanation:
      • μx\mu_x: Mean of feature vectors for real images.
      • μg\mu_g: Mean of feature vectors for generated images.
      • Σx\Sigma_x: Covariance matrix of feature vectors for real images.
      • Σg\Sigma_g: Covariance matrix of feature vectors for generated images.
      • 22||\cdot||_2^2: Squared L2 norm (Euclidean distance).
      • Tr\mathrm{Tr}: Trace of a matrix.

5.3. Baselines

DiffBIR is compared against a comprehensive set of state-of-the-art methods for each Blind Image Restoration task:

  • For Blind Image Super-Resolution (BSR):

    • FeMaSR [5]
    • DASR [30]
    • Real-ESRGAN+ [56]
    • BSRGAN [73]
    • SwinIR-GAN [29]
    • StableSR [52] (a diffusion-based method)
    • PASD [66] (a diffusion-based method) These baselines represent both GAN-based and recent diffusion-based approaches that aim to handle real-world super-resolution with unknown degradations.
  • For Blind Face Restoration (BFR):

    • CodeFormer [77]
    • DifFace [67] (a diffusion-based method)
    • DMDNet [28]
    • DR2 [61] (a diffusion-based method)
    • GCFSR [18]
    • GFP-GAN [55]
    • GPEN [65]
    • RestoreFormer++ [60]
    • VQFR [16]
    • PGDiff [63] (a diffusion-based method) These methods are state-of-the-art in face-specific restoration, utilizing various GAN or diffusion priors tailored for facial structures.
  • For Blind Image Denoising (BID):

    • CBDNet [17]
    • DeamNet [41]
    • Restormer [69]
    • SwinIR [29]
    • SCUNet-GAN [74] These baselines cover a range of CNN-based, Transformer-based, and GAN-based approaches designed for blind noise removal.

6. Results & Analysis

6.1. Core Results Analysis

DiffBIR demonstrates superior performance across all evaluated Blind Image Restoration tasks (BSR, BFR, BID), both quantitatively and qualitatively.

BSR on Synthetic Datasets

The following are the results from Table 2 of the original paper, showing quantitative comparisons on the DIV2K-Val dataset with Real-ESRGAN degradation.

MetricsFeMaSR [5]DASR [30]Real-ESRGAN+ [56]BSRGAN [73]SwinIR-GAN [29]StableSR [52]PASD [66]DiffBIR
(s=0)(s=0.5)(s=1)
PSNR↑20.130321.214121.034821.453120.748821.239220.783820.582421.580821.9154
SSIM↑0.44510.47730.48990.48140.48440.47900.47270.42770.47940.4986
LPIPS↓0.39710.44790.39210.40950.39070.39930.43530.39390.39350.4263
MUSIQ↑62.785558.159164.638962.927165.494557.806963.809473.101968.665761.1476
MANIQA↑0.14430.15310.22380.18330.20610.16480.23540.38360.31460.2466
CLIP-IQA↑0.56740.55710.59050.51950.57790.55410.61250.76560.71580.6347

DiffBIR consistently outperforms state-of-the-art methods on synthetic BSR datasets, especially in no-reference image quality assessment (IQA) metrics like MUSIQ, MANIQA, and CLIP-IQA. When the restoration guidance scale ss is set to 0 (prioritizing quality/realness), DiffBIR achieves the highest scores across all three IQA metrics by a significant margin. This highlights its superior ability to generate perceptually pleasing and realistic details. When s=1s=1 (prioritizing fidelity), DiffBIR achieves the best PSNR and SSIM scores, which measure pixel-wise and structural similarity to the ground truth. Even at s=1s=1, its IQA scores remain competitive (top-3). This demonstrates the effectiveness of the region-adaptive restoration guidance in allowing users to balance realness and fidelity according to their needs. A setting of s=0.5s=0.5 offers a good compromise.

BSR on Real-world Datasets

The following are the results from Table 3 of the original paper, showing quantitative comparisons on real-world datasets for BSR.

DatasetsMetricsFeMaSR [5]DASR [30]Real-ESRGAN+ [56]BSRGAN [73]SwinIR-GAN [29]StableSR [52]PASD [66]DiffBIR
(s=0)
RealSRSet [73]MUSIQ↑64.673559.269563.267567.670564.251264.837267.405269.4208
MANIQA↑0.21420.15950.19630.22400.20540.20830.23700.3211
CLIP-IQA↑0.68790.52360.57720.64560.60080.64180.67610.7637
real47MUSIQ↑68.938462.202668.109869.474168.846768.342270.971273.1397
MANIQA↑0.23470.14540.20550.20630.22170.22640.26070.3682
CLIP-IQA↑0.69110.54450.63820.61110.62460.65740.69130.7781

On challenging real-world BSR datasets (RealSRSet and Real47), DiffBIR (s=0s=0) achieves the best scores across all MUSIQ, MANIQA, and CLIP-IQA metrics. This confirms its superiority in handling complex, authentic degradations and generating highly realistic outputs. Qualitative comparisons (Figure 7) further illustrate that DiffBIR produces sharper and more realistic results compared to GAN-based methods which tend to be over-smoothed. Compared to other diffusion-based methods, DiffBIR's outputs are more realistic, recovering intricate details like whiskers, lips, flower pistils, and clear text.

Figure 7. Visual comparison of BSR methods on real-world datasets 该图像是图表,展示了不同盲图像超分辨率(BSR)方法在真实世界数据集上的视觉比较。图中包含四种方法的效果对比,展示了DiffBIR在恢复图像内容方面的优越性。

BFR on Real-world Datasets

The following are the results from Table 4 of the original paper, showing quantitative comparisons on real-world datasets for BFR.

DatasetsMetricsCodeFormer[77]DifFace[67]DMDNet[28]DR2 [61]GCFSR[18]GFP-GAN[55]GPEN[65]RestoreFormer++[60]VQFR[16]PGDiff[63]DiffBIR
(s=0)
LFW-Test [21]MUSIQ↑75.483070.495773.402767.535771.378976.377976.621072.249274.384772.217576.4206
MANIQA↑0.31880.26920.29730.28300.27900.36880.36160.31790.32800.29270.4499
CLIP-IQA↑0.68900.59450.64670.57280.61430.71960.71810.70250.70990.61330.7948
FID (ref. FFHQ)↓52.876544.920143.540345.942052.697247.471751.986250.730950.130041.58144.9065
Wider-Test [77]MUSIQ↑73.408165.239769.470967.316369.963474.830875.616071.515571.416366.001475.3213
MANIQA↑0.29710.24030.26300.27950.28030.35080.34720.290550.30600.24060.4443
CLIP-IQA↑0.69840.56390.63350.58210.62660.71470.70390.71710.70690.56850.8085
FID (ref. FFHQ)↓39.251737.844038.958040.120241.198641.324746.441945.468638.167540.270035.0940

DiffBIR (s=0s=0) achieves the lowest FID score on both LFW-Test and WIDER-Test datasets, by a significant margin (e.g., 4.9065 vs. 41.5814 on LFW-Test). This is a strong indicator of its ability to generate highly realistic and diverse faces that are visually indistinguishable from real ones. It also obtains the highest scores in MANIQA and CLIP-IQA, with MUSIQ scores being very close to the top. Notably, IRControlNet was not specifically finetuned on face datasets like FFHQ, yet it outperforms specialized BFR methods. This highlights the excellent generalization ability of DiffBIR's proposed restoration pipeline to general images, not just faces. Visual comparisons (Figure 8) show DiffBIR's superiority in handling non-facial elements (e.g., correctly restoring a hand alongside a face) and complex facial orientations (e.g., side profiles, intricate details like teeth and nose), where other methods might distort or fail due to strong facial priors.

Figure 8. Visual comparison of BFR methods on real-world datasets. 该图像是对比不同盲人面部恢复(BFR)方法在真实世界数据集上的视觉效果。展示了多种方法的恢复结果,包括LQ、GPEN、GFPGAN、VQFR、CodeFormer等及DiffBIR的效果。从上到下、左到右排列,每种方法的效果对比清晰可见。

BID on Real-world Datasets

The following are the results from Table 5 of the original paper, showing quantitative comparisons on real-world datasets for BID.

MethodsMUSIQ↑MANIQA↑CLIP-IQA↑
CBDNet [17]48.11490.11030.4709
DeamNet [41]45.99420.09490.4391
Restormer [69]47.46050.09270.3857
SwinIR [29]55.04930.15950.4130
SCUNet-GAN [74]58.21700.18220.5045
DiffBIR (s=0)69.72780.34040.7420

For BID, DiffBIR (s=0s=0) significantly outperforms all baseline methods across MUSIQ, MANIQA, and CLIP-IQA metrics. This substantial difference is attributed to the powerful generative diffusion prior, enabling effective high-quality image restoration beyond mere noise removal. Visual comparisons (Figure 9) reveal that DiffBIR not only removes noise effectively but also generates realistic textures, which other methods often fail to do. Methods like SwinIR and SCUNet-GAN, while successfully denoising, tend to produce smoothed results lacking vivid texture details.

Figure 9. Visual comparisons for BID on real-world datasets. 该图像是图表,展示了不同盲图像恢复算法的视觉对比效果。比较包括低质量图像(LQ)、Restormer、SwinIR、SCUNet-GAN和我们的DiffBIR模型,展示了各算法在真实数据集上的恢复效果。

6.2. Ablation Studies / Parameter Analysis

The paper conducts several ablation studies to validate the design choices and components of DiffBIR.

The Importance of Restoration Module

The following are the results from Table 6 of the original paper, showing an ablation study on the Restoration Module (RM).

DatasetsMetricsw/o RMw/RM
RealSRSet [73]MANIQA↑0.23860.2477
MUSIQ↑62.568364.7319
CLIP-IQA↑0.68180.7075
ImageNet-Val-1k [10]PSNR↑22.848123.0078
SSIM↑0.50390.5198
LPIPS↓0.40760.4026

Removing the Restoration Module (RM), meaning directly finetuning the diffusion model with synthesized training pairs (one-stage model), leads to a noticeable performance drop across all IQA and reference-based metrics on both real-world (RealSRSet) and synthetic (ImageNet-Val-1k) datasets. Qualitatively (Figure 10, left), the one-stage model exhibits severe distortions, such as incorrect facial generation or misinterpreting degradations as semantic information (e.g., producing a colorful background or unusual eye shapes). The two-stage model (with RM) correctly generates facial content and more realistic results. This confirms that the Restoration Module is critical for providing clean, reliable conditions to the generation module, preventing it from being disturbed by degradation and enabling it to focus purely on content regeneration.

Figure 10. Visual comparison of ablation studies. (Left) DiffBIR w/o RM regards degradations as image content and performs poorly in fidelity maintaining; (Right) ControlNet\[75\] has a color shift problem which can be addressed by our IRControlNet. 该图像是对消融实验的视觉比较。左侧显示了无恢复模块(w/o RM)与有恢复模块(w/ RM)的结果,显示了恢复模块对细节保留的重要性;右侧展示了控制网络(w/ ControlNet)与改进控制网络(w/ IRControlNet)的效果对比,证明了改进算法在细节生成上的优势。

The Effectiveness of IRControlNet

The following are the results from Table 7 of the original paper, showing a comparison of ControlNet and ours in PSNR.

Set14 [70]BSD100 [34]manga109 [35]ImageNet-Val-1k [10]
w/ ControlNet20.943522.492320.269222.2874
w/ IRControlNet23.519323.877823.243924.2534

Comparing IRControlNet with a standard ControlNet for BSR tasks reveals that ControlNet often produces results with color shifts (Figure 10, right). This issue is attributed to the lack of explicit regularization for color consistency during training and potentially less effective condition encoding. IRControlNet, by leveraging the pre-trained VAE encoder for condition encoding, effectively addresses this problem and achieves significantly higher PSNR scores across various datasets (Set14, BSD100, manga109, ImageNet-Val-1k). This empirical evidence validates IRControlNet's design as a more suitable backbone for BIR tasks within the diffusion framework.

The Effectiveness of Wide Degradation Range

The following are the results from Table 8 of the original paper, showing an ablation study on the degradation model evaluated on RealSRSet [73].

DegradationMANIQA↑MUSIQ↑CLIP-IQA↑
RealESRGAN [56]0.235164.17180.6936
Ours0.250464.73190.7075

The paper investigates the impact of the degradation model used to synthesize training conditions for the generation module. A classic first-order degradation model with a wide degradation range (ours) is compared against the complex degradation model from Real-ESRGAN [56] (which uses smaller ranges). The results in Table 8 demonstrate that using the proposed classic degradation model with a wide range leads to better utilization of the generative capabilities of the diffusion model, resulting in enhanced quality of the restored images as indicated by higher MANIQA, MUSIQ, and CLIP-IQA scores on RealSRSet. This suggests that diverse yet conceptually simpler degradations are more effective for training a robust generative prior for the second stage.

More Variants for IRControlNet (from Appendix)

The paper further explores two additional variants of IRControlNet (Figure 11):

Figure 11. Architectures of our IRControlNet and two model variants. 该图像是示意图,展示了IRControlNet的两个模型变体:Variant 5和Variant 6。Variant 5采用了控制拼接特征,而Variant 6则使用SFT调制,这些方法在盲图像恢复过程中实现了不同的特征处理与信息生成机制。

The following are the results from Table 9 of the original paper, showing quantitative comparisons of IRControlNet, Variant 5 and 6 on ImageNet1k-Val with Real-ESRGAN degradation.

VariantsPSNR↑SSIM↑MANIQA↑
IRControlNet22.98650.52000.2689
Variant 5: w/ control concat features23.04490.52610.2567
Variant 6: w/ SFT modulation22.99740.52920.2622
  • Variant 5 (w/ control concat features): This variant simultaneously controls middle block, decoder, and skipped features. While it achieves slightly higher PSNR and SSIM, its MANIQA score is worse than IRControlNet. This suggests that applying more control to the pre-trained model can enhance fidelity but might restrict the generative model's artistic freedom, leading to a slight drop in perceived generation quality.

  • Variant 6 (w/ SFT modulation): This variant uses SFT (Scale and Shift Transformation) layers to modulate features, which offers more precise control. Similar to Variant 5, it improves SSIM and PSNR (fidelity) but results in a slightly lower MANIQA score, implying a trade-off.

    In conclusion from these variants, IRControlNet's choice to apply add-on control primarily to skipped features with additive modulation strikes a good balance, preserving most of the generative capability while allowing for effective conditioning. The region-adaptive restoration guidance then provides the flexible trade-off between quality and fidelity during inference.

Qualitative comparisons for Variant 2 (w/o ztz_t in condition network) are shown in Figure 12:

Figure 12. Visual comparisons of Variant 2 and IRControlNet. 该图像是图表,展示了低质量图像(LQ)、IRControlNet 处理结果以及 Variant 2 的恢复效果的视觉对比。图中分别列出了三个不同的图像处理结果,并使用黄色和红色框标记出关键区域,以便进行细致的比较。每一行展示了同一内容在不同方法下的表现,能够直观体现各方法在细节恢复上的差异。

IRControlNet generates more vivid textures, while Variant 2 tends to produce over-smoothed results, reinforcing the importance of including ztz_t in the condition network for high-quality generation.

6.3. Quantitative Comparisons for Efficiency

The following are the results from Table 11 of the original paper, showing quantitative comparisons of inference efficiency and model complexity.

MetricsReal-ESRGAN+ [56]BSRGAN [73]SwinIR-GAN [29]FeMaSR [5]DASR [30]StableSR [52]PASD [66]DiffBIR
Inference Time (ms)46.1946.42126.4489.0112.6919278.4616951.0810906.51
Model Size (M)16.6916.6911.7134.058.061409.111675.761716.7

This comparison evaluates inference speed and model complexity for super-resolution (input 128×128128 \times 128, scale factor 4). DiffBIR is shown to be the most efficient among the diffusion-model (DM)-based baselines, being approximately 1.8x faster than StableSR and 1.6x faster than PASD. This indicates an optimization in its diffusion sampling process. However, GAN-based methods (Real-ESRGAN+, BSRGAN, DASR, FeMaSR) are significantly more efficient in terms of inference time, often completing inference in tens of milliseconds, whereas DM-based methods, including DiffBIR, require thousands of milliseconds (seconds). This is an inherent trade-off, as DM-based methods typically involve multiple sequential sampling steps (DiffBIR uses 50 steps) which are computationally intensive. The model size of DiffBIR (1716.7M) is comparable to other DM-based methods (StableSR: 1409.11M, PASD: 1675.76M) but much larger than GAN-based methods (e.g., DASR: 8.06M). The authors acknowledge the computational expense of DM-based methods but point to rapid advancements in the field, with new works ([33, 44]) achieving satisfactory generation with as few as 1-4 steps, suggesting that the time-consuming problem will likely be alleviated in the future.

6.4. More Quantitative and Qualitative Comparisons for BSR on Synthetic Datasets (from Appendix)

The following are the results from Table 10 of the original paper, showing quantitative comparisons on DRealSR [62] and RealSR [3].

DatasetsMetricsFeMaSR [5]DASR [30]Real-ESRGAN+ [56]BSRGAN [73]SwinIR-GAN [29]StableSR [52]PASD [66]DiffBIR
(s=0)(s=0.5)(s=1)
DRealSR [62]PSNR↑23.197726.384424.687825.690325.389823.866924.203724.873524.989125.6238
SSIM↑0.62390.72710.67050.67650.69620.64000.65290.58740.62460.6544
LPIPS↓0.21900.17930.22290.23080.20570.23550.20160.24480.23280.2350
MUSIQ↑68.745866.065167.460868.938868.139369.262170.767072.351471.533969.8821
MANIQA↑0.30730.20480.23150.23090.23500.25650.28890.39150.38470.3530
CLIP-IQA↑0.63270.50860.50220.52800.52440.59880.61510.68780.67610.6440
RealSR [3]PSNR↑23.162725.550324.240024.971724.624423.562724.538523.523724.221624.7531
SSIM↑0.665340.71830.67930.68390.70510.665490.66940.559890.63460.6615
LPIPS↓0.25200.23970.25560.25450.23400.24290.23170.26460.25440.2565
MUSIQ↑66.120859.556566.733368.067367.096468.459470.004372.390971.396969.5167
MANIQA↑0.26520.17130.22430.23290.22810.24070.27460.38200.37920.3504
CLIP-IQA↑0.59250.43000.47870.52330.49200.58520.58220.68680.68170.66478

The trends observed on DRealSR and RealSR datasets are consistent with those on DIV2K-Val. DiffBIR (s=0s=0) achieves the highest IQA scores across all metrics. For PSNR, DiffBIR (s=1s=1) performs comparably to GAN-based methods and better than other diffusion-based methods, demonstrating a good balance between quality and fidelity. Visually (Figure 13), DiffBIR is shown to correctly recover semantic information and intricate details (e.g., eyes behind a helmet, firework lines, penguin wings) where GAN-based methods produce overly smoothed results, and other diffusion-based methods fail to generate correct semantics due to severe degradation.

Figure 13. Visual comparisons of BSR methods on synthetic dataset (DIV2K-Val \[1\]). 该图像是图表,展示了在合成数据集(DIV2K-Val)上不同盲图像超分辨率(BSR)方法的视觉比较。左侧是低质量(LQ)图像,右侧是多个恢复方法的输出结果,包括GT、DASR、Real-ESRGAN+、BSRGAN、SwinIR-GAN、StableSR、PASD和我们的方法DiffBIR,显示了不同方法在细节恢复上的差异。

6.5. More Real-world Visual Comparisons (from Appendix)

The appendix provides additional visual comparisons for BSR, BID, and BFR tasks on real-world datasets, further solidifying DiffBIR's qualitative superiority.

  • BSR (Figure 14): Shows more examples where DiffBIR produces sharper, more detailed, and semantically correct reconstructions compared to baselines.

    Figure 14. More visual comparisons for BSR on real-world datasets. 该图像是图表,展示了不同盲图像恢复方法在真实世界数据集上的视觉比较。第一行展示了包括 FeMaSR 和 DiffBIR 在内的多个方法恢复的结果,底部则是另一组低质量图像的比较,突显了每个方法在恢复过程中的表现差异。

  • BID (Figure 15): Illustrates DiffBIR's ability to effectively remove noise while restoring realistic textures, which other denoisers often smooth out.

    Figure 15. More visual comparisons for BID on real-world datasets. 该图像是图表,展示了不同方法在真实世界数据集上进行盲图像恢复的对比结果。其中展示了低质量图像(LQ)及由多种恢复算法生成的恢复图像,包括CBDNet、DeamNet、Restormer、SwinIR、SCUNet-GAN和DiffBIR(我们的方法)。

  • BFR (Figure 16): Presents additional cases where DiffBIR accurately restores faces, even in challenging conditions like extreme angles or occlusions, demonstrating its robust generalization.

    Figure 16. More visual comparisons for BFR on real-world datasets. 该图像是一个用于比较不同盲人图像恢复方法(如DiffBIR及其他算法)的视觉结果。通过多种算法处理的图像展示了在真实世界数据集上恢复效果的差异,特别强调了在处理复杂细节方面的改进。

6.6. Interpretation of Results and Advantages

The comprehensive experiments and ablation studies highlight DiffBIR's key advantages:

  • Unified and Generalizable: It successfully tackles three distinct BIR tasks within one framework, overcoming the specialization limitations of previous methods.

  • Superior Generative Quality: Leveraging the latent diffusion prior with IRControlNet allows DiffBIR to generate significantly more realistic and visually pleasing details, as evidenced by its leading IQA and FID scores.

  • Robust Degradation Handling: The two-stage decoupling ensures stability. By first removing degradations, the generation module receives clean conditions, preventing artifacts that plague direct approaches.

  • User-Centric Control: The training-free region-adaptive restoration guidance is a powerful feature, allowing users to explicitly control the fidelity-quality trade-off, catering to diverse preferences.

  • Efficient Diffusion Integration: IRControlNet optimizes the control mechanism for Stable Diffusion, demonstrating better performance than direct ControlNet adaptations and improved efficiency among DM-based baselines.

    The main disadvantage, as acknowledged by the authors, is the computational expense of DM-based inference (50 sampling steps). However, this is a general limitation of current diffusion models, which is actively being addressed in the research community.

7. Conclusion & Reflections

7.1. Conclusion Summary

DiffBIR introduces a novel and highly effective unified framework for blind image restoration (BIR), successfully addressing blind image super-resolution (BSR), blind face restoration (BFR), and blind image denoising (BID) tasks. The core innovation lies in its two-stage decoupled pipeline, which first performs degradation removal to obtain a high-fidelity intermediate image and then information regeneration using a powerful generative latent diffusion model. The proposed IRControlNet effectively leverages the Stable Diffusion prior, ensuring stable and realistic detail generation by using a VAE encoder for robust condition encoding. Furthermore, the training-free region-adaptive restoration guidance allows users to flexibly balance image quality (realness) and fidelity (faithfulness to input) during inference. Extensive experiments demonstrate DiffBIR's state-of-the-art performance across both synthetic and real-world datasets, consistently achieving superior IQA metrics and FID scores compared to existing methods.

7.2. Limitations & Future Work

The authors acknowledge a primary limitation:

  • Computational Expense: DiffBIR requires 50 sampling steps for image restoration, which is computationally expensive and makes real-time applications challenging.

    They suggest future research directions:

  • Exploration of Other BIR Tasks: The two-stage restoration pipeline is a general concept, suggesting its applicability and potential for further exploration in other BIR tasks beyond BSR, BFR, and BID.

  • Faster Diffusion Models: The time-consuming nature of diffusion models is an active research area, and advancements in faster sampling techniques (e.g., [33, 44] cited in the paper) could significantly alleviate this limitation in future iterations of DiffBIR.

7.3. Personal Insights & Critique

DiffBIR represents a significant step forward in blind image restoration by effectively harnessing the superior generative capabilities of diffusion models. The two-stage decoupling is particularly insightful. It directly addresses the fundamental challenge of BIR where degradation and content are entangled, providing a clean separation of concerns that boosts stability and performance. This makes the generative model's task much clearer: just add realistic details, don't worry about removing noise. This clarity allows IRControlNet to shine.

The choice to adapt ControlNet for IR tasks is clever, and the detailed ablation studies on IRControlNet's components are valuable. The finding that the pre-trained VAE encoder is crucial for stable and color-consistent condition encoding, and that including the noisy latent ztz_t in the condition network aids convergence and detail generation, offers practical guidance for future diffusion-based IR methods. The region-adaptive restoration guidance is another strong point, providing a much-needed user control mechanism without additional training. This level of flexibility is often missing in IR models, allowing users to tailor outputs to specific aesthetic or fidelity requirements.

Potential Issues/Areas for Improvement:

  1. Dependence on Stage I: While decoupling is an advantage, the quality of Stage II is inherently dependent on the fidelity and degradation removal capabilities of Stage I. If the chosen Restoration Module (RM) for a specific task performs poorly, it would directly impact the final output, regardless of IRControlNet's generative power. The paper assumes "off-the-shelf" RMs are good, but real-world complexity might still challenge them.
  2. Generalizability of Stage I: The paper states using "separate restoration modules instead of a general one for different BIR tasks" in Stage I to maintain expertise. While practical, this slightly undercuts the "unified framework" ideal. Future work could explore a single, more robust and generalizable Restoration Module that can handle a broader range of degradations for Stage I, further simplifying the pipeline.
  3. Black-Box Nature of IQA Metrics: While crucial for evaluating perceptual quality, MUSIQ, MANIQA, and CLIP-IQA are complex deep-learning-based metrics. Their exact mechanisms and why one performs better than another can sometimes be difficult to interpret, leading to less actionable insights for model improvements compared to traditional metrics.
  4. Sampling Steps vs. Quality: The trade-off between the number of sampling steps and output quality/inference time is critical. While the authors are optimistic about future faster diffusion models, this remains a practical barrier for real-time applications. A more explicit analysis of DiffBIR's performance with fewer sampling steps (e.g., 4-8 steps like emerging fast diffusion models) would be beneficial.

Transferability and Applications: The decoupled two-stage architecture has high transferability. This principle could be applied to other conditional image generation tasks where the input condition is noisy or contains irrelevant information that could disturb the generative model. For example, in text-to-image generation, if the text prompt is ambiguous or contains conflicting instructions, a similar pre-processing (or "clarification") stage could potentially improve generation stability. In image editing tasks, removing unwanted elements first before generating new content could lead to cleaner results. The region-adaptive guidance mechanism is also broadly applicable to any generative model that can be guided via gradients, offering fine-grained control for creative applications or specific industrial requirements.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.