DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior
TL;DR Summary
DiffBIR is a unified image restoration pipeline for blind image tasks, involving degradation removal and information regeneration. It utilizes IRControlNet for realistic detail generation and introduces region-adaptive guidance for user-tunable balance between realness and fideli
Abstract
We present DiffBIR, a general restoration pipeline that could handle different blind image restoration tasks in a unified framework. DiffBIR decouples blind image restoration problem into two stages: 1) degradation removal: removing image-independent content; 2) information regeneration: generating the lost image content. Each stage is developed independently but they work seamlessly in a cascaded manner. In the first stage, we use restoration modules to remove degradations and obtain high-fidelity restored results. For the second stage, we propose IRControlNet that leverages the generative ability of latent diffusion models to generate realistic details. Specifically, IRControlNet is trained based on specially produced condition images without distracting noisy content for stable generation performance. Moreover, we design a region-adaptive restoration guidance that can modify the denoising process during inference without model re-training, allowing users to balance realness and fidelity through a tunable guidance scale. Extensive experiments have demonstrated DiffBIR's superiority over state-of-the-art approaches for blind image super-resolution, blind face restoration and blind image denoising tasks on both synthetic and real-world datasets. The code is available at https://github.com/XPixelGroup/DiffBIR.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is "DiffBIR: Towards Blind Image Restoration with Generative Diffusion Prior". It focuses on developing a unified framework for blind image restoration tasks, leveraging the generative capabilities of diffusion models.
1.2. Authors
The authors are: Xinqi Lin, Jingwen He, Ziyan Chen, Zhaoyang Lyu, Bo Dai, Fanghua Yu, Wanli Ouyang, Yu Qiao, and Chao Dong. Their affiliations include Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shanghai AI Laboratory, and The Chinese University of Hong Kong. This suggests a collaborative effort from research institutions with strong backgrounds in computer vision, artificial intelligence, and image processing.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server. Its publication status is a preprint as of 2023-08-29T07:11:52.000Z. While arXiv itself is not a peer-reviewed journal or conference, it is a widely recognized platform for disseminating cutting-edge research in fields like computer science, often preceding formal peer-reviewed publication.
1.4. Publication Year
The paper was published in 2023.
1.5. Abstract
DiffBIR proposes a novel, unified pipeline for various blind image restoration (BIR) tasks. It decouples the BIR problem into two distinct stages: 1) degradation removal, which targets image-independent content, and 2) information regeneration, which focuses on generating lost image content. Each stage is developed independently but functions seamlessly in a cascaded manner. For the first stage, existing restoration modules are employed to remove degradations and produce high-fidelity intermediate results. For the second stage, the paper introduces IRControlNet, which harnesses the generative power of latent diffusion models to create realistic details. IRControlNet is specifically trained using carefully prepared condition images, devoid of distracting noisy content, to ensure stable generation. Furthermore, a training-free region-adaptive restoration guidance mechanism is designed to adjust the denoising process during inference. This guidance allows users to fine-tune the balance between realness (quality) and fidelity through a tunable guidance scale. Extensive experiments demonstrate DiffBIR's superior performance over state-of-the-art methods in blind image super-resolution (BSR), blind face restoration (BFR), and blind image denoising (BID) tasks across both synthetic and real-world datasets.
1.6. Original Source Link
The original source link is: https://arxiv.org/abs/2308.15070 PDF Link: https://arxiv.org/pdf/2308.15070v3.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by DiffBIR is blind image restoration (BIR). Image restoration aims to reconstruct a high-quality image from its low-quality observation. Traditionally, this involves image denoising, deblurring, and super-resolution under controlled settings where the degradation process is simple and known (e.g., bicubic downsampling). However, these traditional methods exhibit limited generalization ability when faced with real-world degraded images, which often contain unknown and complex degradations. BIR aims to achieve realistic image reconstruction on general images with general, unknown degradations. This is a significant challenge due to the entanglement of degradation and content information in low-quality images.
Prior research in BIR tasks like blind image super-resolution (BSR) and blind image denoising (BID) often formulates the problem as supervised large-scale degradation overfitting, primarily relying on Generative Adversarial Networks (GANs). While robust in degradation removal, these methods frequently struggle to generate truly realistic details due to the inherent limitations in their generative capabilities. Blind face restoration (BFR) methods have shown remarkable success by incorporating powerful generative facial priors (e.g., StyleGAN, VQGAN), but they are restricted to face images and fixed input sizes, lacking applicability to general images. More recently, denoising diffusion probabilistic models (DDPMs) have demonstrated outstanding performance in image generation. However, existing zero-shot image restoration (ZIR) methods using diffusion models, while generating realistic results for specific degradations, cannot generalize well to unknown or complex real-world degradations, meaning they can handle general images but not general degradations.
The paper's entry point and innovative idea stem from the observation that directly using low-quality (LQ) images as conditions for conditional image generation (e.g., with diffusion models) leads to instability and artifacts. This is because degradation and content information are entangled, and the generation process is disturbed by unreliable condition information. The paper motivates a decoupled approach to address this, aiming to separate degradation removal from content generation to achieve a stable and unified solution for diverse BIR tasks.
2.2. Main Contributions / Findings
The primary contributions and key findings of the paper are:
-
Decoupled Two-Stage Pipeline for Unified BIR: DiffBIR introduces a novel two-stage pipeline that effectively decouples the
BIRproblem intodegradation removalandinformation regeneration. This design allows for the first-time achievement of state-of-the-art performance acrossblind image super-resolution (BSR),blind face restoration (BFR), andblind image denoising (BID)tasks within a single, unified framework. This decoupling strategy enhances stability and flexibility by ensuring the generation module only operates on image content, undisturbed by degradations. -
Novel IRControlNet for Realistic Regeneration: The paper proposes
IRControlNet, a generative module that leverages the power of pre-trained text-to-image latent diffusion models (specifically, Stable Diffusion) for realistic image reconstruction. Through comprehensive exploration of critical components (condition encoding, condition network, feature modulation),IRControlNetis identified as a robust backbone for the generation stage. Notably, it utilizes the pre-trainedVAE encoderfor condition encoding and a copied, auxiliary encoder with zero-initialization for efficient and stable control, addressing issues like color shifts seen in baselineControlNetadaptations. -
Training-Free Region-Adaptive Restoration Guidance: DiffBIR introduces a training-free
region-adaptive restoration guidancemodule that operates during the sampling process. This module allows users to flexibly balancequality(realness) andfidelitybased on their preferences. By minimizing a specially designed region-adaptiveMean Squared Error (MSE)loss between the generated result and a high-fidelity guidance image, it influences low-frequency regions more heavily towards fidelity while preserving generative ability in high-frequency regions, offering fine-grained control without requiring model re-training.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand DiffBIR, a foundational grasp of several computer vision and deep learning concepts is essential:
-
Image Restoration (IR): The general task of recovering a high-quality (HQ) image from a degraded low-quality (LQ) observation. Degradations can include noise, blur, low resolution, etc.
- Image Denoising: Removing unwanted noise from an image.
- Image Deblurring: Reversing the effect of blur in an image.
- Image Super-Resolution (SR): Enhancing the resolution of an image, typically generating a high-resolution image from a low-resolution input.
- Blind Image Restoration (BIR): A more challenging variant where the degradation process is unknown or complex, mimicking real-world scenarios. This is the primary focus of DiffBIR.
-
Deep Learning Models:
- Convolutional Neural Networks (CNNs): A class of deep neural networks commonly applied to visual imagery. They use
convolutional layersto automatically and adaptively learn spatial hierarchies of features. - Transformers: Initially developed for natural language processing,
Transformersuseself-attentionmechanisms to weigh the importance of different parts of the input. They have been adapted for vision tasks, such as inSwin Transformer. - U-Net: A convolutional network architecture known for its U-shaped structure, featuring an
encoder(downsampling path) to capture context and adecoder(upsampling path) to enable precise localization. It's widely used in image-to-image tasks.
- Convolutional Neural Networks (CNNs): A class of deep neural networks commonly applied to visual imagery. They use
-
Generative Models: Models capable of generating new data samples that resemble the training data.
- Generative Adversarial Networks (GANs): Comprise two networks: a
Generatorthat creates synthetic data, and aDiscriminatorthat tries to distinguish real data from generated data. They are trained in an adversarial manner.GANshave been popular for image generation and restoration tasks due to their ability to produce realistic-looking images. - Denoising Diffusion Probabilistic Models (DDPMs): A class of generative models that learn to reverse a gradual
diffusion(noise-adding) process.- Forward Diffusion Process: Gradually adds Gaussian noise to an image over several time steps, eventually transforming it into pure noise.
- Reverse Denoising Process: A neural network (often a
U-Net) is trained to predict and remove the noise at each step, starting from pure noise and gradually reconstructing a clean image. - Latent Diffusion Models (LDMs) / Stable Diffusion: An advancement of
DDPMsthat performs the diffusion process in a compressedlatent spacerather than the pixel space. This makes them significantly more efficient in terms of computation and memory. They use anautoencoderto map images to and from the latent space.Stable Diffusionis a popular large-scaletext-to-imageLDM.
- Generative Adversarial Networks (GANs): Comprise two networks: a
-
Autoencoders and VAEs:
- Autoencoder: A neural network that learns to encode data into a lower-dimensional
latent representation(encoder) and then decode it back to the original data (decoder). The goal is to learn an efficient data representation. - Variational Autoencoder (VAE): A type of
autoencoderthat learns a probabilistic mapping from the input data to a continuous latent space, allowing for sampling new data points. InLDMs, theVAE encoder() converts an image to its latent representation, and theVAE decoder() reconstructs the image from the latent.
- Autoencoder: A neural network that learns to encode data into a lower-dimensional
-
ControlNet: An architecture that allows for adding conditional control to pre-trained
text-to-image diffusion modelslikeStable Diffusion. It works by taking a copy of the diffusion model'sU-Netencoder, making it trainable as a "condition network," while keeping the originalU-Netfrozen. It useszero convolutionsto connect the trainable and frozen parts, ensuring minimal interference with the pre-trained weights during initial training. -
Loss Functions:
- Mean Squared Error (MSE) Loss / L2 Loss: Measures the average of the squares of the errors (the difference between predicted and actual values). Often used for fidelity, minimizing pixel-wise differences.
\mathcal{L}_{MSE} = \frac{1}{N} \sum_{i=1}^N (y_i - \hat{y}_i)^2. - L1 Loss: Measures the sum of the absolute differences between predicted and actual values. Often produces sharper images than
MSEbut can be more sensitive to outliers. . - Perceptual Loss (LPIPS): Measures the perceptual similarity between two images by comparing their high-level feature representations extracted from a pre-trained deep neural network (e.g., VGG). This aligns better with human perception than pixel-wise losses.
- Adversarial Loss: Used in
GANsto train thegeneratorto produce realistic images that can fool thediscriminator.
- Mean Squared Error (MSE) Loss / L2 Loss: Measures the average of the squares of the errors (the difference between predicted and actual values). Often used for fidelity, minimizing pixel-wise differences.
-
Image Quality Assessment (IQA) Metrics:
- Reference-based Metrics: Compare a restored image to a ground-truth (reference) image.
- PSNR (Peak Signal-to-Noise Ratio): Quantifies the ratio between the maximum possible power of a signal and the power of corrupting noise. Higher
PSNRindicates better quality. - SSIM (Structural Similarity Index Measure): Measures the similarity between two images based on luminance, contrast, and structure. Closer to 1 indicates better quality.
- PSNR (Peak Signal-to-Noise Ratio): Quantifies the ratio between the maximum possible power of a signal and the power of corrupting noise. Higher
- No-Reference Metrics: Assess image quality without a ground-truth reference, often relying on learned models of human perception. Examples include
MANIQA,MUSIQ,CLIP-IQA. - FID (Fréchet Inception Distance): A metric used to assess the quality of images generated by
generative models. It measures the distance between the feature distributions of real and generated images using activations from anInception-v3network. LowerFIDindicates higher quality and diversity.
- Reference-based Metrics: Compare a restored image to a ground-truth (reference) image.
3.2. Previous Works
The paper contextualizes DiffBIR by reviewing existing approaches across various Blind Image Restoration (BIR) tasks:
-
Blind Image Super-Resolution (BSR):
- Initial Approaches: Focused on formulating
BSRas a supervised large-scale degradation overfitting problem. - BSRGAN [73]: Proposed a random shuffling strategy for synthesizing more practical degradations.
- Real-ESRGAN [56]: Exploited "high-order" degradation modeling to simulate complex real-world degradations, often using
GANswithadversarialandreconstruction losses. These methods are robust in degradation removal but often lack in generating realistic details. - SwinIR-GAN [29]: Utilized
Swin Transformeras a backbone for improved performance. - FeMaSR [5]: Formulated
SRas a feature-matching problem based on pre-trainedVQ-GAN. - Diffusion-based BSR: Recent works like
StableSR [52]andPASD [66]leverageStable Diffusion.StableSRdesigns a time-aware encoder, whilePASDproposes aPACA moduleto inject pixel-level conditions effectively. - Limitation: Most of these methods require re-training for different image restoration tasks, lacking a unified framework.
- Initial Approaches: Focused on formulating
-
Blind Face Restoration (BFR):
- GAN-prior-based methods [4, 18, 55, 65]: Incorporated powerful generative priors (e.g.,
StyleGAN) to reconstruct faces with high quality and fidelity. Examples includeGFPGAN [55]. - VQ-dictionary learning methods [16, 59, 77]: Introduced high-quality
codebooks(e.g.,VQGAN) to generate surprisingly realistic facial details.CodeFormer [77]is a prominent example. - Diffusion-prior-based methods [61, 63, 67]: Latest advances like
DifFace [67],DR2 [61],PGDiff [63]leveragediffusion modelsfor robust face restoration. - Limitation: These methods are typically restricted to face images with fixed input sizes, making them unsuitable for general image restoration.
- GAN-prior-based methods [4, 18, 55, 65]: Incorporated powerful generative priors (e.g.,
-
Blind Image Denoising (BID):
- Deep CNNs:
DnCNN [71]is an end-to-enddeep CNNfor Gaussian denoising. - GAN-based methods:
GCBD [7]usesGANsfor noise modeling, andCBDNet [17]uses a more realistic noise model and real-world noisy-clean pairs. - Variational methods:
VDNet [68]proposes simultaneous noise estimation and denoising. - SwinConv-UNet:
SCUNet [74]is a state-of-the-art method that designs a practical noise degradation model and usesL1oradversarial loss. - Limitation: While effective at noise removal, these methods often produce overly smooth results, lacking realistic texture details.
- Deep CNNs:
-
Zero-shot Image Restoration (ZIR):
- GAN-based ZIR [2, 9, 36, 39]: Focused on searching latent codes within a pre-trained
GAN's latent space. - Diffusion-based ZIR [23, 57, 14]: Employ
DDPMsas priors.DDRM [23]uses anSVD-basedapproach for linear tasks.DDNM [57]analyzesrange-null space decomposition.GDP [14]introducesclassifier guidance. - Limitation:
ZIRmethods, while leveraging powerful generative priors, are generally limited to clearly defined degradations and cannot generalize well to the unknown, complex degradations found in real-world low-quality images.
- GAN-based ZIR [2, 9, 36, 39]: Focused on searching latent codes within a pre-trained
3.3. Technological Evolution
The field of image restoration has evolved significantly:
- Classical Methods: Early methods relied on signal processing techniques, statistical models, and hand-crafted priors (e.g., non-local means for denoising). These were often limited by their reliance on explicit mathematical models of degradation.
- Deep Learning (Constrained IR): The advent of
CNNsrevolutionizedIR. Methods likeSRCNN [12]for super-resolution orDnCNN [71]for denoising, trained on large datasets with known degradations, achieved impressive results, significantly outperforming classical methods. - Deep Learning (Blind IR - GAN-based): Recognizing the limitations of known degradations, research shifted to
BIR.GANsbecame prominent, especially inBSR(e.g.,ESRGAN [54],Real-ESRGAN [56]) andBFR(e.g.,GFPGAN [55]). These leveragedadversarial trainingto generate more perceptually realistic textures, often by training on synthesized data designed to mimic real-world degradations. However,GANscan suffer from training instability, mode collapse, and sometimes generate artifacts. - Deep Learning (Blind IR - Diffusion-based): Most recently,
denoising diffusion modelshave emerged as powerful generative models, surpassingGANsin image quality and diversity. Their application toIRinitially focused onzero-shot restorationfor known degradations. This paper positions itself at the forefront ofdiffusion-based BIR, aiming to combine the robustness ofGAN-based BIR(in handling unknown degradations) with the superior generative capabilities ofdiffusion models.
3.4. Differentiation Analysis
DiffBIR differentiates itself from previous works in several key ways:
- Unified Framework for Diverse BIR Tasks: Unlike most prior works that specialize in one task (e.g.,
BSR,BFR, orBID) and require re-training for others, DiffBIR provides a single, unified pipeline applicable to all three. This significantly improves versatility and reduces development overhead. - Decoupled Two-Stage Approach:
- Problem: Previous
GAN-based BIRanddiffusion-based ZIRmethods often struggle with entangled degradation and content information, leading to artifacts or limited generalization to unknown degradations. Directly feedingLQimages as conditions to generative models (ControlNet) can lead to instability and artifacts. - DiffBIR's Solution: It explicitly decouples the problem into
degradation removal(Stage I) andinformation regeneration(Stage II). This ensures that the generation module receives clean, content-focused conditions, leading to more stable and higher-quality outputs. This is a crucial distinction from direct end-to-end approaches or those that don't explicitly handle the "blind" aspect in a decoupled manner for generation.
- Problem: Previous
- Robust IRControlNet for General Images:
- Problem:
BFRmethods achieve great results by leveraging strong face priors but are limited to faces. GenericBSRmethods often lack the generative power for truly realistic details. PriorControlNetadaptations forIR(e.g.,StableSR,PASD) might still face issues with color consistency or optimal condition encoding. - DiffBIR's Solution:
IRControlNetis specifically designed for general image content regeneration by leveragingStable Diffusion. Its innovative use of the pre-trainedVAE encoderfor condition encoding and efficientControlNet-like structure (copied UNet encoder with zero-initialization, modulating skipped features) effectively addresses issues likecolor shiftsand ensures stable, high-quality generation across diverse image types.
- Problem:
- Training-Free Controllable Trade-off:
- Problem: Many
BIRmethods struggle to offer a flexible balance betweenfidelity(being faithful to the input degraded image) andquality/realness(generating perceptually pleasing details). Users often have different preferences for this trade-off. - DiffBIR's Solution: The
region-adaptive restoration guidanceis a novel, training-free module applied during inference. It allows users to dynamically adjust aguidance scaleto emphasize fidelity or quality, and intelligently applies this guidance more strongly to low-frequency regions (where fidelity is paramount) while allowing high-frequency regions to benefit more from the generative prior. This offers unprecedented user control.
- Problem: Many
4. Methodology
4.1. Principles
DiffBIR operates on the core principle that blind image restoration (BIR) can be effectively solved by decoupling the complex problem into two distinct, manageable stages: degradation removal and information regeneration. The intuition behind this is that real-world degraded images contain both content information and image-independent degradation artifacts that are deeply entangled. Directly feeding such noisy images to a generative model as conditions leads to instability and artifacts, as the model struggles to distinguish between content to preserve and degradation to remove.
The proposed two-stage pipeline addresses this:
-
Degradation Removal (Stage I): This stage focuses on producing a high-fidelity intermediate image by removing the various degradations (e.g., noise, blur, compression artifacts) present in the low-quality input. The key is to obtain a "cleaner" representation of the image content, even if it's still somewhat smooth or lacking fine details. This module acts as a "condition preprocessor" for the next stage.
-
Information Regeneration (Stage II): With a relatively clean (high-fidelity) condition image from Stage I, this stage leverages a powerful generative
latent diffusion modelto synthesize lost or missing realistic details and textures. By conditioning on a degradation-free representation, the generative model can focus purely on adding plausible, high-quality content without being distracted or misled by artifacts.Additionally, a
training-free controllable moduleis integrated to allow users to achieve a flexible trade-off betweenfidelity(closeness to the original content of the input) andquality/realness(perceptual realism of generated details). This is achieved throughregion-adaptive restoration guidance, which modifies the denoising process during inference.
The entire pipeline is illustrated in Figure 3 from the original paper.
该图像是一个示意图,展示了DiffBIR的恢复管道,包括低质量图像(lq images)通过恢复模块(Restoration Module)生成高保真图像(high fidelity images),并利用生成模块(Generation Module)进行信息再生,展示了区域自适应恢复指导在恢复过程中的作用。
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Restoration Module
In the first stage, the goal is to remove distracting degradations from low-quality images without generating new content. Since different BIR tasks (e.g., BID, BFR, BSR) have distinct degradation characteristics and dataset specifics, DiffBIR uses separate, task-specific restoration modules during inference to leverage their specialized expertise. These are typically off-the-shelf BIR models trained with MSE loss. For example:
-
For
BSR,BSRNet [73]is used. -
For
BFR,SwinIR [29](as used inDifFace [67]) is adopted. -
For
BID,SCUNet-PSNR [74]is utilized.However, for training the subsequent generation module, a stable and diversified set of condition images is required. To this end, DiffBIR trains an additional dedicated Restoration Module (RM). This
RMis trained using a classic degradation model with a wide degradation range andMSE loss. This wide range helps generate sufficiently diversified condition images, which in turn improves the overall generative capacity of the generation module. ThisRM(used for condition generation during training) is discarded during inference.
The MSE loss for training this specific RM is defined as:
$
\mathcal { L } _ { R M } = | | I _ { R M } - I _ { h q } | | _ { 2 } ^ { 2 }
$
Here:
-
I _ { h q }denotes the high-quality ground-truth image. -
I _ { l q }denotes the synthesized low-quality counterpart of . -
denotes the restored image produced by the
Restoration Modulefrom . -
represents the squared Euclidean (L2) norm, which calculates the sum of squared differences between corresponding pixels of the two images.
This
RM(for training condition images) acts as a preprocessing step, ensuring that the condition images for the generation module are reliable and not contaminated by complex degradations, which as illustrated in Figure 2, can cause unpleasant artifacts.
该图像是图表,展示了不同条件下图像生成的效果。第一行展示了低质量(LQ)图像,第二行为直接将LQ图像作为条件,显示出不同降解带来的视觉伪影,而最后一行展示了我们DiffBIR方法的生成效果,表现出更稳定的结果。
4.2.2. Generation Module
The generation module leverages the power of pre-trained large-scale text-to-image latent diffusion models, specifically Stable Diffusion 2.1-base. Given the reliable condition image (from Stage I), this module aims to generate realistic details. The core of this module is IRControlNet, which involves three main aspects: condition encoding, condition network, and feature modulation.
Preliminary: Stable Diffusion
Stable Diffusion is built upon an autoencoder [26] that efficiently converts an image into a latent representation using an encoder , and reconstructs it using a decoder . Both the diffusion (noise-adding) and denoising (noise-removing) processes occur in this latent space.
In the diffusion process, Gaussian noise with variance is incrementally added to the encoded latent at time step to produce a noisy latent :
$
z _ { t } = \sqrt { \bar { \alpha } _ { t } } z + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon
$
Here:
-
is the noisy latent at time .
-
is the clean latent (from ).
-
is a random sample from a standard Gaussian distribution (mean 0, variance 1).
-
and . These terms control the amount of noise added. As increases, progressively becomes more noisy, approaching a standard Gaussian distribution when is large enough.
A neural network, denoted as , is trained to predict the noise that was added to . This prediction is conditioned on context (e.g., text prompts) and the current time step . The optimization objective for the
latent diffusion modelis to minimize the squared difference between the actual noise and the predicted noise: $ \mathcal { L } _ { l d m } = \mathbb { E } _ { z , c , t , \epsilon } [ | | \epsilon - \epsilon _ { \theta } ( \sqrt { \bar { \alpha } _ { t } } z + \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon , c , t ) | | _ { 2 } ^ { 2 } ] $ Here: -
is the latent diffusion model's training loss.
-
denotes the expectation over randomly sampled , , , and .
-
where is an image from the dataset.
-
represents the conditioning information (e.g., text prompts, although in DiffBIR, it's often empty or negative prompts).
-
is the noise prediction network (typically a
U-Net).
IRControlNet Architecture and Training
The architecture of IRControlNet and its variants is depicted in Figure 4.
该图像是 IRControlNet 的架构示意图及其四种模型变体。左侧展示了 IRControlNet 的主要结构,包括固定和可训练的模块。右侧则分别展示四个变体的设计,突出了条件编码和特征调制的不同实现方式。
-
Condition encoding:
IRControlNetuses the pre-trained and fixedVAE encoder() to encode the high-fidelity condition imageI _ { R M }(from Stage I) into a latent space. This produces the condition latent: This is a crucial design choice, as theVAEis pre-trained on large-scale datasets and can effectively preserve sufficient image information in the latent space, which is where the diffusion process operates. -
Condition network: Similar to
ControlNet [75], DiffBIR creates a trainable copy of the pre-trainedUNet encoderandmiddle blockfromStable Diffusion. This copied network, denoted as , receives the condition information and outputs control signals. The advantage of copying weights is a good initialization. The input to is a concatenation of the condition latentc _ { R M }and the noisy latentz _ { t }at time : . Since concatenation increases channel numbers, a few parameters are added to the first layer of and initialized to zero. Thiszero initializationfunctions likezero convolutioninControlNet, preventing random noise from acting as gradients early in training and thus stabilizing convergence. -
Feature modulation: The multi-scale features output by the
condition network() are used to modulate the intermediate features of the frozenUNet denoiserofStable Diffusion. FollowingControlNet, this modulation is performed viaaddition operationon themiddle block featuresand theskipped features.Zero convolutionsare again used to connect to the frozenUNet denoiserfor training stability.During training of the generation module, only the parameters of the
condition networkandfeature modulationlayers are updated; the rest of the pre-trainedStable Diffusionmodel (VAE and UNet denoiser) remains frozen. The objective function for training the generation module is: $ \mathcal { L } _ { G M } = \mathbb { E } _ { z _ { t } , c , t , \epsilon , c _ { R M } } [ | | \epsilon - \epsilon _ { \theta } ( z _ { t } , c , t , c _ { R M } ) | | _ { 2 } ^ { 2 } ] $ Here:
- is the generation module's training loss.
- is the noisy latent.
- is the text prompt (can be empty or negative prompts like "low quality", "blurry").
- is the time step.
- is the ground-truth noise.
- is the condition latent derived from .
- is the noise predicted by the
UNet denoiserofStable Diffusion, controlled byIRControlNet's outputs, conditioned on , , , and .
Discussion on IRControlNet Variants
The paper explores several IRControlNet variants (shown in Figure 4 and Figure 11 in the appendix) to validate its design choices:
-
Variant 1 (ControlNet-like condition encoding): Replaces
IRControlNet'sVAE encoder() for condition encoding with a tiny, trained-from-scratch network. This variant shows significantly worse results (e.g., in PSNR andcolor shiftproblems as seen in Figure 10 (right)). This demonstrates the critical role of the pre-trainedVAE encoderin projecting the condition image into the correct latent space for effective control. -
Variant 2 (w/o in condition network): Removes the noisy latent from the
condition networkinput, using only . While achieving good fidelity metrics, it tends to produce smoother results without sufficient texture details (Figure 12). Its training losses are consistently higher (Figure 5). This indicates helps thecondition networkbe aware of randomness at each timestep, boosting convergence and high-quality generation. -
Variant 3 (w/o copied initialization): Trains the
condition networkfrom random initialization instead of copyingUNetweights. This variant struggles with convergence and achieves the worst performance across all metrics (Figure 5, Table 1), highlighting the importance of good weight initialization. -
Variant 4 (w/ control decoder features): Modulates
decoder featuresinstead ofskipped features. It shows comparable performance toIRControlNetin convergence and quantitative results (Figure 5, Table 1). However,skipped featuresare preferred as controllingdecoder featureswould introduce more parameters and computation due to larger channel numbers. -
Variant 5 (w/ control concat features, from Appendix): Simultaneously controls
middle block,decoder, andskipped features. Achieves betterPSNRandSSIMbut worseMANIQAscores (Table 9). This indicates that applying more control enhances fidelity but can damage generation quality by restricting the generative prior. -
Variant 6 (w/ SFT modulation, from Appendix): Uses
SFT (Scale and Shift Transformation) layers [53]for modulation.SFTis defined as , where are feature maps, and are element-wise scale and shift parameters produced byzero-conv. This variant also improves fidelity but at the cost ofMANIQAscores (Table 9), again trading quality for fidelity.In summary,
IRControlNet's design choices are empirically validated to be crucial for either model convergence or performance, making it a robust backbone for the generative module inBIRtasks.
4.2.3. Restoration Guidance
This section describes a training-free controllable module designed to achieve a trade-off between quality (realness, generated details) and fidelity (faithfulness to the input). Users typically desire more generated details in high-frequency regions (e.g., textures, edges) and less generated content (i.e., more fidelity to the smoother IRM image) in flat, low-frequency regions (e.g., sky, walls). This is achieved through region-adaptive restoration guidance applied at each sampling step during inference.
The process is as follows:
At a given time step , the UNet denoiser first predicts the noise from the noisy latent z _ { t }. Then, this predicted noise is removed from z _ { t } to obtain a clean latent :
$
\epsilon _ { t } = \epsilon _ { \theta } ( z _ { t } , c , t , c _ { R M } )
$
$
\tilde { z } _ { 0 } = \frac { z _ { t } - \sqrt { 1 - \bar { \alpha } _ { t } } \epsilon _ { t } } { \sqrt { \bar { \alpha } _ { t } } }
$
Here:
-
is the noise predicted by the
UNet denoiser(conditioned byIRControlNet). -
is the predicted clean latent representation of the image at time 0, before any noise was added.
-
The terms and are scaling factors derived from the diffusion schedule.
The goal is to guide the decoded version of this
clean latenttowards the high-fidelity condition imageI _ { R M }(output of Stage I). This is done by applying aregion-adaptive MSE lossin the pixel space and updating using agradient descent algorithm.
The region-adaptive MSE loss function is defined as:
$
\mathcal { L } ( \tilde { z } _ { 0 } ) = \frac { 1 } { H W C } | | \mathcal { W } \odot ( \mathcal { D } ( \tilde { z } _ { 0 } ) - I _ { R M } ) | | _ { 2 } ^ { 2 }
$
Here:
-
is the region-adaptive MSE loss.
-
H, W, Cdenote the height, width, and channel count of . -
is the image reconstructed from the clean latent by the
VAE decoder. -
is the high-fidelity guidance image from Stage I.
-
is a
weight mapthat assigns different importance to different image regions. -
denotes the element-wise product.
-
is the squared L2 norm.
The
weight mapis calculated based on thegradient magnitudeof . The idea is to assign larger weights to low-frequency (flat) regions and smaller weights to high-frequency (textured, edge) regions. This means low-frequency areas will have a higher loss if they deviate from , thus being influenced more by to maintain fidelity. Conversely, high-frequency regions are less constrained, allowing more generative freedom.
To obtain , first, the gradient magnitude for each pixel (i, j) in is calculated using Sobel operators and (which detect gradients in the horizontal and vertical directions):
$
M ( I _ { R M } ) = \sqrt { G _ { x } ( I _ { R M } ) ^ { 2 } + G _ { y } ( I _ { R M } ) ^ { 2 } }
$
Since strong gradient signals are sparse, patch-level gradient signals are used for better estimation. is divided into multiple equal-sized, non-overlapping patches . For each patch, the sum of gradient magnitudes of all its pixels is calculated and then mapped into the range [0, 1) using the tanh function:
$
S ( I _ { R M } ^ { ( k ) } ) = \operatorname { t a n h } \left( \sum _ { i , j } M _ { i , j } ( I _ { R M } ) \right) , ( i , j ) \in I _ { R M } ^ { ( k ) }
$
Where (i, j) denotes a pixel in patch . A higher indicates a stronger gradient signal in that patch.
The final gradient magnitude map at each pixel (i, j) is then formulated as:
$
\mathcal { G } _ { i , j } ( I _ { R M } ) = \sum _ { k } \mathbb { I } \left[ ( i , j ) \in I _ { R M } ^ { ( k ) } \right] S ( I _ { R M } ^ { ( k ) } )
$
Where is an indicator function, which is 1 if pixel (i,j) is in patch , and 0 otherwise. This effectively assigns the patch-level gradient intensity to all pixels within that patch.
Finally, the weight map is calculated as . This means regions with strong gradients (high-frequency) will have small weights (closer to 0), and regions with weak gradients (low-frequency) will have large weights (closer to 1), as illustrated in Figure 6.
该图像是示意图,展示了区域自适应恢复引导机制。图中界定了不同权重地区的处理方法,其中较大的权重用于高频区域,而较小的权重针对低频区域。具体来说,区域自适应均方误差损失 通过梯度下降算法最小化清洁潜在表示 与高保真引导图 之间的损失。图例展示了过程中的权重图和梯度图计算。
The clean latent is then updated at each sampling step using the gradient descent algorithm:
$
\tilde { z } _ { 0 } ^ { \prime } = \tilde { z } _ { 0 } - s \nabla _ { \tilde { z } _ { 0 } } \mathcal { L } ( \tilde { z } _ { 0 } )
$
Here:
-
is the updated clean latent.
-
is the
guidance scale, a tunable hyper-parameter controlled by the user. A larger pushes closer to , resulting in higherfidelity. -
is the gradient of the region-adaptive MSE loss with respect to the clean latent .
The detailed algorithm for restoration guidance is provided in Algorithm 1 in the Appendix:
Algorithm 1 Restoration guidance, given a diffusion model , and the VAE's encoder E and decoder D Input: Guidance image , text description (set to empty), diffusion steps , gradient scale Output: Output image Sample from for from to 1 do
Calculate from as
Sample from
end for
return
This mechanism provides flexible control without re-training, allowing users to fine-tune the output image's characteristics based on their preferences.
5. Experimental Setup
5.1. Datasets
DiffBIR's models are trained and evaluated on a variety of synthetic and real-world datasets tailored for Blind Image Super-Resolution (BSR), Blind Face Restoration (BFR), and Blind Image Denoising (BID) tasks.
-
Training Dataset (for DiffBIR's Generation Module):
- Filtered laion2b-en [46]: A large-scale dataset containing approximately 15 million high-quality images. All images are randomly cropped to resolution during training. This dataset is chosen due to its vast size and diversity, which is crucial for training robust generative diffusion models like
Stable Diffusion.
- Filtered laion2b-en [46]: A large-scale dataset containing approximately 15 million high-quality images. All images are randomly cropped to resolution during training. This dataset is chosen due to its vast size and diversity, which is crucial for training robust generative diffusion models like
-
Evaluation Datasets for BSR Task:
- Synthetic Datasets: These datasets contain images with artificially induced degradations, allowing for quantitative comparison against ground truth.
- DIV2K-Val [1]: A widely used high-quality image dataset for super-resolution, often used for validation.
- DRealSR [62]: A dataset designed for real-world super-resolution challenges, offering diverse degradations.
- RealSR [3]: Another benchmark dataset focusing on real-world single image super-resolution.
- Real-world Datasets: These datasets consist of images with authentic, complex degradations, posing a greater challenge.
- RealSRSet [73]: A dataset specifically collected for
blind image super-resolutionfrom real-world scenarios. - Real47: A dataset collected by the authors for evaluating
BSRin real-world conditions.
- RealSRSet [73]: A dataset specifically collected for
- Synthetic Datasets: These datasets contain images with artificially induced degradations, allowing for quantitative comparison against ground truth.
-
Evaluation Datasets for BFR Task:
- These datasets contain real-world face images with varying degrees of degradation.
- LFW-Test [55] (based on LFW [21]): A subset derived from the Labeled Faces in the Wild dataset, commonly used for face restoration.
- WIDER-Test [77]: A subset used for face restoration, often containing faces with various poses, expressions, and occlusions under diverse conditions.
- These datasets contain real-world face images with varying degrees of degradation.
-
Evaluation Datasets for BID Task:
- This task evaluates performance on images degraded by various real-world noise types.
- Mixed Real-world Dataset: Composed of images from
real3 [74],real9 [74], andRNI15 [72]. These datasets are known for containing realistic camera sensor noises (e.g., dark current noise, shot noise, thermal noise) that are challenging to remove.
- Mixed Real-world Dataset: Composed of images from
- This task evaluates performance on images degraded by various real-world noise types.
5.2. Evaluation Metrics
For every evaluation metric, a comprehensive explanation is provided below:
-
PSNR (Peak Signal-to-Noise Ratio)
- Conceptual Definition:
PSNRquantifies the reconstruction quality of an image compared to an original reference image. It is most commonly used to measure the quality of reconstruction of lossy compression codecs or restoration algorithms. A higherPSNRvalue generally indicates a better quality image, meaning the reconstructed image is closer to the original. It is usually expressed in decibels (dB). - Mathematical Formula: $ \mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1}\sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $ $ \mathrm{PSNR} = 10 \cdot \log_{10}\left(\frac{\mathrm{MAX}I^2}{\mathrm{MSE}}\right) = 20 \cdot \log{10}\left(\frac{\mathrm{MAX}_I}{\sqrt{\mathrm{MSE}}}\right) $
- Symbol Explanation:
- : Mean Squared Error between the two images.
I(i,j): The pixel value at position(i,j)in the original (ground truth) image.K(i,j): The pixel value at position(i,j)in the reconstructed (restored) image.m, n: Dimensions of the image (rows and columns).- : The maximum possible pixel value of the image. For 8-bit grayscale images, this is 255. For images with pixel values in
[0, 1], this is 1.
- Conceptual Definition:
-
SSIM (Structural Similarity Index Measure)
- Conceptual Definition:
SSIMis a perceptual metric that quantifies the similarity between two images, often used to assess the quality of lossy compression or restored images. UnlikePSNRwhich relies on pixel-wise error,SSIMattempts to model the human visual system by considering three key features: luminance, contrast, and structure. TheSSIMindex ranges from -1 to 1, where 1 indicates perfect structural similarity. - Mathematical Formula: $ \mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
- : A window from the first image (e.g., ground truth).
- : A window from the second image (e.g., restored image).
- : Average of .
- : Average of .
- : Variance of .
- : Variance of .
- : Covariance of and .
- , : Two constants to stabilize the division with a weak denominator.
- : The dynamic range of the pixel values (e.g., 255 for 8-bit grayscale images).
- , : Default constant values.
- Conceptual Definition:
-
LPIPS (Learned Perceptual Image Patch Similarity) [76]
- Conceptual Definition:
LPIPSis a metric that aims to correlate better with human perception of image quality than traditional metrics likePSNRorSSIM. It works by extracting deep features from pre-trained neural networks (like AlexNet or VGG), then calculating the Euclidean distance between these feature representations of two images. A lowerLPIPSscore indicates higher perceptual similarity. - Mathematical Formula: The paper does not provide an explicit formula, as
LPIPSis defined by the specific network architecture and distance calculation. Conceptually, it can be represented as: $ \mathrm{LPIPS}(x, x_0) = \sum_l \frac{1}{H_l W_l} ||w_l \odot (\phi_l(x) - \phi_l(x_0))||_2^2 $ - Symbol Explanation:
- : Original image (ground truth).
- : Restored image.
- : Feature stack from layer of a pre-trained network (e.g., AlexNet, VGG).
- : Weights for each layer .
- : Height and width of the feature maps at layer .
- : Element-wise multiplication.
- : Squared L2 norm.
- Conceptual Definition:
-
MUSIQ (Multi-scale Image Quality Transformer) [24]
- Conceptual Definition:
MUSIQis a no-reference image quality assessment (IQA) metric designed to predict subjective image quality without needing a reference image. It uses aTransformer-based architecture to extract features at multiple scales and aggregates them to provide an overall quality score. A higherMUSIQscore indicates better perceived image quality.
- Conceptual Definition:
-
MANIQA (Multi-dimension Attention Network for No-Reference Image Quality Assessment) [64]
- Conceptual Definition:
MANIQAis another no-reference IQA metric that employs a multi-dimension attention network to assess image quality. It focuses on learning discriminative quality-aware features by attending to different aspects (e.g., spatial, channel) of an image. A higherMANIQAscore typically suggests better image quality.
- Conceptual Definition:
-
CLIP-IQA [51]
- Conceptual Definition:
CLIP-IQAis a no-reference IQA metric that leverages the capabilities ofCLIP (Contrastive Language-Image Pre-training)to assess image quality. It aligns image quality assessment with human aesthetic and semantic understanding by measuring the semantic similarity between an image and quality-related text prompts. A higherCLIP-IQAscore indicates better image quality, often correlating with semantic integrity and visual appeal.
- Conceptual Definition:
-
FID (Fréchet Inception Distance) [19]
- Conceptual Definition:
FIDis a metric used to evaluate the quality of images generated by generative models, particularlyGANsanddiffusion models. It measures the "distance" between the feature distributions of real images and generated images. It uses a pre-trainedInception-v3network to extract features from both sets of images and then computes the Fréchet distance between these two multivariate Gaussian distributions. A lowerFIDscore indicates that the generated images are more similar to real images in terms of quality and diversity. - Mathematical Formula: $ \mathrm{FID} = ||\mu_x - \mu_g||_2^2 + \mathrm{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x\Sigma_g)^{1/2}) $
- Symbol Explanation:
- : Mean of feature vectors for real images.
- : Mean of feature vectors for generated images.
- : Covariance matrix of feature vectors for real images.
- : Covariance matrix of feature vectors for generated images.
- : Squared L2 norm (Euclidean distance).
- : Trace of a matrix.
- Conceptual Definition:
5.3. Baselines
DiffBIR is compared against a comprehensive set of state-of-the-art methods for each Blind Image Restoration task:
-
For Blind Image Super-Resolution (BSR):
FeMaSR [5]DASR [30]Real-ESRGAN+ [56]BSRGAN [73]SwinIR-GAN [29]StableSR [52](a diffusion-based method)PASD [66](a diffusion-based method) These baselines represent bothGAN-basedand recentdiffusion-basedapproaches that aim to handle real-worldsuper-resolutionwith unknown degradations.
-
For Blind Face Restoration (BFR):
CodeFormer [77]DifFace [67](a diffusion-based method)DMDNet [28]DR2 [61](a diffusion-based method)GCFSR [18]GFP-GAN [55]GPEN [65]RestoreFormer++ [60]VQFR [16]PGDiff [63](a diffusion-based method) These methods are state-of-the-art in face-specific restoration, utilizing variousGANordiffusionpriors tailored for facial structures.
-
For Blind Image Denoising (BID):
CBDNet [17]DeamNet [41]Restormer [69]SwinIR [29]SCUNet-GAN [74]These baselines cover a range ofCNN-based,Transformer-based, andGAN-basedapproaches designed for blind noise removal.
6. Results & Analysis
6.1. Core Results Analysis
DiffBIR demonstrates superior performance across all evaluated Blind Image Restoration tasks (BSR, BFR, BID), both quantitatively and qualitatively.
BSR on Synthetic Datasets
The following are the results from Table 2 of the original paper, showing quantitative comparisons on the DIV2K-Val dataset with Real-ESRGAN degradation.
| Metrics | FeMaSR [5] | DASR [30] | Real-ESRGAN+ [56] | BSRGAN [73] | SwinIR-GAN [29] | StableSR [52] | PASD [66] | DiffBIR | ||
| (s=0) | (s=0.5) | (s=1) | ||||||||
| PSNR↑ | 20.1303 | 21.2141 | 21.0348 | 21.4531 | 20.7488 | 21.2392 | 20.7838 | 20.5824 | 21.5808 | 21.9154 |
| SSIM↑ | 0.4451 | 0.4773 | 0.4899 | 0.4814 | 0.4844 | 0.4790 | 0.4727 | 0.4277 | 0.4794 | 0.4986 |
| LPIPS↓ | 0.3971 | 0.4479 | 0.3921 | 0.4095 | 0.3907 | 0.3993 | 0.4353 | 0.3939 | 0.3935 | 0.4263 |
| MUSIQ↑ | 62.7855 | 58.1591 | 64.6389 | 62.9271 | 65.4945 | 57.8069 | 63.8094 | 73.1019 | 68.6657 | 61.1476 |
| MANIQA↑ | 0.1443 | 0.1531 | 0.2238 | 0.1833 | 0.2061 | 0.1648 | 0.2354 | 0.3836 | 0.3146 | 0.2466 |
| CLIP-IQA↑ | 0.5674 | 0.5571 | 0.5905 | 0.5195 | 0.5779 | 0.5541 | 0.6125 | 0.7656 | 0.7158 | 0.6347 |
DiffBIR consistently outperforms state-of-the-art methods on synthetic BSR datasets, especially in no-reference image quality assessment (IQA) metrics like MUSIQ, MANIQA, and CLIP-IQA. When the restoration guidance scale is set to 0 (prioritizing quality/realness), DiffBIR achieves the highest scores across all three IQA metrics by a significant margin. This highlights its superior ability to generate perceptually pleasing and realistic details.
When (prioritizing fidelity), DiffBIR achieves the best PSNR and SSIM scores, which measure pixel-wise and structural similarity to the ground truth. Even at , its IQA scores remain competitive (top-3). This demonstrates the effectiveness of the region-adaptive restoration guidance in allowing users to balance realness and fidelity according to their needs. A setting of offers a good compromise.
BSR on Real-world Datasets
The following are the results from Table 3 of the original paper, showing quantitative comparisons on real-world datasets for BSR.
| Datasets | Metrics | FeMaSR [5] | DASR [30] | Real-ESRGAN+ [56] | BSRGAN [73] | SwinIR-GAN [29] | StableSR [52] | PASD [66] | DiffBIR | |
| (s=0) | ||||||||||
| RealSRSet [73] | MUSIQ↑ | 64.6735 | 59.2695 | 63.2675 | 67.6705 | 64.2512 | 64.8372 | 67.4052 | 69.4208 | |
| MANIQA↑ | 0.2142 | 0.1595 | 0.1963 | 0.2240 | 0.2054 | 0.2083 | 0.2370 | 0.3211 | ||
| CLIP-IQA↑ | 0.6879 | 0.5236 | 0.5772 | 0.6456 | 0.6008 | 0.6418 | 0.6761 | 0.7637 | ||
| real47 | MUSIQ↑ | 68.9384 | 62.2026 | 68.1098 | 69.4741 | 68.8467 | 68.3422 | 70.9712 | 73.1397 | |
| MANIQA↑ | 0.2347 | 0.1454 | 0.2055 | 0.2063 | 0.2217 | 0.2264 | 0.2607 | 0.3682 | ||
| CLIP-IQA↑ | 0.6911 | 0.5445 | 0.6382 | 0.6111 | 0.6246 | 0.6574 | 0.6913 | 0.7781 | ||
On challenging real-world BSR datasets (RealSRSet and Real47), DiffBIR () achieves the best scores across all MUSIQ, MANIQA, and CLIP-IQA metrics. This confirms its superiority in handling complex, authentic degradations and generating highly realistic outputs.
Qualitative comparisons (Figure 7) further illustrate that DiffBIR produces sharper and more realistic results compared to GAN-based methods which tend to be over-smoothed. Compared to other diffusion-based methods, DiffBIR's outputs are more realistic, recovering intricate details like whiskers, lips, flower pistils, and clear text.
该图像是图表,展示了不同盲图像超分辨率(BSR)方法在真实世界数据集上的视觉比较。图中包含四种方法的效果对比,展示了DiffBIR在恢复图像内容方面的优越性。
BFR on Real-world Datasets
The following are the results from Table 4 of the original paper, showing quantitative comparisons on real-world datasets for BFR.
| Datasets | Metrics | CodeFormer[77] | DifFace[67] | DMDNet[28] | DR2 [61] | GCFSR[18] | GFP-GAN[55] | GPEN[65] | RestoreFormer++[60] | VQFR[16] | PGDiff[63] | DiffBIR | |
| (s=0) | |||||||||||||
| LFW-Test [21] | MUSIQ↑ | 75.4830 | 70.4957 | 73.4027 | 67.5357 | 71.3789 | 76.3779 | 76.6210 | 72.2492 | 74.3847 | 72.2175 | 76.4206 | |
| MANIQA↑ | 0.3188 | 0.2692 | 0.2973 | 0.2830 | 0.2790 | 0.3688 | 0.3616 | 0.3179 | 0.3280 | 0.2927 | 0.4499 | ||
| CLIP-IQA↑ | 0.6890 | 0.5945 | 0.6467 | 0.5728 | 0.6143 | 0.7196 | 0.7181 | 0.7025 | 0.7099 | 0.6133 | 0.7948 | ||
| FID (ref. FFHQ)↓ | 52.8765 | 44.9201 | 43.5403 | 45.9420 | 52.6972 | 47.4717 | 51.9862 | 50.7309 | 50.1300 | 41.5814 | 4.9065 | ||
| Wider-Test [77] | MUSIQ↑ | 73.4081 | 65.2397 | 69.4709 | 67.3163 | 69.9634 | 74.8308 | 75.6160 | 71.5155 | 71.4163 | 66.0014 | 75.3213 | |
| MANIQA↑ | 0.2971 | 0.2403 | 0.2630 | 0.2795 | 0.2803 | 0.3508 | 0.3472 | 0.29055 | 0.3060 | 0.2406 | 0.4443 | ||
| CLIP-IQA↑ | 0.6984 | 0.5639 | 0.6335 | 0.5821 | 0.6266 | 0.7147 | 0.7039 | 0.7171 | 0.7069 | 0.5685 | 0.8085 | ||
| FID (ref. FFHQ)↓ | 39.2517 | 37.8440 | 38.9580 | 40.1202 | 41.1986 | 41.3247 | 46.4419 | 45.4686 | 38.1675 | 40.2700 | 35.0940 | ||
DiffBIR () achieves the lowest FID score on both LFW-Test and WIDER-Test datasets, by a significant margin (e.g., 4.9065 vs. 41.5814 on LFW-Test). This is a strong indicator of its ability to generate highly realistic and diverse faces that are visually indistinguishable from real ones. It also obtains the highest scores in MANIQA and CLIP-IQA, with MUSIQ scores being very close to the top.
Notably, IRControlNet was not specifically finetuned on face datasets like FFHQ, yet it outperforms specialized BFR methods. This highlights the excellent generalization ability of DiffBIR's proposed restoration pipeline to general images, not just faces.
Visual comparisons (Figure 8) show DiffBIR's superiority in handling non-facial elements (e.g., correctly restoring a hand alongside a face) and complex facial orientations (e.g., side profiles, intricate details like teeth and nose), where other methods might distort or fail due to strong facial priors.
该图像是对比不同盲人面部恢复(BFR)方法在真实世界数据集上的视觉效果。展示了多种方法的恢复结果,包括LQ、GPEN、GFPGAN、VQFR、CodeFormer等及DiffBIR的效果。从上到下、左到右排列,每种方法的效果对比清晰可见。
BID on Real-world Datasets
The following are the results from Table 5 of the original paper, showing quantitative comparisons on real-world datasets for BID.
| Methods | MUSIQ↑ | MANIQA↑ | CLIP-IQA↑ |
| CBDNet [17] | 48.1149 | 0.1103 | 0.4709 |
| DeamNet [41] | 45.9942 | 0.0949 | 0.4391 |
| Restormer [69] | 47.4605 | 0.0927 | 0.3857 |
| SwinIR [29] | 55.0493 | 0.1595 | 0.4130 |
| SCUNet-GAN [74] | 58.2170 | 0.1822 | 0.5045 |
| DiffBIR (s=0) | 69.7278 | 0.3404 | 0.7420 |
For BID, DiffBIR () significantly outperforms all baseline methods across MUSIQ, MANIQA, and CLIP-IQA metrics. This substantial difference is attributed to the powerful generative diffusion prior, enabling effective high-quality image restoration beyond mere noise removal.
Visual comparisons (Figure 9) reveal that DiffBIR not only removes noise effectively but also generates realistic textures, which other methods often fail to do. Methods like SwinIR and SCUNet-GAN, while successfully denoising, tend to produce smoothed results lacking vivid texture details.
该图像是图表,展示了不同盲图像恢复算法的视觉对比效果。比较包括低质量图像(LQ)、Restormer、SwinIR、SCUNet-GAN和我们的DiffBIR模型,展示了各算法在真实数据集上的恢复效果。
6.2. Ablation Studies / Parameter Analysis
The paper conducts several ablation studies to validate the design choices and components of DiffBIR.
The Importance of Restoration Module
The following are the results from Table 6 of the original paper, showing an ablation study on the Restoration Module (RM).
| Datasets | Metrics | w/o RM | w/RM |
| RealSRSet [73] | MANIQA↑ | 0.2386 | 0.2477 |
| MUSIQ↑ | 62.5683 | 64.7319 | |
| CLIP-IQA↑ | 0.6818 | 0.7075 | |
| ImageNet-Val-1k [10] | PSNR↑ | 22.8481 | 23.0078 |
| SSIM↑ | 0.5039 | 0.5198 | |
| LPIPS↓ | 0.4076 | 0.4026 |
Removing the Restoration Module (RM), meaning directly finetuning the diffusion model with synthesized training pairs (one-stage model), leads to a noticeable performance drop across all IQA and reference-based metrics on both real-world (RealSRSet) and synthetic (ImageNet-Val-1k) datasets.
Qualitatively (Figure 10, left), the one-stage model exhibits severe distortions, such as incorrect facial generation or misinterpreting degradations as semantic information (e.g., producing a colorful background or unusual eye shapes). The two-stage model (with RM) correctly generates facial content and more realistic results. This confirms that the Restoration Module is critical for providing clean, reliable conditions to the generation module, preventing it from being disturbed by degradation and enabling it to focus purely on content regeneration.
该图像是对消融实验的视觉比较。左侧显示了无恢复模块(w/o RM)与有恢复模块(w/ RM)的结果,显示了恢复模块对细节保留的重要性;右侧展示了控制网络(w/ ControlNet)与改进控制网络(w/ IRControlNet)的效果对比,证明了改进算法在细节生成上的优势。
The Effectiveness of IRControlNet
The following are the results from Table 7 of the original paper, showing a comparison of ControlNet and ours in PSNR.
| Set14 [70] | BSD100 [34] | manga109 [35] | ImageNet-Val-1k [10] | |
| w/ ControlNet | 20.9435 | 22.4923 | 20.2692 | 22.2874 |
| w/ IRControlNet | 23.5193 | 23.8778 | 23.2439 | 24.2534 |
Comparing IRControlNet with a standard ControlNet for BSR tasks reveals that ControlNet often produces results with color shifts (Figure 10, right). This issue is attributed to the lack of explicit regularization for color consistency during training and potentially less effective condition encoding. IRControlNet, by leveraging the pre-trained VAE encoder for condition encoding, effectively addresses this problem and achieves significantly higher PSNR scores across various datasets (Set14, BSD100, manga109, ImageNet-Val-1k). This empirical evidence validates IRControlNet's design as a more suitable backbone for BIR tasks within the diffusion framework.
The Effectiveness of Wide Degradation Range
The following are the results from Table 8 of the original paper, showing an ablation study on the degradation model evaluated on RealSRSet [73].
| Degradation | MANIQA↑ | MUSIQ↑ | CLIP-IQA↑ |
| RealESRGAN [56] | 0.2351 | 64.1718 | 0.6936 |
| Ours | 0.2504 | 64.7319 | 0.7075 |
The paper investigates the impact of the degradation model used to synthesize training conditions for the generation module. A classic first-order degradation model with a wide degradation range (ours) is compared against the complex degradation model from Real-ESRGAN [56] (which uses smaller ranges).
The results in Table 8 demonstrate that using the proposed classic degradation model with a wide range leads to better utilization of the generative capabilities of the diffusion model, resulting in enhanced quality of the restored images as indicated by higher MANIQA, MUSIQ, and CLIP-IQA scores on RealSRSet. This suggests that diverse yet conceptually simpler degradations are more effective for training a robust generative prior for the second stage.
More Variants for IRControlNet (from Appendix)
The paper further explores two additional variants of IRControlNet (Figure 11):
该图像是示意图,展示了IRControlNet的两个模型变体:Variant 5和Variant 6。Variant 5采用了控制拼接特征,而Variant 6则使用SFT调制,这些方法在盲图像恢复过程中实现了不同的特征处理与信息生成机制。
The following are the results from Table 9 of the original paper, showing quantitative comparisons of IRControlNet, Variant 5 and 6 on ImageNet1k-Val with Real-ESRGAN degradation.
| Variants | PSNR↑ | SSIM↑ | MANIQA↑ |
| IRControlNet | 22.9865 | 0.5200 | 0.2689 |
| Variant 5: w/ control concat features | 23.0449 | 0.5261 | 0.2567 |
| Variant 6: w/ SFT modulation | 22.9974 | 0.5292 | 0.2622 |
-
Variant 5 (w/ control concat features): This variant simultaneously controls
middle block,decoder, andskipped features. While it achieves slightly higherPSNRandSSIM, itsMANIQAscore is worse thanIRControlNet. This suggests that applying more control to the pre-trained model can enhance fidelity but might restrict the generative model's artistic freedom, leading to a slight drop in perceived generation quality. -
Variant 6 (w/ SFT modulation): This variant uses
SFT (Scale and Shift Transformation)layers to modulate features, which offers more precise control. Similar to Variant 5, it improvesSSIMandPSNR(fidelity) but results in a slightly lowerMANIQAscore, implying a trade-off.In conclusion from these variants,
IRControlNet's choice to apply add-on control primarily toskipped featureswith additive modulation strikes a good balance, preserving most of the generative capability while allowing for effective conditioning. Theregion-adaptive restoration guidancethen provides the flexible trade-off between quality and fidelity during inference.
Qualitative comparisons for Variant 2 (w/o in condition network) are shown in Figure 12:
该图像是图表,展示了低质量图像(LQ)、IRControlNet 处理结果以及 Variant 2 的恢复效果的视觉对比。图中分别列出了三个不同的图像处理结果,并使用黄色和红色框标记出关键区域,以便进行细致的比较。每一行展示了同一内容在不同方法下的表现,能够直观体现各方法在细节恢复上的差异。
IRControlNet generates more vivid textures, while Variant 2 tends to produce over-smoothed results, reinforcing the importance of including in the condition network for high-quality generation.
6.3. Quantitative Comparisons for Efficiency
The following are the results from Table 11 of the original paper, showing quantitative comparisons of inference efficiency and model complexity.
| Metrics | Real-ESRGAN+ [56] | BSRGAN [73] | SwinIR-GAN [29] | FeMaSR [5] | DASR [30] | StableSR [52] | PASD [66] | DiffBIR |
| Inference Time (ms) | 46.19 | 46.42 | 126.44 | 89.01 | 12.69 | 19278.46 | 16951.08 | 10906.51 |
| Model Size (M) | 16.69 | 16.69 | 11.71 | 34.05 | 8.06 | 1409.11 | 1675.76 | 1716.7 |
This comparison evaluates inference speed and model complexity for super-resolution (input , scale factor 4).
DiffBIR is shown to be the most efficient among the diffusion-model (DM)-based baselines, being approximately 1.8x faster than StableSR and 1.6x faster than PASD. This indicates an optimization in its diffusion sampling process.
However, GAN-based methods (Real-ESRGAN+, BSRGAN, DASR, FeMaSR) are significantly more efficient in terms of inference time, often completing inference in tens of milliseconds, whereas DM-based methods, including DiffBIR, require thousands of milliseconds (seconds). This is an inherent trade-off, as DM-based methods typically involve multiple sequential sampling steps (DiffBIR uses 50 steps) which are computationally intensive.
The model size of DiffBIR (1716.7M) is comparable to other DM-based methods (StableSR: 1409.11M, PASD: 1675.76M) but much larger than GAN-based methods (e.g., DASR: 8.06M).
The authors acknowledge the computational expense of DM-based methods but point to rapid advancements in the field, with new works ([33, 44]) achieving satisfactory generation with as few as 1-4 steps, suggesting that the time-consuming problem will likely be alleviated in the future.
6.4. More Quantitative and Qualitative Comparisons for BSR on Synthetic Datasets (from Appendix)
The following are the results from Table 10 of the original paper, showing quantitative comparisons on DRealSR [62] and RealSR [3].
| Datasets | Metrics | FeMaSR [5] | DASR [30] | Real-ESRGAN+ [56] | BSRGAN [73] | SwinIR-GAN [29] | StableSR [52] | PASD [66] | DiffBIR | ||
| (s=0) | (s=0.5) | (s=1) | |||||||||
| DRealSR [62] | PSNR↑ | 23.1977 | 26.3844 | 24.6878 | 25.6903 | 25.3898 | 23.8669 | 24.2037 | 24.8735 | 24.9891 | 25.6238 |
| SSIM↑ | 0.6239 | 0.7271 | 0.6705 | 0.6765 | 0.6962 | 0.6400 | 0.6529 | 0.5874 | 0.6246 | 0.6544 | |
| LPIPS↓ | 0.2190 | 0.1793 | 0.2229 | 0.2308 | 0.2057 | 0.2355 | 0.2016 | 0.2448 | 0.2328 | 0.2350 | |
| MUSIQ↑ | 68.7458 | 66.0651 | 67.4608 | 68.9388 | 68.1393 | 69.2621 | 70.7670 | 72.3514 | 71.5339 | 69.8821 | |
| MANIQA↑ | 0.3073 | 0.2048 | 0.2315 | 0.2309 | 0.2350 | 0.2565 | 0.2889 | 0.3915 | 0.3847 | 0.3530 | |
| CLIP-IQA↑ | 0.6327 | 0.5086 | 0.5022 | 0.5280 | 0.5244 | 0.5988 | 0.6151 | 0.6878 | 0.6761 | 0.6440 | |
| RealSR [3] | PSNR↑ | 23.1627 | 25.5503 | 24.2400 | 24.9717 | 24.6244 | 23.5627 | 24.5385 | 23.5237 | 24.2216 | 24.7531 |
| SSIM↑ | 0.66534 | 0.7183 | 0.6793 | 0.6839 | 0.7051 | 0.66549 | 0.6694 | 0.55989 | 0.6346 | 0.6615 | |
| LPIPS↓ | 0.2520 | 0.2397 | 0.2556 | 0.2545 | 0.2340 | 0.2429 | 0.2317 | 0.2646 | 0.2544 | 0.2565 | |
| MUSIQ↑ | 66.1208 | 59.5565 | 66.7333 | 68.0673 | 67.0964 | 68.4594 | 70.0043 | 72.3909 | 71.3969 | 69.5167 | |
| MANIQA↑ | 0.2652 | 0.1713 | 0.2243 | 0.2329 | 0.2281 | 0.2407 | 0.2746 | 0.3820 | 0.3792 | 0.3504 | |
| CLIP-IQA↑ | 0.5925 | 0.4300 | 0.4787 | 0.5233 | 0.4920 | 0.5852 | 0.5822 | 0.6868 | 0.6817 | 0.66478 | |
The trends observed on DRealSR and RealSR datasets are consistent with those on DIV2K-Val. DiffBIR () achieves the highest IQA scores across all metrics. For PSNR, DiffBIR () performs comparably to GAN-based methods and better than other diffusion-based methods, demonstrating a good balance between quality and fidelity.
Visually (Figure 13), DiffBIR is shown to correctly recover semantic information and intricate details (e.g., eyes behind a helmet, firework lines, penguin wings) where GAN-based methods produce overly smoothed results, and other diffusion-based methods fail to generate correct semantics due to severe degradation.
该图像是图表,展示了在合成数据集(DIV2K-Val)上不同盲图像超分辨率(BSR)方法的视觉比较。左侧是低质量(LQ)图像,右侧是多个恢复方法的输出结果,包括GT、DASR、Real-ESRGAN+、BSRGAN、SwinIR-GAN、StableSR、PASD和我们的方法DiffBIR,显示了不同方法在细节恢复上的差异。
6.5. More Real-world Visual Comparisons (from Appendix)
The appendix provides additional visual comparisons for BSR, BID, and BFR tasks on real-world datasets, further solidifying DiffBIR's qualitative superiority.
-
BSR (Figure 14): Shows more examples where DiffBIR produces sharper, more detailed, and semantically correct reconstructions compared to baselines.
该图像是图表,展示了不同盲图像恢复方法在真实世界数据集上的视觉比较。第一行展示了包括 FeMaSR 和 DiffBIR 在内的多个方法恢复的结果,底部则是另一组低质量图像的比较,突显了每个方法在恢复过程中的表现差异。 -
BID (Figure 15): Illustrates DiffBIR's ability to effectively remove noise while restoring realistic textures, which other denoisers often smooth out.
该图像是图表,展示了不同方法在真实世界数据集上进行盲图像恢复的对比结果。其中展示了低质量图像(LQ)及由多种恢复算法生成的恢复图像,包括CBDNet、DeamNet、Restormer、SwinIR、SCUNet-GAN和DiffBIR(我们的方法)。 -
BFR (Figure 16): Presents additional cases where DiffBIR accurately restores faces, even in challenging conditions like extreme angles or occlusions, demonstrating its robust generalization.
该图像是一个用于比较不同盲人图像恢复方法(如DiffBIR及其他算法)的视觉结果。通过多种算法处理的图像展示了在真实世界数据集上恢复效果的差异,特别强调了在处理复杂细节方面的改进。
6.6. Interpretation of Results and Advantages
The comprehensive experiments and ablation studies highlight DiffBIR's key advantages:
-
Unified and Generalizable: It successfully tackles three distinct
BIRtasks within one framework, overcoming the specialization limitations of previous methods. -
Superior Generative Quality: Leveraging the
latent diffusion priorwithIRControlNetallows DiffBIR to generate significantly more realistic and visually pleasing details, as evidenced by its leadingIQAandFIDscores. -
Robust Degradation Handling: The two-stage decoupling ensures stability. By first removing degradations, the generation module receives clean conditions, preventing artifacts that plague direct approaches.
-
User-Centric Control: The training-free
region-adaptive restoration guidanceis a powerful feature, allowing users to explicitly control thefidelity-qualitytrade-off, catering to diverse preferences. -
Efficient Diffusion Integration:
IRControlNetoptimizes the control mechanism forStable Diffusion, demonstrating better performance than directControlNetadaptations and improved efficiency amongDM-basedbaselines.The main disadvantage, as acknowledged by the authors, is the computational expense of
DM-basedinference (50 sampling steps). However, this is a general limitation of currentdiffusion models, which is actively being addressed in the research community.
7. Conclusion & Reflections
7.1. Conclusion Summary
DiffBIR introduces a novel and highly effective unified framework for blind image restoration (BIR), successfully addressing blind image super-resolution (BSR), blind face restoration (BFR), and blind image denoising (BID) tasks. The core innovation lies in its two-stage decoupled pipeline, which first performs degradation removal to obtain a high-fidelity intermediate image and then information regeneration using a powerful generative latent diffusion model. The proposed IRControlNet effectively leverages the Stable Diffusion prior, ensuring stable and realistic detail generation by using a VAE encoder for robust condition encoding. Furthermore, the training-free region-adaptive restoration guidance allows users to flexibly balance image quality (realness) and fidelity (faithfulness to input) during inference. Extensive experiments demonstrate DiffBIR's state-of-the-art performance across both synthetic and real-world datasets, consistently achieving superior IQA metrics and FID scores compared to existing methods.
7.2. Limitations & Future Work
The authors acknowledge a primary limitation:
-
Computational Expense: DiffBIR requires 50 sampling steps for image restoration, which is computationally expensive and makes real-time applications challenging.
They suggest future research directions:
-
Exploration of Other BIR Tasks: The two-stage restoration pipeline is a general concept, suggesting its applicability and potential for further exploration in other
BIRtasks beyond BSR, BFR, and BID. -
Faster Diffusion Models: The time-consuming nature of diffusion models is an active research area, and advancements in faster sampling techniques (e.g., [33, 44] cited in the paper) could significantly alleviate this limitation in future iterations of DiffBIR.
7.3. Personal Insights & Critique
DiffBIR represents a significant step forward in blind image restoration by effectively harnessing the superior generative capabilities of diffusion models. The two-stage decoupling is particularly insightful. It directly addresses the fundamental challenge of BIR where degradation and content are entangled, providing a clean separation of concerns that boosts stability and performance. This makes the generative model's task much clearer: just add realistic details, don't worry about removing noise. This clarity allows IRControlNet to shine.
The choice to adapt ControlNet for IR tasks is clever, and the detailed ablation studies on IRControlNet's components are valuable. The finding that the pre-trained VAE encoder is crucial for stable and color-consistent condition encoding, and that including the noisy latent in the condition network aids convergence and detail generation, offers practical guidance for future diffusion-based IR methods. The region-adaptive restoration guidance is another strong point, providing a much-needed user control mechanism without additional training. This level of flexibility is often missing in IR models, allowing users to tailor outputs to specific aesthetic or fidelity requirements.
Potential Issues/Areas for Improvement:
- Dependence on Stage I: While decoupling is an advantage, the quality of Stage II is inherently dependent on the
fidelityanddegradation removalcapabilities of Stage I. If the chosenRestoration Module(RM) for a specific task performs poorly, it would directly impact the final output, regardless ofIRControlNet's generative power. The paper assumes "off-the-shelf" RMs are good, but real-world complexity might still challenge them. - Generalizability of Stage I: The paper states using "separate restoration modules instead of a general one for different BIR tasks" in Stage I to maintain expertise. While practical, this slightly undercuts the "unified framework" ideal. Future work could explore a single, more robust and generalizable
Restoration Modulethat can handle a broader range of degradations for Stage I, further simplifying the pipeline. - Black-Box Nature of IQA Metrics: While crucial for evaluating perceptual quality,
MUSIQ,MANIQA, andCLIP-IQAare complex deep-learning-based metrics. Their exact mechanisms and why one performs better than another can sometimes be difficult to interpret, leading to less actionable insights for model improvements compared to traditional metrics. - Sampling Steps vs. Quality: The trade-off between the number of sampling steps and output quality/inference time is critical. While the authors are optimistic about future faster diffusion models, this remains a practical barrier for real-time applications. A more explicit analysis of DiffBIR's performance with fewer sampling steps (e.g., 4-8 steps like emerging fast diffusion models) would be beneficial.
Transferability and Applications:
The decoupled two-stage architecture has high transferability. This principle could be applied to other conditional image generation tasks where the input condition is noisy or contains irrelevant information that could disturb the generative model. For example, in text-to-image generation, if the text prompt is ambiguous or contains conflicting instructions, a similar pre-processing (or "clarification") stage could potentially improve generation stability. In image editing tasks, removing unwanted elements first before generating new content could lead to cleaner results. The region-adaptive guidance mechanism is also broadly applicable to any generative model that can be guided via gradients, offering fine-grained control for creative applications or specific industrial requirements.
Similar papers
Recommended via semantic vector search.