Paper status: completed

DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation

Published:10/24/2024

Diffusion Transformer (6)Real-World Image Restoration (1)Privacy-Safe Dataset Curation (1)Text-to-Image Diffusion Models (1)Multimodal Large Language Model Assisted Restoration (1)

Original Link PDF

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This work introduces GenIR for privacy-safe, large-scale dataset generation and DreamClear, a DiT model leveraging generative priors and multimodal LLMs for adaptive, high-quality real-world image restoration.

Abstract

Image restoration (IR) in real-world scenarios presents significant challenges due to the lack of high-capacity models and comprehensive datasets. To tackle these issues, we present a dual strategy: GenIR, an innovative data curation pipeline, and DreamClear, a cutting-edge Diffusion Transformer (DiT)-based image restoration model. GenIR, our pioneering contribution, is a dual-prompt learning pipeline that overcomes the limitations of existing datasets, which typically comprise only a few thousand images and thus offer limited generalizability for larger models. GenIR streamlines the process into three stages: image-text pair construction, dual-prompt based fine-tuning, and data generation & filtering. This approach circumvents the laborious data crawling process, ensuring copyright compliance and providing a cost-effective, privacy-safe solution for IR dataset construction. The result is a large-scale dataset of one million high-quality images. Our second contribution, DreamClear, is a DiT-based image restoration model. It utilizes the generative priors of text-to-image (T2I) diffusion models and the robust perceptual capabilities of multi-modal large language models (MLLMs) to achieve photorealistic restoration. To boost the model's adaptability to diverse real-world degradations, we introduce the Mixture of Adaptive Modulator (MoAM). It employs token-wise degradation priors to dynamically integrate various restoration experts, thereby expanding the range of degradations the model can address. Our exhaustive experiments confirm DreamClear's superior performance, underlining the efficacy of our dual strategy for real-world image restoration. Code and pre-trained models are available at: https://github.com/shallowdream204/DreamClear.

Mind Map

In-depth Reading

English Analysis~28 min read · 37,572 chars

1. Bibliographic Information

1.1. Title

The title of the paper is DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation. This title clearly indicates the central topic, which is addressing challenges in real-world image restoration through two main innovations: a high-capacity model and a novel, privacy-safe method for dataset creation.

1.2. Authors

The authors are:

Yuang A
Xiaoqiang Zhou
Huaibo Huang
Xiaotian Han
Zhengyu Chen
Quanzeng You
Hongxia Yang

Affiliations include:
MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences
col tificial nteleceUnivrsi Chicadey
ByteDance, Inc
University of Science and Technology of China

The asterisk (*) next to some authors typically denotes equal contribution, and the heart symbol ( $\heartsuit$ ) often indicates a corresponding author, suggesting a collaborative effort from various research institutions and industry partners.

1.3. Journal/Conference

The paper is published at (UTC): 2024-10-24T11:57:20.000Z. The abstract does not explicitly state a journal or conference, but the presence of an arXiv link indicates it is a preprint, commonly submitted to prominent computer vision conferences or journals. Given the topic and scope, it is likely targeting venues like CVPR, ICCV, NeurIPS, or ICLR.

1.4. Publication Year

The publication year is 2024, based on the provided UTC timestamp.

1.5. Abstract

The paper addresses significant challenges in real-world image restoration (IR): the lack of high-capacity models and comprehensive datasets. It proposes a dual strategy:

GenIR: An innovative data curation pipeline that uses dual-prompt learning to overcome the limitations of small, existing datasets. GenIR consists of three stages: image-text pair construction, dual-prompt based fine-tuning, and data generation & filtering. This method is privacy-safe, cost-effective, and copyright-compliant, resulting in a large-scale dataset of one million high-quality images without relying on laborious data crawling.
DreamClear: A cutting-edge Diffusion Transformer (DiT)-based image restoration model. It leverages generative priors from text-to-image (T2I) diffusion models and perceptual capabilities from multi-modal large language models (MLLMs) for photorealistic restoration. To enhance adaptability to diverse real-world degradations, DreamClear introduces the Mixture of Adaptive Modulator (MoAM), which uses token-wise degradation priors to dynamically integrate various restoration experts.

The paper claims that exhaustive experiments confirm DreamClear's superior performance, validating the effectiveness of their dual strategy for real-world image restoration.

1.6. Original Source Link

The original source link is: https://arxiv.org/abs/2410.18666. This indicates the paper is a preprint available on arXiv. The PDF link is: https://arxiv.org/pdf/2410.18666v2.pdf.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the significant challenge of Image Restoration (IR) in real-world scenarios. While IR has seen advancements in specific tasks under controlled conditions (e.g., super-resolution, denoising), its application to diverse and complex real-world degradations remains formidable.

This problem is crucial because real-world images often suffer from various forms of degradation, limiting their utility in many applications, from consumer photography to medical imaging and autonomous driving. The specific challenges and gaps in prior research include:

Limited Datasets: Existing IR datasets are typically small (a few thousand images), offering limited generalizability for larger models and failing to adequately capture the wide spectrum of real-world degradations. This disconnect between training data and real-world scenarios hinders progress.
Data Curation Issues: Traditional methods of obtaining high-quality (HQ) images for datasets, primarily web scraping, are labor-intensive, raise copyright concerns, and pose privacy risks (especially with identifiable human faces).
Model Capacity and Generalization: Despite breakthroughs in Natural Language Processing (NLP) and AI-Generated Content (AIGC) driven by large-scale models and extensive data, IR hasn't seen comparable progress. Current IR models often struggle to generalize across diverse, blind degradations, and many generative prior-based methods neglect degradation priors in low-quality inputs.

The paper's entry point or innovative idea is a dual strategy:

To address the dataset limitations, they propose GenIR, a novel, privacy-safe, and cost-effective data curation pipeline that generates synthetic, high-quality images using Text-to-Image (T2I) models and Multimodal Large Language Models (MLLMs).
To address model capacity and generalization, they introduce DreamClear, a Diffusion Transformer (DiT)-based model that explicitly incorporates degradation priors and dynamically integrates restoration experts to handle diverse real-world degradations.

2.2. Main Contributions / Findings

The paper makes two primary contributions:

GenIR: A Privacy-Safe Automated Data Curation Pipeline:
- Contribution: Introduction of GenIR, a pioneering pipeline for creating large-scale, high-quality IR datasets. It leverages generative priors from pretrained T2I models and MLLMs to synthesize images, circumventing copyright and privacy issues associated with web scraping.
- Specific Problems Solved: It addresses the scarcity of comprehensive IR datasets and the ethical/logistical challenges of traditional data collection. It provides a cost-effective and privacy-safe solution.
- Key Findings: GenIR successfully generates a dataset of one million high-quality images, proving effective in training robust real-world IR models. Ablation studies demonstrate that models trained with GenIR-generated data achieve significantly better performance than those trained on smaller, traditional datasets or data generated by un-finetuned T2I models.
DreamClear: A Robust, High-Capacity Image Restoration Model:
- Contribution: Development of DreamClear, a Diffusion Transformer (DiT)-based model that integrates degradation priors into diffusion-based frameworks and employs a novel Mixture of Adaptive Modulator (MoAM).
- Specific Problems Solved: It enhances control over content generation in T2I models for IR, improves adaptability to diverse real-world degradations, and ensures robust generalization across various scenarios. The MoAM addresses the varied nature of real-world degradations by dynamically combining specialized restoration experts.
- Key Findings:
  - DreamClear achieves state-of-the-art performance across low-level (synthetic & real-world) and high-level (detection & segmentation) benchmarks.
  - It excels in perceptual metrics on synthetic datasets and no-reference metrics on real-world benchmarks, demonstrating photorealistic restoration.
  - User studies confirm DreamClear's superior visual quality, naturalness, and detail accuracy, with users highly preferring its outputs.
  - Ablation studies confirm the effectiveness of each component, including the dual-branch framework, ControlFormer, MoAM, and MLLM-generated detailed text prompts.

3.1. Foundational Concepts

To fully understand this paper, a reader should be familiar with several core concepts in deep learning and computer vision:

Image Restoration (IR): A field in computer vision that aims to recover a high-quality (HQ) image from its degraded low-quality (LQ) counterpart. Degradations can include noise, blur, rain, compression artifacts, low resolution, etc. The goal is to reverse these degradations and produce an image that is visually pleasing and structurally accurate.
Generative Models: A class of artificial intelligence models that can generate new data instances that resemble the training data.
- Generative Adversarial Networks (GANs): Consist of a generator that creates synthetic data and a discriminator that tries to distinguish between real and synthetic data. They are trained in an adversarial manner.
- Diffusion Models: A newer class of generative models that learn to reverse a gradual diffusion process. They start with random noise and progressively denoise it to generate an image. They have shown impressive capabilities in generating high-fidelity and diverse images.
Text-to-Image (T2I) Diffusion Models: A specialized type of diffusion model that generates images based on textual descriptions (prompts). Examples include Stable Diffusion and DALL-E. They learn to map text embeddings to image features, allowing precise control over image generation content and style.
Diffusion Transformer (DiT): A variant of Diffusion Models where the denoising U-Net architecture is replaced or augmented with a Transformer architecture. Transformers, originally developed for Natural Language Processing (NLP), are highly effective at modeling long-range dependencies and have proven powerful in visual tasks, especially for large-scale models like Sora and Stable Diffusion 3.
Multimodal Large Language Models (MLLMs): Large language models that can process and understand information from multiple modalities, typically text and images. They can perform tasks like image captioning, visual question answering, and generating detailed descriptions of images. Gemini-1.5-Pro and LLaVA are examples of MLLMs.
Prompt Learning/Engineering: The process of designing effective text inputs (prompts) to guide generative models (especially T2I and LLMs) to produce desired outputs. Dual-prompt learning extends this by simultaneously learning both positive (desired) and negative (undesired) prompts to fine-tune the model's generation capabilities.
Classifier-Free Guidance (CFG): A technique used in diffusion models to improve the quality and adherence of generated samples to a given condition (e.g., a text prompt). It works by performing two denoising steps: one conditioned on the prompt and one unconditioned, then extrapolating the conditioned output away from the unconditioned output. This amplifies the effect of the condition.

The CFG prediction is formulated as: $\epsilon _ { \theta } ( z _ { t } , t , \mathrm{pos} , \mathrm{neg} ) = \omega \times \epsilon _ { \theta } ( z _ { t } , t , \mathrm{pos} ) + ( 1 - \omega ) \times \epsilon _ { \theta } ( z _ { t } , t , \mathrm{neg} )$ Where:
- z _ { t }: The noisy latent representation at time step $t$ .
- $t$ : The current time step in the diffusion process.
- $\mathrm{pos}$ : The positive (desired) conditioning prompt.
- $\mathrm{neg}$ : The negative (undesired) conditioning prompt.
- $\epsilon _ { \theta } ( z _ { t } , t , \mathrm{pos} )$ : The predicted noise by the denoising model $\epsilon _ { \theta }$ when conditioned on the positive prompt.
- $\epsilon _ { \theta } ( z _ { t } , t , \mathrm{neg} )$ : The predicted noise by the denoising model $\epsilon _ { \theta }$ when conditioned on the negative prompt.
- $\omega$ : The CFG guidance scale, a hyperparameter that controls how strongly the model adheres to the guidance. Higher values lead to stronger adherence but can sometimes reduce diversity or realism.
Mixture-of-Experts (MoE): A neural network architecture where multiple "expert" sub-networks are trained, and a "router" or "gating network" learns to selectively activate and combine the outputs of these experts for each input. This allows a model to specialize in different aspects of the data or task, improving efficiency and capacity.
Adaptive Layer Normalization (AdaLN): A conditional normalization technique often used in generative models. It modulates the scale and shift parameters of layer normalization based on an external conditioning input (e.g., a text embedding or an image feature). This allows the model to adapt its internal representations to specific conditions. AdaLN learns dimension-wise scale $\gamma$ and shift $\beta$ parameters.

3.2. Previous Works

The paper contextualizes its contributions by discussing several categories of prior research:

Image Restoration (IR) Datasets:
- Traditional IR datasets are often created by acquiring HQ images and then simulating degradations.
- Examples cited: DIV2K [44] and Flickr2K [2], which contain only a few thousand images.
- Larger collections: SUPIR [80] (20 million images), highlighting the labor-intensive nature of large-scale curation.
- The paper notes that current datasets are insufficient for covering a broad spectrum of real-world scenarios and that web scraping for data collection involves copyright and privacy concerns.
Generative Priors in IR:
- Many recent state-of-the-art (SOTA) approaches leverage generative priors from pretrained generative models like GANs or diffusion models for realistic IR [43, 8, 38, 20].
- Stable Diffusion [59] is particularly highlighted for its rich generative prior [77, 70, 80].
- Techniques for integrating generative priors:
  - For GANs (e.g., StyleGAN [36]), an additional encoder maps the input image to the latent space [76, 48].
  - For diffusion models, the input image is integrated as a conditional input in the latent feature space during the denoising process [15, 50].
- The paper points out that these methods often neglect the degradation priors in input low-quality images, which is a critical element in blind IR [72].
Synthetic Datasets:
- The importance of large-scale high-quality datasets for training large models is acknowledged [83, 22, 21, 24, 26].
- Existing large datasets are often manually collected (ImageNet [16], LSDIR [39]).
- Concerns with web-crawled data: privacy information leakage [58, 24].
- Synthesized datasets are presented as a solution to reduce laborious human efforts and avoid privacy issues [28, 25, 6]. The paper claims to be the first to explore dataset synthesis specifically for image restoration.

3.3. Technological Evolution

Image Restoration has evolved from traditional signal processing methods to deep learning approaches. Early deep learning methods focused on specific degradation types (e.g., super-resolution [95, 31], denoising [87, 84]), often relying on Convolutional Neural Networks (CNNs). The challenge then shifted to real-world IR, where degradations are complex and diverse. This led to:

Improved Degradation Simulation: Works like BSRGAN [85], Real-ESRGAN [64], and AnimeSR [71] developed more realistic degradation pipelines to generate synthetic LQ-HQ pairs.
Generative Models Integration: The rise of powerful generative models like GANs and diffusion models significantly improved the photorealism of restored images. These models could leverage their learned image priors to hallucinate missing details.
Large-Scale Models and Data: Following trends in NLP (GPT-4 [1]) and AIGC (Stable Diffusion [59]), the IR field started exploring the benefits of high-capacity models and extensive data. However, the lack of suitable IR datasets became a bottleneck.

This paper's work fits into the current technological timeline by addressing this bottleneck through privacy-safe synthetic data generation (GenIR) and then leveraging cutting-edge transformer-based generative models (Diffusion Transformer) with an explicit mechanism for degradation awareness (MoAM) to build a high-capacity IR solution (DreamClear) that can generalize to diverse real-world degradations.

3.4. Differentiation Analysis

Compared to the main methods in related work, DreamClear and GenIR offer several core differences and innovations:

Data Curation (GenIR vs. Traditional/Web Scraping):
- Traditional (e.g., DIV2K, Flickr2K): Small scale, limited diversity, often generated by simple degradation models.
- Web Scraping (e.g., SUPIR): Large scale but labor-intensive, copyright-prone, and privacy-risky.
- GenIR Innovation: It's the first to explore dataset synthesis for IR using T2I diffusion models and MLLMs to create non-existent, high-quality images. This approach is inherently privacy-safe, copyright-compliant, and cost-effective, circumventing the major drawbacks of previous methods. The dual-prompt learning and MLLM-based filtering ensure data quality and diversity, which is crucial for training high-capacity models.
Model Architecture (DreamClear vs. Other IR Models):
- GAN-based/Early Diffusion-based: While leveraging generative priors for photorealism, many prior methods (e.g., Real-ESRGAN, StableSR, DiffBIR) often use U-Net architectures or adapt Stable Diffusion's U-Net with ControlNet. They sometimes neglect explicit degradation priors.
- DreamClear Innovation:
  - DiT-based: It builds on Diffusion Transformer (DiT), a more recent and powerful architecture for diffusion models (e.g., PixArt-alpha [13]), which is more scalable and robust than U-Nets.
  - ControlFormer: Adapts ControlNet principles for DiT, enabling effective spatial control over the DiT-based T2I model using both LQ and reference images.
  - Mixture of Adaptive Modulator (MoAM): This is a key innovation for blind IR. Unlike methods that might apply a single restoration process, MoAM explicitly extracts token-wise degradation priors and dynamically integrates multiple restoration experts (a Mixture-of-Experts approach) based on these priors. This allows DreamClear to adapt to diverse and complex real-world degradations more effectively, which was a recognized gap in previous diffusion-based IR methods.
  - MLLM Guidance: Uses MLLMs (LLaVA) to generate detailed textual captions for images, providing richer semantic guidance to the T2I diffusion model for more photorealistic restoration.
    
    In essence, DreamClear and GenIR represent a comprehensive strategy: generate high-quality, diverse, and ethical data, then train a powerful, degradation-aware, and generalizable model on that data.

4. Methodology

4.1. Principles

The core idea of the method relies on a dual strategy:

Leveraging Generative Priors for Data Curation (GenIR): The principle is that highly capable Text-to-Image (T2I) diffusion models already possess a strong generative prior of high-quality, realistic images. By effectively guiding these models with Multimodal Large Language Models (MLLMs) and specialized dual-prompt learning, one can synthesize vast amounts of diverse, high-quality images that are free from copyright and privacy issues. These synthetic images can then serve as High-Quality (HQ) ground truth for Image Restoration (IR) tasks, overcoming the limitations of existing datasets.
Adaptive and High-Capacity Image Restoration (DreamClear): The principle here is to build a powerful IR model based on Diffusion Transformers (DiT) that not only utilizes the generative prior for photorealism but also explicitly accounts for the diverse and complex nature of real-world degradations. This is achieved by dynamically adapting the restoration process based on degradation priors derived from the low-quality (LQ) input. The model integrates multiple specialized experts and leverages MLLMs for semantic guidance, ensuring robust generalization and detailed, accurate restoration.

4.2. Core Methodology In-depth (Layer by Layer)

The overall methodology comprises two main components: GenIR for dataset curation and DreamClear for image restoration.

4.2.1. GenIR: Privacy-Safe Dataset Curation Pipeline

GenIR is designed to efficiently create high-quality, non-existent images for IR datasets, addressing copyright and privacy issues. It operates in three stages:

4.2.1.1. Image-Text Pairs Construction

This stage prepares the data required to fine-tune the Text-to-Image (T2I) model for optimal image generation suitable for IR.

Positive Samples: Existing IR datasets (e.g., DIV2K [44], Flickr2K [2], LSDIR [39], DIV8K [22]) provide high-resolution, texture-rich images. Since these images lack corresponding text prompts, the sophisticated MLLM, Gemini-1.5-Pro [62], is employed to generate detailed text prompts for each image. This creates image-text pairs for positive samples.
Negative Samples: To ensure the T2I model learns to avoid undesirable attributes (e.g., cartoonish, over-smooth, blurry), negative samples are generated. As depicted in Figure 2 (a), this is done using an image-to-image pipeline [50] (e.g., SDEdit) with the T2I model and manually designed negative prompts such as "cartoon, painting, over-smooth, dirty". This process creates degraded versions corresponding to specific negative attributes.

The following figure (Figure 2 from the original paper) illustrates the three-stage GenIR pipeline:

该图像是论文中Figure 6的图表，展示了合成训练数据量增加对LSDIR-Val六种指标性能的影响。随着训练图像数量的增加，LPIPS、DISTS和FID指标下降，而MNAQIQA、MUSIC和CLIPIQA指标升高，表明性能提升。

Figure 2: An overview of the three-stage GenIR pipeline, which includes (a) Image-Text Pairs Construction, (b) Dual-Prompt Based Fine-Tuning, and (c) Data Generation & Filtering.

4.2.1.2. Dual-Prompt Based Fine-Tuning

This stage adapts a pretrained T2I model to generate images that are specifically useful for IR tasks, by learning both desired and undesired image characteristics.

Prompt Learning: Instead of relying on complex, manually crafted prompts, GenIR proposes a dual-prompt based fine-tuning approach. This involves learning embeddings for both positive and negative tokens.
- $M$ positive tokens, denoted as $\{ \langle p _ { 1 } \rangle , \cdots , \langle p _ { M } \rangle \}$ , represent desired attributes (e.g., "4k, highly detailed, professional").
- $N$ negative tokens, denoted as $\{ n _ { 1 } , \cdots , n _ { N } \}$ , represent undesired attributes (e.g., "deformation, low quality, over-smooth").
Initialization: These new positive and negative tokens are initialized using frequently used descriptive terms.
Model Refinement: Since the text condition is integrated into the diffusion model via cross-attention, the attention block of the T2I model is refined during fine-tuning to better comprehend and respond to these new tokens. This allows the fine-tuned T2I model to generate data efficiently using the learned prompts.

4.2.1.3. Data Generation & Filtering

In this final stage, the fine-tuned T2I model is used to generate a large-scale, diverse, and high-quality dataset, with strict privacy and quality controls.

Diverse Prompt Generation: To ensure scene diversity in the IR dataset, Gemini (an LLM) is leveraged to generate one million text prompts. These prompts describe various scenes under carefully curated language instructions that explicitly proscribe the inclusion of personal or sensitive information, thereby ensuring privacy.
Image Generation: The fine-tuned T2I model, along with the learned positive and negative prompts, is then used to synthesize HQ images.
Classifier-Free Guidance (CFG): During the sampling phase (image generation), CFG [30] is employed to effectively utilize negative prompts and mitigate the generation of undesired content. The denoising model $\epsilon _ { \theta }$ $ϵ_{θ}$ considers two outcomes: one associated with the positive prompt ( $\mathrm{pos}$ $pos$ ) and another with the negative prompt ( $\mathrm{neg}$ $neg$ ). The final CFG prediction is mathematically formulated as: $\epsilon _ { \theta } ( z _ { t } , t , \mathrm{pos} , \mathrm{neg} ) = \omega \times \epsilon _ { \theta } ( z _ { t } , t , \mathrm{pos} ) + ( 1 - \omega ) \times \epsilon _ { \theta } ( z _ { t } , t , \mathrm{neg} )$ Where:
- z _ { t }: The noisy latent representation at time step $t$ .
- $t$ : The current time step.
- $\mathrm{pos}$ : The positive (desired) text prompt.
- $\mathrm{neg}$ : The negative (undesired) text prompt.
- $\epsilon _ { \theta } ( z _ { t } , t , \mathrm{pos} )$ : The noise predicted by the model conditioned on the positive prompt.
- $\epsilon _ { \theta } ( z _ { t } , t , \mathrm{neg} )$ : The noise predicted by the model conditioned on the negative prompt.
- $\omega$ : The CFG guidance scale, which balances the influence of positive and negative prompts.
Quality Filtering:
- Quality Classifier: After sampling, generated images are evaluated by a binary quality classifier. This classifier, trained on positive and negative samples, decides whether to retain images based on predicted probabilities, ensuring basic quality standards.
- MLLM Semantic Check: Gemini is subsequently used to perform a more advanced semantic check, ascertaining whether images exhibit blatant semantic errors or inappropriate content. This step ensures privacy and content appropriateness (e.g., no identifiable individuals).
Result: This process culminates in a dataset of one million high-resolution ( $2040 \times 1356$ ) images, which are privacy-safe, copyright-free, and of superior quality.

4.2.2. DreamClear: High-Capacity Image Restoration Model

DreamClear is built upon PixArt-alpha [13], a pretrained T2I diffusion model based on the Diffusion Transformer (DiT) [53] architecture. Its design focuses on dynamically integrating various restoration experts guided by prior degradation information.

The following figure (Figure 3 from the original paper) illustrates the architecture of the proposed DreamClear:

Figure 7: Visual comparisons for ablation study on DreamClear.
该图像是图7，展示了DreamClear模型的消融实验视觉对比图。图中对比了不同模块缺失情况下的图像恢复效果，展示MoAM、多模态提示和参考分支对结果的影响，突出DreamClear最终恢复效果的优越性。

Figure 3: Architecture of the proposed DreamClear. DreamClear adopts a dual-branch structure, using Mixture of Adaptive Modulator to merge LQ features and Reference features. We utilize MLLM to generate detailed text prompt as the guidance for T2I model.

4.2.2.1. Architecture Overview

DreamClear employs a dual-branch architecture:

LQ Branch: Processes the low-quality (LQ) input image I _ { l q }. The image is first passed through a lightweight network, specifically the SwinIR model [41] (used in DiffBIR [46]), to produce an initial degradation-removed image I _ { r e f}. This SwinIR model acts as a preliminary degradation remover.
Reference Branch: Provides additional context to guide the diffusion model. DreamClear considers both I _ { l q } and the preliminary reference image I _ { r e f } to direct the diffusion model, especially useful as I _ { r e f } might have lost some details.
Textual Guidance: To achieve photorealistic restoration, DreamClear utilizes an open-source MLLM, LLaVA [47], to generate detailed captions for training images. The prompt used is "Describe this image and its style in a very detailed manner". This detailed text prompt serves as textual guidance for the T2I diffusion model.

4.2.2.2. ControlFormer

Traditional ControlNet [88] is designed for U-Net architectures in Stable Diffusion, making it unsuitable for DiT-based models due to architectural differences. ControlFormer is proposed as an adaptation for DiT.

Core Feature: ControlFormer inherits ControlNet's core features: trainable copy blocks and zero-initialized layers.
DiT Adaptation: It duplicates all DiT Blocks from the base PixArt-alpha model.
Feature Combination: It employs the Mixture of Adaptive Modulator (MoAM) block (described below) to combine LQ features (x _ { l q }) and reference features (x _ { r e f }). This integration allows ControlFormer to maintain effective spatial control within the DiT framework.

4.2.2.3. Mixture-of-Adaptive-Modulator (MoAM)

MoAM is designed to enhance the model's robustness to real-world degradations by effectively fusing LQ and reference features in a degradation-aware manner. Each MoAM block comprises adaptive modulators (AM), a cross-attention layer, and a router block.

MoAM operates in three main steps:

Cross-Attention and Degradation Map Generation:
- For the DiT features x _ { i n }, cross-attention is calculated between LQ features $x _ { l q } \in \mathbb { R } ^ { N \times C }$ and reference features $x _ { r e f } \in \mathbb { R } ^ { N \times C }$ . Here, $N$ denotes the number of visual tokens and $C$ represents the hidden size. The output of this cross-attention is $x _ { attn } \in \mathbb { R } ^ { N \times C }$ .
- x _ { i n } is then modulated using x _ { attn }, followed by a zero linear layer.
- A token-wise degradation map $\tilde { D } \in \mathbb { R } ^ { N \times C }$ is generated through a linear mapping of x _ { attn }. This map encodes degradation priors for each token.
Adaptive Modulation with Reference Features:
- The features are further modulated using Adaptive Modulators (AM).
- The reference features x _ { r e f } serve as the AM condition to extract clean features. AM uses AdaLN [54] to learn dimension-wise scale $\gamma$ and shift $\beta$ parameters, embedding conditional information into the input features.
Dynamic Mixture of Restoration Experts:
- To adapt to varying degradations in real-world images, MoAM employs a mixture of degradation-aware experts.
- Each MoAM block contains $K$ restoration experts $\{ E _ { 1 } , \cdots , E _ { K } \}$ , each specialized for particular degradation scenarios. Each expert $E_k$ provides its own scale $Net _ { k } ^ { \gamma }$ and shift $Net _ { k } ^ { \beta }$ values.
- A routing network $R ( \cdot )$ dynamically merges the guidance from these experts for each token. This network is a two-layer MLP followed by a softmax activation, which outputs token-wise expert weights $w = R ( \tilde { D } ) \in \mathbb { R } ^ { \tilde { N } \times K }$ . The $\tilde{N}$ here refers to the number of tokens (possibly $N$ or a downsampled version).
- The dynamic mixture of restoration experts is formulated to compute the final scale $\gamma (i)$ $γ (i)$ and shift $\beta (i)$ $β (i)$ for each token $i$ $i$ : $\gamma ( i ) = \sum _ { k = 1 } ^ { K } w ( i , k ) \times Net _ { k } ^ { \gamma } [ x _ { l q } ( i ) ]$ $\beta ( i ) = \sum _ { k = 1 } ^ { K } w ( i , k ) \times Net _ { k } ^ { \beta } [ x _ { l q } ( i ) ]$ And the final output x _ { out } is then calculated as: $x _ { out } = ( 1 + \gamma ) \otimes x _ { i n } + \beta$ Where:
  - $i$ : Index for a specific token.
  - $k$ : Index for a specific expert.
  - $K$ : The total number of restoration experts (set to 3 in experiments).
  - w ( i , k ): The weight assigned by the routing network to expert $k$ for token $i$ .
  - $Net _ { k } ^ { \gamma } [ x _ { l q } ( i ) ]$ : The scale parameter learned by expert $k$ for token $i$ from LQ features x _ { l q } ( i ).
  - $Net _ { k } ^ { \beta } [ x _ { l q } ( i ) ]$ : The shift parameter learned by expert $k$ for token $i$ from LQ features x _ { l q } ( i ).
  - x _ { in }: The input features to the MoAM block.
  - $\otimes$ : Element-wise multiplication.
    
    This dynamic fusion of expert knowledge, guided by degradation priors, allows MoAM to effectively address complex and varied degradations.

5. Experimental Setup

5.1. Datasets

The experiments use a combination of existing and newly generated datasets for training and evaluate on both synthetic and real-world benchmarks.

5.1.1. Training Datasets

DreamClear is trained using a large combined dataset:

DIV2K [44]: A high-quality image dataset commonly used for super-resolution and other IR tasks.
Flickr2K [2]: Also known as Flickr2K (Flickr2K images with diverse content).
LSDIR [39]: LSDIR (Large Scale Dataset for Image Restoration), a large-scale dataset for IR.
DIV8K [22]: DIV8K (Diverse 8K) is a dataset of diverse, high-resolution images.
GenIR Generated Dataset: The paper's own large-scale dataset of one million high-quality images generated using the GenIR pipeline.

For training, the Real-ESRGAN [64] degradation pipeline is used to generate LQ images from HQ images, with the same degradation settings as SeeSR [70] for fair comparison. All experiments are conducted with a scaling factor of\times 4

.

### 5.1.2. Testing Datasets
*   **Synthetic Benchmarks**: Used to evaluate performance under controlled, simulated degradations.
    *   `DIV2K-Val`: 3,000 randomly cropped patches from the validation sets of `DIV2K`, degraded using the same settings as training.
    *   `LSDIR-Val`: 3,000 randomly cropped patches from `LSDIR`, degraded using the same settings as training.
*   **Real-World Benchmarks**: Used to evaluate performance on genuinely degraded images.
    *   `RealSR` [8]: A commonly used dataset for real-world super-resolution.
    *   `DRealSR` [68]: Another dataset for real-world super-resolution.
    *   `RealLQ250`: A newly established benchmark in this work, comprising 250 `LQ` images ( $256 \times 256$ ) sourced from previous works [70, 42, 64, 86, 80] or the Internet, without corresponding `Ground Truth (GT)` images.
*   **High-Level Vision Tasks Benchmarks**: Used to assess the benefit of `IR` for downstream tasks.
    *   `COCO val2017` [45]: For object detection and instance segmentation.
    *   `ADE20K` [94]: For semantic segmentation.

        For all testing datasets with `GT` images, the resolution of `HQ-LQ` image pairs is  $1024 \times 1024$  and  $256 \times 256$ , respectively, implying a scaling factor of 4.

## 5.2. Evaluation Metrics
The paper employs a comprehensive set of metrics, categorized into reference-based distortion, reference-based perceptual, no-reference, and generation quality metrics, along with metrics for high-level vision tasks.

### 5.2.1. Reference-Based Distortion Metrics
These metrics compare the restored image directly with a known `Ground Truth (GT)` image.
*   **Peak Signal-to-Noise Ratio (PSNR)**:
    *   **Conceptual Definition**: Measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Higher `PSNR` generally indicates a higher quality restoration. It is typically expressed in decibels (dB).
    *   **Mathematical Formula**:

    \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right)

Where `MSE` is the `Mean Squared Error` between the two images, and  $\mathrm{MAX}_I$  is the maximum possible pixel value of the image (e.g., 255 for an 8-bit image).

    \mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2

*   **Symbol Explanation**:
        *    $I$ : The original (ground truth) image.
        *    $K$ : The restored image.
        *   `m, n`: The dimensions (rows and columns) of the images.
        *   `I(i,j)`: The pixel value at position `(i,j)` in the original image.
        *   `K(i,j)`: The pixel value at position `(i,j)` in the restored image.
        *    $\mathrm{MAX}_I$ : The maximum possible pixel value of the image. For an 8-bit image, this is 255.

*   **Structural Similarity Index Measure (SSIM)**:
    *   **Conceptual Definition**: Measures the perceived quality degradation due to processing. It is a full-reference metric that assesses three key features: `luminance`, `contrast`, and `structure`. Its value ranges from -1 to 1, with 1 indicating perfect similarity.
    *   **Mathematical Formula**:

    \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}

*   **Symbol Explanation**:
        *    $x$ : A patch from the original image.
        *    $y$ : A patch from the restored image.
        *    $\mu_x$ : The average of  $x$ .
        *    $\mu_y$ : The average of  $y$ .
        *    $\sigma_x^2$ : The variance of  $x$ .
        *    $\sigma_y^2$ : The variance of  $y$ .
        *    $\sigma_{xy}$ : The covariance of  $x$  and  $y$ .
        *    $c_1 = (k_1 L)^2$  and  $c_2 = (k_2 L)^2$ : Two variables to stabilize the division with a weak denominator.  $L$  is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and  $k_1 = 0.01$ ,  $k_2 = 0.03$  by default.

            These metrics are typically calculated on the `Y channel` of the transformed `YCbCr space`.

### 5.2.2. Reference-Based Perceptual Metrics
These metrics aim to quantify human perception of image quality, often correlating better with human judgment than `PSNR`/`SSIM`.
*   **Learned Perceptual Image Patch Similarity (LPIPS)** [90]:
    *   **Conceptual Definition**: Uses features extracted from a pre-trained deep neural network (e.g., `AlexNet`, `VGG`, `SqueezeNet`) to compare image patches. It measures the distance between these feature representations, aiming to correlate with human perceptual judgments. A lower `LPIPS` score indicates higher perceptual similarity.
    *   **Mathematical Formula**:

    \mathrm{LPIPS}(x, x_0) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} \|w_l \odot (\phi_l(x)_{h,w} - \phi_l(x_0)_{h,w})\|_2^2

*   **Symbol Explanation**:
        *    $x$ : The reference image.
        *    $x_0$ : The distorted image.
        *    $\phi_l$ : Feature maps from layer  $l$  of a pre-trained `CNN` (e.g., `AlexNet`).
        *    $w_l$ : A learned scalar weight for each channel  $l$ .
        *    $\odot$ : Element-wise multiplication.
        *    $H_l, W_l$ : Height and width of the feature map at layer  $l$ .

*   **Deep Image Structure and Texture Similarity (DISTS)** [17]:
    *   **Conceptual Definition**: A full-reference image quality metric that combines structural and textural similarity. It aims to unify `SSIM` and `LPIPS` by considering both structure (global patterns) and texture (local statistical properties) in a deep learning framework. A lower `DISTS` score indicates higher similarity.
    *   **Mathematical Formula**:

    \mathrm{DISTS}(X, Y) = \sum_{l=1}^L \left( \alpha_l \cdot \mathrm{D_{S}}(X_l, Y_l) + \beta_l \cdot \mathrm{D_{T}}(X_l, Y_l) \right)

Where  $\mathrm{D_{S}}$  measures structural distortion and  $\mathrm{D_{T}}$  measures textural distortion, typically calculated from `VGG` features.
    *   **Symbol Explanation**:
        *    $X$ : Reference image.
        *    $Y$ : Distorted image.
        *    $X_l, Y_l$ : Features extracted from layer  $l$  of a pre-trained network (e.g., `VGG`).
        *    $\mathrm{D_{S}}$ : Structural similarity component.
        *    $\mathrm{D_{T}}$ : Textural similarity component.
        *    $\alpha_l, \beta_l$ : Learned weights for structural and textural components at layer  $l$ .

### 5.2.3. No-Reference Metrics
These metrics evaluate image quality without access to a `GT` image.
*   **Naturalness Image Quality Evaluator (NIQE)** [89]:
    *   **Conceptual Definition**: A no-reference `image quality assessment (IQA)` metric that quantifies image quality by comparing statistical features of the image to those of natural, undistorted images. A lower `NIQE` score indicates a more natural-looking and higher-quality image.
    *   **Mathematical Formula**:
        `NIQE` does not have a simple closed-form mathematical formula like `PSNR` or `SSIM`. It is based on fitting a `multivariate Gaussian model` to local `luminance coefficients` (e.g., from `mean subtracted contrast normalized (MSCN)` images) of an image, typically derived from a diverse dataset of natural images. The `quality score` is then the `distance` between the `Gaussian model` of the test image and the `Gaussian model` of the natural image dataset.

*   **Multidimension Attention Network for No-Reference Image Quality Assessment (MANIQA)** [75]:
    *   **Conceptual Definition**: A no-reference `IQA` metric that uses a `multidimensional attention network` to assess image quality. It aims to capture complex human perceptual factors across different scales and aspects of an image. A higher `MANIQA` score indicates better quality.
    *   **Mathematical Formula**: `MANIQA` is a deep learning model, so its score is the output of the trained network. There isn't a single mathematical formula to represent its assessment process in the same way as `PSNR`.

*   **Multi-scale Image Quality Transformer (MUSIQ)** [75]:
    *   **Conceptual Definition**: A no-reference `IQA` model that utilizes a `transformer-based architecture` and operates on image patches extracted at `multiple scales`. This allows it to capture quality distortions that manifest differently at various resolutions. A higher `MUSIQ` score indicates better quality.
    *   **Mathematical Formula**: Similar to `MANIQA`, `MUSIQ` is a deep learning model, and its score is the output of its `transformer-based network`.

*   **CLIPIQA** [75]:
    *   **Conceptual Definition**: A no-reference `IQA` metric that leverages the capabilities of `CLIP` (Contrastive Language-Image Pre-training) models. It aims to align image quality assessment with semantic understanding derived from multi-modal pre-training. A higher `CLIPIQA` score indicates better quality.
    *   **Mathematical Formula**: `CLIPIQA` also relies on a neural network (specifically, features from `CLIP`) and does not have a simple mathematical formula.

### 5.2.4. Generation Quality Metrics
*   **Fréchet Inception Distance (FID)** [29]:
    *   **Conceptual Definition**: Measures the similarity between two sets of images. It calculates the `Fréchet distance` between the `Gaussian distributions` of feature representations (extracted from an `Inception-v3` network) of real and generated images. A lower `FID` score indicates that the generated images are more similar to real images in terms of perceptual quality and diversity.
    *   **Mathematical Formula**:

    \mathrm{FID} = \|\mu_1 - \mu_2\|_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1\Sigma_2)^{1/2})

*   **Symbol Explanation**:
        *    $\mu_1$ : The mean feature vector of the real images.
        *    $\mu_2$ : The mean feature vector of the generated images.
        *    $\Sigma_1$ : The covariance matrix of the real images' features.
        *    $\Sigma_2$ : The covariance matrix of the generated images' features.
        *    $\mathrm{Tr}$ : The trace of the matrix.

### 5.2.5. High-Level Vision Task Metrics
These metrics are used to evaluate how well the restored images benefit downstream tasks.
*   **Average Precision (AP)**:
    *   **Conceptual Definition**: A common metric for `object detection` and `instance segmentation`. It summarizes the `precision-recall curve` by taking the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold as the weight. Higher `AP` indicates better detection/segmentation performance.
    *   **Types**:
        *    $AP^b$ : Average Precision for `object detection` (bounding boxes).
        *    $AP^m$ : Average Precision for `instance segmentation` (masks).
    *   **Mathematical Formula**:

    \mathrm{AP} = \sum_n (R_n - R_{n-1}) P_n

Where  $P_n$  and  $R_n$  are the precision and recall at the  $n$ -th threshold, and  $R_n > R_{n-1}$ . This is a simplified representation; `COCO` `AP` typically averages over multiple `Intersection over Union (IoU)` thresholds.
    *   **Symbol Explanation**:
        *    $P_n$ : Precision at the  $n$ -th recall level.
        *    $R_n$ : Recall at the  $n$ -th recall level.

*   **Mean Intersection over Union (mIoU)**:
    *   **Conceptual Definition**: A standard metric for `semantic segmentation`. It calculates the `IoU` for each class (the ratio of the intersection area to the union area of the predicted and ground truth segmentation masks), and then averages these `IoU` values over all classes. A higher `mIoU` indicates better segmentation performance.
    *   **Mathematical Formula**:

    \mathrm{IoU}_k = \frac{\mathrm{TP}_k}{\mathrm{TP}_k + \mathrm{FP}_k + \mathrm{FN}_k}
    
    \mathrm{mIoU} = \frac{1}{C} \sum_{k=1}^C \mathrm{IoU}_k
    \$\$
*   **Symbol Explanation**:
    *    $\mathrm{TP}_k$ : True Positives for class  $k$ .
    *    $\mathrm{FP}_k$ : False Positives for class  $k$ .
    *    $\mathrm{FN}_k$ : False Negatives for class  $k$ .
    *    $C$ : Total number of classes.

5.3. Baselines

The paper compares DreamClear against a range of state-of-the-art Image Restoration (IR) methods, including both GAN-based and diffusion-based approaches:

GAN-based Methods:
- BSRGAN [85]
- Real-ESRGAN [64]
- SwinIR-GAN [41]
- DASR [43]
Diffusion-based Methods:
- StableSR [63]
- DiffBIR [46]
- ResShift [82]
- SinSR [65]
- SeeSR [70]
- SUPIR [80]
  
  These baselines are representative of the current state-of-the-art in real-world IR, encompassing different architectural choices and strategies for handling image degradation.

5.4. Implementation Details

GenIR Training:
- Uses the original latent diffusion loss [59].
- Built on SDXL [55].
- Trained over 5 days using 16 NVIDIA A100 GPUs.
- Training conducted on $1024 \times 1024$ resolution images with a batch-size of 256.
- For data generation, 256 NVIDIA V100 GPUs were used for 5 days to generate the one million high-quality images.
- SDEdit strength for negative sample generation: 0.6.
- Gemini prompts for filtering explicitly forbid inappropriate content (privacy violations, NSFW).
DreamClear Training:
- Built upon PixArt-alpha [13] and LLaVA [47].
- SwinIR model from DiffBIR [46] used as the lightweight degradation remover in the LQ branch.
- AdamW optimizer with a learning rate of $5e^{-5}$ .
- Training conducted on $1024 \times 1024$ resolution images.
- Ran for 7 days on 32 NVIDIA A100 GPUs with a batch-size of 128.
- Number of experts $K$ in MoAM is set to 3.
DreamClear Inference:
- iDDPM [51] is adopted with 50 sampling steps.
- CFG guidance scale $\omega = 4.5$ .

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate DreamClear's superior performance across various benchmarks, validating the effectiveness of the proposed dual strategy (GenIR for data and DreamClear for modeling).

6.1.1. Quantitative Comparisons

The following are the results from Table 1 of the original paper:

Datasets	Metrics	BSRGAN [85]	Real-ESRGAN [64]	SwinIR-GAN [41]	DASR [43]	StableSR [63]	DiffBIR [46]	ResShift [82]	SinSR [65]	SeeSR [70]	SUPIR [80]	DreamClear
Datasets	Metrics	BSRGAN [85]	Real-ESRGAN [64]	SwinIR-GAN [41]	DASR [43]	StableSR [63]	DiffBIR [46]	ResShift [82]	SinSR [65]	SeeSR [70]	SUPIR [80]	DreamClear
DIV2K-Val	PSNR ↑	19.88	19.92	19.66	19.73	19.73	19.98	19.80	19.37	19.59	18.68	18.69
	SSIM ↑	0.5137	0.5334	0.5253	0.5122	0.5039	0.4987	0.4985	0.4613	0.5045	0.4664	0.4766
	LPIPS ↓	0.4303	0.3981	0.3992	0.4350	0.4145	0.3866	0.4450	0.4383	0.3662	0.3976	0.3657
	DISTS ↓	0.2484	0.2304	0.2253	0.2606	0.2162	0.2396	0.2383	0.2175	0.1886	0.1818	0.1637
	FID ↓	54.42	48.44	49.17	59.62	29.64	37.00	46.12	37.84	24.98	28.11	20.61
	NIQE ↓	3.9322	3.8762	3.7468	3.9725	4.4255	4.5659	5.9852	5.7320	4.1320	3.4014	3.2126
	MANIQA ↑	0.3514	0.3854	0.3654	0.3110	0.2942	0.4268	0.3782	0.4206	0.5251	0.4291	0.4320
	MUSIQ ↑	63.93	64.50	64.54	59.66	58.60	64.77	62.67	65.27	72.04	69.34	68.44
	CLIPIQA ↑	0.5589	0.5804	0.5682	0.5565	0.5190	0.6527	0.6498	0.6961	0.7181	0.6035	0.6963
LSDIR-Val	PSNR ↑	18.27	18.13	17.98	18.15	18.11	18.42	18.24	17.94	18.03	16.95	17.01
	SSIM ↑	0.4673	0.4867	0.4783	0.4679	0.4508	0.4618	0.4579	0.4302	0.4564	0.4080	0.4236
	LPIPS ↓	0.4378	0.3986	0.4020	0.4503	0.4152	0.4049	0.4524	0.4523	0.3759	0.4119	0.3836
	DISTS ↓	0.2539	0.2278	0.2253	0.2615	0.2159	0.2439	0.2436	0.2265	0.1966	0.1838	0.1656
	FID ↓	53.25	46.46	45.31	60.60	31.26	35.91	43.25	36.01	25.91	30.03	22.06
	NIQE ↓	3.6885	3.4078	3.3715	3.6432	4.0218	4.3750	5.5635	5.4240	4.0590	2.9820	3.0707
	MANIQA ↑	0.3829	0.4381	0.3991	0.3315	0.3098	0.4551	0.3995	0.4309	0.5700	0.4683	0.4811
	MUSIQ ↑	65.98	68.25	67.10	60.96	59.37	65.94	63.25	65.35	73.00	70.98	70.40
	CLIPIQA ↑	0.5648	0.6218	0.5983	0.5681	0.5190	0.6592	0.6501	0.6900	0.7261	0.6174	0.6914
RealSR	PSNR ↑	25.01	24.22	24.89	25.51	24.60	24.77	24.94	24.47	24.66	22.67	22.56
	SSIM ↑	0.7422	0.7401	0.7543	0.7526	0.7387	0.6902	0.7178	0.6710	0.7209	0.6567	0.6548
	LPIPS ↓	0.2853	0.2901	0.2680	0.3201	0.2736	0.3436	0.3864	0.4208	0.2997	0.3545	0.3684
	DISTS ↓	0.1967	0.1892	0.1734	0.2056	0.1761	0.2195	0.2467	0.2432	0.2029	0.2185	0.2122
	FID ↓	84.49	90.10	80.07	91.16	88.89	69.94	88.91	70.83	71.92	71.63	65.37
	NIQE ↓	4.9261	5.0069	4.9475	5.9659	5.6124	6.1294	6.6044	6.4662	4.9102	4.5368	4.4381
	MANIQA ↑	0.3660	0.3656	0.3432	0.2819	0.3465	0.4182	0.3781	0.4009	0.5189	0.4296	0.4337
	MUSIQ ↑	64.67	62.06	60.97	50.94	61.07	61.74	60.28	60.36	69.38	66.09	65.33
	CLIPIQA ↑	0.5329	0.4872	0.4548	0.3819	0.5139	0.6202	0.5778	0.6587	0.6839	0.5371	0.6895
DrealSR	PSNR ↑	27.09	26.95	27.00	28.19	27.39	27.31	27.16	26.15	27.10	24.41	24.48
	SSIM ↑	0.7759	0.7812	0.7815	0.8051	0.7830	0.7140	0.7388	0.6564	0.7596	0.6696	0.6508
	LPIPS ↓	0.2950	0.2876	0.2789	0.3165	0.2710	0.3920	0.4101	0.4690	0.3117	0.3844	0.3972
	DISTS ↓	0.1956	0.1857	0.1787	0.2072	0.1737	0.2443	0.2553	0.2103	0.2264	0.2145	0.2145
	FID ↓	84.26	83.79	84.22	94.96	80.23	89.47	91.45	86.73	82.45	73.66	75.02
	NIQE ↓	4.7570	4.5828	4.5828	5.3941	5.0970	5.8821	6.2570	6.0142	4.7500	4.3217	4.2497
	MANIQA ↑	0.3670	0.3644	0.3540	0.2831	0.3512	0.4168	0.3800	0.4042	0.5186	0.4289	0.4326
	MUSIQ ↑	64.44	62.58	61.76	51.35	61.64	62.46	60.67	60.75	69.45	66.19	65.42
	CLIPIQA ↑	0.5367	0.4907	0.4632	0.3843	0.5192	0.6225	0.5790	0.6601	0.6860	0.5401	0.6918

Analysis of Quantitative Results:

Synthetic Datasets (DIV2K-Val, LSDIR-Val):
- DreamClear consistently demonstrates superior performance in perceptual metrics like LPIPS, DISTS, and FID. For DIV2K-Val, DreamClear achieves the best LPIPS (0.3657), DISTS (0.1637), and FID (20.61). Similar trends are observed for LSDIR-Val in DISTS (0.1656) and FID (22.06), and it is the second best in LPIPS (0.3836). This signifies that DreamClear generates images that are perceptually closer to Ground Truth and exhibit higher fidelity and realism, which is a primary goal of diffusion-based methods.
- For no-reference metrics (NIQE, MANIQA, MUSIQ, CLIPIQA), DreamClear often secures the best or second-best position, indicating high perceived quality even without a reference. For example, on DIV2K-Val, it has the best NIQE (3.2126) and second best CLIPIQA (0.6963).
- It is notable that DreamClear sometimes shows slightly lower PSNR/SSIM scores compared to some baselines (e.g., DiffBIR on LSDIR-Val PSNR). The paper acknowledges this, stating that these distortion-based metrics may not fully represent visual quality for photorealistic restoration, and emphasizes the importance of perceptual metrics for diffusion-based models.
Real-World Benchmarks (RealSR, DrealSR):
- On these challenging datasets, DreamClear generally maintains strong performance in no-reference metrics, securing the best NIQE score on both RealSR (4.4381) and DrealSR (4.2497). It also performs very competitively in MANIQA, MUSIQ, and CLIPIQA, often achieving the best or second-best scores. This suggests that DreamClear produces images that are highly natural and visually appealing in real-world scenarios.
- Similar to synthetic datasets, PSNR/SSIM scores might not always be the highest, reinforcing the paper's argument about the limitations of these metrics for photorealistic restoration. However, its superior FID on RealSR (65.37) indicates better overall distribution similarity to real images.

6.1.2. Evaluation on Downstream Benchmarks

The following are the results from Table 2 of the original paper:

Metrics	GT	Zoomed LQ	BSRGAN	Real-ESRGAN	SwinIR-GAN	DASR	StableSR	DiffBIR	ResShift	SinSR	SeeSR	SUPIR	DreamClear
Object Detection (APb)	49.0	7.4	11.0	12.8	11.8	10.5	16.9	18.7	15.6	13.8	18.2	16.6	19.3
Object Detection (APb)	70.6	12.0	17.6	20.7	18.9	17.0	26.7	29.9	25.0	22.3	29.1	27.2	30.8
Object Detection (APb5)	53.8	7.5	11.4	13.1	12.1	10.7	17.6	19.4	15.9	14.2	18.9	17.0	19.8
Instance Segmentation (APm)	43.9	6.4	9.6	11.3	10.2	9.3	14.6	16.2	13.6	12.0	15.9	14.1	16.7
Instance Segmentation (APm)	67.7	11.2	16.4	19.3	17.5	15.9	24.6	27.5	23.3	20.6	26.6	24.5	28.2
Instance Segmentation (APm)	47.3	6.3	9.7	11.5	10.2	9.4	14.9	16.6	13.7	12.1	16.1	14.0	16.8
Semantic Segmentation (mIoU)	50.4	11.5	18.6	17.3	14.3	30.4	19.6	23.6	29.7	19.6	26.9	27.7	31.9

Analysis of Downstream Task Results:

DreamClear consistently achieves the best performance across all object detection (APb), instance segmentation (APm), and semantic segmentation (mIoU) metrics on COCO val2017 and ADE20K.
For example, DreamClear scores 19.3 APb for object detection, 16.7 APm for instance segmentation, and 31.9 mIoU for semantic segmentation, significantly outperforming all baselines.
This demonstrates that DreamClear not only restores visual quality but also preserves and enhances the semantic information critical for high-level computer vision tasks. This is a strong indicator of the model's ability to produce semantically coherent and structurally accurate restorations, which is crucial for real-world applications where IR often serves as a pre-processing step.

6.1.3. Qualitative Comparisons

The following figure (Figure 4 from the original paper) shows qualitative comparisons on synthetic or real-world images.

该图像是多组图像修复对比结果的示意图，展示了作者方法DreamClear与多种主流图像超分辨率重建方法在建筑和飞鸟图像上的修复效果，突出DreamClear在细节和清晰度提升方面的优势。

Qualns o ynhhes own orhes ens. Please zoom in for a better view.

The qualitative comparisons (e.g., Figure 4) visually confirm DreamClear's superiority.
When faced with severe degradations (first row, e.g., heavily blurred/noisy text), DreamClear can reason the correct structure and generate clear details, while other methods might produce deformed structures or blurry results.
For real-world images (third row), DreamClear generates results that are rich in detail and more natural than competitors. This aligns with its strong performance in perceptual and no-reference metrics.

6.1.4. User Study

The following figure (Figure 5 from the original paper) shows the user study results.

Figure 9: Visual comparisons for ablation study on training datasets.
该图像是图9，展示了训练数据集消融实验的视觉对比。图片左侧为老虎低质量输入及不同训练数据集恢复结果，右侧为建筑物低质量输入及对应恢复效果，突出展示了使用本方法数据集提升恢复细节和质量的效果。

Figure 5: User study. Vote percentage denotes average user preference per model. The Top-K ratio indicates the proportion of images preferred by top K users. Our model is highly preferred, with most images being rated as top quality by the majority.

A user study involving 256 evaluators rated DreamClear against five other methods on 100 low-quality images.
DreamClear received over 45% of total votes, indicating a strong user preference for its restorations.
It was the top choice for 80% of the images and ranked first or second for 98% of the images, as measured by the Top-K ratio.
This objective evaluation by human perception strongly corroborates DreamClear's ability to produce consistently high-quality, natural, and accurate image restorations.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Analysis of Generated Dataset for Real-World IR

The following figure (Figure 6 from the original paper) shows the impact of increasing synthetic training data.

Figure 10: Visual comparison of images generated using the pre-trained T2I model and GenIR. Our proposed GenIR produces images with enhanced texture and more realistic details, exhibiting less blurri…
该图像是论文中的示意图，展示了预训练T2I模型与GenIR生成的图像对比。GenIR生成的图像纹理更丰富细腻，细节更真实，模糊和失真较少，更适合用于训练真实世界的图像修复模型。

Figure 6: Impact of synthetic training data. As data size increases, performance improves on LSDIR-Val.

To investigate the impact of the GenIR-generated dataset, SwinIR-GAN was trained on varying quantities of generated data.
As the data size increases, all metrics (LPIPS, DISTS, FID, MANIQA, MUSIQ, CLIPIQA) improve (LPIPS, DISTS, FID decrease, while MANIQA, MUSIQ, CLIPIQA increase), demonstrating that larger datasets enhance model generalizability and restoration performance.

Notably, a model trained with 100,000 generated images outperforms the DF2K model, highlighting the benefits of large-scale synthetic datasets.

The following are the results from Table 4 of the original paper:

Training Data	LPIPS ↓	DISTS ↓	FID ↓	MANIQA ↑	MUSIQ ↑	CLIPIQA ↑
Pre-trained T2I Model (3450images)	0.4819	0.2790	60.12	0.3271	61.94	0.5423
Ours GenIR (3450images)	0.4578	0.2435	51.29	0.3691	63.12	0.567

When comparing data generated by the pre-trained T2I model versus GenIR (both 3450 images), the model trained on GenIR data shows significant improvements across all metrics. This quantitatively demonstrates the effectiveness of the dual-prompt learning and filtering in GenIR for producing IR-suitable images.

The following figure (Figure 10 from the original paper) provides a visual comparison of images generated using the pre-trained T2I model and GenIR.

该图像是论文中图2的示意图，展示了GenIR三阶段数据构建流程：(a) 图文对构建，(b) 双提示词微调，(c) 数据生成与筛选，体现了通过多模态模型与扩散模型实现隐私安全的大规模高质量图像数据生成。

Visually, images generated by GenIR exhibit enhanced texture and more realistic details, with less blurring and distortion compared to those from an un-finetuned T2I model. This qualitative difference supports the quantitative findings that GenIR data is more effective for IR training.

6.2.2. Ablations of DreamClear

The following are the results from Table 3 of the original paper:

	LPIPS ↓	DISTS ↓	FID ↓	MANIQA ↑	MUSIQ ↑	CLIPIQA ↑	AP b	APm	mIoU
Mixture of AM	0.3657	0.1637	20.61	0.4320	68.44	0.6963	19.3	16.7	31.9
AM	0.3981	0.1843	25.75	0.4067	66.18	0.6646	18.0	15.6	28.6
Cross-Attention	0.4177	0.2016	29.74	0.3785	63.21	0.6497	17.2	15.1	26.3
Zero-Linear	0.4082	0.1976	29.89	0.4122	66.11	0.6673	17.6	15.3	27.2
Dual-Branch	0.3657	0.1637	20.61	0.4320	68.44	0.6963	19.3	16.7	31.9
w/o Reference Branch	0.4207	0.2033	30.91	0.3985	64.04	0.6582	15.9	14.0	24.7
Detailed Text Prompt	0.3657	0.1637	20.61	0.4320	68.44	0.6963	19.3	16.7	31.9
Null Prompt	0.3521	0.1607	20.47	0.4230	67.26	0.6812	18.8	16.2	29.8

Analysis of DreamClear Ablations:

Mixture of Adaptive Modulator (MoAM):
- The full MoAM (using Mixture of AM) achieves the best scores across almost all metrics (LPIPS, DISTS, FID, APb, APm, mIoU).
- Replacing MoAM with a simpler AM design leads to a substantial decrease in performance (e.g., LPIPS rises from 0.3657 to 0.3981, FID from 20.61 to 25.75).
- Further simplifying to Cross-Attention or Zero-Linear results in even worse performance for most metrics. This strongly underscores the importance of the degradation prior guidance and dynamic expert fusion mechanism within MoAM for effective restoration.
Dual-Branch Framework:
- The complete Dual-Branch structure (DreamClear) significantly outperforms the model trained without the Reference Branch across all metrics (e.g., LPIPS from 0.3657 to 0.4207, FID from 20.61 to 30.91, mIoU from 31.9 to 24.7). This confirms that providing a reference image and processing both LQ and reference features is crucial for enhancing image details and achieving superior results.
Text Prompt Guidance:
- Using Detailed Text Prompt (generated by MLLMs) leads to better performance on no-reference metrics (MANIQA, MUSIQ, CLIPIQA) and high-level vision metrics (APb, APm, mIoU) compared to using a Null Prompt.
- While Null Prompt shows slightly better perceptual metrics (LPIPS, DISTS, FID), the superior no-reference and high-level metrics for Detailed Text Prompt suggest that MLLM-provided textual guidance better preserves semantic information, leading to more semantically coherent and useful restorations, which is often preferred in real-world applications.
  
  The following figure (Figure 7 from the original paper) shows visual comparisons for the ablation study on DreamClear.
  
  该图像是图表，展示了基于真实世界数据集的图像恢复视觉对比，包括LQ输入及多种方法（StableSR、DiffBIR、SeeSR、SUPIR）与本文提出的DreamClear恢复效果，突出DreamClear在细节和质量上的优势。

Figure 7: Visual comparisons for ablation study on DreamClear.

Visual ablations in Figure 7 further support these findings. For instance, null prompts can lead to semantic errors (e.g., in the bear's eyes), which are corrected with MLLM-generated text prompts. Removing MoAM leads to overly smooth or semantically incorrect results. The full DreamClear model consistently yields the best visual fidelity and perception.

6.2.3. Ablations for GenIR

The following figure (Figure 8 from the original paper) shows visual comparisons for the ablation study on GenIR.

Figure 12: Visual comparisons on real-world benchmarks (2/3). Please zoom in for a better view.
该图像是论文中的图12展示，属于多图排列比较示意图，展示了不同算法在实际场景下图像恢复效果。图中包含六种恢复方法对低质量输入图像的还原对比，突出DreamClear方法在细节和清晰度上的优势。

Figure 8: Visual comparisons for ablation study on GenIR.

The effectiveness of dual-prompt learning in GenIR is visually demonstrated. Images generated with dual-prompt learning (e.g., the tiger and wheat field images) exhibit enhanced texture details and are more suitable for image restoration training.

The following figure (Figure 9 from the original paper) shows visual comparisons for the ablation study on training datasets.

该图像是图像复原对比图，展示了真实场景中的两组图像通过多种方法处理后的视觉效果。图中包括低质量输入图像及StableSR、DiffBIR、SeeSR、SUPIR和本文提出的DreamClear的复原结果，突出DreamClear在细节和清晰度上的优势。

Figure 9: Visual comparisons for ablation study on training datasets.

Figure 9 further illustrates the effectiveness of the generated data from GenIR in enhancing the visual effect of IR models. Models trained with GenIR data show improved details and overall quality.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper effectively addresses the dual challenges of real-world image restoration (IR): the scarcity of comprehensive, privacy-safe training data and the need for high-capacity, adaptable restoration models.

GenIR Pipeline: The paper introduces GenIR, a pioneering, privacy-safe, and cost-effective automated data curation pipeline. By leveraging Text-to-Image (T2I) diffusion models and Multimodal Large Language Models (MLLMs) with a dual-prompt learning strategy, GenIR successfully generates a large-scale dataset of one million high-quality, synthetic images. This dataset overcomes the limitations of traditional data collection methods (copyright, privacy, labor-intensiveness) and provides a robust resource for training IR models.
DreamClear Model: The paper presents DreamClear, a potent IR model built upon the Diffusion Transformer (DiT) architecture. DreamClear integrates degradation priors and utilizes a novel Mixture of Adaptive Modulator (MoAM) that dynamically combines restoration experts based on token-wise degradation representations. This design enhances the model's adaptability to diverse real-world degradations and its ability to achieve photorealistic restoration by leveraging generative priors and MLLM-based semantic guidance.

Comprehensive experiments, including quantitative evaluations on synthetic and real-world benchmarks, assessments on downstream high-level vision tasks, and user studies, confirm DreamClear's superior performance over state-of-the-art methods. The ablation studies further validate the critical contributions of each proposed component, both for data generation (GenIR) and model architecture (DreamClear).

7.2. Limitations & Future Work

The authors acknowledge two primary limitations of their work and suggest future research directions:

Synthesized Texture Details vs. Ground Truth: When dealing with severe image degradation, DreamClear can produce reasonable and realistic results. However, the generated texture details, while plausible, may not perfectly match the ground-truth image. This is an inherent characteristic of generative models that hallucinate missing information.
- Suggested Future Work: The authors propose that using a high-quality reference image or explicit human instruction could potentially mitigate this limitation, providing more accurate guidance for restoration.
Inference Speed: As a diffusion-based model, DreamClear requires multiple inference steps to restore an image. While it achieves high-quality results, its inference speed cannot meet the real-time requirements of many practical applications.
- Suggested Future Work: To address this, the authors suggest exploring techniques like model distillation (transferring knowledge from a large model to a smaller, faster one) and model quantization (reducing the precision of model weights) to improve inference speed.

7.3. Personal Insights & Critique

This paper presents a highly comprehensive and innovative approach to real-world image restoration. The dual strategy of addressing both data scarcity and model limitations is particularly insightful.

Inspirations and Cross-Domain Applications:

Synthetic Data Generation for Niche Fields: The GenIR pipeline offers a powerful blueprint for creating high-quality, privacy-safe datasets in other domains where data collection is challenging, expensive, or ethically sensitive (e.g., medical imaging, rare object detection, historical document restoration). The MLLM-driven prompt generation and filtering for specific content constraints could be invaluable.
Adaptive Expert Systems: The MoAM concept, where token-wise degradation priors dynamically activate specialized experts, could be generalized to other complex multi-degradation or multi-domain tasks. For example, in medical image analysis, different degradation types (noise, artifacts from different scanners) could activate specific diagnostic or enhancement pathways.
Semantic Guidance in Low-Level Vision: The effective use of MLLMs to provide detailed text prompts for semantic guidance in IR highlights the increasing convergence of high-level (language/semantics) and low-level (pixel-wise) vision tasks. This could inspire similar approaches in tasks like image enhancement, inpainting, or style transfer where semantic understanding can greatly improve output quality.

Potential Issues, Unverified Assumptions, or Areas for Improvement:
Generalizability of GenIR-generated Data to All Real-World Degradations: While GenIR creates diverse HQ images, the paper uses the Real-ESRGAN degradation pipeline to generate LQ training pairs. The assumption is that this simulated degradation sufficiently covers the vast complexity of real-world degradations. If real-world degradation types are fundamentally different or more diverse than what Real-ESRGAN can simulate, DreamClear might still face challenges. Future work could explore more advanced, truly blind degradation simulation or unsupervised domain adaptation for the LQ images.
Computational Cost: Both GenIR and DreamClear are computationally intensive. GenIR requires 5 days on 256 V100 GPUs for data generation, and DreamClear requires 7 days on 32 A100 GPUs for training. While the value proposition is high, this limits accessibility for smaller research groups or practical deployment. The acknowledged limitation regarding inference speed is crucial.
Interpretability of MoAM: While MoAM is effective, the specific roles or specializations of the $K=3$ restoration experts are not explicitly detailed. Understanding what kind of degradation each expert handles could provide valuable insights for further architectural refinements or for diagnosing model failures.
Dependency on MLLMs: The pipeline relies heavily on MLLMs (Gemini-1.5-Pro, LLaVA) for prompt generation and filtering. The quality and bias of these MLLMs could implicitly affect the diversity and characteristics of the generated dataset and the semantic guidance during restoration. A thorough analysis of how different MLLMs impact the final IR performance would be beneficial.
Ethical Implications of Synthetic Data: While GenIR is lauded for being privacy-safe by avoiding identifiable individuals, the synthetic nature itself can raise questions. If used to generate data for sensitive applications, ensuring the synthetic data does not inadvertently create new biases or perpetuate existing ones (even if "privacy-safe") is important.

Overall, DreamClear and GenIR represent a significant step forward in making high-quality real-world image restoration more feasible and robust, bridging crucial gaps in both data availability and model capability.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.