DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation
TL;DR Summary
This work introduces GenIR for privacy-safe, large-scale dataset generation and DreamClear, a DiT model leveraging generative priors and multimodal LLMs for adaptive, high-quality real-world image restoration.
Abstract
Image restoration (IR) in real-world scenarios presents significant challenges due to the lack of high-capacity models and comprehensive datasets. To tackle these issues, we present a dual strategy: GenIR, an innovative data curation pipeline, and DreamClear, a cutting-edge Diffusion Transformer (DiT)-based image restoration model. GenIR, our pioneering contribution, is a dual-prompt learning pipeline that overcomes the limitations of existing datasets, which typically comprise only a few thousand images and thus offer limited generalizability for larger models. GenIR streamlines the process into three stages: image-text pair construction, dual-prompt based fine-tuning, and data generation & filtering. This approach circumvents the laborious data crawling process, ensuring copyright compliance and providing a cost-effective, privacy-safe solution for IR dataset construction. The result is a large-scale dataset of one million high-quality images. Our second contribution, DreamClear, is a DiT-based image restoration model. It utilizes the generative priors of text-to-image (T2I) diffusion models and the robust perceptual capabilities of multi-modal large language models (MLLMs) to achieve photorealistic restoration. To boost the model's adaptability to diverse real-world degradations, we introduce the Mixture of Adaptive Modulator (MoAM). It employs token-wise degradation priors to dynamically integrate various restoration experts, thereby expanding the range of degradations the model can address. Our exhaustive experiments confirm DreamClear's superior performance, underlining the efficacy of our dual strategy for real-world image restoration. Code and pre-trained models are available at: https://github.com/shallowdream204/DreamClear.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is DreamClear: High-Capacity Real-World Image Restoration with Privacy-Safe Dataset Curation. This title clearly indicates the central topic, which is addressing challenges in real-world image restoration through two main innovations: a high-capacity model and a novel, privacy-safe method for dataset creation.
1.2. Authors
The authors are:
-
Yuang A
-
Xiaoqiang Zhou
-
Huaibo Huang
-
Xiaotian Han
-
Zhengyu Chen
-
Quanzeng You
-
Hongxia Yang
Affiliations include:
-
MAIS & NLPR, Institute of Automation, Chinese Academy of Sciences
-
col tificial nteleceUnivrsi Chicadey
-
ByteDance, Inc
-
University of Science and Technology of China
The asterisk (*) next to some authors typically denotes equal contribution, and the heart symbol () often indicates a corresponding author, suggesting a collaborative effort from various research institutions and industry partners.
1.3. Journal/Conference
The paper is published at (UTC): 2024-10-24T11:57:20.000Z. The abstract does not explicitly state a journal or conference, but the presence of an arXiv link indicates it is a preprint, commonly submitted to prominent computer vision conferences or journals. Given the topic and scope, it is likely targeting venues like CVPR, ICCV, NeurIPS, or ICLR.
1.4. Publication Year
The publication year is 2024, based on the provided UTC timestamp.
1.5. Abstract
The paper addresses significant challenges in real-world image restoration (IR): the lack of high-capacity models and comprehensive datasets. It proposes a dual strategy:
-
GenIR: An innovative data curation pipeline that uses
dual-prompt learningto overcome the limitations of small, existing datasets. GenIR consists of three stages:image-text pair construction,dual-prompt based fine-tuning, anddata generation & filtering. This method isprivacy-safe,cost-effective, andcopyright-compliant, resulting in a large-scale dataset of one million high-quality images without relying on laborious data crawling. -
DreamClear: A cutting-edge
Diffusion Transformer (DiT)-based image restoration model. It leveragesgenerative priorsfromtext-to-image (T2I) diffusion modelsandperceptual capabilitiesfrommulti-modal large language models (MLLMs)forphotorealistic restoration. To enhance adaptability to diverse real-world degradations, DreamClear introduces theMixture of Adaptive Modulator (MoAM), which usestoken-wise degradation priorsto dynamically integrate various restoration experts.The paper claims that exhaustive experiments confirm DreamClear's superior performance, validating the effectiveness of their dual strategy for real-world image restoration.
1.6. Original Source Link
The original source link is: https://arxiv.org/abs/2410.18666. This indicates the paper is a preprint available on arXiv.
The PDF link is: https://arxiv.org/pdf/2410.18666v2.pdf.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the significant challenge of Image Restoration (IR) in real-world scenarios. While IR has seen advancements in specific tasks under controlled conditions (e.g., super-resolution, denoising), its application to diverse and complex real-world degradations remains formidable.
This problem is crucial because real-world images often suffer from various forms of degradation, limiting their utility in many applications, from consumer photography to medical imaging and autonomous driving. The specific challenges and gaps in prior research include:
-
Limited Datasets: Existing
IRdatasets are typically small (a few thousand images), offering limited generalizability for larger models and failing to adequately capture the wide spectrum of real-world degradations. Thisdisconnect between training data and real-world scenarioshinders progress. -
Data Curation Issues: Traditional methods of obtaining
high-quality (HQ)images for datasets, primarily web scraping, arelabor-intensive, raisecopyright concerns, and poseprivacy risks(especially with identifiable human faces). -
Model Capacity and Generalization: Despite breakthroughs in
Natural Language Processing (NLP)andAI-Generated Content (AIGC)driven by large-scale models and extensive data,IRhasn't seen comparable progress. CurrentIRmodels often struggle to generalize across diverse, blind degradations, and manygenerative prior-based methods neglectdegradation priorsin low-quality inputs.The paper's entry point or innovative idea is a
dual strategy:
- To address the dataset limitations, they propose
GenIR, a novel, privacy-safe, and cost-effective data curation pipeline that generates synthetic, high-quality images usingText-to-Image (T2I)models andMultimodal Large Language Models (MLLMs). - To address model capacity and generalization, they introduce
DreamClear, aDiffusion Transformer (DiT)-based model that explicitly incorporatesdegradation priorsand dynamically integrates restorationexpertsto handle diverse real-world degradations.
2.2. Main Contributions / Findings
The paper makes two primary contributions:
-
GenIR: A Privacy-Safe Automated Data Curation Pipeline:
- Contribution: Introduction of
GenIR, a pioneering pipeline for creating large-scale, high-qualityIRdatasets. It leveragesgenerative priorsfrom pretrainedT2I modelsandMLLMsto synthesize images, circumventingcopyrightandprivacyissues associated with web scraping. - Specific Problems Solved: It addresses the scarcity of comprehensive
IRdatasets and the ethical/logistical challenges of traditional data collection. It provides acost-effectiveandprivacy-safesolution. - Key Findings:
GenIRsuccessfully generates a dataset ofone million high-quality images, proving effective in training robustreal-world IR models. Ablation studies demonstrate that models trained withGenIR-generated data achieve significantly better performance than those trained on smaller, traditional datasets or data generated by un-finetunedT2Imodels.
- Contribution: Introduction of
-
DreamClear: A Robust, High-Capacity Image Restoration Model:
- Contribution: Development of
DreamClear, aDiffusion Transformer (DiT)-based model that integratesdegradation priorsintodiffusion-based frameworksand employs a novelMixture of Adaptive Modulator (MoAM). - Specific Problems Solved: It enhances
control over content generationinT2I modelsforIR, improvesadaptability to diverse real-world degradations, and ensuresrobust generalizationacross various scenarios. TheMoAMaddresses the varied nature of real-world degradations by dynamically combining specializedrestoration experts. - Key Findings:
- DreamClear achieves
state-of-the-art performanceacrosslow-level (synthetic & real-world)andhigh-level (detection & segmentation)benchmarks. - It excels in
perceptual metricson synthetic datasets andno-reference metricson real-world benchmarks, demonstratingphotorealistic restoration. - User studies confirm DreamClear's superior visual quality, naturalness, and detail accuracy, with users highly preferring its outputs.
- Ablation studies confirm the effectiveness of each component, including the
dual-branch framework,ControlFormer,MoAM, andMLLM-generated detailed text prompts.
- DreamClear achieves
- Contribution: Development of
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a reader should be familiar with several core concepts in deep learning and computer vision:
-
Image Restoration (IR): A field in computer vision that aims to recover a
high-quality (HQ)image from itsdegraded low-quality (LQ)counterpart. Degradations can include noise, blur, rain, compression artifacts, low resolution, etc. The goal is to reverse these degradations and produce an image that is visually pleasing and structurally accurate. -
Generative Models: A class of artificial intelligence models that can generate new data instances that resemble the training data.
- Generative Adversarial Networks (GANs): Consist of a
generatorthat creates synthetic data and adiscriminatorthat tries to distinguish between real and synthetic data. They are trained in an adversarial manner. - Diffusion Models: A newer class of generative models that learn to reverse a gradual diffusion process. They start with random noise and progressively denoise it to generate an image. They have shown impressive capabilities in generating high-fidelity and diverse images.
- Generative Adversarial Networks (GANs): Consist of a
-
Text-to-Image (T2I) Diffusion Models: A specialized type of diffusion model that generates images based on textual descriptions (prompts). Examples include
Stable DiffusionandDALL-E. They learn to map text embeddings to image features, allowing precise control over image generation content and style. -
Diffusion Transformer (DiT): A variant of
Diffusion Modelswhere the denoising U-Net architecture is replaced or augmented with a Transformer architecture. Transformers, originally developed forNatural Language Processing (NLP), are highly effective at modeling long-range dependencies and have proven powerful in visual tasks, especially for large-scale models likeSoraandStable Diffusion 3. -
Multimodal Large Language Models (MLLMs): Large language models that can process and understand information from multiple modalities, typically text and images. They can perform tasks like image captioning, visual question answering, and generating detailed descriptions of images.
Gemini-1.5-ProandLLaVAare examples ofMLLMs. -
Prompt Learning/Engineering: The process of designing effective text inputs (prompts) to guide
generative models(especiallyT2IandLLMs) to produce desired outputs.Dual-prompt learningextends this by simultaneously learning both positive (desired) and negative (undesired) prompts to fine-tune the model's generation capabilities. -
Classifier-Free Guidance (CFG): A technique used in
diffusion modelsto improve the quality and adherence of generated samples to a given condition (e.g., a text prompt). It works by performing two denoising steps: one conditioned on the prompt and one unconditioned, then extrapolating the conditioned output away from the unconditioned output. This amplifies the effect of the condition.The
CFGprediction is formulated as: Where:z _ { t }: The noisy latent representation at time step .- : The current time step in the diffusion process.
- : The positive (desired) conditioning prompt.
- : The negative (undesired) conditioning prompt.
- : The predicted noise by the denoising model when conditioned on the positive prompt.
- : The predicted noise by the denoising model when conditioned on the negative prompt.
- : The
CFG guidance scale, a hyperparameter that controls how strongly the model adheres to the guidance. Higher values lead to stronger adherence but can sometimes reduce diversity or realism.
-
Mixture-of-Experts (MoE): A neural network architecture where multiple "expert" sub-networks are trained, and a "router" or "gating network" learns to selectively activate and combine the outputs of these experts for each input. This allows a model to specialize in different aspects of the data or task, improving efficiency and capacity.
-
Adaptive Layer Normalization (AdaLN): A conditional normalization technique often used in
generative models. It modulates the scale and shift parameters of layer normalization based on an external conditioning input (e.g., a text embedding or an image feature). This allows the model to adapt its internal representations to specific conditions.AdaLNlearns dimension-wise scale and shift parameters.
3.2. Previous Works
The paper contextualizes its contributions by discussing several categories of prior research:
-
Image Restoration (IR) Datasets:
- Traditional
IRdatasets are often created by acquiringHQimages and then simulating degradations. - Examples cited:
DIV2K[44] andFlickr2K[2], which contain only a few thousand images. - Larger collections:
SUPIR[80] (20 million images), highlighting the labor-intensive nature of large-scale curation. - The paper notes that current datasets are
insufficient for covering a broad spectrum of real-world scenariosand thatweb scrapingfor data collection involvescopyrightandprivacy concerns.
- Traditional
-
Generative Priors in IR:
- Many recent
state-of-the-art (SOTA)approaches leveragegenerative priorsfrom pretrained generative models likeGANsordiffusion modelsfor realisticIR[43, 8, 38, 20]. Stable Diffusion[59] is particularly highlighted for its richgenerative prior[77, 70, 80].- Techniques for integrating
generative priors:- For
GANs(e.g.,StyleGAN[36]), an additionalencodermaps the input image to the latent space [76, 48]. - For
diffusion models, the input image is integrated as a conditional input in thelatent feature spaceduring thedenoising process[15, 50].
- For
- The paper points out that these methods
often neglect the degradation priors in input low-quality images, which is a critical element inblind IR[72].
- Many recent
-
Synthetic Datasets:
- The importance of
large-scale high-quality datasetsfor training large models is acknowledged [83, 22, 21, 24, 26]. - Existing large datasets are often
manually collected(ImageNet[16],LSDIR[39]). - Concerns with
web-crawled data:privacy information leakage[58, 24]. Synthesized datasetsare presented as a solution to reducelaborious human effortsandavoid privacy issues[28, 25, 6]. The paper claims to be the first to explore dataset synthesis specifically forimage restoration.
- The importance of
3.3. Technological Evolution
Image Restoration has evolved from traditional signal processing methods to deep learning approaches. Early deep learning methods focused on specific degradation types (e.g., super-resolution [95, 31], denoising [87, 84]), often relying on Convolutional Neural Networks (CNNs). The challenge then shifted to real-world IR, where degradations are complex and diverse. This led to:
-
Improved Degradation Simulation: Works like
BSRGAN[85],Real-ESRGAN[64], andAnimeSR[71] developed more realistic degradation pipelines to generate syntheticLQ-HQpairs. -
Generative Models Integration: The rise of powerful
generative modelslikeGANsanddiffusion modelssignificantly improved thephotorealismof restored images. These models could leverage their learnedimage priorsto hallucinate missing details. -
Large-Scale Models and Data: Following trends in
NLP(GPT-4[1]) andAIGC(Stable Diffusion[59]), theIRfield started exploring the benefits ofhigh-capacity modelsandextensive data. However, the lack of suitableIRdatasets became a bottleneck.This paper's work fits into the current technological timeline by addressing this bottleneck through
privacy-safe synthetic data generation(GenIR) and then leveragingcutting-edge transformer-based generative models(Diffusion Transformer) with an explicit mechanism fordegradation awareness(MoAM) to build a high-capacityIRsolution (DreamClear) that can generalize todiverse real-world degradations.
3.4. Differentiation Analysis
Compared to the main methods in related work, DreamClear and GenIR offer several core differences and innovations:
-
Data Curation (GenIR vs. Traditional/Web Scraping):
- Traditional (e.g.,
DIV2K,Flickr2K): Small scale, limited diversity, often generated by simple degradation models. - Web Scraping (e.g.,
SUPIR): Large scale butlabor-intensive,copyright-prone, andprivacy-risky. - GenIR Innovation: It's the
first to explore dataset synthesisforIRusingT2I diffusion modelsandMLLMsto createnon-existent, high-quality images. This approach is inherentlyprivacy-safe,copyright-compliant, andcost-effective, circumventing the major drawbacks of previous methods. Thedual-prompt learningandMLLM-based filteringensure data quality and diversity, which is crucial for training high-capacity models.
- Traditional (e.g.,
-
Model Architecture (DreamClear vs. Other IR Models):
- GAN-based/Early Diffusion-based: While leveraging
generative priorsforphotorealism, many prior methods (e.g.,Real-ESRGAN,StableSR,DiffBIR) often useU-Netarchitectures or adaptStable Diffusion'sU-NetwithControlNet. They sometimesneglect explicit degradation priors. - DreamClear Innovation:
-
DiT-based: It builds onDiffusion Transformer(DiT), a more recent and powerful architecture fordiffusion models(e.g.,PixArt-alpha[13]), which is more scalable and robust thanU-Nets. -
ControlFormer: AdaptsControlNetprinciples forDiT, enabling effective spatial control over theDiT-basedT2I modelusing bothLQandreference images. -
Mixture of Adaptive Modulator (MoAM): This is a key innovation forblind IR. Unlike methods that might apply a single restoration process,MoAMexplicitly extractstoken-wise degradation priorsanddynamically integrates multiple restoration experts(aMixture-of-Expertsapproach) based on these priors. This allowsDreamCleartoadapt to diverse and complex real-world degradationsmore effectively, which was a recognized gap in previousdiffusion-based IRmethods. -
MLLM Guidance: UsesMLLMs(LLaVA) to generatedetailed textual captionsfor images, providing richersemantic guidanceto theT2I diffusion modelfor morephotorealistic restoration.In essence,
DreamClearandGenIRrepresent a comprehensive strategy: generate high-quality, diverse, and ethical data, then train a powerful, degradation-aware, and generalizable model on that data.
-
- GAN-based/Early Diffusion-based: While leveraging
4. Methodology
4.1. Principles
The core idea of the method relies on a dual strategy:
- Leveraging Generative Priors for Data Curation (GenIR): The principle is that highly capable
Text-to-Image (T2I) diffusion modelsalready possess a stronggenerative priorof high-quality, realistic images. By effectively guiding these models withMultimodal Large Language Models (MLLMs)and specializeddual-prompt learning, one can synthesize vast amounts of diverse, high-quality images that are free fromcopyrightandprivacyissues. These synthetic images can then serve asHigh-Quality (HQ)ground truth forImage Restoration (IR)tasks, overcoming the limitations of existing datasets. - Adaptive and High-Capacity Image Restoration (DreamClear): The principle here is to build a powerful
IRmodel based onDiffusion Transformers (DiT)that not only utilizes thegenerative priorforphotorealismbut also explicitly accounts for the diverse and complex nature ofreal-world degradations. This is achieved by dynamically adapting the restoration process based ondegradation priorsderived from thelow-quality (LQ)input. The model integrates multiple specializedexpertsand leveragesMLLMsfor semantic guidance, ensuring robust generalization and detailed, accurate restoration.
4.2. Core Methodology In-depth (Layer by Layer)
The overall methodology comprises two main components: GenIR for dataset curation and DreamClear for image restoration.
4.2.1. GenIR: Privacy-Safe Dataset Curation Pipeline
GenIR is designed to efficiently create high-quality, non-existent images for IR datasets, addressing copyright and privacy issues. It operates in three stages:
4.2.1.1. Image-Text Pairs Construction
This stage prepares the data required to fine-tune the Text-to-Image (T2I) model for optimal image generation suitable for IR.
-
Positive Samples: Existing
IRdatasets (e.g.,DIV2K[44],Flickr2K[2],LSDIR[39],DIV8K[22]) providehigh-resolution, texture-rich images. Since these images lack corresponding text prompts, the sophisticatedMLLM,Gemini-1.5-Pro[62], is employed to generate detailed text prompts for each image. This createsimage-text pairsfor positive samples. -
Negative Samples: To ensure the
T2Imodel learns to avoid undesirable attributes (e.g.,cartoonish,over-smooth,blurry),negative samplesare generated. As depicted in Figure 2 (a), this is done using animage-to-image pipeline[50] (e.g.,SDEdit) with theT2I modeland manually designednegative promptssuch as "cartoon, painting, over-smooth, dirty". This process creates degraded versions corresponding to specific negative attributes.The following figure (Figure 2 from the original paper) illustrates the three-stage
GenIRpipeline:
该图像是论文中Figure 6的图表,展示了合成训练数据量增加对LSDIR-Val六种指标性能的影响。随着训练图像数量的增加,LPIPS、DISTS和FID指标下降,而MNAQIQA、MUSIC和CLIPIQA指标升高,表明性能提升。
Figure 2: An overview of the three-stage GenIR pipeline, which includes (a) Image-Text Pairs Construction, (b) Dual-Prompt Based Fine-Tuning, and (c) Data Generation & Filtering.
4.2.1.2. Dual-Prompt Based Fine-Tuning
This stage adapts a pretrained T2I model to generate images that are specifically useful for IR tasks, by learning both desired and undesired image characteristics.
- Prompt Learning: Instead of relying on complex, manually crafted prompts,
GenIRproposes adual-prompt based fine-tuningapproach. This involves learningembeddingsfor bothpositiveandnegative tokens.- positive tokens, denoted as , represent desired attributes (e.g., "4k, highly detailed, professional").
- negative tokens, denoted as , represent undesired attributes (e.g., "deformation, low quality, over-smooth").
- Initialization: These new
positiveandnegative tokensare initialized using frequently used descriptive terms. - Model Refinement: Since the text condition is integrated into the
diffusion modelviacross-attention, theattention blockof theT2I modelis refined during fine-tuning to better comprehend and respond to these new tokens. This allows the fine-tunedT2I modelto generate data efficiently using the learned prompts.
4.2.1.3. Data Generation & Filtering
In this final stage, the fine-tuned T2I model is used to generate a large-scale, diverse, and high-quality dataset, with strict privacy and quality controls.
- Diverse Prompt Generation: To ensure
scene diversityin theIR dataset,Gemini(anLLM) is leveraged to generateone million text prompts. These prompts describe various scenes under carefully curated language instructions thatexplicitly proscribe the inclusion of personal or sensitive information, thereby ensuringprivacy. - Image Generation: The
fine-tuned T2I model, along with the learnedpositiveandnegative prompts, is then used to synthesizeHQ images. - Classifier-Free Guidance (CFG): During the sampling phase (image generation),
CFG[30] is employed to effectively utilizenegative promptsand mitigate the generation of undesired content. Thedenoising modelconsiders two outcomes: one associated with thepositive prompt() and another with thenegative prompt(). The finalCFG predictionis mathematically formulated as: Where:z _ { t }: The noisy latent representation at time step .- : The current time step.
- : The positive (desired) text prompt.
- : The negative (undesired) text prompt.
- : The noise predicted by the model conditioned on the positive prompt.
- : The noise predicted by the model conditioned on the negative prompt.
- : The
CFG guidance scale, which balances the influence of positive and negative prompts.
- Quality Filtering:
- Quality Classifier: After
sampling, generated images are evaluated by a binaryquality classifier. This classifier, trained onpositiveandnegative samples, decides whether to retain images based on predicted probabilities, ensuring basic quality standards. - MLLM Semantic Check:
Geminiis subsequently used to perform a more advanced semantic check, ascertaining whether imagesexhibit blatant semantic errors or inappropriate content. This step ensuresprivacyandcontent appropriateness(e.g.,no identifiable individuals).
- Quality Classifier: After
- Result: This process culminates in a dataset of
one million high-resolution() images, which areprivacy-safe,copyright-free, and ofsuperior quality.
4.2.2. DreamClear: High-Capacity Image Restoration Model
DreamClear is built upon PixArt-alpha [13], a pretrained T2I diffusion model based on the Diffusion Transformer (DiT) [53] architecture. Its design focuses on dynamically integrating various restoration experts guided by prior degradation information.
The following figure (Figure 3 from the original paper) illustrates the architecture of the proposed DreamClear:

该图像是图7,展示了DreamClear模型的消融实验视觉对比图。图中对比了不同模块缺失情况下的图像恢复效果,展示MoAM、多模态提示和参考分支对结果的影响,突出DreamClear最终恢复效果的优越性。
Figure 3: Architecture of the proposed DreamClear. DreamClear adopts a dual-branch structure, using Mixture of Adaptive Modulator to merge LQ features and Reference features. We utilize MLLM to generate detailed text prompt as the guidance for T2I model.
4.2.2.1. Architecture Overview
DreamClear employs a dual-branch architecture:
- LQ Branch: Processes the
low-quality (LQ)input imageI _ { l q }. The image is first passed through a lightweight network, specifically theSwinIRmodel [41] (used inDiffBIR[46]), to produce an initialdegradation-removed imageI _ { r e f}. ThisSwinIRmodel acts as a preliminary degradation remover. - Reference Branch: Provides additional context to guide the
diffusion model.DreamClearconsiders bothI _ { l q }and the preliminaryreference imageI _ { r e f }to direct thediffusion model, especially useful asI _ { r e f }might have lost some details. - Textual Guidance: To achieve
photorealistic restoration,DreamClearutilizes anopen-source MLLM,LLaVA[47], to generatedetailed captionsfor training images. The prompt used is "Describe this image and its style in a very detailed manner". This detailed text prompt serves astextual guidancefor theT2I diffusion model.
4.2.2.2. ControlFormer
Traditional ControlNet [88] is designed for U-Net architectures in Stable Diffusion, making it unsuitable for DiT-based models due to architectural differences. ControlFormer is proposed as an adaptation for DiT.
- Core Feature:
ControlFormerinheritsControlNet's core features:trainable copy blocksandzero-initialized layers. - DiT Adaptation: It duplicates all
DiT Blocksfrom the basePixArt-alphamodel. - Feature Combination: It employs the
Mixture of Adaptive Modulator (MoAM)block (described below) to combineLQ features(x _ { l q }) andreference features(x _ { r e f }). This integration allowsControlFormerto maintain effectivespatial controlwithin theDiTframework.
4.2.2.3. Mixture-of-Adaptive-Modulator (MoAM)
MoAM is designed to enhance the model's robustness to real-world degradations by effectively fusing LQ and reference features in a degradation-aware manner. Each MoAM block comprises adaptive modulators (AM), a cross-attention layer, and a router block.
MoAM operates in three main steps:
- Cross-Attention and Degradation Map Generation:
- For the
DiT featuresx _ { i n },cross-attentionis calculated betweenLQ featuresandreference features. Here, denotes the number of visual tokens and represents the hidden size. The output of thiscross-attentionis . x _ { i n }is then modulated usingx _ { attn }, followed by azero linear layer.- A
token-wise degradation mapis generated through alinear mappingofx _ { attn }. This map encodesdegradation priorsfor each token.
- For the
- Adaptive Modulation with Reference Features:
- The features are further modulated using
Adaptive Modulators (AM). - The
reference featuresx _ { r e f }serve as theAM conditionto extractclean features.AMusesAdaLN[54] to learndimension-wise scaleandshiftparameters, embedding conditional information into the input features.
- The features are further modulated using
- Dynamic Mixture of Restoration Experts:
- To adapt to varying degradations in real-world images,
MoAMemploys amixture of degradation-aware experts. - Each
MoAM blockcontainsrestoration experts, each specialized for particular degradation scenarios. Each expert provides its own scale and shift values. - A
routing networkdynamically merges the guidance from these experts for each token. This network is atwo-layer MLPfollowed by asoftmax activation, which outputstoken-wise expert weights. The here refers to the number of tokens (possibly or a downsampled version). - The dynamic mixture of restoration experts is formulated to compute the final
scaleandshiftfor each token : And the final outputx _ { out }is then calculated as: Where:-
: Index for a specific token.
-
: Index for a specific expert.
-
: The total number of restoration experts (set to 3 in experiments).
-
w ( i , k ): The weight assigned by the routing network to expert for token . -
: The scale parameter learned by expert for token from
LQ featuresx _ { l q } ( i ). -
: The shift parameter learned by expert for token from
LQ featuresx _ { l q } ( i ). -
x _ { in }: The input features to theMoAMblock. -
: Element-wise multiplication.
This dynamic fusion of expert knowledge, guided by
degradation priors, allowsMoAMto effectively address complex and varied degradations.
-
- To adapt to varying degradations in real-world images,
5. Experimental Setup
5.1. Datasets
The experiments use a combination of existing and newly generated datasets for training and evaluate on both synthetic and real-world benchmarks.
5.1.1. Training Datasets
DreamClear is trained using a large combined dataset:
-
DIV2K[44]: A high-quality image dataset commonly used for super-resolution and otherIRtasks. -
Flickr2K[2]: Also known asFlickr2K(Flickr2Kimages with diverse content). -
LSDIR[39]:LSDIR(Large Scale Dataset for Image Restoration), a large-scale dataset forIR. -
DIV8K[22]:DIV8K(Diverse 8K) is a dataset of diverse, high-resolution images. -
GenIR Generated Dataset: The paper's own large-scale dataset of
one million high-quality imagesgenerated using theGenIR pipeline.For training, the
Real-ESRGAN[64] degradation pipeline is used to generateLQimages fromHQimages, with the same degradation settings asSeeSR[70] for fair comparison. All experiments are conducted with ascaling factor of\times 4
.
### 5.1.2. Testing Datasets
* **Synthetic Benchmarks**: Used to evaluate performance under controlled, simulated degradations.
* `DIV2K-Val`: 3,000 randomly cropped patches from the validation sets of `DIV2K`, degraded using the same settings as training.
* `LSDIR-Val`: 3,000 randomly cropped patches from `LSDIR`, degraded using the same settings as training.
* **Real-World Benchmarks**: Used to evaluate performance on genuinely degraded images.
* `RealSR` [8]: A commonly used dataset for real-world super-resolution.
* `DRealSR` [68]: Another dataset for real-world super-resolution.
* `RealLQ250`: A newly established benchmark in this work, comprising 250 `LQ` images () sourced from previous works [70, 42, 64, 86, 80] or the Internet, without corresponding `Ground Truth (GT)` images.
* **High-Level Vision Tasks Benchmarks**: Used to assess the benefit of `IR` for downstream tasks.
* `COCO val2017` [45]: For object detection and instance segmentation.
* `ADE20K` [94]: For semantic segmentation.
For all testing datasets with `GT` images, the resolution of `HQ-LQ` image pairs is and , respectively, implying a scaling factor of 4.
## 5.2. Evaluation Metrics
The paper employs a comprehensive set of metrics, categorized into reference-based distortion, reference-based perceptual, no-reference, and generation quality metrics, along with metrics for high-level vision tasks.
### 5.2.1. Reference-Based Distortion Metrics
These metrics compare the restored image directly with a known `Ground Truth (GT)` image.
* **Peak Signal-to-Noise Ratio (PSNR)**:
* **Conceptual Definition**: Measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Higher `PSNR` generally indicates a higher quality restoration. It is typically expressed in decibels (dB).
* **Mathematical Formula**:
\mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right)
Where `MSE` is the `Mean Squared Error` between the two images, and is the maximum possible pixel value of the image (e.g., 255 for an 8-bit image).
\mathrm{MSE} = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2
* **Symbol Explanation**:
* : The original (ground truth) image.
* : The restored image.
* `m, n`: The dimensions (rows and columns) of the images.
* `I(i,j)`: The pixel value at position `(i,j)` in the original image.
* `K(i,j)`: The pixel value at position `(i,j)` in the restored image.
* : The maximum possible pixel value of the image. For an 8-bit image, this is 255.
* **Structural Similarity Index Measure (SSIM)**:
* **Conceptual Definition**: Measures the perceived quality degradation due to processing. It is a full-reference metric that assesses three key features: `luminance`, `contrast`, and `structure`. Its value ranges from -1 to 1, with 1 indicating perfect similarity.
* **Mathematical Formula**:
\mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)}
* **Symbol Explanation**:
* : A patch from the original image.
* : A patch from the restored image.
* : The average of .
* : The average of .
* : The variance of .
* : The variance of .
* : The covariance of and .
* and : Two variables to stabilize the division with a weak denominator. is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and , by default.
These metrics are typically calculated on the `Y channel` of the transformed `YCbCr space`.
### 5.2.2. Reference-Based Perceptual Metrics
These metrics aim to quantify human perception of image quality, often correlating better with human judgment than `PSNR`/`SSIM`.
* **Learned Perceptual Image Patch Similarity (LPIPS)** [90]:
* **Conceptual Definition**: Uses features extracted from a pre-trained deep neural network (e.g., `AlexNet`, `VGG`, `SqueezeNet`) to compare image patches. It measures the distance between these feature representations, aiming to correlate with human perceptual judgments. A lower `LPIPS` score indicates higher perceptual similarity.
* **Mathematical Formula**:
\mathrm{LPIPS}(x, x_0) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} \|w_l \odot (\phi_l(x)_{h,w} - \phi_l(x_0)_{h,w})\|_2^2
* **Symbol Explanation**:
* : The reference image.
* : The distorted image.
* : Feature maps from layer of a pre-trained `CNN` (e.g., `AlexNet`).
* : A learned scalar weight for each channel .
* : Element-wise multiplication.
* : Height and width of the feature map at layer .
* **Deep Image Structure and Texture Similarity (DISTS)** [17]:
* **Conceptual Definition**: A full-reference image quality metric that combines structural and textural similarity. It aims to unify `SSIM` and `LPIPS` by considering both structure (global patterns) and texture (local statistical properties) in a deep learning framework. A lower `DISTS` score indicates higher similarity.
* **Mathematical Formula**:
\mathrm{DISTS}(X, Y) = \sum_{l=1}^L \left( \alpha_l \cdot \mathrm{D_{S}}(X_l, Y_l) + \beta_l \cdot \mathrm{D_{T}}(X_l, Y_l) \right)
Where measures structural distortion and measures textural distortion, typically calculated from `VGG` features.
* **Symbol Explanation**:
* : Reference image.
* : Distorted image.
* : Features extracted from layer of a pre-trained network (e.g., `VGG`).
* : Structural similarity component.
* : Textural similarity component.
* : Learned weights for structural and textural components at layer .
### 5.2.3. No-Reference Metrics
These metrics evaluate image quality without access to a `GT` image.
* **Naturalness Image Quality Evaluator (NIQE)** [89]:
* **Conceptual Definition**: A no-reference `image quality assessment (IQA)` metric that quantifies image quality by comparing statistical features of the image to those of natural, undistorted images. A lower `NIQE` score indicates a more natural-looking and higher-quality image.
* **Mathematical Formula**:
`NIQE` does not have a simple closed-form mathematical formula like `PSNR` or `SSIM`. It is based on fitting a `multivariate Gaussian model` to local `luminance coefficients` (e.g., from `mean subtracted contrast normalized (MSCN)` images) of an image, typically derived from a diverse dataset of natural images. The `quality score` is then the `distance` between the `Gaussian model` of the test image and the `Gaussian model` of the natural image dataset.
* **Multidimension Attention Network for No-Reference Image Quality Assessment (MANIQA)** [75]:
* **Conceptual Definition**: A no-reference `IQA` metric that uses a `multidimensional attention network` to assess image quality. It aims to capture complex human perceptual factors across different scales and aspects of an image. A higher `MANIQA` score indicates better quality.
* **Mathematical Formula**: `MANIQA` is a deep learning model, so its score is the output of the trained network. There isn't a single mathematical formula to represent its assessment process in the same way as `PSNR`.
* **Multi-scale Image Quality Transformer (MUSIQ)** [75]:
* **Conceptual Definition**: A no-reference `IQA` model that utilizes a `transformer-based architecture` and operates on image patches extracted at `multiple scales`. This allows it to capture quality distortions that manifest differently at various resolutions. A higher `MUSIQ` score indicates better quality.
* **Mathematical Formula**: Similar to `MANIQA`, `MUSIQ` is a deep learning model, and its score is the output of its `transformer-based network`.
* **CLIPIQA** [75]:
* **Conceptual Definition**: A no-reference `IQA` metric that leverages the capabilities of `CLIP` (Contrastive Language-Image Pre-training) models. It aims to align image quality assessment with semantic understanding derived from multi-modal pre-training. A higher `CLIPIQA` score indicates better quality.
* **Mathematical Formula**: `CLIPIQA` also relies on a neural network (specifically, features from `CLIP`) and does not have a simple mathematical formula.
### 5.2.4. Generation Quality Metrics
* **Fréchet Inception Distance (FID)** [29]:
* **Conceptual Definition**: Measures the similarity between two sets of images. It calculates the `Fréchet distance` between the `Gaussian distributions` of feature representations (extracted from an `Inception-v3` network) of real and generated images. A lower `FID` score indicates that the generated images are more similar to real images in terms of perceptual quality and diversity.
* **Mathematical Formula**:
\mathrm{FID} = \|\mu_1 - \mu_2\|_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1\Sigma_2)^{1/2})
* **Symbol Explanation**:
* : The mean feature vector of the real images.
* : The mean feature vector of the generated images.
* : The covariance matrix of the real images' features.
* : The covariance matrix of the generated images' features.
* : The trace of the matrix.
### 5.2.5. High-Level Vision Task Metrics
These metrics are used to evaluate how well the restored images benefit downstream tasks.
* **Average Precision (AP)**:
* **Conceptual Definition**: A common metric for `object detection` and `instance segmentation`. It summarizes the `precision-recall curve` by taking the weighted mean of precisions achieved at each threshold, with the increase in recall from the previous threshold as the weight. Higher `AP` indicates better detection/segmentation performance.
* **Types**:
* : Average Precision for `object detection` (bounding boxes).
* : Average Precision for `instance segmentation` (masks).
* **Mathematical Formula**:
\mathrm{AP} = \sum_n (R_n - R_{n-1}) P_n
Where and are the precision and recall at the -th threshold, and . This is a simplified representation; `COCO` `AP` typically averages over multiple `Intersection over Union (IoU)` thresholds.
* **Symbol Explanation**:
* : Precision at the -th recall level.
* : Recall at the -th recall level.
* **Mean Intersection over Union (mIoU)**:
* **Conceptual Definition**: A standard metric for `semantic segmentation`. It calculates the `IoU` for each class (the ratio of the intersection area to the union area of the predicted and ground truth segmentation masks), and then averages these `IoU` values over all classes. A higher `mIoU` indicates better segmentation performance.
* **Mathematical Formula**:
\mathrm{IoU}_k = \frac{\mathrm{TP}_k}{\mathrm{TP}_k + \mathrm{FP}_k + \mathrm{FN}_k}
\mathrm{mIoU} = \frac{1}{C} \sum_{k=1}^C \mathrm{IoU}_k
\$\$
* **Symbol Explanation**:
* : True Positives for class .
* : False Positives for class .
* : False Negatives for class .
* : Total number of classes.
5.3. Baselines
The paper compares DreamClear against a range of state-of-the-art Image Restoration (IR) methods, including both GAN-based and diffusion-based approaches:
-
GAN-based Methods:
BSRGAN[85]Real-ESRGAN[64]SwinIR-GAN[41]DASR[43]
-
Diffusion-based Methods:
-
StableSR[63] -
DiffBIR[46] -
ResShift[82] -
SinSR[65] -
SeeSR[70] -
SUPIR[80]These baselines are representative of the current
state-of-the-artin real-worldIR, encompassing different architectural choices and strategies for handling image degradation.
-
5.4. Implementation Details
- GenIR Training:
- Uses the original
latent diffusion loss[59]. - Built on
SDXL[55]. - Trained over 5 days using 16
NVIDIA A100 GPUs. - Training conducted on resolution images with a
batch-sizeof 256. - For data generation, 256
NVIDIA V100 GPUswere used for 5 days to generate the one million high-quality images. SDEditstrength for negative sample generation: 0.6.Geminiprompts for filtering explicitly forbidinappropriate content(privacy violations,NSFW).
- Uses the original
- DreamClear Training:
- Built upon
PixArt-alpha[13] andLLaVA[47]. SwinIRmodel fromDiffBIR[46] used as the lightweightdegradation removerin theLQ branch.AdamW optimizerwith alearning rateof .- Training conducted on resolution images.
- Ran for 7 days on 32
NVIDIA A100 GPUswith abatch-sizeof 128. - Number of experts in
MoAMis set to 3.
- Built upon
- DreamClear Inference:
iDDPM[51] is adopted with 50sampling steps.CFG guidance scale.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate DreamClear's superior performance across various benchmarks, validating the effectiveness of the proposed dual strategy (GenIR for data and DreamClear for modeling).
6.1.1. Quantitative Comparisons
The following are the results from Table 1 of the original paper:
| Datasets | Metrics | BSRGAN [85] | Real-ESRGAN [64] | SwinIR-GAN [41] | DASR [43] | StableSR [63] | DiffBIR [46] | ResShift [82] | SinSR [65] | SeeSR [70] | SUPIR [80] | DreamClear |
| DIV2K-Val | PSNR ↑ | 19.88 | 19.92 | 19.66 | 19.73 | 19.73 | 19.98 | 19.80 | 19.37 | 19.59 | 18.68 | 18.69 |
| SSIM ↑ | 0.5137 | 0.5334 | 0.5253 | 0.5122 | 0.5039 | 0.4987 | 0.4985 | 0.4613 | 0.5045 | 0.4664 | 0.4766 | |
| LPIPS ↓ | 0.4303 | 0.3981 | 0.3992 | 0.4350 | 0.4145 | 0.3866 | 0.4450 | 0.4383 | 0.3662 | 0.3976 | 0.3657 | |
| DISTS ↓ | 0.2484 | 0.2304 | 0.2253 | 0.2606 | 0.2162 | 0.2396 | 0.2383 | 0.2175 | 0.1886 | 0.1818 | 0.1637 | |
| FID ↓ | 54.42 | 48.44 | 49.17 | 59.62 | 29.64 | 37.00 | 46.12 | 37.84 | 24.98 | 28.11 | 20.61 | |
| NIQE ↓ | 3.9322 | 3.8762 | 3.7468 | 3.9725 | 4.4255 | 4.5659 | 5.9852 | 5.7320 | 4.1320 | 3.4014 | 3.2126 | |
| MANIQA ↑ | 0.3514 | 0.3854 | 0.3654 | 0.3110 | 0.2942 | 0.4268 | 0.3782 | 0.4206 | 0.5251 | 0.4291 | 0.4320 | |
| MUSIQ ↑ | 63.93 | 64.50 | 64.54 | 59.66 | 58.60 | 64.77 | 62.67 | 65.27 | 72.04 | 69.34 | 68.44 | |
| CLIPIQA ↑ | 0.5589 | 0.5804 | 0.5682 | 0.5565 | 0.5190 | 0.6527 | 0.6498 | 0.6961 | 0.7181 | 0.6035 | 0.6963 | |
| LSDIR-Val | PSNR ↑ | 18.27 | 18.13 | 17.98 | 18.15 | 18.11 | 18.42 | 18.24 | 17.94 | 18.03 | 16.95 | 17.01 |
| SSIM ↑ | 0.4673 | 0.4867 | 0.4783 | 0.4679 | 0.4508 | 0.4618 | 0.4579 | 0.4302 | 0.4564 | 0.4080 | 0.4236 | |
| LPIPS ↓ | 0.4378 | 0.3986 | 0.4020 | 0.4503 | 0.4152 | 0.4049 | 0.4524 | 0.4523 | 0.3759 | 0.4119 | 0.3836 | |
| DISTS ↓ | 0.2539 | 0.2278 | 0.2253 | 0.2615 | 0.2159 | 0.2439 | 0.2436 | 0.2265 | 0.1966 | 0.1838 | 0.1656 | |
| FID ↓ | 53.25 | 46.46 | 45.31 | 60.60 | 31.26 | 35.91 | 43.25 | 36.01 | 25.91 | 30.03 | 22.06 | |
| NIQE ↓ | 3.6885 | 3.4078 | 3.3715 | 3.6432 | 4.0218 | 4.3750 | 5.5635 | 5.4240 | 4.0590 | 2.9820 | 3.0707 | |
| MANIQA ↑ | 0.3829 | 0.4381 | 0.3991 | 0.3315 | 0.3098 | 0.4551 | 0.3995 | 0.4309 | 0.5700 | 0.4683 | 0.4811 | |
| MUSIQ ↑ | 65.98 | 68.25 | 67.10 | 60.96 | 59.37 | 65.94 | 63.25 | 65.35 | 73.00 | 70.98 | 70.40 | |
| CLIPIQA ↑ | 0.5648 | 0.6218 | 0.5983 | 0.5681 | 0.5190 | 0.6592 | 0.6501 | 0.6900 | 0.7261 | 0.6174 | 0.6914 | |
| RealSR | PSNR ↑ | 25.01 | 24.22 | 24.89 | 25.51 | 24.60 | 24.77 | 24.94 | 24.47 | 24.66 | 22.67 | 22.56 |
| SSIM ↑ | 0.7422 | 0.7401 | 0.7543 | 0.7526 | 0.7387 | 0.6902 | 0.7178 | 0.6710 | 0.7209 | 0.6567 | 0.6548 | |
| LPIPS ↓ | 0.2853 | 0.2901 | 0.2680 | 0.3201 | 0.2736 | 0.3436 | 0.3864 | 0.4208 | 0.2997 | 0.3545 | 0.3684 | |
| DISTS ↓ | 0.1967 | 0.1892 | 0.1734 | 0.2056 | 0.1761 | 0.2195 | 0.2467 | 0.2432 | 0.2029 | 0.2185 | 0.2122 | |
| FID ↓ | 84.49 | 90.10 | 80.07 | 91.16 | 88.89 | 69.94 | 88.91 | 70.83 | 71.92 | 71.63 | 65.37 | |
| NIQE ↓ | 4.9261 | 5.0069 | 4.9475 | 5.9659 | 5.6124 | 6.1294 | 6.6044 | 6.4662 | 4.9102 | 4.5368 | 4.4381 | |
| MANIQA ↑ | 0.3660 | 0.3656 | 0.3432 | 0.2819 | 0.3465 | 0.4182 | 0.3781 | 0.4009 | 0.5189 | 0.4296 | 0.4337 | |
| MUSIQ ↑ | 64.67 | 62.06 | 60.97 | 50.94 | 61.07 | 61.74 | 60.28 | 60.36 | 69.38 | 66.09 | 65.33 | |
| CLIPIQA ↑ | 0.5329 | 0.4872 | 0.4548 | 0.3819 | 0.5139 | 0.6202 | 0.5778 | 0.6587 | 0.6839 | 0.5371 | 0.6895 | |
| DrealSR | PSNR ↑ | 27.09 | 26.95 | 27.00 | 28.19 | 27.39 | 27.31 | 27.16 | 26.15 | 27.10 | 24.41 | 24.48 |
| SSIM ↑ | 0.7759 | 0.7812 | 0.7815 | 0.8051 | 0.7830 | 0.7140 | 0.7388 | 0.6564 | 0.7596 | 0.6696 | 0.6508 | |
| LPIPS ↓ | 0.2950 | 0.2876 | 0.2789 | 0.3165 | 0.2710 | 0.3920 | 0.4101 | 0.4690 | 0.3117 | 0.3844 | 0.3972 | |
| DISTS ↓ | 0.1956 | 0.1857 | 0.1787 | 0.2072 | 0.1737 | 0.2443 | 0.2553 | 0.2103 | 0.2264 | 0.2145 | 0.2145 | |
| FID ↓ | 84.26 | 83.79 | 84.22 | 94.96 | 80.23 | 89.47 | 91.45 | 86.73 | 82.45 | 73.66 | 75.02 | |
| NIQE ↓ | 4.7570 | 4.5828 | 4.5828 | 5.3941 | 5.0970 | 5.8821 | 6.2570 | 6.0142 | 4.7500 | 4.3217 | 4.2497 | |
| MANIQA ↑ | 0.3670 | 0.3644 | 0.3540 | 0.2831 | 0.3512 | 0.4168 | 0.3800 | 0.4042 | 0.5186 | 0.4289 | 0.4326 | |
| MUSIQ ↑ | 64.44 | 62.58 | 61.76 | 51.35 | 61.64 | 62.46 | 60.67 | 60.75 | 69.45 | 66.19 | 65.42 | |
| CLIPIQA ↑ | 0.5367 | 0.4907 | 0.4632 | 0.3843 | 0.5192 | 0.6225 | 0.5790 | 0.6601 | 0.6860 | 0.5401 | 0.6918 |
Analysis of Quantitative Results:
-
Synthetic Datasets (DIV2K-Val, LSDIR-Val):
DreamClearconsistently demonstrates superior performance inperceptual metricslikeLPIPS,DISTS, andFID. ForDIV2K-Val,DreamClearachieves the bestLPIPS(0.3657),DISTS(0.1637), andFID(20.61). Similar trends are observed forLSDIR-ValinDISTS(0.1656) andFID(22.06), and it is the second best inLPIPS(0.3836). This signifies thatDreamCleargenerates images that are perceptually closer toGround Truthand exhibit higher fidelity and realism, which is a primary goal ofdiffusion-basedmethods.- For
no-reference metrics(NIQE,MANIQA,MUSIQ,CLIPIQA),DreamClearoften secures the best or second-best position, indicating high perceived quality even without a reference. For example, onDIV2K-Val, it has the bestNIQE(3.2126) and second bestCLIPIQA(0.6963). - It is notable that
DreamClearsometimes shows slightly lowerPSNR/SSIMscores compared to some baselines (e.g.,DiffBIRonLSDIR-ValPSNR). The paper acknowledges this, stating that thesedistortion-based metricsmay not fully representvisual qualityforphotorealistic restoration, and emphasizes the importance ofperceptual metricsfordiffusion-basedmodels.
-
Real-World Benchmarks (RealSR, DrealSR):
- On these challenging datasets,
DreamCleargenerally maintains strong performance inno-reference metrics, securing the bestNIQEscore on bothRealSR(4.4381) andDrealSR(4.2497). It also performs very competitively inMANIQA,MUSIQ, andCLIPIQA, often achieving the best or second-best scores. This suggests thatDreamClearproduces images that are highly natural and visually appealing in real-world scenarios. - Similar to synthetic datasets,
PSNR/SSIMscores might not always be the highest, reinforcing the paper's argument about the limitations of these metrics forphotorealistic restoration. However, its superiorFIDonRealSR(65.37) indicates better overall distribution similarity to real images.
- On these challenging datasets,
6.1.2. Evaluation on Downstream Benchmarks
The following are the results from Table 2 of the original paper:
| Metrics | GT | Zoomed LQ | BSRGAN | Real-ESRGAN | SwinIR-GAN | DASR | StableSR | DiffBIR | ResShift | SinSR | SeeSR | SUPIR | DreamClear |
| Object Detection (APb) | 49.0 | 7.4 | 11.0 | 12.8 | 11.8 | 10.5 | 16.9 | 18.7 | 15.6 | 13.8 | 18.2 | 16.6 | 19.3 |
| Object Detection (APb) | 70.6 | 12.0 | 17.6 | 20.7 | 18.9 | 17.0 | 26.7 | 29.9 | 25.0 | 22.3 | 29.1 | 27.2 | 30.8 |
| Object Detection (APb5) | 53.8 | 7.5 | 11.4 | 13.1 | 12.1 | 10.7 | 17.6 | 19.4 | 15.9 | 14.2 | 18.9 | 17.0 | 19.8 |
| Instance Segmentation (APm) | 43.9 | 6.4 | 9.6 | 11.3 | 10.2 | 9.3 | 14.6 | 16.2 | 13.6 | 12.0 | 15.9 | 14.1 | 16.7 |
| Instance Segmentation (APm) | 67.7 | 11.2 | 16.4 | 19.3 | 17.5 | 15.9 | 24.6 | 27.5 | 23.3 | 20.6 | 26.6 | 24.5 | 28.2 |
| Instance Segmentation (APm) | 47.3 | 6.3 | 9.7 | 11.5 | 10.2 | 9.4 | 14.9 | 16.6 | 13.7 | 12.1 | 16.1 | 14.0 | 16.8 |
| Semantic Segmentation (mIoU) | 50.4 | 11.5 | 18.6 | 17.3 | 14.3 | 30.4 | 19.6 | 23.6 | 29.7 | 19.6 | 26.9 | 27.7 | 31.9 |
Analysis of Downstream Task Results:
DreamClearconsistently achieves thebest performanceacross allobject detection (APb),instance segmentation (APm), andsemantic segmentation (mIoU)metrics onCOCO val2017andADE20K.- For example,
DreamClearscores19.3 APbfor object detection,16.7 APmfor instance segmentation, and31.9 mIoUfor semantic segmentation, significantly outperforming all baselines. - This demonstrates that
DreamClearnot only restores visual quality but also preserves and enhances thesemantic informationcritical for high-level computer vision tasks. This is a strong indicator of the model's ability to producesemantically coherentandstructurally accuraterestorations, which is crucial for real-world applications whereIRoften serves as a pre-processing step.
6.1.3. Qualitative Comparisons
The following figure (Figure 4 from the original paper) shows qualitative comparisons on synthetic or real-world images.

该图像是多组图像修复对比结果的示意图,展示了作者方法DreamClear与多种主流图像超分辨率重建方法在建筑和飞鸟图像上的修复效果,突出DreamClear在细节和清晰度提升方面的优势。
Qualns o ynhhes own orhes ens. Please zoom in for a better view.
- The qualitative comparisons (e.g., Figure 4) visually confirm
DreamClear's superiority. - When faced with
severe degradations(first row, e.g., heavily blurred/noisy text),DreamClearcanreason the correct structureandgenerate clear details, while other methods might producedeformed structuresorblurry results. - For
real-world images(third row),DreamCleargenerates results that arerich in detailandmore naturalthan competitors. This aligns with its strong performance inperceptualandno-reference metrics.
6.1.4. User Study
The following figure (Figure 5 from the original paper) shows the user study results.

该图像是图9,展示了训练数据集消融实验的视觉对比。图片左侧为老虎低质量输入及不同训练数据集恢复结果,右侧为建筑物低质量输入及对应恢复效果,突出展示了使用本方法数据集提升恢复细节和质量的效果。
Figure 5: User study. Vote percentage denotes average user preference per model. The Top-K ratio indicates the proportion of images preferred by top K users. Our model is highly preferred, with most images being rated as top quality by the majority.
- A user study involving 256 evaluators rated
DreamClearagainst five other methods on 100low-qualityimages. DreamClearreceived over45% of total votes, indicating a strong user preference for its restorations.- It was the
top choicefor80% of the imagesandranked first or second for 98% of the images, as measured by theTop-K ratio. - This objective evaluation by human perception strongly corroborates
DreamClear's ability to produce consistentlyhigh-quality,natural, andaccurateimage restorations.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Analysis of Generated Dataset for Real-World IR
The following figure (Figure 6 from the original paper) shows the impact of increasing synthetic training data.

该图像是论文中的示意图,展示了预训练T2I模型与GenIR生成的图像对比。GenIR生成的图像纹理更丰富细腻,细节更真实,模糊和失真较少,更适合用于训练真实世界的图像修复模型。
Figure 6: Impact of synthetic training data. As data size increases, performance improves on LSDIR-Val.
-
To investigate the impact of the
GenIR-generated dataset,SwinIR-GANwas trained on varying quantities of generated data. -
As the
data size increases, all metrics (LPIPS,DISTS,FID,MANIQA,MUSIQ,CLIPIQA)improve(LPIPS, DISTS, FID decrease, while MANIQA, MUSIQ, CLIPIQA increase), demonstrating thatlarger datasets enhance model generalizability and restoration performance. -
Notably, a model trained with
100,000 generated imagesoutperforms the DF2K model, highlighting the benefits of large-scale synthetic datasets.The following are the results from Table 4 of the original paper:
Training Data LPIPS ↓ DISTS ↓ FID ↓ MANIQA ↑ MUSIQ ↑ CLIPIQA ↑ Pre-trained T2I Model (3450images) 0.4819 0.2790 60.12 0.3271 61.94 0.5423 Ours GenIR (3450images) 0.4578 0.2435 51.29 0.3691 63.12 0.567 -
When comparing data generated by the
pre-trained T2I modelversusGenIR(both 3450 images), the model trained onGenIRdata shows significant improvements across all metrics. This quantitatively demonstrates the effectiveness of thedual-prompt learningandfilteringinGenIRfor producingIR-suitable images.The following figure (Figure 10 from the original paper) provides a visual comparison of images generated using the pre-trained
T2I modelandGenIR.
该图像是论文中图2的示意图,展示了GenIR三阶段数据构建流程:(a) 图文对构建,(b) 双提示词微调,(c) 数据生成与筛选,体现了通过多模态模型与扩散模型实现隐私安全的大规模高质量图像数据生成。
Figure 10: Visual comparison of images generated using the pre-trained T2I model and GenIR. Our proposed GenIR produces images with enhanced texture and more realistic details, exhibiting less blurring and distortion. This makes it better suited for training real-world IR models.
- Visually, images generated by
GenIRexhibitenhanced textureandmore realistic details, withless blurring and distortioncompared to those from anun-finetuned T2I model. This qualitative difference supports the quantitative findings thatGenIRdata is more effective forIRtraining.
6.2.2. Ablations of DreamClear
The following are the results from Table 3 of the original paper:
| LPIPS ↓ | DISTS ↓ | FID ↓ | MANIQA ↑ | MUSIQ ↑ | CLIPIQA ↑ | AP b | APm | mIoU | |
| Mixture of AM | 0.3657 | 0.1637 | 20.61 | 0.4320 | 68.44 | 0.6963 | 19.3 | 16.7 | 31.9 |
| AM | 0.3981 | 0.1843 | 25.75 | 0.4067 | 66.18 | 0.6646 | 18.0 | 15.6 | 28.6 |
| Cross-Attention | 0.4177 | 0.2016 | 29.74 | 0.3785 | 63.21 | 0.6497 | 17.2 | 15.1 | 26.3 |
| Zero-Linear | 0.4082 | 0.1976 | 29.89 | 0.4122 | 66.11 | 0.6673 | 17.6 | 15.3 | 27.2 |
| Dual-Branch | 0.3657 | 0.1637 | 20.61 | 0.4320 | 68.44 | 0.6963 | 19.3 | 16.7 | 31.9 |
| w/o Reference Branch | 0.4207 | 0.2033 | 30.91 | 0.3985 | 64.04 | 0.6582 | 15.9 | 14.0 | 24.7 |
| Detailed Text Prompt | 0.3657 | 0.1637 | 20.61 | 0.4320 | 68.44 | 0.6963 | 19.3 | 16.7 | 31.9 |
| Null Prompt | 0.3521 | 0.1607 | 20.47 | 0.4230 | 67.26 | 0.6812 | 18.8 | 16.2 | 29.8 |
Analysis of DreamClear Ablations:
- Mixture of Adaptive Modulator (MoAM):
- The full
MoAM(usingMixture of AM) achieves the best scores across almost all metrics (LPIPS,DISTS,FID,APb,APm,mIoU). - Replacing
MoAMwith a simplerAMdesign leads to a substantial decrease in performance (e.g.,LPIPSrises from 0.3657 to 0.3981,FIDfrom 20.61 to 25.75). - Further simplifying to
Cross-AttentionorZero-Linearresults in even worse performance for most metrics. This strongly underscores the importance of thedegradation prior guidanceand dynamicexpert fusionmechanism withinMoAMfor effective restoration.
- The full
- Dual-Branch Framework:
- The complete
Dual-Branchstructure (DreamClear) significantlyoutperformsthe model trainedwithout the Reference Branchacross all metrics (e.g.,LPIPSfrom 0.3657 to 0.4207,FIDfrom 20.61 to 30.91,mIoUfrom 31.9 to 24.7). This confirms that providing areference imageand processing bothLQandreference featuresis crucial for enhancingimage detailsand achieving superior results.
- The complete
- Text Prompt Guidance:
-
Using
Detailed Text Prompt(generated byMLLMs) leads to better performance onno-reference metrics(MANIQA,MUSIQ,CLIPIQA) andhigh-level vision metrics(APb,APm,mIoU) compared to using aNull Prompt. -
While
Null Promptshows slightly betterperceptual metrics(LPIPS,DISTS,FID), the superiorno-referenceandhigh-level metricsforDetailed Text Promptsuggest thatMLLM-provided textual guidancebetter preserves semantic information, leading to more semantically coherent and useful restorations, which is often preferred in real-world applications.The following figure (Figure 7 from the original paper) shows visual comparisons for the ablation study on
DreamClear.
该图像是图表,展示了基于真实世界数据集的图像恢复视觉对比,包括LQ输入及多种方法(StableSR、DiffBIR、SeeSR、SUPIR)与本文提出的DreamClear恢复效果,突出DreamClear在细节和质量上的优势。
-
Figure 7: Visual comparisons for ablation study on DreamClear.
- Visual ablations in Figure 7 further support these findings. For instance,
null promptscan lead tosemantic errors(e.g., in the bear's eyes), which are corrected withMLLM-generated text prompts. RemovingMoAMleads tooverly smoothorsemantically incorrectresults. Thefull DreamClear modelconsistently yields the best visual fidelity and perception.
6.2.3. Ablations for GenIR
The following figure (Figure 8 from the original paper) shows visual comparisons for the ablation study on GenIR.

该图像是论文中的图12展示,属于多图排列比较示意图,展示了不同算法在实际场景下图像恢复效果。图中包含六种恢复方法对低质量输入图像的还原对比,突出DreamClear方法在细节和清晰度上的优势。
Figure 8: Visual comparisons for ablation study on GenIR.
-
The effectiveness of
dual-prompt learninginGenIRis visually demonstrated. Images generated withdual-prompt learning(e.g., the tiger and wheat field images) exhibitenhanced texture detailsand aremore suitable for image restoration training.The following figure (Figure 9 from the original paper) shows visual comparisons for the ablation study on training datasets.
该图像是图像复原对比图,展示了真实场景中的两组图像通过多种方法处理后的视觉效果。图中包括低质量输入图像及StableSR、DiffBIR、SeeSR、SUPIR和本文提出的DreamClear的复原结果,突出DreamClear在细节和清晰度上的优势。
Figure 9: Visual comparisons for ablation study on training datasets.
- Figure 9 further illustrates the
effectiveness of the generated datafromGenIRin enhancing the visual effect ofIR models. Models trained withGenIRdata show improved details and overall quality.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper effectively addresses the dual challenges of real-world image restoration (IR): the scarcity of comprehensive, privacy-safe training data and the need for high-capacity, adaptable restoration models.
-
GenIR Pipeline: The paper introduces
GenIR, a pioneering, privacy-safe, and cost-effective automated data curation pipeline. By leveragingText-to-Image (T2I) diffusion modelsandMultimodal Large Language Models (MLLMs)with adual-prompt learningstrategy,GenIRsuccessfully generates a large-scale dataset of one million high-quality, synthetic images. This dataset overcomes the limitations of traditional data collection methods (copyright, privacy, labor-intensiveness) and provides a robust resource for trainingIRmodels. -
DreamClear Model: The paper presents
DreamClear, a potentIRmodel built upon theDiffusion Transformer (DiT)architecture.DreamClearintegratesdegradation priorsand utilizes a novelMixture of Adaptive Modulator (MoAM)that dynamically combinesrestoration expertsbased ontoken-wise degradation representations. This design enhances the model's adaptability to diverse real-world degradations and its ability to achievephotorealistic restorationby leveraginggenerative priorsandMLLM-based semantic guidance.Comprehensive experiments, including quantitative evaluations on synthetic and real-world benchmarks, assessments on downstream high-level vision tasks, and user studies, confirm
DreamClear's superior performance over state-of-the-art methods. The ablation studies further validate the critical contributions of each proposed component, both for data generation (GenIR) and model architecture (DreamClear).
7.2. Limitations & Future Work
The authors acknowledge two primary limitations of their work and suggest future research directions:
-
Synthesized Texture Details vs. Ground Truth: When dealing with
severe image degradation,DreamClearcan producereasonable and realistic results. However, the generated texture details, while plausible,may not perfectly match the ground-truth image. This is an inherent characteristic ofgenerative modelsthat hallucinate missing information.- Suggested Future Work: The authors propose that using a
high-quality reference imageorexplicit human instructioncould potentially mitigate this limitation, providing more accurate guidance for restoration.
- Suggested Future Work: The authors propose that using a
-
Inference Speed: As a
diffusion-based model,DreamClearrequiresmultiple inference stepsto restore an image. While it achieves high-quality results, its inference speedcannot meet the real-time requirementsof many practical applications.- Suggested Future Work: To address this, the authors suggest exploring techniques like
model distillation(transferring knowledge from a large model to a smaller, faster one) andmodel quantization(reducing the precision of model weights) to improve inference speed.
- Suggested Future Work: To address this, the authors suggest exploring techniques like
7.3. Personal Insights & Critique
This paper presents a highly comprehensive and innovative approach to real-world image restoration. The dual strategy of addressing both data scarcity and model limitations is particularly insightful.
Inspirations and Cross-Domain Applications:
-
Synthetic Data Generation for Niche Fields: The
GenIRpipeline offers a powerful blueprint for creating high-quality, privacy-safe datasets in other domains where data collection is challenging, expensive, or ethically sensitive (e.g., medical imaging, rare object detection, historical document restoration). TheMLLM-driven prompt generationandfilteringfor specific content constraints could be invaluable. -
Adaptive Expert Systems: The
MoAMconcept, wheretoken-wise degradation priorsdynamically activate specializedexperts, could be generalized to other complex multi-degradation or multi-domain tasks. For example, inmedical image analysis, differentdegradation types(noise, artifacts from different scanners) could activate specific diagnostic or enhancement pathways. -
Semantic Guidance in Low-Level Vision: The effective use of
MLLMsto providedetailed text promptsforsemantic guidanceinIRhighlights the increasing convergence ofhigh-level (language/semantics)andlow-level (pixel-wise)vision tasks. This could inspire similar approaches in tasks likeimage enhancement,inpainting, orstyle transferwhere semantic understanding can greatly improve output quality.Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Generalizability of GenIR-generated Data to All Real-World Degradations: While
GenIRcreates diverseHQimages, the paper uses theReal-ESRGAN degradation pipelineto generateLQtraining pairs. The assumption is that this simulated degradation sufficiently covers the vast complexity ofreal-world degradations. Ifreal-world degradation typesare fundamentally different or more diverse than whatReal-ESRGANcan simulate,DreamClearmight still face challenges. Future work could explore more advanced, trulyblind degradation simulationorunsupervised domain adaptationfor theLQimages. -
Computational Cost: Both
GenIRandDreamClearare computationally intensive.GenIRrequires 5 days on 256V100 GPUsfor data generation, andDreamClearrequires 7 days on 32A100 GPUsfor training. While the value proposition is high, this limits accessibility for smaller research groups or practical deployment. The acknowledged limitation regardinginference speedis crucial. -
Interpretability of MoAM: While
MoAMis effective, the specific roles or specializations of therestoration expertsare not explicitly detailed. Understanding what kind of degradation each expert handles could provide valuable insights for further architectural refinements or fordiagnosing model failures. -
Dependency on MLLMs: The pipeline relies heavily on
MLLMs(Gemini-1.5-Pro, LLaVA) for prompt generation and filtering. The quality and bias of theseMLLMscould implicitly affect the diversity and characteristics of the generated dataset and the semantic guidance during restoration. A thorough analysis of how differentMLLMsimpact the finalIRperformance would be beneficial. -
Ethical Implications of Synthetic Data: While
GenIRis lauded for beingprivacy-safeby avoiding identifiable individuals, the synthetic nature itself can raise questions. If used to generate data for sensitive applications, ensuring the synthetic data does not inadvertently create new biases or perpetuate existing ones (even if "privacy-safe") is important.Overall,
DreamClearandGenIRrepresent a significant step forward in makinghigh-quality real-world image restorationmore feasible and robust, bridging crucial gaps in both data availability and model capability.
Similar papers
Recommended via semantic vector search.