Enhancing Diffusion-based Restoration Models via Difficulty-Adaptive Reinforcement Learning with IQA Reward

Liang Lin

Paper status: completed

Enhancing Diffusion-based Restoration Models via Difficulty-Adaptive Reinforcement Learning with IQA Reward

Published:11/03/2025

Diffusion Model Image Restoration (1)IQA-Based Reward Mechanism (1)Difficulty-Adaptive Reinforcement Learning (1)Reinforcement Learning in Image Restoration (1)MLLM-based IQA (1)

Original Link PDF

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study presents a difficulty-adaptive reinforcement learning method that enhances diffusion-based image restoration models by integrating with an Image Quality Assessment (IQA) model, demonstrating improved fidelity and restoration performance on challenging samples.

Abstract

Reinforcement Learning (RL) has recently been incorporated into diffusion models, e.g., tasks such as text-to-image. However, directly applying existing RL methods to diffusion-based image restoration models is suboptimal, as the objective of restoration fundamentally differs from that of pure generation: it places greater emphasis on fidelity. In this paper, we investigate how to effectively integrate RL into diffusion-based restoration models. First, through extensive experiments with various reward functions, we find that an effective reward can be derived from an Image Quality Assessment (IQA) model, instead of intuitive ground-truth-based supervision, which has already been optimized during the Supervised Fine-Tuning (SFT) stage prior to RL. Moreover, our strategy focuses on using RL for challenging samples that are significantly distant from the ground truth, and our RL approach is innovatively implemented using MLLM-based IQA models to align distributions with high-quality images initially. As the samples approach the ground truth's distribution, RL is adaptively combined with SFT for more fine-grained alignment. This dynamic process is facilitated through an automatic weighting strategy that adjusts based on the relative difficulty of the training samples. Our strategy is plug-and-play that can be seamlessly applied to diffusion-based restoration models, boosting its performance across various restoration tasks. Extensive experiments across multiple benchmarks demonstrate the effectiveness of our proposed RL framework.

Mind Map

In-depth Reading

English Analysis~20 min read · 27,606 chars

1. Bibliographic Information

1.1. Title

Enhancing Diffusion-based Restoration Models via Difficulty-Adaptive Reinforcement Learning with IQA Reward

1.2. Authors

The paper is authored by a collaborative team from various academic institutions and research labs:

Xiaogang Xu (The Chinese University of Hong Kong)
Ruihang Chu (Tsinghua University)
Jian Wang (Snap Research)
Kun Zhou (Shenzhen University)
Wenjie Shu (UT)
Harry Yang (UT)
Ser-Nam Lim (University of Central Florida)
Hao Chen (UC Davis)
Liang Lin (Sun Yat-Sen University)

The authors represent a strong interdisciplinary background, combining expertise from leading universities and industry research, indicating a blend of theoretical rigor and practical application.

1.3. Journal/Conference

This paper is published as a preprint on arXiv, with a publication date of 2025-11-03T14:57:57.000Z. As an arXiv preprint, it is currently undergoing peer review or has been submitted to a conference/journal for publication. While arXiv itself is not a peer-reviewed journal, it is a highly influential platform for rapid dissemination of research in physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. Papers on arXiv often precede formal publication in top-tier conferences (like CVPR, ICCV, NeurIPS) or journals, making it a crucial venue for staying current with cutting-edge research.

1.4. Publication Year

2025

1.5. Abstract

The abstract introduces the challenge of applying Reinforcement Learning (RL) to diffusion-based image restoration models, highlighting that restoration prioritizes fidelity more than pure generation tasks like text-to-image. The paper proposes an effective method to integrate RL into these models. First, it identifies that an Image Quality Assessment (IQA) model provides a more effective reward function than traditional ground-truth-based supervision, which is already optimized during the Supervised Fine-Tuning (SFT) stage. Second, the strategy focuses RL on challenging samples far from the ground truth, initially using MLLM-based IQA models for broad distribution alignment with high-quality images. As samples improve, RL is adaptively combined with SFT for fine-grained alignment, managed by an automatic weighting strategy based on sample difficulty. This plug-and-play approach is demonstrated to boost performance across various restoration tasks through extensive experiments on multiple benchmarks.

1.6. Original Source Link

https://arxiv.org/abs/2511.01645 (Preprint)

1.7. PDF Link

https://arxiv.org/pdf/2511.01645v1.pdf (Preprint)

2. Executive Summary

2.1. Background & Motivation

The paper addresses a significant challenge in the field of image restoration using diffusion models. Diffusion models, particularly those augmenting large pre-trained text-to-image models (e.g., Stable Diffusion) with control branches, have shown strong capabilities in synthesizing photo-realistic content from degraded images. However, when applied to restoration tasks, these models, typically trained with Supervised Fine-Tuning (SFT), often struggle with fidelity. They can produce hallucinations (content that doesn't exist in the original low-quality image) or unnatural textures and colors. This occurs because SFT primarily focuses on reference-based alignment (minimizing distance to ground truth), which can be suboptimal for ill-posed restoration problems where a single ground truth might not perfectly represent all desirable qualities.

Reinforcement Learning (RL) has recently proven effective in aligning the behavior of large generative models, especially in text-to-image generation. However, directly applying existing RL methods to image restoration is suboptimal because the objective of restoration fundamentally differs from pure generation: it places a much greater emphasis on fidelity (how accurately the output matches the original content) while also aiming for realism. The core problem the paper aims to solve is how to effectively integrate RL into diffusion-based image restoration models to overcome the limitations of SFT, specifically addressing fidelity loss and artifact generation, to produce outputs that are both photo-realistic and faithful to the degraded input.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of diffusion-based image restoration:

Novel RL Pipeline for Restoration: The authors propose the first effective Reinforcement Learning (RL) training pipeline specifically tailored for restoration-targeted diffusion models. This pipeline highlights key implementation techniques that differentiate it from RL applications in pure generative tasks.
IQA-based Reward Function: A significant finding is that an effective reward function for RL in restoration can be derived from an Image Quality Assessment (IQA) model, particularly Multi-modal Large Language Model (MLLM)-based IQA models. This is a departure from intuitive ground-truth-based supervision, which often proves ineffective post-SFT. The IQA reward guides the model towards distribution-level alignment with high-quality images by leveraging the IQA model's ability to distinguish real, high-quality images from those with artifacts.
Difficulty-Adaptive Training Strategy: The paper introduces an innovative difficulty-adaptive training strategy. This approach applies IQA-based RL primarily to hard samples (those significantly distant from the ground truth), where traditional SFT struggles. As these samples approach the ground truth's distribution, the RL objective is adaptively combined with SFT for more fine-grained alignment. This dynamic process is facilitated through an automatic weighting strategy that adjusts based on the relative difficulty of the training samples, allowing the model to explore alternative solutions for hard cases and exploit known directions for easier ones.
Key Implementation Techniques: The paper identifies and integrates additional techniques for enhancing RL in restoration:
- Policy Modeling with a Better Direction: Policy modeling benefits from using a better denoised latent (a more refined estimate of the clean image) as the target for the action, leading to more meaningful updates.
- Step-wise RL Supervision: Unlike prior methods that apply rewards only to the final output, this work applies RL supervision at each intermediate step in the diffusion process, mitigating error accumulation and providing consistent guidance.
Plug-and-Play Framework: The proposed strategy is plug-and-play, meaning it can be seamlessly applied to existing diffusion-based restoration models, boosting their performance across various restoration tasks without requiring significant architectural changes.
Extensive Experimental Validation: Extensive experiments across multiple benchmarks and various restoration tasks (low-light image enhancement, deraining, motion deblurring, defocus deblurring) demonstrate the effectiveness and generality of the proposed RL framework, showing nontrivial improvements over state-of-the-art methods.

This section provides foundational knowledge and context from related research to fully understand the paper's innovations.

3.1. Foundational Concepts

3.1.1. Diffusion Models

Diffusion models (specifically, Denoising Diffusion Probabilistic Models (DDPMs)) are a class of generative models that learn to reverse a gradual diffusion process that adds noise to data. They have gained significant popularity for their ability to generate high-quality, diverse samples.

Forward Diffusion Process: This process gradually adds Gaussian noise to an input data point (e.g., an image) $\mathbf{x}_0$ over $T$ time steps, turning it into pure Gaussian noise. The process is defined by a variance schedule $\beta_t$ . At each step $t$ , a small amount of noise is added, transforming $\mathbf{x}_{t-1}$ into $\mathbf{x}_t$ . The paper defines this as: $ q ( \pmb { x } _ { t } | \pmb { x } _ { t - 1 } ) = \pmb { \mathcal { N } } ( \sqrt { 1 - \beta _ { t } } \pmb { x } _ { t - 1 } , \beta _ { t } \pmb { I } ) $ This describes the transition probability from $\mathbf{x}_{t-1}$ to $\mathbf{x}_t$ . Furthermore, the distribution of $\mathbf{x}_t$ conditioned on the original data $\mathbf{x}_0$ can be directly sampled: $ q ( \pmb { x } _ { t } | \pmb { x } _ { 0 } ) = \pmb { \mathcal { N } } ( \sqrt { \bar { \alpha _ { t } } } \pmb { x } _ { 0 } , ( 1 - \bar { \alpha _ { t } } ) \pmb { I } ) $ Here, the terms are defined as:
- $\mathbf{x}_0$ : The original, clean data (e.g., an image).
- $\mathbf{x}_t$ : The noisy version of the data at time step $t$ .
- $T$ : The total number of diffusion steps.
- $\beta_t$ : The variance schedule, a sequence of small positive constants that control the amount of noise added at each step $t$ . Typically, $\beta_1 < \beta_2 < \dots < \beta_T$ .
- $\pmb{I}$ : The identity matrix, representing independent noise dimensions.
- $\pmb{\mathcal{N}}(\mu, \Sigma)$ : Denotes a Gaussian (normal) distribution with mean $\mu$ and covariance $\Sigma$ .
- $\alpha_t = 1 - \beta_t$ : A helper variable.
- $\bar{\alpha}_t = \prod_{s=1}^t \alpha_s$ : The cumulative product of $\alpha_s$ up to time $t$ . This term allows direct sampling of $\mathbf{x}_t$ from $\mathbf{x}_0$ .
Reverse Denoising Process: The goal is to learn the reverse process, which starts from pure noise $\mathbf{x}_T$ and gradually removes noise to reconstruct the original data $\mathbf{x}_0$ . This is done by training a neural network (often denoted $p_{\pmb{\theta}}$ ) to predict the noise added at each step, or directly predict the clean image $\mathbf{x}_0$ . The model $p_{\pmb{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)$ approximates the true posterior $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ . The training typically involves minimizing the Kullback-Leibler (KL) divergence between the learned reverse distribution and the true posterior.

3.1.2. Reinforcement Learning (RL)

Reinforcement Learning (RL) is an area of machine learning concerned with how intelligent agents should take actions in an environment to maximize the cumulative reward.

Markov Decision Process (MDP): RL problems are often formalized as an MDP, which consists of:
- States (S): The current situation or configuration of the environment.
- Actions (A): The choices an agent can make from a given state.
- Reward Function (R): A scalar value received by the agent after taking an action in a state, indicating the desirability of that action. The agent's goal is to maximize the total reward.
- State Transition Distribution (P): Describes the probability of moving to a new state given the current state and action.
Policy ( $\pi$ ): A policy defines the agent's behavior, mapping states to actions or probabilities of taking actions.
Value Function: Estimates the expected cumulative reward an agent can expect to receive from a given state, following a certain policy.
Advantage Function ( $\hat{A}$ ): Measures how much better a particular action is compared to the average action in a given state. It's often defined as $Q(s,a) - V(s)$ , where Q(s,a) is the action-value function (expected reward from state $s$ taking action $a$ ) and V(s) is the state-value function (expected reward from state $s$ following the policy).
Policy Gradients: A class of RL algorithms that directly optimize the policy by estimating the gradient of the expected reward with respect to the policy parameters.
Importance Sampling: A technique used in policy gradient methods (like Proximal Policy Optimization (PPO)) to estimate expected values under a target policy using samples from a different behavior policy. This allows for off-policy learning, where samples collected under an old policy can be reused to train a new policy, improving sample efficiency.

3.1.3. Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT) is a common training paradigm where a pre-trained model (e.g., a large language model or a diffusion model) is further trained on a specific, labeled dataset for a downstream task. In image restoration, SFT typically involves training the diffusion model to minimize a loss function (e.g., L1 or L2 distance) between its generated output and the high-quality ground truth (GT) image. While effective for learning basic mappings, SFT can struggle with ill-posed problems where multiple outputs might be plausible, potentially leading to hallucinations or fidelity loss if the loss function doesn't perfectly capture perceptual quality.

3.1.4. Image Quality Assessment (IQA)

Image Quality Assessment (IQA) refers to methods and metrics used to objectively or subjectively evaluate the quality of an image. IQA models aim to quantify how good an image looks to humans.

Traditional IQA Models: Often rely on hand-crafted features or simple statistical comparisons (e.g., PSNR, SSIM).
MLLM-based IQA Models (Multi-modal Large Language Models): These are a more recent and powerful class of IQA models. They leverage the broad knowledge and inherent adaptability of large language models (LLMs) by training them on multimodal data (images and text). This allows them to understand complex visual semantics and human perceptual preferences, enabling them to regress more accurate quality scores compared to traditional methods. They can detect subtle artifacts or unrealistic textures that simpler metrics might miss. The paper specifically mentions DeQA-Score [54] as an example.

3.1.5. Ground Truth (GT)

In image restoration, the ground truth (GT) refers to the ideal, pristine, high-quality version of an image that the restoration model aims to reconstruct from a degraded input. It serves as the reference for supervised learning objectives.

3.2. Previous Works

3.2.1. Restoration Models using Pre-trained Diffusion

The paper notes that diffusion-based generative restoration has gained attention, often by augmenting large pre-trained text-to-image models (like Stable Diffusion [28] and Flux [17]) with a control branch [59]. These methods typically employ Supervised Fine-Tuning (SFT).

Early Attempts: StableSR [37] and PASD [49] were initial efforts to apply this strategy, utilizing architectures like ControlNet.
Fidelity Enhancements: DiffBIR [18] introduced region-adaptive restoration guidance to enhance fidelity during the denoising process.
Common Challenge: Despite these advancements, a significant challenge remains: achieving perfect alignment with ground truths, often leading to hallucination or unnatural textures/colors.

3.2.2. Reinforcement Learning for Diffusion Models

The use of RL to modify the biases of diffusion models is a relatively new but growing area, primarily applied to generative tasks like text-to-image.

Denoising Diffusion Policy Optimization (DDPO) [3]: This is a key RL-based approach that reframes the diffusion process as a multi-step Markov Decision Process (MDP) to optimize a given reward function. This paper builds upon the DDPO framework for its RL integration.
Reward Functions in Text-to-Image: Prior works have explored various reward functions for text-to-image generation, including diversity-based rewards [22, 61], alignment rewards [3], and visual rewards like aesthetic quality [3].
Gap in Restoration: The paper points out that the applicability of these RL techniques in image restoration (which has different fidelity requirements) remains underexplored. Flow-GRPO [19] is also cited as a method related to RL for diffusion, likely focusing on policy optimization for flow-matching models.

3.2.3. MLLM-based IQA Models

MLLM-based IQA methods leverage the foundational knowledge of Multi-modal Large Language Models to achieve superior performance in assessing image quality.

Leveraging LLMs for Scoring:
- Q-Bench [41, 63] proposes a binary softmax strategy for MLLMs to generate quality scores.
- Compare2Score [65] trains MLLMs to compare pairs of images for quality scoring.
- Q-Align [42] discretizes scores into five levels using one-hot labels to train MLLMs for more accurate score regression.
- DogIQA [20] uses a one-hot label for training-free IQA, incorporating specific standards and local semantic objects.
- DeQA-Score [54], which this paper primarily uses, employs a distribution-based approach that discretizes the score distribution into soft labels, consistently outperforming other methods in score regression.
Advantage: These models are "inherently biased toward content that aligns with physical plausibility and human perception," making them excellent candidates for reward functions in restoration tasks where perceptual realism and fidelity are crucial.

3.3. Technological Evolution

The evolution of image restoration technologies can be traced as follows:

Traditional Methods: Early methods relied on signal processing, statistical models, or simple neural networks, often struggling with complex degradations.
Deep Learning for Restoration: The advent of deep learning brought significant improvements, with Convolutional Neural Networks (CNNs) and later Transformers becoming dominant. Methods like Restormer [56] (used as a backbone in this paper) improved state-of-the-art. These are typically trained with supervised learning.
Generative Models for Restoration: The rise of generative models, especially Generative Adversarial Networks (GANs) and more recently Diffusion Models, revolutionized restoration by enabling the synthesis of highly photo-realistic details, even from severe degradations. These models often leverage large pre-trained text-to-image models and use Supervised Fine-Tuning (SFT).
Addressing Fidelity in Generative Restoration: While generative models excel in realism, SFT-trained diffusion models can suffer from hallucinations and fidelity issues in restoration tasks because their objective function (e.g., L2 loss) might not perfectly capture human perception or handle ill-posed problems.
This Paper's Position: This paper introduces Reinforcement Learning (RL) as the next evolutionary step to refine diffusion-based restoration. By integrating RL, particularly with IQA-based rewards and difficulty-adaptive training, it aims to specifically address the fidelity and realism challenges that SFT-based methods face, pushing the boundaries of what diffusion models can achieve in restoration.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

RL for Restoration (Novel Application): While RL has been applied to diffusion models for text-to-image generation (e.g., DDPO), this paper is the first to propose an effective RL training pipeline specifically for image restoration tasks using large pre-trained diffusion models. Restoration tasks have a stricter emphasis on fidelity to the ground truth, which differs significantly from the pure generation objectives of text-to-image.
IQA-based Reward (Instead of Reconstruction Error):
- Related Work: Existing SFT methods (e.g., DiffBIR, StableSR) for restoration already optimize ground-truth-based objectives (like reconstruction error).
- Innovation: This paper critically finds that using MLLM-based IQA models as a reward function is more effective than reconstruction error. Reconstruction error rewards offer little additional benefit over SFT and can be unstable. IQA models, trained on real images and biased towards physical plausibility, provide a distribution-level alignment reward, guiding the model to generate perceptually higher-quality images that are closer to the "distribution of real images" rather than merely matching pixels.
Difficulty-Adaptive Training Strategy (Dynamic RL-SFT Combination):
- Related Work: Most RL applications for diffusion models operate in isolation or with fixed-weight combinations.
- Innovation: This paper introduces a dynamic weighting strategy that combines RL and SFT based on the difficulty of individual samples. For hard samples (far from GT), RL with IQA reward drives exploration for distribution alignment. As samples improve, SFT is progressively reintroduced for fine-grained reference-based alignment. This intelligent switching mechanism is novel for balancing the exploratory nature of RL with the precise alignment capabilities of SFT.
Step-wise RL Supervision (Error Accumulation Mitigation):
- Related Work: Prior RL methods for diffusion often apply rewards only to the final output (e.g., $\mathbf{x}_0$ ).
- Innovation: This paper applies RL supervision at each intermediate step of the diffusion process. This helps to mitigate error accumulation and provides more consistent guidance throughout the denoising trajectory, which is crucial for maintaining fidelity in restoration tasks.
Policy Modeling with Refined Latent: The paper improves policy modeling by using a better denoised latent (an enhanced estimate of the clean image) as the target for the action, leading to more meaningful updates.

In essence, the paper innovatively adapts RL to the unique demands of image restoration by rethinking the reward function, introducing dynamic training strategies, and refining the RL application within the diffusion process.

4. Methodology

The core methodology of this paper focuses on integrating Reinforcement Learning (RL) into diffusion-based image restoration models in a novel, difficulty-adaptive manner, primarily utilizing an Image Quality Assessment (IQA) model as the reward function. The aim is to overcome the limitations of traditional Supervised Fine-Tuning (SFT) in achieving both high fidelity and realism for restoration tasks.

4.1. Principles

The core idea is to leverage the strengths of RL to address specific shortcomings of SFT in diffusion-based image restoration. While SFT excels at minimizing direct pixel-wise differences to a ground truth, it struggles with the ill-posed nature of many restoration problems, often leading to hallucinations or unnatural artifacts. The theoretical basis or intuition behind the proposed method is two-fold:

Distribution Alignment through Perceptual Reward: Instead of relying solely on reconstruction loss (which is already optimized during SFT and can lead to local optima), the paper posits that an Image Quality Assessment (IQA) model, especially an MLLM-based one, can provide a more perceptually meaningful reward. This IQA reward guides the diffusion model to generate images that are not just pixel-wise close to the ground truth, but also distributionally aligned with high-quality, realistic images. This encourages the model to explore alternative generation paths that produce visually superior results, effectively helping it escape local optimization points where SFT might get stuck.
Adaptive Learning for Efficiency and Precision: The integration of RL is not absolute but adaptive. The intuition is that RL is most beneficial for hard samples (those with large discrepancies from the ground truth) where the model needs to "explore" new solutions. For samples that are already close to the ground truth distribution, the precise reference-based alignment offered by SFT is more efficient for fine-grained exploitation. This dynamic balance is achieved through an automatic weighting strategy that adjusts the influence of RL and SFT based on the difficulty of each training sample.

The overall process can be viewed as an iterative refinement: first, RL with IQA drives a broader distribution alignment (exploration) to get the output into a perceptually high-quality space. Then, as the outputs become more aligned, SFT takes over for fine-grained alignment (exploitation) to precisely match the ground truth.

4.2. Core Methodology In-depth

The methodology builds upon the existing framework of diffusion models and adapts the Reinforcement Learning (RL) paradigm.

4.2.1. Diffusion Model Background

The paper first reiterates the fundamental equations of diffusion models. The forward process adds Gaussian noise to an image $\mathbf{x}_0$ over $T$ steps, characterized by a variance schedule $\beta_t$ : $q ( \pmb { x } _ { t } | \pmb { x } _ { t - 1 } ) = \pmb { \mathcal { N } } ( \sqrt { 1 - \beta _ { t } } \pmb { x } _ { t - 1 } , \beta _ { t } \pmb { I } )$ This formula describes the conditional probability of transitioning from the image $\mathbf{x}_{t-1}$ at time step t-1 to a noisier image $\mathbf{x}_t$ at time step $t$ . It indicates that $\mathbf{x}_t$ is sampled from a Gaussian distribution ( $\pmb{\mathcal{N}}$ ) whose mean is $\sqrt{1-\beta_t}$ times $\mathbf{x}_{t-1}$ (a scaled version of the previous image) and whose variance is $\beta_t$ times the identity matrix ( $\pmb{I}$ ) (indicating isotropic Gaussian noise).

The distribution of $\mathbf{x}_t$ conditioned on the original clean image $\mathbf{x}_0$ can also be directly sampled: $q ( \pmb { x } _ { t } | \pmb { x } _ { 0 } ) = \pmb { \mathcal { N } } ( \sqrt { \bar { \alpha _ { t } } } \pmb { x } _ { 0 } , ( 1 - \bar { \alpha _ { t } } ) \pmb { I } )$ This equation shows that $\mathbf{x}_t$ can be obtained directly from $\mathbf{x}_0$ by adding a specific amount of noise. The mean is $\sqrt{\bar{\alpha}_t}$ times $\mathbf{x}_0$ , and the variance is $(1 - \bar{\alpha}_t)$ times $\pmb{I}$ . Here, $\alpha_t$ and $\bar{\alpha}_t$ are defined as: $\alpha _ { t } ~ = ~ 1 - \beta _ { t }$ and $\bar { \alpha _ { t } } ~ = ~ \prod _ { s = 1 } ^ { t } \alpha _ { s }$

$\mathbf{x}_0$ : The initial, clean image (latent representation).
$\mathbf{x}_t$ : The latent representation of the image at diffusion time step $t$ , which is progressively noisier than $\mathbf{x}_0$ .
$T$ : The total number of diffusion time steps.
$\beta_t$ : The variance schedule, a sequence of values that dictate the amount of noise added at each step.
$\pmb{I}$ : The identity matrix, implying independent noise in each dimension.
$\pmb{\mathcal{N}}(\mu, \Sigma)$ : A Gaussian (normal) distribution with mean $\mu$ and covariance $\Sigma$ .
$\alpha_t$ : A scalar term related to the noise schedule.
$\bar{\alpha}_t$ : The cumulative product of $\alpha_s$ values up to time $t$ , essential for directly sampling $\mathbf{x}_t$ from $\mathbf{x}_0$ .

The reverse diffusion process, which aims to remove noise, is performed by a trained model $p_{\pmb{\theta}}(\mathbf{x}_{t-1} | \mathbf{x}_t)$ , parameterized by $\pmb{\theta}$ , which approximates the true posterior $q(\mathbf{x}_{t-1} | \mathbf{x}_t, \mathbf{x}_0)$ . This model is typically trained via variational inference to minimize the KL divergence between $p$ and $q$ .

4.2.2. Diffusion with Reinforcement Learning (RL)

The paper adapts the diffusion model for RL by framing the denoising process as a multi-step Markov Decision Process (MDP). The goal is to optimize the model's parameters $\pmb{\theta}$ to maximize an expected reward: $J ( \pmb \theta ) = \mathbb { E } _ { c \sim p ( c ) , \pmb x \sim p _ { \theta } ( \pmb x | c ) } [ r ( \pmb x , \pmb c ) ]$

$J(\pmb{\theta})$ : The objective function to be maximized, representing the expected reward.
$\mathbb{E}$ : The expectation operator.
$c$ : The input condition, which in restoration is the low-quality (LQ) image.
p(c): The distribution of input conditions.
$\pmb{x}$ : The generated output image (denoised $\mathbf{x}_0$ ).
$p_{\pmb{\theta}}(\pmb{x}|c)$ : The sample distribution of the generated image $\pmb{x}$ given the condition $c$ , parameterized by $\pmb{\theta}$ .
$r(\pmb{x}, \pmb{c})$ : The reward function that assigns a score to the generated image $\pmb{x}$ conditioned on $c$ .

The MDP components are defined as follows:
State ( $s_t$ ): The state at time $t$ is a combination of the current noisy latent representation, the input condition, and the time step. $s _ { t } \triangleq \{ x _ { t } , c , t \}$
Action ( $a_t$ ): The action is the result of denoising, specifically the estimate of the image at the previous time step. $a _ { t } \triangleq \hat { x } _ { t - 1 }$
Policy ( $\pi$ ): The policy is the denoising model itself, which predicts the action (denoised image) given the current state. $\pi ( a _ { t } | s _ { t } ) \triangleq p _ { \theta } ( \hat { x } _ { t - 1 } | x _ { t } , c )$
State Transition Distribution ( $P$ ): This describes how the state changes after an action. In this context, it's deterministic, indicating that taking action $a_t$ (which is $\hat{x}_{t-1}$ ) transitions to a state where the latent is $\hat{x}_{t-1}$ , the condition $c$ remains the same, and time decrements. The Dirac function $\delta(\cdot)$ implies a deterministic transition. $P ( s _ { t + 1 } | s _ { t } , a _ { t } ) \triangleq \{ \delta ( \hat { x } _ { t - 1 } ) , \delta ( c ) , \delta ( t - 1 ) \}$
Reward Function ( $R$ ): The reward is associated with the action taken, specifically the quality of the denoised image $\hat{x}_{t-1}$ given the condition $c$ . $R ( s _ { t } , a _ { t } ) \triangleq r ( \hat { x } _ { t - 1 } , c )$

Policy Gradient and Advantage Estimation: For fine-tuning, the paper uses an importance sampling estimator to calculate policy gradients, similar to Proximal Policy Optimization (PPO). This involves two models: the current policy $\pmb{\theta}$ and an older policy $\pmb{\theta}_{\mathrm{old}}$ . The objective function for policy optimization is: $J ( \pmb \theta ) = \mathbb E \left[ \sum _ { t = 0 } ^ { T } \operatorname* { m i n } \Big [ w ( \pmb \theta , \pmb \theta _ { \mathrm { o l d } } ) \hat { A } ( \hat { \pmb x } _ { t - 1 } , \pmb c ) , g ( \epsilon , \hat { A } ( \hat { \pmb x } _ { t - 1 } , \pmb c ) ) \Big ] \right]$ This formula represents a clipped surrogate objective, common in PPO, that ensures the new policy doesn't stray too far from the old one, promoting training stability.

$w ( \pmb { \theta } , \pmb { \theta } _ { \mathrm { o l d } } ) = \frac { p \theta ( \hat { \pmb x } _ { t - 1 } | \pmb x _ { t } , c ) } { p \theta _ { \mathrm { o l d } } ( \hat { \pmb x } _ { t - 1 } | \pmb x _ { t } , c ) }$ This is the importance weight or ratio of probabilities, comparing the likelihood of action $\hat{\pmb{x}}_{t-1}$ under the current policy $\pmb{\theta}$ versus the old policy $\pmb{\theta}_{\mathrm{old}}$ .
$g ( \epsilon , \hat { \pmb A } ) = \left\{ ( 1 + \epsilon ) \hat { \pmb A } \right. \mathrm { i f } \hat { \pmb A } \geq 0$ This formula appears incomplete as presented in the paper. In standard PPO, $g(\epsilon, \hat{\pmb{A}})$ would typically define a clipping mechanism for the advantage function to limit policy updates. A common form is $\mathrm{clip}(r_t(\theta) \hat{A}_t, 1-\epsilon, 1+\epsilon)$ , where $r_t(\theta)$ is the probability ratio $w(\pmb{\theta}, \pmb{\theta}_{\mathrm{old}})$ . The provided snippet $(1+\epsilon)\hat{A}`if`\hat{A} \geq 0$ implies a clipping or scaling based on the advantage sign, but its full form for negative $\hat{A}$ is missing. In general, this term serves to clip the objective function to prevent excessively large policy updates, which can destabilize training.
$\hat { A } ( \hat { x } _ { t - 1 } , c ) = ( r ( \hat { x } _ { t - 1 } , c ) - \mu _ { r } ) / \sqrt { \sigma _ { r } ^ { 2 } + \epsilon }$ $\hat{A} (\overset{x}{^}_{t - 1}, c) = (r (\overset{x}{^}_{t - 1}, c) - μ_{r}) / σ_{r}^{2} + ϵ$ This is the estimated advantage function. It normalizes the raw reward $r(\hat{x}_{t-1}, c)$ $r (\overset{x}{^}_{t - 1}, c)$ by subtracting the mean reward $\mu_r$ $μ_{r}$ and dividing by the standard deviation $\sigma_r$ $σ_{r}$ (with a small $\epsilon$ $ϵ$ for numerical stability). This normalization is crucial for stabilizing RL training, especially when rewards have varying scales.
- $\mu_r$ : Mean of the rewards.
- $\sigma_r^2$ : Variance of the rewards.
- $\epsilon$ : A small constant for numerical stability (e.g., to prevent division by zero).
  
  The paper notes a distinction from previous diffusion-based RL methods regarding reward normalization. Instead of normalizing rewards on a per-context basis by tracking a running mean and standard deviation for each prompt (as in [3]), they compute the mean and variance using both the track of rewards for each input image (historical data) and the reward values from the current batch. This provides a more robust and responsive normalization.

Policy Modeling with A Better Direction: The paper introduces an enhancement to policy modeling. In standard diffusion RL, the policy generates $\hat{x}_{t-1}$ from $x_t$ . The innovation here is to improve this $\hat{x}_{t-1}$ by applying an additional denoising step to $x_t$ , producing an enhanced estimate $\hat{\pmb{x}}^{\text{enhanced}}_{t-1}$ that is closer to the clean image $\mathbf{x}_0$ . This refined denoised result serves as a better target for the action, thereby improving the policy by providing a more reliable direction for exploration. The original notation in the formula $a_t \triangleq \hat{x}_{t-1}$ likely refers to this enhanced estimate in their proposed method.

Reward at Different Time Steps of Diffusion Generation: Contrary to prior diffusion-based RL methods that apply the reward function only to the final output $\mathbf{x}_0$ , this paper argues that step-wise RL supervision is crucial for restoration tasks. Since intermediate results contain important fidelity signals, applying the reward at each intermediate time step during the diffusion process helps prevent artifact accumulation and fidelity loss throughout the denoising trajectory, leading to more consistent guidance.

4.2.3. Reward for Diffusion-based RL in Restoration

The choice of reward function is central to the paper's contribution.

Reward with GT supervision (and its limitations): A straightforward approach for restoration would be to use a ground-truth-based reward, measuring the distance between the model's output and the high-quality ground truth $\pmb{g}$ . $r ( \hat { \pmb { x } } _ { t - 1 } , \pmb { c } ) = \| \hat { \pmb { x } } _ { t - 1 } - \pmb { g } \|$

$\pmb{g}$ : The ground truth (high-quality) image.
$\|\cdot\|$ : Denotes a distance metric, typically L1 or L2 norm.

However, extensive experiments showed this reward to be ineffective. The reasons identified are:

The SFT stage already minimizes this type of distance, so RL provides limited additional benefit and tends to follow the same search direction.
Optimizing towards a single point in latent space (which is what minimizing distance to GT implies) is highly ill-posed, leading to unstable optimization and sometimes worse results (illustrated in Figure 2 and Figure 5).

Reward with IQA score: overview and distribution alignment: The paper proposes using an Image Quality Assessment (IQA) model as the reward function. The key insight is that MLLM-based IQA models can effectively detect artifacts and assign high scores only to real, high-quality images. By guiding the generation process towards outputs with higher IQA scores, the model naturally steers its output distribution toward the ground truth distribution. This is a form of distribution-level alignment rather than direct reference-based alignment. This helps the model escape local optimization points and discover alternative solutions.

This concept is related to optimal transport theory. When distributions are aligned (i.e., output $y_i$ and ground truth $g_i$ both belong to distribution $A$ ), the optimal transport value (a measure of distance between distributions) is smaller than when they belong to different distributions ( $A$ and $B$ ). $\mathcal { O } ( \{ y _ { i } \} , \{ g _ { i } \} | y _ { i } \in A , g _ { i } \in A ) \leq \mathcal { O } ( \{ y _ { i } \} , \{ g _ { i } \} | y _ { i } \in A , g _ { i } \in B$

$\mathcal{O}$ : The optimal transport value, a measure of the minimum "cost" to transform one probability distribution into another.
$\{y_i\}$ : A set of generated image samples.
$\{g_i\}$ : A set of ground truth image samples.
$A$ : A distribution representing high-quality, realistic images.
$B$ : A distribution representing low-quality or artifact-ridden images. This inequality implies that it's "easier" to align (i.e., achieve lower optimal transport cost) outputs with ground truths when both belong to the same high-quality distribution.

Reward with IQA score: implementation: The reward function is formulated using an MLLM-based IQA model, denoted as $M$ . $r ( \hat { \pmb { x } } _ { t - 1 } , \pmb { c } ) = M ( \hat { \pmb { x } } _ { t - 1 } )$

$M(\cdot)$ : The MLLM-based IQA model that outputs a quality score for an image. This reward encourages the model to produce images that score highly on perceptual quality, thereby implicitly aligning with the distribution of real, high-quality images.

4.2.4. Training Diffusion-based RL with Adaptive Manner for Fine-grained Alignment

The paper introduces a difficulty-adaptive training strategy to combine the strengths of both RL (for distribution alignment and exploration) and SFT (for fine-grained reference-based alignment and exploitation).

The main idea is that RL with IQA is crucial for hard samples (those far from the ground truth) to explore new generation paths. Once a sample's output distribution is sufficiently aligned, SFT becomes more effective for precise alignment.

Implementation (Automatic Weighting Strategy): The transition between RL and SFT is managed by dynamically assigning loss weights to individual samples based on their difficulty. Difficulty is measured by the reconstruction difference (e.g., L2 norm) between the model's output and the ground truth. For a batch of $m$ samples, $b_1, \dots, b_m$ , with inference results $\boldsymbol{y}_1, \dots, \boldsymbol{y}_m$ and ground truths $\pmb{g}_1, \dots, \pmb{g}_m$ , the loss weight $w_i$ for sample $i$ is computed as: $w _ { i } = \| \pmb { y } _ { i } - \pmb { g } _ { i } \| / \mathrm { m a x } ( \| \pmb { y } _ { j } - \pmb { g } _ { j } \| , j \in [ 1 , m ] )$

$w_i$ : The adaptive loss weight for sample $i$ .
$\|\pmb{y}_i - \pmb{g}_i\|$ : The reconstruction difference (e.g., L1 or L2 distance) between the model's output $\pmb{y}_i$ and the ground truth $\pmb{g}_i$ for sample $i$ .
$\mathrm{max}(\|\pmb{y}_j - \pmb{g}_j\|, j \in [1, m])$ : The maximum reconstruction difference within the current training batch. This weighting scheme ensures that more difficult samples (larger reconstruction error) receive higher weights $w_i$ , thus giving more emphasis to the RL objective for these samples. Conversely, samples closer to the ground truth (smaller reconstruction error) receive smaller $w_i$ , allowing SFT to dominate.

Final Loss Function: The final loss function combines the standard diffusion loss ( $L_{\mathrm{diff}, \theta}$ ), the RL loss ( $J(\theta)$ ), and an optional KL divergence term ( $L_{\mathrm{KL}, \theta}$ ) to stabilize training. For a batch of data $b_1, \dots, b_m$ , the total loss is: $L _ { \theta } = \sum _ { i = 1 : m } \left( 1 - w _ { i } \right) L _ { \mathrm { d i f f } , \theta } \left( b _ { i } , g _ { i } \right) + \sum _ { i = 1 : m } w _ { i } \times J ( \theta ) ( b _ { i } ) + L _ { \mathrm { K L } , \theta }$

$L_{\theta}$ : The total loss to be minimized for training the model.
$(1 - w_i)$ : The weight for the SFT diffusion loss. This term is higher for "easy" samples (small $w_i$ ).
$L_{\mathrm{diff}, \theta}(b_i, g_i)$ : The standard diffusion loss (e.g., L1/L2 loss between predicted noise and actual noise, or between predicted $\mathbf{x}_0$ and ground truth) for sample $b_i$ with ground truth $g_i$ .
$w_i$ : The weight for the RL loss. This term is higher for "hard" samples (large $w_i$ ).
$J(\theta)(b_i)$ : The RL objective (from Eq. 4) for sample $b_i$ , which is being maximized (or minimized if framed as negative reward).
$L_{\mathrm{KL}, \theta}$ : An optional KL divergence term, commonly used in PPO-like algorithms (e.g., DDPO [3]), to penalize the divergence of the current policy from the initial supervised policy, ensuring the model does not deviate too much from the original SFT-trained reference. This term is crucial for stability.

The overall pipeline can be visualized in the paper's figures. Figure 3 illustrates the schematic of the RL algorithm, with stars representing the latent change. Figure 4 (which is Image 3 in the provided analysis) outlines the inference process including difficulty weighting. Figure 5 (which is Image 4 in the provided analysis) provides a detailed overview of the RL recovery process, showing how rewards, advantages, RL loss, SFT diffusion loss, and KL loss are computed and combined.

The following figure (Figure 3 from the original paper, provided here as Image 2) illustrates the schematic diagram of the RL algorithm:

$Figure 3. The schematic diagram of our RL algorithm (the stars $\\star$ of various colors in (a) and (b) represent the change of sample's latent . P ID.u$ 该图像是图示，展示了我们的强化学习算法的示意图，左侧(a)展示了原始SFT过程与地面真实图之间的关系，右侧(c)展示了不同指标的最终效果，包括PSNR、LLIE和SSIM等，以评估恢复性能。

Figure 3 from the original paper (Image 2 in this analysis). The schematic diagram of our RL algorithm (the stars $\star$ of various colors in (a) and (b) represent the change of sample's latent. P ID.u

The following figure (Figure 4 from the original paper, provided here as Image 3) illustrates the inference process:

该图像是示意图，展示了在Diffusion模型中进行推断的过程，包括数据批次的输入、去噪处理以及损失计算等步骤。图中强调了依据困难度计算权重的过程，公式为 Eq. 9。

Figure 4 from the original paper (Image 3 in this analysis). The inference process in diffusion models, including the input of data batches, denoising operations, and loss computation steps. The figure emphasizes the process of computing weights based on difficulty, represented by the formula Eq. 9.

The following figure (Figure 5 from the original paper, provided here as Image 4) illustrates the detailed pipeline of the RL recovery process:

$该图像是示意图，展示了基于强化学习与图像质量评估（IQA）模型结合的扩散模型恢复过程。图中首先计算奖励，然后通过奖励历史检索各输入的优势，并计算强化学习损失和SFT扩散损失。最终，利用KL损失进行调整。各步骤通过公式进行详细说明，其中包括 $M$ 和对应的公式编号（如 Eq. 3, Eq. 4, Eq. 5, Eq. 8, Eq. 10）。$ 该图像是示意图，展示了基于强化学习与图像质量评估（IQA）模型结合的扩散模型恢复过程。图中首先计算奖励，然后通过奖励历史检索各输入的优势，并计算强化学习损失和SFT扩散损失。最终，利用KL损失进行调整。各步骤通过公式进行详细说明，其中包括 $M$ 和对应的公式编号（如 Eq. 3, Eq. 4, Eq. 5, Eq. 8, Eq. 10）。

Figure 5 from the original paper (Image 4 in this analysis). The recovery process of a diffusion model that combines reinforcement learning with an image quality assessment (IQA) model. It first calculates rewards, retrieves advantages of each input through reward history, and computes reinforcement learning loss and SFT diffusion loss. Finally, adjustments are made using KL loss. Each step is detailed with corresponding formulas.

5. Experimental Setup

This section details the datasets, evaluation metrics, and baseline models used to assess the proposed RL framework.

5.1. Datasets

The experiments cover various image restoration tasks, demonstrating the broad applicability of the proposed method. The paper does not provide concrete examples of data samples within the text, so I will describe their characteristics.

Low-Light Image Enhancement:
- LOL-real [51]: A real-world low-light image enhancement benchmark.
- LOL-synthetic [51]: A synthetic dataset for low-light image enhancement.
- SID [5]: A dataset for learning to see in the dark, often used for very low-light conditions.
- SMID [6]: A dataset for seeing motion in the dark, combining low-light and motion blur challenges.
- Domain: Experiments are conducted in the sRGB domain.
Deraining:
- Rain13K [56]: Used for training deraining models.
- Evaluation Benchmarks:
  - Rain100H [50]
  - Rain100L [50]
  - Test100 [58]
  - Test1200 [57]
  - Test2800 [11]
Single-Image Motion Deblurring:
- GoPro [24]: Used for training motion deblurring models.
- Evaluation Benchmarks:
  - Synthetic: GoPro [24], HIDE [29]
  - Real-world: RealBlur-R [27], RealBlur-J [27]
Defocus Deblurring:
- DPDD [1]: Used for training defocus deblurring models.
- Evaluation Benchmarks:
  - EBDB [15]
  - JNB [30]
    
    These datasets are widely recognized and representative within their respective restoration communities. Their selection ensures a comprehensive evaluation of the model's performance across different degradation types and real-world scenarios.

5.2. Evaluation Metrics

The paper uses several standard metrics to quantitatively evaluate the performance of restoration models.

5.2.1. PSNR (Peak Signal-to-Noise Ratio)

Conceptual Definition: PSNR is a quality metric that quantifies the difference between an original (ground truth) image and a reconstructed (restored) image. It is defined as the ratio of the maximum possible power of a signal to the power of corrupting noise that affects the fidelity of its representation. A higher PSNR value generally indicates a higher quality reconstruction, implying less difference from the ground truth. It is typically expressed in decibels (dB).
Mathematical Formula: $ PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right) $ where MSE (Mean Squared Error) is calculated as: $ MSE = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 $
Symbol Explanation:
- $MAX_I$ : The maximum possible pixel value of the image. For an 8-bit grayscale image, this is 255.
- MSE: Mean Squared Error, which is the average of the squares of the errors between the original image and the reconstructed image.
- I(i,j): The pixel value at row $i$ and column $j$ of the original (ground truth) image.
- K(i,j): The pixel value at row $i$ and column $j$ of the reconstructed (restored) image.
- $M \times N$ : The dimensions of the images (height $\times$ width).

5.2.2. SSIM (Structural Similarity Index Measure)

Conceptual Definition: SSIM is a perceptual metric that measures the similarity between two images. Unlike PSNR, which is an error sensitivity metric, SSIM attempts to model the perceived change in structural information between images, which is often a more accurate representation of human perception of quality. It takes into account three key factors: luminance, contrast, and structure. The SSIM value ranges from -1 to 1, with 1 indicating perfect similarity. Higher values indicate better quality.
Mathematical Formula: $ SSIM(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $
Symbol Explanation:
- $x$ : The original image (or a patch from it).
- $y$ : The reconstructed image (or a corresponding patch from it).
- $\mu_x$ : The average of $x$ .
- $\mu_y$ : The average of $y$ .
- $\sigma_x^2$ : The variance of $x$ .
- $\sigma_y^2$ : The variance of $y$ .
- $\sigma_{xy}$ : The covariance of $x$ and $y$ .
- $C_1 = (K_1 L)^2$ and $C_2 = (K_2 L)^2$ : Two small constants included to avoid instability when denominators are close to zero. $L$ is the dynamic range of the pixel values (e.g., 255 for 8-bit grayscale images), and $K_1, K_2$ are small constants (e.g., 0.01 and 0.03).

5.2.3. LPIPS (Learned Perceptual Image Patch Similarity)

Conceptual Definition: LPIPS is a perceptual distance metric designed to align better with human judgments of image similarity than traditional metrics like PSNR or SSIM. It calculates the distance between two image patches by feeding them through a pre-trained deep convolutional neural network (e.g., VGG or AlexNet), extracting features from various layers, and then computing a weighted $L_2$ distance between these feature stacks. Lower LPIPS values indicate higher perceptual similarity (i.e., better quality).
Mathematical Formula: $ LPIPS(x, y) = \sum_l \frac{1}{H_l W_l} |w_l \odot (\phi_l(x) - \phi_l(y))|_2^2 $
Symbol Explanation:
- $x$ : The original image.
- $y$ : The reconstructed image.
- $\phi_l$ : The feature stack from layer $l$ of a pre-trained deep neural network (e.g., VGG, AlexNet).
- $w_l$ : A vector of learned weights that scale the activations for each channel at layer $l$ . These weights are learned to best match human perceptual judgments.
- $H_l, W_l$ : Height and width of the feature maps at layer $l$ .
- $\odot$ : Element-wise product.
- $\|\cdot\|_2^2$ : Squared $L_2$ norm (Euclidean distance).

5.2.4. FID (Frechet Inception Distance)

Conceptual Definition: FID is a metric used to assess the quality of images generated by generative models, often reflecting both the fidelity and diversity of the generated samples. It calculates the Fréchet distance between the distribution of real images and the distribution of generated images in a feature space. This feature space is typically obtained from an Inception-v3 network pre-trained on ImageNet. A lower FID score indicates better quality and diversity of the generated images, meaning their distribution is closer to that of real images.
Mathematical Formula: $ FID = |\mu_x - \mu_g|_2^2 + Tr(\Sigma_x + \Sigma_g - 2(\Sigma_x\Sigma_g)^{1/2}) $
Symbol Explanation:
- $\mu_x$ : The mean feature vector of real images extracted from the Inception-v3 network.
- $\mu_g$ : The mean feature vector of generated images extracted from the Inception-v3 network.
- $\|\cdot\|_2^2$ : Squared $L_2$ norm (Euclidean distance), measuring the distance between the means.
- $\Sigma_x$ : The covariance matrix of feature vectors for real images.
- $\Sigma_g$ : The covariance matrix of feature vectors for generated images.
- Tr: The trace of a matrix (sum of diagonal elements).
- $(\Sigma_x\Sigma_g)^{1/2}$ : The matrix square root of the product of the covariance matrices.

5.3. Baselines

The proposed method is compared against several representative state-of-the-art diffusion-based restoration methods.

DiffBIR [18]: A diffusion-based model for blind image restoration using generative diffusion priors.
StableSR [37]: Exploits diffusion priors for real-world image super-resolution.
PASD [49]: Pixel-aware stable diffusion for realistic image super-resolution and personalized stylization.
XPSR [26]: Cross-modal priors for diffusion-based image super-resolution.
TSD-SR [9]: One-step diffusion with target score distillation for real-world image super-resolution.
RAP [38]: Restoration prior enhancement in diffusion models for realistic image super-resolution.
FaithDiff [8]: Unleashes diffusion priors for faithful image super-resolution.
Pixel [33]: Pixel-level and semantic-level adjustable super-resolution using a dual-LoRA approach.

Backbones:

For low-light image enhancement, the SNR-aware network [46] is adopted as the pre-trained restoration model to extract clean conditional inputs from degraded images.
For other tasks (image deraining and deblurring), Restormer [56] is employed as the restoration backbone.
All these baselines typically use a text-to-image Stable Diffusion model as their pre-trained diffusion backbone.

The experiments compare these baseline methods:

Original Baseline: The performance of the SOTA methods as published.
+Diff.SFT: These methods are further fine-tuned for the same limited number of iterations as the RL approach, but only using the difficulty-aware SFT strategy (i.e., the weighting scheme defined in Eq. 9 applied to SFT loss alone), focusing on hard samples. This serves as an ablation to show the benefit of adaptive SFT without RL.
$+Our RL$ : These methods integrate the full proposed RL strategy, including IQA reward and difficulty-adaptive training. This demonstrates the full benefit of the paper's contribution.

6. Results & Analysis

The experimental results validate the effectiveness of the proposed difficulty-adaptive Reinforcement Learning (RL) framework with IQA reward across various image restoration tasks. The comparisons highlight the advantages of the proposed method over existing diffusion-based restoration techniques and demonstrate the impact of individual components through ablation studies.

6.1. Core Results Analysis

The core results consistently show that applying the proposed RL strategy ( $+Our RL$ ) significantly boosts the performance of existing diffusion-based restoration models. This improvement is observed across multiple quantitative metrics (PSNR, SSIM, LPIPS, FID) and various tasks, confirming the method's generality and effectiveness.

The paper first establishes a meaningful baseline by showing that even applying difficulty-aware Supervised Fine-Tuning (+Diff.SFT) (fine-tuning specifically on hard samples using the proposed weighting scheme) improves performance over the original SOTA methods. This confirms the importance of targeted alignment for challenging inputs. Crucially, $+Our RL$ consistently outperforms +Diff.SFT, demonstrating that the dynamic RL approach, particularly with the IQA-based reward for distribution alignment, provides additional, substantial benefits beyond what targeted SFT alone can achieve.

6.1.1. Low-Light Image Enhancement (LOL-real and LOL-synthetic)

The following are the results from Table 1 of the original paper:

Methods	LOL-real				LOL-synthetic
Methods	PSNR↑	SSIM↑	LPIPS↓	FID↓	PSNR↑	SSIM↑	LPIPS↓	FID↓
DiffBIR	16.89	0.717	0.1139	88.61	20.25	0.752	0.1004	40.17
+Diff.SFT	17.35	0.720	0.0988	80.66	20.56	0.750	0.0953	38.29
+Our RL	22.11	0.744	0.0846	64.58	21.27	0.755	0.0711	35.47
StableSR	20.39	0.735	0.1227	76.71	23.42	0.784	0.1173	42.66
+Diff.SFT	21.68	0.741	0.1072	74.79	24.06	0.801	0.0918	40.71
+Our RL	22.47	0.754	0.0918	71.22	24.87	0.812	0.0901	38.63
PASD	20.58	0.729	0.1095	78.89	22.86	0.780	0.0935	38.76
+Diff.SFT	21.43	0.732	0.1041	76.48	23.17	0.791	0.0827	37.05
+Our RL	22.46	0.753	0.0967	73.90	24.02	0.803	0.0806	35.44
XPSR [26]	21.15	0.730	0.1003	75.47	23.04	0.786	0.0918	36.28
+Diff.SFT	21.46	0.735	0.0980	74.01	23.28	0.792	0.0893	34.57
+Our RL	22.03	0.746	0.0939	71.12	23.90	0.798	0.0841	33.05
TSD-SR [9]	21.24	0.737	0.1026	77.83	23.15	0.769	0.0954	38.42
+Diff.SFT	21.82	0.741	0.1004	76.09	23.46	0.770	0.0928	36.73
+Our RL	22.46	0.748	0.0971	73.60	24.07	0.775	0.0890	34.84
RAP [38]	21.79	0.741	0.1042	79.55	23.48	0.753	0.0972	39.50
+Diff.SFT	21.94	0.746	0.1015	77.93	23.87	0.759	0.0941	37.82
+Our RL	22.50	0.752	0.0968	75.14	24.53	0.764	0.0886	35.17
FaithDiff [8]	22.05	0.749	0.0934	74.07	23.92	0.771	0.0883	35.61
+Diff.SFT	22.37	0.755	0.0902	73.19	24.16	0.778	0.0861	34.23
+Our RL	23.12	0.763	0.0871	71.68	24.83	0.785	0.0829	32.08
Pixel [33]	21.08	0.724	0.0987	78.46	23.36	0.750	0.0975	40.24
+Diff.SFT	21.49	0.728	0.0956	77.04	23.67	0.755	0.0950	38.43
+Our RL	22.17	0.740	0.0905	72.19	24.21	0.762	0.0893	36.59

Table 1 clearly shows that $+Our RL$ consistently achieves higher PSNR and SSIM, and lower LPIPS and FID values across all tested base models (DiffBIR, StableSR, PASD, XPSR, TSD-SR, RAP, FaithDiff, Pixel) on both LOL-real and LOL-synthetic datasets. For instance, DiffBIR with $+Our RL$ sees a remarkable jump in PSNR from 16.89 to 22.11 on LOL-real and a significant drop in FID from 88.61 to 64.58. This indicates that the RL strategy effectively enhances both pixel-level accuracy and perceptual quality for low-light image enhancement. The improvement of +Diff.SFT over the baseline demonstrates the benefit of focusing on hard samples, while $+Our RL$ further boosts this performance, proving the efficacy of the RL framework.

6.1.2. Low-Light Image Enhancement (SID and SMID)

The following are the results from Table 2 of the original paper:

Methods	SID				SMID
Methods	PSNR↑	SSIM↑	LPIPS↓	FID↓	PSNR↑	SSIM↑	LPIPS↓	FID↓
DiffBIR	17.85	0.604	0.2178	90.62	22.47	0.763	0.1836	88.21
+Diff.SFT	20.75	0.619	0.1882	87.68	23.81	0.778	0.1798	83.09
+Our RL	22.48	0.634	0.1733	84.64	24.56	0.781	0.1672	80.47
StableSR	20.27	0.620	0.2074	87.39	24.08	0.773	0.1798	85.75
+Diff.SFT	21.38	0.633	0.1895	85.76	24.63	0.780	0.1727	83.82
+Our RL	22.04	0.662	0.1758	83.87	25.37	0.790	0.1601	80.94
PASD	20.62	0.674	0.1958	81.83	24.78	0.780	0.1856	83.14
+Diff.SFT	20.90	0.686	0.1829	78.96	24.97	0.786	0.1809	81.30
+Our RL	21.91	0.703	0.1800	76.75	25.37	0.792	0.1778	78.95

Similar trends are observed on SID and SMID datasets. For example, DiffBIR with $+Our RL$ yields a PSNR of 22.48 on SID, a significant improvement over its baseline (17.85) and +Diff.SFT (20.75). The LPIPS and FID values also consistently decrease, indicating improved perceptual quality and distributional alignment. These results further reinforce the benefits of the proposed RL framework for low-light enhancement, even in more challenging scenarios (like SMID which includes motion).

6.1.3. Image Deraining

The following are the results from Table 3 of the original paper:

Method	Test100		Rain100H		Rain100L		Test2800		Test1200		Mean Value
Method	PSNR ↑	SSIM↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑	PSNR ↑	SSIM ↑
DiffBIR	24.50	0.707	24.84	0.702	30.83	0.765	27.75	0.734	27.26	0.721	27.04	0.726
DiffBIR+Ours	25.78	0.739	25.62	0.724	31.82	0.781	28.99	0.750	28.54	0.738	28.15	0.746
PASD	25.30	0.718	25.59	0.723	31.17	0.776	28.79	0.748	28.54	0.737	27.88	0.740
PASD+Ours	26.87	0.740	26.46	0.745	32.04	0.792	29.37	0.763	29.25	0.750	28.80	0.758

For deraining, both DiffBIR and PASD significantly benefit from $+Our RL$ . For example, PASD+Ours achieves a mean PSNR of 28.80 and a mean SSIM of 0.758, outperforming the baseline PASD (27.88 PSNR, 0.740 SSIM). This demonstrates that the RL framework effectively helps remove rain streaks while preserving image details, leading to cleaner and more perceptually pleasing results.

6.1.4. Single-Image Motion Deblurring

The following are the results from Table 4 of the original paper:

Method	GoPro		HIDE		RealBlur-R		RealBlur-J
Method	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑
DiffBIR	27.99	0.862	26.88	0.844	31.87	0.853	24.53	0.772
+Ours	29.76	0.887	28.43	0.862	33.09	0.872	25.67	0.790
PASD	28.67	0.881	27.41	0.858	32.64	0.869	25.36	0.793
+Ours	30.02	0.893	28.49	0.870	33.87	0.876	26.80	0.802

For motion deblurring, DiffBIR+Ours and PASD+Ours show substantial improvements in PSNR and SSIM across all datasets (GoPro, HIDE, RealBlur-R, RealBlur-J). For example, PASD+Ours achieves 30.02 PSNR on GoPro, compared to 28.67 for the baseline. This indicates that the RL approach effectively reduces motion blur, yielding sharper and more detailed images.

6.1.5. Defocus Deblurring

The following are the results from Table 5 of the original paper:

Method	I. S. (S)		O. S. (S)		I. S. (D)		O. S. (D)
Method	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑	PSNR↑	SSIM↑
DiffBIR	25.34	0.808	20.07	0.663	25.91	0.835	20.67	0.684
+Ours	26.28	0.829	21.62	0.679	27.20	0.844	21.05	0.697
PASD	27.36	0.844	20.75	0.702	27.02	0.863	21.90	0.708
+Ours	27.97	0.866	21.73	0.721	27.74	0.872	22.43	0.720

For defocus deblurring, both DiffBIR+Ours and PASD+Ours demonstrate improved PSNR and SSIM values across indoor and outdoor scenes, for both single-image (S) and dual-pixel (D) deblurring. PASD+Ours notably achieves 27.97 PSNR for indoor single-image deblurring, compared to 27.36 for the baseline. These results confirm the method's ability to effectively remove defocus blur and restore image sharpness.

The overall consistency of improvements across different tasks and metrics underscores the robustness and generalizability of the proposed RL framework. Visual comparisons in Figure 6 (provided as Image 6 in this analysis) further illustrate these enhancements in terms of visual quality and perceptual fidelity.

The following figure (Figure 6 from the original paper, provided here as Image 6) provides visual comparisons:

该图像是多个图像恢复效果的对比图，展示了不同方法在多个数据集上的性能，包括 LOL-real、SID 和 Derain 等。各行显示了不同的输入图像和恢复结果，其中包括我们的算法以及对应的真值（GT）。

Figure 6 from the original paper (Image 6 in this analysis). Comparison chart of various image restoration results, showcasing the performance of different methods on multiple datasets, including LOL-real, SID, and Derain. Each row displays different input images and restored results, including our algorithm and the corresponding ground truth (GT).

6.1.6. IQA Scores for Ground Truth vs. SFT vs. RL

The following figure (Figure 1 from the original paper, provided here as Image 1) illustrates the IQA scores:

$Figure 1. IQA scores in this figure are computed using the MLLMbased IQA model DeQA-Score \[54\] on the widely-used real-world low-light image enhancement benchmark LOL-real \[51\]. The original IQA score of the diffusion-based method (DiffBIR \[18\]) via SFT shows a significant gap compared to the IQA score of the ground truth. Thus, IQA can serve to distinguish the distribution between output images and ground truths. After our RL training, it aligns more closely with the ground-truth distribution.$ 该图像是一个示意图，展示了输入图像与不同输出结果的比较，包括基线输出、结合重构奖励的RL输出，以及结合IQA奖励的RL输出，最后与真实图像进行对比。通过不同奖励函数的调整，IQA奖励能够显著改善图像的视觉效果，输出更接近真实内容。

Figure 1 from the original paper (Image 1 in this analysis). IQA scores on LOL-real computed using the MLLM-based IQA model DeQA-Score [54]. The original IQA score of DiffBIR via SFT shows a significant gap compared to the IQA score of the ground truth. After our RL training, it aligns more closely with the ground-truth distribution.

Figure 1 provides a critical insight into the initial motivation. It shows that the IQA score of the SFT-trained DiffBIR output has a noticeable gap compared to the ground truth's IQA score. After applying the proposed RL with IQA reward, the output's IQA score aligns much more closely with the ground truth, validating the core hypothesis that IQA can guide distributional alignment.

6.1.7. Visual Comparisons using Reconstruction Error vs. IQA Reward

The following figure (Figure 2 from the original paper, provided here as Image 1 from original text but actually named Figure 2 in the paper and Image 1 in markdown code; so I'll follow what is described by the image text and call it Figure 2. The markdown file refers to it as Image 1. I will use the image description in the prompt.) illustrates visual comparisons using different reward functions:

$Figure 1. IQA scores in this figure are computed using the MLLMbased IQA model DeQA-Score \[54\] on the widely-used real-world low-light image enhancement benchmark LOL-real \[51\]. The original IQA score of the diffusion-based method (DiffBIR \[18\]) via SFT shows a significant gap compared to the IQA score of the ground truth. Thus, IQA can serve to distinguish the distribution between output images and ground truths. After our RL training, it aligns more closely with the ground-truth distribution.$ 该图像是一个示意图，展示了输入图像与不同输出结果的比较，包括基线输出、结合重构奖励的RL输出，以及结合IQA奖励的RL输出，最后与真实图像进行对比。通过不同奖励函数的调整，IQA奖励能够显著改善图像的视觉效果，输出更接近真实内容。

Figure 2 from the original paper (Image 1 from the original text for the description, but Figure 2 in the paper itself). Visual comparisons using reconstruction error and IQA as the reward function. The reward guided by IQA leads to better visual performance, producing results that are closer to the ground truth, whereas the reconstruction-error-based reward fails.

Figure 2 visually reinforces the inadequacy of reconstruction error as an RL reward for restoration. The IQA-guided reward leads to better visual performance and outputs closer to the ground truth, whereas the reconstruction-error-based reward fails to produce satisfactory results. This directly supports the paper's argument for using IQA.

6.2. Ablation Studies

Ablation studies are conducted to analyze the impact of each proposed strategy, primarily using low-light image enhancement as an example.

The following are the results from Table 6 of the original paper:

Methods	LOL-real				LOL-synthetic
Methods	PSNR↑	SSIM↑	LPIPS↓	FID↓	PSNR↑	SSIM↑	LPIPS↓	FID↓
Base	16.89	0.717	0.1139	88.61	20.25	0.752	0.1004	40.17
Base +Diff.SFT	17.35	0.720	0.0988	80.66	20.56	0.750	0.0953	38.29
with Rec.	18.54	0.728	0.1186	81.15	20.89	0.727	0.0944	70.73
w/o w_i	20.89	0.727	0.0944	70.73	20.97	0.744	0.0923	38.70
with x_t-1	21.46	0.739	0.0910	68.20	21.08	0.751	0.0816	36.95
Reward x₀	21.73	0.724	0.0906	69.22	20.95	0.746	0.0900	37.01
Norm. from track	21.82	0.740	0.0895	69.71	22.04	0.732	0.0858	66.87
with Q-align	22.04	0.732	0.0858	66.87	21.51	0.725	0.0923	70.76
with CLIP-IQA	21.51	0.725	0.0923	70.76	23.37	0.761	0.0810	61.94
with Iter. RL	23.37	0.761	0.0810	61.94	22.04	0.762	0.0695	33.86
Original	22.11	0.744	0.0846	64.58	21.27	0.755	0.0711	35.47

6.2.1. Comparison between Reconstruction Error and IQA as Reward Model

Observation: The variant with Rec. (using reconstruction error as reward) performs significantly worse than Original (using IQA reward) across all metrics, especially on LOL-real (PSNR 18.54 vs. 22.11, FID 81.15 vs. 64.58).
Analysis: This strongly supports the paper's central argument that IQA provides a superior reward for restoration. Reconstruction error, being limited by the SFT objective, is less effective for RL, often leading to unstable training and suboptimal visual results (as shown in Figure 2 and Figure 5). IQA, by focusing on perceptual quality and distributional alignment, guides the model more effectively.

6.2.2. Effects of Conducting RL with Adaptive Weights

Observation: The variant w/o w_i (without adaptive weighting) shows lower performance than Original. For example, on LOL-real, w/o w_i has a PSNR of 20.89 and FID of 70.73, compared to Original's 22.11 PSNR and 64.58 FID.
Analysis: This confirms the efficacy of the difficulty-adaptive weighting strategy. By dynamically adjusting the balance between RL (for hard samples) and SFT (for fine-grained alignment), the model can optimize more efficiently. Without adaptive weights, RL might over-explore for easy samples or SFT might struggle with truly hard samples.

6.2.3. Effects of Conducting RL with Denoised Direction

Observation: The variant $with x_t-1$ (using the less refined estimate for policy modeling) performs worse than Original. For example, on LOL-real, $with x_t-1$ has a PSNR of 21.46 and FID of 68.20, while Original has 22.11 PSNR and 64.58 FID.
Analysis: This validates the importance of using a better denoised latent ( $\hat{x}_{t-1}^{\text{enhanced}}$ ) as the target for the action in policy modeling. A more refined target provides a clearer and more reliable direction for the policy, leading to more meaningful updates and ultimately better restoration performance.

6.2.4. Effects of Conducting RL with Rewards from Multiple Time Steps

Observation: The $Reward x_0$ variant (reward only from the final output) yields inferior results compared to Original (rewards from multiple time steps). On LOL-real, $Reward x_0$ achieves 21.73 PSNR and 69.22 FID, lower than Original.
Analysis: This confirms the benefit of step-wise RL supervision. Applying rewards at intermediate steps prevents error accumulation and provides continuous guidance throughout the denoising trajectory, which is crucial for maintaining fidelity in diffusion-based restoration.

6.2.5. Effects of Conducting RL with New Normalization Parameters

Observation: The Norm. from track variant (normalizing rewards only from historical track) is outperformed by Original (normalizing from both historical track and current batch). For example, Norm. from track has 21.82 PSNR and 69.71 FID on LOL-real, while Original achieves 22.11 PSNR and 64.58 FID.
Analysis: This shows that incorporating rewards from the current batch into the mean and variance calculation for normalization offers a clearer advantage. This more dynamic and batch-aware normalization likely contributes to greater stability and effectiveness in RL training.

6.2.6. Effects of Conducting RL with Different IQA Rewards

Observation: with Q-align and with CLIP-IQA (using alternative IQA models) both outperform the Base and Base +Diff.SFT baselines, but generally perform slightly worse than Original which uses DeQA-Score. For instance, with Q-align achieves 22.04 PSNR on LOL-real, close but still slightly below Original's 22.11 PSNR.
Analysis: This suggests that while MLLM-based IQA models are generally effective as reward functions, the choice of the specific IQA model matters. DeQA-Score appears to be a stronger, more generalized IQA model for this task, leading to better overall performance. This also implies that future improvements in MLLM-based IQA models could further enhance the proposed RL framework.

6.2.7. Using Iterative RL by Updating the Reward Model

Observation: The with Iter. RL variant achieves the highest performance across all metrics, even surpassing Original. On LOL-real, it reaches 23.37 PSNR and 61.94 FID, which are the best recorded values in the table.
Analysis: This indicates that iterative application of the method (where the IQA model itself is fine-tuned based on feedback from the diffusion model's outputs, potentially with human-in-the-loop annotations) can lead to further, significant improvements. This points towards a promising future direction for refining both the restoration model and the reward function dynamically.

6.3. IQA Score Comparisons for Evaluation Set on LOL-real

Figure 1 illustrates the IQA scores for the evaluation set on LOL-real using DeQA-Score. The original diffusion-based method (DiffBIR) via SFT shows a significant gap in IQA score compared to the ground truth. After the proposed RL training, the output's IQA score aligns more closely with the ground-truth distribution. This visual evidence further supports the quantitative results, demonstrating that the IQA reward effectively guides the model to produce perceptually higher-quality images.

6.4. Reward Curve Comparisons

The following figure (Figure 6 from the original paper, provided here as Image 5) illustrates reward curve comparisons:

Figure 6. Reward curve comparisons using reconstruction error and IQA as the reward function. We observe that the reward guided by IQA increases steadily, whereas the reward based on reconstruction error exhibits noticeable fluctuations. 该图像是一个图表，展示了使用不同奖励函数（重建误差与图像质量评估IQAR）进行对比的奖励曲线。左侧（a）展示了基于参考的奖励曲线，波动较大，优化困难；右侧（b）展示了基于分布的奖励曲线，呈现出明显的上升趋势，优化更为顺利。此图表反映了两种奖励函数在优化过程中的表现差异。

Figure 6 from the original paper (Image 5 in this analysis). Reward curve comparisons using reconstruction error and IQA as the reward function. We observe that the reward guided by IQA increases steadily, whereas the reward based on reconstruction error exhibits noticeable fluctuations.

Figure 5 shows the reward curves for training with reconstruction error versus IQA as the reward function. The IQA-guided reward (b) increases steadily, indicating stable and effective learning towards higher-quality outputs. In contrast, the reconstruction-error-based reward (a) exhibits noticeable fluctuations, suggesting an unstable optimization process and difficulty in finding consistent improvements. This directly justifies the choice of IQA over reconstruction error as the reward function.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper presents a pioneering and effective Reinforcement Learning (RL) strategy specifically designed for enhancing diffusion-based image restoration frameworks. The core insight is that standard Supervised Fine-Tuning (SFT) for restoration models, which relies on reconstruction-based objectives, is often suboptimal due to the ill-posed nature of restoration and its emphasis on perceptual fidelity. The authors demonstrate that Image Quality Assessment (IQA) metrics, particularly those derived from Multi-modal Large Language Models (MLLMs), serve as highly effective reward functions. This IQA-based reward provides an alternative optimization direction, facilitating distributional alignment with high-quality images (ground truths), allowing the model to explore beyond local SFT optima.

To further refine this process, the paper introduces a novel difficulty-adaptive weighting strategy. This mechanism dynamically adjusts the influence of RL and SFT based on the difficulty of individual training samples. RL is primarily leveraged for hard samples (those significantly distant from the ground truth) to foster initial distribution-level alignment. As predictions become closer to the target distribution, the optimization objective gradually shifts to incorporate SFT for fine-grained, reference-based alignment. Key implementation techniques, such as using a better denoised latent for policy modeling and applying step-wise RL supervision throughout the diffusion process, also contribute to the framework's robustness. The proposed methodology is plug-and-play, seamlessly integrating with existing pre-trained diffusion models and demonstrating a consistent boost in performance across a wide array of restoration tasks through extensive experiments on multiple benchmarks.

7.2. Limitations & Future Work

The paper implicitly identifies some limitations and suggests avenues for future work:

Reliance on Large Pre-trained Models: The authors explicitly state that their method is "primarily designed for diffusion models built upon large pre-trained base models (e.g., text-to-image models such as Stable Diffusion and Flux)." This suggests that the approach might be less effective or require significant adaptation for smaller models or those not leveraging large pre-trained backbones. The computational cost and resource requirements associated with these large models could be a practical limitation.
Computational Cost of MLLM-based IQA: While MLLM-based IQA models are powerful, they can be computationally expensive to query, especially when used as a reward function at each intermediate step of the diffusion process for many samples. The practical overhead of this reward calculation is not explicitly discussed as a limitation, but it is an inherent characteristic of using such complex models.
Iterative RL with Human in the Loop: The ablation study on "with Iter. RL" shows significant further improvements, implying that iteratively updating the IQA reward model, potentially with human feedback ("human in loop"), is a powerful but resource-intensive approach. Scaling this human annotation effort could be a challenge for broader application.
Theoretical Elaboration of "Better Denoised Latent": While the paper mentions using "a better denoised latent" as the target for action, a more detailed theoretical explanation or exploration of why and how this specific refinement impacts the policy gradient and optimization stability could provide deeper insights.
Generalizability of IQA Models: While MLLM-based IQA models show strong generalization, their biases might still be a factor. If the IQA model itself has biases (e.g., towards certain aesthetics or artifacts not explicitly covered in its training), these biases could be propagated into the restoration model.

Future research directions suggested or implied by the paper include:
Automating Iterative RL: Developing more efficient or automated ways to refine the IQA reward model, moving beyond manual "human in loop" processes, could make iterative RL more scalable.
Exploring Alternative IQA Models: Continued research into more powerful and efficient MLLM-based IQA models could further boost the performance of this RL framework.
Theoretical Analysis of Adaptive Weighting: A deeper theoretical analysis of the difficulty-adaptive weighting strategy, perhaps connecting it to multi-objective optimization or curriculum learning, could provide further insights and potential for refinement.
Application to Broader Tasks: While already demonstrated across several restoration tasks, exploring its application to other ill-posed generative or inverse problems in computer vision could be fruitful.

7.3. Personal Insights & Critique

This paper presents a highly insightful and practical approach to a critical problem in diffusion-based image restoration. The core idea of using MLLM-based IQA as a reward function for RL is particularly innovative. It intelligently shifts the optimization paradigm from rigid pixel-wise fidelity to more perceptually aligned distributional fidelity, which is crucial for ill-posed problems where human perception of quality is paramount. This directly addresses the hallucination and artifact generation issues often seen in SFT-trained generative restoration models.

The difficulty-adaptive weighting strategy is another significant strength. It provides a clever mechanism to balance the exploratory nature of RL with the precise refinement of SFT, ensuring that the model leverages the right learning signal for the right type of sample. This dynamic approach is more sophisticated than static loss weighting and likely contributes significantly to the observed performance gains and training stability. The intuitive argument for using step-wise rewards and refined latent targets also makes strong practical sense for diffusion models.

Potential areas for further exploration or critique:

Computational Overhead: While effective, the computational cost of querying an MLLM-based IQA model for every sample at every relevant time step during RL training could be substantial. The paper doesn't delve into the efficiency implications. Future work could explore lighter-weight, yet equally effective, IQA proxies or more efficient reward computation strategies.
Hyperparameter Sensitivity: The RL framework introduces several hyperparameters (e.g., $\epsilon$ for clipping, details of reward normalization, interaction between RL and SFT weights). While ablation studies show the benefits, a more in-depth analysis of their sensitivity and optimal tuning could be valuable.
Interpretability of IQA Reward: While MLLM-based IQA models are powerful, their internal workings can be complex. Understanding precisely why certain images receive high/low IQA scores could help in designing even more targeted reward functions or loss terms.
Generalization to Out-of-Distribution Degradations: The paper demonstrates effectiveness across various degradations. However, a deeper dive into how well the IQA model (and thus the RL) generalizes to entirely novel or extreme degradation types not seen during training could be interesting.
Theoretical Guarantees: While the empirical results are strong, a more rigorous theoretical analysis of the convergence properties or optimality guarantees for this specific RL-SFT hybrid setup could strengthen the foundational understanding.

Overall, this paper provides a valuable contribution by bridging the gap between general generative RL for diffusion and the specific requirements of image restoration. Its methods and conclusions could be highly transferable to other tasks involving generative models where perceptual quality and fidelity to underlying structure are critical, such as medical image reconstruction, scientific data enhancement, or even certain aspects of content creation where realism is paramount.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Enhancing Diffusion-based Restoration Models via Difficulty-Adaptive Reinforcement Learning with IQA Reward

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~20 min read · 27,606 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Diffusion Models

3.1.2. Reinforcement Learning (RL)

3.1.3. Supervised Fine-Tuning (SFT)

3.1.4. Image Quality Assessment (IQA)

3.1.5. Ground Truth (GT)

3.2. Previous Works

3.2.1. Restoration Models using Pre-trained Diffusion

3.2.2. Reinforcement Learning for Diffusion Models

3.2.3. MLLM-based IQA Models

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Diffusion Model Background

4.2.2. Diffusion with Reinforcement Learning (RL)

4.2.3. Reward for Diffusion-based RL in Restoration

4.2.4. Training Diffusion-based RL with Adaptive Manner for Fine-grained Alignment

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. PSNR (Peak Signal-to-Noise Ratio)

5.2.2. SSIM (Structural Similarity Index Measure)

5.2.3. LPIPS (Learned Perceptual Image Patch Similarity)

5.2.4. FID (Frechet Inception Distance)

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Low-Light Image Enhancement (LOL-real and LOL-synthetic)

6.1.2. Low-Light Image Enhancement (SID and SMID)

6.1.3. Image Deraining

6.1.4. Single-Image Motion Deblurring

6.1.5. Defocus Deblurring

6.1.6. IQA Scores for Ground Truth vs. SFT vs. RL

6.1.7. Visual Comparisons using Reconstruction Error vs. IQA Reward

6.2. Ablation Studies

6.2.1. Comparison between Reconstruction Error and IQA as Reward Model

6.2.2. Effects of Conducting RL with Adaptive Weights

6.2.3. Effects of Conducting RL with Denoised Direction

6.2.4. Effects of Conducting RL with Rewards from Multiple Time Steps

6.2.5. Effects of Conducting RL with New Normalization Parameters

6.2.6. Effects of Conducting RL with Different IQA Rewards

6.2.7. Using Iterative RL by Updating the Reward Model

6.3. IQA Score Comparisons for Evaluation Set on LOL-real

6.4. Reward Curve Comparisons

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers