Paper status: completed

Restora-Flow: Mask-Guided Image Restoration with Flow Matching

Published:11/25/2025

Flow Matching for Image Restoration (1)Training-Free Image Restoration Method (1)Mask-Guided Image Restoration (1)Evaluation on Medical Imaging Datasets (1)Image Denoising and Super-Resolution (1)

Original Link PDF

Price: 0.100000

3 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Restora-Flow is a novel training-free image restoration method that utilizes flow matching guided by a degradation mask, incorporating trajectory correction. It shows superior perceptual quality and processing time compared to existing diffusion and flow matching methods across v

Abstract

Flow matching has emerged as a promising generative approach that addresses the lengthy sampling times associated with state-of-the-art diffusion models and enables a more flexible trajectory design, while maintaining high-quality image generation. This capability makes it suitable as a generative prior for image restoration tasks. Although current methods leveraging flow models have shown promising results in restoration, some still suffer from long processing times or produce over-smoothed results. To address these challenges, we introduce Restora-Flow, a training-free method that guides flow matching sampling by a degradation mask and incorporates a trajectory correction mechanism to enforce consistency with degraded inputs. We evaluate our approach on both natural and medical datasets across several image restoration tasks involving a mask-based degradation, i.e., inpainting, super-resolution and denoising. We show superior perceptual quality and processing time compared to diffusion and flow matching-based reference methods.

Mind Map

In-depth Reading

English Analysis~27 min read · 35,654 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is Restora-Flow, a novel method for mask-guided image restoration using Flow Matching.

1.2. Authors

The authors of the paper are:

Arnela Hadzic
Franz Thaler
Lea Bogensperger
Simon Johannes Joham
Martin Urschler

Their affiliations indicate a strong presence in medical informatics and related fields:
Institute for Medical Informatics, Statistics and Documentation, Medical University of Graz, Graz, Austria (Arnela Hadzic, Franz Thaler, Simon Johannes Joham, Martin Urschler)
Division of Medical Physics and Biophysics, Medical University of Graz, Graz, Austria (Franz Thaler)
nstiu Visualui GraUnivrsiy Tecolgy, Gra Aus (Franz Thaler) - Note: This appears to be a transcription error in the original paper's appendix, likely referring to a university or institute in Graz.
Department of Quantitative Biomedicine, University of Zurich, Zurich, Switzerland (Lea Bogensperger)

1.3. Journal/Conference

The paper is published at (UTC): 2025-11-25T10:22:26.000Z. While the specific journal or conference is not explicitly mentioned in the provided text, the arxiv link suggests it is currently a preprint, often submitted to major conferences or journals in machine learning or computer vision (e.g., NeurIPS, ICLR, CVPR, ECCV, MICCAI) for peer review and eventual publication. Given the early 2025 publication date, it's likely targeting a 2025 or early 2026 conference cycle. These venues are highly reputable and influential in the field of generative models and image processing.

1.4. Publication Year

The publication year is 2025.

1.5. Abstract

The paper introduces Restora-Flow, a training-free method for mask-guided image restoration using Flow Matching. It leverages Flow Matching as a generative prior, which is a promising approach for high-quality image generation with faster sampling times and more flexible trajectory design compared to diffusion models. The core idea of Restora-Flow is to guide Flow Matching sampling using a degradation mask and incorporate a trajectory correction mechanism to ensure consistency with degraded inputs. The method is evaluated on both natural and medical datasets across various mask-based image restoration tasks, including inpainting, super-resolution, and denoising. The results demonstrate superior perceptual quality and processing time compared to existing diffusion and Flow Matching-based reference methods.

1.6. Original Source Link

The original source link is: https://arxiv.org/abs/2511.20152. The PDF link is: https://arxiv.org/pdf/2511.20152v2.pdf. This paper is currently a preprint on arXiv, indicating it is publicly available but may not have undergone formal peer review or official publication yet.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is image restoration, which involves recovering an original image from its degraded observation. Many critical restoration tasks, such as denoising, super-resolution, and inpainting, can be framed as inverse problems involving mask-based degradation. The objective is to produce a restored image that not only possesses high visual quality but also maintains fidelity to the degraded input.

Prior research has widely utilized diffusion models as powerful generative priors for these tasks, demonstrating success in guiding restoration. However, a significant limitation of diffusion models is their long sampling times due to highly curved sampling trajectories and challenges posed by intermediate noisy steps. More recently, Flow Matching (FM) has emerged as an alternative generative modeling approach, known for its significantly straighter trajectories and faster training and sampling times while maintaining high-quality image generation. Although existing flow-based methods show promise in restoration, they still face challenges such as relatively long processing times, over-smoothed results, or the introduction of artifacts.

The paper's entry point and innovative idea is to leverage the advantages of Flow Matching to overcome these limitations in image restoration. Specifically, it aims to develop a training-free method that combines mask-guided sampling with a novel trajectory correction mechanism to enforce consistency with the degraded input, thereby achieving both high restoration quality and fast processing times.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Introduction of Restora-Flow: A training-free algorithm specifically designed for mask-based inverse problems. It utilizes unconditional flow prior models and mask-guided fusion during Flow Matching sampling.
Novel Trajectory Correction Mechanism: To enhance the fidelity of the restoration process, a new correction mechanism is introduced that guides the flow trajectory towards better alignment with observed data. This mechanism refines samples by extrapolating towards the clean image and reintroducing noise, improving consistency between restored and known regions.
Comprehensive Evaluation and State-of-the-Art Performance: The method is rigorously evaluated on both computer vision (natural) and medical datasets across various settings for inpainting, super-resolution, and denoising tasks. The findings demonstrate that Restora-Flow achieves superior perceptual quality (lower LPIPS) and faster processing times compared to related diffusion and Flow Matching-based approaches. Notably, it often achieves state-of-the-art results when jointly considering reconstruction quality and processing time.
Simplicity in Hyperparameter Tuning: Restora-Flow requires only one hyperparameter to optimize (the number of ODE steps), as the number of corrections ( $C=1$ ) is fixed across all experiments and datasets, simplifying its application.

The key conclusions are that Restora-Flow effectively addresses the trade-off between restoration quality and processing time in mask-based image restoration. By intelligently guiding Flow Matching with a mask and a trajectory correction, it overcomes the limitations of previous generative prior-based methods, offering a robust, fast, and high-quality solution.

3.1. Foundational Concepts

3.1.1. Image Restoration

Image restoration is the process of estimating an original, clean image $x$ from a degraded observation $z$ . The degradation can be due to various factors like noise, blur, missing pixels, or low resolution. This process is often formulated as an inverse problem, where the observed degraded image $z$ is related to the original image $x$ by a degradation operator $H$ and additive noise $\xi$ : $z = Hx + \xi$ . The goal is to "invert" this process to recover $x$ .

3.1.2. Generative Models

Generative models are a class of artificial intelligence models that learn the underlying distribution of a dataset and can then generate new samples that resemble the training data. For image restoration, they are used as "priors" to guide the restoration process, ensuring that the recovered images are realistic and consistent with the learned data distribution. Two prominent types discussed in this paper are Diffusion Models and Flow Matching.

3.1.3. Diffusion Models

Diffusion Models (also known as Denoising Diffusion Probabilistic Models or DDPMs) are a class of generative models that work by gradually adding noise to data (forward diffusion process) and then learning to reverse this process (reverse denoising process) to generate new data. They define a sequence of latent variables $x_0, x_1, \ldots, x_T$ where $x_0$ is the original data and $x_T$ is pure noise. The model learns to predict the noise added at each step to reverse the diffusion process and generate a clean image from noise. While powerful, a significant limitation of diffusion models is their long sampling time because they typically require many sequential steps (e.g., hundreds or thousands) to denoise from pure noise back to a clear image, leading to highly curved sampling trajectories.

3.1.4. Flow Matching (FM)

Flow Matching (FM) is a more recent generative modeling approach that addresses some limitations of diffusion models. Instead of defining discrete noise steps, Flow Matching learns a continuous-time vector field (or velocity field) that transports samples from a simple base distribution (e.g., Gaussian noise) at time $t=0$ to the complex target data distribution at time $t=1$ . This transformation is governed by an Ordinary Differential Equation (ODE). The key advantage of Flow Matching is that it learns significantly straighter trajectories between the noise and data distributions, enabling faster and more efficient sampling. This makes it a promising alternative to diffusion models for tasks requiring high-quality generation with reduced computational cost.

3.1.5. Ordinary Differential Equation (ODE)

An Ordinary Differential Equation (ODE) is a mathematical equation that relates a function with its derivatives. In the context of Flow Matching, the ODE describes how a sample $x_t$ evolves over continuous time $t$ under the influence of the learned velocity field $v_{\theta,t}(x_t)$ . Integrating this ODE allows samples to be generated from noise to data.

3.1.6. Mask-Based Degradation

Mask-based degradation refers to image degradation scenarios where a binary mask $m$ explicitly indicates known (unmasked) and unknown (masked) regions of an image.

Inpainting: Filling in missing or corrupted parts of an image (unknown regions indicated by the mask).
Super-resolution: Enhancing the resolution of a low-resolution image. This can be viewed as a mask-based problem if the degradation operator $H$ involves downsampling, effectively "masking" the high-frequency information.
Denoising: Removing noise from an image. While not always explicitly mask-based, the paper proposes a time-dependent mask for denoising to control the influence of the noisy input.
Occlusion Removal: A specific type of inpainting where objects occlude (cover) parts of an image, and the task is to remove these occlusions and fill in the background.

3.1.7. Maximum A Posteriori (MAP) Estimation

Maximum A Posteriori (MAP) estimation is a statistical method for estimating an unknown quantity (e.g., the original image $x$ ) based on observed data (e.g., the degraded image $z$ ) and prior knowledge about the quantity. It seeks to find the value of $x$ that maximizes the posterior probability $P(x|z)$ , which can be expressed as: $ \hat{x} = \arg \max_x P(x|z) = \arg \max_x P(z|x) P(x) $ In image restoration, this is often formulated as minimizing a cost function: $ \hat{x} = \arg \min_x \mathcal{D}(Hx, z) + \mathcal{R}_\theta(x) $ Here, $\mathcal{D}(Hx, z)$ is the data fidelity term, which measures how well the restored image Hx (after applying the degradation operator $H$ ) matches the observed degraded image $z$ . $\mathcal{R}_\theta(x)$ is the prior term (or regularization term), which encodes learned knowledge about the properties of natural (or medical) images, parameterized by $\theta$ .

3.1.8. Perception-Distortion Trade-off

The perception-distortion trade-off [3] refers to a fundamental challenge in image quality assessment and generative modeling. It states that it is generally impossible to simultaneously optimize for both distortion metrics (which measure pixel-wise accuracy, like PSNR and SSIM) and perceptual metrics (which measure visual realism and human-like perception, like LPIPS). Models that produce highly realistic, perceptually pleasing images might have higher pixel-wise differences from the ground truth, while models that minimize pixel-wise errors might appear overly smooth or lack realistic textures. The paper acknowledges this trade-off by using both types of metrics.

3.2. Previous Works

3.2.1. Direct Mapping Methods

Traditional deep learning approaches for image restoration often learn a direct mapping from degraded images to clean images by minimizing a reconstruction loss. Examples include SRCNN for super-resolution [9] or DnCNN for denoising [38]. These methods typically require large datasets of paired degraded and clean images and need to be retrained for each new task (e.g., a different degradation level or type).

3.2.2. Diffusion-Based Prior Methods

To overcome the limitations of paired datasets and task-specific retraining, deep generative priors, particularly diffusion models, have been widely adopted for inverse problems. These methods guide the generative process to ensure consistency with degraded observations.

DDRM (Denoising Diffusion Restoration Models) [18]: Tackles linear inverse problems by employing singular value decomposition of the degradation operator $H$ .
$DDNM+$ (Denoising Diffusion Null-Space Model) [34]: Addresses inverse problems in a zero-shot manner by utilizing range-null space decomposition as a guidance function, meaning it can solve problems without task-specific training.
RePaint [23]: Focuses on inpainting tasks by using unmasked regions to guide the diffusion process. It iteratively resamples unknown regions while keeping known regions fixed or slightly perturbed.
$ΠIGDM$ (Pseudoinverse-Guided Diffusion Models) [29]: Incorporates a vector-Jacobian product as additional guidance to ensure consistency between denoising results and degraded measurements.
RED-Diff [24]: Formulates image restoration as an optimization problem, minimizing a measurement consistency loss while applying score-matching regularization from the diffusion model.

A common limitation for many diffusion-based methods is the long sampling time required to generate high-quality images.

3.2.3. Flow Matching-Based Prior Methods

More recently, Flow Matching has gained attention as a prior for image restoration due to its potential for faster sampling.

OT-ODE (Optimal Transport ODE) [26]: Incorporates a gradient correction term (similar to $ΠIGDM$ ) to guide the flow-based generation process. It demonstrated superior perceptual quality compared to diffusion paths.
D-Flow [2]: Formulates the restoration as a source point optimization problem, minimizing a cost function associated with the initial point in the Flow Matching framework. However, this requires backpropagation through an ODE solver, leading to relatively long sampling times (reported as 5 to 15 minutes per sample).
Flow-Priors [40]: Decomposes the flow's trajectory into several local objectives and uses Tweedie's formula [11] to sequentially optimize these objectives through gradient steps. This results in reduced sampling time compared to D-Flow.
PnP-Flow (Plug-and-Play Flow) [25]: Combines Plug-and-Play (PnP) methods [33] with Flow Matching without requiring backpropagation. PnP methods iteratively alternate between a data fidelity step (e.g., enforcing consistency with observations) and a denoising step (using a pre-trained denoiser, which in this case is the flow model). While faster, PnP-Flow often tends to produce over-smoothed results.

3.3. Technological Evolution

The field of image restoration has evolved from traditional signal processing techniques (e.g., Wiener filtering) to optimization-based methods with handcrafted priors, and then to data-driven deep learning methods. Initially, deep learning focused on direct mapping using Convolutional Neural Networks (CNNs). The next major leap involved leveraging deep generative priors, first with Generative Adversarial Networks (GANs) and then more prominently with Diffusion Models. Diffusion Models set a new standard for image generation quality but were computationally expensive during sampling. Flow Matching emerged as a successor, aiming to achieve comparable or superior generation quality with significantly faster sampling by learning straighter trajectories.

Restora-Flow fits into this timeline by building upon the latest advancements in Flow Matching. It further refines Flow Matching for inverse problems by introducing a training-free mask-guided sampling approach combined with a trajectory correction mechanism, specifically targeting the limitations of existing flow-based methods (speed and over-smoothing) while maintaining high perceptual quality.

3.4. Differentiation Analysis

Compared to the main methods in related work, Restora-Flow differentiates itself primarily through:

Training-Free Nature: Similar to some diffusion and flow-based methods, Restora-Flow operates training-free by using an unconditional flow prior model. This means the generative model does not need to be fine-tuned for specific restoration tasks, making it versatile.
Mask-Guided Fusion: It explicitly incorporates mask-guidance (inspired by RePaint for diffusion models) into the Flow Matching sampling process. This ensures that known regions from the degraded input are preserved.
Novel Trajectory Correction Mechanism: This is a key innovation. Unlike OT-ODE which uses gradient correction, or D-Flow and Flow-Priors which rely on optimization and Tweedie's formula, Restora-Flow uses a simple yet effective forward extrapolation and noise reintroduction step. This trajectory correction mechanism explicitly projects the fused sample closer to the data manifold and re-aligns it with the generative trajectory, preventing misalignments and artifacts. This mechanism aims to counter the divergence from the learned data distribution that can occur with simple mask fusion, a problem observed in a naive mask-guided Flow Matching application.
Speed and Perceptual Quality: Restora-Flow achieves superior perceptual quality (lower LPIPS) while also demonstrating significantly faster processing times compared to diffusion models (RePaint, $DDNM+$ ) and most existing flow-based methods (D-Flow, Flow-Priors). Even PnP-Flow, which is fast, tends to produce over-smoothed results, a problem Restora-Flow aims to avoid. OT-ODE can be fast in some cases but often yields artifacts.
Simplicity: It introduces only one hyperparameter (number of ODE steps), with the number of correction steps fixed at $C=1$ , making it simpler to deploy than methods requiring extensive hyperparameter tuning for different tasks and datasets.

4. Methodology

4.1. Principles

The core idea behind Restora-Flow is to leverage the speed and high-quality generation capabilities of Flow Matching models for mask-based image restoration tasks. The method operates training-free, meaning it uses an already trained unconditional Flow Matching model as a prior. The two main principles are:

Mask-Guided Sampling: During the continuous generation process of Flow Matching (which normally generates an image from pure noise), the algorithm guides the sample at each time step using the known, unmasked regions of the degraded input. This ensures fidelity to the observed data.
Trajectory Correction: A novel mechanism is introduced to address the issue of misalignment that can occur when fusing known data with the evolving sample. This correction actively refines the generative path by projecting the sample towards the clean image manifold and then reintroducing appropriate noise, thereby maintaining consistency with the learned flow trajectory and preventing artifacts or deviations from realism. The goal is to achieve both high reconstruction quality and fast processing times.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Flow Matching (FM) Background

Flow Matching learns a continuous velocity field that defines a deterministic transformation from a simple base distribution to a complex target data distribution. The method aims to learn a velocity field $v_{\theta,t}$ of the probability flow $\Psi_t$ . This field governs how a sample evolves from a simple base distribution (e.g., standard normal distribution $x_0 \sim \mathcal{N}(0, I)$ ) at $t=0$ to the target data distribution $\mathrm{p}(x)$ at $t=1$ . The training objective for this velocity field $v_{\theta,t}$ is given by the conditional Flow Matching loss:

$ \operatorname*{min}{\theta} \mathbb{E}{t, x_1, x_0} \Big[ \frac{1}{2} \Big\lVert v_{\theta,t} \big( \Psi_t(x_0) \big) - \big( x_1 - x_0 \big) \Big\rVert^2 \Big] $

Where:

$\theta$ : The parameters of the neural network approximating the velocity field.
$\mathbb{E}$ : Expectation over the random variables.
$t$ : A continuous time variable, sampled uniformly from $t \sim \mathcal{U}[0, 1]$ .
$x_1$ : A sample from the target data distribution $\mathrm{p}(x)$ .
$x_0$ : A sample from a simple base distribution, typically standard normal noise $x_0 \sim \mathcal{N}(0, I)$ .
$v_{\theta,t}(\Psi_t(x_0))$ : The learned velocity field predicted by the neural network at time $t$ for a sample $\Psi_t(x_0)$ .
$x_1 - x_0$ : The true velocity vector that maps $x_0$ to $x_1$ along a straight line path.
$\Psi_t(x_0) = (1-t)x_0 + tx_1$ : The conditional flow (or linear interpolation) between $x_0$ and $x_1$ at time $t$ . This is a straight line path.

The loss function encourages the learned velocity field $v_{\theta,t}$ to predict the direction of change ( $x_1 - x_0$ ) required to move from the interpolated point $\Psi_t(x_0)$ towards the target data point $x_1$ . This allows for simulation-free training, meaning the ODE does not need to be solved during training.

Once the velocity field $v_{\theta,t}$ is learned, new images can be generated by integrating the corresponding Ordinary Differential Equation (ODE):

$ \frac{\mathrm{d}}{\mathrm{d}t} \Psi_t(x) = v_{\theta,t} ( \Psi_t(x) ) $

This equation describes how the sample $\Psi_t(x)$ changes over time. To obtain samples, numerical integration methods are used. For example, using the explicit Euler integration scheme with a small time step $\Delta_t$ , the estimate of the sample $x_t$ is updated iteratively:

$ x_{t + \Delta_t} = x_t + \Delta_t v_{\theta,t} ( x_t ) $

Here, $x_t$ represents the current estimate of the image at time $t$ , and $x_{t+\Delta_t}$ is the updated estimate at the next time step. The term $\Delta_t v_{\theta,t}(x_t)$ represents the small change applied to $x_t$ based on the predicted velocity at that point. The sampling process starts with $x_0 \sim \mathcal{N}(0, I)$ and iteratively updates $x_t$ until $t=1$ , yielding a generated image.

4.2.2. Restora-Flow Algorithm for Mask-Based Restoration

The goal of image restoration is to find the optimal image $\hat{x}$ that is consistent with the degraded observation $z$ and also adheres to a learned image prior. This is formulated as a Maximum A Posteriori (MAP) estimation problem:

$ \hat{x} = \arg \operatorname*{min}x \mathcal{D}(Hx, z) + \mathcal{R}\theta(x) $

Where:

$\hat{x}$ : The estimated original image.
$H$ : The degradation operator (e.g., downsampling for super-resolution, zeroing out pixels for inpainting).
$\mathcal{D}(Hx, z)$ : The data fidelity term, typically a mean squared error (MSE) or similar measure, quantifying how well the restored image matches the observed degraded input in the known regions.
$\mathcal{R}_\theta(x)$ : The prior term (or regularization term), which leverages the learned Flow Matching model (parameterized by $\theta$ ) to ensure the restored image is realistic and belongs to the data manifold.

Simply running the Flow Matching generation (Eq. (3)) from noise will produce a generic image, but it won't necessarily be consistent with the specific degraded observation $z$ . Therefore, the sampling process must be guided towards the MAP solution. Restora-Flow opts for mask-guidance due to its applicability to various mask-based degradations.

4.2.2.1. Incorporating Mask-Guidance

Restora-Flow incorporates mask-guidance, drawing inspiration from RePaint [23]. This involves fusing the time-dependent sample $x_t$ with the unmasked portion of the original degraded image $z$ .

First, the degraded observation $z$ is adapted to match the noise level present in the current flow estimate $x_t$ . This is done through a convex combination with Gaussian noise $\epsilon$ :

$ z' = t z + (1-t) \epsilon, \quad \epsilon \sim \mathcal{N}(0, I) $

Where:

$z'$ : The noise-adapted version of the degraded observation.
$t$ : Current time step in the ODE integration (from 0 to 1).
$z$ : The original degraded observation.
$\epsilon$ : A sample from a standard normal distribution, $\mathcal{N}(0, I)$ .
$(1-t)\epsilon$ : As $t$ approaches 0 (more noise in $x_t$ ), more noise is added to $z'$ . As $t$ approaches 1 (less noise in $x_t$ ), $z'$ becomes closer to $z$ .

Next, this noise-adapted $z'$ is fused with the current flow estimate $x_t$ using a binary mask $m$ :

$ x_t' = m \odot z' + (1-m) \odot x_t $

Where:

$x_t'$ : The mask-guided (fused) sample at time $t$ .
$m$ : A binary mask where $m_{ij}=1$ for known (unmasked) regions and $m_{ij}=0$ for unknown (masked) regions.
$\odot$ : Element-wise multiplication.
$m \odot z'$ : Preserves the known regions from the noise-adapted degraded observation $z'$ .
$(1-m) \odot x_t$ : Retains the unknown (to be restored) regions from the current flow estimate $x_t$ .

A naive approach would be to simply use this fused sample $x_t'$ in the Flow Matching update equation:

$ x_{t + \Delta_t} = x_t' + \Delta_t v_{\theta,t} ( x_t' ) $

However, the authors point out that this naive approach leads to visible misalignments at mask boundaries because $x_t'$ might diverge from the distribution of samples that the Flow Matching model $v_{\theta,t}$ was trained on. In essence, $x_t'$ might not lie on a trajectory consistent with the learned generative process, causing the model to produce low-quality results or artifacts.

4.2.2.2. Trajectory Correction

To address the issues of misalignment and improve sample quality, Restora-Flow introduces a trajectory correction mechanism. This mechanism is applied after the initial mask-guided update (Eq. (7)) and within each ODE step.

The correction involves two main steps:

Forward Extrapolation: The updated sample $x_{t + \Delta_t}$ (from Eq. (7)) is projected forward along the velocity field towards the endpoint $t=1$ (the clean image manifold). This acts as a learned denoiser, helping to correct misalignments and push the sample closer to the data manifold:

$ \widetilde{x}1 = x{t + \Delta_t} + \left( 1 - (t + \Delta_t) \right) v_{\theta, t + \Delta_t} ( x_{t + \Delta_t} ) $

Where:
- $\widetilde{x}_1$ : The extrapolated estimate of the clean image (at $t=1$ ).
- $x_{t + \Delta_t}$ : The sample after the mask-guided ODE update.
- $v_{\theta, t + \Delta_t} ( x_{t + \Delta_t} )$ : The velocity field predicted for the updated sample at time $t + \Delta_t$ .
- $(1 - (t + \Delta_t))$ : A scaling factor that accounts for the remaining time to $t=1$ .
Reintroduction of Noise and Rescaling: To place the sample back at the correct location along the generative trajectory and allow for stochasticity and diversity in generation, the extrapolated clean image $\widetilde{x}_1$ is scaled back to the current time $t$ and combined with new Gaussian noise $\eta$ :

$ x_t \leftarrow t \widetilde{x}_1 + (1 - t) \eta, \quad \eta \sim \mathcal{N}(0, I) $

Here, $x_t$ (the variable used for the next iteration) is updated. This step scales the denoised estimate $\widetilde{x}_1$ based on $t$ (more weight on $\widetilde{x}_1$ as $t \to 1$ ) and adds noise $\eta$ (more weight on noise as $t \to 0$ ). This ensures that the sample remains on a valid Flow Matching trajectory while being consistent with the denoised estimate.

The authors empirically find that even a single correction step (i.e., $C=1$ ) per ODE iteration significantly improves alignment and reconstruction quality, as shown in Figure 2.

The overall Restora-Flow sampling process for mask-based image restoration (like inpainting and super-resolution) is detailed in Algorithm 1.

4.2.3. Algorithm 1: Mask-Guided Restora-Flow Sampling

Algorithm 1 outlines the steps for mask-based image restoration using Restora-Flow.

Algorithm 1 Mask-Guided Restora-Flow Sampling

1: Input: learned flow network $v _ { \theta }$ , degraded observation $z \in \mathbb { R } ^ { d }$ , number of ODE steps $N$ (with $\Delta _ { t } \gets \frac { 1 } { N } )$ , number of corrections $C > 0$ , mask $m$ 2: Sample $x \sim \mathcal { N } ( 0 , I )$ 3: for $t = 0 , \Delta _ { t } , \ldots , 1 - \Delta _ { t }$ do 4: for $c = 0 , \ldots , C$ do 5: Sample $\epsilon \sim \mathcal { N } ( 0 , I )$ 6: $z ^ { \prime } \gets t z + ( 1 - t ) \epsilon$ if $t > 0$ ,else $z ^ { \prime } = 0$ 7: $x ^ { \prime } \gets m \odot z ^ { \prime } + ( 1 - m ) \odot x$ 8: $x \gets x ^ { \prime } + \Delta _ { t } v _ { \theta , t } ( x ^ { \prime } , t ) ^ { ' }$ 9: if $c > 0$ and $t < 1 - \Delta _ { t }$ then 10: Sample $\eta \sim \mathcal { N } ( 0 , I )$ 11: $\widetilde { x } _ { 1 } \gets x + \left( 1 - ( t + \Delta _ { t } ) \right) v _ { \theta , t + \Delta _ { t } } ( x , t + \Delta _ { t } )$ 12: x ← tx1 + (1 − t)η 13: else 14: $t \gets t + \Delta _ { t }$

Step-by-step breakdown of Algorithm 1:

Line 1 (Input):
- $v_{\theta}$ : The pre-trained Flow Matching velocity field network.
- $z$ : The degraded input image.
- $N$ : Total number of ODE integration steps. $\Delta_t = 1/N$ is the size of each time step.
- $C$ : Number of trajectory correction steps within each ODE iteration. The paper empirically sets $C=1$ .
- $m$ : The binary mask distinguishing known and unknown regions.
Line 2 (Initialization):
- An initial sample $x$ is drawn from a standard normal distribution, $x \sim \mathcal{N}(0, I)$ . This represents the starting point (pure noise) at $t=0$ .
Line 3 (Outer Loop - Time Integration):
- The algorithm iterates through time steps from $t=0$ up to $1-\Delta_t$ . Each iteration performs an ODE update.
Line 4 (Inner Loop - Correction Steps):
- For each time step $t$ , an inner loop runs for $C+1$ iterations (from $c=0$ to $C$ ). The $c=0$ iteration performs the initial mask-guided update, and subsequent iterations ( $c > 0$ ) apply the trajectory correction.
Line 5 (Sample Noise):
- A random noise vector $\epsilon$ is sampled from a standard normal distribution, $\mathcal{N}(0, I)$ .
Line 6 (Noise-Adapt Degraded Observation):
- The degraded observation $z$ is adapted to the current noise level. If $t=0$ , $z'$ is set to 0 (effectively meaning the input contributes no information initially, as the sample $x$ is pure noise). Otherwise, it's a convex combination of $z$ and $\epsilon$ , as per Eq. (5).
Line 7 (Mask-Guided Fusion):
- The current sample $x$ is fused with the noise-adapted degraded observation $z'$ using the mask $m$ , as per Eq. (6). Known regions come from $z'$ , unknown regions from $x$ . The result is $x'$ .
Line 8 (Flow Matching Update):
- The sample $x$ is updated using the explicit Euler step, incorporating the velocity field $v_{\theta,t}$ applied to the fused sample $x'$ , as per Eq. (7). Note: The original paper has a prime symbol after $v_{\theta,t}(x',t)$ , which is likely a minor formatting artifact. The standard Euler step is $x \gets x' + \Delta_t v_{\theta,t}(x', t)$ .
Line 9 (Correction Condition):
- This condition checks if it's a correction step ( $c > 0$ ) and not the very last ODE time step ( $t < 1 - \Delta_t$ ). The correction is not applied at the very last step because the image is already nearly clean.
Line 10 (Sample Noise for Correction):
- If it's a correction step, another noise vector $\eta$ is sampled from $\mathcal{N}(0, I)$ .
Line 11 (Forward Extrapolation):
- The current sample $x$ is projected forward to estimate the clean image $\widetilde{x}_1$ at $t=1$ , using the velocity field at the current time $t+\Delta_t$ , as per Eq. (8).
Line 12 (Reintroduction of Noise and Rescaling):
- The sample $x$ is then rescaled using the extrapolated clean image $\widetilde{x}_1$ and the newly sampled noise $\eta$ , effectively placing it back on a valid Flow Matching trajectory at time $t$ , as per Eq. (9). This updated $x$ is then used for the next iteration of the inner loop (if $c < C$ ) or the outer loop (if $c=C$ ).
Line 13-14 (Time Increment):
- If no correction is performed (i.e., $c=0$ and $C=0$ , or it's the last correction step for the current ODE iteration), the time step $t$ for the outer loop is advanced. Note: The indentation in the original Algorithm 1 is a bit ambiguous for line 13-14, but based on the overall logic, $t$ should be incremented after all $C$ corrections for the current ODE interval.

4.2.4. Algorithm 2: Restora-Flow Sampling for Denoising

For image denoising, Restora-Flow uses a slightly different strategy, described in Algorithm 2, which can be viewed as a special case of mask-guidance where the mask is time-dependent and global.

Algorithm 2 Restora-Flow Sampling for Denoising

1: Input: degraded observation $z \in \mathbb { R } ^ { d }$ with noise level $\sigma$ , number of ODE steps $N$ (with $\textstyle \Delta _ { t } \gets { \frac { 1 } { N } }$ ) 2: Sample $x _ { 0 } \sim \mathcal { N } ( 0 , I )$ 3: for $t = 0 , \Delta _ { t } , \ldots , 1 - \Delta _ { t }$ do 4: Sample $\epsilon \sim \mathcal { N } ( 0 , I )$ 5: $z ^ { \prime } \gets ( 1 - \sigma ) z$ 6: $x _ { t + \Delta _ { t } } ^ { \prime } x _ { t } + \Delta _ { t } v _ { \theta , t } ( x _ { t } )$ 7: $x _ { t + \Delta _ { t } } \gets \mathbf { 1 } _ { \{ t < 1 - \sigma \} } z ^ { \prime } + \mathbf { 1 } _ { \{ t \geq 1 - \sigma \} } x _ { t + \Delta _ { t } } ^ { \prime }$

Step-by-step breakdown of Algorithm 2:

Line 1 (Input):
- $z$ : The noisy degraded observation.
- $\sigma$ : The estimated noise level of the observation.
- $N$ : Number of ODE steps.
Line 2 (Initialization):
- An initial sample $x_0$ is drawn from a standard normal distribution. This is the starting point for Flow Matching.
Line 3 (Time Integration Loop):
- The algorithm iterates through time steps from $t=0$ up to $1-\Delta_t$ .
Line 4 (Sample Noise):
- A random noise vector $\epsilon$ is sampled. Note: $\epsilon$ is sampled but not explicitly used in the final update in this algorithm. This might be a leftover from a more general formulation or an oversight in simplification.
Line 5 (Scaled Observation):
- The degraded observation $z$ is scaled by $(1-\sigma)$ to get $z'$ . This effectively reduces the influence of noise as the Flow Matching process progresses towards a clean image.
Line 6 (Flow Matching Update):
- The sample is updated using the standard Euler step of Flow Matching. Here, $x_t$ is the input to the velocity field, and $x_{t+\Delta_t}'$ is the result of applying the generative prior.
Line 7 (Time-Dependent Masking/Fusion):
- This is the core difference for denoising. A time-dependent global mask is implicitly used.
- $\mathbf{1}_{\{t < 1 - \sigma\}}$ : An indicator function that is 1 if the condition $t < 1 - \sigma$ is true, and 0 otherwise.
- $\mathbf{1}_{\{t \geq 1 - \sigma\}}$ : An indicator function that is 1 if $t \geq 1 - \sigma$ , and 0 otherwise.
- Interpretation:
  - For early time steps where $t < 1 - \sigma$ , the updated sample $x_{t+\Delta_t}$ is directly set to $z'$ . This means the noisy observation $z$ (scaled by $1-\sigma$ ) is used as the dominant input or initialization for the sampling process up to a certain point determined by the noise level $\sigma$ .
  - For later time steps where $t \geq 1 - \sigma$ , the ODE evolves the solution without further direct influence from $z$ . The Flow Matching model takes over to generate the clean image, relying purely on its learned prior knowledge.
- In essence, the noisy observation $z$ is used as an initialization that influences the early stages of the Flow Matching generation. Once the process has evolved sufficiently (beyond $1-\sigma$ ), the Flow Matching model completes the denoising based on its generative capabilities.
  
  Figure 2 visually demonstrates the impact of the trajectory correction mechanism, showing that $C=1$ offers the best trade-off between quality and speed.
  
  $Figure 2. Restora-Flow samples with and without correction steps. Empirically, one correction step ( $C = 1$ )offers the best trade-off between high reconstruction quality and fast processing.$ 该图像是一个示意图，展示了Restora-Flow在不同矫正步骤下的样本效果。通过对比， $C = 1$ 时复原质量与处理速度达到最佳平衡，表现出色。其他矫正步骤（ $C = 0$ , $C = 2$ , $C = 4$ ）的结果则各有差异。

Figure 2. Restora-Flow samples with and without correction steps. Empirically, one correction step ( $C = 1$ )offers the best trade-off between high reconstruction quality and fast processing.

5. Experimental Setup

5.1. Datasets

The authors utilized four diverse datasets to thoroughly assess the performance of Restora-Flow:

CelebA [22]:
- Characteristics: Features $162\mathrm{k}$ training images of celebrity faces.
- Size: Resized to $128 \times 128$ pixels.
- Domain: Natural images, specifically human faces.
- Test Set: 100 test images.
- Purpose: Common benchmark for generative models and face-related tasks.
AFHQ-Cat [5]:
- Characteristics: Includes $5\mathrm{k}$ training images of cat faces.
- Size: Resized to $256 \times 256$ pixels.
- Domain: Natural images, specifically animal faces.
- Test Set: 100 test images.
- Purpose: Evaluates performance on a different natural image domain with potentially more complex textures than human faces.
COCO [19]:
- Characteristics: Contains $118\mathrm{k}$ training images of various object types and complex scenes.
- Size: Resized to $128 \times 128$ pixels.
- Domain: Natural images, highly diverse and complex scenes.
- Test Set: 100 validation images.
- Purpose: Assesses versatility on a broad range of natural images with high variability.
X-ray Hand [13, 36]:
- Characteristics: Comprises 895 hand radiographs.
- Size: Resized to $256 \times 256$ pixels.
- Domain: Medical images (radiographs).
- Test Set: 298 test images.
- Purpose: Crucial for demonstrating the method's applicability and robustness in a specialized, sensitive domain like medical imaging, which often has different image statistics than natural images.
  
  These datasets were chosen to represent a range of image complexities, sizes, and domains (natural vs. medical), providing a comprehensive evaluation of Restora-Flow's versatility and robustness across different scenarios.

5.2. Evaluation Metrics

The paper uses a combination of distortion metrics and perceptual metrics to evaluate the quality of the restored images, acknowledging the perception-distortion trade-off [3].

5.2.1. Peak Signal-to-Noise Ratio (PSNR)

Conceptual Definition: PSNR is a common image quality metric that quantifies the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. It is most easily defined via the Mean Squared Error (MSE). A higher PSNR value indicates better image quality, meaning less distortion or noise. It primarily measures pixel-wise accuracy.
Mathematical Formula: First, the Mean Squared Error (MSE) between two images, $I$ and $K$ , of size $m \times n$ is calculated as: $ MSE = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $ Then, PSNR (in decibels) is calculated as: $ PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right) $
Symbol Explanation:
- $I$ : The original (ground truth) image.
- $K$ : The restored (degraded) image.
- $m$ , $n$ : The dimensions (height and width) of the images.
- I(i,j), K(i,j): The pixel values at coordinates (i,j) in images $I$ and $K$ , respectively.
- $MAX_I$ : The maximum possible pixel value of the image. For 8-bit grayscale images, this is 255. For images normalized to [0,1], this is 1.

5.2.2. Structural Similarity Index (SSIM)

Conceptual Definition: SSIM is a perceptual metric that evaluates the similarity between two images. Unlike PSNR which focuses on absolute errors, SSIM attempts to model how the human visual system perceives image quality, considering three key factors: luminance, contrast, and structure. A value closer to 1 indicates higher similarity.
Mathematical Formula: For two image patches $x$ and $y$ : $ SSIM(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
Symbol Explanation:
- $\mu_x$ , $\mu_y$ : The mean (luminance) of image patches $x$ and $y$ , respectively.
- $\sigma_x$ , $\sigma_y$ : The standard deviation (contrast) of image patches $x$ and $y$ , respectively.
- $\sigma_{xy}$ : The covariance between image patches $x$ and $y$ (structure correlation).
- $c_1 = (K_1 L)^2$ , $c_2 = (K_2 L)^2$ : Stability constants to avoid division by a small denominator.
- $L$ : The dynamic range of the pixel values (e.g., 255 for 8-bit images).
- $K_1$ , $K_2$ : Small constant values (e.g., $K_1=0.01$ , $K_2=0.03$ ).

5.2.3. Learned Perceptual Image Patch Similarity (LPIPS)

Conceptual Definition: LPIPS [39] is a perceptual similarity metric designed to correlate well with human judgment of image quality. Instead of comparing pixels directly, LPIPS computes the distance between deep features extracted from pre-trained convolutional neural networks (e.g., AlexNet, VGG, SqueezeNet). It measures how perceptually similar two images are; a lower LPIPS score indicates higher perceptual similarity. It is particularly useful for evaluating the realism and perceptual quality of generative models.
Mathematical Formula: LPIPS does not have a single, simple mathematical formula like PSNR or SSIM. It is calculated as the weighted $L_2$ distance between feature stacks extracted from a pre-trained deep neural network (often an image classification network like AlexNet or VGG) at various layers. $ LPIPS(x, x_0) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} |w_l \odot (\phi_l(x){h,w} - \phi_l(x_0){h,w})|_2^2 $
Symbol Explanation:
- $x$ : The reference image.
- $x_0$ : The generated/restored image.
- $l$ : Index of a convolutional layer in the chosen pre-trained network.
- $\phi_l$ : The feature map extracted from layer $l$ .
- $H_l, W_l$ : Height and width of the feature map at layer $l$ .
- $w_l$ : A learnable weight vector that scales the feature differences at each layer. These weights are learned by training the LPIPS model on a dataset of human perceptual similarity judgments.
- $\odot$ : Element-wise multiplication.
- $\|\cdot\|_2^2$ : Squared $L_2$ norm (Euclidean distance).

5.3. Baselines

Restora-Flow was compared against a comprehensive set of baseline methods, including both diffusion-based and flow-based approaches.

5.3.1. Flow-Based Baselines

OT-ODE [26]: A flow-based method that uses a gradient correction term to guide the generation process.
Flow-Priors [40]: Decomposes the flow's trajectory into local objectives and optimizes them iteratively.
D-Flow [2]: Formulates restoration as a source point optimization problem, requiring backpropagation through ODE solvers.
PnP-Flow [25]: Combines Plug-and-Play methods with Flow Matching without backpropagation.

For experiments on CelebA and AFHQ-Cat, hyperparameters for these flow-based baselines were adopted from [25], where a grid search was performed. For COCO, CelebA's hyperparameters were used, and for X-ray Hand, AFHQ-Cat's hyperparameters were used.

5.3.2. Diffusion-Based Baselines

RePaint [23]: A diffusion-based method that uses unmasked regions to guide the diffusion process for inpainting tasks.
$DDNM+$ [34]: A diffusion-based method for zero-shot image restoration using range-null space decomposition.

For RePaint, standard hyperparameters (jump length of 10, 10 resampling steps) were used. For $DDNM+$ , optimal performance was achieved with time-travel trick parameters $s=1$ and $l=5$ .

5.3.3. Implementation Details

Flow-Based Models: Pretrained CelebA and AFHQ-Cat models from [25] were used. For COCO and X-ray Hand, new Flow Matching models were trained from scratch. All flow-based reference methods were implemented using the framework from [25].
Diffusion-Based Models: For a fair comparison, DDPMs were trained from scratch using the same U-Net architecture and training parameters as the flow-based models (e.g., learning rate 1e-4, batch size 128/64, 200/400 epochs, 250 diffusion time steps). For X-ray Hand, pretrained models from [14] and the MED-DDPM framework [10, 14] were used.
Restora-Flow Hyperparameters: A fixed number of correction steps ( $C=1$ $C = 1$ ) was used across all experiments and datasets. The only optimized hyperparameter was the number of ODE steps:
- Natural image datasets (CelebA, COCO): 64 ODE steps for denoising and box inpainting; 128 ODE steps for $2\times$ super-resolution and random inpainting; 256 ODE steps for $4\times$ super-resolution.
- Medical X-ray Hand dataset: 64 ODE steps for denoising and $2\times$ super-resolution; 32 ODE steps for box inpainting and occlusion removal.
Hardware: NVIDIA A100 GPUs for $256 \times 256$ resolution experiments, NVIDIA GeForce RTX 3090 for $128 \times 128$ resolution experiments.

5.4. Tasks

The evaluation covered a range of mask-based image restoration tasks:

Denoising: Removing Gaussian measurement noise from images.
- $\sigma = 0.2$ for CelebA, COCO, AFHQ-Cat.
- $\sigma = 0.08$ for X-ray Hand.
Box Inpainting: Filling in a square region in the center of the image.
- $40 \times 40$ centered mask for CelebA, COCO.
- $80 \times 80$ centered mask for AFHQ-Cat.
- $128 \times 128$ centered mask for X-ray Hand.
Super-resolution: Increasing the resolution of an image.
- $2\times$ super-resolution for CelebA, COCO, X-ray Hand.
- $4\times$ super-resolution for AFHQ-Cat.
Random Inpainting: Filling in randomly masked pixels.
- Mask covering $70\%$ of pixels for CelebA, COCO, AFHQ-Cat.
Occlusion Removal: A clinically motivated task on X-ray Hand dataset, where synthetically added occlusions (as described in [14]) are removed.

All experiments, unless noted, included a Gaussian measurement noise level of $\sigma = 0.01$ .

6. Results & Analysis

6.1. Core Results Analysis

The experimental results consistently demonstrate the superior performance of Restora-Flow across various tasks and datasets, particularly in perceptual quality (lower LPIPS) and processing time.

The following are the results from Table 1 of the original paper:

Model	Denoising σ = 0.2				Box inpainting 40 × 40				Super-resolution 2×				Random inpainting 70%
Model	LPIPS ↓	SSIM ↑	PSNR ↑	Time in s ↓	LPIPS ↓	SSIM ↑	PSNR ↑	Time in s ↓	LPIPS ↓	SSIM ↑	PSNR ↑	Time in s ↓	LPIPS ↓	SSIM ↑	PSNR ↑	Time in s ↓
CelebA
RePaint [23]	N/A	N/A	N/A	N/A	0.016	0.967	30.81	32.89	0.014	0.946	32.59	32.89	0.014	0.945	32.37	32.89
DDNM+ [34]	0.076	0.885	30.70	11.57	0.019	0.969	31.05	11.57	0.046	0.905	30.02	11.57	0.031	0.920	30.83	11.57
OT-ODE [26]	0.033	0.858	30.36	2.95	0.022	0.954	29.85	3.68	0.055	0.870	28.65	3.76	0.051	0.871	28.41	3.76
Flow-Priors [40]	0.132	0.767	29.27	26.22	0.020	0.969	31.17	26.22	0.110	0.722	28.52	26.22	0.019	0.944	32.34	26.22
D-Flow [2]	0.099	0.695	24.64	22.73	0.041	0.907	29.77	65.81	0.031	0.894	31.30	71.43	0.021	0.931	32.48	131.78
PnP-Flow [25]	0.056	0.910	32.12	4.60	0.045	0.941	30.48	4.60	0.058	0.908	31.37	4.60	0.022	0.954	33.55	4.60
Restora-Flow	0.019	0.922	33.09	0.58	0.018	0.964	30.91	2.06	0.014	0.952	33.59	3.63	0.015	0.947	32.71	3.63
AFHQ-Cat
RePaint [23]	N/A	N/A	N/A	N/A	0.043	0.939	26.26	86.23	0.139	0.701	24.71	86.23	0.034	0.897	30.93	86.23
DDNM+ [34]	0.170	0.818	29.06	13.74	0.048	0.942	25.16	13.74	0.462	0.534	19.69	13.74	0.065	0.876	30.12	13.74
OT-ODE [26]	0.078	0.814	29.73	5.54	0.048	0.924	24.36	6.94	0.285	0.565	21.85	7.28	0.094	0.839	28.87	7.28
Flow-Priors [40]	0.153	0.771	29.43	67.10	0.054	0.942	26.04	67.05	0.271	0.565	23.50	67.30	0.046	0.909	31.82	67.69
D-Flow [2]	0.184	0.648	24.98	44.45	0.112	0.839	26.17	126.09	0.123	0.707	25.34	261.84	0.056	0.878	30.97	266.18
PnP-Flow [25]	0.165	0.864	31.10	9.86	0.124	0.904	26.18	9.86	0.180	0.790	26.95	46.26	0.042	0.930	33.07	19.15
Restora-Flow	0.051	0.899	32.35	0.72	0.047	0.939	25.96	3.96	0.158	0.761	26.33	14.48	0.034	0.914	31.99	7.48
COCO
RePaint [23]	N/A	N/A	N/A	N/A	0.093	0.922	21.20	32.89	0.046	0.856	25.84	32.89	0.038	0.876	26.82	32.89
DDNM+ [34]	0.162	0.805	27.04	11.57	0.112	0.925	21.71	11.57	0.257	0.682	19.05	11.57	0.069	0.845	25.80	11.57
OT-ODE [26]	0.066	0.810	27.52	2.95	0.073	0.914	23.40	3.68	0.146	0.745	23.83	3.76	0.130	0.763	23.98	3.76
Flow-Priors [40]	0.116	0.751	27.08	26.22	0.084	0.927	23.58	26.22	0.112	0.698	24.93	26.22	0.055	0.855	25.97	26.22
D-Flow [2]	0.252	0.552	21.19	22.73	0.115	0.825	23.46	65.81	0.083	0.778	24.80	71.43	0.053	0.840	26.29	131.78
PnP-Flow [25]	0.128	0.855	28.97	4.60	0.121	0.892	24.56	4.60	0.118	0.827	26.73	4.60	0.053	0.896	28.13	4.60
Restora-Flow	0.026	0.905	30.57	0.58	0.084	0.929	24.80	2.06	0.044	0.877	27.44	3.63	0.040	0.881	27.37	3.63
X-ray Hand
RePaint [23]	N/A	N/A	N/A	N/A	0.046	0.821	23.90	17.02	0.074	0.767	20.04	17.02	0.032	0.898	29.66	17.02
DDNM+ [34]	0.057	0.819	23.78	13.35	0.059	0.801	22.76	13.35	0.143	0.635	14.10	13.35	0.047	0.884	26.57	13.35
OT-ODE [26]	0.026	0.853	27.83	8.73	0.038	0.801	23.58	11.17	0.076	0.684	22.01	11.17	0.029	0.845	26.55	11.17
Flow-Priors [40]	0.033	0.885	28.58	68.54	0.035	0.882	25.74	68.59	0.162	0.460	20.38	68.62	0.023	0.933	27.07	68.59
D-Flow [2]	0.077	0.630	24.09	101.66	0.145	0.588	13.61	285.55	0.127	0.639	15.26	361.22	0.110	0.587	22.23	361.22
PnP-Flow [25]	0.052	0.843	25.17	20.48	0.054	0.822	23.67	20.35	0.029	0.884	25.88	102.29	0.045	0.889	26.83	20.35
Restora-Flow	0.021	0.912	31.34	0.50	0.035	0.846	24.67	4.03	0.037	0.857	24.66	7.95	0.017	0.935	33.51	4.03

CelebA Dataset:
- Perceptual Quality (LPIPS): Restora-Flow consistently achieves the best LPIPS scores across all tasks (denoising, box inpainting, super-resolution, random inpainting), indicating superior perceptual quality and visual realism. For example, in denoising, Restora-Flow gets 0.019 compared to the next best OT-ODE at 0.033. In super-resolution, Restora-Flow gets 0.014 compared to RePaint's 0.014, making it competitive with the best diffusion method.
- Distortion Metrics (SSIM, PSNR): Restora-Flow achieves superior SSIM and PSNR scores for denoising and super-resolution. For box inpainting and random inpainting, it is a close second to Flow-Priors and PnP-Flow, respectively, but still very competitive.
- Processing Time: A standout advantage is Restora-Flow's processing time. It is remarkably faster than all other methods. For instance, in denoising, it takes only 0.58 seconds, which is significantly faster than $DDNM+$ (11.57s), RePaint (N/A for denoising, but 32.89s for inpainting/SR), and even faster flow-based methods like OT-ODE (2.95s) and PnP-Flow (4.60s). This makes Restora-Flow approximately $10\times$ faster than RePaint and $6\times$ faster than $DDNM+$ for comparable or better quality.
- Overall: The plots in Figure 4 further emphasize this, showing Restora-Flow occupying the bottom-left region (best quality, fastest time) across all metrics and tasks on CelebA.
AFHQ-Cat and COCO Datasets:
- Restora-Flow maintains its strong performance on these diverse natural image datasets. It generally outperforms all other flow-based methods in LPIPS for most tasks, with a few instances where it's a very close second (e.g., super-resolution on AFHQ-Cat, box inpainting on COCO).
- For SSIM and PSNR, it also consistently achieves the best results or closely matches the top performers.
- Crucially, these high-quality results are achieved with the fastest processing times in most settings. OT-ODE is occasionally faster (e.g., denoising on AFHQ-Cat at 5.54s vs. Restora-Flow at 0.72s), but OT-ODE delivers inferior reconstruction quality in those instances.
- Diffusion-based baselines (RePaint, $DDNM+$ ) often yield competitive quality but at a significantly higher computational cost. Figure 1 (left) visually confirms Restora-Flow's superior speed-quality trade-off for AFHQ-Cat.
X-ray Hand Dataset (Medical Domain):
- The results on the medical dataset further confirm Restora-Flow's versatility and robustness. It consistently achieves excellent scores across LPIPS, SSIM, and PSNR compared to baselines, while operating in a fraction of the time.
- In the clinically motivated occlusion removal task, Restora-Flow again shows improved perceptual quality at lower processing times, which is significant for downstream medical tasks.
  
  Visual results (Figure 3 for CelebA, Figure 6 for AFHQ-Cat, and supplemental figures) corroborate these quantitative findings, showing Restora-Flow generating artifact-free and realistic images that maintain texture. In contrast, some baselines (OT-ODE, Flow-Priors, $DDNM+$ ) produce artifacts, D-Flow is slow and struggles with some object reconstruction, and PnP-Flow often yields over-smoothed results. For medical images, while overall variability is lower, Restora-Flow shows clear advantages in maintaining anatomical detail and cleanly removing occlusions compared to baselines that might alter known regions or leave residual artifacts.
  
  $Figure 4. Visual representation of quantitative results on CelebA. Restora-Flow $( \\bigcirc )$ is compared to related work methods (other shapes) on four different tasks (colors). The plots show LPIPS $\\downarrow$ (left), ${ \\bf S S I M \\uparrow }$ (center) and PSNR $\\uparrow$ (right) on the y-axis, and processing time $\\downarrow$ (all plots) on the $\\mathbf { X }$ ai.For better visualizatin and comparison each plot is separateinto two parts with different scales in the $\\mathbf { X }$ -axis.$ 该图像是图表，展示了Restora-Flow与相关方法在CelebA数据集上的定量比较结果。图表中包含三个子图，分别展示了LPIPS值（左）、SSIM值（中）和PSNR值（右）与处理时间的关系。每个子图的y轴显示评分，x轴表示处理时间，使用不同形状和颜色标识不同方法及任务类型，包括去噪、超分辨率和图像修复。

Figure 4. Visual representation of quantitative results on CelebA. Restora-Flow $( \bigcirc )$ is compared to related work methods (other shapes) on four different tasks (colors). The plots show LPIPS $\downarrow$ (left), ${ \bf SSIM \uparrow }$ (center) and PSNR $\uparrow$ (right) on the y-axis, and processing time $\downarrow$ (all plots) on the $\mathbf { X }$ ai.For better visualizatin and comparison each plot is separateinto two parts with different scales in the $\mathbf { X }$ -axis.

该图像是图表和示意图，展示了不同图像恢复方法在去噪、超分辨率、随机填充和框填充任务上的评分与处理时间的比较。左侧图表展示了各方法在不同时间下的得分表现，右侧为各任务处理结果的图示。

Figure 1. The image is a chart and schematic showing the comparison of different image restoration methods in terms of score and processing time for denoising, super-resolution, random inpainting, and box inpainting tasks. The left chart displays the performance of various methods at different time intervals, while the right side illustrates the results for each task.

该图像是一个比较不同图像修复方法的插图，展示了去噪、框内修补、超分辨率以及随机修补等任务下的结果。各行展示了不同方法的恢复效果，包括原始图像和使用 Restora-Flow 制作的结果。

Figure 3. The image is an illustration comparing different image restoration methods, showcasing results for denoising, box inpainting, super-resolution, and random inpainting tasks. Each row displays the recovery effects of different methods, including the original image and the result produced using Restora-Flow.

该图像是一个比较不同图像恢复方法效果的示意图。展示了去噪、盒子填充、超分辨率和随机填充的结果，并分别与原始图像及其他方法（如RePaint、DDNM、Flow-Priors等）进行了对比，最后展示了Restora-Flow的效果。

Figure 6. The image is a comparative illustration of the effects of different image restoration methods. It presents the results of denoising, box inpainting, super-resolution, and random inpainting, comparing them with the original image and other methods such as RePaint, DDNM, and Flow-Priors, and ultimately showing the results of Restora-Flow.

6.2. Ablation Studies / Parameter Analysis

The authors conducted an ablation study to investigate the influence of two key parameters for Restora-Flow: the number of ODE steps and the number of correction steps (C). This study focused on $2\times$ super-resolution on the CelebA dataset, evaluating LPIPS, SSIM, and PSNR against processing time.

The following figure (Figure 5 from the original paper) shows the ablation study results:

$Figure 5. Ablation of ODE steps (indicated by markers) and correction steps $C$ for $2 \\times$ super-resolution on CelebA comparing LPIPS $\\downarrow$ (top), ${ \\bf S S I M \\uparrow }$ (middle) and $\\mathrm { P S N R \\uparrow }$ (bottom) to processing time $\\downarrow$ .ODE steps increase from left to right and represent 4, 8, 16, 32, 64, 128 and 256, respectively. For better visualization, ODE steps 4 and 8 when using $C = 0$ are omitted. The circle indicates the selected hyperparameters. Time is per image and displayed on a logarithmic scale.$ 该图像是图表，展示了在 CelebA 数据集上进行 2 imes 超分辨率的实验结果。图中分别显示了使用不同修正步骤 $C$ 值下的 LPIPS、SSIM 和 PSNR 指标相对于处理时间的变化，横轴为时间，纵轴为各指标的值。OD步骤的变化在图中标记。

Figure 5. Ablation of ODE steps (indicated by markers) and correction steps $C$ for $2 \times$ super-resolution on CelebA comparing LPIPS $\downarrow$ (top), ${ \bf SSIM \uparrow }$ (middle) and $\mathrm { PSNR \uparrow }$ (bottom) to processing time $\downarrow$ .ODE steps increase from left to right and represent 4, 8, 16, 32, 64, 128 and 256, respectively. For better visualization, ODE steps 4 and 8 when using $C = 0$ are omitted. The circle indicates the selected hyperparameters. Time is per image and displayed on a logarithmic scale.

Analysis of Figure 5:

Impact of Correction Steps (C):
- As expected, increasing the number of correction steps (C) (e.g., from $C=0$ to $C=1, 2, 4$ ) leads to longer evaluation times for the same number of ODE steps. Each correction adds computational overhead.
- The most significant improvement in quality (lower LPIPS, higher SSIM/PSNR) occurs when moving from $C=0$ (no correction) to $C=1$ . For $C=0$ , the quality metrics are substantially worse, especially LPIPS and PSNR, indicating that the trajectory correction is critical for high-quality restoration.
- Increasing $C$ beyond 1 (e.g., to $C=2$ or $C=4$ ) offers diminishing returns in terms of quality improvement, while further increasing processing time. For LPIPS, $C=1, 2, 4$ show very similar performance, often with $C=1$ being slightly better or on par for comparable ODE steps.
- This empirically validates the authors' choice to fix $C=1$ for all experiments, as it offers the best trade-off between high reconstruction quality and fast processing, requiring minimal additional computational cost beyond the ODE integration itself.
Impact of ODE Steps:
- For a fixed $C$ , increasing the number of ODE steps generally leads to better quality metrics (lower LPIPS, higher SSIM/PSNR) up to a certain point where the metrics saturate. More ODE steps mean finer integration of the velocity field, leading to more accurate trajectories.
- However, increasing ODE steps also directly increases processing time.
- The plots show a clear trade-off between fast sampling and better scores. Lower correction steps (specifically $C=1$ ) are advantageous because they allow for achieving good quality with a manageable number of ODE steps without excessive processing time.
- For $C=1$ , the quality metrics (LPIPS, SSIM, PSNR) show good performance starting from 32 or 64 ODE steps, with further increases providing only marginal gains. The chosen hyperparameters (e.g., 64 or 128 ODE steps depending on the task) reflect this saturation point for the best balance.
  
  The ablation study confirms that the trajectory correction mechanism is a vital component of Restora-Flow, significantly improving reconstruction quality, and that $C=1$ is a well-justified choice. It also highlights the typical trade-off between computational cost and quality when selecting the number of ODE steps, which is common in ODE-based generative models.

6.3. Qualitative Results

The qualitative results, as illustrated in Figure 3 (CelebA), Figure 6 (AFHQ-Cat), and supplemental figures, provide strong visual evidence supporting the quantitative superiority of Restora-Flow.

OT-ODE: Often produces artifacts, especially noticeable in box inpainting tasks on CelebA and AFHQ-Cat, and in random inpainting on COCO. This suggests that its gradient correction might not fully resolve inconsistencies or maintain perceptual quality across diverse tasks.
Flow-Priors: Tends to generate noisy reconstructions in super-resolution on CelebA and introduces artifacts in both super-resolution and random inpainting on AFHQ-Cat and COCO. This indicates issues with maintaining smoothness and realism.
D-Flow: While generally yielding realistic results, it suffers from slow processing times. More critically, it sometimes struggles to accurately reconstruct certain objects (e.g., for denoising on COCO) and can substantially alter known input regions (e.g., box inpainting on X-ray Hand), suggesting sensitivity to hyperparameters or inherent limitations in its optimization approach.
PnP-Flow: Despite achieving good SSIM and PSNR scores, PnP-Flow frequently produces over-smoothed results across all datasets. This is a common characteristic of Plug-and-Play methods that prioritize distortion metrics, often at the expense of perceptual realism. The lack of fine textures and details makes images appear less natural.
Diffusion-based Baselines ( $DDNM+$ , RePaint):
- $DDNM+$ often introduces artifacts in super-resolution tasks.
- RePaint generally produces visually realistic results but is significantly slower than Restora-Flow.
  
  In stark contrast, Restora-Flow consistently generates artifact-free and realistic images that maintain texture while ensuring fast processing across all conducted experiments. The restored images preserve fine details and natural appearances without the common pitfalls of other methods (artifacts, over-smoothing, or altered known regions). This comprehensive qualitative analysis highlights Restora-Flow's ability to balance perceptual quality with fidelity and speed.

For the medical X-ray Hand dataset, while overall visual variability among methods is lower due to the inherent uniformity of X-ray images, Restora-Flow still demonstrates clear advantages. For instance, in occlusion removal, other baselines often leave partially visible occlusions, whereas Restora-Flow effectively removes them and produces clean, high-quality reconstructions that are crucial for clinical interpretation. The consistency and quality, combined with faster processing, make Restora-Flow particularly suitable for sensitive applications like medical imaging.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Restora-Flow, a novel training-free algorithm for mask-guided image restoration based on Flow Matching. By intelligently combining mask-guided sampling with an effective trajectory correction mechanism, Restora-Flow overcomes critical limitations of existing diffusion and flow-based methods. It consistently achieves superior perceptual quality (lower LPIPS) and significantly reduced processing times across various tasks (denoising, inpainting, super-resolution, occlusion removal) and diverse datasets (natural and medical). A key advantage is its simplicity, requiring only the number of ODE steps as a hyperparameter, as the number of corrections is fixed at $C=1$ .

7.2. Limitations & Future Work

The primary limitation acknowledged by the authors is that Restora-Flow is currently designed for mask-based degradation operators. This means it might not be directly applicable to image restoration tasks involving more complex, non-mask-based degradations (e.g., motion blur, specific camera artifacts, or complex non-linear degradation models) without modification.

As future work, the authors plan to extend the algorithm to tackle image restoration tasks that involve these non-mask-based degradation operators. This would significantly broaden the applicability of Restora-Flow to an even wider range of real-world scenarios.

7.3. Personal Insights & Critique

This paper presents a highly valuable contribution to the field of image restoration, particularly in demonstrating the practical superiority of Flow Matching over diffusion models for this specific application. The training-free nature and the emphasis on processing speed while maintaining perceptual quality are highly desirable characteristics for real-world deployment.

Inspirations and Applications to Other Domains:

Efficiency for Real-time Applications: The significant reduction in processing time makes Restora-Flow highly inspirational for real-time image enhancement systems, such as in autonomous driving (e.g., removing sensor noise or inpainting occluded regions from cameras), video conferencing (real-time denoising or super-resolution), or medical imaging during live procedures.
Medical Imaging Potential: Its strong performance on the X-ray Hand dataset, especially for occlusion removal, suggests immense potential for other medical imaging modalities. The consistency and artifact-free generation could be crucial for diagnostic accuracy and downstream AI tasks (like segmentation or classification) where noisy or degraded inputs are common.
Simplicity of Use: The minimal hyperparameter tuning (just ODE steps) is a significant advantage. This ease of use could encourage broader adoption by practitioners who might be deterred by the complex tuning requirements of other generative models.
Foundation for Other Inverse Problems: The core idea of mask-guided generation with trajectory correction could be a robust framework for other inverse problems beyond image restoration, such as computational photography tasks or even some scientific image analysis where parts of data are known or missing.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Generalization to Non-Mask Degradations: The current limitation to mask-based degradations is a critical point. Extending Restora-Flow to non-mask-based problems will likely require a more generalized data fidelity term $\mathcal{D}(Hx, z)$ and potentially a different correction mechanism that doesn't rely on a binary mask. This is a non-trivial extension.
Fidelity Term in Denoising (Algorithm 2): In Algorithm 2 for denoising, the noise $\epsilon$ is sampled but not explicitly used in the final update. The observation $z$ is only incorporated for $t < 1-\sigma$ . While this works, it might be interesting to explore if a more continuous or adaptive data fidelity term throughout the ODE steps could further improve denoising performance, especially for varying noise distributions beyond simple Gaussian. The paper notes that $\epsilon$ is sampled, but it's not present in the update equation as expected. This might be a typo in the algorithm or a simplified representation.
Impact of $C=1$ Universality: While empirically $C=1$ is shown to be optimal for the tested scenarios, it's worth considering if specific, highly complex degradation types or very diverse datasets might benefit from a dynamically chosen or slightly higher $C$ . However, the authors' empirical evidence suggests it's robust.
Computational Cost of Velocity Field Evaluation: While the ODE integration is faster than diffusion, the underlying velocity field $v_{\theta,t}$ is still a deep neural network (U-Net). The speed-up primarily comes from fewer total evaluations due to straighter trajectories. Further optimizations for the velocity field inference itself (e.g., knowledge distillation, smaller models) could push the boundaries of real-time performance even further.
Memory Footprint: Flow Matching models, like diffusion models, can be memory-intensive due to their U-Net architecture. While not explicitly discussed, the memory footprint could be a consideration for very high-resolution images or embedded systems.

Overall, Restora-Flow represents a significant step forward, showcasing the practical utility and efficiency of Flow Matching in a crucial application domain. Its elegant design and empirical success make it a strong candidate for future generative prior-based image restoration methods.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Restora-Flow: Mask-Guided Image Restoration with Flow Matching

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~27 min read · 35,654 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Image Restoration

3.1.2. Generative Models

3.1.3. Diffusion Models

3.1.4. Flow Matching (FM)

3.1.5. Ordinary Differential Equation (ODE)

3.1.6. Mask-Based Degradation

3.1.7. Maximum A Posteriori (MAP) Estimation

3.1.8. Perception-Distortion Trade-off

3.2. Previous Works

3.2.1. Direct Mapping Methods

3.2.2. Diffusion-Based Prior Methods

3.2.3. Flow Matching-Based Prior Methods

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Flow Matching (FM) Background

4.2.2. Restora-Flow Algorithm for Mask-Based Restoration

4.2.2.1. Incorporating Mask-Guidance

4.2.2.2. Trajectory Correction

4.2.3. Algorithm 1: Mask-Guided Restora-Flow Sampling

4.2.4. Algorithm 2: Restora-Flow Sampling for Denoising

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Peak Signal-to-Noise Ratio (PSNR)

5.2.2. Structural Similarity Index (SSIM)

5.2.3. Learned Perceptual Image Patch Similarity (LPIPS)

5.3. Baselines

5.3.1. Flow-Based Baselines

5.3.2. Diffusion-Based Baselines

5.3.3. Implementation Details

5.4. Tasks

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies / Parameter Analysis

6.3. Qualitative Results

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers