Paper status: completed

Unsupervised Degradation Representation Learning for Unpaired Restoration of Images and Point Clouds

Published:10/30/2024

Unpaired Image and Point Cloud Restoration (1)Degradation Representation Learning (1)Unsupervised Restoration Methods (1)Degradation-Aware Convolutions (1)Low-Quality Data Restoration (1)

Original Link

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper presents an unsupervised degradation representation learning scheme to address challenges in unpaired restoration of images and point clouds, utilizing degradation-aware convolutions for flexible adaptation, ultimately establishing a generic framework that demonstrates

Abstract

Restoration tasks in low-level vision aim to restore high-quality (HQ) data from their low-quality (LQ) observations. To circumnavigate the difficulty of acquiring paired data in real scenarios, unpaired approaches that aim to restore HQ data solely on unpaired data are drawing increasing interest. Since restoration tasks are tightly coupled with the degradation model, unknown and highly diverse degradations in real scenarios make learning from unpaired data quite challenging. In this paper, we propose a degradation representation learning scheme to address this challenge. By learning to distinguish various degradations in the representation space, our degradation representations can extract implicit degradation information in an unsupervised manner. Moreover, to handle diverse degradations, we develop degradation-aware (DA) convolutions with flexible adaption to various degradations to fully exploit the degradation information in the learned representations. Based on our degradation representations and DA convolutions, we introduce a generic framework for unpaired restoration tasks. Based on our framework, we propose UnIRnet and UnPRnet for unpaired image and point cloud restoration tasks, respectively. It is demonstrated that our degradation representation learning scheme can extract discriminative representations to obtain accurate degradation information. Experiments on unpaired image and point cloud restoration tasks show that our UnIRnet and UnPRnet achieve state-of-the-art performance.

Mind Map

In-depth Reading

English Analysis~49 min read · 67,786 chars

1. Bibliographic Information

1.1. Title

The central topic of this paper is "Unsupervised Degradation Representation Learning for Unpaired Restoration of Images and Point Clouds."

1.2. Authors

The authors of this paper are:

Longguang Wang
Yulan Guo (Senior Member, IEEE)
Yingqian Wang
Xiaoyu Dong
Qingyu Xu
Jungang Yang
Wei An

Their affiliations indicate a background in electrical engineering, information and communication engineering, complexity science and engineering, and advanced intelligence project research, with a strong focus on low-level vision, 3D vision, image processing, and neural networks. Longguang Wang, Yulan Guo, Yingqian Wang, Qingyu Xu, Jungang Yang, and Wei An are primarily affiliated with the National University of Defense Technology (NUDT), China, while Xiaoyu Dong is associated with The University of Tokyo, Japan, and RIKEN Center for Advanced Intelligence Project (AIP).

1.3. Journal/Conference

The paper was published at IEEE Transactions on Pattern Analysis and Machine Intelligence. This journal is highly reputable and influential in the fields of computer vision, pattern recognition, and machine intelligence, often publishing state-of-the-art research.

1.4. Publication Year

The publication date (UTC) is 2024-10-30T00:00:00.000Z.

1.5. Abstract

The paper addresses the challenge of image and point cloud restoration, particularly in scenarios where paired high-quality (HQ) and low-quality (LQ) data are difficult to obtain. Current methods often rely on synthetic paired data, which may not accurately reflect real-world degradations. To overcome this, the authors propose an unsupervised degradation representation learning scheme. This scheme learns to distinguish various degradations in a representation space, extracting implicit degradation information without explicit supervision. To handle diverse degradations, they introduce degradation-aware (DA) convolutions that adapt flexibly to different degradation types, effectively exploiting the learned degradation information. Based on these concepts, they develop a generic framework for unpaired restoration tasks, leading to specific network implementations: UnIRnet for unpaired image restoration and UnPRnet for unpaired point cloud restoration. The results demonstrate that their degradation representation learning scheme extracts discriminative representations for accurate degradation information, and both UnIRnet and UnPRnet achieve state-of-the-art performance on their respective unpaired restoration tasks.

1.6. Original Source Link

The original source link is /files/papers/6932aa82574a23595ada7188/paper.pdf. It is presented as a PDF link, indicating it is an officially published paper or a final preprint version.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the restoration of high-quality (HQ) images and point clouds from low-quality (LQ) observations in real-world scenarios, especially when paired HQ-LQ data is unavailable.

This problem is crucial in low-level vision because acquired images and point clouds are often degraded by factors like blurs, noises, downsampling, quantization, and compression due to limitations in optical and electrical systems. These degradations severely impact perceptual quality and hinder the performance of downstream tasks (e.g., object recognition, 3D reconstruction).

Current state-of-the-art learning-based methods typically rely on paired HQ-LQ data for training. However, acquiring such paired data in real-world scenarios is extremely difficult, if not impossible. Researchers often resort to synthesizing LQ data from HQ data using pre-defined degradation models. The significant challenge here is that real-world degradations are highly diverse, complex, and often unknown, and these synthetic degradation models, despite leveraging expert knowledge, cannot fully cover the vast spectrum of real degradations. This "domain gap" between synthetic and real degradations severely limits the performance of existing methods on real-world data.

The paper identifies two major challenges for unpaired restoration:

Real degradations are unknown: Restoration tasks are intrinsically linked to the degradation model. Accurate degradation information can significantly improve restoration performance. However, without ground truth degradations in real scenarios, existing methods trained on synthetic data cannot extract this crucial information.
Real degradations are highly diverse: Existing unpaired methods (often GAN-based) assume LQ data follows a specific distribution and attempt to synthesize pseudo LQ data. However, due to the high diversity of real degradations, these GANs often suffer from mode collapse (failure to generate diverse outputs) and cannot produce LQ data as varied as real observations. Furthermore, directly concatenating degradation representations with image features for processing by standard convolutions can introduce interference due to the domain gap between these feature types.

The paper's innovative entry point is to tackle these challenges by proposing an unsupervised degradation representation learning scheme. Instead of explicitly estimating specific degradation parameters (which are unknown), it learns to implicitly represent and distinguish different degradations. This learned representation then guides the synthesis of diverse pseudo LQ data and enables degradation-aware restoration.

2.2. Main Contributions / Findings

The paper makes several significant contributions to address the challenges of unpaired image and point cloud restoration:

Unsupervised Degradation Representation Learning: They introduce a novel scheme that extracts implicit degradation information from LQ data in an unsupervised manner. This is achieved by learning to distinguish various degradations in a representation space using a contrastive learning framework. This is a pioneering technique that does not rely on ground-truth degradation supervision, making it highly practical for real-world scenarios.
High-Diversity LQ Data Synthesis: The proposed method synthesizes pseudo LQ data from HQ data conditioned on the degradation representation of an unpaired LQ data. By encouraging the synthetic data to have similar degradation representations to the unpaired LQ data, their approach generates pseudo LQ data with significantly higher degradation diversity, better covering real-world degradation variations.
Flexible Degradation Adaptation with Degradation-Aware (DA) Convolutions: They develop DA convolutions that dynamically predict convolutional kernels and channel-wise modulation coefficients based on the learned degradation representations. This allows the restoration network to flexibly adapt to different degradation types, overcoming the limitations of directly concatenating degradation information with image features.
Generic Unpaired Restoration Framework: They introduce a generic framework that integrates the unsupervised degradation representation learning and DA convolutions. This framework is applicable to different data types and restoration tasks.
State-of-the-Art Performance: Based on their framework, they propose UnIRnet for unpaired image restoration and UnPRnet for unpaired point cloud restoration. Extensive experiments demonstrate that both networks achieve state-of-the-art performance on their respective tasks, outperforming existing methods on both synthetic and real-world datasets.
Extension and Generalization: This paper extends their previous conference version by developing a more generic degradation representation learning scheme applicable to different data types (images and point clouds), extending it from paired to unpaired restoration, and providing more comprehensive experiments and analyses.

In summary, the paper's key conclusion is that by implicitly learning and representing degradation information in an unsupervised manner, it is possible to effectively perform unpaired restoration of images and point clouds, generating diverse pseudo-paired data and enabling adaptive restoration networks, leading to superior performance in real-world scenarios.

3.1. Foundational Concepts

To understand this paper, a beginner should be familiar with the following foundational concepts:

Low-Level Vision: A subfield of computer vision that deals with processing images at a low level, often involving tasks like image enhancement, restoration, denoising, deblurring, super-resolution, and point cloud processing. These tasks typically operate directly on pixel values or point coordinates to improve data quality.
High-Quality (HQ) vs. Low-Quality (LQ) Data:
- HQ Data: Ideal, pristine data (e.g., clear images, dense and accurate point clouds) without degradation. Often referred to as ground truth.
- LQ Data: Degraded observations of HQ data (e.g., blurry images, noisy point clouds, downsampled images). The goal of restoration is to recover HQ data from LQ observations.
Image Restoration Tasks: A collection of inverse problems in image processing aiming to recover an original, pristine image from a degraded version. Examples include:
- Image Denoising: Removing unwanted noise from an image.
- Image Deblurring: Reversing the effect of blurring.
- Image Super-Resolution (SR): Reconstructing a high-resolution image from a low-resolution input.
- Compression Artifacts Reduction: Removing visual distortions caused by data compression (e.g., JPEG compression).
Point Cloud Restoration Tasks: Similar to image restoration, but applied to 3D point cloud data. Examples include:
- Point Cloud Denoising: Removing noise from 3D point coordinates or attributes (e.g., color).
- Point Cloud Upsampling: Increasing the density of points in a point cloud.
- Point Cloud Completion: Filling in missing parts of a partial point cloud.
Ill-Posed Problem: In mathematics, an inverse problem is ill-posed if a unique solution does not exist or if the solution does not depend continuously on the initial data. Image and point cloud restoration are ill-posed because multiple HQ inputs could result in the same LQ output, making the recovery of the original HQ data ambiguous without additional information or constraints.
Paired vs. Unpaired Data:
- Paired Data: Datasets where each LQ observation has a corresponding HQ ground truth. This is ideal for supervised learning. For example, a blurry image and its perfectly sharp version.
- Unpaired Data: Datasets where LQ observations and HQ examples exist, but there's no direct one-to-one correspondence between them. For example, a collection of blurry real-world images and a separate collection of sharp real-world images, but no clear image is known for any specific blurry image.
Deep Learning / Neural Networks: A subset of machine learning that uses multi-layered neural networks (often called deep neural networks) to learn complex patterns from data.
- Convolutional Neural Networks (CNNs): A type of deep neural network particularly effective for processing grid-like data such as images. They use convolutional layers that apply learnable filters to input data to extract features.
- Multilayer Perceptron (MLP): A basic type of neural network consisting of at least three layers of nodes: an input layer, a hidden layer, and an output layer. Each node (except for the input nodes) is a neuron that uses a nonlinear activation function.
- Encoder-Decoder Architecture: A common neural network design where an encoder compresses input data into a lower-dimensional latent space (feature representation), and a decoder reconstructs the desired output from this latent representation. Often used in generative models and autoencoders.
- Residual Block / Residual Connection: A technique introduced in ResNet to train very deep neural networks. It involves adding the input of a layer directly to its output, allowing the network to learn residual functions (changes from the identity mapping) rather than entirely new mappings. This helps mitigate the vanishing gradient problem.
- Batch Normalization (BN): A technique to standardize the inputs to layers in a neural network, which helps stabilize and speed up the training process.
- Leaky ReLU: An activation function similar to ReLU but allows a small, non-zero gradient when the input is negative. This helps prevent dying ReLU problem where neurons can become inactive and stop learning.
Generative Adversarial Networks (GANs): A class of neural networks where two networks, a generator and a discriminator, compete against each other. The generator tries to create realistic data (e.g., fake images), while the discriminator tries to distinguish between real and fake data. This adversarial process drives the generator to produce increasingly realistic outputs.
- Mode Collapse: A common problem in GAN training where the generator produces a limited variety of outputs, failing to capture the full diversity of the real data distribution.
Contrastive Learning: A self-supervised learning paradigm where a model learns representations by minimizing the distance between positive pairs (different views of the same data sample) and maximizing the distance between negative pairs (different data samples). This helps the model learn to distinguish between different instances.
- InfoNCE Loss: A specific form of contrastive loss commonly used, derived from Noise-Contrastive Estimation. It encourages the model to classify a query's positive sample among a set of negative samples.
- Momentum Encoder (MoCo): A technique used in contrastive learning to maintain a large and consistent set of negative samples. It uses a slowly updated (momentum) copy of the encoder to encode negative keys, which helps stabilize training and allows for larger effective batch sizes without requiring very large physical batches.
Dynamic Convolutions / Dynamic Networks: Neural networks where the parameters (e.g., convolutional kernels, activation functions) are not fixed but are dynamically generated or adapted based on the input data. This allows for greater flexibility and adaptability to varying inputs.
- Hypernetworks: A neural network that generates the weights of another neural network.
Evaluation Metrics:
- Peak Signal-to-Noise Ratio (PSNR): A common metric to quantify image quality, especially for reconstruction. Higher PSNR indicates better quality.
- Structural Similarity Index (SSIM): A perceptual metric that quantifies the similarity between two images, considering luminance, contrast, and structure. Closer to 1 indicates higher similarity.
- Learned Perceptual Image Patch Similarity (LPIPS): A metric that uses a deep neural network to measure the perceptual similarity between two images, often correlating better with human judgment than PSNR or SSIM. Lower LPIPS indicates higher perceptual similarity.
- Chamfer Distance (CD): A metric used to measure the similarity between two point clouds. It calculates the sum of the squared minimum distances from points in one set to the nearest points in the other set, and vice versa. Lower CD indicates higher similarity.
- Point-to-Mesh Distance (P2M): A metric used to evaluate point cloud reconstruction quality, measuring the distance from reconstructed points to a reference ground-truth mesh. Lower P2M indicates higher accuracy.
- Naturalness Image Quality Evaluator (NIQE): A no-reference image quality metric that measures the naturalness of an image. Lower NIQE scores indicate better perceptual quality.
- Convolutional Neural Network-based Image Quality Assessment (CNNIQA): A no-reference image quality metric that uses a CNN to assess image quality.

3.2. Previous Works

The paper reviews image restoration and point cloud restoration methods, focusing on deep learning approaches, and discusses dynamic convolutions and contrastive learning.

3.2.1. Image Restoration

Paired Image Restoration: These methods rely on paired LQ-HQ images, typically using synthetic degradations.
- Single Degradation: Early works focused on specific tasks:
  - Image Denoising: DnCNN [24] used CNNs for learning noisy-to-clean mappings. Guo et al. [25] focused on blind denoising of real photographs.
  - Image Deblurring: Sun et al. [33] used CNNs to predict motion blur. Nah et al. [27] and Tao et al. [26] developed networks for deblurring.
  - Image Super-Resolution (SR): Dong et al. [2, 29] pioneered CNNs for SR and compression artifacts. EDSR [34] and RCAN [9] introduced very deep residual networks for SR.
- Versatile Networks for Multiple Degradations:
  - RDN [35]: Combines residual learning and dense connections.
  - PAN [36]: Uses pyramid attention for multi-scale features.
  - Zamir et al. [37]: Multi-stage architecture for progressive restoration.
  - SwinIR [11]: Adapts Swin Transformer for image restoration.
- Handling Complicated Degradations (with supervision):
  - Zero-shot methods (ZSSR [38], Soh et al. [39]): Adapt to complex degradations without prior training on them, often by training an internal network during test time.
  - Degradation-aware methods (Zhang et al. [19], IKC [20]): Use degradation information (e.g., blur kernel) as an additional input to adapt the network.
  - Model-based frameworks (Zhang et al. [40]): Integrate CNN denoisers into optimization algorithms.
  - Practical degradation models (Wang et al. [14], Zhang et al. [13]): Propose more realistic synthetic degradation models to improve performance on real images.
Unpaired Image Restoration: These methods train directly on unpaired images, aiming to bridge the domain gap.
- Domain-Specific Deblurring: Lu et al. [42] used CycleGAN [41] and disentangled representations.
- Unsupervised Denoising (single noisy image): Alexander et al. [43] and Tao et al. [44] learn denoising from single images, assuming spatially uncorrelated noise. This limits their applicability to complex degradations like blur.
- GAN-based Degradation Modeling:
  - Bulat et al. [15], Lugmayr et al. [16]: Train a degradation network to synthesize pseudo LQ images, then use these pairs for SR.
  - Yuan et al. [45] (CinCGAN), Maeda et al. [17]: Unified frameworks to learn both degradation and SR networks simultaneously.
  - Liu et al. [46], Yang et al. [47]: Incorporate physical properties as regularizers.
  - Limitation of GANs: These often learn deterministic mappings, ignoring the stochasticity of degradations, leading to mode collapse and limited diversity.
  - DeFlow [48]: Uses conditional flows to model stochastic degradations, but with high computational cost.

3.2.2. Point Cloud Restoration

Paired Point Cloud Restoration:
- PointProNet [51]: Denoises point patches by projecting them to learned local frames.
- PU-Net [52]: Reconstructs high-resolution point clouds from low-resolution ones.
- EC-Net [53]: Edge-aware point cloud consolidation.
Unpaired Point Cloud Restoration:
- Hermosilla et al. [54] (Total Denoising): Extends unpaired image denoising methods to point clouds, limited to denoising and spatially uncorrelated noise.
- Wen et al. [55] (Cycle4Completion): Unpaired point cloud completion using cycle transformation.

3.2.3. Dynamic Convolutions

Networks with dynamic convolutions parameterize filters conditioned on the input.

Hypernetworks [57], [58]: Generate convolutional filters using another network.
CondConv [59], WeightNet [60]: Combine multiple expert kernels or dynamically assemble basic kernels.
Image Restoration:
- CResMD [61]: Uses controllable residual connections for interactive restoration.
- ArbSR [62]: Customizes dynamic convolutions for scale-arbitrary SR.
Point Cloud Processing:
- PointConv [63]: Uses MLPs to dynamically synthesize filters for each point based on relative coordinates.
- PAConv [64]: Position adaptive convolution with dynamic kernel assembling.
- Chen et al. [65]: Rotation-invariant convolution with pose-adapted filters.

3.2.4. Contrastive Learning

Effective for unsupervised representation learning by maximizing mutual information.

Previous methods: Doersch et al. [66], Zhang et al. [67], Noroozi et al. [68], Gidaris et al. [69] focused on predicting context or learning counts.
Modern approaches: Maximize mutual information between different views of the same data.
- Wu et al. [70]: Non-parametric instance discrimination.
- SimCLR [71]: A simple framework for contrastive learning with large batch sizes.
- MoCo [72], MoCo v2 [77]: Use a momentum encoder to maintain a large dictionary of negative samples, enabling contrastive learning with smaller batch sizes.
- Tian et al. [73]: Contrastive multiview coding.
- van den Oord et al. [74], Hénaff et al. [75]: Contrastive predictive coding.
- Radford et al. [76] (CLIP): Learning visual models from natural language supervision.
- Park et al. [78]: Contrastive learning for unpaired image-to-image translation.

3.3. Technological Evolution

The field of image and point cloud restoration has evolved from traditional signal processing methods using a priori information (e.g., smoothness, sparsity, low rankness) to data-driven deep learning approaches. Initially, deep learning methods focused on paired data and specific degradations, then moved towards versatile networks for multiple degradations. The major shift, and where this paper fits, is towards unpaired restoration due to the difficulty of acquiring real-world paired data. This transition is marked by attempts to model real degradations using GANs, but these often struggle with mode collapse and limited diversity.

This paper represents a crucial step in this evolution by moving beyond explicit degradation estimation (which often requires supervision or is computationally expensive) and deterministic GAN-based synthesis. It leverages contrastive learning to implicitly and unsupervisedly learn degradation representations, which provides a more robust and generalizable way to understand and mimic diverse real-world degradations. The integration of dynamic convolutions further allows the restoration network to truly "understand" and adapt to these diverse degradations, rather than just passively receiving degradation parameters. This positions the paper at the forefront of unsupervised and adaptive restoration techniques for real-world scenarios.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

Unsupervised Degradation Information Extraction:
- Differentiation: Unlike paired methods that rely on ground-truth degradation (e.g., blur kernels in IKC [20], DAN [87]) or zero-shot methods (ZSSR [38]) that estimate kernels at test time (which is slow), this paper's degradation representation learning extracts implicit degradation information in an entirely unsupervised manner. It learns to distinguish degradations rather than explicitly estimate them, making it practical for unknown real-world degradations and much more efficient.
- Innovation: This is a novel way to get degradation information without explicit supervision, addressing a major bottleneck for real-world applications.
High-Diversity Unpaired Data Synthesis:
- Differentiation: Previous unpaired GAN-based methods (CinCGAN [45], Lugmayr et al. [16], DeFlow [48]) often struggle with mode collapse and produce pseudo LQ data with limited diversity because they try to model the complex p(y) distribution directly.
- Innovation: This paper explicitly decouples the generation process by modeling $p(y | x; d)p(d)$ . By conditioning the synthesis on learned degradation representations (R_LQ) from real LQ data and using a degradation consistency loss, it generates pseudo LQ data that precisely mimic the diverse degradations present in the real unpaired LQ dataset. This leads to significantly higher diversity in synthetic training data.
Flexible Adaptation with Degradation-Aware (DA) Convolutions:
- Differentiation: Existing multi-degradation restoration networks (Zhang et al. [19], Xu et al. [81]) often concatenate degradation representations directly with image features. This can cause interference due to the domain gap. Dynamic convolution methods exist (CResMD [61], ArbSR [62]) but are not specifically designed for unsupervised degradation learning or generic unpaired restoration as comprehensively.
- Innovation: The DA convolution innovatively uses the learned degradation representations to dynamically predict convolutional kernels and channel-wise modulation coefficients. This allows for a much more flexible and adaptive response to various degradations without the domain gap issue, leading to better performance.
Generic Framework for Multiple Data Types:
- Differentiation: Many unpaired restoration methods are specific to image super-resolution or denoising. Point cloud restoration, especially unpaired, is relatively underexplored.
- Innovation: The proposed framework is generic and successfully applied to both unpaired image restoration (UnIRnet) and unpaired point cloud restoration (UnPRnet), demonstrating its broad applicability and effectiveness across different data modalities. This is the first work to attempt unpaired point cloud restoration under complicated degradations in this manner.
  
  In essence, the paper's core innovation lies in its unique, unsupervised approach to understanding and utilizing degradation information, which then drives a more effective pseudo-data synthesis strategy and a truly adaptive restoration network architecture.

4. Methodology

The proposed methodology addresses the challenges of unpaired image and point cloud restoration through an unsupervised degradation representation learning scheme and a generic framework built upon it. The framework consists of an encoder, a degrader, and a generator, and operates in two stages: LQ data synthesis and HQ data restoration.

4.1. Principles

The core idea is to bypass the need for paired HQ-LQ data and explicit degradation knowledge by:

Implicitly learning degradation characteristics: Instead of trying to estimate specific degradation parameters (like blur kernel sizes or noise levels), the method learns a compact representation (a degradation representation) that can distinguish different types of degradations. This is achieved in an unsupervised manner using contrastive learning. The assumption is that patches within the same LQ image share the same degradation, while patches from different LQ images may have different degradations.
Synthesizing diverse pseudo LQ data: The learned degradation representations from real unpaired LQ data are then used to guide a degrader network. This degrader takes an HQ image and the degradation representation of a real LQ image as input, and synthesizes a new LQ image that mimics the degradation of the guiding real LQ image. This generates diverse pseudo-paired data for training the main restoration network.
Degradation-aware restoration: A generator network performs the actual restoration. It incorporates the learned degradation representation directly into its convolutional layers (degradation-aware (DA) convolutions), allowing it to dynamically adapt its processing to the specific degradation present in the input LQ data.

4.2. Core Methodology In-depth (Layer by Layer)

The framework has two main stages: LQ data synthesis and HQ data restoration. The overall workflow can be summarized as follows: During training, unpaired HQ data ( $x \sim p(x)$ ) and LQ data ( $y \sim p(y)$ ) are used.

The encoder learns degradation representations ( $R^{LQ}$ ) from real LQ data ( $y$ ) in an unsupervised manner.
The degrader takes an HQ data ( $x$ ) and a degradation representation ( $R^{LQ}$ ) from a real LQ sample, and synthesizes a pseudo LQ data ( $y_{pse}$ ). This $y_{pse}$ is designed to have a degradation similar to the $y$ that provided $R^{LQ}$ .
The generator then learns to restore HQ data ( $x$ ) from this pseudo LQ data ( $y_{pse}$ ), guided by its degradation representation ( $R_{pse}^{LQ}$ ).

During testing, real LQ data ( $y$ ) is fed to the encoder to extract $R^{LQ}$ . This $R^{LQ}$ then guides the generator to restore the final HQ data ( $x^{out}$ ).

The following are the detailed components:

4.2.1. Degradation Representation Learning (Encoder)

The encoder is responsible for extracting a discriminative degradation representation from LQ data in an unsupervised manner. This is crucial because real degradations are unknown. The core idea is that degradation is consistent within a single image/point cloud but varies across different ones.

The method employs a contrastive learning framework, similar to MoCo [72].

Formulation:

Query, Positive, and Negative Samples:
- For an LQ image (or point cloud), a randomly cropped patch serves as the query patch.
- Another patch extracted from the same LQ image is considered a positive sample (they share the same degradation).
- Patches from other LQ images (which inherently have different degradations) are considered negative samples.
Encoding: The query, positive, and negative patches are fed into an encoder network to produce initial representations.
Projection Head: These representations are then passed through a two-layer Multilayer Perceptron (MLP) projection head to obtain final normalized representations: $z$ (for query), $z^{+}$ (for positive), and $z^{-}$ (for negative).
Similarity Maximization/Minimization: The goal is to make $z$ similar to $z^{+}$ and dissimilar to $z^{-}$ . This is achieved using the InfoNCE loss.

The InfoNCE loss for a single query $z$ is defined as: $ \mathcal { L } _ { z } = - \log \frac { \exp ( \frac { z ^ { T } \cdot z ^ { + } } { \tau } ) } { \exp ( \frac { z ^ { T } \cdot z ^ { + } } { \tau } ) + \sum _ { n = 1 } ^ { N } \exp ( \frac { z ^ { T } \cdot z _ { n } ^ { - } } { \tau } ) } $ Where:

$z$ : The degradation representation of the query patch.
$z^{+}$ : The degradation representation of the positive sample (from the same LQ image as $z$ ).
$z_{n}^{-}$ : The degradation representation of the $n$ -th negative sample (from a different LQ image).
$\cdot$ : Represents the dot product, which measures the similarity between two vectors.
$\tau$ : A temperature hyper-parameter that scales the logits before the softmax function, influencing the sharpness of the distribution. A smaller $\tau$ makes the distribution sharper, enforcing stronger separation between positive and negative pairs.
$N$ : The total number of negative samples in the batch or queue.
$\exp(\cdot)$ : The exponential function. The term $\exp(A/ \tau)$ can be interpreted as a similarity score, where larger values indicate higher similarity.
The numerator $\exp ( \frac { z ^ { T } \cdot z ^ { + } } { \tau } )$ represents the similarity between the query and its positive pair.
The denominator \exp ( \frac { z ^ { T } \cdot z ^ { + } } { \tau } ) + \sum _ { n = 1 } ^ { N } \exp ( \frac { z ^ { T } \cdot z _ { n } ^ { - } } { \tau } ) sums the similarity of the query with its positive pair and all negative pairs.
The ratio inside the logarithm is essentially a softmax probability, representing the probability that the positive sample is correctly identified among all samples. Minimizing the negative logarithm of this probability maximizes it.

To ensure content-invariant degradation representations (meaning the representation should capture degradation type, not image content), a queue is maintained, storing representations of samples with diverse contents and degradations. During training, $B$ LQ images (representing $B$ different degradations) are randomly selected. Two patches are cropped from each image. For the $i$ -th image, $R_{i,1}^{LQ}$ and $R_{i,2}^{LQ}$ are its two patches' degradation representations, serving as query and positive sample respectively.

The overall degradation contrastive loss ( $\mathcal{L}_{deg}$ ) is computed over a batch of $B$ images: $ \begin{array} { r l } & { \mathcal { L } _ { d e g } = } \ & { \sum _ { i = 1 } ^ { B } - \log \frac { \exp { \left( \frac { R _ { i , 1 } ^ { L Q } \cdot R _ { i , 2 } ^ { L Q } } { \tau } \right) } } { \exp { \left( \frac { R _ { i , 1 } ^ { L Q } \cdot R _ { i , 2 } ^ { L Q } } { \tau } \right) } + \sum _ { j = 1 } ^ { N _ { q u e u e } } \exp { \left( \frac { R _ { i , 1 } ^ { L Q } \cdot R _ { j } ^ { q u e u e } } { \tau } \right) } } \end{array} $ Where:

$R_{i,1}^{LQ}$ : Degradation representation of the first patch (query) from the $i$ -th LQ image.
$R_{i,2}^{LQ}$ : Degradation representation of the second patch (positive) from the $i$ -th LQ image.
$N_{queue}$ : The number of samples currently stored in the queue (these serve as additional negative samples for the current batch).
$R_j^{queue}$ : The $j$ -th negative sample from the queue.

Encoder Architecture (Image): As illustrated in Fig. 3(a), the image encoder consists of eight $3 \times 3$ convolutional layers across four different resolution levels. Each convolutional layer is followed by a Batch Normalization (BN) layer and a Leaky ReLU activation function. An average pooling layer is applied after the final convolutional layer to obtain the degradation representation $R^{LQ}$ .

Encoder Architecture (Point Cloud): As illustrated in Fig. 10(a), the point cloud encoder first uses an FC (Fully Connected) layer for initial feature extraction. These features then pass through a four-stage structure. Each stage comprises a point convolution (specifically, geometry-aware point convolution [96]) and an FC layer, followed by a BN layer and a Leaky ReLU activation. An average pooling layer after the last convolutional layer yields the degradation representation $\dot { R } ^ { L Q }$ .

4.2.2. Degradation-Aware LQ Data Synthesis (Degrader)

This stage aims to synthesize pseudo LQ data ( $y_{pse}$ ) from HQ data ( $x$ ) by mimicking the degradation of an unpaired real LQ data ( $y$ ). The key principle is to model the conditional distribution $p(y | x; d)p(d)$ rather than directly p(y), where $d$ represents degradation. The encoder learns $p_{\theta_E}(d | y)$ to approximate p(d), and the degrader models $p(y | x; d)$ .

Degrader Architecture (Image): As illustrated in Fig. 3(b), the image degrader uses an encoder-decoder architecture:

HQ Input Processing: The input HQ image $I^{HQ} \in \mathbb{R}^{H \times W \times 3}$ is fed into five $3 \times 3$ convolutional layers with stride 2, resulting in a latent feature $F_d \in \mathbb{R}^{\frac{H}{32} \times \frac{W}{32} \times 192}$ . This $F_d$ is then compressed by an FC layer to produce $R_d$ .
Degradation Representation Input: The degradation representation $R^{LQ}$ (extracted from an unpaired real LQ image by the encoder) is incorporated to guide the degradation process.
Contamination Injection:
- $F_d$ is first processed by a Degradation-Aware (DA) convolution (detailed below) conditioned on $R_d$ to introduce initial contamination.
- Noise injection modules are then used. Within these modules, $R_d$ is passed through two FC layers to generate per-channel factors that rescale Gaussian noise. This injects stochasticity into the degradation.
- The features are then progressively upsampled and passed through subsequent layers, continuing to inject contamination at different resolution levels until the final pseudo LQ image $I_{pse}^{LQ}$ is synthesized.

Degrader Architecture (Point Cloud): As illustrated in Fig. 10(b), the point cloud degrader uses an encoder-decoder architecture with skip connections:

HQ Input Processing: The input HQ point cloud $P^{HQ}$ (coordinates $\mathbb{R}^{N \times 3}$ or coordinates + RGB $\mathbb{R}^{N \times 6}$ ) is fed to an FC layer for initial feature extraction.
Degradation Representation Input: The degradation representation $R^{LQ}$ from an unpaired real LQ point cloud is passed to another FC layer for compression, resulting in $R_d$ .
Feature Extraction: Four point convolutions extract deep features $F_d \in \mathbb{R}^{\frac{N}{64} \times 128}$ . After each point convolution, an average pooling layer downsamples the point cloud by a factor of four (random sampling followed by feature averaging over K-nearest neighbors).
Contamination Injection:
- $F_d$ is upsampled and fed to a Degradation-Aware (DA) point convolution (detailed below) conditioned on $R_d$ to introduce contamination.
- Noise injection modules are used to inject noises.
- The features are then progressively upsampled and passed through subsequent layers to perform contamination injection at different resolutions, resulting in the final pseudo LQ point cloud $P_{pse}^{LQ}$ .

Loss Functions for Degrader (Image & Point Cloud): The overall loss for the degrader ( $\mathcal{L}_D$ ) is defined as: $ \mathcal { L } _ { D } = \lambda _ { c o n } \mathcal { L } _ { c o n } + \lambda _ { a d v } \mathcal { L } _ { a d v } ^ { D } + \lambda _ { c o n s i s t } \mathcal { L } _ { c o n s i s t } $ Where:

$\mathcal{L}_{con}$ : Content loss.
$\mathcal{L}_{adv}^D$ : Adversarial loss for the degrader.
$\mathcal{L}_{consist}$ : Degradation consistency loss.
$\lambda_{con}$ , $\lambda_{adv}$ , $\lambda_{consist}$ : Weighting hyper-parameters (empirically set to 1, 0.01, and 0.005 respectively in image experiments).

Content Loss ( $\mathcal{L}_{con}$ ): An L1 loss is used to maintain content consistency between the synthesized LQ data and its original HQ data. For images, a Gaussian filter is applied to both to smooth out high-frequency details, focusing on structural similarity. $ \mathcal { L } _ { c o n } = \big | \big | g ( I _ { p s e } ^ { L Q } ) - g ( I ^ { H Q } \downarrow ) \big | \big | _ { 1 } $ Where:

$I_{pse}^{LQ}$ : The synthetic pseudo LQ image.
$I^{HQ}$ : The original HQ image.
$g(\cdot)$ : A $3 \times 3$ Gaussian filter (for images).
$\downarrow$ : Represents bicubic downsampling (for images), indicating that the HQ image is also degraded to match the expected resolution of the LQ image for comparison.
$|| \cdot ||_1$ : The L1 norm (Manhattan distance), which measures the absolute difference between pixel values.

Adversarial Loss ( $\mathcal{L}_{adv}^{Dis}$ and $\mathcal{L}_{adv}^D$ ): A discriminator network is trained to distinguish between real LQ data and synthetic pseudo LQ data. The degrader tries to fool this discriminator. For the discriminator: $ \mathcal { L } _ { a d v } ^ { D i s } = \mathbb { E } _ { I ^ { L Q } } [ \log ( 1 - \mathrm { N e t } _ { \mathrm { D i s } } ( I ^ { L Q } ) ) ] + \mathbb { E } _ { I _ { p s e } ^ { L Q } } [ \log ( \mathrm { N e t } _ { \mathrm { D i s } } ( I _ { p s e } ^ { L Q } ) ] $ Where:

$\mathrm{Net}_{Dis}(\cdot)$ : The output of the discriminator network (a probability score, usually between 0 and 1, indicating how "real" the input is).
$\mathbb{E}[\cdot]$ : Expected value.
$I^{LQ}$ : A real LQ image (or point cloud).
$I_{pse}^{LQ}$ : A synthetic pseudo LQ image (or point cloud).
The discriminator tries to maximize this loss: output 0 for real $I^{LQ}$ (so $1-\mathrm{Net}_{Dis}(I^{LQ})$ is 1, $\log(1)$ is 0) and 1 for synthetic $I_{pse}^{LQ}$ ( $\log(1)$ is 0), which is incorrect. A standard GAN discriminator loss typically aims to maximize $\mathbb{E}_{x \sim p_{data}}[\log D(x)] + \mathbb{E}_{z \sim p_z}[\log (1 - D(G(z)))]$ . The formulation provided in the paper for $\mathcal{L}_{adv}^{Dis}$ seems to be a slightly modified or inverted form of the standard objective. Assuming standard GAN logic, the discriminator wants to assign a high probability to real samples ( $I^{LQ}$ ) and a low probability to fake samples ( $I_{pse}^{LQ}$ ). So the first term $\mathbb{E}_{I^{LQ}}[\log (\mathrm{Net}_{Dis}(I^{L Q}))]$ would be maximized, and the second term $\mathbb{E}_{I_{pse}^{L Q}}[\log (1 - \mathrm{Net}_{Dis}(I_{pse}^{L Q}))]$ would be maximized. The paper's formulation $\mathbb{E}_{I^{LQ}}[\log (1 - \mathrm{Net}_{\mathrm{Dis}}(I^{L Q}))]$ is unusual, typically this would be $\mathbb{E}_{I^{LQ}}[\log (\mathrm{Net}_{\mathrm{Dis}}(I^{L Q}))]$ for the real sample term. This looks like a generator's objective function for the real data, or a misprint. However, adhering strictly to the paper's formula: the discriminator tries to minimize this $\mathcal{L}_{adv}^{Dis}$ , meaning it wants $\mathrm{Net}_{Dis}(I^{LQ})$ to be 1 for real images and $\mathrm{Net}_{Dis}(I_{pse}^{LQ})$ to be 0 for fake images. This would make $\log(1-\mathrm{Net}_{Dis}(I^{LQ}))$ large negative (undesirable for real) and $\log(\mathrm{Net}_{Dis}(I_{pse}^{LQ}))$ large negative (undesirable for fake). Let's assume the standard GAN objective for the discriminator where it wants to correctly classify real as real and fake as fake. The provided formula for $\mathcal{L}_{adv}^{Dis}$ is usually for the generator aiming to fool the discriminator.

For the degrader (generator for LQ data): $ \mathcal { L } _ { a d v } ^ { D } = \mathbb { E } _ { I _ { p s e } ^ { L Q } } [ \log ( 1 - \mathrm { N e t } _ { \mathrm { D i s } } ( I _ { p s e } ^ { L Q } ) ) ] $ Where:
The degrader tries to minimize this loss. This means it wants $\mathrm{Net}_{Dis}(I_{pse}^{LQ})$ to be close to 1 (i.e., make its synthetic images look real), which makes $\log(1-\mathrm{Net}_{Dis}(I_{pse}^{LQ}))$ a large negative number, minimizing the loss. This is the standard form of the adversarial loss for the generator.

Discriminator Architecture (Image): A network consisting of six convolutional layers, a flattening layer, and a two-layer MLP head. Discriminator Architecture (Point Cloud): A network comprising four FC layers, three point convolutional layers, an average pooling layer, and a two-layer MLP head.

Degradation Consistency Loss ( $\mathcal{L}_{consist}$ ): This loss ensures that the synthesized pseudo LQ image $I_{pse}^{LQ}$ has degradations similar to the input unpaired real LQ image $I^{LQ}$ that provided the guiding degradation representation. It uses a contrastive loss similar to (2). $ \begin{array} { r l } & { \mathcal { L } _ { c o n s i s t } = } \ & { \sum _ { i = 1 } ^ { B } - \log \frac { \exp ( \frac { R _ { p s e } ^ { L Q } \cdot R ^ { L Q } } { \tau } ) } { \exp ( \frac { R _ { p s e } ^ { L Q } \cdot R ^ { L Q } } { \tau } ) + \sum _ { j = 1 } ^ { N _ { q u e u e } } \exp ( \frac { R _ { p s e } ^ { L Q } \cdot R _ { j } ^ { q u e u e } } { \tau } ) } } \end{array} $ Where:

$R_{pse}^{LQ}$ : The degradation representation of the synthetic pseudo LQ image (or point cloud), obtained by feeding $I_{pse}^{LQ}$ through the encoder.
$R^{LQ}$ : The degradation representation of the corresponding input real LQ image (or point cloud) that served as guidance for synthesis.
$R_j^{queue}$ : The $j$ -th negative sample from the queue (other degradation representations).
$\tau$ : The temperature hyper-parameter. This loss encourages $R_{pse}^{LQ}$ to be close to $R^{LQ}$ (positive pair) and far from $R_j^{queue}$ (negative pairs), ensuring the synthetic degradation matches the desired one.

4.2.3. Degradation-Aware HQ Data Restoration (Generator)

This stage aims to restore HQ data from LQ data. During training, the generator learns to restore HQ data ( $I^{HQ}$ or $P^{HQ}$ ) from the synthetic pseudo LQ data ( $I_{pse}^{LQ}$ or $P_{pse}^{LQ}$ ), conditioned on its degradation representation ( $R_{pse}^{LQ}$ ). During inference, real LQ data ( $I^{LQ}$ or $P^{LQ}$ ) is fed to the encoder to get $R^{LQ}$ , which then guides the generator.

Generator Architecture (Image - UnIRnet): As illustrated in Fig. 3(c), the image generator uses Degradation-Aware (DA) blocks as its building blocks and adopts a high-level structure similar to RCAN [9].

Initial Feature Extraction: Input LQ image ( $I^{LQ}$ or $I_{pse}^{LQ}$ ) is fed to a $3 \times 3$ convolution.
Degradation Representation Compression: The input degradation representation ( $R^{LQ}$ or $R_{pse}^{LQ}$ ) is passed to an FC layer for compression, resulting in $R_g$ .
Deep Feature Extraction: Initial features are passed through five residual groups, each containing five DA blocks. These blocks extract deep features, conditioned on $R_g$ .
Reconstruction: Finally, a reconstructor (e.g., a convolutional layer) produces the HQ image output $I^{out}$ .

Generator Architecture (Point Cloud - UnPRnet): As illustrated in Fig. 10(c), the point cloud generator employs an encoder-decoder structure with skip connections and DA blocks.

Initial Feature Extraction: Input LQ point cloud ( $P^{LQ}$ or $P_{pse}^{LQ}$ ) is fed to an FC layer.
Degradation Representation Compression: The degradation representation is passed to an FC layer for compression, resulting in $R_g$ .
Encoder Path: Initial features are passed to three DA blocks to extract deep features $F_g \in \mathbb{R}^{\frac{N}{64} \times 128}$ . After each DA block, an average pooling layer downsamples the point cloud by a factor of four.
Decoder Path: Three upsampling layers, three DA point convolutions, and an FC layer decode $F_g$ to an HQ point cloud output $P^{out}$ .

Degradation-Aware (DA) Convolution (for Images - Fig. 3(d)): This novel convolution adapts to degradations by predicting its kernel and channel-wise modulation coefficients.

Kernel Prediction Branch:
- The degradation representation $R$ is fed to two FC layers and a reshape layer.
- This generates a convolutional kernel $w \in \mathbb{R}^{C \times 1 \times 3 \times 3}$ (where $C$ is the number of channels). This $w$ is used for a depth-wise convolution.
- The input feature $F$ is processed by a $3 \times 3$ depth-wise convolution (using $w$ ) and a $1 \times 1$ convolution to produce $F_1$ .
Modulation Coefficient Prediction Branch:
- $R$ is passed to another two FC layers and a sigmoid activation layer.
- This generates channel-wise modulation coefficients $v$ .
- $v$ is used to rescale different channel components in the input feature $F$ , resulting in $F_2$ .
Output: Finally, $F_1$ and $F_2$ are summed: $F_{out} = F_1 + F_2$ .

Degradation-Aware (DA) Point Convolution (for Point Clouds): Similar to image DA convolution, it predicts kernel and modulation coefficients.

Kernel Prediction Branch:
- The degradation representation $R$ is fed to two FC layers and a reshape layer.
- This produces a kernel $w \in \mathbb{R}^{C \times 3 \times 3 \times 3}$ that serves as the convolutional kernel (look-up table) for a point convolution [96].
Modulation Coefficient Prediction Branch:
- $R$ is passed to another two FC layers and a sigmoid activation layer.
- This generates channel-wise modulation coefficients $v$ .
Output: $v$ is used to rescale different channel components in the resultant feature of the point convolution (denoted as $F_1$ ), resulting in $F^{out}$ .

Loss Function for Generator: A simple L1 loss is used as the restoration loss to train the generator. $ \mathcal { L } _ { r e s } = \left| \left| I ^ { o u t } - I ^ { H Q } \right| \right| _ { 1 } $ Where:

$I^{out}$ : The restored HQ image (or point cloud).
$I^{HQ}$ : The ground truth HQ image (or point cloud).

4.2.4. Training Strategy

A progressive training strategy is adopted, consisting of three stages:

Stage 1: Encoder Training:
- Objective: Train the encoder to learn discriminative degradation representations.
- Loss: Only the degradation contrastive loss ( $\mathcal{L}_{deg}$ in Equation 2) is used.
- Components trained: Encoder.
Stage 2: Degrader Training:
- Objective: Train the degrader to synthesize pseudo LQ data that mimics diverse and complicated real-world degradations.
- Loss: The overall degrader loss ( $\mathcal{L}_D$ in Equation 5) is used, which includes content loss, adversarial loss for the degrader, and degradation consistency loss. A discriminator is simultaneously optimized using its own adversarial loss ( $\mathcal{L}_{adv}^{Dis}$ in Equation 7).
- Components trained: Degrader, Discriminator.
- Frozen components: Encoder (parameters are fixed from Stage 1).
Stage 3: Generator Training:
- Objective: Train the generator to restore HQ data from the pseudo LQ data synthesized by the degrader.
- Loss: Only the restoration loss ( $\mathcal{L}_{res}$ in Equation 10) is used.
- Components trained: Generator.
- Frozen components: Encoder and Degrader (parameters are fixed from previous stages).

5. Experimental Setup

The experiments are conducted on both unpaired image restoration (focusing on real-world image super-resolution, AIM-RWSR challenge) and unpaired point cloud restoration tasks.

5.1. Datasets

5.1.1. Unpaired Image Restoration

Model Analysis (Synthetic Data for Image SR):
- HQ Images: 800 training images from DIV2K [83] and 2650 training images from Flickr2K [84].
- LQ Images: Synthesized online from HQ images.
- Degradations: Anisotropic Gaussian blur, bicubic downsampling, noise, and JPEG compression.
  - Anisotropic Gaussian kernels: Characterized by $\mathcal{N}(0, \Sigma)$ (zero mean, varying covariance $\Sigma$ ). $\Sigma$ determined by two random eigenvalues $\lambda_1, \lambda_2 \sim U(0.2, 4)$ and a random rotation angle $\theta \sim U(0, \pi)$ . Kernel size fixed to $21 \times 21$ .
  - Noise level: [0, 30].
  - JPEG compression quality factor (q): [30, 95].
- Benchmark Dataset for Evaluation: Set14 [85].
  - To test diverse degradations: 5 typical anisotropic Gaussian kernels, 2 noise levels (15 and 25), and 2 JPEG compression quality factors (75 and 90) were combined to create 20 representative degradations.
Evaluation on Benchmarks (Synthetic AIM-RWSR Data):
- Training Set: 2650 noisy and compressed LQ images with unknown degradations from Flickr2K [84] and 800 HQ images from DIV2K [83]. This represents an unpaired setting.
- Validation Set: 100 LQ images with the same type of degradations as the training set. Paired HQ images are provided for quantitative evaluation.
Evaluation on Real Data (PASCAL VOC Dataset):
- LQ Images: 17125 images containing diverse real-world degradations from the PASCAL VOC dataset [93].
- HQ Images: 800 images from the DIV2K dataset. This is an unpaired setting.
- Evaluation Set: 100 real LQ images from the VOC dataset. Ground truth HQ images are unavailable for this set.

5.1.2. Unpaired Point Cloud Restoration

Evaluation on XYZ Point Clouds (Geometry only):
- Training Dataset: PU [52] dataset.
- Evaluation Datasets: PU [52] and PC [101] datasets.
- Degradations: Only Gaussian coordinate noise.
  - $Coordinate noise (σ_coord)$ : $[0.5\%, 2\%]$ .
Evaluation on XYZ-RGB Point Clouds (Geometry and Color):
- Training Dataset: Areas 1-4 and area 6 of the S3DIS dataset.
- Evaluation Dataset: Area 5 of the S3DIS dataset.
- Degradations: Gaussian coordinate noise, Gaussian color noise, and GPCC geometry compression.
  - $Coordinate noise (σ_coord)$ : $[0, 2.5 \mathrm{cm}]$ .
  - $Color noise (σ_color)$ : [0, 20].
  - Geometry compression quality factor (q): [7, 12].

5.2. Evaluation Metrics

5.2.1. For Image Restoration

Peak Signal-to-Noise Ratio (PSNR)
1. Conceptual Definition: PSNR is a ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Because many signals have a very wide dynamic range, PSNR is usually expressed in terms of the logarithmic decibel scale. It is most commonly used to measure the quality of reconstruction of lossy compression codecs or image restoration methods. A higher PSNR value generally indicates a better quality image.
2. Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $
3. Symbol Explanation:
  - $\mathrm{MAX}_I$ : The maximum possible pixel value of the image. For 8-bit grayscale images, this is 255. For color images where each color component is 8 bits, this is also 255.
  - $\mathrm{MSE}$ : Mean Squared Error between the original (ground truth) image and the restored (compressed/processed) image.
  - \mathrm{MSE} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2
    - I(i,j): The pixel value at row $i$ and column $j$ of the original image.
    - K(i,j): The pixel value at row $i$ and column $j$ of the restored image.
    - M, N: The dimensions (height and width) of the image.
Structural Similarity Index (SSIM)
1. Conceptual Definition: SSIM is a perceptual metric that quantifies the similarity between two images. Unlike PSNR, which primarily measures pixel-wise differences, SSIM aims to mimic human visual perception by considering changes in structural information, luminance, and contrast. Values range from -1 to 1, where 1 indicates perfect similarity.
2. Mathematical Formula: $ \mathrm{SSIM}(x,y) = [l(x,y)]^{\alpha} \cdot [c(x,y)]^{\beta} \cdot [s(x,y)]^{\gamma} $ Typically, $\alpha = \beta = \gamma = 1$ , and $C_3 = C_2 / 2$ , simplifying to: $ \mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2+\mu_y^2+C_1)(\sigma_x^2+\sigma_y^2+C_2)} $
3. Symbol Explanation:
  - x, y: Two image patches (e.g., from the original and restored images).
  - $\mu_x$ : The average (mean) of $x$ .
  - $\mu_y$ : The average (mean) of $y$ .
  - $\sigma_x^2$ : The variance of $x$ .
  - $\sigma_y^2$ : The variance of $y$ .
  - $\sigma_{xy}$ : The covariance of $x$ and $y$ .
  - $C_1 = (K_1 L)^2$ : A small constant to prevent division by zero, where $L$ is the dynamic range of pixel values (e.g., 255 for 8-bit images), and $K_1$ is a small constant (e.g., 0.01).
  - $C_2 = (K_2 L)^2$ : A small constant to prevent division by zero, where $K_2$ is a small constant (e.g., 0.03).
  - $l(x,y) = \frac{2\mu_x\mu_y + C_1}{\mu_x^2+\mu_y^2+C_1}$ : Luminance comparison function.
  - $c(x,y) = \frac{2\sigma_x\sigma_y + C_2}{\sigma_x^2+\sigma_y^2+C_2}$ : Contrast comparison function.
  - $s(x,y) = \frac{\sigma_{xy} + C_3}{\sigma_x\sigma_y+C_3}$ : Structure comparison function.
Learned Perceptual Image Patch Similarity (LPIPS)
1. Conceptual Definition: LPIPS (often called "perceptual distance") uses features from a pre-trained deep convolutional neural network (e.g., VGG, AlexNet) to measure the distance between two images. Instead of comparing raw pixels, it compares their high-level feature representations, which often correlates better with human judgment of image similarity. A lower LPIPS score indicates higher perceptual similarity.
2. Mathematical Formula: The LPIPS distance between two images $x$ and $x_0$ is given by: $ d(x, x_0) = \sum_l w_l \cdot ||\phi_l(x) - \phi_l(x_0)||_2 / H_l W_l $
3. Symbol Explanation:
  - $\phi_l$ : Feature stack (output) of the $l$ -th layer of a pre-trained CNN (e.g., VGG).
  - $w_l$ : A scalar weight for each layer $l$ , learned from human perceptual similarity judgments.
  - $||\cdot||_2$ : The L2 norm (Euclidean distance), typically computed per-channel and then averaged.
  - $H_l, W_l$ : Height and width of the feature map at layer $l$ .
Naturalness Image Quality Evaluator (NIQE)
1. Conceptual Definition: NIQE is a "no-reference" (blind) image quality assessment metric, meaning it does not require a reference (ground truth) image. It is based on the assumption that features extracted from natural, high-quality images follow a multivariate Gaussian distribution. NIQE measures the distance between the multivariate Gaussian model of a distorted image and a model trained on a collection of natural, pristine images. A lower NIQE score indicates better perceptual quality (closer to naturalness).
2. Mathematical Formula: Let $v_1$ and $\Sigma_1$ be the mean vector and covariance matrix of the natural pristine image patches, and $v_2$ and $\Sigma_2$ be the mean vector and covariance matrix of the test image patches. Both are derived from a specific generalized Gaussian model (GGN) fitting. $ \mathrm{NIQE} = \sqrt{(v_1 - v_2)^T (\frac{\Sigma_1 + \Sigma_2}{2})^{-1} (v_1 - v_2)} $
3. Symbol Explanation:
  - $v_1, \Sigma_1$ : Mean vector and covariance matrix of statistical features extracted from a natural image database.
  - $v_2, \Sigma_2$ : Mean vector and covariance matrix of statistical features extracted from the distorted (test) image.
  - $(\cdot)^T$ : Transpose of a vector/matrix.
  - $(\cdot)^{-1}$ : Inverse of a matrix.
  - $\sqrt{\cdot}$ : Square root. This formula calculates the Mahalanobis distance between the natural pristine model and the test image model.
Convolutional Neural Network-based Image Quality Assessment (CNNIQA)
1. Conceptual Definition: CNNIQA is another no-reference image quality assessment metric that leverages a convolutional neural network to predict image quality. It learns to map image features directly to quality scores, often trained on large datasets of images with human-assigned quality ratings. The specific output meaning (higher/lower is better) depends on how the network was trained (e.g., predicting MOS - Mean Opinion Score).
2. Mathematical Formula: (Not explicitly provided in the paper, but conceptual understanding is key) The core of CNNIQA is a neural network architecture that takes an image as input and outputs a quality score. The "formula" is the network's function $f_{CNN}$ : $ \mathrm{QualityScore} = f_{CNN}(\mathrm{Image}) $
3. Symbol Explanation:
  - $\mathrm{Image}$ : The input image being assessed.
  - $f_{CNN}$ : The trained convolutional neural network model.
  - $\mathrm{QualityScore}$ : The predicted quality score. The interpretation of this score (e.g., higher is better, lower is better) depends on the specific training objective of the CNNIQA model.

5.2.2. For Point Cloud Restoration

Chamfer Distance (CD)
1. Conceptual Definition: Chamfer Distance is a popular metric for measuring the dissimilarity between two point clouds. It calculates the sum of the squared minimum distances from each point in one set to its nearest neighbor in the other set, and vice versa. It effectively penalizes both missing points and spurious points. A lower CD value indicates greater similarity between the two point clouds.
2. Mathematical Formula: Given two point clouds $P_1 = \{p_i\}_{i=1}^{N_1}$ and $P_2 = \{q_j\}_{j=1}^{N_2}$ : $ \mathrm{CD}(P_1, P_2) = \sum_{p \in P_1} \min_{q \in P_2} ||p - q||2^2 + \sum{q \in P_2} \min_{p \in P_1} ||q - p||_2^2 $
3. Symbol Explanation:
  - $P_1, P_2$ : The two point clouds being compared.
  - $p$ : A point in point cloud $P_1$ .
  - $q$ : A point in point cloud $P_2$ .
  - $||\cdot||_2^2$ : The squared Euclidean distance between two points.
  - $\min$ : Finds the minimum distance. The first term finds the closest point in $P_2$ for each point in $P_1$ , and the second term does the reverse.
Point-to-Mesh Distance (P2M)
1. Conceptual Definition: Point-to-Mesh distance measures the quality of a reconstructed point cloud by comparing it against a ground-truth mesh model. For each point in the point cloud, it calculates the shortest distance to the surface of the mesh. This metric is particularly useful when the ground truth is a continuous surface rather than another discrete point cloud. A lower P2M value indicates that the reconstructed point cloud is closer to the true underlying geometry.
2. Mathematical Formula: Given a point cloud $P = \{p_i\}_{i=1}^{N}$ and a ground-truth mesh $M$ : $ \mathrm{P2M}(P, M) = \frac{1}{N} \sum_{p \in P} \min_{q \in M} ||p - q||_2 $ (Note: The original paper does not specify if it's squared Euclidean distance or mean distance, but typically it's mean distance to the closest point on the mesh surface. Here assuming non-squared for clarity as is common, and an average over points.)
3. Symbol Explanation:
  - $P$ : The reconstructed point cloud.
  - $M$ : The ground-truth mesh.
  - $p$ : A point in the reconstructed point cloud $P$ .
  - $q$ : A point on the surface of the mesh $M$ .
  - $||\cdot||_2$ : The Euclidean distance between point $p$ and point $q$ .
  - $\min_{q \in M}$ : Finds the closest point $q$ on the mesh surface to point $p$ .
  - $N$ : The number of points in point cloud $P$ .

5.3. Baselines

5.3.1. Unpaired Image Restoration

Zero-shot SR methods:
- ZSSR [38]: Performs training during inference to adapt to the test image.
Paired SR methods (trained on synthetic data with predefined degradations):
- RCAN [9]: Trained with only bicubic degradations.
- IKC [20]: Blind SR with iterative kernel correction. Trained with combinations of Gaussian blur and noise.
- DAN [87]: Uses degradation-aware network for blind SR. Trained with combinations of Gaussian blur and noise.
- BSRNet [13]: Practical degradation model for deep blind image super-resolution.
- BSRGAN [13]: GAN-based version of BSRNet.
- Real-ESRNet [14]: Training real-world blind super-resolution with pure synthetic data using second-order degradations.
- Real-ESRGAN [14]: GAN-based version of Real-ESRNet.
Unpaired SR methods (trained directly on unpaired data or using GANs for pseudo-pairing):
- CinCGAN [45]: Unsupervised SR using cycle-in-cycle GANs.
- Lugmayr et al. [16]: Unsupervised learning for real-world SR.
- FSSR [91]: Flexible super-resolution (unpaired).
- DASR [18]: Unsupervised real-world image super-resolution via domain-distance aware learning.
- DeFlow [48]: Learning complex image degradations from unpaired data with conditional flows.

5.3.2. Unpaired Point Cloud Restoration

Traditional Methods:
- Bilateral [99]: Bilateral filter for mesh denoising.
- GLR [100]: 3D point cloud denoising using graph Laplacian regularization.
Learning-based Methods (Supervised):
- PCNet [101]: PointCleanNet, learning to denoise and remove outliers from dense point clouds.
- DMR [102]: Differentiable manifold reconstruction for point cloud denoising.
- SBPCD [97]: Score-based point cloud denoising.
- RePCD-Net [98]: Feature-aware recurrent point cloud denoising network.
Learning-based Methods (Unsupervised/Unpaired):
- Total Denoising (TD) [54]: Unsupervised learning of 3D point cloud cleaning.
- DMR-un [102]: Unsupervised version of DMR.

Note on Baselines for XYZ-RGB Point Clouds: For XYZ-RGB Point Clouds, previous methods like Total Denoising [54], PointCleanNet [101], and PointFilter [103] are not included for comparison because they primarily handle coordinate noises, not color noises, and are limited to denoising single objects rather than real 3D scenes. Therefore, only Gaussian filter and Bilateral filter [99] are used as traditional baselines.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Model Analyses (UnIRnet for Synthetic Image SR)

The paper first conducted ablation studies and detailed analyses on synthetic data to investigate the effectiveness of its network designs. The evaluation used Set14 with 20 representative degradations combining Gaussian blur, noise, and JPEG compression. PSNR and SSIM were used as metrics.

The following are the results from Table I of the original paper:

Model	Encoder	Degrader			Generator		Mean PSNR (dB)
Model	Contrastive Loss (Eq. 2)	Noise Injection	DA Conv	Consistency Loss (Eq. 9)	DA Conv Kernel	DA Conv Modulation	Blur 1 (σ=15, q=75)	Blur 2 (σ=15, q=90)	Blur 3 (σ=25, q=75)	Blur 4 (σ=25, q=90)	Blur 5 (σ=25, q=95)
E1	×	✓	✓	×	✓	✓	22.92	22.71	22.47	22.26	22.16
D1	✓	×	×	×	✓	✓	18.07	17.59	17.44	17.37	17.25
D2	✓	✓	×	×	✓	✓	22.86	22.67	22.44	22.26	22.13
D3	✓	✓	✓	×	✓	✓	23.07	22.91	22.62	22.41	22.29
G1	✓	✓	✓	✓	×	×	21.55	21.45	21.33	21.26	21.20
G2	✓	✓	✓	✓	✓	×	23.03	22.81	22.57	22.40	22.27
Baseline (Ours)	✓	✓	✓	✓	✓	✓	23.16	23.01	22.75	22.57	22.43

6.1.1.1. Encoder: Degradation Representation Learning

Effectiveness of Degradation Representation Learning:
- Model E1 (without degradation representation learning, i.e., no contrastive loss and no degradation consistency loss) shows significantly lower PSNR compared to the Baseline. For instance, E1 achieves 22.92 dB for Blur 1, while the Baseline achieves 23.16 dB.
- Analysis: This demonstrates that learning discriminative degradation information is crucial. Without it, the degrader cannot synthesize diverse pseudo LQ images effectively (lacking $\mathcal{L}_{consist}$ ), and the generator struggles to adapt to various degradations. The Baseline benefits from accurate implicit degradation information, leading to better SR performance.
Visualization of Degradation Representations: The following figure (Fig. 4 from the original paper) illustrates the visualization of degradation representations.

$该图像是示意图。上半部分展示了不同噪声强度（$\\sigma_{noise}$）和质量因子（$q$）下的图像恢复效果，从无噪声到高噪声且低质量的变化过程；下半部分包含三个小图(a)，(b)和(c)，分别展示了在不同噪声和质量条件下的特征点的聚类分布情况。这些聚类结果展示了在各自条件下，特征之间的可分性和差异性。$ 该图像是示意图。上半部分展示了不同噪声强度（ $\sigma_{noise}$ ）和质量因子（ $q$ ）下的图像恢复效果，从无噪声到高噪声且低质量的变化过程；下半部分包含三个小图(a)，(b)和(c)，分别展示了在不同噪声和质量条件下的特征点的聚类分布情况。这些聚类结果展示了在各自条件下，特征之间的可分性和差异性。

VLM Description: The image is a schematic representation. The upper part shows the image restoration effects under different noise intensities ( $\sigma_{noise}$ ) and quality factors ( $q$ ), illustrating the transition from noise-free to high noise and low quality; the lower part contains three sub-images (a), (b), and (c), depicting the clustering distribution of feature points under varying noise and quality conditions. These clustering results demonstrate the separability and differences among features under their respective conditions.
- T-SNE visualizations [86] (Fig. 4) show that the degradation encoder can roughly distinguish different blur kernels (Fig. 4(a)) and clearly cluster degradations by noise levels (Fig. 4(b)) and JPEG compression factors (Fig. 4(c)).
- Analysis: This confirms that the learned degradation representations are indeed discriminative and effectively capture implicit degradation information, allowing the model to differentiate between various degradation types.
Content-Invariance of Degradation Representations: The following figure (Fig. 5 from the original paper) shows PSNR results achieved using degradation representations learned from different image contents.

该图像是一个图表，展示了使用不同噪声水平和压缩质量对10幅图像恢复任务的PSNR结果。数据点由不同形状的标记表示，分别代表不同的噪声标准差和压缩质量。可以看出，随着图像内容的变化，PSNR值在不同条件下的表现差异显著。

VLM Description: The image is a chart showing the PSNR results for the restoration tasks of 10 images under different noise levels and compression qualities. The data points are represented by different shaped markers, indicating various noise standard deviations and compression qualities. It can be observed that there are significant variations in PSNR performance under different conditions as the image content changes.
- Experiments (Fig. 5) show relatively stable performance when using degradation representations learned from different image contents, even when the image content varies, as long as the degradation model is the same.
- Analysis: This supports the claim that the degradation representations are robust to image content variations, focusing on the degradation itself rather than the image content, which is crucial for a generalizable restoration method.

6.1.1.2. Degrader: LQ Data Synthesis

Noise Injection:
- D1 (no noise injection, no DA convolutions, no consistency loss in degrader, resembling traditional GAN-based synthesis) shows very low PSNR (e.g., 18.07 dB for Blur 1).
- D2 (adds noise injection to D1) significantly improves PSNR (e.g., 22.86 dB for Blur 1).
- Analysis: Noise injection enables the degrader to synthesize pseudo LQ images with stochastic degradations, increasing diversity. This helps the generator train on a wider range of degradations, improving its ability to handle complex, unseen real degradations.
Degradation Consistency Loss:
- D3 (removes $\mathcal{L}_{consist}$ from the Baseline) shows a notable performance drop compared to the Baseline (e.g., 23.07 dB for Blur 1 vs. 23.16 dB for Baseline).
- Analysis: $\mathcal{L}_{consist}$ is crucial for guiding the degrader to mimic the specific degradations of unpaired real LQ images. Without it, the degrader can suffer from mode collapse, producing less diverse synthetic LQ images, which limits the restoration performance.
Diversity of Synthetic Pseudo LQ Images: The following figure (Fig. 6 from the original paper) displays the effects of different denoising methods, labeled as Guidance, Baseline, D1, D2, and D3.

该图像是一个展示不同去噪效果的示意图，包含多个恢复效果的比较，分别标记为Guidance、Baseline、D1、D2和D3。图中展示了在处理图像去噪任务时，使用不同方法对同一图像进行恢复的效果对比。

VLM Description: The image is a diagram displaying the effects of different denoising methods, labeled as Guidance, Baseline, D1, D2, and D3. It demonstrates the comparison of restoration results on the same image using various techniques for image denoising tasks.
- Visual comparison (Fig. 6) shows D1 generates deterministic LQ images. D2 adds stochasticity but with limited diversity. D3 (with DA convolutions but without $\mathcal{L}_{consist}$ ) synthesizes more diverse LQ images, but their degradation distribution might not match the guidance.
- The Baseline (with degradation representation learning and consistency loss) can synthesize diverse LQ images that closely mimic the degradations in the guidance images (e.g., strong noises, JPEG blocking artifacts).
- Analysis: This visually confirms that noise injection and degradation consistency loss are vital for generating diverse and accurate pseudo LQ images, allowing the synthetic data to effectively cover the real degradation space. The following figure (Fig. 7 from the original paper) illustrates degradation representations for pseudo LQ images generated using different guidance images.
  
  该图像是示意图，展示了不同引导图生成的伪低质量（LQ）图像对应的降级表示。图中使用不同颜色的三角形代表四种不同的引导图标识，黑色、红色、绿色和蓝色区域分别对应不同的降级表示。
VLM Description: The image is a schematic that illustrates degradation representations for pseudo low-quality (LQ) images generated by different guidance images. The colored triangles represent four different guidance identifiers, with black, red, green, and blue areas corresponding to distinct degradation representations.
- Visualization of degradation representations for pseudo LQ images (Fig. 7) shows that synthetic LQ images are clustered into discriminative groups corresponding to different guidance images, and are close to their corresponding guidance images.
- Analysis: This further validates the effectiveness of degradation-aware LQ data synthesis in producing controlled and diverse degradations.

6.1.1.3. Generator: Degradation-Aware Convolutions

Effectiveness of DA Convolutions:
- G1 (replaces DA convolutions with vanilla ones, i.e., no degradation information) has significantly lower PSNR than the Baseline (e.g., 21.55 dB for Blur 1 vs. 23.16 dB).
- G2 (includes dynamic convolutional kernels but removes the channel-wise modulation branch) shows much better performance than G1 (e.g., 23.03 dB for Blur 1).
- The Baseline (adds channel-wise modulation coefficients on top of G2) achieves the best results.
- Analysis: This ablation study clearly demonstrates the effectiveness of DA convolutions. Dynamically predicting convolutional kernels based on degradation representations allows the network to adapt to different degradations, leading to substantial gains. Further, channel-wise modulation coefficients provide additional flexibility, contributing to the Baseline's superior performance.

6.1.2. Evaluation on Benchmarks (UnIRnet for Image SR)

6.1.2.1. Evaluation on Synthetic Data (AIM-RWSR)

The evaluation was conducted on the AIM Real-World SR (AIM-RWSR) challenge dataset for $\times 4$ SR.

The following are the results from Table II of the original paper:

	Method	Training Data	Training Degradation	#Params.	Time	PSNR (↑)	SSIM (↑)	LPIPS (↓)
Zero-Shot	ZSSR [38]	-	-	0.2M	230s	22.351	0.6173	0.537
Zero-Shot	ZSSR++ [104]	-	-	0.2M	230s	22.327	0.6022	0.630
Paired	RCAN [9]	DIV2K	Bicubic	16M	0.26s	22.322	0.6042	0.472
	IKC [20]	DIV2K+Flickr2K	Blur+Noise	5.2M	0.52s	22.245	0.6001	0.479
	DAN [87]	DIV2K+Flickr2K	Blur+Noise	4.2M	0.35s	22.405	0.6094	0.471
	BSRNet [13]	DIV2K+Flickr2K+WED [88]+FFHQ [89]	Randomly Shuffled	16M	0.26s	23.180	0.6676	0.334
	BSRGAN [13]	DIV2K+Flickr2K+WED [88]+FFHQ [89]	Randomly Shuffled	16M	0.26s	22.468	0.6223	0.236
	Real-ESRNet [14]	DIV2K+Flickr2K+OST [90]	Second-Order	16M	0.26s	23.169	0.6707	0.333
	Real-ESRGAN [14]	DIV2K+Flickr2K+OST [90]	Second-Order	16M	0.26s	22.078	0.6217	0.238
	CinCGAN [45]	AIM-RWSR	Unknown	43M	-	21.602	0.6129	0.461
	FSSR [91]	AIM-RWSR	Unknown	16M	0.26s	21.590	-	-
Unpaired	Lugmayr et al. [16]	AIM-RWSR	Unknown	-	-	-	0.5500	0.472
	DASR [18]	AIM-RWSR	Unknown	16M	0.26s	20.820	0.5103	0.390
	DeFlow [48]	AIM-RWSR	Unknown	16M	0.26s	21.600	0.5640	0.336
	UnIRnet (Ours)	AIM-RWSR	Unknown	16M	0.26s	22.673	0.6449	0.374
	UnIRGAN (Ours)	AIM-RWSR	Unknown	5.1M+4.5M	0.09s	22.462	0.6273	0.301

Note: The table from the original paper seems to have a discrepancy in the "UnIRnet (Ours)" row for PSNR, SSIM, and LPIPS compared to other values. The abstract and text discuss "UnIRnet" achieving "22.673/0.6449" versus "22.250/0.6200" for previous unpaired methods. It seems like the table may have a typo or the row labeled "UnIRnet (Ours)" is actually "UnIRGAN (Ours)" and the row labeled "UnIRGAN (Ours)" is perhaps a different configuration or a mislabel. For this analysis, I will strictly follow the table as provided, highlighting the bold and underlined values as they appear.

Comparison with Zero-Shot and Paired Methods:
- ZSSR methods are time-consuming and suffer limited accuracy due to unknown degradations.
- RCAN performs poorly on real degradations as it's trained only on bicubic degradations.
- IKC and DAN perform conditional SR after degradation estimation, but their performance is limited to Gaussian blur and noise combinations, and they are inefficient due to iterative estimation.
- BSRNet, BSRGAN, Real-ESRNet, Real-ESRGAN achieve promising results using complex degradation models and large datasets, but at a relatively high computational cost.
Comparison with Unpaired Methods:
- Our UnIRnet achieves higher PSNR and SSIM scores compared to previous unpaired SR methods. For example, UnIRnet (22.673 PSNR, 0.6449 SSIM) outperforms DeFlow (21.600 PSNR, 0.5640 SSIM) significantly, with fewer parameters (<60% of DeFlow's 16M parameters, though the table shows 16M for UnIRnet and 5.1M+4.5M for UnIRGAN).
- Our UnIRGAN (the perception-oriented version, obtained by finetuning UnIRnet with a GAN loss) achieves the best LPIPS score (0.301), indicating superior perceptual quality, and also competitive PSNR and SSIM. It also achieves comparable or better accuracy with higher efficiency (0.09s inference time) compared to BSRGAN and Real-ESRGAN, which use more complex degradation settings and larger models.
  
  The following figure (Fig. 8 from the original paper) shows the visual comparison of restored images on the AIM-RWSR dataset.
  
  该图像是图表，展示了在 AIM-RWSR 数据集上恢复图像的视觉比较。图中从左到右依次为低质量（LR）图像、Bicubic 插值、FFSR、DASR、DeFlow、UnIRnet（我们的模型）和UnIRGAN（我们的模型）的输出结果。这些结果展现了不同恢复方法在视觉质量上的差异。

VLM Description: The image is a chart that illustrates the visual comparison of restored images on the AIM-RWSR dataset. From left to right, the outputs include the low-quality (LR) image, Bicubic interpolation, FSSR, DASR, DeFlow, UnIRnet (our model), and UnIRGAN (our model). The results show the differences in visual quality among various restoration methods.

Visual Comparison (Fig. 8): Previous unpaired methods (FSSR, DASR, DeFlow) suffer from noticeable artifacts (e.g., in the shorts). Our UnIRnet and UnIRGAN produce cleaner results with finer details and higher perceptual quality.

6.1.2.2. Evaluation on Real Data (PASCAL VOC)

Evaluation was conducted on real LQ images from PASCAL VOC without ground-truth HQ images, using no-reference metrics (NIQE, CNNIQA).

The following figure (Fig. 9 from the original paper) shows the visual comparison of restored images on the VOC dataset.

Fig. 9. Visual comparison of restored images on the VOC dataset. 该图像是图表，展示了在VOC数据集上恢复的图像的视觉比较。左侧为低质量图像（LR），右侧展示了使用不同方法（如Bicubic、RCAN、FSR、DASR、UnIRnet（我们的方法）、UnIRGAN（我们的方法））恢复的高质量图像（HQ）。

VLM Description: The image is a chart showing a visual comparison of restored images on the VOC dataset. The left side displays the low-quality image (LR), while the right side presents high-quality images (HQ) restored using various methods, including Bicubic, RCAN, FSSR, DASR, UnIRnet (ours), and UnIRGAN (ours).

Visual Comparison (Fig. 9):
- FSSR and DASR produce unpleasant artifacts and low perceptual quality.
- Our UnIRnet yields SR results with fewer artifacts and higher quality.
- UnIRGAN (perception-oriented finetuned version) restores finer and more realistic details (e.g., stripes in the second scenario), demonstrating superior perceptual quality on real-world images.

6.1.3. Evaluation on Unpaired Point Cloud Restoration (UnPRnet)

6.1.3.1. Evaluation on XYZ Point Clouds

Evaluation was conducted on PU and PC datasets with Gaussian coordinate noise. Chamfer Distance (CD) and Point-to-Mesh Distance (P2M) were used.

The following are the results from Table III of the original paper:

#Points	Noise Level	Paired							Unsupervised/Unpaired
#Points	Noise Level	Bilateral* [99]	GLR* [100]	PCNet* [101]	DMR* [102]	SBPCD* [97]	RePCD-Net† [98]	PRnet (Ours)	TD†[54]	DMR-un [102]	UnPRnet (Ours)
PU 10K	1%	3.646/1.342	2.959/1.052	3.515/1.148	4.482/1.722	2.521/0.463	5.140/-	2.267/0.415	8.350/-	8.255/4.790	2.922/0.700
	2%	5.007/2.018	3.773/1.306	7.467/3.965	4.982/2.115	3.686/1.074	-	3.304/1.036	-	9.729/5.991	4.538/1.665
	3%	6.998/3.557	4.909/2.114	13.067/8.737	5.892/2.846	4.708/1.942	-	4.539/1.911	-	11.516/7.477	6.547/3.345
PU 50K	1%	0.877/0.234	0.696/0.161	1.049/0.346	1.162/0.469	0.716/0.150	-	0.618/0.135	-	2.241/1.301	1.108/0.387
	2%	2.376/1.389	1.587/0.830	1.447/0.608	1.566/0.800	1.288/0.566	-	1.113/0.523	-	3.389/2.247	2.012/1.005
	3%	6.304/4.730	3.839/2.707	2.289/1.285	2.432/1.528	1.928/1.041	-	1.805/0.922	-	5.794/4.415	3.927/2.677
PC 10K	1%	4.320/1.351	3.399/0.956	3.847/1.221	6.602/2.152	3.369/0.830	3.132/0.755	3.027/0.730	13.266/6.959	14.399/7.610	5.189/1.305
	2%	6.171/1.646	5.274/1.146	8.752/3.043	7.145/2.237	5.132/1.195	5.027/1.103	4.897/1.085	15.834/8.449	11.516/7.477	7.299/2.668
	3%	8.295/2.392	7.249/1.674	14.525/5.873	8.087/2.487	6.776/1.941	6.662/1.891	6.551/1.859	17.472/9.308	15.834/8.449	10.453/4.601
PC 50K	1%	1.172/0.198	0.964/0.134	1.293/0.289	1.566/0.350	1.066/0.177	0.922/0.155	0.803/0.125	3.182/1.423	4.245/1.986	1.561/0.430
	2%	2.478/0.634	2.015/0.417	1.913/0.505	2.009/0.485	1.659/0.354	1.508/0.301	1.428/0.284	4.910/2.443	4.245/1.986	2.377/0.801
	3%	6.077/2.189	4.488/1.306	3.249/1.076	2.993/0.859	2.494/0.657	2.313/0.606	2.261/0.592	6.462/3.181	4.910/2.443	3.914/1.553

Comparison with Unsupervised/Unpaired Approaches:
- Our UnPRnet consistently achieves significantly better performance (lower CD and P2M) than other unsupervised/unpaired methods (TD, DMR-un). For example, on PU Dataset (10K) at 1% noise, UnPRnet achieves 2.922/0.700 (CD/P2M) compared to DMR-un's 8.255/4.790.
Comparison with Paired Approaches:
- Our UnPRnet also produces competitive results compared to supervised paired methods. For 10K points, UnPRnet surpasses DMR at 1% noise (2.922/5.189 vs. 4.482/6.602 on PU/PC datasets, respectively). For 50K points, UnPRnet shows comparable performance at most noise levels. PRnet (Ours) is a supervised version of our network and achieves the best performance among all methods shown for paired.
- Analysis: This highlights the strength of the unsupervised degradation representation learning, enabling competitive performance even without paired ground-truth degradations, demonstrating its robustness and adaptability.
  
  The following figure (Fig. 11 from the original paper) shows the comparison of different algorithms in the low-quality (LQ) and high-quality (GT) point cloud restoration tasks.
  
  该图像是插图，展示了不同算法在低质量（LQ）和高质量（GT）点云修复任务上的比较。上方展示了椅子的修复结果，左侧为低质量数据，右侧为我们提出的UnPRnet方法的结果；下方展示了猫的形象，左侧为低质量数据，右侧为算法输出。比较的算法包括DMR、SBPCD与DMR-un。

VLM Description: The image is an illustration that shows the comparison of different algorithms in the low-quality (LQ) and high-quality (GT) point cloud restoration tasks. The top row presents the restoration results for the chair, with the left side showing the low-quality data and the right side showing the results from our proposed UnPRnet method; the bottom row displays the cat figure, with the left side representing the low-quality data and the right side showing the algorithm's output. The compared algorithms include DMR, SBPCD, and DMR-un.

Visual Comparison (Fig. 11): UnPRnet produces cleaner and finer results with lower point-to-mesh distances compared to DMR-un. It also closes the performance gap to SBPCD, a strong paired baseline.

6.1.3.2. Visualization of Degradation Representations (Point Clouds)

The following figure (Fig. 12 from the original paper) visualizes the degradation representations for degradations with different coordinate noise levels.

Fig. 12. Visualization of representations for degradations with different coordinate noise levels. 该图像是一个示意图，展示了不同坐标噪声水平下的降解表示，左侧为10K点的表示，右侧为50K点的表示。不同颜色的点分别表示不同的噪声标准差，其中蓝色表示 $ext{σ}_{ ext{coord}} = 0.1\%$ ，红色表示 $ext{σ}_{ ext{coord}} = 0.5\\%$ ，绿色表示 $ext{σ}_{ ext{coord}} = 1\\%$ ，黑色表示 $ext{σ}_{ ext{coord}} = 2\\%$ 。

VLM Description: The image is a schematic diagram that illustrates degradation representations under different coordinate noise levels, with the left side representing 10K points and the right side representing 50K points. The points in different colors correspond to various noise standard deviations, where blue indicates $\sigma_{\text{coord}} = 0.1\%$ , red indicates $\sigma_{\text{coord}} = 0.5\%$ , green indicates $\sigma_{\text{coord}} = 1\%$ , and black indicates $\sigma_{\text{coord}} = 2\%$ .

T-SNE visualizations (Fig. 12) show that the degradation encoder can distinguish point clouds with different coordinate noise levels, especially for noise levels larger than 0.5%.
Analysis: This confirms that the learned degradation representations are discriminative for 3D geometry degradations as well, providing implicit degradation information for point clouds.

6.1.3.3. Visualization of Synthetic Pseudo LQ Point Clouds

The following figure (Fig. 13 from the original paper) visualizes the synthetic LQ point clouds.

Fig. 13. Visualization of synthetic LQ point clouds. 该图像是示意图，展示了在不同 $\sigma_{coord}$ 值下生成的合成低质量（LQ）点云的可视化效果。其中，指导图和合成图分别在上方和下方展示， $\sigma_{coord}$ 的值在 $0.5\%$ 到 $3.0\%$ 之间变化，体现了点云的不同质量和分布特征。

VLM Description: The image is a diagram that illustrates the visualization of synthetic low-quality (LQ) point clouds generated at different $\sigma_{coord}$ values. The guidance and synthetic clouds are displayed in the top and bottom sections, respectively, with $\sigma_{coord}$ values varying from $0.5\%$ to $3.0\%$ , highlighting different qualities and distribution characteristics of the point clouds.

Visualizations (Fig. 13) demonstrate that the degrader can synthesize diverse LQ point clouds that effectively mimic the different noise levels in the guidance point clouds.
Analysis: This reinforces the effectiveness of the degradation-aware LQ data synthesis in covering diverse degradations in $P^{LQ}$ with synthetic pseudo LQ point clouds.

6.1.3.4. Evaluation on XYZ-RGB Point Clouds

Evaluation was performed on the S3DIS dataset (Area 5) for point clouds with 3D coordinates and RGB values, including Gaussian coordinate noise, Gaussian color noise, and GPCC geometry compression. CD and PSNR (for color) were used.

The following are the results from Table IV of the original paper:

Method	CD (× 10−4) (↓)	PSNR (↑)
LQ Data	4.244	71.35
Gaussian	3.926	65.64
Bilateral [99]	3.956	75.33
UnPRnet (Ours)	3.681	78.23
UnPRnet+ (Ours)	3.630	78.37

Performance Comparison:
- Our UnPRnet significantly outperforms the bilateral filter in terms of PSNR (78.23 vs. 75.33), showing better color restoration.
- With a self-ensemble strategy ( $UnPRnet+$ ), performance is further improved, achieving the lowest CD (3.630) and highest PSNR (78.37).
Analysis: This demonstrates the framework's capability to restore both geometry (CD) and appearance (PSNR) information under complex, multi-modal degradations in real 3D scenes.

The following figure (Fig. 14 from the original paper) shows the visual comparison of restored point clouds.

该图像是对比恢复后点云的视觉效果，左侧为低质量（LQ）点云，中间为我们方法UnPRnet的恢复结果，右侧为真实高质量（GT）点云。该图展示了不同方法在点云恢复任务中的表现。

VLM Description: The image is a visual comparison of restored point clouds, with the left showing the low-quality (LQ) point cloud, the middle displaying the restoration result from our method UnPRnet, and the right representing the ground truth (GT) point cloud. This illustrates the performance of different methods in the point cloud restoration task.

Visual Results (Fig. 14): UnPRnet greatly improves the perceptual quality of input LQ point clouds, producing much cleaner point clouds with finer details (e.g., walls, roofs).

6.2. Ablation Studies / Parameter Analysis

The ablation studies for image restoration (Table I) provide clear evidence for the contribution of each proposed component:

Degradation Representation Learning: Removing the contrastive loss (model E1) results in a significant performance drop, confirming the necessity of learning discriminative degradation information.
Noise Injection in Degrader: Removing noise injection (model D1 vs. D2) severely limits the diversity of synthetic LQ images and hence the restoration performance.
Degradation Consistency Loss in Degrader: Removing $\mathcal{L}_{consist}$ (model D3 vs. Baseline) leads to mode collapse and reduced diversity in synthesized data, confirming its role in matching degradation distributions.
DA Convolutions in Generator:
- Replacing DA convolutions with vanilla ones (model G1) shows the largest performance degradation, emphasizing the critical role of degradation-aware adaptation.
- Removing only the channel-wise modulation coefficient branch (model G2) still yields good performance but slightly worse than the full DA convolution (Baseline), indicating that both kernel prediction and channel modulation contribute to optimal adaptation.
  
  These ablation studies rigorously validate that each proposed technical design—unsupervised degradation representation learning, noise injection, degradation consistency loss, and both components of DA convolutions—contributes positively and significantly to the overall state-of-the-art performance of the framework.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper presents a novel and effective framework for unpaired restoration of images and point clouds. Its core innovation lies in an unsupervised degradation representation learning scheme, which implicitly extracts discriminative degradation information from low-quality (LQ) data without relying on ground-truth degradation labels. This scheme forms the basis for two key components:

Degradation-aware LQ data synthesis: A degrader network, guided by these learned representations and incorporating noise injection and a degradation consistency loss, synthesizes diverse pseudo LQ data from high-quality (HQ) inputs that accurately mimic real-world degradations. This overcomes the mode collapse issue often seen in conventional GAN-based unpaired methods and provides rich training data for the restorer.
Degradation-aware HQ data restoration: A generator network utilizes degradation-aware (DA) convolutions, which dynamically predict convolutional kernels and channel-wise modulation coefficients based on the degradation representations. This allows the network to flexibly adapt its processing to various input degradations, leading to highly accurate restorations.

The framework is demonstrated through UnIRnet for unpaired image restoration and UnPRnet for unpaired point cloud restoration. Extensive experiments show that both achieve state-of-the-art performance on various benchmark datasets, for both synthetic and real-world degradations, surpassing previous paired and unpaired methods, often with fewer parameters and higher efficiency.

7.2. Limitations & Future Work

The authors implicitly acknowledge the difficulty of dealing with unknown and highly diverse real degradations as the core challenge. While they successfully address this with their unsupervised learning scheme, they don't explicitly list specific limitations of their own method or suggest future work in the conclusion. However, based on the discussion, potential limitations and future directions could be inferred:

Generalizability to Extreme Degradations: While DA convolutions offer flexibility, there might be a limit to how well the model can generalize to entirely new, extremely severe, or out-of-distribution degradation types not represented in the training set of unpaired LQ data.
Computational Cost of Degradation Representations: While the encoder is efficient at inference, the training process for contrastive learning and the overall three-stage training might still be computationally intensive.
Interpretability of Degradation Representations: The learned degradation representations are "implicit." While effective, understanding what specific degradation features are encoded in these representations could lead to further improvements or applications.
Real-time Applications: Although UnIRGAN shows promising inference time (0.09s), further optimization might be needed for very high-throughput or real-time applications, especially for large image/point cloud resolutions.
Broader Degradation Spectrum: The paper focuses on common degradations like blur, noise, downsampling, and compression. Exploring more complex or domain-specific degradations (e.g., rain, haze, motion artifacts in point clouds) could be a future direction.
Integration with Downstream Tasks: While restoring HQ data benefits downstream tasks, directly incorporating feedback from downstream tasks during restoration training could lead to end-to-end optimized solutions.

7.3. Personal Insights & Critique

This paper offers several profound insights and advancements:

Elegance of Unsupervised Degradation Learning: The idea of learning to distinguish degradations rather than explicitly estimating them is quite elegant. It cleverly sidesteps the impossible task of knowing ground-truth degradation parameters in real-world scenarios. This contrastive learning approach is a powerful paradigm shift for unpaired restoration. The visual separation of different degradation types in the T-SNE plots is compelling evidence of this.
Effective Handling of Diversity: The decoupled approach to LQ data synthesis ( $p(y | x; d)p(d)$ ) is a smart way to overcome mode collapse in GANs. By actively conditioning on and mimicking the degradation representations of diverse real LQ samples, the model ensures that the synthetic data truly covers the "real degradation space," which is a critical advantage for training robust restorers.
Adaptive Architecture with DA Convolutions: The DA convolutions are a highly effective mechanism for integrating degradation information into the network. Instead of simply concatenating features (which can lead to domain gap issues), dynamically predicting kernels and modulation coefficients allows for fine-grained, adaptive processing. This is a generalizable technique that could be applied to other conditional image generation or processing tasks.
Cross-Modality Applicability: The successful application of the same core framework to both image and point cloud restoration demonstrates its generality and robustness. This suggests that the fundamental principles of unsupervised degradation representation and adaptive processing are broadly applicable across different data modalities.

Potential Issues or Areas for Improvement:

Computational Overhead of Training: While the inference time is competitive, the three-stage training process, especially with contrastive learning and adversarial training, could be quite resource-intensive. Further work on optimizing the training efficiency (e.g., one-stage training, more efficient contrastive learning setups) might be beneficial.
Hyperparameter Sensitivity: Contrastive learning often involves sensitive hyperparameters like the temperature $\tau$ and queue size $N_{queue}$ . The paper mentions empirically setting these; a more detailed sensitivity analysis or adaptive parameter tuning might be valuable.
Strict Adherence to Unpaired Data: While the method excels in unpaired settings, could there be a hybrid approach that leverages small amounts of paired data if available, or incorporates other forms of weak supervision to further boost performance?
Long-Term Degradation Drift: Real-world degradations can change over time (e.g., sensor degradation). How would the model adapt to a degradation distribution that slowly shifts? Continuous or online learning might be needed.
Perceptual Quality vs. Fidelity Trade-off: The paper offers both UnIRnet (PSNR-oriented) and UnIRGAN (perception-oriented). This highlights the inherent trade-off. While UnIRGAN achieves excellent LPIPS, further research could explore how to strike an optimal balance or allow users to control this trade-off more explicitly.

Overall, this paper makes significant strides in addressing the fundamental challenges of real-world unpaired data restoration. Its unsupervised approach to degradation understanding and adaptive network design provides a powerful and flexible paradigm that is likely to influence future research in low-level vision and beyond.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.