LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models

Shuaicheng Liu

Paper status: completed

LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models

Published:07/12/2024

Unsupervised Low-Light Image Enhancement (1)Latent-Retinex Diffusion Model (1)Content-Transfer Decomposition Network (1)Self-Constrained Consistency Loss (1)Low-Light Feature Guided Restoration (1)

Original Link PDF

Price: 0.10

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

LightenDiffusion introduces an unsupervised framework for low-light image enhancement, integrating Retinex theory with diffusion models. It utilizes a content-transfer decomposition network for Retinex decomposition in latent space, significantly enhancing restoration performance

Abstract

In this paper, we propose a diffusion-based unsupervised framework that incorporates physically explainable Retinex theory with diffusion models for low-light image enhancement, named LightenDiffusion. Specifically, we present a content-transfer decomposition network that performs Retinex decomposition within the latent space instead of image space as in previous approaches, enabling the encoded features of unpaired low-light and normal-light images to be decomposed into content-rich reflectance maps and content-free illumination maps. Subsequently, the reflectance map of the low-light image and the illumination map of the normal-light image are taken as input to the diffusion model for unsupervised restoration with the guidance of the low-light feature, where a self-constrained consistency loss is further proposed to eliminate the interference of normal-light content on the restored results to improve overall visual quality. Extensive experiments on publicly available real-world benchmarks show that the proposed LightenDiffusion outperforms state-of-the-art unsupervised competitors and is comparable to supervised methods while being more generalizable to various scenes. Our code is available at https://github.com/JianghaiSCU/LightenDiffusion.

Mind Map

In-depth Reading

English Analysis~45 min read · 63,234 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models."

1.2. Authors

The authors are:

Hai Jiang ( $^ { 1 , 5 } \oplus$ )
Ao Luo ( $\cdot 2 , 5 \oplus \oplus$ )
Xiaohong Liu ( $^ { 4 } \oplus$ )
Songchen Han ( $^ { 1 } \mathbb { O }$ )
Shuaicheng Li ( $x ^ { 3 , 5 , \dagger \oplus }$ )

Their affiliations are:

Sichuan University
Southwest Jiaotong University
Universi leroniScc nTecolgy (Likely a typo or obfuscated name, should probably be a university/institute)
Shanghai Jiao Tong University
Megvii Technology

Shuaicheng Li is indicated as the corresponding author ( $\dagger$ ).

1.3. Journal/Conference

The paper is published at arXiv, a preprint server. This indicates it is currently a preprint and has not yet undergone formal peer review and publication in a specific journal or conference proceeding. arXiv is widely used in academic fields, especially in computer science and physics, for rapid dissemination of research.

1.4. Publication Year

The paper was published at (UTC) on 2024-07-12T02:54:43.000Z, which corresponds to 2024.

1.5. Abstract

The paper proposes LightenDiffusion, an unsupervised framework for low-light image enhancement (LLIE). This framework integrates the physically explainable Retinex theory with diffusion models. A novel content-transfer decomposition network (CTDN) performs Retinex decomposition in the latent space (instead of the typical image space). This enables the decomposition of encoded features from unpaired low-light and normal-light images into content-rich reflectance maps and content-free illumination maps. Subsequently, the reflectance map of the low-light image and the illumination map of the normal-light image are fed into a diffusion model for unsupervised restoration, guided by the low-light feature. To further improve visual quality, a self-constrained consistency loss is introduced to prevent interference from normal-light content in the restored results. Experimental evaluations on real-world benchmarks demonstrate that LightenDiffusion surpasses state-of-the-art unsupervised methods and achieves performance comparable to supervised methods, while exhibiting superior generalizability across diverse scenes. The code for LightenDiffusion is publicly available.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2407.08939v1
PDF Link: https://arxiv.org/pdf/2407.08939v1.pdf
Publication Status: The paper is a preprint available on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is low-light image enhancement (LLIE). Images captured under poor lighting conditions suffer from significant degradations, including poor visibility, reduced contrast, and amplified noise. This problem is crucial because such degraded images negatively impact the performance of various downstream computer vision tasks, such as object detection, segmentation, and surveillance.

The challenges and gaps in prior research are multifaceted:

Ill-posed problem: LLIE is inherently ill-posed, meaning multiple high-quality images could correspond to a single low-light input. Traditional methods, relying on hand-crafted priors like histogram equalization (HE) or Retinex theory, struggle to adapt to diverse illumination conditions and often produce artifacts or limited improvements.
Overfitting and poor generalization in learning-based methods: While deep learning-based approaches have shown promise, many supervised methods require large-scale paired datasets (low-light and corresponding normal-light images) for training. Such paired data are difficult and costly to collect in real-world scenarios. This reliance often leads to models that overfit to specific training conditions and perform poorly when applied to unseen, real-world low-light scenes, exhibiting issues like incorrect exposure, color distortion, blurred details, or noise amplification.
Limitations of zero-shot diffusion models: Recent generative models, particularly diffusion models, have gained attention for their ability to generate high-quality images. Some zero-shot approaches leverage pre-trained diffusion models for image restoration without training from scratch. However, these methods are limited by the known degradation modes (types of image corruption) embedded in the pre-trained models and tend to perform unsatisfactorily in real-world scenarios where degradations are diverse and unknown.

The paper's entry point and innovative idea revolve around developing an unsupervised, learning-based framework that overcomes the paired data dependency and generalization issues. It proposes to incorporate the physically explainable Retinex theory with the powerful generative capabilities of diffusion models to learn degradation modes directly from extensive unpaired real-world data. A key innovation is performing Retinex decomposition within the latent space rather than the traditional image space, allowing for more robust separation of content and illumination information.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Novel Diffusion-Based Unsupervised Framework (LightenDiffusion): Proposing a new framework that synergistically combines Retinex theory and diffusion models for unsupervised low-light image enhancement. This addresses the critical issue of paired data scarcity and improves generalization.
Content-Transfer Decomposition Network (CTDN): Introducing a specialized network that performs Retinex decomposition in the latent space. This allows for the generation of content-rich reflectance maps (containing intrinsic image details) and content-free illumination maps (representing only lighting conditions) from unpaired low-light and normal-light images. This latent-space decomposition is shown to be more effective than traditional image-space methods in separating these components.
Self-Constrained Consistency Loss ( $\mathcal{L}_{scc}$ ): Proposing a novel loss function to improve visual quality by eliminating interference from normal-light content. This loss ensures that the restored feature shares the same intrinsic content information as the input low-light image, mitigating potential artifacts arising from imperfect illumination map estimation.
Extensive Experimental Validation: Demonstrating through comprehensive experiments on publicly available real-world benchmarks that LightenDiffusion significantly outperforms state-of-the-art unsupervised competitors. Furthermore, it achieves comparable performance to supervised methods while exhibiting superior generalization abilities to various unseen scenes.
Practical Value in Downstream Tasks: Showing that LightenDiffusion can effectively serve as a pre-processing step for downstream vision tasks, such as low-light face detection, significantly improving the precision of detectors like RetinaFace in challenging conditions.

The key conclusions and findings are that LightenDiffusion effectively resolves the trade-off between enhancement quality and generalization ability in LLIE. By learning from unpaired data and leveraging the strengths of latent-space Retinex decomposition and diffusion models, it produces visually pleasing and artifact-free enhanced images that are robust across diverse real-world low-light conditions. This approach provides a practical solution for real-world applications where paired data is unavailable.

3.1. Foundational Concepts

To fully understand the LightenDiffusion paper, a beginner should be familiar with several core concepts in computer vision and deep learning:

3.1.1. Low-Light Image Enhancement (LLIE)

Low-Light Image Enhancement (LLIE) is the task of improving the visual quality of images captured under insufficient lighting conditions. These images typically suffer from low brightness, poor contrast, color cast, and amplified noise. The goal of LLIE is to transform these degraded images into visually pleasant, high-quality images that resemble those taken under normal lighting, making details more discernible and improving their utility for human perception and computer vision systems.

3.1.2. Retinex Theory

The Retinex theory is a perceptual model of color vision developed by Edwin Land, which explains how humans perceive color consistently despite varying illumination. In the context of image processing, it assumes that an observed image ( $I$ ) can be decomposed into two components:

Reflectance map ( $\mathbf{R}$ ): This represents the intrinsic color and texture properties of objects in the scene, which should be invariant to changes in illumination. It's the "true color" or "content" of the image.
Illumination map ( $\mathbf{L}$ ): This represents the lighting conditions of the scene, describing the amount of light falling on objects. It typically varies smoothly across the image. The mathematical formulation of Retinex theory is often given as a multiplicative relationship: $I = \mathbf{R} \odot \mathbf{L}$ where $\odot$ denotes the Hadamard (element-wise) product. The goal of Retinex-based LLIE is to estimate these two components from a low-light image, enhance the illumination map (e.g., by increasing its dynamic range), and then recombine it with the original reflectance map to produce an enhanced image.

3.1.3. Diffusion Models (DMs)

Diffusion Models (DMs) are a class of generative models that have shown remarkable success in generating high-quality and diverse images. They operate through two main processes:

Forward Diffusion (Noising Process): This process gradually adds Gaussian noise to an image over several time steps, progressively transforming a clean image ( $\mathbf{x}_0$ ) into pure Gaussian noise ( $\mathbf{x}_T$ ). This process is fixed and can be described mathematically.
Reverse Denoising (Generation Process): This is the learned part of the model. It starts with random noise ( $\mathbf{x}_T$ ) and attempts to gradually reverse the noising process, step by step, to reconstruct a clean image ( $\mathbf{x}_0$ ). A neural network (often a U-Net) is trained to predict the noise added at each step, allowing the model to iteratively remove noise and generate realistic images. Conditional Diffusion Models extend this by incorporating additional information (e.g., text descriptions, class labels, or in this paper, low-light features) to guide the generation process, allowing for controlled image synthesis or restoration.

3.1.4. Unsupervised Learning

Unsupervised learning is a type of machine learning where the model learns patterns and structures from input data without any explicit labels or paired outputs. In the context of LLIE, this means the model can learn to enhance low-light images using datasets that contain only low-light images, or a collection of unpaired low-light and normal-light images, without requiring perfectly matched pairs. This is highly advantageous for real-world applications where obtaining paired data is difficult or impossible.

3.1.5. Latent Space

In deep learning, latent space (also known as feature space or embedding space) refers to a lower-dimensional representation of data that captures its essential characteristics. An encoder network maps high-dimensional input data (like an image) to a more compact latent representation, which is a vector or a feature map. This latent space often disentangles various attributes of the data, making it easier for subsequent processes (like a decoder or diffusion model) to manipulate or generate new data. Operations in latent space can be more robust and semantically meaningful than operations directly in the pixel space.

3.1.6. Convolutional Neural Networks (CNNs) and Encoder-Decoder Architecture

Convolutional Neural Networks (CNNs) are a class of deep neural networks primarily used for analyzing visual imagery. They use convolutional layers to automatically and adaptively learn spatial hierarchies of features. An encoder-decoder architecture is a common structure in CNNs for image-to-image translation tasks (like LLIE).

Encoder: Consists of several convolutional and pooling layers that progressively downsample the input image, extracting increasingly abstract and compressed features into a latent space.
Decoder: Consists of up-sampling and convolutional layers that take the latent representation from the encoder and gradually reconstruct the output image at the original resolution.

3.1.7. Attention Mechanisms

Attention mechanisms in neural networks allow the model to focus on specific parts of the input data that are most relevant for a given task, rather than processing all parts equally.

Self-Attention (SA): Enables a model to weigh the importance of different parts of a single input sequence or image feature. For image processing, it helps capture long-range dependencies between different spatial locations.
Cross-Attention (CA): Allows a model to attend to information from a different input sequence or feature map. In LightenDiffusion, cross-attention might be used to allow the reflectance map features to attend to relevant information in the illumination map features, or vice-versa, to refine their decomposition.

3.2. Previous Works

The paper discusses various categories of prior work in Low-Light Image Enhancement (LLIE):

3.2.1. Traditional Methods

These methods rely on pre-defined mathematical models or hand-crafted priors.

Histogram Equalization (HE)-based methods [2, 42, 44]: Aim to improve contrast by re-distributing pixel intensities to span the full dynamic range. While simple, they can sometimes lead to over-enhancement or noise amplification.
Retinex-based methods [9, 14, 4]: Decompose an image into reflectance and illumination components. Enhancement is achieved by manipulating the illumination map (e.g., increasing its brightness or contrast) and then recombining it with the reflectance map. Examples include LIME [14] and BrainRetinex [4].
Limitations: Difficult to generalize to diverse, real-world low-light conditions due to the inherent ill-posed nature of the problem and the reliance on fixed priors.

3.2.2. Learning-Based Methods

With the advent of deep learning, LLIE methods shifted towards learning complex mappings from low-light to normal-light images.

Supervised Methods [13, 23, 34, 54, 60, 63, 64, 72]: These models are trained on large-scale paired datasets (low-light input and corresponding normal-light ground truth). They leverage powerful network architectures (e.g., CNNs) to directly learn the enhancement function. Examples include SMG [64].
- Retinex-based Deep Networks [5, 15, 58, 59, 74]: Combine the principles of Retinex theory with deep learning. They often use neural networks to learn the decomposition and adjustment steps. Examples include RetinexNet [58] and URetinexNet [59].
- Limitations: Heavy reliance on paired datasets, which are hard to collect and often lead to models with poor generalization when applied to real-world images outside the training distribution.
Unsupervised Methods [10, 12, 24, 32, 40, 67]: Address the data scarcity issue by learning from unpaired data or without explicit labels. They often employ techniques like adversarial learning (EnlightenGAN [24]), curve estimation (Zero-DCE [12]), or neural architecture search (RUAS [32]). PairLIE [10] and NeRCo [67] are also unsupervised methods.
- Limitations: While better in generalization, they can sometimes struggle with visual fidelity or introduce artifacts compared to supervised methods.
Semi-supervised Methods [29, 68]: Attempt to combine the benefits of both supervised and unsupervised learning, using a mix of paired and unpaired data to achieve stable training and better generalization. DRBN [68] and BLL [39] are examples.

3.2.3. Diffusion-Based Image Restoration

Diffusion Models (DMs) have recently been applied to various image restoration tasks.

Conditional DMs [6, 20, 22, 43, 47, 48, 71, 76]: Most DM-based methods train a model from scratch using paired data, where the degraded image (e.g., low-light image) serves as a condition or guidance during the denoising process. Examples include PyDiff [76] and GSAD [20] for LLIE.
- Limitations: Still suffer from the paired data requirement, limiting their real-world applicability and generalization.
Zero-Shot DMs [8, 25, 35, 55, 78]: Utilize pre-trained diffusion models (often trained on large, diverse datasets) to restore degraded images without specific training for the degradation. They leverage the general priors learned by the pre-trained model. GDP [8] is an example for LLIE.
- Limitations: Performance is constrained by the known degradation modes the pre-trained model implicitly learned. They may struggle with diverse, unknown degradations present in real-world LLIE scenarios, often leading to under-enhancement or unsatisfactory visual quality.

3.3. Technological Evolution

The field of LLIE has evolved significantly:

Early Traditional Methods (1970s-2010s): Focused on hand-crafted mathematical models like Histogram Equalization (early 2000s) and Retinex theory (1970s, applied to images in 1980s-2000s). These were often simple and fast but lacked adaptability.
Deep Learning Era (2015-present):
- Supervised CNNs (2015-2018): Initial deep learning approaches directly mapped low-light to normal-light images using CNNs, requiring large paired datasets. RetinexNet [58] was a key development applying deep learning to Retinex theory.
- Unsupervised and Semi-Supervised Methods (2018-present): To overcome paired data limitations, techniques like GANs (EnlightenGAN [24]) and curve estimation (Zero-DCE [12]) emerged, focusing on unpaired data or intrinsic image properties.
- Generative Models (2020-present): The rise of Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and more recently, Diffusion Models (DMs), brought new paradigms for generating high-quality images. DMs (2020 onwards) have gained prominence due to their impressive generative power and training stability, being applied to various image restoration tasks including LLIE (PyDiff [76], GSAD [20]).
  
  LightenDiffusion fits into this timeline by building upon the success of diffusion models and Retinex theory, but innovatively addressing the unsupervised learning challenge and the generalization gap that previous DM-based methods still faced due to their reliance on paired data or limited zero-shot capabilities.

3.4. Differentiation Analysis

Compared to the main methods in related work, LightenDiffusion introduces several core differences and innovations:

Unsupervised Learning with Diffusion Models: Unlike most diffusion-based LLIE methods (PyDiff, GSAD) that rely on paired data and supervised training, LightenDiffusion operates in an unsupervised manner. This is a critical advantage, making it highly applicable to real-world scenarios where paired low-light and normal-light images are scarce.
Latent-Space Retinex Decomposition: Previous Retinex-based deep learning methods (RetinexNet, URetinexNet, $KinD++$ , PairLIE) typically perform decomposition in the image space. LightenDiffusion innovates by performing this decomposition in the latent space through its Content-Transfer Decomposition Network (CTDN). This allows for a more effective disentanglement of content-rich reflectance maps and content-free illumination maps, reducing information leakage between components and producing cleaner results (as illustrated in Fig. 3).
Integration of Retinex with Diffusion Models: While both Retinex theory and diffusion models have been used for LLIE independently, LightenDiffusion provides a principled integration where the decomposed reflectance and illumination components (specifically, R_low and L_high) are explicitly used as input to guide the diffusion model's restoration process. This combination leverages the physical interpretability of Retinex with the powerful generative capabilities of DMs.
Self-Constrained Consistency Loss ( $\mathcal{L}_{scc}$ ): This novel loss function specifically targets a weakness of Retinex-based approaches – the potential for residual content information in the estimated illumination map to introduce artifacts. By adding this consistency loss, LightenDiffusion explicitly guides the diffusion model to reconstruct results with intrinsic content information consistent with the low-light input, thus improving visual fidelity and robustness without requiring ground-truth supervision.
Improved Generalization: By training on extensive unpaired real-world data and employing the latent-space decomposition, LightenDiffusion demonstrates superior generalization ability compared to both supervised methods (which overfit to training distributions) and zero-shot diffusion methods (which are limited by known degradation modes), as shown by its performance on unseen real-world datasets.

4. Methodology

4.1. Principles

The core idea behind LightenDiffusion is to combine the strengths of Retinex theory and diffusion models within an unsupervised learning framework for low-light image enhancement (LLIE). The fundamental principle is that an image can be intrinsically separated into its reflectance (content) and illumination (lighting) components. By performing this decomposition in a more robust latent space and then using a diffusion model to transfer desirable illumination from a normal-light image to the low-light image's content, the method aims to enhance low-light images without requiring paired training data. The diffusion model then implicitly learns to compensate for any information loss during decomposition and further refines the enhancement. A self-constrained consistency loss ensures that the content of the enhanced image remains faithful to the original low-light input.

4.2. Core Methodology In-depth (Layer by Layer)

The overall pipeline of LightenDiffusion is illustrated in Figure 2.

4.2.1. Overall Pipeline

The process begins with an unpaired low-light image ( $I_{low}$ ) and a normal-light image ( $I_{high}$ ).

Latent Feature Extraction: An encoder $\mathcal{E}(\cdot)$ first transforms both $I_{low}$ and $I_{high}$ from the image space into a lower-dimensional latent space. This yields encoded features $\mathcal{F}_{low}$ and $\mathcal{F}_{high}$ , respectively. The encoder consists of $k$ cascaded residual blocks, with each block downsampling the input by a scale of 2 using a max-pooling layer. So, if $I_{low} \in \mathbb{R}^{H \times W \times 3}$ , then $\mathcal{F}_{low} \in \mathbb{R}^{\frac{H}{2^k} \times \frac{W}{2^k} \times C}$ .
```
graph TD
    A[Low-Light Image I_low] --> E1;
    B[Normal-Light Image I_high] --> E2;
    E1[Encoder ε(·)] --> F_low[Latent Feature F_low];
    E2[Encoder ε(·)] --> F_high[Latent Feature F_high];
```
- $I_{low}, I_{high}$ : Input images (low-light and normal-light, respectively).
- $\mathcal{E}(\cdot)$ : Encoder network that transforms image space input to latent space features.
- $\mathcal{F}_{low}, \mathcal{F}_{high}$ : Encoded latent features for low-light and normal-light images.
- H, W: Height and width of the input images.
- $C$ : Number of channels in the latent features.
- $k$ : Number of downsampling steps in the encoder.
Latent-Space Retinex Decomposition: The encoded features $\mathcal{F}_{low}$ and $\mathcal{F}_{high}$ are then fed into the proposed Content-Transfer Decomposition Network (CTDN). The CTDN performs Retinex decomposition within the latent space to separate each feature into two components:
- Content-rich reflectance map ( $\mathbf{R}_{low}$ , $\mathbf{R}_{high}$ ): Captures the intrinsic content and texture details.
- Content-free illumination map ( $\mathbf{L}_{low}$ , $\mathbf{L}_{high}$ ): Represents only the lighting conditions.
```
graph TD
    F_low --> CTDN;
    F_high --> CTDN;
    CTDN --> R_low[Reflectance Map R_low];
    CTDN --> L_low[Illumination Map L_low];
    CTDN --> R_high[Reflectance Map R_high];
    CTDN --> L_high[Illumination Map L_high];
```
- $\mathbf{R}_{low}, \mathbf{R}_{high}$ : Reflectance maps extracted from $\mathcal{F}_{low}$ and $\mathcal{F}_{high}$ .
- $\mathbf{L}_{low}, \mathbf{L}_{high}$ : Illumination maps extracted from $\mathcal{F}_{low}$ and $\mathcal{F}_{high}$ .
Diffusion-Based Restoration: The reflectance map of the low-light image ( $\mathbf{R}_{low}$ ) and the illumination map of the normal-light image ( $\mathbf{L}_{high}$ ) are combined to form an initial input for the diffusion model. This combined feature, $\mathbf{x}_0 = \mathbf{R}_{low} \odot \mathbf{L}_{high}$ , conceptually represents the low-light content with normal-light illumination. This $\mathbf{x}_0$ then undergoes a forward diffusion process. Subsequently, a reverse denoising process is performed, guided by the original low-light feature $\mathcal{F}_{low}$ (denoted as $\tilde{\mathbf{x}}$ ), to gradually transform random Gaussian noise into a restored feature $\hat{\mathcal{F}}_{low}$ .
```
graph TD
    R_low --> H[Hadamard Product];
    L_high --> H;
    H --> X0[x0 = R_low ⊙ L_high];
    X0 --> DM[Diffusion Model];
    F_low --> DM;
    DM --> F_hat_low[Restored Feature F_hat_low];
```
- $\hat{\mathcal{F}}_{low}$ : The restored latent feature for the low-light image.
Final Image Reconstruction: Finally, the restored feature $\hat{\mathcal{F}}_{low}$ is passed through a decoder $\mathcal{D}(\cdot)$ to reconstruct the final enhanced low-light image $\hat{I}_{low}$ in the image space.
```
graph TD
    F_hat_low --> D[Decoder D(·)];
    D --> I_hat_low[Restored Image I_hat_low];
```
- $\mathcal{D}(\cdot)$ : Decoder network that transforms latent space features back to image space.
- $\hat{I}_{low}$ : The final enhanced low-light image.
  
  The overall pipeline is depicted in Figure 2:
  
  $Fig. 2: The overall pipeline of our proposed framework. We first employ an encoder $\\mathcal { E } ( \\cdot )$ to convert the unpaired low-light image $\\iota _ { l o w }$ and normal-light image `I _ { h i g h }` into latent space denoted as ${ \\mathcal { F } } _ { l o w }$ and $\\mathcal { F } _ { h i g h }$ . The encoded features are sent to the proposed content-transfer decomposition network (CTDN) to generate content-rich reflectance maps denoted as $\\mathbf { R } _ { l o w }$ and $\\mathbf { R } _ { h i g h }$ and content-free illumination maps as $\\mathbf { L } _ { l o w }$ and $\\mathbf { L } _ { h i g h }$ . Then, the reflectance map of the low-light image $\\mathbf { R } _ { l o w }$ and the illumination of the normal-light image $\\mathbf { L } _ { h i g h }$ are taken as the input of the diffusion model to perform the forward diffusion process. Finally, we perform the reverse denoising process to gradually transform the randomly sampled Gaussian noise $\\hat { \\mathbf { x } } _ { T }$ into the restored feature $\\hat { \\mathcal { F } } _ { l o w }$ with the guidance of the low-light feature ${ \\mathcal { F } } _ { l o w }$ denoted as $\\tilde { \\bf x }$ , and subsequently send it to a decoder $\\mathcal { D } ( \\cdot )$ to produce the final result $\\hat { I } _ { l o w }$ .$ 该图像是一个示意图，展示了LightenDiffusion框架的整体流程。首先，通过编码器 ext{E}(ullet) 将低光照图像 $I_{ ext{low}}$ 和正常光照图像 $I_{ ext{high}}$ 转换为潜在空间 ext{F}_{ ext{low}} 和 ext{F}_{ ext{high}}，然后进入内容转移分解网络生成反射图和照明图。最后，通过扩散模型进行前向扩散和反向去噪处理，以恢复低光照图像。

Fig. 2: The overall pipeline of our proposed framework. We first employ an encoder $\mathcal { E } ( \cdot )$ to convert the unpaired low-light image $\iota _ { l o w }$ and normal-light image I _ { h i g h } into latent space denoted as ${ \mathcal { F } } _ { l o w }$ and $\mathcal { F } _ { h i g h }$ . The encoded features are sent to the proposed content-transfer decomposition network (CTDN) to generate content-rich reflectance maps denoted as $\mathbf { R } _ { l o w }$ and $\mathbf { R } _ { h i g h }$ and content-free illumination maps as $\mathbf { L } _ { l o w }$ and $\mathbf { L } _ { h i g h }$ . Then, the reflectance map of the low-light image $\mathbf { R } _ { l o w }$ and the illumination of the normal-light image $\mathbf { L } _ { h i g h }$ are taken as the input of the diffusion model to perform the forward diffusion process. Finally, we perform the reverse denoising process to gradually transform the randomly sampled Gaussian noise $\hat { \mathbf { x } } _ { T }$ into the restored feature $\hat { \mathcal { F } } _ { l o w }$ with the guidance of the low-light feature ${ \mathcal { F } } _ { l o w }$ denoted as $\tilde { \bf x }$ , and subsequently send it to a decoder $\mathcal { D } ( \cdot )$ to produce the final result $\hat { I } _ { l o w }$ .

4.2.2. Content-Transfer Decomposition Network (CTDN)

The Retinex theory fundamentally assumes that an image $I$ can be decomposed into a reflectance map $\mathbf{R}$ and an illumination map $\mathbf{L}$ : $I = { \bf R } \odot { \bf L } \quad \text{(1)}$

$I$ : The input image (or latent feature in this case).
$\mathbf{R}$ : The reflectance map, representing the inherent content and texture of the scene. It should be consistent across different lighting conditions.
$\mathbf{L}$ : The illumination map, representing the lighting conditions (brightness and contrast). It should be locally smooth and free of content details.
$\odot$ : The Hadamard (element-wise) product operation.

Previous methods typically perform this decomposition in the image space. However, this often results in incomplete separation, where content information might still be partially retained in the illumination map (as shown in Figure 3a). This leakage can lead to artifacts in enhanced images.

To address this, LightenDiffusion introduces the Content-Transfer Decomposition Network (CTDN) which performs decomposition within the latent space. The CTDN aims to ensure that the reflectance maps are content-rich and the illumination maps are truly content-free. The detailed architecture of the CTDN is shown in Figure 4.

Fig. 4: The detailed architecture of our proposed CTDN. 该图像是论文中提出的CTDN架构示意图。图中展示了Retinex分解的过程，包括低光照图像的反射率和正常光照图像的照明图的特征提取与合成。

Fig. 4: The detailed architecture of our proposed CTDN.

The process within CTDN is as follows:

Initial Estimation: For an input latent feature $\mathcal{F}$ (which can be $\mathcal{F}_{low}$ or $\mathcal{F}_{high}$ ), initial illumination $\tilde{\mathbf{L}}$ and reflectance $\tilde{\mathbf{R}}$ maps are estimated following a method similar to [14]: $\tilde { \mathbf { L } } ( x ) = \operatorname* { m a x } _ { c \in [ 0 , C ) } \mathcal { F } ^ { c } ( x ) , \tilde { \mathbf { R } } ( x ) = \mathcal { F } ( x ) / ( \tilde { \mathbf { L } } ( x ) + \tau ) \quad \text{(2)}$
- $x$ : Represents a pixel location in the latent feature map.
- $\mathcal{F}^c(x)$ : The value of the latent feature $\mathcal{F}$ at pixel $x$ for channel $c$ .
- $C$ : The total number of channels in the latent feature.
- $\operatorname*{max}_{c \in [0, C)} \mathcal{F}^c(x)$ : Takes the maximum value across all channels at a given pixel $x$ to estimate the initial illumination intensity.
- $\tilde{\mathbf{L}}(x)$ : The initially estimated illumination value at pixel $x$ .
- $\tilde{\mathbf{R}}(x)$ : The initially estimated reflectance value at pixel $x$ , calculated by dividing the feature $\mathcal{F}(x)$ by the illumination (similar to the Retinex multiplicative model).
- $\tau$ : A small constant added to the denominator to prevent division by zero or very small values, ensuring numerical stability.
Feature Embedding: The initially estimated maps $\tilde{\mathbf{L}}$ and $\tilde{\mathbf{R}}$ are then refined. First, several convolutional blocks (Convs) are applied to obtain embedded features $\mathbf{L}'$ and $\mathbf{R}'$ :
- $\mathbf{L}' = \mathrm{Convs}(\tilde{\mathbf{L}})$
- $\mathbf{R}' = \mathrm{Convs}(\tilde{\mathbf{R}})$
Cross-Attention (CA) for Reflectance Reinforcement: A cross-attention (CA) module [21] is used to leverage the illumination map features ( $\mathbf{L}'$ ) to reinforce the content information in the reflectance map features ( $\mathbf{R}'$ ). This helps to ensure that $\mathbf{R}'$ captures all relevant content details:
- $\mathbf{R}'' = \mathrm{CA}(\mathbf{R}', \mathbf{L}')$
- The cross-attention mechanism allows $\mathbf{R}'$ to query $\mathbf{L}'$ for relevant contextual information, helping to refine $\mathbf{R}'$ by incorporating details that might be implicitly tied to illumination variations but are intrinsically part of the content.
Self-Attention (SA) for Illumination Content Extraction: A self-attention (SA) module [50] is applied to the illumination map features ( $\mathbf{L}'$ ) to further extract any remaining content information that might still be present within $\mathbf{L}'$ . This extracted content is denoted as $\mathbf{L}''$ :
- $\mathbf{L}'' = \mathrm{SA}(\mathbf{L}')$
- The self-attention helps the network identify and isolate content patterns within $\mathbf{L}'$ itself, which should ideally be content-free.
Final Map Generation: The final reflectance map $\mathbf{R}$ and illumination map $\mathbf{L}$ are then derived. The extracted content information $\mathbf{L}''$ is added to the refined reflectance map $\mathbf{R}''$ (to ensure it is truly content-rich) and subtracted from the illumination map $\mathbf{L}'$ (to ensure it is truly content-free). These intermediate results are passed through additional convolutional blocks (Convs):
- $\mathbf{R} = \mathrm{Convs}(\mathbf{R}'' + \mathbf{L}'')$
- $\mathbf{L} = \mathrm{Convs}(\mathbf{L}' - \mathbf{L}'')$
- This transfer mechanism ("content-transfer") explicitly moves any residual content from the illumination branch to the reflectance branch.
  
  As a result of this sophisticated CTDN design, the method is able to generate content-rich reflectance maps that fully represent the intrinsic information of the image, and content-free illumination maps that only reveal the lighting conditions, as demonstrated in Figure 3b, contrasting with previous methods in Figure 3a.
  
  $Fig. 3: Illustration of the decomposition results obtained by different methods. (a) shows the results of previous methods, i.e., RetinexNet \[58\], $\\mathrm { K i n D + + }$ \[74\], URetinexNet \[59\], and PairLIE \[10\], that perform decomposition in image space. (b) presents the results of our CTDN that performs decomposition in latent space. Our method can generate content-rich reflectance maps and content-free illumination maps.$ 该图像是一个示意图，展示了不同方法在图像空间和潜在空间中的分解结果。图(a)显示了RetinexNet、KinD++、URetinexNet和PairLIE等方法在图像空间的分解结果，而图(b)展示了我们的CTDN在潜在空间的分解结果，能够生成内容丰富的反射图与内容无关的照明图。

Fig. 3: Illustration of the decomposition results obtained by different methods. (a) shows the results of previous methods, i.e., RetinexNet [58], $\mathrm { K i n D + + }$ [74], URetinexNet [59], and PairLIE [10], that perform decomposition in image space. (b) presents the results of our CTDN that performs decomposition in latent space. Our method can generate content-rich reflectance maps and content-free illumination maps.

4.2.3. Latent-Retinex Diffusion Models (LRDM)

Even with the robust CTDN, two challenges remain:

Information Loss: Any decomposition process, including Retinex, inevitably involves some information loss.
Stubborn Content in Illumination: Despite efforts, there might be challenging cases where the estimated illumination map still retains subtle content information, potentially introducing artifacts.

To address these, the paper proposes a Latent-Retinex Diffusion Model (LRDM). This model leverages the generative capabilities of diffusion models to compensate for lost content and eliminate potential artifacts. It follows the standard forward diffusion and reverse denoising processes.

4.2.3.1. Forward Diffusion

The forward diffusion process progressively adds Gaussian noise to a data point over $T$ steps.

Input $\mathbf{x}_0$ : The input to the diffusion model at time step $t=0$ is formed by combining the reflectance map of the low-light image ( $\mathbf{R}_{low}$ ) and the illumination map of the normal-light image ( $\mathbf{L}_{high}$ ). This reflectance-illumination product is denoted as $\mathbf{x}_0$ :
- $\mathbf{x}_0 = \mathbf{R}_{low} \odot \mathbf{L}_{high}$
- This $\mathbf{x}_0$ represents the desired enhanced feature, combining the content of the low-light image with the lighting of a normal-light image.
Noising Process: A pre-defined variance schedule $\{\beta_1, \beta_2, \dots, \beta_T\}$ is used to gradually transform $\mathbf{x}_0$ into pure Gaussian noise $\mathbf{x}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ over $T$ steps. Each step adds a small amount of Gaussian noise: $q ( \mathbf { x } _ { t } \mid \mathbf { x } _ { t - 1 } ) = { \mathcal { N } } ( \mathbf { x } _ { t } ; { \sqrt { 1 - \beta _ { t } } } \mathbf { x } _ { t - 1 } , \beta _ { t } \mathbf { I } ) \quad \text{(3)}$
- $q(\mathbf{x}_t \mid \mathbf{x}_{t-1})$ : The probability distribution of $\mathbf{x}_t$ given $\mathbf{x}_{t-1}$ .
- $\mathbf{x}_t$ : The noisy data at time step $t$ , where $t \in [0, T]$ .
- $\mathcal{N}(\cdot; \mu, \Sigma)$ : Denotes a Gaussian (Normal) distribution with mean $\mu$ and covariance matrix $\Sigma$ .
- $\sqrt{1 - \beta_t} \mathbf{x}_{t-1}$ : The mean of the Gaussian distribution, indicating that a portion of the previous noisy data $\mathbf{x}_{t-1}$ is retained.
- $\beta_t \mathbf{I}$ : The variance of the Gaussian distribution. $\beta_t$ is a small, positive scalar from the variance schedule, and $\mathbf{I}$ is the identity matrix, meaning noise is added independently to each dimension.
Closed-Form Expression: For direct sampling, $\mathbf{x}_t$ can be obtained from $\mathbf{x}_0$ in a single step using parameter re-normalization:
- $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}_t$
- $\alpha_t = 1 - \beta_t$ : A re-parameterization of the variance.
- $\bar{\alpha}_t = \prod_{i=0}^t \alpha_i$ : The cumulative product of $\alpha_i$ values up to time $t$ .
- $\boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ : A randomly sampled Gaussian noise vector at time $t$ .

4.2.3.2. Reverse Denoising

The reverse denoising process aims to gradually transform randomly sampled Gaussian noise back into the desired enhanced feature.

Conditional Generation: A randomly sampled Gaussian noise $\hat{\mathbf{x}}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ is progressively denoised into a sharp, restored feature $\hat{\mathbf{x}}_0$ . This process is guided by the encoded low-light feature $\mathcal{F}_{low}$ , which is denoted as $\tilde{\mathbf{x}}$ . The guidance ensures that the restored result maintains fidelity to the original low-light image's content. $p _ { \theta } ( \hat { \mathbf { x } } _ { t - 1 } \mid \hat { \mathbf { x } } _ { t } , \tilde { \mathbf { x } } ) = \mathcal { N } ( \hat { \mathbf { x } } _ { t - 1 } ; \boldsymbol { \mu } _ { \theta } ( \hat { \mathbf { x } } _ { t } , \tilde { \mathbf { x } } , t ) , \sigma _ { t } ^ { 2 } \mathbf { I } ) \quad \text{(4)}$
- $p_\theta(\hat{\mathbf{x}}_{t-1} \mid \hat{\mathbf{x}}_t, \tilde{\mathbf{x}})$ : The probability distribution of the denoised data $\hat{\mathbf{x}}_{t-1}$ given the noisy data $\hat{\mathbf{x}}_t$ and the condition $\tilde{\mathbf{x}}$ . The subscript $\theta$ indicates that this distribution is learned by the neural network.
- $\hat{\mathbf{x}}_t$ : Noisy data at time step $t$ during the reverse process.
- $\tilde{\mathbf{x}}$ : The conditional input, which is the encoded low-light feature $\mathcal{F}_{low}$ .
- $\boldsymbol{\mu}_\theta(\hat{\mathbf{x}}_t, \tilde{\mathbf{x}}, t)$ : The mean of the Gaussian distribution for denoising, predicted by the neural network with parameters $\theta$ , taking $\hat{\mathbf{x}}_t$ , $\tilde{\mathbf{x}}$ , and $t$ as input.
- $\sigma_t^2 \mathbf{I}$ : The variance of the Gaussian distribution for denoising.
- $\sigma_t^2 = \frac{1 - \bar{\alpha}_{t-1}}{1 - \bar{\alpha}_t} \beta_t$ : The variance, derived from the forward process schedule.
Mean Value Calculation: The mean value $\boldsymbol{\mu}_\theta$ is typically re-parameterized to predict the noise $\boldsymbol{\epsilon}_\theta$ :
- $\boldsymbol { \mu } _ { \theta } ( \hat { { \bf x } } _ { t } , \tilde { \bf x } , t ) = \frac { 1 } { \sqrt { \alpha _ { t } } } ( \hat { \bf x } _ { t } - \frac { \beta _ { t } } { \sqrt { 1 - \bar { \alpha } _ { t } } } \epsilon _ { \theta } ( \hat { \bf x } _ { t } , \tilde { \bf x } , t ) )$
- $\epsilon_\theta(\hat{\mathbf{x}}_t, \tilde{\mathbf{x}}, t)$ : A neural network (often a U-Net) that predicts the noise component $\boldsymbol{\epsilon}_t$ at time step $t$ , given the noisy input $\hat{\mathbf{x}}_t$ , the conditional guidance $\tilde{\mathbf{x}}$ , and the current time step $t$ . This network is the core of the diffusion model.
Diffusion Loss ( $\mathcal{L}_{diff}$ ): During training, the objective is to optimize the parameters $\theta$ of the noise estimator network $\epsilon_\theta$ so that its prediction for the noise vector is as close as possible to the actual noise $\boldsymbol{\epsilon}_t$ that was added during the forward process. This is a standard objective for diffusion models [19]: $\mathcal { L } _ { d i f f } = \| \epsilon _ { t } - \epsilon _ { \theta } ( \mathbf { x } _ { t } , \tilde { \mathbf { x } } , t ) \| _ { 2 } \quad \text{(5)}$
- $\mathcal{L}_{diff}$ : The diffusion loss, calculated as the L2 norm (squared Euclidean distance) between the true noise and the predicted noise.
- $\boldsymbol{\epsilon}_t$ : The ground-truth noise sampled from $\mathcal{N}(\mathbf{0}, \mathbf{I})$ and used to create $\mathbf{x}_t$ from $\mathbf{x}_0$ .
- $\epsilon_\theta(\mathbf{x}_t, \tilde{\mathbf{x}}, t)$ : The noise predicted by the neural network.

4.2.3.3. Self-Constrained Consistency Loss ( $\mathcal{L}_{scc}$ )

The initial input $\mathbf{x}_0 = \mathbf{R}_{low} \odot \mathbf{L}_{high}$ could still contain artifacts if $\mathbf{L}_{high}$ is not perfectly content-free. This might disrupt the learned distribution of the diffusion model and affect the quality of the restored feature $\hat{\mathcal{F}}_{low}$ . To prevent this, a self-constrained consistency loss $\mathcal{L}_{scc}$ is proposed. Its purpose is to ensure that the restored feature $\hat{\mathcal{F}}_{low}$ retains the same intrinsic content information as the input low-light image.

Pseudo Label Construction: In the training phase, a pseudo label $\ddot{\mathcal{F}}_{low}$ is constructed from the decomposition results of the low-light image itself. This serves as a reference for the desired content:
- $\ddot{\mathcal{F}}_{low} = \mathbf{R}_{low} \odot \mathbf{L}_{low}^\gamma$
- $\mathbf{R}_{low}$ : The reflectance map of the low-light image.
- $\mathbf{L}_{low}$ : The illumination map of the low-light image.
- $\gamma$ : An illumination correction factor (set to 0.2 in experiments) applied to $\mathbf{L}_{low}$ . This factor adjusts the brightness of the low-light illumination map to a more "normal" level without changing its content, creating a pseudo ground truth that preserves the original content while having an adjusted illumination.
Consistency Loss: The $\mathcal{L}_{scc}$ aims to constrain the similarity between this pseudo label $\ddot{\mathcal{F}}_{low}$ and the restored feature $\hat{\mathcal{F}}_{low}$ produced by the diffusion model: $\mathcal { L } _ { s c c } = \| \ddot { \mathcal { F } } _ { l o w } - \hat { \mathcal { F } } _ { l o w } \| _ { 1 } \quad \text{(6)}$
- $\mathcal{L}_{scc}$ : The self-constrained consistency loss, calculated as the L1 norm (Manhattan distance) between the pseudo label and the restored feature. The L1 norm encourages sparsity and preserves edges better than L2.
- $\hat{\mathcal{F}}_{low}$ : The restored feature generated by the diffusion model.

4.2.3.4. Overall Objective and Training Algorithm

The overall objective function for optimizing the Latent-Retinex Diffusion Model (LRDM) is a combination of the diffusion loss and the self-constrained consistency loss:

$\mathcal{L} = \mathcal{L}_{diff} + \lambda_1 \mathcal{L}_{scc}$

$\lambda_1$ : A hyperparameter that balances the contribution of the self-constrained consistency loss.

The training strategy for LRDM is summarized in Algorithm 1.

input : The decomposition results Rlow and Lhigh, low-light feature Flow, time step T, and sampling step S. x0 = Rlow Lhigh, = Flow while Not converged do

∼ N (0, I), t ∼ Uniform{1, · · · , T } Perform gradient descent steps on θ∥t − θ(√¯αtx0 + √1 − ¯αtt, x, t)k2

xT ∼ N (0, I) for i = S : 1 do

t = (i − 1) · T/S + 1 tnext = (i − 2) · T/S + 1 if i > 1, else 0

Z( xt−√1−αt -θ(xt,x,t) + 1 − αtnext · θ(t, x, t) xt ← √¯αtnext √¯αt end

Perform gradient descent steps on θ|Rlow Low − 0∥2 end

Algorithm 1: LRDM training (Simplified and explained)

Input:

$\mathbf{R}_{low}$ : Reflectance map of the low-light image (from CTDN).
$\mathbf{L}_{high}$ : Illumination map of the normal-light image (from CTDN).
$\mathcal{F}_{low}$ : Low-light feature (from Encoder $\mathcal{E}(\cdot)$ ), serves as condition $\tilde{\mathbf{x}}$ .
$T$ : Total time steps for forward diffusion.
$S$ : Sampling steps for reverse denoising.

Initialization:

$\mathbf{x}_0 = \mathbf{R}_{low} \odot \mathbf{L}_{high}$ (Target for diffusion to learn).
$\tilde{\mathbf{x}} = \mathcal{F}_{low}$ (Guidance for diffusion).

Training Loop (while not converged):

Sample Noise and Time:
- Sample random noise $\boldsymbol{\epsilon}_t \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ .
- Sample a random time step $t \sim \text{Uniform}\{1, \dots, T\}$ .
Generate Noisy Input: Create $\mathbf{x}_t$ from $\mathbf{x}_0$ and $\boldsymbol{\epsilon}_t$ using the closed-form forward diffusion equation: $\mathbf{x}_t = \sqrt{\bar{\alpha}_t} \mathbf{x}_0 + \sqrt{1 - \bar{\alpha}_t} \boldsymbol{\epsilon}_t$ .
Optimize Diffusion Model: Perform a gradient descent step to update the parameters $\theta$ $θ$ of the noise prediction network $\epsilon_\theta$ $ϵ_{θ}$ . The objective is to minimize the difference between the true noise $\boldsymbol{\epsilon}_t$ $ϵ_{t}$ and the predicted noise $\epsilon_\theta(\mathbf{x}_t, \tilde{\mathbf{x}}, t)$ $ϵ_{θ} (x_{t}, \tilde{x}, t)$ :
- Minimize $\|\boldsymbol{\epsilon}_t - \epsilon_\theta(\mathbf{x}_t, \tilde{\mathbf{x}}, t)\|_2^2$ (This is $\mathcal{L}_{diff}$ ).
Perform Reverse Denoising (for $\mathcal{L}_{scc}$ ):
- Initialize $\hat{\mathbf{x}}_T \sim \mathcal{N}(\mathbf{0}, \mathbf{I})$ (random Gaussian noise).
- For $i = S$ $i = S$ down to 1 (iterative denoising steps):
  - Calculate current time $t = (i-1) \cdot T/S + 1$ .
  - Calculate next time $t_{next} = (i-2) \cdot T/S + 1$ (if $i > 1$ , else 0).
  - Estimate the noise $\epsilon_\theta(\hat{\mathbf{x}}_t, \tilde{\mathbf{x}}, t)$ using the current model.
  - Update $\hat{\mathbf{x}}_t$ to $\hat{\mathbf{x}}_{t-1}$ using the estimated noise and the DDIM (Denoising Diffusion Implicit Models) [49] sampling formula (implicitly represented by the $Z$ and xt update lines in the algorithm block). This part explicitly generates $\hat{\mathcal{F}}_{low}$ (which is $\hat{\mathbf{x}}_0$ after $S$ steps).
- The restored feature $\hat{\mathcal{F}}_{low}$ is the final $\hat{\mathbf{x}}_0$ obtained after this sampling loop.
Optimize Self-Constrained Consistency Loss: Perform a gradient descent step to update $\theta$ $θ$ to minimize the $\mathcal{L}_{scc}$ $L_{scc}$ between the pseudo label $\ddot{\mathcal{F}}_{low} = \mathbf{R}_{low} \odot \mathbf{L}_{low}^\gamma$ $\ddot{F}_{l o w} = R_{l o w} ⊙ L_{l o w}^{γ}$ and the restored feature $\hat{\mathcal{F}}_{low}$ $\hat{F}_{l o w}$ (which is the $\hat{\mathbf{x}}_0$ $\hat{x}_{0}$ generated in the previous step):
- Minimize $\|\mathbf{R}_{low} \odot \mathbf{L}_{low}^\gamma - \hat{\mathcal{F}}_{low}\|_1$ (This is $\mathcal{L}_{scc}$ ).
- Note: The algorithm's line "Perform gradient descent steps on $\theta|R_{low} \odot L_{low}^\gamma - \hat{\mathbf{x}}_0\|_2^2$ " suggests an L2 loss, but the text states L1 norm in Equation 6. This discrepancy might be a minor detail in implementation or presentation. Following the text, it's L1.
  
  The U-Net [46] architecture is adopted as the noise estimator network $\epsilon_\theta$ . The number of time steps $T$ is 1000, and the sampling steps $S$ for reverse denoising is 20.

4.2.4. Network Training

The training process for LightenDiffusion is divided into two stages:

Stage 1: Training Encoder, CTDN, and Decoder:
- Objective: To effectively train the encoder $\mathcal{E}(\cdot)$ , the Content-Transfer Decomposition Network (CTDN), and the decoder $\mathcal{D}(\cdot)$ to accurately perform feature extraction, latent-space decomposition, and reconstruction. During this stage, the parameters of the diffusion model are frozen.
- Dataset: This stage uses low-light image pairs (e.g., $I_{low}^1$ and $I_{low}^2$ ) from the SICE dataset [3]. These are likely different exposures of the same scene, used to enforce consistency.
- Loss Functions:
  - Content Loss ( $\mathcal{L}_{con}$ ): Optimizes the encoder and decoder to ensure faithful reconstruction of the input low-light images. $\mathcal { L } _ { c o n } = \sum _ { i = 1 } ^ { 2 } \Vert \boldsymbol { I } _ { l o w } ^ { i } - \mathcal { D } ( \mathcal { E } ( \boldsymbol { I } _ { l o w } ^ { i } ) ) \Vert _ { 2 } \quad \text{(7)}$
    - $\mathcal{L}_{con}$ : The content loss, measured as the L2 norm between the input low-light image and its reconstructed version after passing through the encoder and decoder.
    - $\boldsymbol{I}_{low}^i$ : The $i$ -th input low-light image.
    - $\mathcal{D}(\mathcal{E}(\boldsymbol{I}_{low}^i))$ : The reconstructed image.
  - Decomposition Loss ( $\mathcal{L}_{dec}$ ): Optimizes the CTDN to ensure proper Retinex decomposition. It consists of three components [58, 71, 74]:
    - Reconstruction Loss ( $\mathcal{L}_{rec}$ ): Guarantees that the decomposed reflectance and illumination components can reconstruct the original input latent features. $\mathcal { L } _ { r e c } = \sum _ { i = 1 } ^ { 2 } \sum _ { j = 1 } ^ { 2 } \| \mathcal { F } _ { l o w } ^ { j } - \mathbf { R } _ { l o w } ^ { i } \odot \mathbf { L } _ { l o w } ^ { j } \| _ { 1 } \quad \text{(8)}$
      - $\mathcal{L}_{rec}$ : The reconstruction loss, calculated as the L1 norm.
      - $\mathcal{F}_{low}^j$ : The $j$ -th low-light latent feature.
      - $\mathbf{R}_{low}^i$ : The $i$ -th low-light reflectance map.
      - $\mathbf{L}_{low}^j$ : The $j$ -th low-light illumination map. This loss ensures that combining any reflectance from one low-light image with illumination from another (or the same) low-light image can reconstruct the respective original low-light feature. This helps disentangle the components.
    - Reflectance Consistency Loss ( $\mathcal{L}_{ref}$ ): Enforces that the reflectance maps derived from different low-light images of the same scene (e.g., different exposures) should be consistent, as reflectance should be invariant to illumination changes. $\mathcal { L } _ { r e f } = \| \mathbf { R } _ { l o w } ^ { 1 } - \mathbf { R } _ { l o w } ^ { 2 } \| _ { 1 } \quad \text{(9)}$
      - $\mathcal{L}_{ref}$ : The reflectance consistency loss, calculated as the L1 norm.
      - $\mathbf{R}_{low}^1, \mathbf{R}_{low}^2$ : Reflectance maps from two different low-light inputs ( $I_{low}^1, I_{low}^2$ ) of the same scene.
    - Illumination Smoothing Loss ( $\mathcal{L}_{ill}$ ): Encourages the illumination map to be locally smooth, suppressing high-frequency details (which should belong to the reflectance map). It is weighted by the reflectance gradient to preserve edges. $\mathcal { L } _ { i l l } = \sum _ { i = 1 } ^ { 2 } \| \nabla { \mathbf { L } } _ { l o w } ^ { i } \cdot \exp ( - \lambda _ { g } \nabla { \mathbf { R } } _ { l o w } ^ { i } ) \| _ { 2 } \quad \text{(9)}$
      - $\mathcal{L}_{ill}$ : The illumination smoothing loss, calculated as the L2 norm.
      - $\nabla$ : Denotes the horizontal and vertical gradient operator.
      - $\nabla \mathbf{L}_{low}^i$ : Gradient of the illumination map, which should ideally be small for smooth regions.
      - $\nabla \mathbf{R}_{low}^i$ : Gradient of the reflectance map, which indicates edges and content details.
      - $\exp(-\lambda_g \nabla \mathbf{R}_{low}^i)$ : An edge-aware weight. Where reflectance gradients are large (indicating an edge), this term becomes small, effectively reducing the penalty on illumination gradients near strong edges. This allows the illumination map to change more abruptly across object boundaries while remaining smooth within regions.
      - $\lambda_g$ : A coefficient to balance the perceived strength of the structure.
    - Overall Decomposition Loss: $\mathcal{L}_{dec} = \mathcal{L}_{rec} + \lambda_2 \mathcal{L}_{ref} + \lambda_3 \mathcal{L}_{ill}$ $L_{d ec} = L_{rec} + λ_{2} L_{re f} + λ_{3} L_{i ll}$ .
      - $\lambda_2, \lambda_3$ : Hyperparameters balancing the components of $\mathcal{L}_{dec}$ .
Stage 2: Training the Diffusion Model:
- Objective: To train the Latent-Retinex Diffusion Model (LRDM) using the combined loss $\mathcal{L} = \mathcal{L}_{diff} + \lambda_1 \mathcal{L}_{scc}$ . During this stage, the encoder, CTDN, and decoder (which were trained in Stage 1) are frozen.
- Dataset: Approximately 180k unpaired low-light/normal-light image pairs are collected for this stage, providing a diverse set of real-world scenarios.
  
  This two-stage training strategy allows for specialized optimization of the decomposition and reconstruction components first, followed by the generative diffusion process, leading to a more stable and effective overall framework.

5. Experimental Setup

5.1. Datasets

The experiments evaluate the performance of LightenDiffusion on a variety of publicly available datasets, categorized as paired and unpaired.

5.1.1. Paired Datasets

These datasets contain low-light images and their corresponding normal-light ground-truth images, allowing for precise evaluation with full-reference metrics.

LOL [58]: (Low-light Image Enhancement Dataset)
- Characteristics: A widely used dataset for LLIE, providing paired low/normal-light images. It often serves as a benchmark for training and evaluating supervised methods.
- Usage: Used for quantitative and qualitative comparisons, particularly for evaluating against supervised methods that are typically trained on this dataset.
LSRW [16]: (Low-light Saliency Ranking Dataset for Weakly Supervised Low-light Enhancement)
- Characteristics: Another paired dataset for LLIE, potentially with more diverse scenes or specific characteristics (e.g., related to saliency).
- Usage: Used for quantitative and qualitative comparisons.

5.1.2. Unpaired Datasets

These datasets consist only of low-light images (without corresponding normal-light ground truth) and are used to evaluate the generalization ability of models to real-world, unseen scenarios.

DICM [28]: (Dark Image Composition Model)
- Characteristics: A collection of real-world low-light images.
- Usage: Used for quantitative and qualitative comparisons, specifically to assess unsupervised methods' performance and generalization.
NPE [53]: (Naturalness Preserved Enhancement)
- Characteristics: Contains real-world low-light images.
- Usage: Used for quantitative and qualitative comparisons.
VV [51]: (Vonikakis-Kouskouridas-Gasteratos)
- Characteristics: Another dataset of real-world low-light images, often used for evaluating illumination compensation algorithms.
- Usage: Used for quantitative and qualitative comparisons.

5.1.3. Face Detection Dataset

DARK FACE [69]:
- Characteristics: Consists of 6,000 images captured under weakly illuminated conditions with annotated labels for face detection. This dataset is specifically designed to test the impact of LLIE methods as a pre-processing step for improving low-light face detection.
- Usage: Used to investigate the practical value of LLIE methods in improving downstream vision tasks.

5.2. Evaluation Metrics

The choice of evaluation metrics depends on whether paired ground truth images are available.

5.2.1. For Paired Datasets (LOL, LSRW)

For these datasets, where ground-truth normal-light images ( $I_{GT}$ ) are available, full-reference metrics are used:

PSNR (Peak Signal-to-Noise Ratio)
- Conceptual Definition: PSNR quantifies the quality of reconstruction of an image compared to an original image. It is most commonly used to measure the quality of reconstruction of lossy compression codecs or image processing methods like enhancement. A higher PSNR generally indicates a better reconstruction, implying that the enhanced image is closer to the ground truth. It is inversely related to the mean squared error (MSE).
- Mathematical Formula: $ \text{MSE} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I_{GT}(i,j) - I_{ENH}(i,j)]^2 $ $ \text{PSNR} = 10 \cdot \log_{10} \left( \frac{\text{MAX}_I^2}{\text{MSE}} \right) $
- Symbol Explanation:
  - $I_{GT}$ : The ground-truth normal-light image.
  - $I_{ENH}$ : The enhanced image produced by the method.
  - M, N: The dimensions (height and width) of the image.
  - $I_{GT}(i,j), I_{ENH}(i,j)$ : The pixel values at coordinates (i,j) in the ground-truth and enhanced images, respectively.
  - $\text{MAX}_I$ : The maximum possible pixel value of the image (e.g., 255 for 8-bit grayscale images).
  - $\text{MSE}$ : Mean Squared Error.
SSIM (Structural Similarity Index Measure)
- Conceptual Definition: SSIM is a perceptual metric that evaluates the similarity between two images. Unlike PSNR which focuses on absolute errors, SSIM considers image degradation as a perceived change in structural information, also incorporating luminance and contrast changes. It attempts to model how the human visual system perceives quality. A SSIM value closer to 1 indicates higher similarity and better perceived quality.
- Mathematical Formula: $ \text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $
- Symbol Explanation:
  - x, y: Two image patches (or the entire images) being compared, representing $I_{ENH}$ and $I_{GT}$ .
  - $\mu_x, \mu_y$ : The average (mean) of $x$ and $y$ , respectively.
  - $\sigma_x^2, \sigma_y^2$ : The variance of $x$ and $y$ , respectively.
  - $\sigma_{xy}$ : The covariance of $x$ and $y$ .
  - $C_1 = (K_1 L)^2, C_2 = (K_2 L)^2$ : Small constants to prevent division by zero or near-zero values. $L$ is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and $K_1, K_2$ are typically small constants (e.g., 0.01 and 0.03).
LPIPS (Learned Perceptual Image Patch Similarity)
- Conceptual Definition: LPIPS is a perceptual metric that uses features from a pre-trained deep convolutional neural network (e.g., VGG, AlexNet) to measure the "perceptual distance" between two images. It aims to better align with human judgment of image similarity than traditional metrics like PSNR or SSIM. A lower LPIPS score indicates higher perceptual similarity and better quality.
- Mathematical Formula: LPIPS is a learned metric, meaning its calculation involves forward passes through a deep neural network and is not represented by a single, simple mathematical formula in the same way as PSNR or SSIM. Therefore, a specific formula for LPIPS is not provided here, as per typical practice for learned metrics.

5.2.2. For Unpaired Datasets (DICM, NPE, VV)

For these datasets, where no ground-truth normal-light images are available, non-reference (or "no-reference") perceptual metrics are used:

NIQE (Naturalness Image Quality Evaluator)
- Conceptual Definition: NIQE is a no-reference image quality assessment (NR-IQA) metric. It works by building a natural scene statistic (NSS) model from a database of pristine natural images and then measures the distance between the NSS features of the test image and those of the pristine model. A lower NIQE score indicates better image quality and greater naturalness.
- Mathematical Formula: NIQE is based on a complex statistical model derived from natural images. It involves fitting a Generalized Gaussian Distribution (GGD) to local image features (e.g., mean subtracted contrast normalized (MSCN) coefficients) and then calculating a distance measure between the multivariate Gaussian model parameters of the enhanced image and those of a reference natural image database. A single simple formula is not typically provided for NIQE as it relies on a trained statistical model.
  - The core idea involves a distance computation (e.g., Mahalanobis distance) between the feature vectors of the test image and the natural image model.
PI (Perceptual Index)
- Conceptual Definition: PI is a no-reference perceptual metric typically used in image enhancement and super-resolution tasks. It is designed to correlate well with human perception of image quality. Like LPIPS, it often incorporates elements related to naturalness and distortion, aiming for a single score that reflects overall visual appeal. A lower PI score indicates better perceptual quality.
- Mathematical Formula: Similar to LPIPS, PI is often a composite or learned metric, and a universally standard, single mathematical formula is not typically provided. It often combines other no-reference metrics or uses learned perceptual features.

5.2.3. For Low-Light Face Detection (DARK FACE)

AP (Average Precision)
- Conceptual Definition: Average Precision (AP) is a common metric used to evaluate the performance of object detection models. It summarizes the precision-recall curve into a single value. A precision-recall curve plots the precision (the proportion of true positive detections among all positive detections) against recall (the proportion of true positive detections among all actual positive instances). AP is calculated as the area under this curve. A higher AP indicates better detection performance, meaning the model can detect more relevant objects (higher recall) while maintaining high accuracy (higher precision).
- Mathematical Formula: $ \text{AP} = \sum_{n} (R_n - R_{n-1})P_n $ This is a common way to approximate the area under the precision-recall curve by summing the areas of rectangles.
- Symbol Explanation:
  - $P_n$ : The precision at the $n$ -th threshold (or recall level).
  - $R_n$ : The recall at the $n$ -th threshold (or recall level).
  - $R_n - R_{n-1}$ : The change in recall between consecutive thresholds. The metric is typically calculated at a specific Intersection over Union (IoU) threshold (e.g., $IoU=0.3$ ), which defines what constitutes a correct detection.

5.3. Baselines

The proposed method LightenDiffusion is compared against a comprehensive set of existing LLIE methods, categorized by their approach:

5.3.1. Traditional Methods

LIME [14]
SDDLLE [17]
CDEF [30]
BrainRetinex [4]

5.3.2. Supervised Methods

RetinexNet [58]
KinD++ [74]
LCDPNet [52]
URetinexNet [59]
SMG [64]
PyDiff [76] (a diffusion-based supervised method)
GSAD [20] (a diffusion-based supervised method)

5.3.3. Semi-supervised Methods

DRBN [68]
BLL [39]

5.3.4. Unsupervised Methods

Zero-DCE [12]
EnlightenGAN [24]
RUAS [32]
SCI [40]
GDP [8] (a zero-shot diffusion-based method)
PairLIE [10]
NeRCo [67]

These baselines represent a wide range of approaches, from classical techniques to cutting-edge deep learning methods, including both supervised and unsupervised paradigms, as well as other diffusion-based methods. This diverse comparison set helps to rigorously position LightenDiffusion within the broader LLIE landscape.

5.4. Implementation Details

Framework: Implemented with PyTorch.
Hardware: Four NVIDIA RTX 2080Ti GPUs.
Batch Size: 12.
Patch Size: $512 \times 512$ .
Training Iterations:
- Stage 1 (Encoder, CTDN, Decoder): $1 \times 10^5$ iterations.
- Stage 2 (Diffusion Model): $4 \times 10^5$ iterations.
Optimizer: Adam optimizer [26].
Learning Rate:
- Stage 1: Initial $1 \times 10^{-4}$ .
- Stage 2: Reinitialized to fixed $2 \times 10^{-5}$ , decays by factor of 0.8.
Hyperparameters:
- Encoder downsampling scale $k$ : 3.
- Illumination correction factor $\gamma$ : 0.2.
- Loss weights: $\lambda_1 = 0.01$ , $\lambda_2 = 0.1$ , $\lambda_3 = 0.01$ , $\lambda_g = 10$ .
LRDM (Diffusion Model) specifics:
- Noise estimator network: U-Net [46] architecture.
- Time steps $T$ : 1000 (for forward diffusion).
- Sampling steps $S$ : 20 (for reverse denoising).
Evaluation: For GDP and LightenDiffusion, the reported performance is the mean over five evaluations due to the stochastic nature of generative models.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Quantitative Comparison

The following are the results from Table 1 of the original paper:

	Type	Method	LOL [58]			LSRW [16]			DICM [28]		NPE [53]		VV [51]
	Type	Method	PSNR ↑	SSIM ↑	LPIPS ↓	PSNR ↑	SSIM ↑	LPIPS ↓	NIQE ↓	PI ↓	NIQE ↓	PI ↓	NIQE ↓	PI ↓
T		LIME [14]	17.546	0.531	0.290	17.342	0.520	0.416	4.476	4.216	4.170	3.789	3.713	3.335
		SDDLLE [17]	13.342	0.634	0.261	14.708	0.486	0.382	4.581	3.828	4.179	3.315	4.274	3.382
		CDEF [30]	16.335	0.585	0.351	16.758	0.465	0.314	4.142	4.242	3.862	2.910	5.051	3.272
		BrainRetinex [4]	11.063	0.475	0.327	12.506	0.390	0.374	4.350	3.555	3.707	3.044	4.031	3.114
SL		RetinexNet [58]	16.774	0.462	0.390	15.609	0.414	0.393	4.487	3.242	4.732	3.219	5.881	3.727
		KinD++ [74]	17.752	0.758	0.198	16.085	0.394	0.366	4.027	3.399	4.005	3.144	3.586	2.773
		LCDPNet [52]	14.506	0.575	0.312	15.689	0.474	0.344	4.110	3.250	4.106	3.127	5.039	3.347
		URetinexNet [59]	19.842	0.824	0.128	18.271	0.518	0.295	4.774	3.565	4.028	3.153	3.851	2.891
		SMG [64]	23.814	0.809	0.144	17.579	0.538	0.456	6.224	4.228	5.300	3.627	5.752	3.757
		PyDiff [76]	23.275	0.859	0.108	17.264	0.510	0.335	4.499	3.792	4.082	3.268	4.360	3.678
		GSAD [20]	22.021	0.848	0.137	17.414	0.507	0.294	4.496	3.593	4.489	3.361	5.252	3.657
SSL		DRBN [68]	16.677	0.730	0.252	16.734	0.507	0.376	4.369	3.800	3.921	3.267	3.671	3.117
SSL		BLL [39]	10.305	0.401	0.382	12.444	0.333	0.384	5.046	4.055	4.885	3.870	5.740	4.030
UL		Zero-DCE [12]	14.861	0.562	0.330	15.867	0.443	0.315	3.951	3.149	3.826	2.918	5.080	3.307
		EnlightenGAN [24]	17.606	0.653	0.319	17.106	0.463	0.322	3.832	3.256	3.775	2.953	3.689	2.749
		RUAS [32]	16.405	0.503	0.257	14.271	0.461	0.455	7.306	5.700	7.198	5.651	4.987	4.329
		SCI [40]	14.784	0.525	0.333	15.242	0.419	0.321	4.519	3.700	4.124	3.534	5.312	3.648
		GDP [8]	15.896	0.542	0.337	12.887	0.362	0.386	4.358	3.552	4.032	3.097	4.683	3.431
UL		PairLIE [10]	19.514	0.731	0.254	17.602	0.501	0.323	4.282	3.469	4.661	3.543	3.373	2.734
		NeRCo [67]	19.738	0.740	0.239	17.844	0.535	0.371	4.107	3.345	3.902	3.037	3.765	3.094
		Ours	20.453	0.803	0.192	18.555	0.539	0.311	3.724	3.144	3.618	2.879	2.941	2.558

Analysis of Quantitative Results:

Paired Datasets (LOL, LSRW):
- LOL Dataset: LightenDiffusion (Ours) achieves the highest PSNR (20.453) and SSIM (0.803) among all unsupervised learning (UL) methods, and also has a very competitive LPIPS (0.192). While some supervised learning (SL) methods like SMG (PSNR 23.814, SSIM 0.809, LPIPS 0.144) and PyDiff (PSNR 23.275, SSIM 0.859, LPIPS 0.108) outperform LightenDiffusion on LOL, this is expected given they are specifically trained on this paired dataset. However, LightenDiffusion still surpasses several SL methods (e.g., RetinexNet, $KinD++$ , LCDPNet) on LOL, showcasing its strong performance even without paired supervision.
- LSRW Dataset: Here, LightenDiffusion truly shines. It achieves the best PSNR (18.555) and SSIM (0.539) across all categories of methods, including supervised ones. Its LPIPS (0.311) is also highly competitive, being better than SMG and URetinexNet, though slightly higher than GSAD and URetinexNet. This result strongly supports the paper's claim that LightenDiffusion has better generalization abilities than supervised methods, as it performs better on a different paired dataset.
Unpaired Datasets (DICM, NPE, VV):
- For these real-world benchmarks where no ground truth is available, NIQE and PI (lower is better for both) are used.
- LightenDiffusion consistently outperforms all other unsupervised competitors on all three datasets: DICM (NIQE 3.724, PI 3.144), NPE (NIQE 3.618, PI 2.879), and VV (NIQE 2.941, PI 2.558).
- Importantly, unsupervised methods (including LightenDiffusion) generally exhibit much better generalization (lower NIQE/PI) on these unseen datasets compared to supervised methods. For instance, SMG (a strong supervised method) performs poorly on these unpaired datasets (e.g., DICM NIQE 6.224, PI 4.228), highlighting the generalization gap of supervised approaches. LightenDiffusion's superior performance in these real-world settings is a critical validation of its design.
  
  In summary, the quantitative results demonstrate that LightenDiffusion sets a new state-of-the-art for unsupervised LLIE, showing strong performance comparable to or even surpassing supervised methods on unseen datasets, while maintaining excellent visual quality.

6.1.2. Qualitative Comparison

The visual comparisons further support the quantitative findings.

Paired Datasets (LOL, LSRW - Figure 5):

$Fig.5: Qualitative comparison of our method and competitive methods on the LOL \[58\] and LSRW \[16\] test sets. Best viewed by zooming in.$ 该图像是一个比较图，展示了在LOL和LSRW测试集上，我们的方法（最右侧）与其他竞争方法的定性比较。图像中展示了不同方法对低光图像增强的效果。

Fig.5: Qualitative comparison of our method and competitive methods on the LOL [58] and LSRW [16] test sets. Best viewed by zooming in.
- Figure 5 (top row, LOL): Other methods often show underexposure (GDP), color distortion (URetinexNet), or noise amplification (SMG). LightenDiffusion achieves proper brightness, vibrant colors, and sharp details without noticeable noise.
- Figure 5 (bottom row, LSRW): Similar trends are observed. Some methods (SMG) overexpose certain areas or introduce color casts, while LightenDiffusion provides a balanced enhancement.
Unpaired Datasets (DICM, NPE, VV - Figure 6):

$Fig.6: Qualitative comparison of our method and competitive methods on the DICM \[28\], NPE \[53\], and VV \[51\] datasets. Best viewed by zooming in.$ 该图像是图表，展示了我们的LightenDiffusion方法与其他竞争方法在DICM、NPE和VV数据集上的定性比较。图中包含输入图像和多种方法的处理结果，便于观察不同方法对低光照图像的增强效果。

Fig.6: Qualitative comparison of our method and competitive methods on the DICM [28], NPE [53], and VV [51] datasets. Best viewed by zooming in.
- Figure 6 (row 1, DICM): Competing methods like NeRCo can still leave images slightly dark, or PairLIE might alter colors slightly. LightenDiffusion enhances visibility while preserving natural colors.
- Figure 6 (row 2, NPE): This row highlights a common failure mode for other methods: artifacts around light sources (PairLIE, NeRCo) or overexposed regions. LightenDiffusion effectively handles challenging lighting conditions, presenting correct exposure and avoiding artifacts.
- Figure 6 (row 3, VV): LightenDiffusion maintains vivid colors and proper contrast, outperforming methods that might produce duller or less natural-looking results.
  
  These visual results confirm that LightenDiffusion not only achieves superior quantitative scores but also produces visually pleasing, natural-looking enhanced images, with improved global and local contrast, sharper details, and effective noise suppression, particularly excelling in generalization to diverse real-world low-light scenes.

6.1.3. Low-Light Face Detection

The paper investigates the practical utility of LLIE methods by using them as a pre-processing step for low-light face detection on the DARK FACE dataset [69]. The RetinaFace [7] detector is used for evaluation with an IoU threshold of 0.3, and Average Precision (AP) is calculated.

The following figure (Figure 7 from the original paper) shows the comparison of low-light face detection results:

$Fig. 7: Comparison of low-light face detection results on the DARK FACE dataset \[69\].$ 该图像是一个比较低光照人脸检测结果的图表，展示了不同方法在DARK FACE数据集上的平均精确度与召回率的关系。左侧为相应的方法精确度曲线，右侧为输入图像及三种不同方法的增强效果，包括我们的改进方法。

Fig. 7: Comparison of low-light face detection results on the DARK FACE dataset [69].

Analysis of Face Detection Results:

The precision-recall (P-R) curves in Figure 7 clearly show that enhancing low-light images before feeding them to a face detector significantly improves performance.
The RetinaFace detector on raw, unenhanced images achieves a low AP of 20.2%.
After enhancement by LightenDiffusion, the AP of RetinaFace improves significantly to 36.4%. This represents a substantial boost in detection performance.
Comparing LightenDiffusion to other LLIE methods, it demonstrates superior performance in the high recall area, indicating its ability to detect a larger proportion of faces while maintaining high precision.
This result underscores the potential practical values of LightenDiffusion as a robust pre-processing solution for various downstream vision tasks operating in challenging low-light environments.

6.2. Ablation Studies / Parameter Analysis

The following are the results from Table 2 of the original paper:

Method	LOL [58]			DICM [28]		Time (s) ↓
Method	PSNR ↑	SSIM ↑	LPIPS ↓	NIQE ↓	PI ↓	Time (s) ↓
1) k = 0 (Image Space)	17.054	0.715	0.372	4.519	4.377	4.733
2) k = 1 (Latent Space)	19.228	0.728	0.355	4.101	3.457	0.872
3) k = 2 (Latent Space)	20.097	0.798	0.210	4.021	3.402	0.411
4) k = 3 (Latent Space) Default	20.453	0.803	0.192	3.724	3.144	0.314
5) k = 4 (Latent Space)	20.104	0.785	0.195	3.906	3.332	0.256
6) RetinexNet [58]	16.616	0.563	0.579	5.859	6.056	0.296
7) URetinexNet [59]	17.916	0.703	0.391	4.371	4.561	0.293
8) PairLIE [10]	17.089	0.605	0.568	6.017	6.349	0.295
9) w/o Lscc (S = 20)	19.184	0.785	0.213	4.045	3.408	0.314
10) w/o Lscc (S = 50)	19.473	0.791	0.209	3.998	3.392	0.687
11) w/o Lscc (S = 100)	20.255	0.801	0.209	3.831	3.228	1.208
12) Default (with Lscc, S = 20)	20.453	0.803	0.192	3.724	3.144	0.314

6.2.1. Latent Space vs. Image Space Decomposition

This ablation study investigates the impact of performing Retinex decomposition at different scales of the latent space (controlled by the downsampling factor $k$ ) versus the traditional image space ( $k=0$ ).

Quantitative Results (Table 2, rows 1-5):
- Image Space ( $k=0$ ): Shows the worst performance across all metrics (e.g., LOL PSNR 17.054, SSIM 0.715, LPIPS 0.372; DICM NIQE 4.519, PI 4.377). This confirms the difficulty of achieving satisfactory decomposition in image space, often leading to content information leaking into the illumination map.
- Latent Space ( $k=1, 2, 3$ ): As $k$ $k$ increases from 0 to 3, the performance consistently improves.
  - $k=1$ : Significantly better than $k=0$ (LOL PSNR 19.228, SSIM 0.728).
  - $k=2$ : Further improvement (LOL PSNR 20.097, SSIM 0.798).
  - $k=3$ (Default): Achieves the best performance (LOL PSNR 20.453, SSIM 0.803, LPIPS 0.192; DICM NIQE 3.724, PI 3.144), along with a good inference speed of 0.314s.
- Latent Space ( $k=4$ ): Shows a slight degradation in performance compared to $k=3$ (LOL PSNR 20.104 vs 20.453; SSIM 0.785 vs 0.803). This is attributed to the substantial reduction in feature information richness at very deep latent spaces ( $k=4$ implies heavy downsampling), which can adversely affect the generative ability of the diffusion model.
- Inference Speed: Deeper latent spaces ( $k$ increases) generally lead to faster inference times (e.g., 4.733s for $k=0$ down to 0.256s for $k=4$ ), as processing smaller feature maps is quicker.
Visual Results (Figure 8):

该图像是图8，展示了我们采用的潜在-Retinex分解策略和提出的内容转移分解网络的消融研究的可视化结果。第一行显示了不同设置下的恢复结果，第二行呈现了低光/正常光图像的估计照明图。

Fig. 8: Visual results of the ablation study about our employed latent-Retinex decomposition strategy and the proposed content-transfer decomposition network. The first row shows the restored results with different settings, and the second row presents estimated illumination maps of low/normal-light images.
- Figure 8a (Image Space, $k=0$ ): The illumination map clearly retains content information (e.g., outlines of objects), leading to artifacts in the restored image.
- Figure 8b-d (Latent Space, $k=1, 2, 3$ ): As $k$ increases, the illumination maps become progressively smoother and more content-free, allowing the diffusion model to generate cleaner, more visually faithful restored images.
Conclusion: Performing decomposition in the latent space is crucial for effective separation of reflectance and illumination. A moderate downsampling factor ( $k=3$ ) strikes the best balance between feature richness and efficiency, leading to optimal performance.

6.2.2. Retinex Decomposition Network (CTDN)

This study validates the effectiveness of the proposed Content-Transfer Decomposition Network (CTDN) by comparing it against alternative decomposition networks from prior Retinex-based methods.

Quantitative Results (Table 2, rows 4, 6-8):
- CTDN (Default $k=3$ , row 4): Achieves the best performance (LOL PSNR 20.453, SSIM 0.803, LPIPS 0.192; DICM NIQE 3.724, PI 3.144).
- Replacing with RetinexNet [58] (row 6): Significant drop in performance (LOL PSNR 16.616, SSIM 0.563, LPIPS 0.579).
- Replacing with URetinexNet [59] (row 7): Improved over RetinexNet, but still substantially worse than CTDN (LOL PSNR 17.916, SSIM 0.703, LPIPS 0.391).
- Replacing with PairLIE [10] (row 8): Similar to RetinexNet (LOL PSNR 17.089, SSIM 0.605, LPIPS 0.568).
Visual Results (Figure 8e-g):
- Decomposition networks from RetinexNet, URetinexNet, and PairLIE fail to produce truly content-free illumination maps. Their illumination maps still show discernible object outlines and textures.
- This content leakage directly translates to blurry details and artifacts in the restored images.
Conclusion: The CTDN's specialized design, which includes initial estimation, convolutional embeddings, cross-attention, and self-attention for explicit content transfer, is crucial for generating clean reflectance and illumination maps in the latent space. This robust decomposition is a key factor behind LightenDiffusion's superior performance.

6.2.3. Loss Function ( $\mathcal{L}_{scc}$ )

This study evaluates the contribution of the self-constrained consistency loss (\mathcal{L}_{scc}) to the overall performance.

Quantitative Results (Table 2, rows 9-12):
- Default (with $\mathcal{L}_{scc}$ , $S=20$ , row 12): Achieves the best performance (LOL PSNR 20.453, SSIM 0.803, LPIPS 0.192; DICM NIQE 3.724, PI 3.144) with an inference time of 0.314s.
- Without $\mathcal{L}_{scc}$ ( $S=20$ , row 9): Performance drops significantly (LOL PSNR 19.184, SSIM 0.785, LPIPS 0.213). This confirms that $\mathcal{L}_{scc}$ is essential for improving overall visual quality.
- Without $\mathcal{L}_{scc}$ with increased sampling steps ( $S=50, 100$ , rows 10-11):
  - Increasing $S$ to 50 improves performance (PSNR 19.473, SSIM 0.791) but increases inference time to 0.687s.
  - Increasing $S$ to 100 brings performance closer to the default model (PSNR 20.255, SSIM 0.801, LPIPS 0.209) but drastically increases inference time to 1.208s (almost 4 times slower than default).
Visual Results (Figure 9):

$Fig. 9: Visual results of the ablation study about our proposed $\\mathcal { L } _ { s c c }$$ 该图像是图表，展示了不同参数设置下对比的视觉结果，分别为 (a) w/o $L_{scc}$ (S = 20)、(b) w/o $L_{scc}$ (S = 50)、(c) w/o $L_{scc}$ (S = 100) 和 (d) w/ $L_{scc}$ (default)。各部分展示了低光照图像增强效果的变化。

Fig. 9: Visual results of the ablation study about our proposed $\mathcal { L } _ { s c c }$
- Figure 9a (w/o $\mathcal{L}_{scc}$ , $S=20$ ): Shows slightly less vibrant colors and possibly some subtle artifacts compared to the default.
- Figure 9b-c (w/o $\mathcal{L}_{scc}$ , $S=50, 100$ ): As $S$ increases, the visual quality improves, becoming more comparable to the default.
- Figure 9d (Default, with $\mathcal{L}_{scc}$ , $S=20$ ): Produces high-quality, artifact-free results.
Conclusion: The self-constrained consistency loss is highly effective. It allows the diffusion model to achieve efficient and robust restoration with fewer sampling steps ( $S=20$ ), significantly reducing inference time compared to models without $\mathcal{L}_{scc}$ that require many more steps to reach comparable quality. This highlights $\mathcal{L}_{scc}$ 's role in guiding the diffusion model towards accurate content reconstruction and improving overall efficiency.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully presents LightenDiffusion, an innovative unsupervised low-light image enhancement (LLIE) framework that cleverly integrates the physically intuitive Retinex theory with the powerful generative capabilities of diffusion models. The core innovations include:

A Content-Transfer Decomposition Network (CTDN) that performs Retinex decomposition within the latent space, effectively disentangling content-rich reflectance maps and content-free illumination maps from unpaired low-light and normal-light images.
A Latent-Retinex Diffusion Model (LRDM) that utilizes these decomposed components (low-light reflectance and normal-light illumination) as input for a diffusion model, guided by low-light features, to perform robust enhancement.
A novel self-constrained consistency loss that further refines the restoration by ensuring the enhanced output maintains faithful intrinsic content to the original low-light input, effectively mitigating artifacts and improving visual fidelity.

Extensive experiments on various paired and unpaired real-world benchmarks convincingly demonstrate that LightenDiffusion not only outperforms all state-of-the-art unsupervised competitors but also achieves performance comparable to, and in some cases even superior to, supervised methods, especially in terms of generalization to diverse and unseen scenes. Its practical value is further highlighted by significant improvements in low-light face detection as a pre-processing step.

7.2. Limitations & Future Work

The paper does not explicitly detail a "Limitations" section, but some can be inferred:

Computational Cost: While LightenDiffusion is efficient at inference with $S=20$ due to the $\mathcal{L}_{scc}$ , diffusion models generally have higher computational costs during training and potentially longer inference times compared to single-pass feed-forward networks (even with optimized sampling). The training process involves two stages and a large unpaired dataset, which can be resource-intensive.
Hyperparameter Sensitivity: The framework relies on several hyperparameters ( $\lambda_1, \lambda_2, \lambda_3, \lambda_g, \gamma, k$ ). Optimal performance might be sensitive to these values, and finding them empirically can be time-consuming.
Unsupervised Nature: While a strength for generalization, the lack of explicit paired ground truth in the diffusion training stage might still leave some subtle artifacts or color shifts that would be caught by supervised signals, although $\mathcal{L}_{scc}$ helps to mitigate this. For instance, on LSRW, while PSNR and SSIM are best, LPIPS (perceptual quality) is slightly inferior to some supervised methods like GSAD and URetinexNet, suggesting there might be room for perceptual improvement.

Based on these observations, potential future research directions could include:
Real-time Applications: Further optimizing the diffusion model's sampling process or integrating with faster denoising diffusion implicit models (DDIMs) to achieve near real-time performance for video LLIE.
Adaptive Hyperparameter Tuning: Developing methods for dynamically adjusting hyperparameters based on input image characteristics or scene context, rather than relying on fixed empirical values.
Integration with Other Perceptual Losses: Exploring additional perceptual losses or adversarial training specifically tailored for diffusion models to further bridge the gap with supervised methods on perceptual metrics, perhaps in a semi-unsupervised setting.
Extension to Other Degradations: Adapting the latent-Retinex decomposition and diffusion framework to handle other complex image degradations beyond low-light, such as haze, rain, or blur, in an unsupervised manner.
Quantifying "Content-Free" Illumination: Developing more rigorous quantitative metrics to assess how truly content-free the illumination maps are, which could further refine the CTDN.

7.3. Personal Insights & Critique

LightenDiffusion presents several compelling insights and innovations:

Innovation of Latent-Space Decomposition: The shift from image-space Retinex decomposition to latent-space decomposition is a powerful idea. In latent space, features are often more semantically meaningful and disentangled, making it easier for the network to separate intrinsic content from lighting conditions. This is a significant improvement over traditional Retinex implementations and deep learning models that often struggle with content leakage. The visual evidence in Figure 3 and 8 strongly supports this.
Synergy of Retinex and Diffusion: The combination of Retinex theory (physical interpretability) and diffusion models (powerful generative ability) is highly effective. Retinex provides a structured way to think about LLIE by separating components, while diffusion models offer the robustness to reconstruct high-quality images even from imperfect inputs and compensate for information loss. This hybrid approach leverages the best of both worlds.
Effectiveness of Self-Constrained Consistency Loss: The $\mathcal{L}_{scc}$ is a brilliant practical addition. It addresses the common pitfall of Retinex-based methods where imperfect illumination maps can introduce unwanted content or artifacts. By providing a self-generated pseudo label to guide consistency, the model learns to maintain fidelity to the original image's content without needing actual ground truth, which is fundamental for unsupervised learning. The ablation study clearly shows its value in improving quality and speeding up inference.
Strong Generalization for Real-World Problems: The paper effectively addresses the generalization challenge, which is a major hurdle for many LLIE methods. By training on unpaired data and leveraging the proposed architectural and loss designs, LightenDiffusion proves its robustness on diverse real-world datasets, making it highly applicable to practical scenarios where paired data is simply not available. This is crucial for deployment in areas like surveillance, autonomous driving, or mobile photography.

Critique:

Unclear "Content-Free" Definition: While the paper aims for "content-free" illumination maps, the precise mathematical or perceptual definition of "content" in latent space is implicitly learned. It would be valuable to explore more explicit constraints or metrics to quantify this disentanglement.
Complexity of Multi-Stage Training: While effective, the two-stage training process adds complexity. Future work might explore end-to-end training strategies for LightenDiffusion to simplify the pipeline, though this could introduce new training stability challenges.
Reference for $\gamma$ : The illumination correction factor $\gamma$ for the pseudo label $\ddot{\mathcal{F}}_{low} = \mathbf{R}_{low} \odot \mathbf{L}_{low}^\gamma$ is empirically set to 0.2. While effective, the paper could elaborate on the sensitivity to this parameter or discuss methods for adaptively determining it.

Overall, LightenDiffusion represents a significant step forward in unsupervised LLIE, demonstrating how principled model design combined with advanced generative techniques can yield highly performant and generalizable solutions for challenging real-world problems. Its methodology could potentially inspire similar hybrid approaches in other image restoration tasks.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~45 min read · 63,234 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Low-Light Image Enhancement (LLIE)

3.1.2. Retinex Theory

3.1.3. Diffusion Models (DMs)

3.1.4. Unsupervised Learning

3.1.5. Latent Space

3.1.6. Convolutional Neural Networks (CNNs) and Encoder-Decoder Architecture

3.1.7. Attention Mechanisms

3.2. Previous Works

3.2.1. Traditional Methods

3.2.2. Learning-Based Methods

3.2.3. Diffusion-Based Image Restoration

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Overall Pipeline

4.2.2. Content-Transfer Decomposition Network (CTDN)

4.2.3. Latent-Retinex Diffusion Models (LRDM)

4.2.3.1. Forward Diffusion

4.2.3.2. Reverse Denoising

4.2.3.3. Self-Constrained Consistency Loss (Lscc\mathcal{L}_{scc}Lscc​)

4.2.3.4. Overall Objective and Training Algorithm

4.2.4. Network Training

5. Experimental Setup

5.1. Datasets

5.1.1. Paired Datasets

5.1.2. Unpaired Datasets

5.1.3. Face Detection Dataset

5.2. Evaluation Metrics

5.2.1. For Paired Datasets (LOL, LSRW)

5.2.2. For Unpaired Datasets (DICM, NPE, VV)

5.2.3. For Low-Light Face Detection (DARK FACE)

5.3. Baselines

5.3.1. Traditional Methods

5.3.2. Supervised Methods

5.3.3. Semi-supervised Methods

5.3.4. Unsupervised Methods

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Quantitative Comparison

6.1.2. Qualitative Comparison

6.1.3. Low-Light Face Detection

6.2. Ablation Studies / Parameter Analysis

6.2.1. Latent Space vs. Image Space Decomposition

6.2.2. Retinex Decomposition Network (CTDN)

6.2.3. Loss Function (Lscc\mathcal{L}_{scc}Lscc​)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.3.3. Self-Constrained Consistency Loss ( $\mathcal{L}_{scc}$ )

6.2.3. Loss Function ( $\mathcal{L}_{scc}$ )