LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models
TL;DR Summary
LightenDiffusion introduces an unsupervised framework for low-light image enhancement, integrating Retinex theory with diffusion models. It utilizes a content-transfer decomposition network for Retinex decomposition in latent space, significantly enhancing restoration performance
Abstract
In this paper, we propose a diffusion-based unsupervised framework that incorporates physically explainable Retinex theory with diffusion models for low-light image enhancement, named LightenDiffusion. Specifically, we present a content-transfer decomposition network that performs Retinex decomposition within the latent space instead of image space as in previous approaches, enabling the encoded features of unpaired low-light and normal-light images to be decomposed into content-rich reflectance maps and content-free illumination maps. Subsequently, the reflectance map of the low-light image and the illumination map of the normal-light image are taken as input to the diffusion model for unsupervised restoration with the guidance of the low-light feature, where a self-constrained consistency loss is further proposed to eliminate the interference of normal-light content on the restored results to improve overall visual quality. Extensive experiments on publicly available real-world benchmarks show that the proposed LightenDiffusion outperforms state-of-the-art unsupervised competitors and is comparable to supervised methods while being more generalizable to various scenes. Our code is available at https://github.com/JianghaiSCU/LightenDiffusion.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "LightenDiffusion: Unsupervised Low-Light Image Enhancement with Latent-Retinex Diffusion Models."
1.2. Authors
The authors are:
-
Hai Jiang ()
-
Ao Luo ()
-
Xiaohong Liu ()
-
Songchen Han ()
-
Shuaicheng Li ()
Their affiliations are:
-
Sichuan University
-
Southwest Jiaotong University
-
Universi leroniScc nTecolgy (Likely a typo or obfuscated name, should probably be a university/institute)
-
Shanghai Jiao Tong University
-
Megvii Technology
Shuaicheng Li is indicated as the corresponding author ().
1.3. Journal/Conference
The paper is published at arXiv, a preprint server. This indicates it is currently a preprint and has not yet undergone formal peer review and publication in a specific journal or conference proceeding. arXiv is widely used in academic fields, especially in computer science and physics, for rapid dissemination of research.
1.4. Publication Year
The paper was published at (UTC) on 2024-07-12T02:54:43.000Z, which corresponds to 2024.
1.5. Abstract
The paper proposes LightenDiffusion, an unsupervised framework for low-light image enhancement (LLIE). This framework integrates the physically explainable Retinex theory with diffusion models. A novel content-transfer decomposition network (CTDN) performs Retinex decomposition in the latent space (instead of the typical image space). This enables the decomposition of encoded features from unpaired low-light and normal-light images into content-rich reflectance maps and content-free illumination maps. Subsequently, the reflectance map of the low-light image and the illumination map of the normal-light image are fed into a diffusion model for unsupervised restoration, guided by the low-light feature. To further improve visual quality, a self-constrained consistency loss is introduced to prevent interference from normal-light content in the restored results. Experimental evaluations on real-world benchmarks demonstrate that LightenDiffusion surpasses state-of-the-art unsupervised methods and achieves performance comparable to supervised methods, while exhibiting superior generalizability across diverse scenes. The code for LightenDiffusion is publicly available.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2407.08939v1
- PDF Link: https://arxiv.org/pdf/2407.08939v1.pdf
- Publication Status: The paper is a preprint available on
arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is low-light image enhancement (LLIE). Images captured under poor lighting conditions suffer from significant degradations, including poor visibility, reduced contrast, and amplified noise. This problem is crucial because such degraded images negatively impact the performance of various downstream computer vision tasks, such as object detection, segmentation, and surveillance.
The challenges and gaps in prior research are multifaceted:
-
Ill-posed problem: LLIE is inherently ill-posed, meaning multiple high-quality images could correspond to a single low-light input. Traditional methods, relying on hand-crafted priors like
histogram equalization (HE)orRetinex theory, struggle to adapt to diverse illumination conditions and often produce artifacts or limited improvements. -
Overfitting and poor generalization in learning-based methods: While deep learning-based approaches have shown promise, many supervised methods require large-scale paired datasets (low-light and corresponding normal-light images) for training. Such paired data are difficult and costly to collect in real-world scenarios. This reliance often leads to models that overfit to specific training conditions and perform poorly when applied to unseen, real-world low-light scenes, exhibiting issues like incorrect exposure, color distortion, blurred details, or noise amplification.
-
Limitations of zero-shot diffusion models: Recent generative models, particularly
diffusion models, have gained attention for their ability to generate high-quality images. Some zero-shot approaches leverage pre-trained diffusion models for image restoration without training from scratch. However, these methods are limited by theknown degradation modes(types of image corruption) embedded in the pre-trained models and tend to perform unsatisfactorily in real-world scenarios where degradations are diverse and unknown.The paper's entry point and innovative idea revolve around developing an unsupervised, learning-based framework that overcomes the paired data dependency and generalization issues. It proposes to incorporate the physically explainable
Retinex theorywith the powerful generative capabilities ofdiffusion modelsto learn degradation modes directly from extensive unpaired real-world data. A key innovation is performingRetinex decompositionwithin thelatent spacerather than the traditional image space, allowing for more robust separation of content and illumination information.
2.2. Main Contributions / Findings
The paper's primary contributions are:
-
Novel Diffusion-Based Unsupervised Framework (
LightenDiffusion): Proposing a new framework that synergistically combinesRetinex theoryanddiffusion modelsforunsupervised low-light image enhancement. This addresses the critical issue of paired data scarcity and improves generalization. -
Content-Transfer Decomposition Network (CTDN): Introducing a specialized network that performs
Retinex decompositionin thelatent space. This allows for the generation ofcontent-rich reflectance maps(containing intrinsic image details) andcontent-free illumination maps(representing only lighting conditions) from unpaired low-light and normal-light images. This latent-space decomposition is shown to be more effective than traditional image-space methods in separating these components. -
Self-Constrained Consistency Loss (): Proposing a novel loss function to improve visual quality by eliminating interference from normal-light content. This loss ensures that the restored feature shares the same intrinsic content information as the input low-light image, mitigating potential artifacts arising from imperfect illumination map estimation.
-
Extensive Experimental Validation: Demonstrating through comprehensive experiments on publicly available real-world benchmarks that
LightenDiffusionsignificantlyoutperforms state-of-the-art unsupervised competitors. Furthermore, it achievescomparable performance to supervised methodswhile exhibitingsuperior generalization abilitiesto various unseen scenes. -
Practical Value in Downstream Tasks: Showing that
LightenDiffusioncan effectively serve as a pre-processing step for downstream vision tasks, such aslow-light face detection, significantly improving the precision of detectors likeRetinaFacein challenging conditions.The key conclusions and findings are that
LightenDiffusioneffectively resolves the trade-off between enhancement quality and generalization ability inLLIE. By learning from unpaired data and leveraging the strengths of latent-spaceRetinex decompositionanddiffusion models, it produces visually pleasing and artifact-free enhanced images that are robust across diverse real-world low-light conditions. This approach provides a practical solution for real-world applications where paired data is unavailable.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the LightenDiffusion paper, a beginner should be familiar with several core concepts in computer vision and deep learning:
3.1.1. Low-Light Image Enhancement (LLIE)
Low-Light Image Enhancement (LLIE) is the task of improving the visual quality of images captured under insufficient lighting conditions. These images typically suffer from low brightness, poor contrast, color cast, and amplified noise. The goal of LLIE is to transform these degraded images into visually pleasant, high-quality images that resemble those taken under normal lighting, making details more discernible and improving their utility for human perception and computer vision systems.
3.1.2. Retinex Theory
The Retinex theory is a perceptual model of color vision developed by Edwin Land, which explains how humans perceive color consistently despite varying illumination. In the context of image processing, it assumes that an observed image () can be decomposed into two components:
- Reflectance map (): This represents the intrinsic color and texture properties of objects in the scene, which should be invariant to changes in illumination. It's the "true color" or "content" of the image.
- Illumination map (): This represents the lighting conditions of the scene, describing the amount of light falling on objects. It typically varies smoothly across the image.
The mathematical formulation of
Retinex theoryis often given as a multiplicative relationship: where denotes the Hadamard (element-wise) product. The goal ofRetinex-basedLLIEis to estimate these two components from a low-light image, enhance theillumination map(e.g., by increasing its dynamic range), and then recombine it with the originalreflectance mapto produce an enhanced image.
3.1.3. Diffusion Models (DMs)
Diffusion Models (DMs) are a class of generative models that have shown remarkable success in generating high-quality and diverse images. They operate through two main processes:
- Forward Diffusion (Noising Process): This process gradually adds Gaussian noise to an image over several time steps, progressively transforming a clean image () into pure Gaussian noise (). This process is fixed and can be described mathematically.
- Reverse Denoising (Generation Process): This is the learned part of the model. It starts with random noise () and attempts to gradually reverse the noising process, step by step, to reconstruct a clean image (). A neural network (often a
U-Net) is trained to predict the noise added at each step, allowing the model to iteratively remove noise and generate realistic images.Conditional Diffusion Modelsextend this by incorporating additional information (e.g., text descriptions, class labels, or in this paper, low-light features) to guide the generation process, allowing for controlled image synthesis or restoration.
3.1.4. Unsupervised Learning
Unsupervised learning is a type of machine learning where the model learns patterns and structures from input data without any explicit labels or paired outputs. In the context of LLIE, this means the model can learn to enhance low-light images using datasets that contain only low-light images, or a collection of unpaired low-light and normal-light images, without requiring perfectly matched pairs. This is highly advantageous for real-world applications where obtaining paired data is difficult or impossible.
3.1.5. Latent Space
In deep learning, latent space (also known as feature space or embedding space) refers to a lower-dimensional representation of data that captures its essential characteristics. An encoder network maps high-dimensional input data (like an image) to a more compact latent representation, which is a vector or a feature map. This latent space often disentangles various attributes of the data, making it easier for subsequent processes (like a decoder or diffusion model) to manipulate or generate new data. Operations in latent space can be more robust and semantically meaningful than operations directly in the pixel space.
3.1.6. Convolutional Neural Networks (CNNs) and Encoder-Decoder Architecture
Convolutional Neural Networks (CNNs) are a class of deep neural networks primarily used for analyzing visual imagery. They use convolutional layers to automatically and adaptively learn spatial hierarchies of features. An encoder-decoder architecture is a common structure in CNNs for image-to-image translation tasks (like LLIE).
- Encoder: Consists of several
convolutionalandpoolinglayers that progressively downsample the input image, extracting increasingly abstract and compressed features into alatent space. - Decoder: Consists of
up-samplingandconvolutionallayers that take thelatent representationfrom theencoderand gradually reconstruct the output image at the original resolution.
3.1.7. Attention Mechanisms
Attention mechanisms in neural networks allow the model to focus on specific parts of the input data that are most relevant for a given task, rather than processing all parts equally.
- Self-Attention (SA): Enables a model to weigh the importance of different parts of a single input sequence or image feature. For image processing, it helps capture long-range dependencies between different spatial locations.
- Cross-Attention (CA): Allows a model to attend to information from a different input sequence or feature map. In
LightenDiffusion,cross-attentionmight be used to allow thereflectance mapfeatures to attend to relevant information in theillumination mapfeatures, or vice-versa, to refine their decomposition.
3.2. Previous Works
The paper discusses various categories of prior work in Low-Light Image Enhancement (LLIE):
3.2.1. Traditional Methods
These methods rely on pre-defined mathematical models or hand-crafted priors.
- Histogram Equalization (HE)-based methods [2, 42, 44]: Aim to improve contrast by re-distributing pixel intensities to span the full dynamic range. While simple, they can sometimes lead to over-enhancement or noise amplification.
- Retinex-based methods [9, 14, 4]: Decompose an image into
reflectanceandilluminationcomponents. Enhancement is achieved by manipulating theillumination map(e.g., increasing its brightness or contrast) and then recombining it with thereflectance map. Examples includeLIME [14]andBrainRetinex [4]. - Limitations: Difficult to generalize to diverse, real-world low-light conditions due to the inherent ill-posed nature of the problem and the reliance on fixed priors.
3.2.2. Learning-Based Methods
With the advent of deep learning, LLIE methods shifted towards learning complex mappings from low-light to normal-light images.
- Supervised Methods [13, 23, 34, 54, 60, 63, 64, 72]: These models are trained on large-scale paired datasets (low-light input and corresponding normal-light ground truth). They leverage powerful network architectures (e.g.,
CNNs) to directly learn the enhancement function. Examples includeSMG [64].- Retinex-based Deep Networks [5, 15, 58, 59, 74]: Combine the principles of
Retinex theorywith deep learning. They often use neural networks to learn the decomposition and adjustment steps. Examples includeRetinexNet [58]andURetinexNet [59]. - Limitations: Heavy reliance on paired datasets, which are hard to collect and often lead to models with poor generalization when applied to real-world images outside the training distribution.
- Retinex-based Deep Networks [5, 15, 58, 59, 74]: Combine the principles of
- Unsupervised Methods [10, 12, 24, 32, 40, 67]: Address the data scarcity issue by learning from unpaired data or without explicit labels. They often employ techniques like
adversarial learning(EnlightenGAN [24]),curve estimation(Zero-DCE [12]), orneural architecture search(RUAS [32]).PairLIE [10]andNeRCo [67]are also unsupervised methods.- Limitations: While better in generalization, they can sometimes struggle with visual fidelity or introduce artifacts compared to supervised methods.
- Semi-supervised Methods [29, 68]: Attempt to combine the benefits of both supervised and unsupervised learning, using a mix of paired and unpaired data to achieve stable training and better generalization.
DRBN [68]andBLL [39]are examples.
3.2.3. Diffusion-Based Image Restoration
Diffusion Models (DMs) have recently been applied to various image restoration tasks.
- Conditional DMs [6, 20, 22, 43, 47, 48, 71, 76]: Most
DM-based methods train a model from scratch using paired data, where the degraded image (e.g., low-light image) serves as a condition or guidance during the denoising process. Examples includePyDiff [76]andGSAD [20]forLLIE.- Limitations: Still suffer from the paired data requirement, limiting their real-world applicability and generalization.
- Zero-Shot DMs [8, 25, 35, 55, 78]: Utilize pre-trained
diffusion models(often trained on large, diverse datasets) to restore degraded images without specific training for the degradation. They leverage the general priors learned by the pre-trained model.GDP [8]is an example forLLIE.- Limitations: Performance is constrained by the
known degradation modesthe pre-trained model implicitly learned. They may struggle with diverse, unknown degradations present in real-worldLLIEscenarios, often leading to under-enhancement or unsatisfactory visual quality.
- Limitations: Performance is constrained by the
3.3. Technological Evolution
The field of LLIE has evolved significantly:
- Early Traditional Methods (1970s-2010s): Focused on hand-crafted mathematical models like
Histogram Equalization(early 2000s) andRetinex theory(1970s, applied to images in 1980s-2000s). These were often simple and fast but lacked adaptability. - Deep Learning Era (2015-present):
-
Supervised CNNs (2015-2018): Initial deep learning approaches directly mapped low-light to normal-light images using
CNNs, requiring large paired datasets.RetinexNet [58]was a key development applying deep learning toRetinex theory. -
Unsupervised and Semi-Supervised Methods (2018-present): To overcome paired data limitations, techniques like
GANs(EnlightenGAN [24]) andcurve estimation(Zero-DCE [12]) emerged, focusing on unpaired data or intrinsic image properties. -
Generative Models (2020-present): The rise of
Generative Adversarial Networks (GANs),Variational Autoencoders (VAEs), and more recently,Diffusion Models (DMs), brought new paradigms for generating high-quality images.DMs(2020 onwards) have gained prominence due to their impressive generative power and training stability, being applied to various image restoration tasks includingLLIE(PyDiff [76],GSAD [20]).LightenDiffusionfits into this timeline by building upon the success ofdiffusion modelsandRetinex theory, but innovatively addressing theunsupervisedlearning challenge and thegeneralizationgap that previousDM-based methods still faced due to their reliance on paired data or limited zero-shot capabilities.
-
3.4. Differentiation Analysis
Compared to the main methods in related work, LightenDiffusion introduces several core differences and innovations:
- Unsupervised Learning with Diffusion Models: Unlike most
diffusion-based LLIEmethods (PyDiff,GSAD) that rely on paired data and supervised training,LightenDiffusionoperates in anunsupervisedmanner. This is a critical advantage, making it highly applicable to real-world scenarios where paired low-light and normal-light images are scarce. - Latent-Space Retinex Decomposition: Previous
Retinex-based deep learning methods (RetinexNet,URetinexNet, ,PairLIE) typically performdecompositionin theimage space.LightenDiffusioninnovates by performing thisdecompositionin thelatent spacethrough itsContent-Transfer Decomposition Network (CTDN). This allows for a more effective disentanglement ofcontent-rich reflectance mapsandcontent-free illumination maps, reducing information leakage between components and producing cleaner results (as illustrated in Fig. 3). - Integration of Retinex with Diffusion Models: While both
Retinex theoryanddiffusion modelshave been used forLLIEindependently,LightenDiffusionprovides a principled integration where the decomposedreflectanceandilluminationcomponents (specifically,R_lowandL_high) are explicitly used as input to guide thediffusion model's restoration process. This combination leverages the physical interpretability ofRetinexwith the powerful generative capabilities ofDMs. - Self-Constrained Consistency Loss (): This novel loss function specifically targets a weakness of
Retinex-based approaches – the potential for residual content information in the estimatedillumination mapto introduce artifacts. By adding thisconsistency loss,LightenDiffusionexplicitly guides thediffusion modelto reconstruct results with intrinsic content information consistent with the low-light input, thus improving visual fidelity and robustness without requiring ground-truth supervision. - Improved Generalization: By training on extensive unpaired real-world data and employing the latent-space decomposition,
LightenDiffusiondemonstrates superior generalization ability compared to both supervised methods (which overfit to training distributions) and zero-shot diffusion methods (which are limited by known degradation modes), as shown by its performance on unseen real-world datasets.
4. Methodology
4.1. Principles
The core idea behind LightenDiffusion is to combine the strengths of Retinex theory and diffusion models within an unsupervised learning framework for low-light image enhancement (LLIE). The fundamental principle is that an image can be intrinsically separated into its reflectance (content) and illumination (lighting) components. By performing this decomposition in a more robust latent space and then using a diffusion model to transfer desirable illumination from a normal-light image to the low-light image's content, the method aims to enhance low-light images without requiring paired training data. The diffusion model then implicitly learns to compensate for any information loss during decomposition and further refines the enhancement. A self-constrained consistency loss ensures that the content of the enhanced image remains faithful to the original low-light input.
4.2. Core Methodology In-depth (Layer by Layer)
The overall pipeline of LightenDiffusion is illustrated in Figure 2.
4.2.1. Overall Pipeline
The process begins with an unpaired low-light image () and a normal-light image ().
-
Latent Feature Extraction: An
encoderfirst transforms both and from the image space into a lower-dimensionallatent space. This yields encoded features and , respectively. Theencoderconsists of cascaded residual blocks, with each block downsampling the input by a scale of 2 using a max-pooling layer. So, if , then .graph TD A[Low-Light Image I_low] --> E1; B[Normal-Light Image I_high] --> E2; E1[Encoder ε(·)] --> F_low[Latent Feature F_low]; E2[Encoder ε(·)] --> F_high[Latent Feature F_high];- : Input images (low-light and normal-light, respectively).
- : Encoder network that transforms image space input to latent space features.
- : Encoded latent features for low-light and normal-light images.
H, W: Height and width of the input images.- : Number of channels in the latent features.
- : Number of downsampling steps in the encoder.
-
Latent-Space Retinex Decomposition: The encoded features and are then fed into the proposed
Content-Transfer Decomposition Network (CTDN). TheCTDNperformsRetinex decompositionwithin thelatent spaceto separate each feature into two components:Content-rich reflectance map(, ): Captures the intrinsic content and texture details.Content-free illumination map(, ): Represents only the lighting conditions.
graph TD F_low --> CTDN; F_high --> CTDN; CTDN --> R_low[Reflectance Map R_low]; CTDN --> L_low[Illumination Map L_low]; CTDN --> R_high[Reflectance Map R_high]; CTDN --> L_high[Illumination Map L_high];- : Reflectance maps extracted from and .
- : Illumination maps extracted from and .
-
Diffusion-Based Restoration: The
reflectance mapof the low-light image () and theillumination mapof the normal-light image () are combined to form an initial input for thediffusion model. This combined feature, , conceptually represents the low-light content with normal-light illumination. This then undergoes aforward diffusion process. Subsequently, areverse denoising processis performed, guided by the original low-light feature (denoted as ), to gradually transform random Gaussian noise into arestored feature.graph TD R_low --> H[Hadamard Product]; L_high --> H; H --> X0[x0 = R_low ⊙ L_high]; X0 --> DM[Diffusion Model]; F_low --> DM; DM --> F_hat_low[Restored Feature F_hat_low];- : The restored latent feature for the low-light image.
-
Final Image Reconstruction: Finally, the
restored featureis passed through adecoderto reconstruct the final enhanced low-light image in the image space.graph TD F_hat_low --> D[Decoder D(·)]; D --> I_hat_low[Restored Image I_hat_low];-
: Decoder network that transforms latent space features back to image space.
-
: The final enhanced low-light image.
The overall pipeline is depicted in Figure 2:
该图像是一个示意图,展示了LightenDiffusion框架的整体流程。首先,通过编码器 ext{E}(ullet)将低光照图像 和正常光照图像 转换为潜在空间ext{F}_{ ext{low}}和ext{F}_{ ext{high}},然后进入内容转移分解网络生成反射图和照明图。最后,通过扩散模型进行前向扩散和反向去噪处理,以恢复低光照图像。
-
Fig. 2: The overall pipeline of our proposed framework. We first employ an encoder to convert the unpaired low-light image and normal-light image I _ { h i g h } into latent space denoted as and . The encoded features are sent to the proposed content-transfer decomposition network (CTDN) to generate content-rich reflectance maps denoted as and and content-free illumination maps as and . Then, the reflectance map of the low-light image and the illumination of the normal-light image are taken as the input of the diffusion model to perform the forward diffusion process. Finally, we perform the reverse denoising process to gradually transform the randomly sampled Gaussian noise into the restored feature with the guidance of the low-light feature denoted as , and subsequently send it to a decoder to produce the final result .
4.2.2. Content-Transfer Decomposition Network (CTDN)
The Retinex theory fundamentally assumes that an image can be decomposed into a reflectance map and an illumination map :
-
: The input image (or latent feature in this case).
-
: The
reflectance map, representing the inherent content and texture of the scene. It should be consistent across different lighting conditions. -
: The
illumination map, representing the lighting conditions (brightness and contrast). It should be locally smooth and free of content details. -
: The Hadamard (element-wise) product operation.
Previous methods typically perform this
decompositionin theimage space. However, this often results in incomplete separation, wherecontent informationmight still be partially retained in theillumination map(as shown in Figure 3a). This leakage can lead to artifacts in enhanced images.
To address this, LightenDiffusion introduces the Content-Transfer Decomposition Network (CTDN) which performs decomposition within the latent space. The CTDN aims to ensure that the reflectance maps are content-rich and the illumination maps are truly content-free. The detailed architecture of the CTDN is shown in Figure 4.
该图像是论文中提出的CTDN架构示意图。图中展示了Retinex分解的过程,包括低光照图像的反射率和正常光照图像的照明图的特征提取与合成。
Fig. 4: The detailed architecture of our proposed CTDN.
The process within CTDN is as follows:
-
Initial Estimation: For an input latent feature (which can be or ), initial
illuminationandreflectancemaps are estimated following a method similar to [14]:- : Represents a pixel location in the latent feature map.
- : The value of the latent feature at pixel for channel .
- : The total number of channels in the latent feature.
- : Takes the maximum value across all channels at a given pixel to estimate the initial
illuminationintensity. - : The initially estimated
illuminationvalue at pixel . - : The initially estimated
reflectancevalue at pixel , calculated by dividing the feature by theillumination(similar to theRetinexmultiplicative model). - : A small constant added to the denominator to prevent division by zero or very small values, ensuring numerical stability.
-
Feature Embedding: The initially estimated maps and are then refined. First, several
convolutional blocks(Convs) are applied to obtain embedded features and : -
Cross-Attention (CA) for Reflectance Reinforcement: A
cross-attention (CA) module[21] is used to leverage theillumination mapfeatures () to reinforce thecontent informationin thereflectance mapfeatures (). This helps to ensure that captures all relevant content details:- The
cross-attentionmechanism allows to query for relevant contextual information, helping to refine by incorporating details that might be implicitly tied to illumination variations but are intrinsically part of the content.
-
Self-Attention (SA) for Illumination Content Extraction: A
self-attention (SA) module[50] is applied to theillumination mapfeatures () to further extract any remainingcontent informationthat might still be present within . This extracted content is denoted as :- The
self-attentionhelps the network identify and isolate content patterns within itself, which should ideally be content-free.
-
Final Map Generation: The final
reflectance mapandillumination mapare then derived. The extractedcontent informationis added to the refinedreflectance map(to ensure it is trulycontent-rich) and subtracted from theillumination map(to ensure it is trulycontent-free). These intermediate results are passed through additionalconvolutional blocks(Convs):-
-
-
This transfer mechanism ("content-transfer") explicitly moves any residual content from the
illuminationbranch to thereflectancebranch.As a result of this sophisticated
CTDNdesign, the method is able to generatecontent-rich reflectance mapsthat fully represent the intrinsic information of the image, andcontent-free illumination mapsthat only reveal the lighting conditions, as demonstrated in Figure 3b, contrasting with previous methods in Figure 3a.
该图像是一个示意图,展示了不同方法在图像空间和潜在空间中的分解结果。图(a)显示了RetinexNet、KinD++、URetinexNet和PairLIE等方法在图像空间的分解结果,而图(b)展示了我们的CTDN在潜在空间的分解结果,能够生成内容丰富的反射图与内容无关的照明图。
-
Fig. 3: Illustration of the decomposition results obtained by different methods. (a) shows the results of previous methods, i.e., RetinexNet [58], [74], URetinexNet [59], and PairLIE [10], that perform decomposition in image space. (b) presents the results of our CTDN that performs decomposition in latent space. Our method can generate content-rich reflectance maps and content-free illumination maps.
4.2.3. Latent-Retinex Diffusion Models (LRDM)
Even with the robust CTDN, two challenges remain:
-
Information Loss: Any decomposition process, including
Retinex, inevitably involves some information loss. -
Stubborn Content in Illumination: Despite efforts, there might be challenging cases where the estimated
illumination mapstill retains subtlecontent information, potentially introducing artifacts.To address these, the paper proposes a
Latent-Retinex Diffusion Model (LRDM). This model leverages the generative capabilities ofdiffusion modelsto compensate for lost content and eliminate potential artifacts. It follows the standardforward diffusionandreverse denoisingprocesses.
4.2.3.1. Forward Diffusion
The forward diffusion process progressively adds Gaussian noise to a data point over steps.
-
Input : The input to the
diffusion modelat time step is formed by combining thereflectance mapof the low-light image () and theillumination mapof the normal-light image (). Thisreflectance-illumination productis denoted as :- This represents the desired enhanced feature, combining the content of the low-light image with the lighting of a normal-light image.
-
Noising Process: A pre-defined variance schedule is used to gradually transform into pure Gaussian noise over steps. Each step adds a small amount of Gaussian noise:
- : The probability distribution of given .
- : The noisy data at time step , where .
- : Denotes a Gaussian (Normal) distribution with mean and covariance matrix .
- : The mean of the Gaussian distribution, indicating that a portion of the previous noisy data is retained.
- : The variance of the Gaussian distribution. is a small, positive scalar from the variance schedule, and is the identity matrix, meaning noise is added independently to each dimension.
-
Closed-Form Expression: For direct sampling, can be obtained from in a single step using parameter re-normalization:
- : A re-parameterization of the variance.
- : The cumulative product of values up to time .
- : A randomly sampled Gaussian noise vector at time .
4.2.3.2. Reverse Denoising
The reverse denoising process aims to gradually transform randomly sampled Gaussian noise back into the desired enhanced feature.
-
Conditional Generation: A randomly sampled Gaussian noise is progressively denoised into a sharp, restored feature . This process is guided by the encoded low-light feature , which is denoted as . The guidance ensures that the restored result maintains fidelity to the original low-light image's content.
- : The probability distribution of the denoised data given the noisy data and the condition . The subscript indicates that this distribution is learned by the neural network.
- : Noisy data at time step during the reverse process.
- : The conditional input, which is the encoded low-light feature .
- : The mean of the Gaussian distribution for denoising, predicted by the neural network with parameters , taking , , and as input.
- : The variance of the Gaussian distribution for denoising.
- : The variance, derived from the forward process schedule.
-
Mean Value Calculation: The mean value is typically re-parameterized to predict the noise :
- : A neural network (often a
U-Net) that predicts the noise component at time step , given the noisy input , the conditional guidance , and the current time step . This network is the core of thediffusion model.
-
Diffusion Loss (): During training, the objective is to optimize the parameters of the
noise estimator networkso that its prediction for the noise vector is as close as possible to the actual noise that was added during the forward process. This is a standard objective fordiffusion models[19]:- : The diffusion loss, calculated as the L2 norm (squared Euclidean distance) between the true noise and the predicted noise.
- : The ground-truth noise sampled from and used to create from .
- : The noise predicted by the neural network.
4.2.3.3. Self-Constrained Consistency Loss ()
The initial input could still contain artifacts if is not perfectly content-free. This might disrupt the learned distribution of the diffusion model and affect the quality of the restored feature . To prevent this, a self-constrained consistency loss is proposed. Its purpose is to ensure that the restored feature retains the same intrinsic content information as the input low-light image.
-
Pseudo Label Construction: In the training phase, a
pseudo labelis constructed from thedecomposition resultsof the low-light image itself. This serves as a reference for the desired content:- : The
reflectance mapof the low-light image. - : The
illumination mapof the low-light image. - : An
illumination correction factor(set to 0.2 in experiments) applied to . This factor adjusts the brightness of the low-light illumination map to a more "normal" level without changing its content, creating apseudo ground truththat preserves the original content while having an adjusted illumination.
-
Consistency Loss: The aims to constrain the similarity between this
pseudo labeland therestored featureproduced by thediffusion model:- : The self-constrained consistency loss, calculated as the L1 norm (Manhattan distance) between the pseudo label and the restored feature. The L1 norm encourages sparsity and preserves edges better than L2.
- : The restored feature generated by the
diffusion model.
4.2.3.4. Overall Objective and Training Algorithm
The overall objective function for optimizing the Latent-Retinex Diffusion Model (LRDM) is a combination of the diffusion loss and the self-constrained consistency loss:
-
-
: A hyperparameter that balances the contribution of the
self-constrained consistency loss.The training strategy for
LRDMis summarized in Algorithm 1.input : The decomposition results Rlow and Lhigh, low-light feature Flow, time step T, and sampling step S. x0 = Rlow Lhigh, = Flow while Not converged do ∼ N (0, I), t ∼ Uniform{1, · · · , T } Perform gradient descent steps on θ∥t − θ(√¯αtx0 + √1 − ¯αtt, x, t)k2 xT ∼ N (0, I) for i = S : 1 do t = (i − 1) · T/S + 1 tnext = (i − 2) · T/S + 1 if i > 1, else 0 Z( xt−√1−αt -θ(xt,x,t) + 1 − αtnext · θ(t, x, t) xt ← √¯αtnext √¯αt end Perform gradient descent steps on θ|Rlow Low − 0∥2 end
Algorithm 1: LRDM training (Simplified and explained)
Input:
- : Reflectance map of the low-light image (from CTDN).
- : Illumination map of the normal-light image (from CTDN).
- : Low-light feature (from Encoder ), serves as condition .
- : Total time steps for forward diffusion.
- : Sampling steps for reverse denoising.
Initialization:
- (Target for diffusion to learn).
- (Guidance for diffusion).
Training Loop (while not converged):
- Sample Noise and Time:
- Sample random noise .
- Sample a random time step .
- Generate Noisy Input: Create from and using the closed-form forward diffusion equation: .
- Optimize Diffusion Model: Perform a gradient descent step to update the parameters of the noise prediction network . The objective is to minimize the difference between the true noise and the predicted noise :
- Minimize (This is ).
- Perform Reverse Denoising (for ):
- Initialize (random Gaussian noise).
- For down to
1(iterative denoising steps):- Calculate current time .
- Calculate next time (if , else
0). - Estimate the noise using the current model.
- Update to using the estimated noise and the
DDIM (Denoising Diffusion Implicit Models)[49] sampling formula (implicitly represented by the andxtupdate lines in the algorithm block). This part explicitly generates (which is after steps).
- The
restored featureis the final obtained after this sampling loop.
- Optimize Self-Constrained Consistency Loss: Perform a gradient descent step to update to minimize the between the
pseudo labeland therestored feature(which is the generated in the previous step):-
Minimize (This is ).
-
Note: The algorithm's line "Perform gradient descent steps on " suggests an L2 loss, but the text states L1 norm in Equation 6. This discrepancy might be a minor detail in implementation or presentation. Following the text, it's L1.
The
U-Net[46] architecture is adopted as thenoise estimator network. The number oftime stepsis 1000, and thesampling stepsforreverse denoisingis 20.
-
4.2.4. Network Training
The training process for LightenDiffusion is divided into two stages:
-
Stage 1: Training Encoder, CTDN, and Decoder:
- Objective: To effectively train the
encoder, theContent-Transfer Decomposition Network (CTDN), and thedecoderto accurately perform feature extraction, latent-space decomposition, and reconstruction. During this stage, the parameters of thediffusion modelare frozen. - Dataset: This stage uses low-light image pairs (e.g., and ) from the SICE dataset [3]. These are likely different exposures of the same scene, used to enforce consistency.
- Loss Functions:
- Content Loss (): Optimizes the
encoderanddecoderto ensure faithful reconstruction of the input low-light images.- : The content loss, measured as the L2 norm between the input low-light image and its reconstructed version after passing through the encoder and decoder.
- : The -th input low-light image.
- : The reconstructed image.
- Decomposition Loss (): Optimizes the
CTDNto ensure properRetinex decomposition. It consists of three components [58, 71, 74]:- Reconstruction Loss (): Guarantees that the decomposed
reflectanceandilluminationcomponents can reconstruct the original input latent features.- : The reconstruction loss, calculated as the L1 norm.
- : The -th low-light latent feature.
- : The -th low-light
reflectance map. - : The -th low-light
illumination map. This loss ensures that combining anyreflectancefrom one low-light image withilluminationfrom another (or the same) low-light image can reconstruct the respective original low-light feature. This helps disentangle the components.
- Reflectance Consistency Loss (): Enforces that the
reflectance mapsderived from different low-light images of the same scene (e.g., different exposures) should be consistent, asreflectanceshould be invariant to illumination changes.- : The reflectance consistency loss, calculated as the L1 norm.
- :
Reflectance mapsfrom two different low-light inputs () of the same scene.
- Illumination Smoothing Loss (): Encourages the
illumination mapto be locally smooth, suppressing high-frequency details (which should belong to thereflectance map). It is weighted by thereflectance gradientto preserve edges.- : The illumination smoothing loss, calculated as the L2 norm.
- : Denotes the horizontal and vertical gradient operator.
- : Gradient of the
illumination map, which should ideally be small for smooth regions. - : Gradient of the
reflectance map, which indicates edges and content details. - : An edge-aware weight. Where
reflectance gradientsare large (indicating an edge), this term becomes small, effectively reducing the penalty onillumination gradientsnear strong edges. This allows theillumination mapto change more abruptly across object boundaries while remaining smooth within regions. - : A coefficient to balance the perceived strength of the structure.
- Overall Decomposition Loss: .
- : Hyperparameters balancing the components of .
- Reconstruction Loss (): Guarantees that the decomposed
- Content Loss (): Optimizes the
- Objective: To effectively train the
-
Stage 2: Training the Diffusion Model:
-
Objective: To train the
Latent-Retinex Diffusion Model (LRDM)using the combined loss . During this stage, theencoder,CTDN, anddecoder(which were trained in Stage 1) are frozen. -
Dataset: Approximately 180k unpaired low-light/normal-light image pairs are collected for this stage, providing a diverse set of real-world scenarios.
This two-stage training strategy allows for specialized optimization of the
decompositionandreconstructioncomponents first, followed by thegenerative diffusionprocess, leading to a more stable and effective overall framework.
-
5. Experimental Setup
5.1. Datasets
The experiments evaluate the performance of LightenDiffusion on a variety of publicly available datasets, categorized as paired and unpaired.
5.1.1. Paired Datasets
These datasets contain low-light images and their corresponding normal-light ground-truth images, allowing for precise evaluation with full-reference metrics.
- LOL [58]: (Low-light Image Enhancement Dataset)
- Characteristics: A widely used dataset for
LLIE, providing paired low/normal-light images. It often serves as a benchmark for training and evaluating supervised methods. - Usage: Used for quantitative and qualitative comparisons, particularly for evaluating against supervised methods that are typically trained on this dataset.
- Characteristics: A widely used dataset for
- LSRW [16]: (Low-light Saliency Ranking Dataset for Weakly Supervised Low-light Enhancement)
- Characteristics: Another paired dataset for
LLIE, potentially with more diverse scenes or specific characteristics (e.g., related to saliency). - Usage: Used for quantitative and qualitative comparisons.
- Characteristics: Another paired dataset for
5.1.2. Unpaired Datasets
These datasets consist only of low-light images (without corresponding normal-light ground truth) and are used to evaluate the generalization ability of models to real-world, unseen scenarios.
- DICM [28]: (Dark Image Composition Model)
- Characteristics: A collection of real-world low-light images.
- Usage: Used for quantitative and qualitative comparisons, specifically to assess unsupervised methods' performance and generalization.
- NPE [53]: (Naturalness Preserved Enhancement)
- Characteristics: Contains real-world low-light images.
- Usage: Used for quantitative and qualitative comparisons.
- VV [51]: (Vonikakis-Kouskouridas-Gasteratos)
- Characteristics: Another dataset of real-world low-light images, often used for evaluating illumination compensation algorithms.
- Usage: Used for quantitative and qualitative comparisons.
5.1.3. Face Detection Dataset
- DARK FACE [69]:
- Characteristics: Consists of 6,000 images captured under weakly illuminated conditions with annotated labels for face detection. This dataset is specifically designed to test the impact of
LLIEmethods as a pre-processing step for improving low-light face detection. - Usage: Used to investigate the practical value of
LLIEmethods in improving downstream vision tasks.
- Characteristics: Consists of 6,000 images captured under weakly illuminated conditions with annotated labels for face detection. This dataset is specifically designed to test the impact of
5.2. Evaluation Metrics
The choice of evaluation metrics depends on whether paired ground truth images are available.
5.2.1. For Paired Datasets (LOL, LSRW)
For these datasets, where ground-truth normal-light images () are available, full-reference metrics are used:
-
PSNR (Peak Signal-to-Noise Ratio)
- Conceptual Definition:
PSNRquantifies the quality of reconstruction of an image compared to an original image. It is most commonly used to measure the quality of reconstruction of lossy compression codecs or image processing methods like enhancement. A higherPSNRgenerally indicates a better reconstruction, implying that the enhanced image is closer to the ground truth. It is inversely related to the mean squared error (MSE). - Mathematical Formula: $ \text{MSE} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I_{GT}(i,j) - I_{ENH}(i,j)]^2 $ $ \text{PSNR} = 10 \cdot \log_{10} \left( \frac{\text{MAX}_I^2}{\text{MSE}} \right) $
- Symbol Explanation:
- : The ground-truth normal-light image.
- : The enhanced image produced by the method.
M, N: The dimensions (height and width) of the image.- : The pixel values at coordinates
(i,j)in the ground-truth and enhanced images, respectively. - : The maximum possible pixel value of the image (e.g., 255 for 8-bit grayscale images).
- : Mean Squared Error.
- Conceptual Definition:
-
SSIM (Structural Similarity Index Measure)
- Conceptual Definition:
SSIMis a perceptual metric that evaluates the similarity between two images. UnlikePSNRwhich focuses on absolute errors,SSIMconsiders image degradation as a perceived change in structural information, also incorporating luminance and contrast changes. It attempts to model how the human visual system perceives quality. ASSIMvalue closer to 1 indicates higher similarity and better perceived quality. - Mathematical Formula: $ \text{SSIM}(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $
- Symbol Explanation:
x, y: Two image patches (or the entire images) being compared, representing and .- : The average (mean) of and , respectively.
- : The variance of and , respectively.
- : The covariance of and .
- : Small constants to prevent division by zero or near-zero values. is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and are typically small constants (e.g., 0.01 and 0.03).
- Conceptual Definition:
-
LPIPS (Learned Perceptual Image Patch Similarity)
- Conceptual Definition:
LPIPSis a perceptual metric that uses features from a pre-trained deep convolutional neural network (e.g., VGG, AlexNet) to measure the "perceptual distance" between two images. It aims to better align with human judgment of image similarity than traditional metrics likePSNRorSSIM. A lowerLPIPSscore indicates higher perceptual similarity and better quality. - Mathematical Formula:
LPIPSis a learned metric, meaning its calculation involves forward passes through a deep neural network and is not represented by a single, simple mathematical formula in the same way asPSNRorSSIM. Therefore, a specific formula forLPIPSis not provided here, as per typical practice for learned metrics.
- Conceptual Definition:
5.2.2. For Unpaired Datasets (DICM, NPE, VV)
For these datasets, where no ground-truth normal-light images are available, non-reference (or "no-reference") perceptual metrics are used:
-
NIQE (Naturalness Image Quality Evaluator)
- Conceptual Definition:
NIQEis a no-reference image quality assessment (NR-IQA) metric. It works by building anatural scene statistic (NSS)model from a database of pristine natural images and then measures the distance between theNSSfeatures of the test image and those of the pristine model. A lowerNIQEscore indicates better image quality and greater naturalness. - Mathematical Formula:
NIQEis based on a complex statistical model derived from natural images. It involves fitting aGeneralized Gaussian Distribution (GGD)to local image features (e.g., mean subtracted contrast normalized (MSCN) coefficients) and then calculating a distance measure between the multivariate Gaussian model parameters of the enhanced image and those of a reference natural image database. A single simple formula is not typically provided forNIQEas it relies on a trained statistical model.- The core idea involves a distance computation (e.g., Mahalanobis distance) between the feature vectors of the test image and the natural image model.
- Conceptual Definition:
-
PI (Perceptual Index)
- Conceptual Definition:
PIis a no-reference perceptual metric typically used in image enhancement and super-resolution tasks. It is designed to correlate well with human perception of image quality. LikeLPIPS, it often incorporates elements related tonaturalnessanddistortion, aiming for a single score that reflects overall visual appeal. A lowerPIscore indicates better perceptual quality. - Mathematical Formula: Similar to
LPIPS,PIis often a composite or learned metric, and a universally standard, single mathematical formula is not typically provided. It often combines other no-reference metrics or uses learned perceptual features.
- Conceptual Definition:
5.2.3. For Low-Light Face Detection (DARK FACE)
- AP (Average Precision)
- Conceptual Definition:
Average Precision (AP)is a common metric used to evaluate the performance of object detection models. It summarizes theprecision-recall curveinto a single value. Aprecision-recall curveplots theprecision(the proportion of true positive detections among all positive detections) againstrecall(the proportion of true positive detections among all actual positive instances).APis calculated as the area under this curve. A higherAPindicates better detection performance, meaning the model can detect more relevant objects (higherrecall) while maintaining high accuracy (higherprecision). - Mathematical Formula: $ \text{AP} = \sum_{n} (R_n - R_{n-1})P_n $ This is a common way to approximate the area under the precision-recall curve by summing the areas of rectangles.
- Symbol Explanation:
- : The precision at the -th threshold (or recall level).
- : The recall at the -th threshold (or recall level).
- : The change in recall between consecutive thresholds.
The metric is typically calculated at a specific
Intersection over Union (IoU)threshold (e.g., ), which defines what constitutes a correct detection.
- Conceptual Definition:
5.3. Baselines
The proposed method LightenDiffusion is compared against a comprehensive set of existing LLIE methods, categorized by their approach:
5.3.1. Traditional Methods
LIME [14]SDDLLE [17]CDEF [30]BrainRetinex [4]
5.3.2. Supervised Methods
RetinexNet [58]KinD++ [74]LCDPNet [52]URetinexNet [59]SMG [64]PyDiff [76](a diffusion-based supervised method)GSAD [20](a diffusion-based supervised method)
5.3.3. Semi-supervised Methods
DRBN [68]BLL [39]
5.3.4. Unsupervised Methods
-
Zero-DCE [12] -
EnlightenGAN [24] -
RUAS [32] -
SCI [40] -
GDP [8](a zero-shot diffusion-based method) -
PairLIE [10] -
NeRCo [67]These baselines represent a wide range of approaches, from classical techniques to cutting-edge deep learning methods, including both supervised and unsupervised paradigms, as well as other
diffusion-basedmethods. This diverse comparison set helps to rigorously positionLightenDiffusionwithin the broaderLLIElandscape.
5.4. Implementation Details
- Framework: Implemented with
PyTorch. - Hardware: Four
NVIDIA RTX 2080Ti GPUs. - Batch Size: 12.
- Patch Size: .
- Training Iterations:
- Stage 1 (Encoder, CTDN, Decoder): iterations.
- Stage 2 (Diffusion Model): iterations.
- Optimizer:
Adam optimizer[26]. - Learning Rate:
- Stage 1: Initial .
- Stage 2: Reinitialized to fixed , decays by factor of 0.8.
- Hyperparameters:
Encoderdownsampling scale : 3.Illumination correction factor: 0.2.- Loss weights: , , , .
- LRDM (Diffusion Model) specifics:
- Noise estimator network:
U-Net[46] architecture. - Time steps : 1000 (for forward diffusion).
- Sampling steps : 20 (for reverse denoising).
- Noise estimator network:
- Evaluation: For
GDPandLightenDiffusion, the reported performance is the mean over five evaluations due to the stochastic nature of generative models.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Quantitative Comparison
The following are the results from Table 1 of the original paper:
| Type | Method | LOL [58] | LSRW [16] | DICM [28] | NPE [53] | VV [51] | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | PSNR ↑ | SSIM ↑ | LPIPS ↓ | NIQE ↓ | PI ↓ | NIQE ↓ | PI ↓ | NIQE ↓ | PI ↓ | |||
| T | LIME [14] | 17.546 | 0.531 | 0.290 | 17.342 | 0.520 | 0.416 | 4.476 | 4.216 | 4.170 | 3.789 | 3.713 | 3.335 | |
| SDDLLE [17] | 13.342 | 0.634 | 0.261 | 14.708 | 0.486 | 0.382 | 4.581 | 3.828 | 4.179 | 3.315 | 4.274 | 3.382 | ||
| CDEF [30] | 16.335 | 0.585 | 0.351 | 16.758 | 0.465 | 0.314 | 4.142 | 4.242 | 3.862 | 2.910 | 5.051 | 3.272 | ||
| BrainRetinex [4] | 11.063 | 0.475 | 0.327 | 12.506 | 0.390 | 0.374 | 4.350 | 3.555 | 3.707 | 3.044 | 4.031 | 3.114 | ||
| SL | RetinexNet [58] | 16.774 | 0.462 | 0.390 | 15.609 | 0.414 | 0.393 | 4.487 | 3.242 | 4.732 | 3.219 | 5.881 | 3.727 | |
| KinD++ [74] | 17.752 | 0.758 | 0.198 | 16.085 | 0.394 | 0.366 | 4.027 | 3.399 | 4.005 | 3.144 | 3.586 | 2.773 | ||
| LCDPNet [52] | 14.506 | 0.575 | 0.312 | 15.689 | 0.474 | 0.344 | 4.110 | 3.250 | 4.106 | 3.127 | 5.039 | 3.347 | ||
| URetinexNet [59] | 19.842 | 0.824 | 0.128 | 18.271 | 0.518 | 0.295 | 4.774 | 3.565 | 4.028 | 3.153 | 3.851 | 2.891 | ||
| SMG [64] | 23.814 | 0.809 | 0.144 | 17.579 | 0.538 | 0.456 | 6.224 | 4.228 | 5.300 | 3.627 | 5.752 | 3.757 | ||
| PyDiff [76] | 23.275 | 0.859 | 0.108 | 17.264 | 0.510 | 0.335 | 4.499 | 3.792 | 4.082 | 3.268 | 4.360 | 3.678 | ||
| GSAD [20] | 22.021 | 0.848 | 0.137 | 17.414 | 0.507 | 0.294 | 4.496 | 3.593 | 4.489 | 3.361 | 5.252 | 3.657 | ||
| SSL | DRBN [68] | 16.677 | 0.730 | 0.252 | 16.734 | 0.507 | 0.376 | 4.369 | 3.800 | 3.921 | 3.267 | 3.671 | 3.117 | |
| BLL [39] | 10.305 | 0.401 | 0.382 | 12.444 | 0.333 | 0.384 | 5.046 | 4.055 | 4.885 | 3.870 | 5.740 | 4.030 | ||
| UL | Zero-DCE [12] | 14.861 | 0.562 | 0.330 | 15.867 | 0.443 | 0.315 | 3.951 | 3.149 | 3.826 | 2.918 | 5.080 | 3.307 | |
| EnlightenGAN [24] | 17.606 | 0.653 | 0.319 | 17.106 | 0.463 | 0.322 | 3.832 | 3.256 | 3.775 | 2.953 | 3.689 | 2.749 | ||
| RUAS [32] | 16.405 | 0.503 | 0.257 | 14.271 | 0.461 | 0.455 | 7.306 | 5.700 | 7.198 | 5.651 | 4.987 | 4.329 | ||
| SCI [40] | 14.784 | 0.525 | 0.333 | 15.242 | 0.419 | 0.321 | 4.519 | 3.700 | 4.124 | 3.534 | 5.312 | 3.648 | ||
| GDP [8] | 15.896 | 0.542 | 0.337 | 12.887 | 0.362 | 0.386 | 4.358 | 3.552 | 4.032 | 3.097 | 4.683 | 3.431 | ||
| UL | PairLIE [10] | 19.514 | 0.731 | 0.254 | 17.602 | 0.501 | 0.323 | 4.282 | 3.469 | 4.661 | 3.543 | 3.373 | 2.734 | |
| NeRCo [67] | 19.738 | 0.740 | 0.239 | 17.844 | 0.535 | 0.371 | 4.107 | 3.345 | 3.902 | 3.037 | 3.765 | 3.094 | ||
| Ours | 20.453 | 0.803 | 0.192 | 18.555 | 0.539 | 0.311 | 3.724 | 3.144 | 3.618 | 2.879 | 2.941 | 2.558 | ||
Analysis of Quantitative Results:
-
Paired Datasets (LOL, LSRW):
- LOL Dataset:
LightenDiffusion(Ours) achieves the highestPSNR(20.453) andSSIM(0.803) among allunsupervised learning (UL)methods, and also has a very competitiveLPIPS(0.192). While somesupervised learning (SL)methods likeSMG(PSNR23.814,SSIM0.809,LPIPS0.144) andPyDiff(PSNR23.275,SSIM0.859,LPIPS0.108) outperformLightenDiffusiononLOL, this is expected given they are specifically trained on this paired dataset. However,LightenDiffusionstill surpasses severalSLmethods (e.g.,RetinexNet, ,LCDPNet) onLOL, showcasing its strong performance even without paired supervision. - LSRW Dataset: Here,
LightenDiffusiontruly shines. It achieves the bestPSNR(18.555) andSSIM(0.539) across all categories of methods, includingsupervisedones. ItsLPIPS(0.311) is also highly competitive, being better thanSMGandURetinexNet, though slightly higher thanGSADandURetinexNet. This result strongly supports the paper's claim thatLightenDiffusionhas better generalization abilities than supervised methods, as it performs better on a different paired dataset.
- LOL Dataset:
-
Unpaired Datasets (DICM, NPE, VV):
-
For these real-world benchmarks where no ground truth is available,
NIQEandPI(lower is better for both) are used. -
LightenDiffusionconsistentlyoutperforms all other unsupervised competitorson all three datasets:DICM(NIQE3.724,PI3.144),NPE(NIQE3.618,PI2.879), andVV(NIQE2.941,PI2.558). -
Importantly,
unsupervised methods(includingLightenDiffusion) generally exhibit much better generalization (lowerNIQE/PI) on these unseen datasets compared tosupervised methods. For instance,SMG(a strong supervised method) performs poorly on these unpaired datasets (e.g.,DICM NIQE6.224,PI4.228), highlighting the generalization gap of supervised approaches.LightenDiffusion's superior performance in these real-world settings is a critical validation of its design.In summary, the quantitative results demonstrate that
LightenDiffusionsets a new state-of-the-art for unsupervisedLLIE, showing strong performance comparable to or even surpassing supervised methods on unseen datasets, while maintaining excellent visual quality.
-
6.1.2. Qualitative Comparison
The visual comparisons further support the quantitative findings.
-
Paired Datasets (LOL, LSRW - Figure 5):
该图像是一个比较图,展示了在LOL和LSRW测试集上,我们的方法(最右侧)与其他竞争方法的定性比较。图像中展示了不同方法对低光图像增强的效果。Fig.5: Qualitative comparison of our method and competitive methods on the LOL [58] and LSRW [16] test sets. Best viewed by zooming in.
- Figure 5 (top row, LOL): Other methods often show underexposure (
GDP), color distortion (URetinexNet), or noise amplification (SMG).LightenDiffusionachieves proper brightness, vibrant colors, and sharp details without noticeable noise. - Figure 5 (bottom row, LSRW): Similar trends are observed. Some methods
(SMG)overexpose certain areas or introduce color casts, whileLightenDiffusionprovides a balanced enhancement.
- Figure 5 (top row, LOL): Other methods often show underexposure (
-
Unpaired Datasets (DICM, NPE, VV - Figure 6):
该图像是图表,展示了我们的LightenDiffusion方法与其他竞争方法在DICM、NPE和VV数据集上的定性比较。图中包含输入图像和多种方法的处理结果,便于观察不同方法对低光照图像的增强效果。Fig.6: Qualitative comparison of our method and competitive methods on the DICM [28], NPE [53], and VV [51] datasets. Best viewed by zooming in.
-
Figure 6 (row 1, DICM): Competing methods like
NeRCocan still leave images slightly dark, orPairLIEmight alter colors slightly.LightenDiffusionenhances visibility while preserving natural colors. -
Figure 6 (row 2, NPE): This row highlights a common failure mode for other methods: artifacts around light sources (
PairLIE,NeRCo) or overexposed regions.LightenDiffusioneffectively handles challenging lighting conditions, presenting correct exposure and avoiding artifacts. -
Figure 6 (row 3, VV):
LightenDiffusionmaintains vivid colors and proper contrast, outperforming methods that might produce duller or less natural-looking results.These visual results confirm that
LightenDiffusionnot only achieves superior quantitative scores but also produces visually pleasing, natural-looking enhanced images, with improved global and local contrast, sharper details, and effective noise suppression, particularly excelling in generalization to diverse real-world low-light scenes.
-
6.1.3. Low-Light Face Detection
The paper investigates the practical utility of LLIE methods by using them as a pre-processing step for low-light face detection on the DARK FACE dataset [69]. The RetinaFace [7] detector is used for evaluation with an IoU threshold of 0.3, and Average Precision (AP) is calculated.
The following figure (Figure 7 from the original paper) shows the comparison of low-light face detection results:
该图像是一个比较低光照人脸检测结果的图表,展示了不同方法在DARK FACE数据集上的平均精确度与召回率的关系。左侧为相应的方法精确度曲线,右侧为输入图像及三种不同方法的增强效果,包括我们的改进方法。
Fig. 7: Comparison of low-light face detection results on the DARK FACE dataset [69].
Analysis of Face Detection Results:
- The
precision-recall (P-R)curves in Figure 7 clearly show that enhancing low-light images before feeding them to a face detector significantly improves performance. - The
RetinaFacedetector on raw, unenhanced images achieves a lowAPof 20.2%. - After enhancement by
LightenDiffusion, theAPofRetinaFaceimproves significantly to 36.4%. This represents a substantial boost in detection performance. - Comparing
LightenDiffusionto otherLLIEmethods, it demonstrates superior performance in the high recall area, indicating its ability to detect a larger proportion of faces while maintaining high precision. - This result underscores the
potential practical valuesofLightenDiffusionas a robust pre-processing solution for various downstream vision tasks operating in challenging low-light environments.
6.2. Ablation Studies / Parameter Analysis
The following are the results from Table 2 of the original paper:
| Method | LOL [58] | DICM [28] | Time (s) ↓ | |||
|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | LPIPS ↓ | NIQE ↓ | PI ↓ | ||
| 1) k = 0 (Image Space) | 17.054 | 0.715 | 0.372 | 4.519 | 4.377 | 4.733 |
| 2) k = 1 (Latent Space) | 19.228 | 0.728 | 0.355 | 4.101 | 3.457 | 0.872 |
| 3) k = 2 (Latent Space) | 20.097 | 0.798 | 0.210 | 4.021 | 3.402 | 0.411 |
| 4) k = 3 (Latent Space) Default | 20.453 | 0.803 | 0.192 | 3.724 | 3.144 | 0.314 |
| 5) k = 4 (Latent Space) | 20.104 | 0.785 | 0.195 | 3.906 | 3.332 | 0.256 |
| 6) RetinexNet [58] | 16.616 | 0.563 | 0.579 | 5.859 | 6.056 | 0.296 |
| 7) URetinexNet [59] | 17.916 | 0.703 | 0.391 | 4.371 | 4.561 | 0.293 |
| 8) PairLIE [10] | 17.089 | 0.605 | 0.568 | 6.017 | 6.349 | 0.295 |
| 9) w/o Lscc (S = 20) | 19.184 | 0.785 | 0.213 | 4.045 | 3.408 | 0.314 |
| 10) w/o Lscc (S = 50) | 19.473 | 0.791 | 0.209 | 3.998 | 3.392 | 0.687 |
| 11) w/o Lscc (S = 100) | 20.255 | 0.801 | 0.209 | 3.831 | 3.228 | 1.208 |
| 12) Default (with Lscc, S = 20) | 20.453 | 0.803 | 0.192 | 3.724 | 3.144 | 0.314 |
6.2.1. Latent Space vs. Image Space Decomposition
This ablation study investigates the impact of performing Retinex decomposition at different scales of the latent space (controlled by the downsampling factor ) versus the traditional image space ().
-
Quantitative Results (Table 2, rows 1-5):
- Image Space (): Shows the worst performance across all metrics (e.g., LOL
PSNR17.054,SSIM0.715,LPIPS0.372; DICMNIQE4.519,PI4.377). This confirms the difficulty of achieving satisfactory decomposition in image space, often leading tocontent informationleaking into theillumination map. - Latent Space (): As increases from 0 to 3, the performance consistently improves.
- : Significantly better than (LOL
PSNR19.228,SSIM0.728). - : Further improvement (LOL
PSNR20.097,SSIM0.798). - (Default): Achieves the best performance (LOL
PSNR20.453,SSIM0.803,LPIPS0.192; DICMNIQE3.724,PI3.144), along with a good inference speed of 0.314s.
- : Significantly better than (LOL
- Latent Space (): Shows a slight degradation in performance compared to (LOL
PSNR20.104 vs 20.453;SSIM0.785 vs 0.803). This is attributed to the substantial reduction infeature information richnessat very deep latent spaces ( implies heavy downsampling), which can adversely affect the generative ability of thediffusion model. - Inference Speed: Deeper latent spaces ( increases) generally lead to faster inference times (e.g., 4.733s for down to 0.256s for ), as processing smaller feature maps is quicker.
- Image Space (): Shows the worst performance across all metrics (e.g., LOL
-
Visual Results (Figure 8):
该图像是图8,展示了我们采用的潜在-Retinex分解策略和提出的内容转移分解网络的消融研究的可视化结果。第一行显示了不同设置下的恢复结果,第二行呈现了低光/正常光图像的估计照明图。Fig. 8: Visual results of the ablation study about our employed latent-Retinex decomposition strategy and the proposed content-transfer decomposition network. The first row shows the restored results with different settings, and the second row presents estimated illumination maps of low/normal-light images.
- Figure 8a (Image Space, ): The
illumination mapclearly retainscontent information(e.g., outlines of objects), leading to artifacts in the restored image. - Figure 8b-d (Latent Space, ): As increases, the
illumination mapsbecome progressively smoother and morecontent-free, allowing thediffusion modelto generate cleaner, more visually faithful restored images.
- Figure 8a (Image Space, ): The
-
Conclusion: Performing
decompositionin thelatent spaceis crucial for effective separation ofreflectanceandillumination. A moderate downsampling factor () strikes the best balance betweenfeature richnessand efficiency, leading to optimal performance.
6.2.2. Retinex Decomposition Network (CTDN)
This study validates the effectiveness of the proposed Content-Transfer Decomposition Network (CTDN) by comparing it against alternative decomposition networks from prior Retinex-based methods.
- Quantitative Results (Table 2, rows 4, 6-8):
- CTDN (Default , row 4): Achieves the best performance (LOL
PSNR20.453,SSIM0.803,LPIPS0.192; DICMNIQE3.724,PI3.144). - Replacing with
RetinexNet[58] (row 6): Significant drop in performance (LOLPSNR16.616,SSIM0.563,LPIPS0.579). - Replacing with
URetinexNet[59] (row 7): Improved overRetinexNet, but still substantially worse thanCTDN(LOLPSNR17.916,SSIM0.703,LPIPS0.391). - Replacing with
PairLIE[10] (row 8): Similar toRetinexNet(LOLPSNR17.089,SSIM0.605,LPIPS0.568).
- CTDN (Default , row 4): Achieves the best performance (LOL
- Visual Results (Figure 8e-g):
- Decomposition networks from
RetinexNet,URetinexNet, andPairLIEfail to produce trulycontent-free illumination maps. Theirillumination mapsstill show discernible object outlines and textures. - This
content leakagedirectly translates toblurry detailsandartifactsin the restored images.
- Decomposition networks from
- Conclusion: The
CTDN's specialized design, which includes initial estimation, convolutional embeddings,cross-attention, andself-attentionfor explicit content transfer, is crucial for generating cleanreflectanceandillumination mapsin thelatent space. This robust decomposition is a key factor behindLightenDiffusion's superior performance.
6.2.3. Loss Function ()
This study evaluates the contribution of the self-constrained consistency loss (\mathcal{L}_{scc}) to the overall performance.
-
Quantitative Results (Table 2, rows 9-12):
- Default (with , , row 12): Achieves the best performance (LOL
PSNR20.453,SSIM0.803,LPIPS0.192; DICMNIQE3.724,PI3.144) with an inference time of 0.314s. - Without (, row 9): Performance drops significantly (LOL
PSNR19.184,SSIM0.785,LPIPS0.213). This confirms that is essential for improving overall visual quality. - Without with increased sampling steps (, rows 10-11):
- Increasing to 50 improves performance (
PSNR19.473,SSIM0.791) but increases inference time to 0.687s. - Increasing to 100 brings performance closer to the default model (
PSNR20.255,SSIM0.801,LPIPS0.209) but drastically increases inference time to 1.208s (almost 4 times slower than default).
- Increasing to 50 improves performance (
- Default (with , , row 12): Achieves the best performance (LOL
-
Visual Results (Figure 9):
该图像是图表,展示了不同参数设置下对比的视觉结果,分别为 (a) w/o (S = 20)、(b) w/o (S = 50)、(c) w/o (S = 100) 和 (d) w/ (default)。各部分展示了低光照图像增强效果的变化。Fig. 9: Visual results of the ablation study about our proposed
- Figure 9a (w/o , ): Shows slightly less vibrant colors and possibly some subtle artifacts compared to the default.
- Figure 9b-c (w/o , ): As increases, the visual quality improves, becoming more comparable to the default.
- Figure 9d (Default, with , ): Produces high-quality, artifact-free results.
-
Conclusion: The
self-constrained consistency lossis highly effective. It allows thediffusion modelto achieve efficient and robust restoration with fewersampling steps(), significantly reducing inference time compared to models without that require many more steps to reach comparable quality. This highlights 's role in guiding thediffusion modeltowards accurate content reconstruction and improving overall efficiency.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully presents LightenDiffusion, an innovative unsupervised low-light image enhancement (LLIE) framework that cleverly integrates the physically intuitive Retinex theory with the powerful generative capabilities of diffusion models. The core innovations include:
-
A
Content-Transfer Decomposition Network (CTDN)that performsRetinex decompositionwithin thelatent space, effectively disentanglingcontent-rich reflectance mapsandcontent-free illumination mapsfrom unpaired low-light and normal-light images. -
A
Latent-Retinex Diffusion Model (LRDM)that utilizes these decomposed components (low-light reflectance and normal-light illumination) as input for adiffusion model, guided by low-light features, to perform robust enhancement. -
A novel
self-constrained consistency lossthat further refines the restoration by ensuring the enhanced output maintains faithful intrinsic content to the original low-light input, effectively mitigating artifacts and improving visual fidelity.Extensive experiments on various paired and unpaired real-world benchmarks convincingly demonstrate that
LightenDiffusionnot onlyoutperforms all state-of-the-art unsupervised competitorsbut also achieves performancecomparable to, and in some cases even superior to, supervised methods, especially in terms of generalization to diverse and unseen scenes. Its practical value is further highlighted by significant improvements inlow-light face detectionas a pre-processing step.
7.2. Limitations & Future Work
The paper does not explicitly detail a "Limitations" section, but some can be inferred:
-
Computational Cost: While
LightenDiffusionis efficient at inference with due to the ,diffusion modelsgenerally have higher computational costs during training and potentially longer inference times compared to single-pass feed-forward networks (even with optimized sampling). The training process involves two stages and a large unpaired dataset, which can be resource-intensive. -
Hyperparameter Sensitivity: The framework relies on several hyperparameters (). Optimal performance might be sensitive to these values, and finding them empirically can be time-consuming.
-
Unsupervised Nature: While a strength for generalization, the lack of explicit paired ground truth in the diffusion training stage might still leave some subtle artifacts or color shifts that would be caught by supervised signals, although helps to mitigate this. For instance, on
LSRW, whilePSNRandSSIMare best,LPIPS(perceptual quality) is slightly inferior to some supervised methods likeGSADandURetinexNet, suggesting there might be room for perceptual improvement.Based on these observations, potential future research directions could include:
-
Real-time Applications: Further optimizing the
diffusion model's sampling process or integrating with fasterdenoising diffusion implicit models (DDIMs)to achieve near real-time performance for videoLLIE. -
Adaptive Hyperparameter Tuning: Developing methods for dynamically adjusting hyperparameters based on input image characteristics or scene context, rather than relying on fixed empirical values.
-
Integration with Other Perceptual Losses: Exploring additional
perceptual lossesoradversarial trainingspecifically tailored fordiffusion modelsto further bridge the gap with supervised methods on perceptual metrics, perhaps in a semi-unsupervised setting. -
Extension to Other Degradations: Adapting the
latent-Retinex decompositionanddiffusion frameworkto handle other complex image degradations beyond low-light, such as haze, rain, or blur, in anunsupervisedmanner. -
Quantifying "Content-Free" Illumination: Developing more rigorous quantitative metrics to assess how truly
content-freetheillumination mapsare, which could further refine theCTDN.
7.3. Personal Insights & Critique
LightenDiffusion presents several compelling insights and innovations:
- Innovation of Latent-Space Decomposition: The shift from
image-space Retinex decompositiontolatent-space decompositionis a powerful idea. Inlatent space, features are often more semantically meaningful and disentangled, making it easier for the network to separate intrinsic content from lighting conditions. This is a significant improvement over traditionalRetineximplementations and deep learning models that often struggle withcontent leakage. The visual evidence in Figure 3 and 8 strongly supports this. - Synergy of Retinex and Diffusion: The combination of
Retinex theory(physical interpretability) anddiffusion models(powerful generative ability) is highly effective.Retinexprovides a structured way to think aboutLLIEby separating components, whilediffusion modelsoffer the robustness toreconstruct high-quality imageseven from imperfect inputs and compensate for information loss. This hybrid approach leverages the best of both worlds. - Effectiveness of Self-Constrained Consistency Loss: The is a brilliant practical addition. It addresses the common pitfall of
Retinex-based methods where imperfectillumination mapscan introduce unwanted content or artifacts. By providing aself-generated pseudo labelto guide consistency, the model learns to maintain fidelity to the original image's content without needing actual ground truth, which is fundamental forunsupervised learning. The ablation study clearly shows its value in improving quality and speeding up inference. - Strong Generalization for Real-World Problems: The paper effectively addresses the generalization challenge, which is a major hurdle for many
LLIEmethods. By training onunpaired dataand leveraging the proposed architectural and loss designs,LightenDiffusionproves its robustness on diverse real-world datasets, making it highly applicable to practical scenarios wherepaired datais simply not available. This is crucial for deployment in areas like surveillance, autonomous driving, or mobile photography.
Critique:
-
Unclear "Content-Free" Definition: While the paper aims for "content-free" illumination maps, the precise mathematical or perceptual definition of "content" in
latent spaceis implicitly learned. It would be valuable to explore more explicit constraints or metrics to quantify this disentanglement. -
Complexity of Multi-Stage Training: While effective, the two-stage training process adds complexity. Future work might explore end-to-end training strategies for
LightenDiffusionto simplify the pipeline, though this could introduce new training stability challenges. -
Reference for : The illumination correction factor for the pseudo label is empirically set to 0.2. While effective, the paper could elaborate on the sensitivity to this parameter or discuss methods for adaptively determining it.
Overall,
LightenDiffusionrepresents a significant step forward inunsupervised LLIE, demonstrating how principled model design combined with advanced generative techniques can yield highly performant and generalizable solutions for challenging real-world problems. Its methodology could potentially inspire similar hybrid approaches in otherimage restorationtasks.
Similar papers
Recommended via semantic vector search.