Paper status: completed

Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement

Published:01/19/2020

Low-Light Image Enhancement (1)Zero-Reference Deep Curve Estimation (1)DCE-Net (1)Image-Specific Curve Estimation (1)Non-Reference Loss Functions (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study introduces Zero-DCE, a method using a lightweight network for image-specific curve estimation to enhance low-light images without requiring paired data. It employs non-reference loss functions to effectively improve image quality, demonstrating good generalization acro

Abstract

The paper presents a novel method, Zero-Reference Deep Curve Estimation (Zero-DCE), which formulates light enhancement as a task of image-specific curve estimation with a deep network. Our method trains a lightweight deep network, DCE-Net, to estimate pixel-wise and high-order curves for dynamic range adjustment of a given image. The curve estimation is specially designed, considering pixel value range, monotonicity, and differentiability. Zero-DCE is appealing in its relaxed assumption on reference images, i.e., it does not require any paired or unpaired data during training. This is achieved through a set of carefully formulated non-reference loss functions, which implicitly measure the enhancement quality and drive the learning of the network. Our method is efficient as image enhancement can be achieved by an intuitive and simple nonlinear curve mapping. Despite its simplicity, we show that it generalizes well to diverse lighting conditions. Extensive experiments on various benchmarks demonstrate the advantages of our method over state-of-the-art methods qualitatively and quantitatively. Furthermore, the potential benefits of our Zero-DCE to face detection in the dark are discussed. Code and model will be available at https://github.com/Li-Chongyi/Zero-DCE.

Mind Map

In-depth Reading

English Analysis~37 min read · 47,943 chars

1. Bibliographic Information

1.1. Title

The title of the paper is Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement. This title directly indicates the paper's central topic: enhancing images taken in low-light conditions using a deep learning approach that relies on curve estimation and does not require reference images during training.

1.2. Authors

The paper lists multiple authors:

Chunle Guo
Chongyi Li
Jichang Guo
Chen Change Loy
Junhui Hou
Sam Kwong
Runmin Cong The authors are affiliated with institutions such as Tianjin University, City University of Hong Kong, Nanyang Technological University, and Beijing Jiaotong University, indicating a collaborative research effort from multiple academic institutions, primarily based in China and Singapore. Their research backgrounds generally involve computer vision, image processing, and deep learning.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server, on 2020-01-19T13:49:15.000Z. While arXiv itself is not a peer-reviewed journal or conference, it is a widely used platform for quickly disseminating research findings in physics, mathematics, computer science, and related fields. Papers often appear on arXiv before or concurrently with submission to prestigious conferences (e.g., CVPR, ICCV, ECCV) or journals. The subsequent citations in major venues suggest its influence in the image enhancement community.

1.4. Publication Year

The paper was published in 2020.

1.5. Abstract

The paper introduces Zero-Reference Deep Curve Estimation (Zero-DCE), a novel method for low-light image enhancement. It re-frames enhancement as an image-specific curve estimation problem solved by a lightweight deep neural network called DCE-Net. DCE-Net learns to estimate pixel-wise, high-order curves that dynamically adjust the range of pixel values in a given image. The design of these curves carefully considers properties like pixel value range, monotonicity, and differentiability. A significant innovation of Zero-DCE is its ability to train without any paired or unpaired reference images. This "zero-reference" training is achieved through a suite of specifically designed non-reference loss functions that implicitly gauge enhancement quality and guide network learning. The method is efficient due to its simple nonlinear curve mapping approach and demonstrates strong generalization across diverse lighting conditions. Extensive experiments show qualitative and quantitative advantages over state-of-the-art methods, and the paper also highlights its practical benefits for high-level vision tasks like face detection in low light.

1.6. Original Source Link

The original source link is: https://arxiv.org/abs/2001.06826 This is an arXiv preprint.

1.7. PDF Link

The PDF link is: https://arxiv.org/pdf/2001.06826v2.pdf

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is the low-light image enhancement. Many images are captured under suboptimal lighting conditions due to environmental factors (e.g., night, indoor with poor lighting) or technical constraints (e.g., incorrect exposure settings). These low-light images suffer from two major issues:

Compromised Aesthetic Quality: They appear dark, dull, and lack visual appeal, negatively impacting the viewer's experience.
Unsatisfactory Information Transmission: Crucial details can be obscured, leading to inaccurate interpretation, especially for automated computer vision tasks like object recognition or face detection. For instance, a face detector might fail on a dark image.

Existing research in image enhancement faces several challenges:

Generalization: Many deep learning-based methods rely on paired data (low-light image and its corresponding well-lit version) or unpaired data (collections of low-light and well-lit images). Collecting such data is costly, time-consuming, and often involves synthetic degradation or expert retouching, which can introduce factitious (artificial) and unrealistic data. This reliance often leads to models that overfit to their training data and generalize poorly to real-world, diverse low-light conditions, producing artifacts or color casts.
Computational Burden: Some methods are computationally intensive, limiting their use in real-time applications or on devices with limited resources (e.g., mobile phones).
Lack of Robustness: Traditional methods often struggle with nonuniform illumination, where different parts of an image have vastly different lighting levels.

The paper's entry point and innovative idea is to reformulate low-light image enhancement as an image-specific curve estimation problem. Instead of directly mapping a low-light image to an enhanced one, it proposes to learn a set of pixel-wise (meaning each pixel can have a unique adjustment) and high-order curves (meaning the adjustment can be complex and iterative) that can dynamically adjust the dynamic range of the input image. Crucially, this is achieved with zero-reference training, meaning no target enhanced images (either paired or unpaired) are needed during the learning process.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

Zero-Reference Learning Framework: It proposes the first low-light enhancement network that operates independent of paired and unpaired training data. This completely avoids the high cost and potential pitfalls (e.g., overfitting, factitious data) associated with collecting reference images, leading to better generalization across various lighting conditions.
Image-Specific Light-Enhancement Curve (LE-curve): The authors design a novel image-specific curve that can approximate pixel-wise and higher-order adjustments through iterative application. This curve is carefully formulated to maintain the pixel value range (0-1), preserve monotonicity (to prevent contrast reversal), and be differentiable (to allow gradient-based optimization). This design allows for effective dynamic range mapping.
Task-Specific Non-Reference Loss Functions: A set of differentiable non-reference loss functions are introduced, including spatial consistency loss, exposure control loss, color constancy loss, and illumination smoothness loss. These losses implicitly evaluate the quality of the enhanced images without explicit ground-truth references, guiding the network's learning effectively.

The key conclusions and findings are:

Zero-DCE achieves state-of-the-art performance both qualitatively (visually pleasing results with natural brightness, color, and contrast) and quantitatively (higher PSNR, SSIM, lower MAE, better User Study scores, lower Perceptual Index) compared to existing methods, even those requiring reference data.
The method is highly efficient, capable of processing images in real-time (around 500 frames per second on GPU for 640x480 images) and requiring minimal training time (30 minutes).
Zero-DCE significantly improves the performance of high-level visual tasks like face detection in low-light conditions, demonstrating its practical utility beyond mere aesthetic enhancement.

3.1. Foundational Concepts

To fully understand the Zero-DCE paper, a novice reader should be familiar with several fundamental concepts in image processing and deep learning:

Pixel Values and RGB Channels:
- An image is composed of a grid of tiny picture elements called pixels.
- In color images, each pixel's color is typically represented by three primary color channels: Red (R), Green (G), and Blue (B).
- Each channel usually has an intensity value ranging from 0 to 255 (for 8-bit images), where 0 means no intensity (black) and 255 means full intensity (brightest for that color). Low-light images typically have most pixel values clustered at the lower end of this range.
- For deep learning models, these pixel values are often normalized to a range like [0, 1] by dividing by 255, which helps with model stability and convergence.
Dynamic Range and Contrast:
- Dynamic range refers to the ratio between the maximum and minimum light intensity an imaging system can capture or display. In an image, it's the spread of pixel intensity values.
- Contrast is the difference in brightness or color that makes an object distinguishable. A low-light image often has a narrow dynamic range and low contrast, making it look flat and dull.
- Image enhancement aims to expand this dynamic range and improve contrast to reveal more details and make the image visually appealing.
Monotonicity:
- In the context of a curve or a mapping function, monotonicity means that the function either always increases or always decreases. If a curve is monotonically increasing, as the input value increases, the output value also increases (or stays the same).
- For image enhancement, monotonicity is crucial because it ensures that the relative order of pixel intensities is preserved. For example, if pixel A is brighter than pixel B in the original image, it should remain brighter (or equally bright) after enhancement. Violating monotonicity would reverse contrast, creating unnatural artifacts.
Differentiability:
- A function is differentiable if its derivative exists at every point in its domain. In simpler terms, it means the function's slope can be calculated at any point.
- In deep learning, differentiability is essential because neural networks are trained using gradient-based optimization algorithms (like stochastic gradient descent or ADAM). These algorithms calculate gradients (slopes) of the loss function with respect to the network's trainable parameters (weights and biases) to update them and minimize the loss. If the operations within the network (like the LE-curve in this paper) are not differentiable, backpropagation (the process of computing gradients) cannot occur, and the network cannot be trained.
Convolutional Neural Networks (CNNs):
- CNNs are a class of deep neural networks specifically designed for processing structured grid-like data, such as images.
- They consist of multiple layers, including convolutional layers, pooling layers, and fully connected layers.
- A convolutional layer applies a set of learnable filters (also called kernels) across the input image. Each filter detects specific features (e.g., edges, textures).
- Activation functions (like ReLU or Tanh) introduce nonlinearity into the network, allowing it to learn complex patterns.
- ReLU (Rectified Linear Unit): An activation function defined as $f(x) = \max(0, x)$ . It outputs the input directly if it's positive, otherwise it outputs zero. It's computationally efficient.
- Tanh (Hyperbolic Tangent): An activation function defined as $f(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}}$ . It squashes output values to the range $[-1, 1]$ , making it suitable for producing normalized parameters or outputs.
- Batch Normalization is a technique used to standardize the inputs to layers in a deep neural network, stabilizing and speeding up training. The authors explicitly discard it here to preserve neighboring pixel relations.
Loss Functions and Optimizers:
- A loss function quantifies the difference between the network's output and the desired target. The goal of training is to minimize this loss.
- Non-reference loss functions are special types of loss functions that do not require a ground-truth (reference) image for comparison. Instead, they implicitly evaluate the quality of the output based on certain desirable properties (e.g., smoothness, exposure level, color consistency).
- An optimizer is an algorithm (like ADAM) used to adjust the network's trainable parameters (weights and biases) in the direction that minimizes the loss function based on the computed gradients.

3.2. Previous Works

The paper categorizes previous low-light image enhancement methods into Conventional Methods and Data-Driven Methods.

3.2.1. Conventional Methods

These methods often rely on handcrafted algorithms, mathematical models, or statistical properties of images.

Histogram Equalization (HE)-based methods:
- Concept: These methods enhance contrast by adjusting the histogram (distribution of pixel intensities) of an image, spreading out the most frequent intensity values.
- Global HE: Adjusts the histogram for the entire image (e.g., [7, 10]). Can sometimes lead to over-enhancement in already bright regions or amplification of noise.
- Local HE: Adjusts histograms for smaller, local regions of the image (e.g., [15, 27]), providing better local contrast but can be computationally intensive and sometimes introduce blocky artifacts.
- Differentiation: Zero-DCE avoids directly manipulating histograms, opting for curve mapping which offers more fine-grained, image-specific control without the risk of abrupt changes in pixel distribution.
Retinex Theory-based methods:
- Concept: The Retinex theory [13] postulates that an image can be decomposed into two components: reflectance (the intrinsic property of an object's surface, independent of lighting) and illumination (the lighting conditions). The goal of enhancement is to estimate and adjust the illumination component while preserving the reflectance.
- Examples:
  - SRIE [8] (Xueyang Fu et al., 2016) proposed a weighted variational model to simultaneously estimate reflectance and illumination.
  - LIME [9] (Xiaojie Guo et al., 2017) estimated a coarse illumination map by searching the maximum intensity in RGB channels, then refined it using a structure prior.
  - Li et al. [19] (Mading Li et al., 2018) introduced a Retinex model that considers noise, estimating the illumination map by solving an optimization problem.
- Differentiation: Conventional Retinex methods rely on potentially inaccurate physical models and often involve complex optimization. Zero-DCE uses a purely data-driven, learned curve mapping approach, which is more flexible and less dependent on explicit physical assumptions.
Automatic Exposure Correction (e.g., Yuan and Sun [36], 2012): This method estimated a global S-shaped curve for an image using a global optimization algorithm.
- Differentiation: Zero-DCE differs by being a purely data-driven method that learns pixel-wise curves and incorporates multiple light enhancement factors through its non-reference loss functions, leading to broader dynamic range adjustment and lower computational burden.

3.2.2. Data-Driven Methods

These methods leverage large datasets and deep neural networks for learning enhancement mappings.

CNN-based (Supervised) Methods:
- Concept: These methods train Convolutional Neural Networks (CNNs) on paired data, where each low-light image has a corresponding well-exposed ground-truth (reference) image. The network learns a mapping from low-light to well-lit.
- Data Collection Challenge: Collecting such paired data is resource-intensive.
  - LLNet [20] (Lore et al., 2017) was trained on data simulated using random Gamma correction (a simple non-linear adjustment).
  - The LOL dataset [32] (Cai et al., 2018, though it cites Wei et al. for RetinexNet, LOL is by Chen Wei et al.) collected paired low/normal light images by altering camera exposure time and ISO.
  - The MIT-Adobe FiveK dataset [3] (Bychkovsky et al., 2011) comprises 5,000 raw images, each with five expert-retouched versions.
- Examples:
  - RetinexNet [32] (Wei et al., 2018) employed a deep Retinex decomposition for low-light enhancement, trained on paired data.
  - Wang et al. [28] (Ruixing Wang et al., 2019) proposed an underexposed photo enhancement network by estimating an illumination map, also trained on paired expert-retouched data.
- Limitations: High cost of data collection, potential for factitious and unrealistic data in training, and poor generalization capability to diverse real-world lighting conditions, often producing artifacts and color casts.
- Differentiation: Zero-DCE completely eliminates the need for paired data, addressing the generalization issue and reducing data collection costs.
GAN-based (Unsupervised) Methods:
- Concept: Generative Adversarial Networks (GANs) [38] (Zhu et al., 2017 for CycleGAN, often cited for unpaired image-to-image translation) learn to generate realistic images without requiring perfectly paired input-output examples. They use a generator network (which creates enhanced images) and a discriminator network (which tries to distinguish between real well-lit images and generated enhanced images).
- Examples:
  - EnlightenGAN [12] (Jing et al., 2019) is a pioneer unsupervised GAN-based method for low-light image enhancement, learning from unpaired low/normal light data. It uses carefully designed discriminators and loss functions.
- Limitations: While eliminating paired data, GANs still require careful selection of unpaired training data and can be challenging to train, sometimes leading to unstable results or mode collapse.
- Differentiation: Zero-DCE goes a step further than EnlightenGAN by requiring zero reference data at all – neither paired nor unpaired. This simplifies the training process and reduces reliance on specific data distributions.

3.3. Technological Evolution

The field of low-light image enhancement has evolved from traditional signal processing techniques to sophisticated deep learning models. Initially, methods focused on global histogram adjustments or physical image models like Retinex theory. While effective to some extent, these often struggled with local variations, noise, or computational complexity. The advent of deep learning brought about a paradigm shift, allowing models to learn complex, non-linear mappings directly from data.

The first wave of deep learning methods (CNN-based) largely relied on supervised learning with paired datasets. This approach demonstrated impressive results in controlled settings but faced significant hurdles in data acquisition and generalization to diverse real-world scenarios. The subsequent development of unsupervised learning with GANs (GAN-based methods) addressed the paired data dependency by enabling training with unpaired datasets. However, unpaired data still needs to be carefully curated, and GANs can be notoriously difficult to train.

This paper's work, Zero-DCE, represents a further evolution by moving towards zero-reference learning. It acknowledges the limitations of both paired and unpaired data paradigms and proposes a novel way to train a deep model using only the properties of desirable enhanced images encoded into non-reference loss functions, effectively making the model self-supervised in a unique way. This positions Zero-DCE at the forefront of data-independent deep learning for image enhancement.

3.4. Differentiation Analysis

Compared to the main methods discussed in related work, Zero-DCE offers core differences and innovations:

Zero-Reference Training: This is the most significant differentiator. Unlike CNN-based methods that require paired low/normal light images or GAN-based methods that need unpaired low/normal light datasets, Zero-DCE is trained without any reference images at all. This completely liberates the model from expensive and potentially problematic data collection, making it highly practical and robust against overfitting to specific datasets.
Image-Specific Curve Estimation vs. Direct Image-to-Image Mapping: Instead of learning a direct end-to-end mapping from a low-light image to an enhanced one (as many CNNs and GANs do), Zero-DCE learns to estimate pixel-wise, higher-order curves. These curves then perform the actual enhancement. This indirect approach provides a more interpretable and controllable mechanism for dynamic range adjustment, similar to traditional curve adjustments in photo editing, but learned adaptively.
Novel Curve Design: The LE-curve itself is specifically designed to meet critical criteria: maintaining pixel value range ([0,1]), preserving monotonicity (for contrast), and ensuring differentiability (for training). The iterative and pixel-wise application of this curve provides a powerful and flexible adjustment capability.
Differentiable Non-Reference Loss Functions: The paper's ability to train without reference images stems from its meticulously crafted non-reference loss functions. These losses (spatial consistency, exposure control, color constancy, illumination smoothness) implicitly define what a "good" enhanced image looks like, guiding the network without explicit examples. This is a key innovation over methods that rely on pixel-wise L1/L2 losses against ground truth or adversarial losses against discriminators.
Efficiency and Generalization: The lightweight DCE-Net architecture combined with the simple curve mapping leads to high computational efficiency, enabling real-time processing. The zero-reference training, by avoiding dataset-specific biases, inherently promotes better generalization to diverse, unseen low-light conditions.

4. Methodology

The Zero-Reference Deep Curve Estimation (Zero-DCE) method formulates low-light image enhancement as an image-specific curve estimation problem. It uses a lightweight deep neural network, DCE-Net, to predict parameters for these curves, which are then iteratively applied to the input image to achieve enhancement. The entire process is trained without any reference images, relying solely on specially designed non-reference loss functions.

4.1. Principles

The core idea of Zero-DCE is to learn a set of adjustment curves that can map the pixel values of a low-light image to their enhanced counterparts. This approach is inspired by curve adjustment tools found in photo editing software. The theoretical basis or intuition is that low-light images primarily suffer from a narrow dynamic range and low brightness. By applying appropriate non-linear curves, the intensity values can be stretched and shifted, increasing brightness and contrast.

The key principles guiding the design are:

Image-Specific Adjustment: The curves should be self-adaptive, meaning their parameters are determined solely by the input image, allowing for flexible adjustment to different lighting conditions.
Constraint-Aware Curve Design: The curves must:
- Keep enhanced pixel values within a valid range (e.g., [0, 1]) to prevent information loss from overflow truncation.
- Be monotonous to preserve the relative order of pixel intensities and thus maintain the contrast between neighboring pixels.
- Be differentiable to enable gradient-based optimization during neural network training.
Zero-Reference Learning: The model should be trainable without any paired or unpaired ground-truth reference images. This is achieved through a set of carefully crafted non-reference loss functions that indirectly quantify enhancement quality based on desirable image properties.

4.2. Core Methodology In-depth (Layer by Layer)

The framework of Zero-DCE consists of three main components: the Light-Enhancement Curve (LE-curve), the Deep Curve Estimation Network (DCE-Net), and the Non-Reference Loss Functions.

The overall framework is illustrated in Figure 2.

$该图像是一个示意图，展示了Zero-Reference Deep Curve Estimation (Zero-DCE)方法的框架。图中包含输入图像I，通过深度曲线估计网络（DCE-Net）生成增强图像，同时展示了曲线参数图和不同α值下的曲线图。公式$LE_{n} = LE(LE_{n-1}; A_{n}^{R,G,B})$描述了曲线估计的递归过程。$
该图像是一个示意图，展示了Zero-Reference Deep Curve Estimation (Zero-DCE)方法的框架。图中包含输入图像I，通过深度曲线估计网络（DCE-Net）生成增强图像，同时展示了曲线参数图和不同α值下的曲线图。公式 $LE_{n} = LE(LE_{n-1}; A_{n}^{R,G,B})$ 描述了曲线估计的递归过程。

Figure 2: The framework of Zero-Reference Deep Curve Estimation (Zero-DCE).

4.2.1. Light-Enhancement Curve (LE-curve)

The LE-curve is the fundamental building block for pixel value adjustment. It's designed to be simple, effective, and compliant with the three objectives mentioned above (range, monotonicity, differentiability).

4.2.1.1. Base LE-curve

The base LE-curve is defined as a quadratic curve: $ L E ( I ( \mathbf { x } ) ; \alpha ) = I ( \mathbf { x } ) + \alpha I ( \mathbf { x } ) ( 1 - I ( \mathbf { x } ) ) $ (1)

Here, we explain the symbols in the formula:

$\mathbf{x}$ : Denotes the pixel coordinates in the image.
$I(\mathbf{x})$ : Represents the intensity value of the input pixel at coordinates $\mathbf{x}$ . It is assumed to be normalized to the range [0, 1].
$L E ( I ( { \bf x } ) ; \alpha )$ : Represents the enhanced intensity value of the pixel at coordinates $\mathbf{x}$ after applying the LE-curve.
$\alpha$ : Is a trainable curve parameter that lies within the range $[-1, 1]$ . This parameter controls the magnitude and direction of the curve's adjustment, effectively governing the exposure level of the image. A positive $\alpha$ brightens the image, while a negative $\alpha$ darkens it.

The operations in Equation (1) are applied pixel-wise. The paper states that the LE-curve is applied separately to each of the three RGB channels instead of just an illumination channel (like in some Retinex models). This three-channel adjustment is intended to better preserve the inherent colors of the image and minimize the risk of over-saturation.

The quadratic form I(1-I) ensures that the output $LE(I(\mathbf{x}); \alpha)$ remains within [0, 1] if $I \in [0, 1]$ and $\alpha \in [-1, 1]$ . For $I \in [0, 1]$ , I(1-I) is always positive. When $\alpha = 1$ , the maximum value I(1-I) is 0.25 (at $I=0.5$ ), so $LE(I) = I + 0.25$ . The maximum output for $I=0.5$ would be $0.5 + 0.25 = 0.75$ , which is within [0,1]. For $\alpha = -1$ , the minimum output would be $I - 0.25$ . The term I(1-I) also guarantees differentiability. The effect of different $\alpha$ values is illustrated in Figure 2(b).

4.2.1.2. Higher-Order Curve

To enable more flexible and robust adjustments for challenging low-light conditions, the LE-curve can be applied iteratively. This creates a higher-order curve with greater adjustment capability (i.e., higher curvature).

The iterative application is defined as: $ L E _ { n } ( \mathbf { x } ) = L E _ { n - 1 } ( \mathbf { x } ) + \alpha _ { n } L E _ { n - 1 } ( \mathbf { x } ) ( 1 - L E _ { n - 1 } ( \mathbf { x } ) ) $ (2)

Here, we explain the symbols in the formula:

$n$ : Represents the number of iterations or applications of the LE-curve. It controls the overall curvature of the combined mapping.
$L E _ { n } ( \mathbf { x } )$ : Is the enhanced pixel value after $n$ iterations.
$L E _ { n - 1 } ( \mathbf { x } )$ : Is the enhanced pixel value from the previous iteration (n-1). For the first iteration, $L E _ { 0 } ( \mathbf { x } )$ is simply the input image $I(\mathbf{x})$ .
$\alpha_n$ : Is the curve parameter for the $n$ -th iteration.

In this paper, the number of iterations $N$ is set to 8, which is found to be satisfactory for most cases. When $n=1$ , Equation (2) simplifies back to the base LE-curve in Equation (1). Figure 2(c) visually demonstrates how higher-order curves (through iterative application) provide a more powerful adjustment capability than a single application.

4.2.1.3. Pixel-Wise Curve

While higher-order curves offer broader dynamic range adjustment, if a single $\alpha$ value is used for the entire image (global adjustment), it can lead to over-enhancement in already brighter regions or under-enhancement in extremely dark areas. To address this, the parameter $\alpha$ is made pixel-wise, meaning each pixel can have its own best-fitting adjustment parameter.

The pixel-wise higher-order curve is reformulated as: $ L E _ { n } ( \mathbf { x } ) = L E _ { n - 1 } ( \mathbf { x } ) + A _ { n } ( \mathbf { x } ) L E _ { n - 1 } ( \mathbf { x } ) ( 1 - L E _ { n - 1 } ( \mathbf { x } ) ) $ (3)

Here, we explain the symbols in the formula:

$\mathcal{A}_n(\mathbf{x})$ : Is the parameter map for the $n$ -th iteration. This map has the same spatial dimensions as the input image, providing a unique $\alpha$ value for each pixel $\mathbf{x}$ .

The assumption here is that pixels within a local region often share similar intensity characteristics and thus should undergo similar adjustments. This pixel-wise approach allows for adaptive local enhancement while still preserving the monotonic relations between neighboring pixels (as the curve itself is monotonic). This ensures the three objectives (range, monotonicity, differentiability) are still met. Figure 3 illustrates how pixel-wise curve parameter maps accurately reflect the brightness variations across an image and how they contribute to a well-enhanced result.

$该图像是插图，展示了低光照图像增强的过程，包括原始输入图像和各种估计的光照曲线。图（a）是输入图像，图（b）展示了 $A^R_n$，图（c）显示了 $A^G_n$，图（d）为 $A^B_n$，图（e）为经过增强处理后的结果。该示意图明确反映了Zero-DCE方法在不同光照条件下的效果和模型输出。$ 该图像是插图，展示了低光照图像增强的过程，包括原始输入图像和各种估计的光照曲线。图（a）是输入图像，图（b）展示了 $A^R_n$ ，图（c）显示了 $A^G_n$ ，图（d）为 $A^B_n$ ，图（e）为经过增强处理后的结果。该示意图明确反映了Zero-DCE方法在不同光照条件下的效果和模型输出。

Figure 3: An example of the estimated curve parameter maps of three channels for 8 iterations $(n=8)$ and normalize the values to the range of [0, 1]. $\mathcal { A } _ { n } ^ { R }$ , $\mathcal { A } _ { n } ^ { G }$ , and $\mathcal { A } _ { n } ^ { B }$ represent the averaged best-fitting curve parameter maps for the R, G, and B channels, respectively. The maps are represented by heatmaps.

4.2.2. DCE-Net

The Deep Curve Estimation Network (DCE-Net) is designed to learn the mapping from an input low-light image to its corresponding pixel-wise curve parameter maps ( $\mathcal{A}_n(\mathbf{x})$ ).

Input: A low-light image.
Output: A set of pixel-wise curve parameter maps. For 8 iterations ( $n=8$ ) and 3 RGB channels, the network outputs $8 \times 3 = 24$ parameter maps. Each map corresponds to $\mathcal{A}_n^c(\mathbf{x})$ , where $n$ is the iteration index and $c$ is the channel (R, G, or B).
Architecture: DCE-Net is a plain CNN (Convolutional Neural Network) comprising seven convolutional layers with symmetrical concatenation.
- Each convolutional layer uses 32 convolutional kernels (filters) of size $3 \times 3$ and stride 1.
- Following each convolutional layer (except the last), a ReLU activation function is applied.
- Crucially, down-sampling layers (like pooling) and batch normalization layers are explicitly discarded. This design choice is made to preserve the intrinsic relations of neighboring pixels and avoid losing spatial information, which is important for pixel-wise adjustments.
- The last convolutional layer is followed by a Tanh activation function. The Tanh function scales the output of the final layer to the range $[-1, 1]$ , which aligns perfectly with the defined range for the curve parameters $\alpha$ (or $\mathcal{A}_n(\mathbf{x})$ ).
Efficiency: The DCE-Net is designed to be lightweight, containing only 79,416 trainable parameters and requiring 5.21 Giga floating-point operations (GFlops) for an input image of size $256 \times 256 \times 3$ . This makes it suitable for computationally limited devices, such as mobile platforms.

4.2.3. Non-Reference Loss Functions

To enable zero-reference learning (training without ground-truth images), a set of differentiable non-reference loss functions are formulated. These losses implicitly evaluate the quality of the enhanced images by enforcing desirable properties.

4.2.3.1. Spatial Consistency Loss

The spatial consistency loss ( $L_{spa}$ ) encourages the enhanced image to maintain local structural coherence. It penalizes changes in the differences between neighboring regions from the input image to its enhanced version, thereby preserving local contrast and details.

The loss $L_{spa}$ is expressed as: $ L _ { s p a } = \frac { 1 } { K } \sum _ { i = 1 } ^ { K } \sum _ { j \in \Omega ( i ) } ( | ( Y _ { i } - Y _ { j } ) | - | ( I _ { i } - I _ { j } ) | ) ^ { 2 } $ (4)

Here, we explain the symbols in the formula:

$K$ : Represents the total number of local regions considered in the image.
$i$ : Denotes a specific local region.
$j \in \Omega(i)$ : Indicates the four neighboring regions (top, down, left, right) relative to region $i$ .
$Y_i$ : Is the average intensity value of the local region $i$ in the enhanced image.
$Y_j$ : Is the average intensity value of the neighboring region $j$ in the enhanced image.
$I_i$ : Is the average intensity value of the local region $i$ in the input image.
$I_j$ : Is the average intensity value of the neighboring region $j$ in the input image.

The term $(| (Y_i - Y_j) | - | (I_i - I_j) |)^2$ measures the squared difference between the absolute intensity differences of neighboring regions in the enhanced image versus the input image. By minimizing this, the loss ensures that local changes in brightness or contrast are consistent with the original image's structure. The size of the local region is empirically set to $4 \times 4$ .

4.2.3.2. Exposure Control Loss

The exposure control loss ( $L_{exp}$ ) is designed to prevent under-exposed (too dark) or over-exposed (too bright) regions in the enhanced image by pushing local average intensities towards a desired well-exposedness level.

The loss $L_{exp}$ is expressed as: $ L _ { e x p } = \frac { 1 } { M } \sum _ { k = 1 } ^ { M } | Y _ { k } - E | $ (5)

Here, we explain the symbols in the formula:

$M$ : Represents the number of non-overlapping local regions of size $16 \times 16$ across the image.
$k$ : Denotes a specific local region.
$Y_k$ : Is the average intensity value of the local region $k$ in the enhanced image.
$E$ : Represents the target well-exposedness level. Following existing practices [23, 24], $E$ is typically set as a gray level in the RGB color space. In the experiments, $E$ is set to 0.6 (normalized from 0 to 1), though values between 0.4 and 0.7 yield similar performance.

This loss measures the absolute difference between the average intensity of each local region and the target exposure level, encouraging uniform and appropriate brightness across the image.

4.2.3.3. Color Constancy Loss

The color constancy loss ( $L_{col}$ ) aims to correct potential color deviations that might arise during the enhancement process and to establish proper color balance among the three RGB channels. It is based on the Gray-World color constancy hypothesis [2], which states that, over an entire image, the average reflectance of surfaces is achromatic (gray).

The loss $L_{col}$ is expressed as: $ L _ { c o l } = \sum _ { \forall ( p , q ) \in \varepsilon } ( J ^ { p } - J ^ { q } ) ^ { 2 } , \varepsilon = { ( R , G ) , ( R , B ) , ( G , B ) } $ (6)

Here, we explain the symbols in the formula:

$J^p$ : Denotes the average intensity value of channel $p$ (e.g., Red) across the entire enhanced image.
$J^q$ : Denotes the average intensity value of channel $q$ (e.g., Green) across the entire enhanced image.
(p, q): Represents a pair of channels.
$\varepsilon$ : Is the set of all possible pairs of distinct channels in RGB color space: (Red, Green), (Red, Blue), and (Green, Blue).

By minimizing the squared difference between the average intensities of all channel pairs, this loss encourages the overall color balance of the enhanced image to be achromatic, thereby correcting color casts.

4.2.3.4. Illumination Smoothness Loss

The illumination smoothness loss ( $L_{tv,A}$ ) is applied to the predicted curve parameter maps $\mathcal{A}$ . Its purpose is to encourage smooth transitions in the adjustment parameters across the image, which in turn helps to preserve the monotonicity relations between neighboring pixels in the final enhanced image and prevent artifacts (like harsh edges or abrupt changes) in the enhanced output. It is a variant of Total Variation (TV) loss.

The loss $L_{tv,A}$ is defined as: $ L _ { t v _ { A } } = \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \sum _ { c \in \xi } ( | \nabla _ { x } \boldsymbol { A } _ { n } ^ { c } | + \nabla _ { y } \boldsymbol { A } _ { n } ^ { c } | ) ^ { 2 } , \boldsymbol { \xi } = { R , G , B } $ (7)

Here, we explain the symbols in the formula:

$N$ : Represents the total number of iterations (set to 8 in this paper).
$n$ : Denotes the iteration index.
$c$ : Represents a specific color channel (Red, Green, or Blue).
$\xi$ : Is the set of all color channels $\{R, G, B\}$ .
$\mathcal{A}_n^c$ : Refers to the curve parameter map for the $n$ -th iteration and channel $c$ .
$\nabla_x$ : Represents the horizontal gradient operation. It measures the change in the parameter map values along the x-direction.
$\nabla_y$ : Represents the vertical gradient operation. It measures the change in the parameter map values along the y-direction.

This loss minimizes the sum of squared magnitudes of the horizontal and vertical gradients of each parameter map across all iterations and channels. By doing so, it forces the parameter maps to be spatially smooth, ensuring that adjacent pixels receive similar adjustments unless there's a significant image feature boundary.

4.2.3.5. Total Loss

The total loss ( $L_{total}$ ) combines all the individual non-reference loss functions to guide the training of the DCE-Net.

The total loss is expressed as: $ L _ { t o t a l } = L _ { s p a } + L _ { e x p } + W _ { c o l } L _ { c o l } + W _ { t v _ { A } } L _ { t v _ { A } } $ (8)

Here, we explain the symbols in the formula:

$L_{spa}$ : Is the spatial consistency loss.
$L_{exp}$ : Is the exposure control loss.
$L_{col}$ : Is the color constancy loss.
$L_{tv,A}$ : Is the illumination smoothness loss.
$W_{col}$ : Is a weight coefficient for the color constancy loss. It balances the contribution of this loss to the total loss.
$W_{tv,A}$ : Is a weight coefficient for the illumination smoothness loss. It balances the contribution of this loss to the total loss.

The weights $W_{col}$ and $W_{tv,A}$ are empirically set to 0.5 and 20, respectively, to balance the scales of the different loss components. This weighted sum allows the model to optimize for multiple desirable image enhancement properties simultaneously without requiring any reference images.

5. Experimental Setup

5.1. Datasets

The authors employed a variety of datasets for both training and evaluation to demonstrate the robustness and generalization capabilities of Zero-DCE.

5.1.1. Training Data

SICE dataset Part1 [4]: The primary training data consists of 360 multi-exposure sequences from Part1 of the SICE dataset. This dataset is advantageous because it includes both low-light and over-exposed images, which helps the model learn a wide dynamic range adjustment capability. From this, 3,022 images of different exposure levels were randomly split into:
- Training Set: 2,422 images.
- Validation Set: The remaining images.
Image Preprocessing: Training images were resized to $512 \times 512$ pixels.

5.1.2. Testing/Evaluation Data

The paper used several standard image sets from previous works for qualitative (visual) and perceptual comparisons, as well as a specific dataset for full-reference quantitative evaluation and another for a face detection task.

For Qualitative and Perceptual Comparisons (No Reference Ground Truth):
- NPE [29]: 84 images.
- LIME [9]: 10 images.
- MEF [22]: 17 images.
- DICM [14]: 64 images.
- VV (not explicitly referenced, likely an internal dataset or common benchmark): 24 images. These datasets are used for subjective User Study and Perceptual Index (PI) evaluations.
For Full-Reference Quantitative Comparisons:
- SICE dataset Part2 [4]: This subset contains 229 multi-exposure sequences, each with a corresponding reference (normal-light) image. For a fair comparison, only the low-light images from Part2 were selected for testing. Specifically, the first three (if seven images in sequence) or four (if nine images in sequence) low-light images were chosen.
- Image Preprocessing: All images were resized to $1200 \times 900 \times 3$ pixels.
- Result: This process yielded 767 paired low/normal light images for quantitative evaluation.
- Exclusion: The low/normal light image dataset from [37] (DARK FACE) was excluded from this comparison because it was used in the training of some baseline methods (RetinexNet [32] and EnlightenGAN [12]), which would lead to an unfair comparison. The MIT-Adobe FiveK dataset [3] was also not used as it is not primarily designed for underexposed photo enhancement.
For Face Detection in the Dark:
- DARK FACE dataset [37]: This dataset contains 10,000 images taken in dark conditions, specifically for evaluating object detection (faces) in low light.
- Evaluation Subset: Since bounding boxes for the test set were not publicly available at the time, evaluation was performed on the training and validation sets, comprising 6,000 images.

5.2. Evaluation Metrics

The paper uses a comprehensive set of metrics to evaluate the performance of Zero-DCE, including no-reference perceptual metrics, full-reference objective metrics, and task-specific metrics.

5.2.1. User Study (US)

Conceptual Definition: A User Study is a subjective evaluation where human subjects assess the visual quality of enhanced images. It aims to capture human perception and preference, which objective metrics sometimes fail to fully reflect. The subjects are typically trained to look for specific artifacts or desirable qualities.
Measurement: Scores typically range from 1 (worst quality) to 5 (best quality), and average scores are reported. A higher US score indicates better perceptual quality.
Context in Paper: 15 human subjects independently scored enhanced images (from NPE, LIME, MEF, DICM, VV sets) based on criteria like under-/over-exposed artifacts, color deviation, and unnatural texture/noise.

5.2.2. Perceptual Index (PI)

Conceptual Definition: Perceptual Index (PI) is a no-reference perceptual quality metric used to quantify the visual quality of images without requiring a reference image. It is often derived from a combination of other no-reference image quality assessment (NR-IQA) metrics. For instance, the original PI metric [1] combines Naturalness Image Quality Evaluator (NIQE) and Blind/Referenceless Image Spatial Quality Evaluator (BRISQUE) scores. It aims to measure how "natural" and visually pleasing an image appears.
Mathematical Formula (Example for NIQE, a component of PI): $ NIQE = \sqrt{(\mathbf{v} - \mathbf{v}{prior})^T \mathbf{\Sigma}^{-1} (\mathbf{v} - \mathbf{v}{prior})} $ Here, we explain the symbols in the formula:
- $\mathbf{v}$ : Is a vector of natural scene statistics (NSS) features extracted from the test image.
- $\mathbf{v}_{prior}$ : Is a vector of mean NSS features learned from a database of pristine (high-quality, natural) images.
- $\mathbf{\Sigma}$ : Is the covariance matrix of NSS features learned from the pristine image database.
- $()^T$ : Denotes the transpose of a vector.
- $\Sigma^{-1}$ : Denotes the inverse of the covariance matrix.
- The formula essentially measures the Mahalanobis distance between the NSS features of the test image and a model of natural image features.
Interpretation: A lower PI value indicates better perceptual quality (closer to naturalness).
Context in Paper: The paper cites [1, 21, 25] for PI, indicating it uses a standard no-reference metric to assess quality.

5.2.3. Peak Signal-to-Noise Ratio (PSNR)

Conceptual Definition: PSNR is a common full-reference metric used to quantify the quality of reconstruction of an image compared to an original ground-truth (reference) image. It is most easily defined via the Mean Squared Error (MSE). It is typically expressed in decibels (dB). A higher PSNR value indicates a higher quality (less noise/distortion) reconstruction.
Mathematical Formula: $ MSE = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 $ $ PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right) $ Here, we explain the symbols in the formula:
- I(i,j): Represents the pixel value at row $i$ and column $j$ of the original (reference) image.
- K(i,j): Represents the pixel value at row $i$ and column $j$ of the enhanced image.
- M, N: Represent the dimensions (rows and columns) of the image.
- MSE: Mean Squared Error, the average of the squared differences between the corresponding pixels of the two images.
- $MAX_I$ : The maximum possible pixel value of the image. For an 8-bit image, this is 255.
Interpretation: Higher PSNR is better.

5.2.4. Structural Similarity (SSIM)

Conceptual Definition: SSIM [31] is another full-reference metric that assesses the perceived quality of an image by considering three key aspects: luminance (brightness), contrast, and structure. Unlike PSNR, which primarily focuses on pixel-wise error, SSIM aims to capture human visual system characteristics, particularly the sensitivity to structural information. The SSIM index can range from -1 to 1, with 1 indicating perfect similarity.
Mathematical Formula: $ SSIM(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $ Here, we explain the symbols in the formula:
- x, y: Represent two image patches being compared (from the reference and enhanced images, respectively).
- $\mu_x, \mu_y$ : The average (mean) of patch $x$ and patch $y$ , respectively.
- $\sigma_x^2, \sigma_y^2$ : The variance of patch $x$ and patch $y$ , respectively.
- $\sigma_{xy}$ : The covariance of patch $x$ and patch $y$ .
- $C_1 = (K_1L)^2, C_2 = (K_2L)^2$ : Small constants to stabilize the division with a weak denominator. $L$ is the dynamic range of the pixel values (e.g., 255 for 8-bit images). $K_1 = 0.01, K_2 = 0.03$ are common default values.
Interpretation: Higher SSIM is better.

5.2.5. Mean Absolute Error (MAE)

Conceptual Definition: MAE is a full-reference metric that measures the average magnitude of the errors between corresponding pixels in the enhanced image and the reference image, without considering their direction. It is a straightforward measure of accuracy.
Mathematical Formula: $ MAE = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} |I(i,j) - K(i,j)| $ Here, we explain the symbols in the formula:
- I(i,j): Represents the pixel value at row $i$ and column $j$ of the original (reference) image.
- K(i,j): Represents the pixel value at row $i$ and column $j$ of the enhanced image.
- M, N: Represent the dimensions (rows and columns) of the image.
Interpretation: Lower MAE is better.

5.2.6. Average Precision (AP)

Conceptual Definition: Average Precision (AP) is a common metric used to evaluate the performance of object detection models. It summarizes the precision-recall curve into a single value. Precision is the ratio of correctly detected objects to all detected objects, while Recall is the ratio of correctly detected objects to all actual objects in the image. The P-R curve plots precision against recall at various confidence thresholds. AP is typically calculated as the area under this P-R curve. A higher AP indicates better overall detection performance.
Context in Paper: Used to evaluate how well different enhancement methods improve the performance of a face detector (DSFD [18]) in low-light conditions.

5.3. Baselines

The Zero-DCE method was compared against a selection of state-of-the-art methods spanning different categories:

Conventional Methods:
- SRIE [8] (Xueyang Fu et al., CVPR 2016): A weighted variational model for simultaneous reflectance and illumination estimation. Representative of Retinex-based approaches.
- LIME [9] (Xiaojie Guo et al., TIP 2017): Low-light image enhancement via illumination map estimation. Another well-known Retinex-based method.
- Li et al. [19] (Mading Li et al., TIP 2018): Structure-revealing low-light image enhancement via robust Retinex model. A more recent Retinex-based method that considers noise.
CNN-based Methods (Supervised Deep Learning):
- RetinexNet [32] (Chen Wei et al., BMVC 2018): Deep Retinex decomposition for low-light enhancement, trained on paired data.
- Wang et al. [28] (Ruixing Wang et al., CVPR 2019): Underexposed photo enhancement using deep illumination estimation, also trained on paired data.
GAN-based Methods (Unsupervised Deep Learning):
- EnlightenGAN [12] (Yifan Jing et al., CVPR 2019): Deep light enhancement without paired supervision, trained using unpaired low/normal light data.
  
  These baselines are representative because they cover the major categories of low-light image enhancement techniques (conventional, supervised deep learning, unsupervised deep learning) and include recent, high-performing methods from prestigious conferences/journals. The authors ensured a fair comparison by reproducing results using publicly available source codes and recommended parameters.

5.4. Implementation Details

Framework: Implemented using PyTorch.
Hardware: An NVIDIA 2080Ti GPU was used for training.
Batch Size: A batch size of 8 was applied during training.
Parameter Initialization: The filter weights (parameters) of each layer in DCE-Net were initialized using a standard Gaussian function with zero mean and 0.02 standard deviation. Bias terms were initialized as a constant.
Optimizer: The ADAM optimizer was used for network optimization, with its default parameters.
Learning Rate: A fixed learning rate of $1 \times 10^{-4}$ was used.
Loss Weights: The weights for the color constancy loss ( $W_{col}$ ) and the illumination smoothness loss ( $W_{tv,A}$ ) were empirically set to 0.5 and 20, respectively. These values were chosen to balance the contributions of different loss components to the total loss.

6. Results & Analysis

The experimental results demonstrate the superiority of Zero-DCE across various qualitative, quantitative, and perceptual evaluations, and also highlight its efficiency and practical benefits for high-level tasks.

6.1. Core Results Analysis

6.1.1. Visual and Perceptual Comparisons

The following are the visual comparisons on typical low-light images from Figure 7 of the original paper:

Figure 7: Visual comparisons on typical low-light images. Red boxes indicate the obvious differences.
该图像是一个比较不同低光图像增强方法的示意图。上方显示了输入图像，接着是多种方法的结果，包括SRIE、LIME、Li et al.等，它们的检测区域用红框标出，最后展示了Zero-DCE方法的效果。

Figure 7: Visual comparisons on typical low-light images. Red boxes indicate the obvious differences.

Analysis of Figure 7:
- First Example (Top Row): A challenging back-lit scenario with a face in shadow.
  - Zero-DCE produces a natural exposure and reveals clear details on the face, effectively combating the extreme backlight.
  - SRIE, LIME, Wang et al., and EnlightenGAN fail to clearly recover the face, leaving it under-enhanced.
  - RetinexNet suffers from over-exposed artifacts in other regions, indicating poor dynamic range control.
- Second Example (Bottom Row): An indoor scene.
  - Zero-DCE successfully enhances the dark regions while preserving the original colors of the input image. The result is visually pleasing with no obvious noise or color casts.
  - Li et al. oversmoothes details, suggesting a loss of texture.
  - Other baseline methods amplify noise or introduce color deviation (e.g., the wall's color).
    
    These visual comparisons strongly validate Zero-DCE's ability to produce high-quality, natural-looking enhanced images, especially in complex lighting conditions and for specific regions like faces.

The following are the results from Table 1 of the original paper:

Method	NPE	LIME	MEF	DICM	VV	Average
SRIE [8]	3.65/2.79	3.50/2.76	3.22/2.61	3.42/3.17	2.80/3.37	3.32/2.94
LIME [9]	3.78/3.05	3.95/3.00	3.71/2.78	3.31/3.35	3.21/3.03	3.59/3.04
Li et al. [19]	3.80/3.09	3.78/3.02	2.93/3.61	3.47/3.43	2.87/3.37	3.37/3.72
RetinexNet [32]	3.30/3.18	2.32/3.08	2.80/2.86	2.88/3.24	1.96/2.95	2.58/3.06
Wang et al. [28]	3.83/2.83	3.82/2.90	3.13/2.72	3.44/3.20	2.95/3.42	3.43/3.01
EnlightenGAN [12]	3.90/2.96	3.84/2.83	3.75/2.45	3.50/3.13	3.17/4.71	3.63/3.22
Zero-DCE	3.81/2.84	3.80/2.76	4.13/2.	3.78/2.	3.54/2.	3.81/2.

Table 1: User study (US)/Perceptual index (PI) $\downarrow$ scores on the image sets (NPE, LIME, MEF, DICM, VV). Higher US score is better, lower PI is better. The best result is in red whereas the second best one is in blue under each case. (Note: The original table has incomplete PI values for Zero-DCE, represented here as given in the paper).

Analysis of Table 1 (User Study and Perceptual Index):
- Zero-DCE achieves the highest average User Study (US) score (3.81) across all 202 testing images, indicating it is most favored by human subjects in terms of visual quality.
- For specific datasets like MEF, DICM, and VV, Zero-DCE clearly outperforms others with the highest US scores.
- In terms of Perceptual Index (PI), where lower values are better, Zero-DCE also shows competitive performance, often achieving the lowest or near-lowest PI values (though the provided table has truncated PI values for Zero-DCE, the trend suggests superiority). This confirms that Zero-DCE produces images that are perceived as more natural and of higher quality, aligning with human judgment.

6.1.2. Quantitative Comparisons

The following are the results from Table 2 of the original paper:

Method	PSNR↑	SSIM↑	MAE↓
SRIE [8]	14.41	0.54	127.08
LIME [9]	16.17	0.57	108.12
Li et al. [19]	15.19	0.54	114.21
RetinexNet [32]	15.99	0.53	104.81
Wang et al. [28]	13.52	0.49	142.01
EnlightenGAN [12]	16.21	0.59	102.78
Zero-DCE	16.57	0.59	98.78

Table 2: Quantitative comparisons in terms of full-reference image quality assessment metrics. The best result is in red whereas the second best one is in blue under each case.

Analysis of Table 2 (Full-Reference Metrics):
- Despite not using any paired or unpaired training data (zero-reference), Zero-DCE achieves the best values for all three full-reference image quality assessment metrics on the SICE Part2 subset:
  - PSNR (Peak Signal-to-Noise Ratio): 16.57 dB (highest), indicating highest pixel-level fidelity.
  - SSIM (Structural Similarity): 0.59 (tied for highest with EnlightenGAN), suggesting excellent preservation of structural information.
  - MAE (Mean Absolute Error): 98.78 (lowest), indicating the smallest average pixel-wise difference from the ground truth.
- This quantitative superiority, especially in PSNR and MAE, is remarkable given the zero-reference training paradigm, demonstrating that the implicitly learned non-reference loss functions are highly effective in driving the network towards high-quality enhancements that correlate well with objective measures.

6.1.3. Runtime Comparisons

The following are the results from Table 3 of the original paper:

Method	RT	Platform
SRIE [8]	12.1865	MATLAB (CPU)
LIME [9]	0.4914	MATLAB (CPU)
Li et al. [19]	90.7859	MATLAB (CPU)
RetinexNet [32]	0.1200	TensorFlow (GPU)
Wang et al. [28]	0.0210	TensorFlow (GPU)
EnlightenGAN [12]	0.0078	PyTorch (GPU)
Zero-DCE	0.0025	PyTorch (GPU)

Table 3: Runtime (RT) comparisons (in second). The best result is in red whereas the second best one is in blue.

Analysis of Table 3 (Runtime):
- Zero-DCE is the most computationally efficient method, with a runtime of 0.0025 seconds per image on a GPU. This is significantly faster than all other methods, including other deep learning-based approaches.
- This high efficiency (about 500 FPS for $640 \times 480$ images) is attributed to its simple curve mapping form and lightweight DCE-Net architecture. This makes Zero-DCE highly suitable for real-time applications and deployment on resource-constrained devices.

6.1.4. Face Detection in the Dark

The following are the results from Figure 8 of the original paper:

Figure 8: The performance of face detection in the dark. PR curves, the AP, and two examples of face detection before and after enhanced by our Zero-DCE.
该图像是图表，展示了在低光条件下，使用 Zero-DCE 方法进行人脸检测前后的效果对比。上半部分是PR曲线，显示不同方法的精确度与召回率关系；下半部分展示了原始和增强后的检测结果，突出增强效果。源自相关实验数据。

Figure 8: The performance of face detection in the dark. PR curves, the AP, and two examples of face detection before and after enhanced by our Zero-DCE.

Analysis of Figure 8 (Face Detection):
- The precision-recall (P-R) curves clearly show that image enhancement significantly increases the precision of the Dual Shot Face Detector (DSFD) [18] compared to using raw, unenhanced images. This demonstrates the practical benefit of low-light enhancement for high-level computer vision tasks.
- Among the enhancement methods, RetinexNet [32] and Zero-DCE perform the best, showing comparable performance. However, Zero-DCE exhibits better performance in the high recall area, meaning it can detect a larger proportion of actual faces while maintaining good precision.
- The Average Precision (AP) values (indicated in the graph) further confirm the improvement. Zero-DCE's AP is higher than most other methods.
- The visual examples at the bottom of Figure 8 show that Zero-DCE effectively brightens faces in extremely dark regions while preserving well-lit areas, directly contributing to the improved face detection performance.

6.2. Ablation Studies / Parameter Analysis

The authors conducted several ablation studies to analyze the contribution of each component of Zero-DCE and the effect of its parameter settings.

6.2.1. Contribution of Each Loss

The following are the results from Figure 4 of the original paper:

$Figure 4: Ablation study of the contribution of each loss (spatial consistency loss `L _ { s p a }` , exposure control loss `L _ { e x p }` ,color constancy loss `L _ { c o l }` , illumination smoothness loss $L _ { t v _ { \\mathcal { A } } }$ ).$
该图像是示意图，展示了低光照图像增强的不同效果。左侧第一幅图(a)为输入图像，第二幅图(b)为应用Zero-DCE方法后的结果。后面的四幅图(c-f)展示了分别去除不同损失函数后得到的增强效果，包括去除空间一致性损失 $L_{spa}$ 、曝光控制损失 $L_{exp}$ 、颜色恒常性损失 $L_{col}$ 和照明平滑性损失 $L_{tv_\mathcal{A}}$ 。该图说明了各损失函数对最终结果的重要性。

Figure 4: Ablation study of the contribution of each loss (spatial consistency loss L _ { s p a }, exposure control loss L _ { e x p },color constancy loss L _ { c o l }, illumination smoothness loss $L _ { t v _ { \mathcal { A } } }$ ).

Analysis of Figure 4: This study demonstrates the importance of each non-reference loss function:
- Without Spatial Consistency Loss ( $L_{spa}$ ): The enhanced image (Figure 4c) shows relatively lower contrast (e.g., in the cloud regions) compared to the full result (Figure 4b). This confirms that $L_{spa}$ is crucial for preserving the differences of neighboring regions and thus maintaining local contrast.
- Without Exposure Control Loss ( $L_{exp}$ ): The variant (Figure 4d) fails to recover the low-light region adequately, remaining visibly dark. This highlights the role of $L_{exp}$ in controlling the overall exposure level and preventing under-exposure.
- Without Color Constancy Loss ( $L_{col}$ ): The result (Figure 4e) exhibits severe color casts. This indicates that $L_{col}$ is essential for correcting color deviations and establishing color balance among the RGB channels, preventing an unnatural hue.
- Without Illumination Smoothness Loss ( $L_{tv,A}$ ): The enhanced image (Figure 4f) shows obvious artifacts and a lack of smoothness. This confirms that $L_{tv,A}$ is vital for preserving monotonicity relations and smoothing the parameter maps, preventing abrupt changes and unnatural textures in the output.
  
  This ablation study rigorously validates the necessity and effectiveness of each designed non-reference loss function for achieving high-quality, natural-looking image enhancement.

6.2.2. Effect of Parameter Settings

The following are the results from Figure 5 of the original paper:

$Figure 5: Ablation study of the effect of parameter settings. ${ \\it l } - f - n$ represents the proposed Zero-DCE with $l$ convolutional layers, $f$ feature maps of each layer (except the last layer), and $n$ iterations.$
该图像是一个示意图，展示了使用Zero-DCE方法对低光照图像进行增强的结果。图中分别展示了输入图像（a）和经过不同参数设置生成的结果（b-f），其中参数表示卷积层数、特征图数和迭代次数，如 3-32-8、7-32-16等。

Figure 5: Ablation study of the effect of parameter settings. ${ \it l } - f - n$ represents the proposed Zero-DCE with $l$ convolutional layers, $f$ feature maps of each layer (except the last layer), and $n$ iterations.

Analysis of Figure 5: This study explores the impact of DCE-Net's depth ( $l$ $l$ ), width ( $f$ $f$ ), and the number of iterations ( $n$ $n$ ) for the LE-curve application.
- $Zero-DCE_3-32-8$ (3 layers, 32 feature maps, 8 iterations, Figure 5b): Even with a minimal network depth of 3 convolutional layers, the model produces satisfactory results. This suggests the inherent effectiveness of the zero-reference learning approach and the LE-curve design itself.
- $Zero-DCE_7-32-8$ (7 layers, 32 feature maps, 8 iterations, Figure 5c): This configuration yields most visually pleasing results with natural exposure and proper contrast. This is the chosen final model, representing a good balance.
- $Zero-DCE_7-32-16$ (7 layers, 32 feature maps, 16 iterations, Figure 5e): While increasing the number of iterations from 8 to 16, the performance difference compared to $Zero-DCE_7-32-8$ is not significantly better, suggesting that 8 iterations are sufficient for most cases.
- $Zero-DCE_7-32-1$ (7 layers, 32 feature maps, 1 iteration, Figure 5d): A significant decrease in performance is observed. The image remains quite dark and under-enhanced. This confirms the necessity of higher-order curves (i.e., multiple iterations) for powerful dynamic range adjustment, as a single iteration offers limited adjustment capability.
- $Zero-DCE_7-64-8$ (7 layers, 64 feature maps, 8 iterations, Figure 5f): Increasing the network width by using 64 feature maps instead of 32 does not lead to a noticeable visual improvement over $Zero-DCE_7-32-8$ .
  
  The findings justify the choice of $Zero-DCE_7-32-8$ as the final model due to its optimal trade-off between efficiency and restoration performance.

6.2.3. Impact of Training Data

The following are the results from Figure 6 of the original paper:

Figure 6: Ablation study on the impact of training data.
该图像是一个图表，展示了不同输入方式下的低光照图像增强效果。图中的五个小图分别显示了输入图像（a）、使用Zero-DCE方法生成的增强图像（b）、使用Zero-DCE_Low（c）、Zero-DCE_LargeL（d）和Zero-DCE_LargeLH（e）所得到的效果。各增强方法展示了对同一低光照场景的不同处理结果，旨在比较它们对于图像亮度和细节的改善效果。

Figure 6: Ablation study on the impact of training data.

Analysis of Figure 6: This study investigates how the composition of the training data affects Zero-DCE's performance.
- Input Image (Figure 6a): Original low-light image.
- Zero-DCE (trained on 2,422 multi-exposure images, Figure 6b): The baseline Zero-DCE result, showing good enhancement.
- Zero-DCE_Low (trained on only 900 low-light images from original set, Figure 6c): When trained with only low-light images (and fewer of them), the model tends to over-enhance the well-lit regions (e.g., the face), while dark regions are still enhanced.
- Zero-DCE_LargeL (trained on 9,000 unlabeled low-light images from DARK FACE dataset, Figure 6d): Even with a larger quantity of only low-light images, the issue of over-enhancement in bright regions persists.
- Zero-DCE_LargeLH (trained on 4,800 multi-exposure images from augmented SICE Part1+Part2, Figure 6e): When trained with a larger amount of multi-exposure data (including both low-light and over-exposed samples), the model achieves even better recovery of dark regions and balanced enhancement across the image.
  
  This ablation clearly demonstrates the rationality and necessity of including multi-exposure training data (both low-light and over-exposed) in the training process. Training on only low-light images, even a large quantity, can lead to over-enhancement in regions that are already sufficiently lit. The diverse exposure levels in the training data help Zero-DCE learn to adaptively adjust dynamic ranges across a wider spectrum of brightness. The authors also note that for fair comparison with other deep learning methods, they used a comparable amount of training data, although more data could further improve performance.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduced Zero-Reference Deep Curve Estimation (Zero-DCE), a novel and highly effective method for low-light image enhancement. The core innovation lies in reformulating the enhancement task as an image-specific curve estimation problem, tackled by a lightweight DCE-Net. The LE-curve design ensures output within a valid pixel range, monotonicity for contrast preservation, and differentiability for end-to-end training. A paradigm-shifting aspect is its zero-reference training, which eliminates the need for any paired or unpaired ground-truth images. This is achieved through a meticulously designed set of non-reference loss functions (spatial consistency, exposure control, color constancy, illumination smoothness) that implicitly guide the network towards desirable enhancement qualities. Zero-DCE demonstrated state-of-the-art performance both qualitatively and quantitatively against existing methods, proved to be exceptionally efficient (real-time processing), and showed significant practical benefits for downstream high-level vision tasks such as face detection in the dark.

7.2. Limitations & Future Work

The authors acknowledge potential areas for future improvement:

Semantic Information for Hard Cases: The current approach primarily focuses on pixel-level and local region properties. Introducing semantic information (e.g., understanding of objects, scene context) could help solve more hard cases where generic illumination adjustment might be insufficient or ambiguous.
Effects of Noise: The paper mentions considering the effects of noise as a future direction. Low-light images often suffer from amplified noise during brightening. Developing strategies to simultaneously enhance and denoise would be a valuable extension.

7.3. Personal Insights & Critique

The Zero-DCE paper offers several profound insights and inspires critical reflection:

Novelty of Zero-Reference Learning: The zero-reference training paradigm is a significant advancement. It addresses the most critical bottleneck in supervised image enhancement—the reliance on expensive and often artificial reference data. This makes Zero-DCE highly practical for real-world applications where ground truth is scarce or impossible to obtain. The success of this approach highlights the power of carefully designed non-reference loss functions that encode human perceptual priors or desirable image characteristics. It opens up avenues for other image-to-image translation tasks where reference data is a major limitation.
Elegance of Curve Estimation: Reframing image enhancement as image-specific curve estimation is elegant. It offers a more interpretable and controllable adjustment mechanism compared to opaque end-to-end image-to-image mappings. The iterative, pixel-wise application of the LE-curve provides a flexible yet constrained transformation space, ensuring properties like monotonicity and valid pixel range which are critical for natural results. This approach could potentially be transferred to other image adjustment tasks, like contrast enhancement or tone mapping, where explicit curve control is beneficial.
Efficiency for Real-time Applications: The lightweight DCE-Net and simple curve mapping result in exceptional computational efficiency. This makes Zero-DCE highly suitable for deployment on resource-constrained devices like smartphones or for real-time video enhancement, broadening its applicability significantly.
Impact on High-Level Vision Tasks: Demonstrating the improvement in face detection in low-light conditions is a strong validation of Zero-DCE's practical value beyond aesthetic improvement. This underscores that effective low-level image processing can have a profound impact on the performance of high-level computer vision systems.

Potential Issues or Areas for Improvement:

Fragility of Loss Functions: While the non-reference loss functions are a strength, their effectiveness heavily relies on their precise formulation and the chosen weights. If the assumptions encoded in these losses (e.g., Gray-World hypothesis for color constancy, specific exposure target) do not hold universally for all possible low-light scenarios, the model's generalization might still be limited in those edge cases. Further research could explore adaptive weighting of losses or more sophisticated perceptual losses that better mimic human perception without explicit reference.
Handling Extreme Noise: As acknowledged by the authors, noise amplification is a common issue in low-light enhancement. While illumination smoothness loss helps prevent artifacts in the parameter maps, it doesn't explicitly address image noise. Integrating a denoising component or noise-aware curve estimation could further improve robustness.
Interpretability of $\mathcal{A}$ maps: While the LE-curve itself is interpretable, the pixel-wise parameter maps $\mathcal{A}_n(\mathbf{x})$ are network outputs. Further analysis into why the network produces specific $\mathcal{A}$ values for different regions could yield deeper insights into the learning process and potentially lead to even more refined curve designs.

Overall, Zero-DCE is a highly impactful paper that presents an elegant, efficient, and data-independent solution to a pervasive problem, pushing the boundaries of deep learning in image enhancement.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~37 min read · 47,943 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

1.7. PDF Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.2.1. Conventional Methods

3.2.2. Data-Driven Methods

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Light-Enhancement Curve (LE-curve)

4.2.1.1. Base LE-curve

4.2.1.2. Higher-Order Curve

4.2.1.3. Pixel-Wise Curve

4.2.2. DCE-Net

4.2.3. Non-Reference Loss Functions

4.2.3.1. Spatial Consistency Loss

4.2.3.2. Exposure Control Loss

4.2.3.3. Color Constancy Loss

4.2.3.4. Illumination Smoothness Loss

4.2.3.5. Total Loss

5. Experimental Setup

5.1. Datasets

5.1.1. Training Data

5.1.2. Testing/Evaluation Data

5.2. Evaluation Metrics

5.2.1. User Study (US)

5.2.2. Perceptual Index (PI)

5.2.3. Peak Signal-to-Noise Ratio (PSNR)

5.2.4. Structural Similarity (SSIM)

5.2.5. Mean Absolute Error (MAE)

5.2.6. Average Precision (AP)

5.3. Baselines

5.4. Implementation Details

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Visual and Perceptual Comparisons

6.1.2. Quantitative Comparisons

6.1.3. Runtime Comparisons

6.1.4. Face Detection in the Dark

6.2. Ablation Studies / Parameter Analysis

6.2.1. Contribution of Each Loss

6.2.2. Effect of Parameter Settings

6.2.3. Impact of Training Data

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers