Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement
TL;DR Summary
This study introduces Zero-DCE, a method using a lightweight network for image-specific curve estimation to enhance low-light images without requiring paired data. It employs non-reference loss functions to effectively improve image quality, demonstrating good generalization acro
Abstract
The paper presents a novel method, Zero-Reference Deep Curve Estimation (Zero-DCE), which formulates light enhancement as a task of image-specific curve estimation with a deep network. Our method trains a lightweight deep network, DCE-Net, to estimate pixel-wise and high-order curves for dynamic range adjustment of a given image. The curve estimation is specially designed, considering pixel value range, monotonicity, and differentiability. Zero-DCE is appealing in its relaxed assumption on reference images, i.e., it does not require any paired or unpaired data during training. This is achieved through a set of carefully formulated non-reference loss functions, which implicitly measure the enhancement quality and drive the learning of the network. Our method is efficient as image enhancement can be achieved by an intuitive and simple nonlinear curve mapping. Despite its simplicity, we show that it generalizes well to diverse lighting conditions. Extensive experiments on various benchmarks demonstrate the advantages of our method over state-of-the-art methods qualitatively and quantitatively. Furthermore, the potential benefits of our Zero-DCE to face detection in the dark are discussed. Code and model will be available at https://github.com/Li-Chongyi/Zero-DCE.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The title of the paper is Zero-Reference Deep Curve Estimation for Low-Light Image Enhancement. This title directly indicates the paper's central topic: enhancing images taken in low-light conditions using a deep learning approach that relies on curve estimation and does not require reference images during training.
1.2. Authors
The paper lists multiple authors:
- Chunle Guo
- Chongyi Li
- Jichang Guo
- Chen Change Loy
- Junhui Hou
- Sam Kwong
- Runmin Cong The authors are affiliated with institutions such as Tianjin University, City University of Hong Kong, Nanyang Technological University, and Beijing Jiaotong University, indicating a collaborative research effort from multiple academic institutions, primarily based in China and Singapore. Their research backgrounds generally involve computer vision, image processing, and deep learning.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server, on 2020-01-19T13:49:15.000Z. While arXiv itself is not a peer-reviewed journal or conference, it is a widely used platform for quickly disseminating research findings in physics, mathematics, computer science, and related fields. Papers often appear on arXiv before or concurrently with submission to prestigious conferences (e.g., CVPR, ICCV, ECCV) or journals. The subsequent citations in major venues suggest its influence in the image enhancement community.
1.4. Publication Year
The paper was published in 2020.
1.5. Abstract
The paper introduces Zero-Reference Deep Curve Estimation (Zero-DCE), a novel method for low-light image enhancement. It re-frames enhancement as an image-specific curve estimation problem solved by a lightweight deep neural network called DCE-Net. DCE-Net learns to estimate pixel-wise, high-order curves that dynamically adjust the range of pixel values in a given image. The design of these curves carefully considers properties like pixel value range, monotonicity, and differentiability. A significant innovation of Zero-DCE is its ability to train without any paired or unpaired reference images. This "zero-reference" training is achieved through a suite of specifically designed non-reference loss functions that implicitly gauge enhancement quality and guide network learning. The method is efficient due to its simple nonlinear curve mapping approach and demonstrates strong generalization across diverse lighting conditions. Extensive experiments show qualitative and quantitative advantages over state-of-the-art methods, and the paper also highlights its practical benefits for high-level vision tasks like face detection in low light.
1.6. Original Source Link
The original source link is: https://arxiv.org/abs/2001.06826
This is an arXiv preprint.
1.7. PDF Link
The PDF link is: https://arxiv.org/pdf/2001.06826v2.pdf
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the low-light image enhancement. Many images are captured under suboptimal lighting conditions due to environmental factors (e.g., night, indoor with poor lighting) or technical constraints (e.g., incorrect exposure settings). These low-light images suffer from two major issues:
-
Compromised Aesthetic Quality: They appear dark, dull, and lack visual appeal, negatively impacting the viewer's experience.
-
Unsatisfactory Information Transmission: Crucial details can be obscured, leading to inaccurate interpretation, especially for automated computer vision tasks like
object recognitionorface detection. For instance, a face detector might fail on a dark image.Existing research in image enhancement faces several challenges:
-
Generalization: Many deep learning-based methods rely on
paired data(low-light image and its corresponding well-lit version) orunpaired data(collections of low-light and well-lit images). Collecting such data is costly, time-consuming, and often involves synthetic degradation or expert retouching, which can introducefactitious(artificial) andunrealisticdata. This reliance often leads to models thatoverfitto their training data andgeneralize poorlyto real-world, diverse low-light conditions, producingartifactsorcolor casts. -
Computational Burden: Some methods are computationally intensive, limiting their use in real-time applications or on devices with limited resources (e.g., mobile phones).
-
Lack of Robustness: Traditional methods often struggle with
nonuniform illumination, where different parts of an image have vastly different lighting levels.The paper's entry point and innovative idea is to reformulate
low-light image enhancementas animage-specific curve estimation problem. Instead of directly mapping a low-light image to an enhanced one, it proposes to learn a set ofpixel-wise(meaning each pixel can have a unique adjustment) andhigh-order curves(meaning the adjustment can be complex and iterative) that can dynamically adjust thedynamic rangeof the input image. Crucially, this is achieved withzero-reference training, meaning no target enhanced images (either paired or unpaired) are needed during the learning process.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Zero-Reference Learning Framework: It proposes the first low-light enhancement network that operates
independent of paired and unpaired training data. This completely avoids the high cost and potential pitfalls (e.g.,overfitting,factitious data) associated with collecting reference images, leading to bettergeneralizationacross various lighting conditions. -
Image-Specific Light-Enhancement Curve (LE-curve): The authors design a novel
image-specific curvethat can approximatepixel-wiseandhigher-order adjustmentsthrough iterative application. This curve is carefully formulated to maintain the pixel value range (0-1), preservemonotonicity(to prevent contrast reversal), and bedifferentiable(to allow gradient-based optimization). This design allows for effective dynamic range mapping. -
Task-Specific Non-Reference Loss Functions: A set of
differentiable non-reference loss functionsare introduced, includingspatial consistency loss,exposure control loss,color constancy loss, andillumination smoothness loss. These losses implicitly evaluate the quality of the enhanced images without explicit ground-truth references, guiding the network's learning effectively.The key conclusions and findings are:
Zero-DCEachievesstate-of-the-art performanceboth qualitatively (visually pleasing results with natural brightness, color, and contrast) and quantitatively (higherPSNR,SSIM, lowerMAE, betterUser Studyscores, lowerPerceptual Index) compared to existing methods, even those requiring reference data.- The method is highly
efficient, capable of processing images inreal-time(around 500 frames per second on GPU for 640x480 images) and requiring minimal training time (30 minutes). Zero-DCEsignificantlyimproves the performance of high-level visual taskslikeface detectionin low-light conditions, demonstrating its practical utility beyond mere aesthetic enhancement.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the Zero-DCE paper, a novice reader should be familiar with several fundamental concepts in image processing and deep learning:
-
Pixel Values and RGB Channels:
- An image is composed of a grid of tiny picture elements called
pixels. - In color images, each pixel's color is typically represented by three primary color channels:
Red (R),Green (G), andBlue (B). - Each channel usually has an intensity value ranging from 0 to 255 (for 8-bit images), where 0 means no intensity (black) and 255 means full intensity (brightest for that color). Low-light images typically have most pixel values clustered at the lower end of this range.
- For deep learning models, these pixel values are often
normalizedto a range like[0, 1]by dividing by 255, which helps with model stability and convergence.
- An image is composed of a grid of tiny picture elements called
-
Dynamic Range and Contrast:
Dynamic rangerefers to the ratio between the maximum and minimum light intensity an imaging system can capture or display. In an image, it's the spread of pixel intensity values.Contrastis the difference in brightness or color that makes an object distinguishable. A low-light image often has a narrow dynamic range and low contrast, making it look flat and dull.Image enhancementaims to expand this dynamic range and improve contrast to reveal more details and make the image visually appealing.
-
Monotonicity:
- In the context of a curve or a mapping function,
monotonicitymeans that the function either always increases or always decreases. If a curve ismonotonically increasing, as the input value increases, the output value also increases (or stays the same). - For image enhancement,
monotonicityis crucial because it ensures that the relative order of pixel intensities is preserved. For example, if pixel A is brighter than pixel B in the original image, it should remain brighter (or equally bright) after enhancement. Violating monotonicity would reverse contrast, creating unnaturalartifacts.
- In the context of a curve or a mapping function,
-
Differentiability:
- A function is
differentiableif its derivative exists at every point in its domain. In simpler terms, it means the function's slope can be calculated at any point. - In deep learning,
differentiabilityis essential because neural networks are trained usinggradient-based optimizationalgorithms (likestochastic gradient descentorADAM). These algorithms calculategradients(slopes) of theloss functionwith respect to the network'strainable parameters(weights and biases) to update them and minimize the loss. If the operations within the network (like theLE-curvein this paper) are not differentiable,backpropagation(the process of computing gradients) cannot occur, and the network cannot be trained.
- A function is
-
Convolutional Neural Networks (CNNs):
CNNsare a class of deep neural networks specifically designed for processing structured grid-like data, such as images.- They consist of multiple layers, including
convolutional layers,pooling layers, andfully connected layers. - A
convolutional layerapplies a set of learnablefilters(also calledkernels) across the input image. Each filter detects specific features (e.g., edges, textures). Activation functions(likeReLUorTanh) introduce nonlinearity into the network, allowing it to learn complex patterns.ReLU (Rectified Linear Unit): An activation function defined as . It outputs the input directly if it's positive, otherwise it outputs zero. It's computationally efficient.Tanh (Hyperbolic Tangent): An activation function defined as . It squashes output values to the range , making it suitable for producing normalized parameters or outputs.Batch Normalizationis a technique used to standardize the inputs to layers in a deep neural network, stabilizing and speeding up training. The authors explicitly discard it here to preserve neighboring pixel relations.
-
Loss Functions and Optimizers:
- A
loss functionquantifies the difference between the network's output and the desired target. The goal of training is to minimize this loss. Non-reference loss functionsare special types of loss functions that do not require a ground-truth (reference) image for comparison. Instead, they implicitly evaluate the quality of the output based on certain desirable properties (e.g., smoothness, exposure level, color consistency).- An
optimizeris an algorithm (likeADAM) used to adjust the network'strainable parameters(weights and biases) in the direction that minimizes theloss functionbased on the computedgradients.
- A
3.2. Previous Works
The paper categorizes previous low-light image enhancement methods into Conventional Methods and Data-Driven Methods.
3.2.1. Conventional Methods
These methods often rely on handcrafted algorithms, mathematical models, or statistical properties of images.
-
Histogram Equalization (HE)-based methods:
- Concept: These methods enhance contrast by adjusting the
histogram(distribution of pixel intensities) of an image, spreading out the most frequent intensity values. - Global HE: Adjusts the histogram for the entire image (e.g., [7, 10]). Can sometimes lead to over-enhancement in already bright regions or amplification of noise.
- Local HE: Adjusts histograms for smaller, local regions of the image (e.g., [15, 27]), providing better local contrast but can be computationally intensive and sometimes introduce
blocky artifacts. - Differentiation:
Zero-DCEavoids directly manipulating histograms, opting for curve mapping which offers more fine-grained, image-specific control without the risk of abrupt changes in pixel distribution.
- Concept: These methods enhance contrast by adjusting the
-
Retinex Theory-based methods:
- Concept: The
Retinex theory[13] postulates that an image can be decomposed into two components:reflectance(the intrinsic property of an object's surface, independent of lighting) andillumination(the lighting conditions). The goal of enhancement is to estimate and adjust the illumination component while preserving the reflectance. - Examples:
SRIE[8] (Xueyang Fu et al., 2016) proposed a weighted variational model to simultaneously estimate reflectance and illumination.LIME[9] (Xiaojie Guo et al., 2017) estimated a coarse illumination map by searching the maximum intensity in RGB channels, then refined it using a structure prior.Li et al.[19] (Mading Li et al., 2018) introduced a Retinex model that considers noise, estimating the illumination map by solving an optimization problem.
- Differentiation: Conventional Retinex methods rely on potentially inaccurate physical models and often involve complex optimization.
Zero-DCEuses a purely data-driven, learned curve mapping approach, which is more flexible and less dependent on explicit physical assumptions.
- Concept: The
-
Automatic Exposure Correction (e.g., Yuan and Sun [36], 2012): This method estimated a global S-shaped curve for an image using a global optimization algorithm.
- Differentiation:
Zero-DCEdiffers by being apurely data-driven methodthat learnspixel-wisecurves and incorporatesmultiple light enhancement factorsthrough itsnon-reference loss functions, leading to broader dynamic range adjustment and lower computational burden.
- Differentiation:
3.2.2. Data-Driven Methods
These methods leverage large datasets and deep neural networks for learning enhancement mappings.
-
CNN-based (Supervised) Methods:
- Concept: These methods train
Convolutional Neural Networks(CNNs) onpaired data, where each low-light image has a corresponding well-exposedground-truth(reference) image. The network learns a mapping from low-light to well-lit. - Data Collection Challenge: Collecting such paired data is resource-intensive.
LLNet[20] (Lore et al., 2017) was trained on data simulated using random Gamma correction (a simple non-linear adjustment).- The
LOL dataset[32] (Cai et al., 2018, though it cites Wei et al. for RetinexNet, LOL is by Chen Wei et al.) collected paired low/normal light images by altering camera exposure time and ISO. - The
MIT-Adobe FiveK dataset[3] (Bychkovsky et al., 2011) comprises 5,000 raw images, each with five expert-retouched versions.
- Examples:
RetinexNet[32] (Wei et al., 2018) employed a deep Retinex decomposition for low-light enhancement, trained on paired data.Wang et al.[28] (Ruixing Wang et al., 2019) proposed an underexposed photo enhancement network by estimating an illumination map, also trained on paired expert-retouched data.
- Limitations: High cost of data collection, potential for
factitiousandunrealistic datain training, andpoor generalization capabilityto diverse real-world lighting conditions, often producingartifactsandcolor casts. - Differentiation:
Zero-DCEcompletely eliminates the need for paired data, addressing the generalization issue and reducing data collection costs.
- Concept: These methods train
-
GAN-based (Unsupervised) Methods:
- Concept:
Generative Adversarial Networks (GANs)[38] (Zhu et al., 2017 for CycleGAN, often cited for unpaired image-to-image translation) learn to generate realistic images without requiring perfectly paired input-output examples. They use ageneratornetwork (which creates enhanced images) and adiscriminatornetwork (which tries to distinguish between real well-lit images and generated enhanced images). - Examples:
EnlightenGAN[12] (Jing et al., 2019) is a pioneer unsupervised GAN-based method for low-light image enhancement, learning fromunpairedlow/normal light data. It uses carefully designed discriminators and loss functions.
- Limitations: While eliminating paired data, GANs still require careful selection of unpaired training data and can be challenging to train, sometimes leading to unstable results or
mode collapse. - Differentiation:
Zero-DCEgoes a step further thanEnlightenGANby requiringzero referencedata at all – neither paired nor unpaired. This simplifies the training process and reduces reliance on specific data distributions.
- Concept:
3.3. Technological Evolution
The field of low-light image enhancement has evolved from traditional signal processing techniques to sophisticated deep learning models. Initially, methods focused on global histogram adjustments or physical image models like Retinex theory. While effective to some extent, these often struggled with local variations, noise, or computational complexity. The advent of deep learning brought about a paradigm shift, allowing models to learn complex, non-linear mappings directly from data.
The first wave of deep learning methods (CNN-based) largely relied on supervised learning with paired datasets. This approach demonstrated impressive results in controlled settings but faced significant hurdles in data acquisition and generalization to diverse real-world scenarios. The subsequent development of unsupervised learning with GANs (GAN-based methods) addressed the paired data dependency by enabling training with unpaired datasets. However, unpaired data still needs to be carefully curated, and GANs can be notoriously difficult to train.
This paper's work, Zero-DCE, represents a further evolution by moving towards zero-reference learning. It acknowledges the limitations of both paired and unpaired data paradigms and proposes a novel way to train a deep model using only the properties of desirable enhanced images encoded into non-reference loss functions, effectively making the model self-supervised in a unique way. This positions Zero-DCE at the forefront of data-independent deep learning for image enhancement.
3.4. Differentiation Analysis
Compared to the main methods discussed in related work, Zero-DCE offers core differences and innovations:
- Zero-Reference Training: This is the most significant differentiator. Unlike CNN-based methods that require
paired low/normal light imagesor GAN-based methods that needunpaired low/normal light datasets,Zero-DCEis trained without any reference images at all. This completely liberates the model from expensive and potentially problematic data collection, making it highly practical and robust againstoverfittingto specific datasets. - Image-Specific Curve Estimation vs. Direct Image-to-Image Mapping: Instead of learning a direct end-to-end mapping from a low-light image to an enhanced one (as many CNNs and GANs do),
Zero-DCElearns to estimatepixel-wise,higher-order curves. These curves then perform the actual enhancement. This indirect approach provides a more interpretable and controllable mechanism for dynamic range adjustment, similar to traditional curve adjustments in photo editing, but learned adaptively. - Novel Curve Design: The
LE-curveitself is specifically designed to meet critical criteria: maintaining pixel value range ([0,1]), preservingmonotonicity(for contrast), and ensuringdifferentiability(for training). The iterative and pixel-wise application of this curve provides a powerful and flexible adjustment capability. - Differentiable Non-Reference Loss Functions: The paper's ability to train without reference images stems from its meticulously crafted
non-reference loss functions. These losses (spatial consistency, exposure control, color constancy, illumination smoothness) implicitly define what a "good" enhanced image looks like, guiding the network without explicit examples. This is a key innovation over methods that rely on pixel-wiseL1/L2losses against ground truth or adversarial losses against discriminators. - Efficiency and Generalization: The lightweight
DCE-Netarchitecture combined with the simple curve mapping leads to high computational efficiency, enablingreal-time processing. The zero-reference training, by avoiding dataset-specific biases, inherently promotes bettergeneralizationto diverse, unseen low-light conditions.
4. Methodology
The Zero-Reference Deep Curve Estimation (Zero-DCE) method formulates low-light image enhancement as an image-specific curve estimation problem. It uses a lightweight deep neural network, DCE-Net, to predict parameters for these curves, which are then iteratively applied to the input image to achieve enhancement. The entire process is trained without any reference images, relying solely on specially designed non-reference loss functions.
4.1. Principles
The core idea of Zero-DCE is to learn a set of adjustment curves that can map the pixel values of a low-light image to their enhanced counterparts. This approach is inspired by curve adjustment tools found in photo editing software. The theoretical basis or intuition is that low-light images primarily suffer from a narrow dynamic range and low brightness. By applying appropriate non-linear curves, the intensity values can be stretched and shifted, increasing brightness and contrast.
The key principles guiding the design are:
- Image-Specific Adjustment: The curves should be
self-adaptive, meaning their parameters are determined solely by the input image, allowing for flexible adjustment to different lighting conditions. - Constraint-Aware Curve Design: The curves must:
- Keep enhanced pixel values within a valid range (e.g.,
[0, 1]) to preventinformation lossfromoverflow truncation. - Be
monotonousto preserve the relative order of pixel intensities and thus maintain thecontrastbetweenneighboring pixels. - Be
differentiableto enablegradient-based optimizationduring neural network training.
- Keep enhanced pixel values within a valid range (e.g.,
- Zero-Reference Learning: The model should be trainable without any
pairedorunpairedground-truth reference images. This is achieved through a set of carefully craftednon-reference loss functionsthat indirectly quantify enhancement quality based on desirable image properties.
4.2. Core Methodology In-depth (Layer by Layer)
The framework of Zero-DCE consists of three main components: the Light-Enhancement Curve (LE-curve), the Deep Curve Estimation Network (DCE-Net), and the Non-Reference Loss Functions.
The overall framework is illustrated in Figure 2.

该图像是一个示意图,展示了Zero-Reference Deep Curve Estimation (Zero-DCE)方法的框架。图中包含输入图像I,通过深度曲线估计网络(DCE-Net)生成增强图像,同时展示了曲线参数图和不同α值下的曲线图。公式描述了曲线估计的递归过程。
Figure 2: The framework of Zero-Reference Deep Curve Estimation (Zero-DCE).
4.2.1. Light-Enhancement Curve (LE-curve)
The LE-curve is the fundamental building block for pixel value adjustment. It's designed to be simple, effective, and compliant with the three objectives mentioned above (range, monotonicity, differentiability).
4.2.1.1. Base LE-curve
The base LE-curve is defined as a quadratic curve:
$
L E ( I ( \mathbf { x } ) ; \alpha ) = I ( \mathbf { x } ) + \alpha I ( \mathbf { x } ) ( 1 - I ( \mathbf { x } ) )
$
(1)
Here, we explain the symbols in the formula:
-
: Denotes the pixel coordinates in the image.
-
: Represents the intensity value of the input pixel at coordinates . It is assumed to be normalized to the range
[0, 1]. -
: Represents the enhanced intensity value of the pixel at coordinates after applying the
LE-curve. -
: Is a
trainable curve parameterthat lies within the range . This parameter controls the magnitude and direction of the curve's adjustment, effectively governing theexposure levelof the image. A positive brightens the image, while a negative darkens it.The operations in Equation (1) are applied
pixel-wise. The paper states that theLE-curveis applied separately to each of the threeRGB channelsinstead of just anillumination channel(like in some Retinex models). Thisthree-channel adjustmentis intended to better preserve the inherent colors of the image and minimize the risk ofover-saturation.
The quadratic form I(1-I) ensures that the output remains within [0, 1] if and . For , I(1-I) is always positive. When , the maximum value I(1-I) is 0.25 (at ), so . The maximum output for would be , which is within [0,1]. For , the minimum output would be . The term I(1-I) also guarantees differentiability. The effect of different values is illustrated in Figure 2(b).
4.2.1.2. Higher-Order Curve
To enable more flexible and robust adjustments for challenging low-light conditions, the LE-curve can be applied iteratively. This creates a higher-order curve with greater adjustment capability (i.e., higher curvature).
The iterative application is defined as: $ L E _ { n } ( \mathbf { x } ) = L E _ { n - 1 } ( \mathbf { x } ) + \alpha _ { n } L E _ { n - 1 } ( \mathbf { x } ) ( 1 - L E _ { n - 1 } ( \mathbf { x } ) ) $ (2)
Here, we explain the symbols in the formula:
-
: Represents the
number of iterationsor applications of theLE-curve. It controls the overallcurvatureof the combined mapping. -
: Is the enhanced pixel value after iterations.
-
: Is the enhanced pixel value from the previous iteration
(n-1). For the first iteration, is simply the input image . -
: Is the curve parameter for the -th iteration.
In this paper, the number of iterations is set to 8, which is found to be satisfactory for most cases. When , Equation (2) simplifies back to the base
LE-curvein Equation (1). Figure 2(c) visually demonstrates how higher-order curves (through iterative application) provide a more powerful adjustment capability than a single application.
4.2.1.3. Pixel-Wise Curve
While higher-order curves offer broader dynamic range adjustment, if a single value is used for the entire image (global adjustment), it can lead to over-enhancement in already brighter regions or under-enhancement in extremely dark areas. To address this, the parameter is made pixel-wise, meaning each pixel can have its own best-fitting adjustment parameter.
The pixel-wise higher-order curve is reformulated as:
$
L E _ { n } ( \mathbf { x } ) = L E _ { n - 1 } ( \mathbf { x } ) + A _ { n } ( \mathbf { x } ) L E _ { n - 1 } ( \mathbf { x } ) ( 1 - L E _ { n - 1 } ( \mathbf { x } ) )
$
(3)
Here, we explain the symbols in the formula:
-
: Is the
parameter mapfor the -th iteration. This map has the same spatial dimensions as the input image, providing a unique value for each pixel .The assumption here is that pixels within a local region often share similar intensity characteristics and thus should undergo similar adjustments. This
pixel-wiseapproach allows foradaptive local enhancementwhile still preserving themonotonic relationsbetweenneighboring pixels(as the curve itself is monotonic). This ensures the three objectives (range, monotonicity, differentiability) are still met. Figure 3 illustrates howpixel-wise curve parameter mapsaccurately reflect the brightness variations across an image and how they contribute to a well-enhanced result.
该图像是插图,展示了低光照图像增强的过程,包括原始输入图像和各种估计的光照曲线。图(a)是输入图像,图(b)展示了 ,图(c)显示了 ,图(d)为 ,图(e)为经过增强处理后的结果。该示意图明确反映了Zero-DCE方法在不同光照条件下的效果和模型输出。
Figure 3: An example of the estimated curve parameter maps of three channels for 8 iterations and normalize the values to the range of [0, 1]. , , and represent the averaged best-fitting curve parameter maps for the R, G, and B channels, respectively. The maps are represented by heatmaps.
4.2.2. DCE-Net
The Deep Curve Estimation Network (DCE-Net) is designed to learn the mapping from an input low-light image to its corresponding pixel-wise curve parameter maps ().
- Input: A low-light image.
- Output: A set of
pixel-wise curve parameter maps. For 8 iterations () and 3 RGB channels, the network outputs parameter maps. Each map corresponds to , where is the iteration index and is the channel (R, G, or B). - Architecture:
DCE-Netis aplain CNN(Convolutional Neural Network) comprisingseven convolutional layerswithsymmetrical concatenation.- Each
convolutional layeruses32 convolutional kernels(filters) of size andstride 1. - Following each convolutional layer (except the last), a
ReLU activation functionis applied. - Crucially,
down-sampling layers(like pooling) andbatch normalization layersare explicitlydiscarded. This design choice is made to preserve the intrinsicrelations of neighboring pixelsand avoid losing spatial information, which is important forpixel-wise adjustments. - The
last convolutional layeris followed by aTanh activation function. TheTanhfunction scales the output of the final layer to the range , which aligns perfectly with the defined range for the curve parameters (or ).
- Each
- Efficiency: The
DCE-Netis designed to be lightweight, containing only79,416 trainable parametersand requiring5.21 Giga floating-point operations (GFlops)for an input image of size . This makes it suitable for computationally limited devices, such as mobile platforms.
4.2.3. Non-Reference Loss Functions
To enable zero-reference learning (training without ground-truth images), a set of differentiable non-reference loss functions are formulated. These losses implicitly evaluate the quality of the enhanced images by enforcing desirable properties.
4.2.3.1. Spatial Consistency Loss
The spatial consistency loss () encourages the enhanced image to maintain local structural coherence. It penalizes changes in the differences between neighboring regions from the input image to its enhanced version, thereby preserving local contrast and details.
The loss is expressed as: $ L _ { s p a } = \frac { 1 } { K } \sum _ { i = 1 } ^ { K } \sum _ { j \in \Omega ( i ) } ( | ( Y _ { i } - Y _ { j } ) | - | ( I _ { i } - I _ { j } ) | ) ^ { 2 } $ (4)
Here, we explain the symbols in the formula:
-
: Represents the total
number of local regionsconsidered in the image. -
: Denotes a specific local region.
-
: Indicates the four
neighboring regions(top, down, left, right) relative to region . -
: Is the
average intensity valueof the local region in theenhanced image. -
: Is the
average intensity valueof the neighboring region in theenhanced image. -
: Is the
average intensity valueof the local region in theinput image. -
: Is the
average intensity valueof the neighboring region in theinput image.The term measures the squared difference between the absolute intensity differences of neighboring regions in the enhanced image versus the input image. By minimizing this, the loss ensures that local changes in brightness or contrast are consistent with the original image's structure. The size of the local region is empirically set to .
4.2.3.2. Exposure Control Loss
The exposure control loss () is designed to prevent under-exposed (too dark) or over-exposed (too bright) regions in the enhanced image by pushing local average intensities towards a desired well-exposedness level.
The loss is expressed as: $ L _ { e x p } = \frac { 1 } { M } \sum _ { k = 1 } ^ { M } | Y _ { k } - E | $ (5)
Here, we explain the symbols in the formula:
-
: Represents the
number of non-overlapping local regionsof size across the image. -
: Denotes a specific local region.
-
: Is the
average intensity valueof the local region in theenhanced image. -
: Represents the target
well-exposedness level. Following existing practices [23, 24], is typically set as a gray level in the RGB color space. In the experiments, is set to0.6(normalized from 0 to 1), though values between0.4and0.7yield similar performance.This loss measures the absolute difference between the average intensity of each local region and the target exposure level, encouraging uniform and appropriate brightness across the image.
4.2.3.3. Color Constancy Loss
The color constancy loss () aims to correct potential color deviations that might arise during the enhancement process and to establish proper color balance among the three RGB channels. It is based on the Gray-World color constancy hypothesis [2], which states that, over an entire image, the average reflectance of surfaces is achromatic (gray).
The loss is expressed as: $ L _ { c o l } = \sum _ { \forall ( p , q ) \in \varepsilon } ( J ^ { p } - J ^ { q } ) ^ { 2 } , \varepsilon = { ( R , G ) , ( R , B ) , ( G , B ) } $ (6)
Here, we explain the symbols in the formula:
-
: Denotes the
average intensity valueof channel (e.g., Red) across the entireenhanced image. -
: Denotes the
average intensity valueof channel (e.g., Green) across the entireenhanced image. -
(p, q): Represents apair of channels. -
: Is the set of all possible pairs of distinct channels in RGB color space: (Red, Green), (Red, Blue), and (Green, Blue).
By minimizing the squared difference between the average intensities of all channel pairs, this loss encourages the overall color balance of the enhanced image to be achromatic, thereby correcting
color casts.
4.2.3.4. Illumination Smoothness Loss
The illumination smoothness loss () is applied to the predicted curve parameter maps . Its purpose is to encourage smooth transitions in the adjustment parameters across the image, which in turn helps to preserve the monotonicity relations between neighboring pixels in the final enhanced image and prevent artifacts (like harsh edges or abrupt changes) in the enhanced output. It is a variant of Total Variation (TV) loss.
The loss is defined as: $ L _ { t v _ { A } } = \frac { 1 } { N } \sum _ { n = 1 } ^ { N } \sum _ { c \in \xi } ( | \nabla _ { x } \boldsymbol { A } _ { n } ^ { c } | + \nabla _ { y } \boldsymbol { A } _ { n } ^ { c } | ) ^ { 2 } , \boldsymbol { \xi } = { R , G , B } $ (7)
Here, we explain the symbols in the formula:
-
: Represents the
total number of iterations(set to 8 in this paper). -
: Denotes the iteration index.
-
: Represents a specific color channel (Red, Green, or Blue).
-
: Is the set of all color channels .
-
: Refers to the
curve parameter mapfor the -th iteration and channel . -
: Represents the
horizontal gradient operation. It measures the change in the parameter map values along the x-direction. -
: Represents the
vertical gradient operation. It measures the change in the parameter map values along the y-direction.This loss minimizes the sum of squared magnitudes of the horizontal and vertical gradients of each parameter map across all iterations and channels. By doing so, it forces the parameter maps to be
spatially smooth, ensuring that adjacent pixels receive similar adjustments unless there's a significant image feature boundary.
4.2.3.5. Total Loss
The total loss () combines all the individual non-reference loss functions to guide the training of the DCE-Net.
The total loss is expressed as: $ L _ { t o t a l } = L _ { s p a } + L _ { e x p } + W _ { c o l } L _ { c o l } + W _ { t v _ { A } } L _ { t v _ { A } } $ (8)
Here, we explain the symbols in the formula:
-
: Is the
spatial consistency loss. -
: Is the
exposure control loss. -
: Is the
color constancy loss. -
: Is the
illumination smoothness loss. -
: Is a
weight coefficientfor thecolor constancy loss. It balances the contribution of this loss to the total loss. -
: Is a
weight coefficientfor theillumination smoothness loss. It balances the contribution of this loss to the total loss.The weights and are empirically set to
0.5and20, respectively, to balance the scales of the different loss components. This weighted sum allows the model to optimize for multiple desirable image enhancement properties simultaneously without requiring any reference images.
5. Experimental Setup
5.1. Datasets
The authors employed a variety of datasets for both training and evaluation to demonstrate the robustness and generalization capabilities of Zero-DCE.
5.1.1. Training Data
- SICE dataset Part1 [4]: The primary training data consists of 360 multi-exposure sequences from
Part1of theSICE dataset. This dataset is advantageous because it includes bothlow-lightandover-exposedimages, which helps the model learn a wide dynamic range adjustment capability. From this, 3,022 images of different exposure levels were randomly split into:- Training Set: 2,422 images.
- Validation Set: The remaining images.
- Image Preprocessing: Training images were resized to pixels.
5.1.2. Testing/Evaluation Data
The paper used several standard image sets from previous works for qualitative (visual) and perceptual comparisons, as well as a specific dataset for full-reference quantitative evaluation and another for a face detection task.
-
For Qualitative and Perceptual Comparisons (No Reference Ground Truth):
NPE[29]: 84 images.LIME[9]: 10 images.MEF[22]: 17 images.DICM[14]: 64 images.VV(not explicitly referenced, likely an internal dataset or common benchmark): 24 images. These datasets are used for subjectiveUser StudyandPerceptual Index (PI)evaluations.
-
For Full-Reference Quantitative Comparisons:
- SICE dataset Part2 [4]: This subset contains 229 multi-exposure sequences, each with a corresponding reference (normal-light) image. For a fair comparison, only the low-light images from
Part2were selected for testing. Specifically, the first three (if seven images in sequence) or four (if nine images in sequence) low-light images were chosen. - Image Preprocessing: All images were resized to pixels.
- Result: This process yielded 767 paired low/normal light images for quantitative evaluation.
- Exclusion: The low/normal light image dataset from [37] (DARK FACE) was excluded from this comparison because it was used in the training of some baseline methods (
RetinexNet[32] andEnlightenGAN[12]), which would lead to an unfair comparison. TheMIT-Adobe FiveK dataset[3] was also not used as it is not primarily designed for underexposed photo enhancement.
- SICE dataset Part2 [4]: This subset contains 229 multi-exposure sequences, each with a corresponding reference (normal-light) image. For a fair comparison, only the low-light images from
-
For Face Detection in the Dark:
- DARK FACE dataset [37]: This dataset contains 10,000 images taken in dark conditions, specifically for evaluating object detection (faces) in low light.
- Evaluation Subset: Since bounding boxes for the test set were not publicly available at the time, evaluation was performed on the
training and validation sets, comprising 6,000 images.
5.2. Evaluation Metrics
The paper uses a comprehensive set of metrics to evaluate the performance of Zero-DCE, including no-reference perceptual metrics, full-reference objective metrics, and task-specific metrics.
5.2.1. User Study (US)
- Conceptual Definition: A
User Studyis a subjective evaluation where human subjects assess the visual quality of enhanced images. It aims to capture human perception and preference, which objective metrics sometimes fail to fully reflect. The subjects are typically trained to look for specific artifacts or desirable qualities. - Measurement: Scores typically range from 1 (worst quality) to 5 (best quality), and average scores are reported. A higher
US scoreindicates better perceptual quality. - Context in Paper: 15 human subjects independently scored enhanced images (from NPE, LIME, MEF, DICM, VV sets) based on criteria like
under-/over-exposed artifacts,color deviation, andunnatural texture/noise.
5.2.2. Perceptual Index (PI)
- Conceptual Definition:
Perceptual Index (PI)is ano-reference perceptual quality metricused to quantify the visual quality of images without requiring a reference image. It is often derived from a combination of otherno-reference image quality assessment (NR-IQA)metrics. For instance, the original PI metric [1] combinesNaturalness Image Quality Evaluator (NIQE)andBlind/Referenceless Image Spatial Quality Evaluator (BRISQUE)scores. It aims to measure how "natural" and visually pleasing an image appears. - Mathematical Formula (Example for NIQE, a component of PI):
$
NIQE = \sqrt{(\mathbf{v} - \mathbf{v}{prior})^T \mathbf{\Sigma}^{-1} (\mathbf{v} - \mathbf{v}{prior})}
$
Here, we explain the symbols in the formula:
- : Is a vector of
natural scene statistics (NSS)features extracted from the test image. - : Is a vector of
mean NSS featureslearned from a database of pristine (high-quality, natural) images. - : Is the
covariance matrixofNSS featureslearned from the pristine image database. - : Denotes the
transposeof a vector. - : Denotes the
inverseof the covariance matrix. - The formula essentially measures the Mahalanobis distance between the NSS features of the test image and a model of natural image features.
- : Is a vector of
- Interpretation: A
lower PI valueindicates better perceptual quality (closer to naturalness). - Context in Paper: The paper cites [1, 21, 25] for PI, indicating it uses a standard
no-referencemetric to assess quality.
5.2.3. Peak Signal-to-Noise Ratio (PSNR)
- Conceptual Definition:
PSNRis a commonfull-reference metricused to quantify the quality of reconstruction of an image compared to an originalground-truth(reference) image. It is most easily defined via theMean Squared Error (MSE). It is typically expressed in decibels (dB). A higher PSNR value indicates a higher quality (less noise/distortion) reconstruction. - Mathematical Formula:
$
MSE = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2
$
$
PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right)
$
Here, we explain the symbols in the formula:
I(i,j): Represents the pixel value at row and column of the original (reference) image.K(i,j): Represents the pixel value at row and column of the enhanced image.M, N: Represent the dimensions (rows and columns) of the image.MSE:Mean Squared Error, the average of the squared differences between the corresponding pixels of the two images.- : The maximum possible pixel value of the image. For an 8-bit image, this is 255.
- Interpretation: Higher
PSNRis better.
5.2.4. Structural Similarity (SSIM)
- Conceptual Definition:
SSIM[31] is anotherfull-reference metricthat assesses the perceived quality of an image by considering three key aspects:luminance(brightness),contrast, andstructure. Unlike PSNR, which primarily focuses on pixel-wise error, SSIM aims to capture human visual system characteristics, particularly the sensitivity to structural information. The SSIM index can range from -1 to 1, with 1 indicating perfect similarity. - Mathematical Formula:
$
SSIM(x, y) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)}
$
Here, we explain the symbols in the formula:
x, y: Represent two image patches being compared (from the reference and enhanced images, respectively).- : The
average(mean) of patch and patch , respectively. - : The
varianceof patch and patch , respectively. - : The
covarianceof patch and patch . - : Small constants to stabilize the division with a weak denominator. is the dynamic range of the pixel values (e.g., 255 for 8-bit images). are common default values.
- Interpretation: Higher
SSIMis better.
5.2.5. Mean Absolute Error (MAE)
- Conceptual Definition:
MAEis afull-reference metricthat measures the average magnitude of the errors between corresponding pixels in the enhanced image and the reference image, without considering their direction. It is a straightforward measure of accuracy. - Mathematical Formula:
$
MAE = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} |I(i,j) - K(i,j)|
$
Here, we explain the symbols in the formula:
I(i,j): Represents the pixel value at row and column of the original (reference) image.K(i,j): Represents the pixel value at row and column of the enhanced image.M, N: Represent the dimensions (rows and columns) of the image.
- Interpretation: Lower
MAEis better.
5.2.6. Average Precision (AP)
- Conceptual Definition:
Average Precision (AP)is a common metric used to evaluate the performance ofobject detectionmodels. It summarizes theprecision-recall curveinto a single value.Precisionis the ratio of correctly detected objects to all detected objects, whileRecallis the ratio of correctly detected objects to all actual objects in the image. TheP-R curveplots precision against recall at various confidence thresholds. AP is typically calculated as the area under thisP-R curve. A higher AP indicates better overall detection performance. - Context in Paper: Used to evaluate how well different enhancement methods improve the performance of a
face detector(DSFD[18]) in low-light conditions.
5.3. Baselines
The Zero-DCE method was compared against a selection of state-of-the-art methods spanning different categories:
-
Conventional Methods:
SRIE[8] (Xueyang Fu et al., CVPR 2016): A weighted variational model for simultaneous reflectance and illumination estimation. Representative of Retinex-based approaches.LIME[9] (Xiaojie Guo et al., TIP 2017): Low-light image enhancement via illumination map estimation. Another well-known Retinex-based method.Li et al.[19] (Mading Li et al., TIP 2018): Structure-revealing low-light image enhancement via robust Retinex model. A more recent Retinex-based method that considers noise.
-
CNN-based Methods (Supervised Deep Learning):
RetinexNet[32] (Chen Wei et al., BMVC 2018): Deep Retinex decomposition for low-light enhancement, trained on paired data.Wang et al.[28] (Ruixing Wang et al., CVPR 2019): Underexposed photo enhancement using deep illumination estimation, also trained on paired data.
-
GAN-based Methods (Unsupervised Deep Learning):
-
EnlightenGAN[12] (Yifan Jing et al., CVPR 2019): Deep light enhancement without paired supervision, trained using unpaired low/normal light data.These baselines are representative because they cover the major categories of low-light image enhancement techniques (
conventional,supervised deep learning,unsupervised deep learning) and include recent, high-performing methods from prestigious conferences/journals. The authors ensured a fair comparison by reproducing results using publicly available source codes and recommended parameters.
-
5.4. Implementation Details
- Framework: Implemented using
PyTorch. - Hardware: An
NVIDIA 2080Ti GPUwas used for training. - Batch Size: A
batch sizeof 8 was applied during training. - Parameter Initialization: The
filter weights(parameters) of each layer inDCE-Netwere initialized using astandard Gaussian functionwithzero meanand0.02 standard deviation.Bias termswere initialized as a constant. - Optimizer: The
ADAM optimizerwas used for network optimization, with itsdefault parameters. - Learning Rate: A fixed
learning rateof was used. - Loss Weights: The weights for the
color constancy loss() and theillumination smoothness loss() were empirically set to0.5and20, respectively. These values were chosen to balance the contributions of different loss components to the total loss.
6. Results & Analysis
The experimental results demonstrate the superiority of Zero-DCE across various qualitative, quantitative, and perceptual evaluations, and also highlight its efficiency and practical benefits for high-level tasks.
6.1. Core Results Analysis
6.1.1. Visual and Perceptual Comparisons
The following are the visual comparisons on typical low-light images from Figure 7 of the original paper:

该图像是一个比较不同低光图像增强方法的示意图。上方显示了输入图像,接着是多种方法的结果,包括SRIE、LIME、Li et al.等,它们的检测区域用红框标出,最后展示了Zero-DCE方法的效果。
Figure 7: Visual comparisons on typical low-light images. Red boxes indicate the obvious differences.
- Analysis of Figure 7:
- First Example (Top Row): A challenging
back-litscenario with a face in shadow.Zero-DCEproduces anatural exposureand revealsclear detailson the face, effectively combating the extreme backlight.SRIE,LIME,Wang et al., andEnlightenGANfail to clearly recover the face, leaving itunder-enhanced.RetinexNetsuffers fromover-exposed artifactsin other regions, indicating poor dynamic range control.
- Second Example (Bottom Row): An indoor scene.
-
Zero-DCEsuccessfully enhances thedark regionswhilepreserving the original colorsof the input image. The result isvisually pleasingwith no obviousnoiseorcolor casts. -
Li et al.oversmoothesdetails, suggesting a loss of texture. -
Other baseline methods amplify
noiseor introducecolor deviation(e.g., the wall's color).These visual comparisons strongly validate
Zero-DCE's ability to produce high-quality, natural-looking enhanced images, especially in complex lighting conditions and for specific regions like faces.
-
- First Example (Top Row): A challenging
The following are the results from Table 1 of the original paper:
| Method | NPE | LIME | MEF | DICM | VV | Average |
|---|---|---|---|---|---|---|
| SRIE [8] | 3.65/2.79 | 3.50/2.76 | 3.22/2.61 | 3.42/3.17 | 2.80/3.37 | 3.32/2.94 |
| LIME [9] | 3.78/3.05 | 3.95/3.00 | 3.71/2.78 | 3.31/3.35 | 3.21/3.03 | 3.59/3.04 |
| Li et al. [19] | 3.80/3.09 | 3.78/3.02 | 2.93/3.61 | 3.47/3.43 | 2.87/3.37 | 3.37/3.72 |
| RetinexNet [32] | 3.30/3.18 | 2.32/3.08 | 2.80/2.86 | 2.88/3.24 | 1.96/2.95 | 2.58/3.06 |
| Wang et al. [28] | 3.83/2.83 | 3.82/2.90 | 3.13/2.72 | 3.44/3.20 | 2.95/3.42 | 3.43/3.01 |
| EnlightenGAN [12] | 3.90/2.96 | 3.84/2.83 | 3.75/2.45 | 3.50/3.13 | 3.17/4.71 | 3.63/3.22 |
| Zero-DCE | 3.81/2.84 | 3.80/2.76 | 4.13/2. | 3.78/2. | 3.54/2. | 3.81/2. |
Table 1: User study (US)/Perceptual index (PI) scores on the image sets (NPE, LIME, MEF, DICM, VV). Higher US score is better, lower PI is better. The best result is in red whereas the second best one is in blue under each case. (Note: The original table has incomplete PI values for Zero-DCE, represented here as given in the paper).
- Analysis of Table 1 (User Study and Perceptual Index):
Zero-DCEachieves thehighest average User Study (US) score(3.81) across all 202 testing images, indicating it is most favored by human subjects in terms of visual quality.- For specific datasets like
MEF,DICM, andVV,Zero-DCEclearly outperforms others with the highest US scores. - In terms of
Perceptual Index (PI), where lower values are better,Zero-DCEalso shows competitive performance, often achieving the lowest or near-lowest PI values (though the provided table has truncated PI values for Zero-DCE, the trend suggests superiority). This confirms thatZero-DCEproduces images that are perceived as more natural and of higher quality, aligning with human judgment.
6.1.2. Quantitative Comparisons
The following are the results from Table 2 of the original paper:
| Method | PSNR↑ | SSIM↑ | MAE↓ |
|---|---|---|---|
| SRIE [8] | 14.41 | 0.54 | 127.08 |
| LIME [9] | 16.17 | 0.57 | 108.12 |
| Li et al. [19] | 15.19 | 0.54 | 114.21 |
| RetinexNet [32] | 15.99 | 0.53 | 104.81 |
| Wang et al. [28] | 13.52 | 0.49 | 142.01 |
| EnlightenGAN [12] | 16.21 | 0.59 | 102.78 |
| Zero-DCE | 16.57 | 0.59 | 98.78 |
Table 2: Quantitative comparisons in terms of full-reference image quality assessment metrics. The best result is in red whereas the second best one is in blue under each case.
- Analysis of Table 2 (Full-Reference Metrics):
- Despite not using any paired or unpaired training data (
zero-reference),Zero-DCEachieves thebest valuesfor all threefull-reference image quality assessmentmetrics on theSICE Part2 subset:PSNR(Peak Signal-to-Noise Ratio):16.57 dB(highest), indicating highest pixel-level fidelity.SSIM(Structural Similarity):0.59(tied for highest withEnlightenGAN), suggesting excellent preservation of structural information.MAE(Mean Absolute Error):98.78(lowest), indicating the smallest average pixel-wise difference from the ground truth.
- This quantitative superiority, especially in
PSNRandMAE, is remarkable given thezero-referencetraining paradigm, demonstrating that the implicitly learnednon-reference loss functionsare highly effective in driving the network towards high-quality enhancements that correlate well with objective measures.
- Despite not using any paired or unpaired training data (
6.1.3. Runtime Comparisons
The following are the results from Table 3 of the original paper:
| Method | RT | Platform |
|---|---|---|
| SRIE [8] | 12.1865 | MATLAB (CPU) |
| LIME [9] | 0.4914 | MATLAB (CPU) |
| Li et al. [19] | 90.7859 | MATLAB (CPU) |
| RetinexNet [32] | 0.1200 | TensorFlow (GPU) |
| Wang et al. [28] | 0.0210 | TensorFlow (GPU) |
| EnlightenGAN [12] | 0.0078 | PyTorch (GPU) |
| Zero-DCE | 0.0025 | PyTorch (GPU) |
Table 3: Runtime (RT) comparisons (in second). The best result is in red whereas the second best one is in blue.
- Analysis of Table 3 (Runtime):
Zero-DCEis themost computationally efficientmethod, with a runtime of0.0025 secondsper image on a GPU. This is significantly faster than all other methods, including other deep learning-based approaches.- This high efficiency (
about 500 FPSfor images) is attributed to itssimple curve mapping formandlightweight DCE-Netarchitecture. This makesZero-DCEhighly suitable forreal-time applicationsand deployment onresource-constrained devices.
6.1.4. Face Detection in the Dark
The following are the results from Figure 8 of the original paper:

该图像是图表,展示了在低光条件下,使用 Zero-DCE 方法进行人脸检测前后的效果对比。上半部分是PR曲线,显示不同方法的精确度与召回率关系;下半部分展示了原始和增强后的检测结果,突出增强效果。源自相关实验数据。
Figure 8: The performance of face detection in the dark. PR curves, the AP, and two examples of face detection before and after enhanced by our Zero-DCE.
- Analysis of Figure 8 (Face Detection):
- The
precision-recall (P-R) curvesclearly show thatimage enhancementsignificantlyincreases the precisionof theDual Shot Face Detector (DSFD)[18] compared to using raw, unenhanced images. This demonstrates the practical benefit of low-light enhancement for high-level computer vision tasks. - Among the enhancement methods,
RetinexNet[32] andZero-DCEperform the best, showingcomparable performance. However,Zero-DCEexhibitsbetter performance in the high recall area, meaning it can detect a larger proportion of actual faces while maintaining good precision. - The
Average Precision (AP)values (indicated in the graph) further confirm the improvement.Zero-DCE's AP is higher than most other methods. - The visual examples at the bottom of Figure 8 show that
Zero-DCEeffectively brightens faces in extremely dark regions while preserving well-lit areas, directly contributing to the improvedface detectionperformance.
- The
6.2. Ablation Studies / Parameter Analysis
The authors conducted several ablation studies to analyze the contribution of each component of Zero-DCE and the effect of its parameter settings.
6.2.1. Contribution of Each Loss
The following are the results from Figure 4 of the original paper:

该图像是示意图,展示了低光照图像增强的不同效果。左侧第一幅图(a)为输入图像,第二幅图(b)为应用Zero-DCE方法后的结果。后面的四幅图(c-f)展示了分别去除不同损失函数后得到的增强效果,包括去除空间一致性损失、曝光控制损失、颜色恒常性损失和照明平滑性损失。该图说明了各损失函数对最终结果的重要性。
Figure 4: Ablation study of the contribution of each loss (spatial consistency loss L _ { s p a }, exposure control loss L _ { e x p },color constancy loss L _ { c o l }, illumination smoothness loss ).
- Analysis of Figure 4: This study demonstrates the importance of each
non-reference loss function:-
Without
Spatial Consistency Loss(): The enhanced image (Figure 4c) showsrelatively lower contrast(e.g., in the cloud regions) compared to the full result (Figure 4b). This confirms that is crucial forpreserving the differences of neighboring regionsand thus maintaining local contrast. -
Without
Exposure Control Loss(): The variant (Figure 4d)fails to recover the low-light regionadequately, remaining visibly dark. This highlights the role of incontrolling the overall exposure leveland preventingunder-exposure. -
Without
Color Constancy Loss(): The result (Figure 4e) exhibitssevere color casts. This indicates that is essential forcorrecting color deviationsand establishingcolor balanceamong the RGB channels, preventing an unnatural hue. -
Without
Illumination Smoothness Loss(): The enhanced image (Figure 4f) showsobvious artifactsand a lack of smoothness. This confirms that is vital forpreserving monotonicity relationsandsmoothing the parameter maps, preventing abrupt changes and unnatural textures in the output.This ablation study rigorously validates the necessity and effectiveness of each designed
non-reference loss functionfor achieving high-quality, natural-looking image enhancement.
-
6.2.2. Effect of Parameter Settings
The following are the results from Figure 5 of the original paper:

该图像是一个示意图,展示了使用Zero-DCE方法对低光照图像进行增强的结果。图中分别展示了输入图像(a)和经过不同参数设置生成的结果(b-f),其中参数表示卷积层数、特征图数和迭代次数,如 3-32-8、7-32-16等。
Figure 5: Ablation study of the effect of parameter settings. represents the proposed Zero-DCE with convolutional layers, feature maps of each layer (except the last layer), and iterations.
- Analysis of Figure 5: This study explores the impact of
DCE-Net's depth (), width (), and the number of iterations () for theLE-curveapplication.-
(3 layers, 32 feature maps, 8 iterations, Figure 5b): Even with a minimal network depth of 3 convolutional layers, the model produces
satisfactory results. This suggests the inherent effectiveness of thezero-reference learningapproach and theLE-curvedesign itself. -
(7 layers, 32 feature maps, 8 iterations, Figure 5c): This configuration yields
most visually pleasing resultswithnatural exposureandproper contrast. This is the chosen final model, representing a good balance. -
(7 layers, 32 feature maps, 16 iterations, Figure 5e): While increasing the number of iterations from 8 to 16, the performance difference compared to is not significantly better, suggesting that 8 iterations are sufficient for most cases.
-
(7 layers, 32 feature maps, 1 iteration, Figure 5d): A significant
decrease in performanceis observed. The image remains quite dark andunder-enhanced. This confirms the necessity ofhigher-order curves(i.e., multiple iterations) for powerfuldynamic range adjustment, as a single iteration offers limited adjustment capability. -
(7 layers, 64 feature maps, 8 iterations, Figure 5f): Increasing the network width by using 64 feature maps instead of 32 does not lead to a noticeable visual improvement over .
The findings justify the choice of as the final model due to its optimal trade-off between
efficiencyandrestoration performance.
-
6.2.3. Impact of Training Data
The following are the results from Figure 6 of the original paper:

该图像是一个图表,展示了不同输入方式下的低光照图像增强效果。图中的五个小图分别显示了输入图像(a)、使用Zero-DCE方法生成的增强图像(b)、使用Zero-DCE_Low(c)、Zero-DCE_LargeL(d)和Zero-DCE_LargeLH(e)所得到的效果。各增强方法展示了对同一低光照场景的不同处理结果,旨在比较它们对于图像亮度和细节的改善效果。
Figure 6: Ablation study on the impact of training data.
- Analysis of Figure 6: This study investigates how the composition of the training data affects
Zero-DCE's performance.-
Input Image (Figure 6a): Original low-light image.
-
Zero-DCE(trained on 2,422 multi-exposure images, Figure 6b): The baselineZero-DCEresult, showing good enhancement. -
Zero-DCE_Low(trained on only 900 low-light images from original set, Figure 6c): When trained withonly low-light images(and fewer of them), the model tends toover-enhancethe well-lit regions (e.g., the face), while dark regions are still enhanced. -
Zero-DCE_LargeL(trained on 9,000 unlabeled low-light images from DARK FACE dataset, Figure 6d): Even with a larger quantity ofonly low-light images, the issue ofover-enhancementin bright regions persists. -
Zero-DCE_LargeLH(trained on 4,800 multi-exposure images from augmented SICE Part1+Part2, Figure 6e): When trained with a larger amount ofmulti-exposure data(including both low-light and over-exposed samples), the model achieves even better recovery of dark regions and balanced enhancement across the image.This ablation clearly demonstrates the
rationality and necessityof includingmulti-exposure training data(both low-light and over-exposed) in the training process. Training on only low-light images, even a large quantity, can lead toover-enhancementin regions that are already sufficiently lit. The diverse exposure levels in the training data helpZero-DCElearn to adaptively adjustdynamic rangesacross a wider spectrum of brightness. The authors also note that for fair comparison with other deep learning methods, they used a comparable amount of training data, although more data could further improve performance.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduced Zero-Reference Deep Curve Estimation (Zero-DCE), a novel and highly effective method for low-light image enhancement. The core innovation lies in reformulating the enhancement task as an image-specific curve estimation problem, tackled by a lightweight DCE-Net. The LE-curve design ensures output within a valid pixel range, monotonicity for contrast preservation, and differentiability for end-to-end training. A paradigm-shifting aspect is its zero-reference training, which eliminates the need for any paired or unpaired ground-truth images. This is achieved through a meticulously designed set of non-reference loss functions (spatial consistency, exposure control, color constancy, illumination smoothness) that implicitly guide the network towards desirable enhancement qualities. Zero-DCE demonstrated state-of-the-art performance both qualitatively and quantitatively against existing methods, proved to be exceptionally efficient (real-time processing), and showed significant practical benefits for downstream high-level vision tasks such as face detection in the dark.
7.2. Limitations & Future Work
The authors acknowledge potential areas for future improvement:
- Semantic Information for Hard Cases: The current approach primarily focuses on pixel-level and local region properties. Introducing
semantic information(e.g., understanding of objects, scene context) could help solve morehard caseswhere generic illumination adjustment might be insufficient or ambiguous. - Effects of Noise: The paper mentions considering the
effects of noiseas a future direction. Low-light images often suffer from amplified noise during brightening. Developing strategies to simultaneously enhance anddenoisewould be a valuable extension.
7.3. Personal Insights & Critique
The Zero-DCE paper offers several profound insights and inspires critical reflection:
- Novelty of Zero-Reference Learning: The
zero-referencetraining paradigm is a significant advancement. It addresses the most critical bottleneck in supervised image enhancement—the reliance on expensive and often artificial reference data. This makesZero-DCEhighly practical for real-world applications where ground truth is scarce or impossible to obtain. The success of this approach highlights the power of carefully designednon-reference loss functionsthat encode human perceptual priors or desirable image characteristics. It opens up avenues for other image-to-image translation tasks where reference data is a major limitation. - Elegance of Curve Estimation: Reframing image enhancement as
image-specific curve estimationis elegant. It offers a more interpretable and controllable adjustment mechanism compared to opaque end-to-end image-to-image mappings. The iterative,pixel-wiseapplication of theLE-curveprovides a flexible yet constrained transformation space, ensuring properties likemonotonicityandvalid pixel rangewhich are critical for natural results. This approach could potentially be transferred to other image adjustment tasks, like contrast enhancement or tone mapping, where explicit curve control is beneficial. - Efficiency for Real-time Applications: The lightweight
DCE-Netand simplecurve mappingresult in exceptional computational efficiency. This makesZero-DCEhighly suitable for deployment onresource-constrained deviceslike smartphones or forreal-time video enhancement, broadening its applicability significantly. - Impact on High-Level Vision Tasks: Demonstrating the improvement in
face detectionin low-light conditions is a strong validation ofZero-DCE's practical value beyond aesthetic improvement. This underscores that effective low-level image processing can have a profound impact on the performance of high-level computer vision systems.
Potential Issues or Areas for Improvement:
-
Fragility of Loss Functions: While the
non-reference loss functionsare a strength, their effectiveness heavily relies on their precise formulation and the chosen weights. If the assumptions encoded in these losses (e.g., Gray-World hypothesis for color constancy, specific exposure target) do not hold universally for all possible low-light scenarios, the model's generalization might still be limited in those edge cases. Further research could explore adaptive weighting of losses or more sophisticatedperceptual lossesthat better mimic human perception without explicit reference. -
Handling Extreme Noise: As acknowledged by the authors,
noise amplificationis a common issue in low-light enhancement. Whileillumination smoothness losshelps prevent artifacts in the parameter maps, it doesn't explicitly address image noise. Integrating adenoising componentornoise-aware curve estimationcould further improve robustness. -
Interpretability of maps: While the
LE-curveitself is interpretable, thepixel-wise parameter mapsare network outputs. Further analysis into why the network produces specific values for different regions could yield deeper insights into the learning process and potentially lead to even more refined curve designs.Overall,
Zero-DCEis a highly impactful paper that presents an elegant, efficient, and data-independent solution to a pervasive problem, pushing the boundaries of deep learning in image enhancement.
Similar papers
Recommended via semantic vector search.