Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement
TL;DR Summary
This paper introduces a novel one-stage Retinex framework (ORF) for low-light image enhancement. By estimating illumination and restoring corruptions, combined with an Illumination-Guided Transformer (IGT), Retinexformer outperforms state-of-the-art methods significantly across b
Abstract
When enhancing low-light images, many deep learning algorithms are based on the Retinex theory. However, the Retinex model does not consider the corruptions hidden in the dark or introduced by the light-up process. Besides, these methods usually require a tedious multi-stage training pipeline and rely on convolutional neural networks, showing limitations in capturing long-range dependencies. In this paper, we formulate a simple yet principled One-stage Retinex-based Framework (ORF). ORF first estimates the illumination information to light up the low-light image and then restores the corruption to produce the enhanced image. We design an Illumination-Guided Transformer (IGT) that utilizes illumination representations to direct the modeling of non-local interactions of regions with different lighting conditions. By plugging IGT into ORF, we obtain our algorithm, Retinexformer. Comprehensive quantitative and qualitative experiments demonstrate that our Retinexformer significantly outperforms state-of-the-art methods on thirteen benchmarks. The user study and application on low-light object detection also reveal the latent practical values of our method. Code, models, and results are available at https://github.com/caiyuanhao1998/Retinexformer
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement
1.2. Authors
Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, Yulun Zhan. The authors are affiliated with Tsinghua University, University of Würzburg, and ETH Zürich.
1.3. Journal/Conference
The paper is published on arXiv, a preprint server. While not a peer-reviewed journal or conference proceeding itself, arXiv is a widely used platform for disseminating research in fields like computer science, physics, and mathematics, often preceding formal publication. The publication date suggests it was likely submitted to a major computer vision conference or journal in 2023.
1.4. Publication Year
2023
1.5. Abstract
The paper addresses the limitations of many deep learning algorithms for low-light image enhancement that are based on the Retinex theory. These limitations include not accounting for corruptions hidden in dark areas or introduced during the brightening process, requiring multi-stage training pipelines, and relying on convolutional neural networks (CNNs) which struggle with long-range dependencies. To overcome these, the authors propose a simple yet principled One-stage Retinex-based Framework (ORF). ORF first estimates illumination to brighten the image and then restores corruptions. A key component is the Illumination-Guided Transformer (IGT), which uses illumination representations to manage non-local interactions across regions with varying lighting. Integrating IGT into ORF yields Retinexformer. Extensive quantitative and qualitative experiments on thirteen benchmarks demonstrate that Retinexformer significantly surpasses state-of-the-art methods. Furthermore, a user study and application in low-light object detection highlight its practical utility.
1.6. Original Source Link
https://arxiv.org/abs/2303.06705v3 (Preprint)
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the effective enhancement of low-light images. This task is crucial for improving human visual perception and the performance of downstream computer vision tasks, such as nighttime object detection, which often suffer significantly in poor lighting conditions.
Existing approaches face several challenges:
-
Corruption Handling: Many deep learning algorithms based on the Retinex theory assume images are corruption-free, which is inconsistent with real low-light scenes that inevitably contain noise, artifacts, and color distortions (e.g., from high ISO or long exposure). Additionally, the enhancement process itself can introduce or amplify these corruptions (e.g., under-/over-exposure, color distortion).
-
Multi-stage Training: Traditional Retinex-based deep learning methods often employ a tedious multi-stage training pipeline, where different CNNs are trained separately for decomposition, denoising, and illumination adjustment before being finetuned together. This process is time-consuming and complex.
-
Long-range Dependencies: Most existing deep learning methods heavily rely on convolutional neural networks (CNNs). While effective for local feature extraction, CNNs have inherent limitations in capturing long-range dependencies and non-local self-similarity across an image, which are critical for holistic image restoration and enhancement. The high computational cost of directly applying global Vision Transformers restricts their full potential in this domain.
The paper's entry point and innovative idea revolve around re-evaluating the Retinex theory to explicitly account for corruptions and integrating the power of Transformers in a computationally efficient manner within a streamlined, one-stage framework.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of low-light image enhancement:
- First Transformer-based Algorithm for Low-light Image Enhancement: The paper proposes
Retinexformer, pioneering the full exploration of Transformer models for this task, overcoming the computational limitations that previously restricted their application. - One-stage Retinex-based Framework (ORF): The authors formulate
ORF, a simple yet principled framework that revises the traditional Retinex model by introducing perturbation terms to explicitly model corruptions (noise, artifacts, under-/over-exposure, color distortion) inherent in low-light images and introduced during enhancement. This framework allows for an easy one-stage, end-to-end training process, addressing the "tedious multi-stage training pipeline" issue of prior Retinex-based deep learning methods. - Illumination-Guided Multi-head Self-Attention (IG-MSA): A novel self-attention mechanism,
IG-MSA, is designed. It leverages illumination information (represented bylight-up features) as a crucial guide for modeling long-range dependencies. This mechanism enables effective interaction between regions with different lighting conditions and significantly reduces computational complexity compared to global self-attention, making Transformers viable for dense image tasks. - State-of-the-Art Performance:
Retinexformer, by plugging theIGTintoORF, demonstrates significantly superior quantitative and qualitative performance over state-of-the-art methods across an extensive suite of thirteen diverse low-light image enhancement datasets. Notably, it achieves improvements of over 6 dB PSNR on severely corrupted datasets like SID and SDSD. - Demonstrated Practical Value: Beyond objective metrics, the paper validates the practical utility of
Retinexformerthrough a comprehensive user study, showing high subjective visual quality. Furthermore, its application as a pre-processing step for low-light object detection tasks significantly boosts detection accuracy, revealing its value for high-level computer vision applications.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand the Retinexformer paper, a reader should be familiar with several fundamental concepts in image processing and deep learning:
-
Low-light Image Enhancement (LLIE): This is the overarching task addressed. It aims to transform an underexposed, dark, or noisy image into a well-exposed, clear, and visually appealing image, often referred to as a "normal-light" image. The goal is to improve visibility, contrast, and color fidelity.
-
Retinex Theory:
- Conceptual Definition: Proposed by Edwin Land, the Retinex (a portmanteau of "retina" and "cortex") theory posits that the perceived color of an object is determined by its reflectance properties rather than by the absolute amount of light reaching the eye. In image processing, this translates to decomposing an image into two components:
- Reflectance (R): This represents the intrinsic properties of the scene objects (e.g., color, texture) and is independent of illumination. It is often considered the "true" enhanced image.
- Illumination (L): This represents the amount and distribution of light incident on the scene.
- Mathematical Representation: The original Retinex model expresses a given image as the element-wise product of its reflectance and illumination : $ \mathbf{I} = \mathbf{R} \odot \mathbf{L} $ Where denotes element-wise multiplication. The goal of Retinex-based enhancement is to estimate from and then derive (the enhanced image) by dividing by (or multiplying by the inverse of ).
- Conceptual Definition: Proposed by Edwin Land, the Retinex (a portmanteau of "retina" and "cortex") theory posits that the perceived color of an object is determined by its reflectance properties rather than by the absolute amount of light reaching the eye. In image processing, this translates to decomposing an image into two components:
-
Convolutional Neural Networks (CNNs):
- Conceptual Definition: CNNs are a class of deep neural networks primarily used for analyzing visual imagery. They are characterized by
convolutional layersthat apply learnable filters (kernels) to input data,pooling layersthat reduce spatial dimensions, andfully connected layersfor classification or regression. - Strengths: CNNs excel at capturing local patterns and hierarchical features due to their local receptive fields and weight sharing. They have been highly successful in many image processing tasks.
- Limitations (relevant to this paper): Their local nature means they inherently struggle to capture
long-range dependencies(i.e., relationships between pixels or regions that are far apart in an image) andnon-local self-similaritywithout large receptive fields or complex architectural designs.
- Conceptual Definition: CNNs are a class of deep neural networks primarily used for analyzing visual imagery. They are characterized by
-
Transformer:
- Conceptual Definition: Originally introduced for natural language processing (NLP), the Transformer is a neural network architecture that relies primarily on a mechanism called
self-attentionto weigh the importance of different parts of the input data. Unlike recurrent neural networks (RNNs), Transformers process input sequences in parallel, making them highly efficient and capable of modeling very long-range dependencies. - Vision Transformer (ViT): Transformers were adapted for computer vision by treating image patches as sequences of tokens, similar to words in NLP.
- Computational Cost Issue: A significant challenge with vanilla Transformers in vision tasks, especially for high-resolution images, is that the computational complexity of global self-attention is quadratic with respect to the input spatial size (
HW), making it computationally expensive and memory-intensive.
- Conceptual Definition: Originally introduced for natural language processing (NLP), the Transformer is a neural network architecture that relies primarily on a mechanism called
-
Self-Attention Mechanism (Core of Transformer):
- Conceptual Definition: Self-attention allows a model to weigh the importance of different parts of the input sequence when processing a specific element. For each element, it computes a score for every other element, indicating how much "attention" should be paid to it.
- Mathematical Formula (Standard Multi-Head Attention): The core of the self-attention mechanism involves three learnable linear projections: Query (), Key (), and Value (). Given an input sequence (or tokens from an image patch), these are generated. The attention output is then computed as:
$
\mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\mathrm{T}}{\sqrt{d_k}}\right)\mathbf{V}
$
- : Query matrix, derived from the input, representing what we are looking for.
- : Key matrix, derived from the input, representing what is available.
- : Value matrix, derived from the input, containing the information to be aggregated.
- : Number of tokens (e.g., patches in an image).
- : Dimension of keys and queries.
- : Transpose of the key matrix.
- : Normalization function that converts scores into probability distributions.
- : Scaling factor to prevent large dot products from pushing the softmax into regions with tiny gradients.
- Multi-head Self-Attention (MSA): This extends self-attention by performing the attention mechanism multiple times in parallel with different linear projections (heads). The outputs from these heads are then concatenated and linearly transformed to produce the final output. This allows the model to jointly attend to information from different representation subspaces at different positions.
-
U-Net Architecture:
- Conceptual Definition: U-Net is a convolutional network architecture primarily designed for image segmentation, but widely adopted for various image-to-image translation tasks. It features an encoder-decoder structure with symmetric downsampling and upsampling paths, connected by
skip connections. - Structure:
- Encoder (Downsampling Path): Consists of repeated application of convolutional layers, followed by pooling layers (e.g., max-pooling or strided convolutions) to reduce spatial dimensions while increasing feature channels. This captures contextual information.
- Decoder (Upsampling Path): Consists of upsampling operations (e.g., transposed convolutions), followed by convolutional layers. This path gradually recovers the spatial resolution.
- Skip Connections: Direct connections from the encoder path to the corresponding (same spatial resolution) decoder path. These are crucial for propagating fine-grained details lost during downsampling, enabling precise localization.
- Conceptual Definition: U-Net is a convolutional network architecture primarily designed for image segmentation, but widely adopted for various image-to-image translation tasks. It features an encoder-decoder structure with symmetric downsampling and upsampling paths, connected by
3.2. Previous Works
The paper discusses various categories of previous low-light image enhancement methods:
-
Plain Methods:
- Description: These are simple, global image processing techniques that directly manipulate pixel intensities or histograms to improve visibility. Examples include
histogram equalization[1, 8, 12, 40, 41] andGamma Correction (GC)[19, 42, 53]. - Drawbacks: They often produce undesirable artifacts (e.g., over-enhancement, loss of detail, unnatural appearance) because they do not consider the underlying illumination factors or local image content. They treat the image uniformly, leading to perceptually inconsistent results.
- Description: These are simple, global image processing techniques that directly manipulate pixel intensities or histograms to improve visibility. Examples include
-
Traditional Cognition Methods (Retinex-based):
- Description: These methods leverage the Retinex theory to decompose an image into reflectance and illumination components, focusing on estimating the illumination map. The reflectance component is then considered the enhanced image. They use hand-crafted priors to constrain the decomposition.
- Examples: Guo et al. [18] (LIME) proposed refining the initial estimated illumination map by imposing a structure prior. Jobson et al. [23, 24] developed Multi-scale Retinex. Fu et al. [15] used a weighted variational model.
- Drawbacks: They assume low-light images are
corruption-free, which is unrealistic. This leads to severe noise amplification and color distortion in the enhanced images. Their reliance on hand-crafted priors also limits their generalization ability and often requires careful parameter tuning.
-
Deep Learning Methods: These represent the current state-of-the-art and are primarily CNN-based.
- Category 1: Direct Mapping CNNs:
- Description: These methods train a CNN to directly learn an end-to-end mapping from a low-light image to its normal-light counterpart, without explicit reliance on image formation models like Retinex.
- Examples: LLNet [33], MBLLEN [35], EnGAN [22] (which uses GANs for unpaired training).
- Drawbacks: They often lack interpretability and may struggle to generalize to unseen lighting conditions or image corruptions, as they learn a "brute-force" mapping without explicit physical principles.
- Category 2: Retinex-based CNNs:
- Description: These methods integrate the Retinex theory into deep learning frameworks, using CNNs to estimate or refine reflectance and illumination.
- Examples: RetinexNet [54], KinD [66], DeepUPE [49], RUAS [30].
- Drawbacks:
- Multi-stage Training Pipeline: A major limitation is the
tedious multi-stage training pipeline. For instance, RetinexNet uses separate CNNs for decomposition, denoising, and illumination adjustment, which are initially trained independently and then finetuned. This is time-consuming and complex. - Corruption Oversight: Methods like DeepUPE [49] focus on predicting the illumination map but do not explicitly consider the
corruption factors(noise, artifacts) hidden in the dark, leading to amplified noise and color distortion when brightening. - CNN Limitations: Like other CNN-based methods, they are limited in capturing
long-range dependenciesandnon-local self-similarity, which are crucial for consistent image restoration across the entire image.
- Multi-stage Training Pipeline: A major limitation is the
- Category 1: Direct Mapping CNNs:
-
Vision Transformers for Image Restoration:
- Description: Recent advancements have seen Transformers applied to low-level vision tasks like image restoration, showing promise for capturing global dependencies.
- Examples: IPT [11], Uformer [52], Restormer [60].
- Drawbacks (relevant to this paper): Directly applying vanilla Vision Transformers to high-resolution image tasks often encounters an
issue of unaffordable computational cost, as self-attention is quadratic to input spatial size (HW). This led to hybrid approaches likeSNR-Net[57], which only employs a single global Transformer layer at the lowest spatial resolution of a U-shaped CNN, thus not fully exploring the Transformer's potential.
3.3. Technological Evolution
The evolution of low-light image enhancement can be traced as follows:
-
Early 20th Century - Theory: The conceptual foundation laid by Retinex theory (Land, 1960s-70s), positing image decomposition into reflectance and illumination.
-
Late 20th Century - Traditional Algorithms: Development of traditional Retinex-based algorithms (e.g., Multi-scale Retinex by Jobson et al., 1990s) and simple pixel manipulation methods (e.g., histogram equalization, gamma correction). These relied on hand-crafted priors and often had limitations in handling noise and generalization.
-
2010s - Deep Learning Era (CNNs): With the rise of deep learning, CNNs began to dominate.
- Initially, direct end-to-end mapping CNNs emerged, learning arbitrary functions from low-light to normal-light images (e.g., LLNet, MBLLEN).
- Soon after, CNNs were integrated with the Retinex theory (e.g., RetinexNet, KinD, DeepUPE), aiming to leverage the interpretable decomposition. However, these often suffered from multi-stage training.
-
Early 2020s - Transformer Integration: The success of Transformers in NLP led to their adoption in computer vision (Vision Transformers). Initial applications to image restoration demonstrated their power in capturing long-range dependencies, but their high computational cost for high-resolution images necessitated hybrid models (e.g., SNR-Net) that used Transformers sparingly.
Retinexformerfits into this timeline by pushing the boundaries of Transformer integration. It addresses the key shortcomings of previous Retinex-based deep learning methods (multi-stage training, corruption handling) and CNN-based methods (long-range dependencies) by proposing aone-stage, Retinex-inspired frameworkthat fully leverages acomputationally efficient, illumination-guided Transformer.
3.4. Differentiation Analysis
Compared to the main methods in related work, Retinexformer introduces several core differences and innovations:
-
Explicit Corruption Modeling within Retinex Framework:
- Previous: Traditional and many deep Retinex-based methods (e.g., RetinexNet, DeepUPE) largely ignore or simplify the
corruptions(noise, artifacts, exposure issues) present in low-light images or introduced during the brightening process. They primarily focus on estimating "clean" reflectance and illumination. - Retinexformer Innovation: The paper explicitly reformulates the Retinex model by introducing
perturbation terms() for both reflectance and illumination. This allows the model to account for hidden corruptions and those caused by "light-up," enabling a dedicatedcorruption restorer() to handle them effectively.
- Previous: Traditional and many deep Retinex-based methods (e.g., RetinexNet, DeepUPE) largely ignore or simplify the
-
One-Stage End-to-End Training:
- Previous: Most Retinex-based deep learning methods (e.g., RetinexNet, KinD) require
tedious multi-stage training pipelines, involving separate training phases for decomposition, denoising, and adjustment, followed by finetuning. - Retinexformer Innovation:
ORFis designed as aone-stage framework, allowing the entire model to be trained end-to-end. This significantly simplifies the training process, making it more efficient and less prone to compounding errors from independent stages.
- Previous: Most Retinex-based deep learning methods (e.g., RetinexNet, KinD) require
-
Full Integration of Transformers with Illumination Guidance:
- Previous: CNN-based methods inherently struggle with
long-range dependencies. While some hybrid methods like SNR-Net introduced Transformers, they used them sparingly (e.g., a single global Transformer layer at the lowest resolution) due to thequadratic computational complexityof vanilla global self-attention. - Retinexformer Innovation: The paper designs
Illumination-Guided Transformer (IGT)and specificallyIllumination-Guided Multi-head Self-Attention (IG-MSA). This mechanism:- Models Long-Range Dependencies: It fully integrates Transformer capabilities throughout the network (within
IGABunits at multiple scales), exploring their potential more thoroughly than hybrid approaches. - Addresses Computational Cost:
IG-MSAreduces the computational complexity from quadratic tolinearwith respect to spatial size by treating single-channel feature maps as tokens. - Leverages Illumination Information: Crucially,
IG-MSAusesillumination representations() to explicitlydirect the computation of self-attention. This guides the model to understand and interact between regions with different lighting conditions, making the attention mechanism highly relevant to the low-light enhancement task. This contextual guidance is a key differentiator from generic Transformers.
- Models Long-Range Dependencies: It fully integrates Transformer capabilities throughout the network (within
- Previous: CNN-based methods inherently struggle with
-
Robust Illumination Map Estimation:
-
Previous: Some methods might estimate the illumination map , which then requires element-wise division () to get the reflectance. This operation is numerically unstable and prone to
data overflowor amplification of small errors when values are close to zero. -
Retinexformer Innovation:
ORFestimates thelight-up map(such that ), enabling element-wise multiplication () for brightening. This is a morerobustand numerically stable operation, avoiding the pitfalls of division.In essence,
Retinexformerinnovatively combines the interpretability of a revised Retinex model with the global modeling power of a computationally efficient and illumination-aware Transformer, all within a streamlined one-stage training paradigm.
-
4. Methodology
4.1. Principles
The core idea of Retinexformer is to build a simple yet principled One-stage Retinex-based Framework (ORF) that explicitly models corruptions in low-light images and leverages Transformer's ability to capture long-range dependencies, guided by illumination information.
The theoretical basis and intuition are as follows:
- Revised Retinex Model: The traditional Retinex theory assumes images are perfect. However, real low-light images are inherently corrupted by noise, artifacts, and color distortion due to imaging conditions (e.g., high ISO, long exposure). Furthermore, the process of brightening a dark image can amplify existing corruptions or introduce new ones like under-/over-exposure. The paper's principle is to
revise the Retinex modelby adding perturbation terms to both reflectance and illumination, explicitly acknowledging and modeling these corruptions. This allows for a dedicatedcorruption restorerto clean the image after initial brightening. - One-Stage Training: Instead of the tedious multi-stage pipelines common in previous Retinex-based deep learning methods, the paper aims for an
end-to-end, one-stage trainingprocess. This is achieved by formulating the entire enhancement as a single trainable network, where illumination estimation and corruption restoration are integrated. - Illumination-Guided Non-local Interactions: CNNs struggle with
long-range dependencies, which are crucial for consistent enhancement across an entire image (e.g., maintaining color consistency or denoising patches far apart). Transformers are excellent at this but are computationally expensive for high-resolution images. The principle here is to design a Transformer that iscomputationally efficientand, crucially,guided by illumination information. The intuition is that the illumination map provides valuable contextual cues about different lighting conditions across the image, which can direct the Transformer's attention mechanism to effectively model interactions between well-lit and dark regions.
4.2. Core Methodology In-depth (Layer by Layer)
The Retinexformer algorithm is built upon the One-stage Retinex-based Framework (ORF) and employs an Illumination-Guided Transformer (IGT) as its corruption restorer.
The overall architecture is illustrated in Figure 2 (a) from the original paper.
该图像是一个示意图,展示了Retinexformer算法的结构,包括照明估计器和污染恢复模块。图中展示了输入图像、光照图、增强图像等多个步骤,以及照明引导自注意力块和多头自注意力机制的组成部分。
4.2.1. One-stage Retinex-based Framework (ORF)
The ORF is formulated by first revising the standard Retinex theory to account for corruptions, and then defining an illumination estimator () and a corruption restorer ().
1. Revised Retinex Model:
The classical Retinex theory states that a low-light image (where H, W are height and width, and 3 is for RGB channels) can be decomposed into a reflectance and a single-channel illumination map as:
$
\mathbf{I} = \mathbf{R} \odot \mathbf{L}
$
Here, denotes element-wise multiplication.
The authors point out that this model assumes is corruption-free, which is often not true in real low-light scenarios. Corruptions arise from two main sources:
-
Hidden Corruptions: High-ISO and long-exposure settings, common in dark scenes, inevitably introduce noise and artifacts (e.g., ).
-
Introduced Corruptions: The light-up process itself can amplify existing noise/artifacts and cause new issues like under-/over-exposure and color distortion (e.g., ).
To model these corruptions, the authors reformulate the equation by introducing
perturbation termsfor reflectance and for illumination: $ \mathbf{I} = (\mathbf{R} + \hat{\mathbf{R}}) \odot (\mathbf{L} + \hat{\mathbf{L}}) $ Expanding this, we get: $ \mathbf{I} = \mathbf{R} \odot \mathbf{L} + \mathbf{R} \odot \hat{\mathbf{L}} + \hat{\mathbf{R}} \odot (\mathbf{L} + \hat{\mathbf{L}}) $ The paper regards as the desired well-exposed image. To "light up" the low-light image , it is element-wise multiplied by alight-up mapsuch that (meaning acts as an inverse of ). Multiplying both sides of the reformulated equation by : $ \mathbf{I} \odot \bar{\mathbf{L}} = \mathbf{R} + \mathbf{R} \odot (\hat{\mathbf{L}} \odot \bar{\mathbf{L}}) + (\hat{\mathbf{R}} \odot (\mathbf{L} + \hat{\mathbf{L}})) \odot \bar{\mathbf{L}} $ Let's break down the terms on the right-hand side: -
: The desired clean reflectance.
-
: This term represents corruptions like under-/over-exposure and color distortion caused by potential errors or perturbations in the illumination estimation/light-up process ( is perturbation in illumination, and is the light-up map).
-
: This term represents the noise and artifacts initially hidden in the dark scene () which are amplified by the brightening process (multiplication by ).
This expression is then simplified into a more compact form: $ \mathbf{I}_{lu} = \mathbf{I} \odot \bar{\mathbf{L}} = \mathbf{R} + \mathbf{C} $ Here:
-
: The
lit-up image, which is the low-light image after initial brightening. This image still contains corruptions. -
: The
overall corruption term, encompassing all the noise, artifacts, under-/over-exposure, and color distortion.
2. ORF Architecture:
The ORF then defines the overall enhancement process as:
$
(\mathbf{I}{lu}, \mathbf{F}{lu}) = \boldsymbol{\mathcal{E}}(\mathbf{I}, \mathbf{L}p)
$
$
\mathbf{I}{en} = \mathcal{R}(\mathbf{I}{lu}, \mathbf{F}{lu})
$
Where:
- : The
illumination estimator. It takes the low-light input image and anillumination prior mapas inputs.- , where calculates the mean pixel value along the channel dimension (i.e., grayscale version of the input image), serving as an initial estimate of illumination.
- outputs two components: the
lit-up image(calculated as ) and alight-up feature(where is the feature channel dimension).
- : The
corruption restorer. It takes thelit-up imageand thelight-up featureas inputs. - : The final
enhanced image, produced by after restoring the corruptions.
3. Illumination Estimator () Details: The architecture of is shown in Figure 2 (a) (i).
- It first concatenates the input low-light image and its illumination prior map .
- A
conv1x1(convolution with kernel size 1x1) layer is used to fuse these concatenated inputs. - A
depth-wise separable conv5x5layer is then applied. This layer is chosen to model interactions between regions with different lighting conditions, generating thelight-up feature. The paper notes that well-exposed regions can provide semantic context for under-exposed regions. - Finally, another
conv1x1layer aggregates to produce thelight-up map. Importantly, is designed as athree-channel RGB tensor(instead of a single-channel one as in some prior works) to enhance its capacity to simulate non-linearity across RGB channels for better color enhancement. - This is then used to compute .
Discussion on ORF:
- Estimation of vs. : The paper highlights a crucial design choice: estimating (the inverse illumination) instead of (the illumination map itself). If were estimated, the lit-up image would be obtained by element-wise division (). This operation is numerically unstable because pixel values in can be very small or even zero, leading to
data overflowor amplification of small computational errors. By modeling and using element-wise multiplication (), the process becomes much morerobust. - Comprehensive Corruption Handling: Unlike previous Retinex-based deep learning methods that mainly focused on suppressing reflectance corruptions (),
ORFexplicitly considers both illumination estimation errors () and hidden noise (), using to restore all of them.
4.2.2. Illumination-Guided Transformer (IGT)
The Illumination-Guided Transformer (IGT) serves as the corruption restorer in ORF. It is designed to overcome the limitations of CNNs in capturing long-range dependencies while mitigating the high computational cost of vanilla Transformers.
1. Network Structure:
As shown in Figure 2 (a) (ii), IGT adopts a three-scale U-shaped architecture [44].
- Input: The
lit-up image(output of ) is the input toIGT. - Downsampling Branch:
- first passes through a
conv3x3layer. - This is followed by an
IGAB(Illumination-Guided Attention Block). - Then, a
strided conv4x4layer is used to downscale features. - This pattern (two
IGABs followed by astrided conv4x4) repeats to generate hierarchical features , where represents the scale. - The deepest level () passes through two more
IGABs.
- first passes through a
- Upsampling Branch:
- A symmetrical structure is used, employing
deconv2x2(transposed convolution with stride=2) to upscale features. Skip connectionsare utilized from the downsampling branch to alleviate information loss and transfer fine-grained details.
- A symmetrical structure is used, employing
- Output: The upsampling branch produces a
residual image. - Final Enhanced Image: The
enhanced imageis obtained by summing thelit-up imageand theresidual image: $ \mathbf{I}{en} = \mathbf{I}{lu} + \mathbf{I}_{re} $
2. Illumination-Guided Multi-head Self-Attention (IG-MSA):
The IG-MSA module is the key component within each IGAB and the core innovation for enabling efficient, illumination-aware Transformer operations. Its details are shown in Figure 2 (c).
- Input: The
input feature(from theIGAB) is reshaped into tokens . - Head Splitting: is then split into heads: $ \mathbf{X} = [\mathbf{X}_1, \mathbf{X}_2, \dots, \mathbf{X}_k] $ Where and . Each corresponds to the input for .
- Query, Key, Value Projections: For each , three
fully connected (fc)layers (without bias) linearly project into query (), key (), and value () elements: $ \mathbf{Q}_i = \mathbf{X}i \mathbf{W}{\mathbf{Q}_i}^\mathrm{T} $ $ \mathbf{K}_i = \mathbf{X}i \mathbf{W}{\mathbf{K}_i}^\mathrm{T} $ $ \mathbf{V}_i = \mathbf{X}i \mathbf{W}{\mathbf{V}_i}^\mathrm{T} $ Where . are the learnable parameters of thefclayers, and denotes matrix transpose. - Illumination Guidance: The
light-up feature(estimated by ) is crucial here. It encodes illumination information and interactions of regions with different lighting conditions. This feature is also reshaped into and split into heads: $ \mathbf{Y} = [\mathbf{Y}_1, \mathbf{Y}_2, \dots, \mathbf{Y}_k] $ Where . - Illumination-Guided Self-Attention Calculation: For each , the self-attention is formulated as:
$
\operatorname{Attention}(\mathbf{Q}_i, \mathbf{K}_i, \mathbf{V}_i, \mathbf{Y}_i) = (\mathbf{Y}_i \odot \mathbf{V}_i) \operatorname{softmax}\left(\frac{\mathbf{K}_i^\mathrm{T} \mathbf{Q}_i}{\alpha_i}\right)
$
Here:
- : Element-wise multiplication. This is where the illumination guidance happens: the
value elementsare modulated by theillumination features. This means that the information aggregated by the attention mechanism is weighted by the illumination context, allowing regions with different lighting conditions to interact more effectively. - : A learnable parameter that adaptively scales the matrix multiplication . This helps control the sharpness of the attention distribution.
- : Element-wise multiplication. This is where the illumination guidance happens: the
- Output: After computing attention for all heads, their outputs are concatenated. This concatenated result then passes through another
fclayer. Finally, apositional encoding(learnable parameters) is added to produce the output tokens . These tokens are then reshaped back to the output feature .
3. Complexity Analysis:
This section directly compares the computational efficiency of IG-MSA with Global Multi-head Self-Attention (G-MSA).
-
Complexity of IG-MSA: The main computational burden comes from the two matrix multiplications in Eq. (9) for each of the heads: and .
- The term involves matrices of size and , resulting in operations.
- The term is element-wise and then multiplied by the softmax output, involving and operations, resulting in operations.
- Summing these for heads: $ \mathcal{O}(\text{IG-MSA}) = k \cdot [d_k \cdot (d_k \cdot HW) + HW \cdot (d_k \cdot d_k)] $ $ = 2HW k d_k^2 $
- Since (where is the total channel dimension), substituting : $ \mathcal{O}(\text{IG-MSA}) = 2HW k \left(\frac{C}{k}\right)^2 = \frac{2HW C^2}{k} $
- Key Insight: This shows that the complexity of
IG-MSAislinearwith respect to the spatial size (HW).
-
Complexity of Global MSA (G-MSA):
- The computational complexity of standard global MSA (like in vanilla Vision Transformers) is: $ \mathcal{O}(\text{G-MSA}) = 2(HW)^2 C $
- Key Insight: This complexity is
quadraticwith respect to the spatial size (HW).
-
Comparison and Advantage: The linear complexity of
IG-MSA(vs. quadratic forG-MSA) is a major advantage. It allowsIG-MSAto beplugged into each basic unit IGABthroughout the network, even at higher resolutions, without incurring prohibitive computational costs. This enables a full exploration of Transformer's potential for low-light image enhancement, unlike previous hybrid methods that could only afford a single global Transformer layer at the lowest resolution.
5. Experimental Setup
5.1. Datasets
The authors evaluated Retinexformer on a total of thirteen datasets, demonstrating comprehensive testing across various scenarios.
Datasets with Ground Truth (Paired Low-light/Normal-light Images):
-
LOL (Low-Light) Dataset: This dataset contains paired low-light and normal-light images.
- LOL-v1 [54]: Split into 485 training pairs and 15 testing pairs.
- LOL-v2 [59]: Divided into two subsets:
- LOL-v2-real: 689 training pairs, 100 testing pairs. These are real-world low-light images.
- LOL-v2-synthetic: 900 training pairs, 100 testing pairs. These are synthetically generated low-light images.
- Characteristics: These datasets are widely used benchmarks for low-light enhancement, covering various indoor and outdoor scenes.
-
SID (See-in-the-Dark) Dataset [9]:
- Source/Characteristics: Subset captured by a Sony II camera. It consists of short-/long-exposure RAW image pairs. The low-light (short exposure) and normal-light (long exposure) RGB images are derived by applying the same in-camera signal processing pipeline to the RAW data. This dataset is known for having significant noise in the low-light images.
- Scale: 2697 RAW image pairs in total.
- Splits: 2099 pairs for training and 598 pairs for testing.
-
SMID (Synthetic Multi-Illumination Dataset) [10]:
- Source/Characteristics: Collects 20809 short-/long-exposure RAW image pairs. Similar to SID, RAW data is converted to low-/normal-light RGB image pairs. This dataset offers a larger scale for training.
- Splits: 15763 pairs for training and the remaining for testing.
-
SDSD (Seeing Dynamic Scenes in the Dark) Dataset [48]:
- Source/Characteristics: This dataset is designed for dynamic scenes and is captured by a Canon EOS 6D Mark II camera with an ND filter. It contains both indoor and outdoor subsets, focusing on video sequences. The paper uses the static version.
- SDSD-indoor: 62 low-/normal-light video pairs for training, 6 for testing.
- SDSD-outdoor: 116 low-/normal-light video pairs for training, 10 for testing.
-
FiveK (MIT-Adobe FiveK) Dataset [5]:
- Source/Characteristics: A dataset of 5000 images, each adjusted manually by five expert photographers (labeled A-E). It's used for photo enhancement and aesthetic adjustment.
- Splits: 4500 low-/normal-light image pairs for training, 500 for testing.
- Reference: The authors use expert C's adjusted images as reference and adopt the sRGB output mode.
- Example: A low-light image might be an underexposed JPEG, and the normal-light counterpart is the expertly retouched version.
Datasets without Ground Truth (Used for Qualitative Evaluation/User Study):
- LIME [18]
- NPE [50]
- MEF [36]
- DICM [28]
- VV [47]
-
Characteristics: These datasets typically consist of real-world low-light images without corresponding normal-light ground truth, making them suitable for subjective visual quality assessment and generalization testing.
These datasets are well-chosen to validate the method's performance across various low-light conditions, image types (real, synthetic, RAW-derived), and levels of corruption (noise, exposure issues). The inclusion of datasets without ground truth helps assess subjective visual quality and generalization to real-world scenarios where paired data is unavailable.
-
5.2. Evaluation Metrics
The paper uses standard evaluation metrics for image quality assessment and object detection performance.
5.2.1. Peak Signal-to-Noise Ratio (PSNR)
- Conceptual Definition: PSNR is a widely used objective metric to quantify the quality of reconstruction of an image compared to an original (ground truth) image. It measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. A higher PSNR value generally indicates a higher quality (less noisy) reconstructed image. It is often expressed in decibels (dB).
- Mathematical Formula: $ PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right) $
- Symbol Explanation:
- : The maximum possible pixel value of the image. For 8-bit grayscale images, this is 255. For color images represented by 8 bits per channel, .
MSE: Mean Squared Error, which is the cumulative squared error between the enhanced image and the ground truth image, divided by the total number of pixels. $ MSE = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $m, n: The number of rows and columns in the image, respectively.I(i,j): The pixel value at row and column in the ground truth image.K(i,j): The pixel value at row and column in the enhanced image.
5.2.2. Structural Similarity Index Measure (SSIM)
- Conceptual Definition: SSIM is an objective metric that evaluates the perceived quality of an image by quantifying the similarity between two images. Unlike PSNR, which focuses on pixel-wise differences (error visibility), SSIM attempts to model the human visual system's perception of structural information, luminance, and contrast. The SSIM index ranges from -1 to 1, where 1 indicates perfect structural similarity.
- Mathematical Formula: $ SSIM(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
x, y: Two image patches (or the entire images) being compared.- : The mean pixel values of and , respectively.
- : The standard deviations of pixel values of and , respectively.
- : The covariance of and .
- : Small constants included to avoid division by zero, where is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and are small constants (e.g., 0.01 and 0.03 by default).
5.2.3. Average Precision (AP)
- Conceptual Definition: AP is a common metric used to evaluate the performance of object detection models. It summarizes the
precision-recall curvefor a given class. A higher AP value indicates better detection performance, meaning the model can accurately identify objects (high precision) while also finding most of them (high recall). The mean Average Precision (mAP) is the average AP across all object classes. - Mathematical Formula: $ AP = \sum_{k=1}^{N} P(k) \Delta R(k) $ Or, more commonly, as the area under the precision-recall curve: $ AP = \int_0^1 P(R) dR $
- Symbol Explanation:
P(k): Precision at the -th recall value.- : The change in recall from
k-1to . P(R): Precision as a function of recall.- Precision: The fraction of correctly detected objects (True Positives) among all detected objects (True Positives + False Positives). .
- Recall: The fraction of correctly detected objects (True Positives) among all actual objects in the image (True Positives + False Negatives). .
- Intersection over Union (IoU): A threshold (e.g., 0.5) is typically applied to determine if a detected bounding box correctly matches a ground truth bounding box. If IoU > threshold, it's a True Positive.
5.3. Baselines
The paper compares Retinexformer against a comprehensive set of state-of-the-art (SOTA) low-light image enhancement algorithms, categorized as follows:
-
Retinex-based Deep Learning Methods:
DeepUPE[49]RetinexNet[54]RUAS(RAS in the tables) [30]KinD[66]
-
General Deep Learning Image Enhancement/Restoration Methods (CNN-based):
SID[9] (A method designed for seeing in the dark, often used as a baseline)3DLUT[63]RF[26]DeepLPF[38]Sparse[59]EnGAN[22]RAS(RUAS in the paper text) [30]FIDE[56]DRBN[58]MIRNet[61]
-
Transformer-based or CNN-Transformer Hybrid Methods:
IPT[11] (Pre-trained image processing transformer)UFormer[52] (U-shaped Transformer)Restormer[60] (Efficient Transformer for high-resolution image restoration)SNR-Net[57] (SNR-aware CNN-Transformer hybrid network)
-
Methods for Low-light Object Detection Comparison:
-
ZeroDCE[117] -
SCI[37] (Self-calibrated illumination) -
Others from the above list (e.g., MIRNet, RetinexNet, RUAS, Restormer, KinD, SNR-Net) when applied as pre-processing.
These baselines are representative of the current landscape of low-light image enhancement, covering various architectural designs (CNNs, Transformers), foundational principles (Retinex-based, direct mapping), and training paradigms (supervised, unsupervised). This broad comparison validates
Retinexformer's efficacy across different approaches.
-
5.4. Implementation Details
- Framework: Implemented using PyTorch [39].
- Optimizer: Adam [25] optimizer with and .
- Training Iterations: iterations.
- Learning Rate Schedule: Initially set to and then steadily decreased to using the cosine annealing scheme [34] during training.
- Training Samples: Patches of size are randomly cropped from low-/normal-light image pairs.
- Batch Size: 8.
- Data Augmentation: Random rotation and flipping.
- Loss Function: Mean Absolute Error (MAE) between the enhanced image and the ground truth. The objective is to minimize this MAE.
$
MAE = \frac{1}{mnc} \sum_{k=0}^{c-1} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} |I(i,j,k) - K(i,j,k)|
$
m, n: Height and width of the image.- : Number of channels (e.g., 3 for RGB).
I(i,j,k): Pixel value at row , column , channel in the ground truth image.K(i,j,k): Pixel value at row , column , channel in the enhanced image.
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that Retinexformer consistently achieves state-of-the-art performance across a wide array of benchmarks, both quantitatively and qualitatively, while maintaining competitive computational efficiency.
The following are the results from Table 1 of the original paper:
| Methods | Complexity | LOL-v1 | LOL-v2-real | LOL-v2-syn | SID | SMID | SDSD-in | SDSD-out | ||||||||
| FLOPS (G) | Params (M) | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | PSNR | SSIM | |
| SID [9] | 13.73 | 7.76 | 14.35 | 0.436 | 13.24 | 0.442 | 15.04 | 0.610 | 16.97 | 0.591 | 24.78 | 0.718 | 23.29 | 0.703 | 24.90 | 0.693 |
| 3DLUT [63] | 0.075 | 0.59 | 14.35 | 0.445 | 17.59 | 0.721 | 18.04 | 0.800 | 20.11 | 0.592 | 23.86 | 0.678 | 21.66 | 0.655 | 21.89 | 0.649 |
| DeepUPE [49] | 21.10 | 1.02 | 14.38 | 0.446 | 13.27 | 0.452 | 15.08 | 0.623 | 17.01 | 0.604 | 23.91 | 0.690 | 21.70 | 0.662 | 21.94 | 0.698 |
| RF [26] | 46.23 | 21.54 | 15.23 | 0.452 | 14.05 | 0.458 | 15.97 | 0.632 | 16.44 | 0.596 | 23.11 | 0.681 | 20.97 | 0.655 | 21.21 | 0.689 |
| DeepLPF [38] | 5.86 | 1.77 | 15.28 | 0.473 | 14.10 | 0.480 | 16.02 | 0.587 | 18.07 | 0.600 | 24.36 | 0.688 | 22.21 | 0.664 | 22.76 | 0.658 |
| IPT [11] | 6887 | 115.31 | 16.27 | 0.504 | 19.80 | 0.813 | 18.30 | 0.811 | 20.53 | 0.561 | 27.03 | 0.783 | 26.11 | 0.831 | 27.55 | 0.850 |
| UFormer [52] | 12.00 | 5.29 | 16.36 | 0.771 | 18.82 | 0.771 | 19.66 | 0.871 | 18.54 | 0.577 | 27.20 | 0.792 | 23.17 | 0.859 | 23.85 | 0.748 |
| RetinexNet [54] | 587.47 | 0.84 | 16.77 | 0.560 | 15.47 | 0.567 | 17.13 | 0.798 | 16.48 | 0.578 | 22.83 | 0.684 | 20.84 | 0.617 | 20.96 | 0.629 |
| Sparse [59] | 53.26 | 2.33 | 17.20 | 0.640 | 20.06 | 0.816 | 22.05 | 0.905 | 18.68 | 0.606 | 25.48 | 0.766 | 23.25 | 0.863 | 25.28 | 0.804 |
| EnGAN [22] | 61.01 | 114.35 | 17.48 | 0.650 | 18.23 | 0.617 | 16.57 | 0.734 | 17.23 | 0.543 | 22.62 | 0.674 | 20.02 | 0.604 | 20.10 | 0.616 |
| RAS [30] | 0.83 | 0.003 | 18.23 | 0.720 | 18.37 | 0.723 | 16.55 | 0.652 | 18.44 | 0.581 | 25.88 | 0.744 | 23.17 | 0.696 | 23.84 | 0.743 |
| FIDE [56] | 28.51 | 8.62 | 18.27 | 0.665 | 16.85 | 0.678 | 15.20 | 0.612 | 18.34 | 0.578 | 24.42 | 0.692 | 22.41 | 0.659 | 22.20 | 0.629 |
| DRBN [58] | 48.61 | 5.27 | 20.13 | 0.830 | 20.29 | 0.831 | 23.22 | 0.927 | 19.02 | 0.577 | 26.60 | 0.781 | 24.08 | 0.868 | 25.77 | 0.841 |
| KinD [66] | 34.99 | 8.02 | 20.86 | 0.790 | 14.74 | 0.641 | 13.29 | 0.578 | 18.02 | 0.583 | 22.18 | 0.634 | 21.95 | 0.672 | 21.97 | 0.654 |
| Restormer [60] | 144.25 | 26.13 | 22.43 | 0.823 | 19.94 | 0.827 | 21.41 | 0.830 | 22.27 | 0.649 | 26.97 | 0.758 | 25.67 | 0.827 | 24.79 | 0.802 |
| MIRNet [61] | 785 | 31.76 | 24.14 | 0.830 | 20.02 | 0.820 | 21.94 | 0.876 | 20.84 | 0.605 | 25.66 | 0.762 | 24.38 | 0.864 | 27.13 | 0.837 |
| SNR-Net [57] | 26.35 | 4.01 | 24.61 | 0.842 | 21.48 | 0.849 | 24.14 | 0.928 | 22.87 | 0.625 | 28.49 | 0.805 | 29.44 | 0.894 | 28.66 | 0.866 |
| Retinexformer | 15.57 | 1.61 | 25.16 | 0.845 | 22.80 | 0.840 | 25.67 | 0.930 | 24.44 | 0.680 | 29.15 | 0.815 | 29.77 | 0.896 | 29.84 | 0.877 |
Quantitative Results Analysis (Table 1):
- Overall Dominance:
Retinexformerachieves the highest PSNR and SSIM scores on seven out of eight datasets: LOL-v1, LOL-v2-real, LOL-v2-synthetic, SID, SMID, SDSD-indoor, and SDSD-outdoor. This robust performance across diverse benchmarks strongly validates its effectiveness. - Comparison with SOTA (SNR-Net):
Retinexformerconsistently outperformsSNR-Net, which was the previous best method.- It gains improvements of 0.55 dB (LOL-v1), 1.32 dB (LOL-v2-real), 1.53 dB (LOL-v2-synthetic), 1.57 dB (SID), 0.66 dB (SMID), 0.33 dB (SDSD-indoor), and 1.18 dB (SDSD-outdoor) in PSNR.
- Crucially,
Retinexformerachieves this while being more efficient: it costs only 40% of the parameters (1.61 M vs. 4.01 M) and 59% of the FLOPS (15.57 G vs. 26.35 G) compared toSNR-Net. This highlights its efficiency advantage.
- Comparison with Retinex-based Deep Learning Methods: When compared to
DeepUPE,RetinexNet,RUAS, andKinD,Retinexformershows dramatic improvements, particularly on datasets with severe corruptions.- Improvements are over 6 dB on SID (e.g., 24.44 dB vs. RetinexNet's 16.48 dB) and SDSD datasets, indicating its superior ability to handle noise and artifacts. This is a direct validation of its
corruption modelingwithin theORF.
- Improvements are over 6 dB on SID (e.g., 24.44 dB vs. RetinexNet's 16.48 dB) and SDSD datasets, indicating its superior ability to handle noise and artifacts. This is a direct validation of its
- Comparison with Transformer-based Image Restoration Algorithms: Even against other Transformer-based methods like
IPT,Uformer, andRestormer,Retinexformershows significant gains (e.g., up to 4.26 dB).-
Furthermore,
Retinexformeris vastly more efficient thanIPTandRestormer. For example, it uses only 1.4% ofIPT's parameters and 0.2% of its FLOPS, and 6.2% ofRestormer's parameters and 10.9% of its FLOPS. This underscores the success of theIG-MSAin achieving high performance with reduced computational complexity.The following are the results from Table 2 of the original paper:
Methods DeepUPE [49] MIRNet [61] SNR-Net [57] Restormer [60] Ours PSNR (dB) 23.04 23.73 23.81 24.13 24.94 FLOPS (G) 21.10 785.0 26.35 144.3 15.57
-
Quantitative Results Analysis (Table 2 - FiveK Dataset):
On the FiveK dataset (sRGB output mode), Retinexformer again demonstrates superior performance with a PSNR of 24.94 dB, surpassing Restormer (24.13 dB) and SNR-Net (23.81 dB). Its FLOPS (15.57 G) remain significantly lower than MIRNet (785.0 G) and Restormer (144.3 G), and even lower than SNR-Net (26.35 G), further solidifying its efficiency.
These results collectively confirm Retinexformer's outstanding effectiveness and efficiency across various low-light image enhancement tasks.
6.2. Qualitative Results
The visual comparisons provided in the paper further reinforce Retinexformer's quantitative superiority, showcasing its ability to produce perceptually pleasing and natural-looking enhanced images.
The following figure (Figure 3 from the original paper) shows visual comparison results:
Figure 3. Visual comparison of object detection in low-light (left) and enhanced (right) scenes by our method on the Exdark dataset.
The following figure (Figure 4 from the original paper) shows visual comparison results:
Figure 4. Visual comparison of object detection in low-light (left) and enhanced (right) scenes by our method on the Exdark dataset.
The following figure (Figure 5 from the original paper) shows visual comparison results:
Figure 5. Visual comparison of object detection in low-light (left) and enhanced (right) scenes by our method on the Exdark dataset.
The following figure (Figure 7 from the original paper) shows visual comparison results on datasets without ground truth:
Figure 7. Visual comparison of object detection in low-light (left) and enhanced (right) scenes by our method on the Exdark dataset.
Qualitative Results Analysis (Figures 3, 4, 5, 7):
-
Color Fidelity and Distortion: Previous methods often suffer from
color distortion. For instance,RUASin Figure 3 shows noticeable color shifts.Retinexformerrobustly preserves original colors, producing natural and accurate hues. -
Exposure Control: Many methods struggle with
over-/under-exposed regions.RetinexNetandDeepUPEin Figure 4 are cited as failing to suppress noise and exhibiting poor exposure control.Retinexformereffectively balances brightness, enhancing visibility in dark areas without introducing overexposure in brighter regions. -
Noise and Artifact Suppression: Low-light images are typically noisy. Methods like
RetinexNetandDeepUPEamplify noise.DRBNandSNR-Netin Figure 5 are mentioned for introducing black spots or unnatural artifacts.RestormerandSNR-Netin Figure 4 can produce blurry images. In contrast,Retinexformerexcels at reliably removing noise and artifacts, yielding cleaner images without introducing new corruptions, spots, or blurriness. This is a direct outcome of itscorruption modelingcapability. -
Sharpness and Detail Preservation:
Retinexformermaintains good image sharpness and preserves fine details, which is crucial for perceptual quality. Some other methods (e.g.,Restormer,SNR-Net) tend to generate blurrier results as a trade-off for noise reduction. -
Generalization to No Ground Truth Data: Figure 7 showcases results on datasets like LIME, NPE, MEF, DICM, and VV, which lack ground truth. In these scenarios,
Retinexformerstill demonstrates better visual quality than other supervised and unsupervised algorithms, validating its strong generalization ability to real-world diverse low-light scenes.The visual evidence strongly supports the quantitative findings, highlighting
Retinexformer's ability to produce high-quality, perceptually superior enhanced images across various challenging conditions.
6.3. User Study Score
To quantify human subjective visual perception, a user study was conducted.
The following are the results from Table 3a of the original paper:
| Methods | L-v1 | L-v2-R | L-v2-S | SID | SMID | SD-in | SD-out | Mean |
| EnGAN [22] | 2.43 | 1.39 | 2.13 | 1.04 | 2.78 | 1.83 | 1.87 | 1.92 |
| RetinexNet [54] | 2.17 | 1.91 | 1.13 | 1.09 | 2.35 | 3.96 | 3.74 | 2.34 |
| DRBN [58] | 2.70 | 2.26 | 3.65 | 1.96 | 2.22 | 2.78 | 2.91 | 2.64 |
| FIFDE [56] | 2.87 | 2.52 | 3.48 | 2.22 | 2.57 | 3.04 | 2.96 | 2.81 |
| KinD [66] | 2.65 | 2.48 | 3.17 | 1.87 | 3.04 | 3.43 | 3.39 | 2.86 |
| MIRNet [61] | 2.96 | 3.57 | 3.61 | 2.35 | 2.09 | 2.91 | 3.09 | 2.94 |
| Restormer [60] | 3.04 | 3.48 | 3.39 | 2.43 | 3.17 | 2.48 | 2.70 | 2.96 |
| UAS [30] | 3.83 | 3.22 | 2.74 | 2.26 | 3.48 | 3.39 | 3.04 | 3.14 |
| SNR-Net [57] | 3.13 | 3.83 | 3.57 | 3.04 | 3.30 | 2.74 | 3.17 | 3.25 |
| Retinexformer | 3.61 | 4.17 | 3.78 | 3.39 | 3.87 | 3.65 | 3.91 | 3.77 |
User Study Analysis (Table 3a):
- Setup: 23 human subjects were invited to score the visual quality of enhanced images from seven datasets, ranging from 1 (worst) to 5 (best). They were instructed to evaluate based on: (i) presence of under-/over-exposed regions, (ii) presence of color distortion, and (iii) presence of noise or artifacts. Images were presented without method names to avoid bias. 156 testing images were used in total.
- Results:
Retinexformerachieved the highest average score of 3.77, significantly outperforming all other methods. It was also the most favored method on LOL-v2-real, LOL-v2-synthetic, SID, SMID, and SDSD-outdoor datasets, and the second most favored on LOL-v1 and SDSD-indoor. - Conclusion: The user study confirms that
Retinexformerproduces results that are not only objectively superior (higher PSNR/SSIM) but also subjectively more appealing and natural-looking to human observers, indicating its success in addressing perceptual quality aspects of low-light enhancement.
6.4. Low-light Object Detection
To demonstrate the practical value of Retinexformer for high-level computer vision tasks, experiments were conducted on low-light object detection.
The following are the results from Table 3b of the original paper:
| Methods | Bicycle | Boat | Bottle | Bus | Car | Cat | Chair | Cup | Dog | Motor | People | Table | Mean |
| MIRNet [61] | 71.8 | 63.8 | 62.9 | 81.4 | 71.1 | 58.8 | 58.9 | 61.3 | 63.1 | 52.0 | 68.8 | 45.5 | 63.6 |
| RetinexNet [54] | 73.8 | 62.8 | 64.8 | 84.9 | 80.8 | 53.4 | 57.2 | 68.3 | 61.5 | 51.3 | 65.9 | 43.1 | 64.0 |
| RUAS [30] | 72.0 | 62.2 | 65.2 | 72.9 | 78.1 | 57.3 | 62.4 | 61.8 | 60.2 | 61.5 | 69.4 | 46.8 | 64.2 |
| Restormer [60] | 76.2 | 65.1 | 64.2 | 84.0 | 76.3 | 59.2 | 53.0 | 58.7 | 66.1 | 62.9 | 68.6 | 45.0 | 64.9 |
| KinD [66] | 72.2 | 66.5 | 58.9 | 83.7 | 74.5 | 55.4 | 61.7 | 61.3 | 63.8 | 63.0 | 70.5 | 47.8 | 65.0 |
| ZeroDCE [117] | 75.8 | 66.5 | 65.6 | 84.9 | 77.2 | 56.3 | 53.8 | 59.0 | 63.5 | 64.0 | 68.3 | 46.3 | 65.1 |
| SNR-Net [57] | 75.3 | 64.4 | 63.6 | 85.3 | 77.5 | 59.1 | 54.1 | 59.6 | 66.3 | 65.2 | 69.1 | 44.6 | 65.3 |
| SCI [37] | 74.6 | 65.3 | 65.8 | 85.4 | 76.3 | 59.4 | 57.1 | 60.5 | 65.6 | 63.9 | 69.1 | 45.9 | 65.6 |
| Retinexformer | 76.3 | 66.7 | 65.9 | 84.7 | 77.6 | 61.2 | 53.5 | 60.7 | 67.5 | 63.4 | 69.5 | 46.0 | 66.1 |
Experiment Settings:
- Dataset: ExDark [32] dataset, which contains 7363 underexposed images with bounding box annotations for 12 object categories.
- Splits: 5890 images for training, 1473 for testing.
- Detector: YOLO-v3 [43], trained from scratch.
- Enhancement: Different low-light enhancement methods (including
Retinexformerand baselines) serve as preprocessing modules with fixed parameters.
Quantitative Results (Table 3b - Object Detection AP):
-
Retinexformerachieves the highest mean Average Precision (AP) of 66.1, surpassing all other enhancement methods. -
This is 0.5 AP higher than
SCI[37] (a recent best self-supervised method) and 0.8 AP higher thanSNR-Net[57] (a recent best fully-supervised method). -
Retinexformeralso yields the best results on five specific object categories: bicycle, boat, bottle, cat, and dog. -
Conclusion: These results clearly demonstrate that
Retinexformeris not just good at perceptual enhancement but also significantly improves the input quality for downstream computer vision tasks like object detection, making objects more detectable and reliably localized in low-light scenes.The following figure (Figure 6 from the original paper) visually compares object detection results:
Figure 6. Visual comparison of object detection in low-light (left) and enhanced (right) scenes by our method on the Exdark dataset.Qualitative Results (Figure 6 - Object Detection):
-
The left image shows object detection in a low-light scene, where the detector (
YOLO-v3) misses some boats or predicts inaccurate locations and lower confidence scores. -
The right image, enhanced by
Retinexformer, shows that the detector can now reliably predict well-placed bounding boxes with higher confidence scores, covering all boats. -
Conclusion: This visual evidence further validates the effectiveness of
Retinexformerin making low-light images more amenable to high-level vision tasks.
6.5. Ablation Study
An ablation study was conducted on the SDSD-outdoor dataset to analyze the contribution of each component of Retinexformer. This dataset was chosen due to its good convergence and stable performance.
The following are the results from Table 4a of the original paper:
| Baseline-1 | ORF | IG-MSA | PSNR | SSIM | Params (M) | FLOPS (G) |
| ✓ | 26.47 | 0.843 | 1.01 | 9.18 | ||
| ✓ | ✓ | 27.92 | 0.857 | 1.27 | 11.37 | |
| ✓ | ✓ | 28.86 | 0.868 | 1.34 | 13.38 | |
| ✓ | ✓ | ✓ | 29.84 | 0.877 | 1.61 | 15.57 |
6.5.1. Break-down Ablation (Table 4a):
-
Baseline-1: This model is derived by removing both
ORF(the One-stage Retinex-based Framework) andIG-MSA(Illumination-Guided Multi-head Self-Attention) fromRetinexformer. It achieves 26.47 dB PSNR. -
Adding ORF: When
ORFis added to Baseline-1, PSNR improves to 27.92 dB (a gain of 1.45 dB). This shows the effectiveness of the proposed Retinex-based framework in handling low-light conditions and initial brightening. -
Adding IG-MSA: When
IG-MSAis added to Baseline-1 (withoutORF), PSNR improves to 28.86 dB (a gain of 2.39 dB). This indicates that theIllumination-Guided Transformeris highly effective in restoring corruptions and modeling long-range dependencies, even without the explicit ORF guidance in the initial stages. -
Full Retinexformer (ORF + IG-MSA): When both
ORFandIG-MSAare jointly exploited, the model achieves the highest PSNR of 29.84 dB (a gain of 3.37 dB over Baseline-1). This demonstrates thatORFandIG-MSAare complementary and their combination leads to the best performance. -
Conclusion: This breakdown confirms the significant individual contributions of both the
ORFandIG-MSAcomponents, and their synergistic effect in achieving superior enhancement.The following are the results from Table 4b of the original paper:
Method Ilu = I Ilu = I./L Ilu = I L +Flu PSNR 28.86 28.97 29.26 29.84 SSIM 0.868 0.868 0.870 0.877 Params (M) 1.34 1.61 1.61 1.61 FLOPS (G) 13.38 14.01 14.01 15.57
6.5.2. Ablation of the Proposed ORF (Table 4b):
This study examines different ways of forming the lit-up image and the role of the light-up feature .
-
: When
ORFis removed and the corruption restorer directly takes the raw low-light image as input (effectively bypassing the illumination estimation), the PSNR is 28.86 dB. This configuration is equivalent to the third row of Table 4a (Baseline-1 + IG-MSA). -
: Here,
ORFis used, but the illumination estimator estimates the illumination map . The lit-up image is then computed via element-wise division . A small constant is added to to prevent division by zero. This yields a PSNR of 28.97 dB, a marginal improvement of 0.11 dB over . This small gain confirms the numerical instability and vulnerability of the division operation, as discussed in the methodology. -
(Corrected in paper to
bar{L}for clarity): This configuration represents the proposed robust approach whereORFestimates thelight-up map(instead of ), and the lit-up image is obtained via element-wise multiplication . This results in a PSNR of 29.26 dB, a notable improvement of 0.40 dB over . This confirms the robustness and effectiveness of estimating and using multiplication. -
(Full Retinexformer): Finally, when the
light-up feature(output of ) is also used todirectthe corruption restorer (specifically, theIG-MSAmodules withinIGT), the PSNR further increases to 29.84 dB (a gain of 0.58 dB). The SSIM also improves from 0.870 to 0.877. -
Conclusion: This ablation clearly demonstrates the importance of robustly estimating the
light-up mapvia multiplication, and the crucial role of thelight-up featurein guiding the Transformer-based corruption restorer for optimal performance.The following are the results from Table 4c of the original paper:
Method Baseline-2 G-MSA W-MSA IG-MSA PSNR 27.92 28.43 28.65 29.84 SSIM 0.857 0.841 0.845 0.877 Params (M) 1.27 1.61 1.61 1.61 FLOPS (G) 11.37 17.65 16.43 15.57
6.5.3. Ablation of Self-Attention Schemes (Table 4c):
This study compares the proposed IG-MSA with other self-attention variants.
- Baseline-2: This corresponds to the
ORFcomponent from the breakdown ablation, but with a standard CNN-based corruption restorer, meaningIG-MSAis removed. It achieves 27.92 dB PSNR. - G-MSA (Global Multi-head Self-Attention): A global MSA is plugged into each basic unit of the corruption restorer . To avoid out-of-memory errors due to its quadratic complexity, input feature maps are downscaled to size. This achieves 28.43 dB PSNR, an improvement over Baseline-2. However, it incurs higher FLOPS (17.65 G) and a lower SSIM (0.841) than Baseline-2. The lower SSIM despite higher PSNR suggests potential artifacts or less structural integrity.
- W-MSA (Window-based Multi-head Self-Attention): This is a local attention mechanism, similar to Swin Transformer [31], where self-attention is computed within non-overlapping windows. It achieves 28.65 dB PSNR, performing better than G-MSA and incurring slightly lower FLOPS (16.43 G).
- IG-MSA (Illumination-Guided Multi-head Self-Attention): The proposed
IG-MSAachieves the highest PSNR of 29.84 dB and SSIM of 0.877. Crucially, it does so with fewer FLOPS (15.57 G) than both G-MSA and W-MSA. - Conclusion:
IG-MSAsurpassesG-MSAby 1.41 dB andW-MSAby 1.34 dB, while costing 2.08 G and 0.86 G less FLOPS, respectively. This powerfully demonstrates thecost-effectiveness advantageand superior performance of the proposedIG-MSAin capturing relevant long-range dependencies for low-light image enhancement, benefiting from the illumination guidance and efficient design.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper introduces Retinexformer, a novel Transformer-based method that significantly advances low-light image enhancement. At its core is the One-stage Retinex-based Framework (ORF), which innovatively revises the traditional Retinex theory to explicitly model corruptions (noise, artifacts, exposure issues) inherent in low-light conditions and those introduced during the brightening process. This framework streamlines the enhancement process into a single, end-to-end trainable pipeline, overcoming the complexities of multi-stage training common in prior Retinex-based deep learning approaches.
A key technical contribution is the Illumination-Guided Transformer (IGT), which incorporates a novel Illumination-Guided Multi-head Self-Attention (IG-MSA) mechanism. IG-MSA leverages illumination representations to direct the modeling of long-range dependencies, enabling effective interactions between regions with varying lighting conditions. Crucially, its design reduces the computational complexity from quadratic to linear with respect to spatial size, making Transformers viable for high-resolution image enhancement without prohibitive costs.
Extensive experiments on thirteen diverse datasets demonstrate Retinexformer's state-of-the-art performance, achieving significant improvements (e.g., over 6 dB PSNR on SID and SDSD datasets) over existing methods, including other Transformer-based and CNN-based approaches, often with greater efficiency. Furthermore, a user study confirms its superior subjective visual quality, and its successful application in low-light object detection showcases its practical value for improving downstream high-level vision tasks.
7.2. Limitations & Future Work
The paper doesn't explicitly list limitations of Retinexformer itself or suggest future work specific to their method in the conclusion. However, implicitly, the paper addresses limitations of previous methods, which can be seen as the improvements Retinexformer brings.
- Addressing Corruptions: The paper highlights that previous Retinex models didn't consider corruptions hidden in the dark or introduced by the light-up process.
Retinexformer'sORFexplicitly models and restores these. A potential area for future work, though not stated, might be to explore even more granular or specific types of corruptions (e.g., specific sensor noise patterns) or more adaptive corruption modeling. - Multi-stage Training: The
tedious multi-stage training pipelineof prior Retinex-based deep learning methods is a key limitation addressed byORF's one-stage design. - Long-range Dependencies and Computational Cost: The inability of CNNs to capture
long-range dependenciesand thequadratic computational complexityof vanilla Transformers for high-resolution images were significant limitations.Retinexformeraddresses these withIGTandIG-MSA. Future work could involve exploring further optimizations for Transformer architectures or more dynamic illumination guidance mechanisms.
7.3. Personal Insights & Critique
Retinexformer presents a highly compelling and well-executed solution to low-light image enhancement, offering several valuable insights:
-
Principled Integration of Theory and Deep Learning: The work effectively bridges traditional image processing theory (Retinex) with modern deep learning (Transformers). By revising the Retinex model to explicitly account for corruptions, the authors imbue the deep learning framework with a more robust and interpretable physical basis. This "principled" approach, rather than a purely data-driven black-box mapping, likely contributes to its strong generalization and ability to handle diverse corruptions.
-
Smart Solution to Transformer's Achilles' Heel: The
Illumination-Guided Multi-head Self-Attention (IG-MSA)is a particularly ingenious contribution. The quadratic complexity of global self-attention has been a major bottleneck for applying Transformers to dense prediction tasks on high-resolution images. By designing an attention mechanism that is linear in spatial size and, more importantly, uses the illumination context to guide the attention, the paper successfully harnesses the power of Transformers for long-range dependency modeling without incurring prohibitive costs. This sets a precedent for how Transformers can be made more efficient and domain-aware for low-level vision. -
Holistic Approach to Enhancement: The
ORF's two-stage process (light-up then restore corruption) is intuitive and effective. It recognizes that brightening is only part of the solution; the subsequentcorruption restorationis equally critical. The user study and object detection results highlight thatRetinexformerachieves not just high PSNR/SSIM but also superior perceptual quality and utility for downstream tasks, which is the ultimate goal of image enhancement.
Potential Issues or Areas for Improvement:
-
Interpretability of Illumination Guidance: While the paper states that
F_ludirects the modeling, a deeper dive into how the illumination representation specifically impacts the attention weights or value modulation inIG-MSAcould be beneficial. Visualizations of attention maps conditioned on different illumination levels might offer more insights. -
Generalization to Extreme Conditions: While tested on many datasets, the paper could further discuss its performance under extremely challenging conditions (e.g., highly saturated noise, complex mixed lighting scenarios, severe motion blur in low light) or specific failure modes.
-
Computational Overhead for Illumination Estimation: While
IG-MSAis efficient, thedepth-wise separable conv5x5and subsequentconv1x1in the illumination estimator contribute to the overall complexity. A deeper analysis of this initial estimation's impact on the total runtime and whether it could be further optimized might be insightful. -
Hyperparameter Sensitivity: The paper mentions as a learnable parameter. An ablation study or sensitivity analysis of this parameter, or other key hyperparameters, could provide more practical guidance for implementation and fine-tuning.
-
Positional Encoding Details: The positional encoding is mentioned as "learnable parameters". More details on its structure or how it's learned could be helpful, especially for beginners trying to implement the model.
Overall,
Retinexformerrepresents a significant step forward in low-light image enhancement, demonstrating that careful theoretical re-evaluation combined with innovative architectural design can yield highly effective and efficient deep learning models. Its methods, particularly theIG-MSAand the principledORF, could be transferable to other image restoration or image-to-image translation tasks where contextual guidance and efficient long-range dependency modeling are crucial.
Similar papers
Recommended via semantic vector search.