AiPaper
Paper status: completed

Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement

Published:03/13/2023
Original LinkPDF
Price: 0.10
Price: 0.10
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This paper introduces a novel one-stage Retinex framework (ORF) for low-light image enhancement. By estimating illumination and restoring corruptions, combined with an Illumination-Guided Transformer (IGT), Retinexformer outperforms state-of-the-art methods significantly across b

Abstract

When enhancing low-light images, many deep learning algorithms are based on the Retinex theory. However, the Retinex model does not consider the corruptions hidden in the dark or introduced by the light-up process. Besides, these methods usually require a tedious multi-stage training pipeline and rely on convolutional neural networks, showing limitations in capturing long-range dependencies. In this paper, we formulate a simple yet principled One-stage Retinex-based Framework (ORF). ORF first estimates the illumination information to light up the low-light image and then restores the corruption to produce the enhanced image. We design an Illumination-Guided Transformer (IGT) that utilizes illumination representations to direct the modeling of non-local interactions of regions with different lighting conditions. By plugging IGT into ORF, we obtain our algorithm, Retinexformer. Comprehensive quantitative and qualitative experiments demonstrate that our Retinexformer significantly outperforms state-of-the-art methods on thirteen benchmarks. The user study and application on low-light object detection also reveal the latent practical values of our method. Code, models, and results are available at https://github.com/caiyuanhao1998/Retinexformer

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Retinexformer: One-stage Retinex-based Transformer for Low-light Image Enhancement

1.2. Authors

Yuanhao Cai, Hao Bian, Jing Lin, Haoqian Wang, Radu Timofte, Yulun Zhan. The authors are affiliated with Tsinghua University, University of Würzburg, and ETH Zürich.

1.3. Journal/Conference

The paper is published on arXiv, a preprint server. While not a peer-reviewed journal or conference proceeding itself, arXiv is a widely used platform for disseminating research in fields like computer science, physics, and mathematics, often preceding formal publication. The publication date suggests it was likely submitted to a major computer vision conference or journal in 2023.

1.4. Publication Year

2023

1.5. Abstract

The paper addresses the limitations of many deep learning algorithms for low-light image enhancement that are based on the Retinex theory. These limitations include not accounting for corruptions hidden in dark areas or introduced during the brightening process, requiring multi-stage training pipelines, and relying on convolutional neural networks (CNNs) which struggle with long-range dependencies. To overcome these, the authors propose a simple yet principled One-stage Retinex-based Framework (ORF). ORF first estimates illumination to brighten the image and then restores corruptions. A key component is the Illumination-Guided Transformer (IGT), which uses illumination representations to manage non-local interactions across regions with varying lighting. Integrating IGT into ORF yields Retinexformer. Extensive quantitative and qualitative experiments on thirteen benchmarks demonstrate that Retinexformer significantly surpasses state-of-the-art methods. Furthermore, a user study and application in low-light object detection highlight its practical utility.

https://arxiv.org/abs/2303.06705v3 (Preprint)

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the effective enhancement of low-light images. This task is crucial for improving human visual perception and the performance of downstream computer vision tasks, such as nighttime object detection, which often suffer significantly in poor lighting conditions.

Existing approaches face several challenges:

  1. Corruption Handling: Many deep learning algorithms based on the Retinex theory assume images are corruption-free, which is inconsistent with real low-light scenes that inevitably contain noise, artifacts, and color distortions (e.g., from high ISO or long exposure). Additionally, the enhancement process itself can introduce or amplify these corruptions (e.g., under-/over-exposure, color distortion).

  2. Multi-stage Training: Traditional Retinex-based deep learning methods often employ a tedious multi-stage training pipeline, where different CNNs are trained separately for decomposition, denoising, and illumination adjustment before being finetuned together. This process is time-consuming and complex.

  3. Long-range Dependencies: Most existing deep learning methods heavily rely on convolutional neural networks (CNNs). While effective for local feature extraction, CNNs have inherent limitations in capturing long-range dependencies and non-local self-similarity across an image, which are critical for holistic image restoration and enhancement. The high computational cost of directly applying global Vision Transformers restricts their full potential in this domain.

    The paper's entry point and innovative idea revolve around re-evaluating the Retinex theory to explicitly account for corruptions and integrating the power of Transformers in a computationally efficient manner within a streamlined, one-stage framework.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of low-light image enhancement:

  • First Transformer-based Algorithm for Low-light Image Enhancement: The paper proposes Retinexformer, pioneering the full exploration of Transformer models for this task, overcoming the computational limitations that previously restricted their application.
  • One-stage Retinex-based Framework (ORF): The authors formulate ORF, a simple yet principled framework that revises the traditional Retinex model by introducing perturbation terms to explicitly model corruptions (noise, artifacts, under-/over-exposure, color distortion) inherent in low-light images and introduced during enhancement. This framework allows for an easy one-stage, end-to-end training process, addressing the "tedious multi-stage training pipeline" issue of prior Retinex-based deep learning methods.
  • Illumination-Guided Multi-head Self-Attention (IG-MSA): A novel self-attention mechanism, IG-MSA, is designed. It leverages illumination information (represented by light-up features) as a crucial guide for modeling long-range dependencies. This mechanism enables effective interaction between regions with different lighting conditions and significantly reduces computational complexity compared to global self-attention, making Transformers viable for dense image tasks.
  • State-of-the-Art Performance: Retinexformer, by plugging the IGT into ORF, demonstrates significantly superior quantitative and qualitative performance over state-of-the-art methods across an extensive suite of thirteen diverse low-light image enhancement datasets. Notably, it achieves improvements of over 6 dB PSNR on severely corrupted datasets like SID and SDSD.
  • Demonstrated Practical Value: Beyond objective metrics, the paper validates the practical utility of Retinexformer through a comprehensive user study, showing high subjective visual quality. Furthermore, its application as a pre-processing step for low-light object detection tasks significantly boosts detection accuracy, revealing its value for high-level computer vision applications.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand the Retinexformer paper, a reader should be familiar with several fundamental concepts in image processing and deep learning:

  • Low-light Image Enhancement (LLIE): This is the overarching task addressed. It aims to transform an underexposed, dark, or noisy image into a well-exposed, clear, and visually appealing image, often referred to as a "normal-light" image. The goal is to improve visibility, contrast, and color fidelity.

  • Retinex Theory:

    • Conceptual Definition: Proposed by Edwin Land, the Retinex (a portmanteau of "retina" and "cortex") theory posits that the perceived color of an object is determined by its reflectance properties rather than by the absolute amount of light reaching the eye. In image processing, this translates to decomposing an image into two components:
      1. Reflectance (R): This represents the intrinsic properties of the scene objects (e.g., color, texture) and is independent of illumination. It is often considered the "true" enhanced image.
      2. Illumination (L): This represents the amount and distribution of light incident on the scene.
    • Mathematical Representation: The original Retinex model expresses a given image I\mathbf{I} as the element-wise product of its reflectance R\mathbf{R} and illumination L\mathbf{L}: $ \mathbf{I} = \mathbf{R} \odot \mathbf{L} $ Where \odot denotes element-wise multiplication. The goal of Retinex-based enhancement is to estimate L\mathbf{L} from I\mathbf{I} and then derive R\mathbf{R} (the enhanced image) by dividing I\mathbf{I} by L\mathbf{L} (or multiplying by the inverse of L\mathbf{L}).
  • Convolutional Neural Networks (CNNs):

    • Conceptual Definition: CNNs are a class of deep neural networks primarily used for analyzing visual imagery. They are characterized by convolutional layers that apply learnable filters (kernels) to input data, pooling layers that reduce spatial dimensions, and fully connected layers for classification or regression.
    • Strengths: CNNs excel at capturing local patterns and hierarchical features due to their local receptive fields and weight sharing. They have been highly successful in many image processing tasks.
    • Limitations (relevant to this paper): Their local nature means they inherently struggle to capture long-range dependencies (i.e., relationships between pixels or regions that are far apart in an image) and non-local self-similarity without large receptive fields or complex architectural designs.
  • Transformer:

    • Conceptual Definition: Originally introduced for natural language processing (NLP), the Transformer is a neural network architecture that relies primarily on a mechanism called self-attention to weigh the importance of different parts of the input data. Unlike recurrent neural networks (RNNs), Transformers process input sequences in parallel, making them highly efficient and capable of modeling very long-range dependencies.
    • Vision Transformer (ViT): Transformers were adapted for computer vision by treating image patches as sequences of tokens, similar to words in NLP.
    • Computational Cost Issue: A significant challenge with vanilla Transformers in vision tasks, especially for high-resolution images, is that the computational complexity of global self-attention is quadratic with respect to the input spatial size (HW), making it computationally expensive and memory-intensive.
  • Self-Attention Mechanism (Core of Transformer):

    • Conceptual Definition: Self-attention allows a model to weigh the importance of different parts of the input sequence when processing a specific element. For each element, it computes a score for every other element, indicating how much "attention" should be paid to it.
    • Mathematical Formula (Standard Multi-Head Attention): The core of the self-attention mechanism involves three learnable linear projections: Query (Q\mathbf{Q}), Key (K\mathbf{K}), and Value (V\mathbf{V}). Given an input sequence (or tokens from an image patch), these are generated. The attention output is then computed as: $ \mathrm{Attention}(\mathbf{Q}, \mathbf{K}, \mathbf{V}) = \mathrm{softmax}\left(\frac{\mathbf{Q}\mathbf{K}^\mathrm{T}}{\sqrt{d_k}}\right)\mathbf{V} $
      • QRN×dk\mathbf{Q} \in \mathbb{R}^{N \times d_k}: Query matrix, derived from the input, representing what we are looking for.
      • KRN×dk\mathbf{K} \in \mathbb{R}^{N \times d_k}: Key matrix, derived from the input, representing what is available.
      • VRN×dv\mathbf{V} \in \mathbb{R}^{N \times d_v}: Value matrix, derived from the input, containing the information to be aggregated.
      • NN: Number of tokens (e.g., patches in an image).
      • dkd_k: Dimension of keys and queries.
      • KT\mathbf{K}^\mathrm{T}: Transpose of the key matrix.
      • softmax()\mathrm{softmax}(\cdot): Normalization function that converts scores into probability distributions.
      • dk\sqrt{d_k}: Scaling factor to prevent large dot products from pushing the softmax into regions with tiny gradients.
    • Multi-head Self-Attention (MSA): This extends self-attention by performing the attention mechanism multiple times in parallel with different linear projections (heads). The outputs from these heads are then concatenated and linearly transformed to produce the final output. This allows the model to jointly attend to information from different representation subspaces at different positions.
  • U-Net Architecture:

    • Conceptual Definition: U-Net is a convolutional network architecture primarily designed for image segmentation, but widely adopted for various image-to-image translation tasks. It features an encoder-decoder structure with symmetric downsampling and upsampling paths, connected by skip connections.
    • Structure:
      • Encoder (Downsampling Path): Consists of repeated application of convolutional layers, followed by pooling layers (e.g., max-pooling or strided convolutions) to reduce spatial dimensions while increasing feature channels. This captures contextual information.
      • Decoder (Upsampling Path): Consists of upsampling operations (e.g., transposed convolutions), followed by convolutional layers. This path gradually recovers the spatial resolution.
      • Skip Connections: Direct connections from the encoder path to the corresponding (same spatial resolution) decoder path. These are crucial for propagating fine-grained details lost during downsampling, enabling precise localization.

3.2. Previous Works

The paper discusses various categories of previous low-light image enhancement methods:

  • Plain Methods:

    • Description: These are simple, global image processing techniques that directly manipulate pixel intensities or histograms to improve visibility. Examples include histogram equalization [1, 8, 12, 40, 41] and Gamma Correction (GC) [19, 42, 53].
    • Drawbacks: They often produce undesirable artifacts (e.g., over-enhancement, loss of detail, unnatural appearance) because they do not consider the underlying illumination factors or local image content. They treat the image uniformly, leading to perceptually inconsistent results.
  • Traditional Cognition Methods (Retinex-based):

    • Description: These methods leverage the Retinex theory to decompose an image into reflectance and illumination components, focusing on estimating the illumination map. The reflectance component is then considered the enhanced image. They use hand-crafted priors to constrain the decomposition.
    • Examples: Guo et al. [18] (LIME) proposed refining the initial estimated illumination map by imposing a structure prior. Jobson et al. [23, 24] developed Multi-scale Retinex. Fu et al. [15] used a weighted variational model.
    • Drawbacks: They assume low-light images are corruption-free, which is unrealistic. This leads to severe noise amplification and color distortion in the enhanced images. Their reliance on hand-crafted priors also limits their generalization ability and often requires careful parameter tuning.
  • Deep Learning Methods: These represent the current state-of-the-art and are primarily CNN-based.

    • Category 1: Direct Mapping CNNs:
      • Description: These methods train a CNN to directly learn an end-to-end mapping from a low-light image to its normal-light counterpart, without explicit reliance on image formation models like Retinex.
      • Examples: LLNet [33], MBLLEN [35], EnGAN [22] (which uses GANs for unpaired training).
      • Drawbacks: They often lack interpretability and may struggle to generalize to unseen lighting conditions or image corruptions, as they learn a "brute-force" mapping without explicit physical principles.
    • Category 2: Retinex-based CNNs:
      • Description: These methods integrate the Retinex theory into deep learning frameworks, using CNNs to estimate or refine reflectance and illumination.
      • Examples: RetinexNet [54], KinD [66], DeepUPE [49], RUAS [30].
      • Drawbacks:
        • Multi-stage Training Pipeline: A major limitation is the tedious multi-stage training pipeline. For instance, RetinexNet uses separate CNNs for decomposition, denoising, and illumination adjustment, which are initially trained independently and then finetuned. This is time-consuming and complex.
        • Corruption Oversight: Methods like DeepUPE [49] focus on predicting the illumination map but do not explicitly consider the corruption factors (noise, artifacts) hidden in the dark, leading to amplified noise and color distortion when brightening.
        • CNN Limitations: Like other CNN-based methods, they are limited in capturing long-range dependencies and non-local self-similarity, which are crucial for consistent image restoration across the entire image.
  • Vision Transformers for Image Restoration:

    • Description: Recent advancements have seen Transformers applied to low-level vision tasks like image restoration, showing promise for capturing global dependencies.
    • Examples: IPT [11], Uformer [52], Restormer [60].
    • Drawbacks (relevant to this paper): Directly applying vanilla Vision Transformers to high-resolution image tasks often encounters an issue of unaffordable computational cost, as self-attention is quadratic to input spatial size (HW). This led to hybrid approaches like SNR-Net [57], which only employs a single global Transformer layer at the lowest spatial resolution of a U-shaped CNN, thus not fully exploring the Transformer's potential.

3.3. Technological Evolution

The evolution of low-light image enhancement can be traced as follows:

  1. Early 20th Century - Theory: The conceptual foundation laid by Retinex theory (Land, 1960s-70s), positing image decomposition into reflectance and illumination.

  2. Late 20th Century - Traditional Algorithms: Development of traditional Retinex-based algorithms (e.g., Multi-scale Retinex by Jobson et al., 1990s) and simple pixel manipulation methods (e.g., histogram equalization, gamma correction). These relied on hand-crafted priors and often had limitations in handling noise and generalization.

  3. 2010s - Deep Learning Era (CNNs): With the rise of deep learning, CNNs began to dominate.

    • Initially, direct end-to-end mapping CNNs emerged, learning arbitrary functions from low-light to normal-light images (e.g., LLNet, MBLLEN).
    • Soon after, CNNs were integrated with the Retinex theory (e.g., RetinexNet, KinD, DeepUPE), aiming to leverage the interpretable decomposition. However, these often suffered from multi-stage training.
  4. Early 2020s - Transformer Integration: The success of Transformers in NLP led to their adoption in computer vision (Vision Transformers). Initial applications to image restoration demonstrated their power in capturing long-range dependencies, but their high computational cost for high-resolution images necessitated hybrid models (e.g., SNR-Net) that used Transformers sparingly.

    Retinexformer fits into this timeline by pushing the boundaries of Transformer integration. It addresses the key shortcomings of previous Retinex-based deep learning methods (multi-stage training, corruption handling) and CNN-based methods (long-range dependencies) by proposing a one-stage, Retinex-inspired framework that fully leverages a computationally efficient, illumination-guided Transformer.

3.4. Differentiation Analysis

Compared to the main methods in related work, Retinexformer introduces several core differences and innovations:

  • Explicit Corruption Modeling within Retinex Framework:

    • Previous: Traditional and many deep Retinex-based methods (e.g., RetinexNet, DeepUPE) largely ignore or simplify the corruptions (noise, artifacts, exposure issues) present in low-light images or introduced during the brightening process. They primarily focus on estimating "clean" reflectance and illumination.
    • Retinexformer Innovation: The paper explicitly reformulates the Retinex model by introducing perturbation terms (R^,L^\hat{\mathbf{R}}, \hat{\mathbf{L}}) for both reflectance and illumination. This allows the model to account for hidden corruptions and those caused by "light-up," enabling a dedicated corruption restorer (R\mathcal{R}) to handle them effectively.
  • One-Stage End-to-End Training:

    • Previous: Most Retinex-based deep learning methods (e.g., RetinexNet, KinD) require tedious multi-stage training pipelines, involving separate training phases for decomposition, denoising, and adjustment, followed by finetuning.
    • Retinexformer Innovation: ORF is designed as a one-stage framework, allowing the entire model to be trained end-to-end. This significantly simplifies the training process, making it more efficient and less prone to compounding errors from independent stages.
  • Full Integration of Transformers with Illumination Guidance:

    • Previous: CNN-based methods inherently struggle with long-range dependencies. While some hybrid methods like SNR-Net introduced Transformers, they used them sparingly (e.g., a single global Transformer layer at the lowest resolution) due to the quadratic computational complexity of vanilla global self-attention.
    • Retinexformer Innovation: The paper designs Illumination-Guided Transformer (IGT) and specifically Illumination-Guided Multi-head Self-Attention (IG-MSA). This mechanism:
      1. Models Long-Range Dependencies: It fully integrates Transformer capabilities throughout the network (within IGAB units at multiple scales), exploring their potential more thoroughly than hybrid approaches.
      2. Addresses Computational Cost: IG-MSA reduces the computational complexity from quadratic to linear with respect to spatial size by treating single-channel feature maps as tokens.
      3. Leverages Illumination Information: Crucially, IG-MSA uses illumination representations (Flu\mathbf{F}_{lu}) to explicitly direct the computation of self-attention. This guides the model to understand and interact between regions with different lighting conditions, making the attention mechanism highly relevant to the low-light enhancement task. This contextual guidance is a key differentiator from generic Transformers.
  • Robust Illumination Map Estimation:

    • Previous: Some methods might estimate the illumination map L\mathbf{L}, which then requires element-wise division (I./L\mathbf{I} ./ \mathbf{L}) to get the reflectance. This operation is numerically unstable and prone to data overflow or amplification of small errors when L\mathbf{L} values are close to zero.

    • Retinexformer Innovation: ORF estimates the light-up map Lˉ\bar{\mathbf{L}} (such that LˉL=1\bar{\mathbf{L}} \odot \mathbf{L} = \mathbf{1}), enabling element-wise multiplication (ILˉ\mathbf{I} \odot \bar{\mathbf{L}}) for brightening. This is a more robust and numerically stable operation, avoiding the pitfalls of division.

      In essence, Retinexformer innovatively combines the interpretability of a revised Retinex model with the global modeling power of a computationally efficient and illumination-aware Transformer, all within a streamlined one-stage training paradigm.

4. Methodology

4.1. Principles

The core idea of Retinexformer is to build a simple yet principled One-stage Retinex-based Framework (ORF) that explicitly models corruptions in low-light images and leverages Transformer's ability to capture long-range dependencies, guided by illumination information.

The theoretical basis and intuition are as follows:

  1. Revised Retinex Model: The traditional Retinex theory assumes images are perfect. However, real low-light images are inherently corrupted by noise, artifacts, and color distortion due to imaging conditions (e.g., high ISO, long exposure). Furthermore, the process of brightening a dark image can amplify existing corruptions or introduce new ones like under-/over-exposure. The paper's principle is to revise the Retinex model by adding perturbation terms to both reflectance and illumination, explicitly acknowledging and modeling these corruptions. This allows for a dedicated corruption restorer to clean the image after initial brightening.
  2. One-Stage Training: Instead of the tedious multi-stage pipelines common in previous Retinex-based deep learning methods, the paper aims for an end-to-end, one-stage training process. This is achieved by formulating the entire enhancement as a single trainable network, where illumination estimation and corruption restoration are integrated.
  3. Illumination-Guided Non-local Interactions: CNNs struggle with long-range dependencies, which are crucial for consistent enhancement across an entire image (e.g., maintaining color consistency or denoising patches far apart). Transformers are excellent at this but are computationally expensive for high-resolution images. The principle here is to design a Transformer that is computationally efficient and, crucially, guided by illumination information. The intuition is that the illumination map provides valuable contextual cues about different lighting conditions across the image, which can direct the Transformer's attention mechanism to effectively model interactions between well-lit and dark regions.

4.2. Core Methodology In-depth (Layer by Layer)

The Retinexformer algorithm is built upon the One-stage Retinex-based Framework (ORF) and employs an Illumination-Guided Transformer (IGT) as its corruption restorer.

The overall architecture is illustrated in Figure 2 (a) from the original paper.

该图像是一个示意图,展示了Retinexformer算法的结构,包括照明估计器和污染恢复模块。图中展示了输入图像、光照图、增强图像等多个步骤,以及照明引导自注意力块和多头自注意力机制的组成部分。 该图像是一个示意图,展示了Retinexformer算法的结构,包括照明估计器和污染恢复模块。图中展示了输入图像、光照图、增强图像等多个步骤,以及照明引导自注意力块和多头自注意力机制的组成部分。

4.2.1. One-stage Retinex-based Framework (ORF)

The ORF is formulated by first revising the standard Retinex theory to account for corruptions, and then defining an illumination estimator (E\mathcal{E}) and a corruption restorer (R\mathcal{R}).

1. Revised Retinex Model: The classical Retinex theory states that a low-light image IRH×W×3\mathbf{I} \in \mathbb{R}^{H \times W \times 3} (where H, W are height and width, and 3 is for RGB channels) can be decomposed into a reflectance RRH×W×3\mathbf{R} \in \mathbb{R}^{H \times W \times 3} and a single-channel illumination map LRH×W\mathbf{L} \in \mathbb{R}^{H \times W} as: $ \mathbf{I} = \mathbf{R} \odot \mathbf{L} $ Here, \odot denotes element-wise multiplication.

The authors point out that this model assumes I\mathbf{I} is corruption-free, which is often not true in real low-light scenarios. Corruptions arise from two main sources:

  • Hidden Corruptions: High-ISO and long-exposure settings, common in dark scenes, inevitably introduce noise and artifacts (e.g., R^\hat{\mathbf{R}}).

  • Introduced Corruptions: The light-up process itself can amplify existing noise/artifacts and cause new issues like under-/over-exposure and color distortion (e.g., L^\hat{\mathbf{L}}).

    To model these corruptions, the authors reformulate the equation by introducing perturbation terms R^RH×W×3\hat{\mathbf{R}} \in \mathbb{R}^{H \times W \times 3} for reflectance and L^RH×W\hat{\mathbf{L}} \in \mathbb{R}^{H \times W} for illumination: $ \mathbf{I} = (\mathbf{R} + \hat{\mathbf{R}}) \odot (\mathbf{L} + \hat{\mathbf{L}}) $ Expanding this, we get: $ \mathbf{I} = \mathbf{R} \odot \mathbf{L} + \mathbf{R} \odot \hat{\mathbf{L}} + \hat{\mathbf{R}} \odot (\mathbf{L} + \hat{\mathbf{L}}) $ The paper regards R\mathbf{R} as the desired well-exposed image. To "light up" the low-light image I\mathbf{I}, it is element-wise multiplied by a light-up map Lˉ\bar{\mathbf{L}} such that LˉL=1\bar{\mathbf{L}} \odot \mathbf{L} = \mathbf{1} (meaning Lˉ\bar{\mathbf{L}} acts as an inverse of L\mathbf{L}). Multiplying both sides of the reformulated equation by Lˉ\bar{\mathbf{L}}: $ \mathbf{I} \odot \bar{\mathbf{L}} = \mathbf{R} + \mathbf{R} \odot (\hat{\mathbf{L}} \odot \bar{\mathbf{L}}) + (\hat{\mathbf{R}} \odot (\mathbf{L} + \hat{\mathbf{L}})) \odot \bar{\mathbf{L}} $ Let's break down the terms on the right-hand side:

  • R\mathbf{R}: The desired clean reflectance.

  • R(L^Lˉ)\mathbf{R} \odot (\hat{\mathbf{L}} \odot \bar{\mathbf{L}}): This term represents corruptions like under-/over-exposure and color distortion caused by potential errors or perturbations in the illumination estimation/light-up process (L^\hat{\mathbf{L}} is perturbation in illumination, and Lˉ\bar{\mathbf{L}} is the light-up map).

  • (R^(L+L^))Lˉ(\hat{\mathbf{R}} \odot (\mathbf{L} + \hat{\mathbf{L}})) \odot \bar{\mathbf{L}}: This term represents the noise and artifacts initially hidden in the dark scene (R^\hat{\mathbf{R}}) which are amplified by the brightening process (multiplication by Lˉ\bar{\mathbf{L}}).

    This expression is then simplified into a more compact form: $ \mathbf{I}_{lu} = \mathbf{I} \odot \bar{\mathbf{L}} = \mathbf{R} + \mathbf{C} $ Here:

  • IluRH×W×3\mathbf{I}_{lu} \in \mathbb{R}^{H \times W \times 3}: The lit-up image, which is the low-light image after initial brightening. This image still contains corruptions.

  • CRH×W×3\mathbf{C} \in \mathbb{R}^{H \times W \times 3}: The overall corruption term, encompassing all the noise, artifacts, under-/over-exposure, and color distortion.

2. ORF Architecture: The ORF then defines the overall enhancement process as: $ (\mathbf{I}{lu}, \mathbf{F}{lu}) = \boldsymbol{\mathcal{E}}(\mathbf{I}, \mathbf{L}p) $ $ \mathbf{I}{en} = \mathcal{R}(\mathbf{I}{lu}, \mathbf{F}{lu}) $ Where:

  • E\boldsymbol{\mathcal{E}}: The illumination estimator. It takes the low-light input image I\mathbf{I} and an illumination prior map Lp\mathbf{L}_p as inputs.
    • Lp=meanc(I)\mathbf{L}_p = \mathrm{mean}_c(\mathbf{I}), where meanc\mathrm{mean}_c calculates the mean pixel value along the channel dimension (i.e., grayscale version of the input image), serving as an initial estimate of illumination.
    • E\boldsymbol{\mathcal{E}} outputs two components: the lit-up image Ilu\mathbf{I}_{lu} (calculated as ILˉ\mathbf{I} \odot \bar{\mathbf{L}}) and a light-up feature FluRH×W×C\mathbf{F}_{lu} \in \mathbb{R}^{H \times W \times C} (where CC is the feature channel dimension).
  • R\mathcal{R}: The corruption restorer. It takes the lit-up image Ilu\mathbf{I}_{lu} and the light-up feature Flu\mathbf{F}_{lu} as inputs.
  • IenRH×W×3\mathbf{I}_{en} \in \mathbb{R}^{H \times W \times 3}: The final enhanced image, produced by R\mathcal{R} after restoring the corruptions.

3. Illumination Estimator (E\boldsymbol{\mathcal{E}}) Details: The architecture of E\boldsymbol{\mathcal{E}} is shown in Figure 2 (a) (i).

  • It first concatenates the input low-light image I\mathbf{I} and its illumination prior map Lp\mathbf{L}_p.
  • A conv1x1 (convolution with kernel size 1x1) layer is used to fuse these concatenated inputs.
  • A depth-wise separable conv5x5 layer is then applied. This layer is chosen to model interactions between regions with different lighting conditions, generating the light-up feature Flu\mathbf{F}_{lu}. The paper notes that well-exposed regions can provide semantic context for under-exposed regions.
  • Finally, another conv1x1 layer aggregates Flu\mathbf{F}_{lu} to produce the light-up map LˉRH×W×3\bar{\mathbf{L}} \in \mathbb{R}^{H \times W \times 3}. Importantly, Lˉ\bar{\mathbf{L}} is designed as a three-channel RGB tensor (instead of a single-channel one as in some prior works) to enhance its capacity to simulate non-linearity across RGB channels for better color enhancement.
  • This Lˉ\bar{\mathbf{L}} is then used to compute Ilu=ILˉ\mathbf{I}_{lu} = \mathbf{I} \odot \bar{\mathbf{L}}.

Discussion on ORF:

  • Estimation of Lˉ\bar{\mathbf{L}} vs. L\mathbf{L}: The paper highlights a crucial design choice: estimating Lˉ\bar{\mathbf{L}} (the inverse illumination) instead of L\mathbf{L} (the illumination map itself). If L\mathbf{L} were estimated, the lit-up image would be obtained by element-wise division (I./L\mathbf{I} ./ \mathbf{L}). This operation is numerically unstable because pixel values in L\mathbf{L} can be very small or even zero, leading to data overflow or amplification of small computational errors. By modeling Lˉ\bar{\mathbf{L}} and using element-wise multiplication (ILˉ\mathbf{I} \odot \bar{\mathbf{L}}), the process becomes much more robust.
  • Comprehensive Corruption Handling: Unlike previous Retinex-based deep learning methods that mainly focused on suppressing reflectance corruptions (R^\hat{\mathbf{R}}), ORF explicitly considers both illumination estimation errors (L^\hat{\mathbf{L}}) and hidden noise (R^\hat{\mathbf{R}}), using R\mathcal{R} to restore all of them.

4.2.2. Illumination-Guided Transformer (IGT)

The Illumination-Guided Transformer (IGT) serves as the corruption restorer R\mathcal{R} in ORF. It is designed to overcome the limitations of CNNs in capturing long-range dependencies while mitigating the high computational cost of vanilla Transformers.

1. Network Structure: As shown in Figure 2 (a) (ii), IGT adopts a three-scale U-shaped architecture [44].

  • Input: The lit-up image Ilu\mathbf{I}_{lu} (output of E\mathcal{E}) is the input to IGT.
  • Downsampling Branch:
    • Ilu\mathbf{I}_{lu} first passes through a conv3x3 layer.
    • This is followed by an IGAB (Illumination-Guided Attention Block).
    • Then, a strided conv4x4 layer is used to downscale features.
    • This pattern (two IGABs followed by a strided conv4x4) repeats to generate hierarchical features FiRH2i×W2i×2iC\mathbf{F}_i \in \mathbb{R}^{\frac{H}{2^i} \times \frac{W}{2^i} \times 2^i C}, where i=0,1,2i = 0, 1, 2 represents the scale.
    • The deepest level (F2\mathbf{F}_2) passes through two more IGABs.
  • Upsampling Branch:
    • A symmetrical structure is used, employing deconv2x2 (transposed convolution with stride=2) to upscale features.
    • Skip connections are utilized from the downsampling branch to alleviate information loss and transfer fine-grained details.
  • Output: The upsampling branch produces a residual image IreRH×W×3\mathbf{I}_{re} \in \mathbb{R}^{H \times W \times 3}.
  • Final Enhanced Image: The enhanced image Ien\mathbf{I}_{en} is obtained by summing the lit-up image and the residual image: $ \mathbf{I}{en} = \mathbf{I}{lu} + \mathbf{I}_{re} $

2. Illumination-Guided Multi-head Self-Attention (IG-MSA): The IG-MSA module is the key component within each IGAB and the core innovation for enabling efficient, illumination-aware Transformer operations. Its details are shown in Figure 2 (c).

  • Input: The input feature FˉinRH×W×C\bar{\mathbf{F}}_{in} \in \mathbb{R}^{H \times W \times C} (from the IGAB) is reshaped into tokens XRHW×C\mathbf{X} \in \mathbb{R}^{HW \times C}.
  • Head Splitting: X\mathbf{X} is then split into kk heads: $ \mathbf{X} = [\mathbf{X}_1, \mathbf{X}_2, \dots, \mathbf{X}_k] $ Where XiRHW×dk\mathbf{X}_i \in \mathbb{R}^{HW \times d_k} and dk=Ckd_k = \frac{C}{k}. Each Xi\mathbf{X}_i corresponds to the input for headihead_i.
  • Query, Key, Value Projections: For each headihead_i, three fully connected (fc) layers (without bias) linearly project Xi\mathbf{X}_i into query (Qi\mathbf{Q}_i), key (Ki\mathbf{K}_i), and value (Vi\mathbf{V}_i) elements: $ \mathbf{Q}_i = \mathbf{X}i \mathbf{W}{\mathbf{Q}_i}^\mathrm{T} $ $ \mathbf{K}_i = \mathbf{X}i \mathbf{W}{\mathbf{K}_i}^\mathrm{T} $ $ \mathbf{V}_i = \mathbf{X}i \mathbf{W}{\mathbf{V}_i}^\mathrm{T} $ Where Qi,Ki,ViRHW×dk\mathbf{Q}_i, \mathbf{K}_i, \mathbf{V}_i \in \mathbb{R}^{HW \times d_k}. WQi,WKi,WViRdk×dk\mathbf{W}_{\mathbf{Q}_i}, \mathbf{W}_{\mathbf{K}_i}, \mathbf{W}_{\mathbf{V}_i} \in \mathbb{R}^{d_k \times d_k} are the learnable parameters of the fc layers, and T\mathrm{T} denotes matrix transpose.
  • Illumination Guidance: The light-up feature FluRH×W×C\mathbf{F}_{lu} \in \mathbb{R}^{H \times W \times C} (estimated by E\mathcal{E}) is crucial here. It encodes illumination information and interactions of regions with different lighting conditions. This feature is also reshaped into YRHW×C\mathbf{Y} \in \mathbb{R}^{HW \times C} and split into kk heads: $ \mathbf{Y} = [\mathbf{Y}_1, \mathbf{Y}_2, \dots, \mathbf{Y}_k] $ Where YiRHW×dk\mathbf{Y}_i \in \mathbb{R}^{HW \times d_k}.
  • Illumination-Guided Self-Attention Calculation: For each headihead_i, the self-attention is formulated as: $ \operatorname{Attention}(\mathbf{Q}_i, \mathbf{K}_i, \mathbf{V}_i, \mathbf{Y}_i) = (\mathbf{Y}_i \odot \mathbf{V}_i) \operatorname{softmax}\left(\frac{\mathbf{K}_i^\mathrm{T} \mathbf{Q}_i}{\alpha_i}\right) $ Here:
    • \odot: Element-wise multiplication. This is where the illumination guidance happens: the value elements Vi\mathbf{V}_i are modulated by the illumination features Yi\mathbf{Y}_i. This means that the information aggregated by the attention mechanism is weighted by the illumination context, allowing regions with different lighting conditions to interact more effectively.
    • αiR1\alpha_i \in \mathbb{R}^1: A learnable parameter that adaptively scales the matrix multiplication KiTQi\mathbf{K}_i^\mathrm{T} \mathbf{Q}_i. This helps control the sharpness of the attention distribution.
  • Output: After computing attention for all kk heads, their outputs are concatenated. This concatenated result then passes through another fc layer. Finally, a positional encoding PRHW×C\mathbf{P} \in \mathbb{R}^{HW \times C} (learnable parameters) is added to produce the output tokens XoutRHW×C\mathbf{X}_{out} \in \mathbb{R}^{HW \times C}. These tokens are then reshaped back to the output feature FoutRH×W×C\mathbf{F}_{out} \in \mathbb{R}^{H \times W \times C}.

3. Complexity Analysis: This section directly compares the computational efficiency of IG-MSA with Global Multi-head Self-Attention (G-MSA).

  • Complexity of IG-MSA: The main computational burden comes from the two matrix multiplications in Eq. (9) for each of the kk heads: KiTQi\mathbf{K}_i^\mathrm{T} \mathbf{Q}_i and (YiVi)×softmax()(\mathbf{Y}_i \odot \mathbf{V}_i) \times \operatorname{softmax}(\dots).

    • The term KiTQi\mathbf{K}_i^\mathrm{T} \mathbf{Q}_i involves matrices of size Rdk×HW\mathbb{R}^{d_k \times HW} and RHW×dk\mathbb{R}^{HW \times d_k}, resulting in HWdkdkHW \cdot d_k \cdot d_k operations.
    • The term (YiVi)(\mathbf{Y}_i \odot \mathbf{V}_i) is element-wise and then multiplied by the softmax output, involving RHW×dk\mathbb{R}^{HW \times d_k} and Rdk×dk\mathbb{R}^{d_k \times d_k} operations, resulting in HWdkdkHW \cdot d_k \cdot d_k operations.
    • Summing these for kk heads: $ \mathcal{O}(\text{IG-MSA}) = k \cdot [d_k \cdot (d_k \cdot HW) + HW \cdot (d_k \cdot d_k)] $ $ = 2HW k d_k^2 $
    • Since dk=Ckd_k = \frac{C}{k} (where CC is the total channel dimension), substituting dkd_k: $ \mathcal{O}(\text{IG-MSA}) = 2HW k \left(\frac{C}{k}\right)^2 = \frac{2HW C^2}{k} $
    • Key Insight: This shows that the complexity of IG-MSA is linear with respect to the spatial size (HW).
  • Complexity of Global MSA (G-MSA):

    • The computational complexity of standard global MSA (like in vanilla Vision Transformers) is: $ \mathcal{O}(\text{G-MSA}) = 2(HW)^2 C $
    • Key Insight: This complexity is quadratic with respect to the spatial size (HW).
  • Comparison and Advantage: The linear complexity of IG-MSA (vs. quadratic for G-MSA) is a major advantage. It allows IG-MSA to be plugged into each basic unit IGAB throughout the network, even at higher resolutions, without incurring prohibitive computational costs. This enables a full exploration of Transformer's potential for low-light image enhancement, unlike previous hybrid methods that could only afford a single global Transformer layer at the lowest resolution.

5. Experimental Setup

5.1. Datasets

The authors evaluated Retinexformer on a total of thirteen datasets, demonstrating comprehensive testing across various scenarios.

Datasets with Ground Truth (Paired Low-light/Normal-light Images):

  • LOL (Low-Light) Dataset: This dataset contains paired low-light and normal-light images.

    • LOL-v1 [54]: Split into 485 training pairs and 15 testing pairs.
    • LOL-v2 [59]: Divided into two subsets:
      • LOL-v2-real: 689 training pairs, 100 testing pairs. These are real-world low-light images.
      • LOL-v2-synthetic: 900 training pairs, 100 testing pairs. These are synthetically generated low-light images.
    • Characteristics: These datasets are widely used benchmarks for low-light enhancement, covering various indoor and outdoor scenes.
  • SID (See-in-the-Dark) Dataset [9]:

    • Source/Characteristics: Subset captured by a Sony α7S\alpha7S II camera. It consists of short-/long-exposure RAW image pairs. The low-light (short exposure) and normal-light (long exposure) RGB images are derived by applying the same in-camera signal processing pipeline to the RAW data. This dataset is known for having significant noise in the low-light images.
    • Scale: 2697 RAW image pairs in total.
    • Splits: 2099 pairs for training and 598 pairs for testing.
  • SMID (Synthetic Multi-Illumination Dataset) [10]:

    • Source/Characteristics: Collects 20809 short-/long-exposure RAW image pairs. Similar to SID, RAW data is converted to low-/normal-light RGB image pairs. This dataset offers a larger scale for training.
    • Splits: 15763 pairs for training and the remaining for testing.
  • SDSD (Seeing Dynamic Scenes in the Dark) Dataset [48]:

    • Source/Characteristics: This dataset is designed for dynamic scenes and is captured by a Canon EOS 6D Mark II camera with an ND filter. It contains both indoor and outdoor subsets, focusing on video sequences. The paper uses the static version.
    • SDSD-indoor: 62 low-/normal-light video pairs for training, 6 for testing.
    • SDSD-outdoor: 116 low-/normal-light video pairs for training, 10 for testing.
  • FiveK (MIT-Adobe FiveK) Dataset [5]:

    • Source/Characteristics: A dataset of 5000 images, each adjusted manually by five expert photographers (labeled A-E). It's used for photo enhancement and aesthetic adjustment.
    • Splits: 4500 low-/normal-light image pairs for training, 500 for testing.
    • Reference: The authors use expert C's adjusted images as reference and adopt the sRGB output mode.
    • Example: A low-light image might be an underexposed JPEG, and the normal-light counterpart is the expertly retouched version.

Datasets without Ground Truth (Used for Qualitative Evaluation/User Study):

  • LIME [18]
  • NPE [50]
  • MEF [36]
  • DICM [28]
  • VV [47]
    • Characteristics: These datasets typically consist of real-world low-light images without corresponding normal-light ground truth, making them suitable for subjective visual quality assessment and generalization testing.

      These datasets are well-chosen to validate the method's performance across various low-light conditions, image types (real, synthetic, RAW-derived), and levels of corruption (noise, exposure issues). The inclusion of datasets without ground truth helps assess subjective visual quality and generalization to real-world scenarios where paired data is unavailable.

5.2. Evaluation Metrics

The paper uses standard evaluation metrics for image quality assessment and object detection performance.

5.2.1. Peak Signal-to-Noise Ratio (PSNR)

  • Conceptual Definition: PSNR is a widely used objective metric to quantify the quality of reconstruction of an image compared to an original (ground truth) image. It measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. A higher PSNR value generally indicates a higher quality (less noisy) reconstructed image. It is often expressed in decibels (dB).
  • Mathematical Formula: $ PSNR = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{MSE} \right) $
  • Symbol Explanation:
    • MAXIMAX_I: The maximum possible pixel value of the image. For 8-bit grayscale images, this is 255. For color images represented by 8 bits per channel, MAXI=255MAX_I = 255.
    • MSE: Mean Squared Error, which is the cumulative squared error between the enhanced image and the ground truth image, divided by the total number of pixels. $ MSE = \frac{1}{mn} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} [I(i,j) - K(i,j)]^2 $
      • m, n: The number of rows and columns in the image, respectively.
      • I(i,j): The pixel value at row ii and column jj in the ground truth image.
      • K(i,j): The pixel value at row ii and column jj in the enhanced image.

5.2.2. Structural Similarity Index Measure (SSIM)

  • Conceptual Definition: SSIM is an objective metric that evaluates the perceived quality of an image by quantifying the similarity between two images. Unlike PSNR, which focuses on pixel-wise differences (error visibility), SSIM attempts to model the human visual system's perception of structural information, luminance, and contrast. The SSIM index ranges from -1 to 1, where 1 indicates perfect structural similarity.
  • Mathematical Formula: $ SSIM(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
  • Symbol Explanation:
    • x, y: Two image patches (or the entire images) being compared.
    • μx,μy\mu_x, \mu_y: The mean pixel values of xx and yy, respectively.
    • σx,σy\sigma_x, \sigma_y: The standard deviations of pixel values of xx and yy, respectively.
    • σxy\sigma_{xy}: The covariance of xx and yy.
    • c1=(K1L)2,c2=(K2L)2c_1 = (K_1 L)^2, c_2 = (K_2 L)^2: Small constants included to avoid division by zero, where LL is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and K1,K2K_1, K_2 are small constants (e.g., 0.01 and 0.03 by default).

5.2.3. Average Precision (AP)

  • Conceptual Definition: AP is a common metric used to evaluate the performance of object detection models. It summarizes the precision-recall curve for a given class. A higher AP value indicates better detection performance, meaning the model can accurately identify objects (high precision) while also finding most of them (high recall). The mean Average Precision (mAP) is the average AP across all object classes.
  • Mathematical Formula: $ AP = \sum_{k=1}^{N} P(k) \Delta R(k) $ Or, more commonly, as the area under the precision-recall curve: $ AP = \int_0^1 P(R) dR $
  • Symbol Explanation:
    • P(k): Precision at the kk-th recall value.
    • ΔR(k)\Delta R(k): The change in recall from k-1 to kk.
    • P(R): Precision as a function of recall.
    • Precision: The fraction of correctly detected objects (True Positives) among all detected objects (True Positives + False Positives). P=TPTP+FPP = \frac{TP}{TP + FP}.
    • Recall: The fraction of correctly detected objects (True Positives) among all actual objects in the image (True Positives + False Negatives). R=TPTP+FNR = \frac{TP}{TP + FN}.
    • Intersection over Union (IoU): A threshold (e.g., 0.5) is typically applied to determine if a detected bounding box correctly matches a ground truth bounding box. If IoU > threshold, it's a True Positive.

5.3. Baselines

The paper compares Retinexformer against a comprehensive set of state-of-the-art (SOTA) low-light image enhancement algorithms, categorized as follows:

  • Retinex-based Deep Learning Methods:

    • DeepUPE [49]
    • RetinexNet [54]
    • RUAS (RAS in the tables) [30]
    • KinD [66]
  • General Deep Learning Image Enhancement/Restoration Methods (CNN-based):

    • SID [9] (A method designed for seeing in the dark, often used as a baseline)
    • 3DLUT [63]
    • RF [26]
    • DeepLPF [38]
    • Sparse [59]
    • EnGAN [22]
    • RAS (RUAS in the paper text) [30]
    • FIDE [56]
    • DRBN [58]
    • MIRNet [61]
  • Transformer-based or CNN-Transformer Hybrid Methods:

    • IPT [11] (Pre-trained image processing transformer)
    • UFormer [52] (U-shaped Transformer)
    • Restormer [60] (Efficient Transformer for high-resolution image restoration)
    • SNR-Net [57] (SNR-aware CNN-Transformer hybrid network)
  • Methods for Low-light Object Detection Comparison:

    • ZeroDCE [117]

    • SCI [37] (Self-calibrated illumination)

    • Others from the above list (e.g., MIRNet, RetinexNet, RUAS, Restormer, KinD, SNR-Net) when applied as pre-processing.

      These baselines are representative of the current landscape of low-light image enhancement, covering various architectural designs (CNNs, Transformers), foundational principles (Retinex-based, direct mapping), and training paradigms (supervised, unsupervised). This broad comparison validates Retinexformer's efficacy across different approaches.

5.4. Implementation Details

  • Framework: Implemented using PyTorch [39].
  • Optimizer: Adam [25] optimizer with β1=0.9\beta_1 = 0.9 and β2=0.999\beta_2 = 0.999.
  • Training Iterations: 2.5×1052.5 \times 10^5 iterations.
  • Learning Rate Schedule: Initially set to 2×1042 \times 10^{-4} and then steadily decreased to 1×1061 \times 10^{-6} using the cosine annealing scheme [34] during training.
  • Training Samples: Patches of size 128×128128 \times 128 are randomly cropped from low-/normal-light image pairs.
  • Batch Size: 8.
  • Data Augmentation: Random rotation and flipping.
  • Loss Function: Mean Absolute Error (MAE) between the enhanced image and the ground truth. The objective is to minimize this MAE. $ MAE = \frac{1}{mnc} \sum_{k=0}^{c-1} \sum_{i=0}^{m-1} \sum_{j=0}^{n-1} |I(i,j,k) - K(i,j,k)| $
    • m, n: Height and width of the image.
    • cc: Number of channels (e.g., 3 for RGB).
    • I(i,j,k): Pixel value at row ii, column jj, channel kk in the ground truth image.
    • K(i,j,k): Pixel value at row ii, column jj, channel kk in the enhanced image.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that Retinexformer consistently achieves state-of-the-art performance across a wide array of benchmarks, both quantitatively and qualitatively, while maintaining competitive computational efficiency.

The following are the results from Table 1 of the original paper:

Methods Complexity LOL-v1 LOL-v2-real LOL-v2-syn SID SMID SDSD-in SDSD-out
FLOPS (G) Params (M) PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM PSNR SSIM
SID [9] 13.73 7.76 14.35 0.436 13.24 0.442 15.04 0.610 16.97 0.591 24.78 0.718 23.29 0.703 24.90 0.693
3DLUT [63] 0.075 0.59 14.35 0.445 17.59 0.721 18.04 0.800 20.11 0.592 23.86 0.678 21.66 0.655 21.89 0.649
DeepUPE [49] 21.10 1.02 14.38 0.446 13.27 0.452 15.08 0.623 17.01 0.604 23.91 0.690 21.70 0.662 21.94 0.698
RF [26] 46.23 21.54 15.23 0.452 14.05 0.458 15.97 0.632 16.44 0.596 23.11 0.681 20.97 0.655 21.21 0.689
DeepLPF [38] 5.86 1.77 15.28 0.473 14.10 0.480 16.02 0.587 18.07 0.600 24.36 0.688 22.21 0.664 22.76 0.658
IPT [11] 6887 115.31 16.27 0.504 19.80 0.813 18.30 0.811 20.53 0.561 27.03 0.783 26.11 0.831 27.55 0.850
UFormer [52] 12.00 5.29 16.36 0.771 18.82 0.771 19.66 0.871 18.54 0.577 27.20 0.792 23.17 0.859 23.85 0.748
RetinexNet [54] 587.47 0.84 16.77 0.560 15.47 0.567 17.13 0.798 16.48 0.578 22.83 0.684 20.84 0.617 20.96 0.629
Sparse [59] 53.26 2.33 17.20 0.640 20.06 0.816 22.05 0.905 18.68 0.606 25.48 0.766 23.25 0.863 25.28 0.804
EnGAN [22] 61.01 114.35 17.48 0.650 18.23 0.617 16.57 0.734 17.23 0.543 22.62 0.674 20.02 0.604 20.10 0.616
RAS [30] 0.83 0.003 18.23 0.720 18.37 0.723 16.55 0.652 18.44 0.581 25.88 0.744 23.17 0.696 23.84 0.743
FIDE [56] 28.51 8.62 18.27 0.665 16.85 0.678 15.20 0.612 18.34 0.578 24.42 0.692 22.41 0.659 22.20 0.629
DRBN [58] 48.61 5.27 20.13 0.830 20.29 0.831 23.22 0.927 19.02 0.577 26.60 0.781 24.08 0.868 25.77 0.841
KinD [66] 34.99 8.02 20.86 0.790 14.74 0.641 13.29 0.578 18.02 0.583 22.18 0.634 21.95 0.672 21.97 0.654
Restormer [60] 144.25 26.13 22.43 0.823 19.94 0.827 21.41 0.830 22.27 0.649 26.97 0.758 25.67 0.827 24.79 0.802
MIRNet [61] 785 31.76 24.14 0.830 20.02 0.820 21.94 0.876 20.84 0.605 25.66 0.762 24.38 0.864 27.13 0.837
SNR-Net [57] 26.35 4.01 24.61 0.842 21.48 0.849 24.14 0.928 22.87 0.625 28.49 0.805 29.44 0.894 28.66 0.866
Retinexformer 15.57 1.61 25.16 0.845 22.80 0.840 25.67 0.930 24.44 0.680 29.15 0.815 29.77 0.896 29.84 0.877

Quantitative Results Analysis (Table 1):

  • Overall Dominance: Retinexformer achieves the highest PSNR and SSIM scores on seven out of eight datasets: LOL-v1, LOL-v2-real, LOL-v2-synthetic, SID, SMID, SDSD-indoor, and SDSD-outdoor. This robust performance across diverse benchmarks strongly validates its effectiveness.
  • Comparison with SOTA (SNR-Net): Retinexformer consistently outperforms SNR-Net, which was the previous best method.
    • It gains improvements of 0.55 dB (LOL-v1), 1.32 dB (LOL-v2-real), 1.53 dB (LOL-v2-synthetic), 1.57 dB (SID), 0.66 dB (SMID), 0.33 dB (SDSD-indoor), and 1.18 dB (SDSD-outdoor) in PSNR.
    • Crucially, Retinexformer achieves this while being more efficient: it costs only 40% of the parameters (1.61 M vs. 4.01 M) and 59% of the FLOPS (15.57 G vs. 26.35 G) compared to SNR-Net. This highlights its efficiency advantage.
  • Comparison with Retinex-based Deep Learning Methods: When compared to DeepUPE, RetinexNet, RUAS, and KinD, Retinexformer shows dramatic improvements, particularly on datasets with severe corruptions.
    • Improvements are over 6 dB on SID (e.g., 24.44 dB vs. RetinexNet's 16.48 dB) and SDSD datasets, indicating its superior ability to handle noise and artifacts. This is a direct validation of its corruption modeling within the ORF.
  • Comparison with Transformer-based Image Restoration Algorithms: Even against other Transformer-based methods like IPT, Uformer, and Restormer, Retinexformer shows significant gains (e.g., up to 4.26 dB).
    • Furthermore, Retinexformer is vastly more efficient than IPT and Restormer. For example, it uses only 1.4% of IPT's parameters and 0.2% of its FLOPS, and 6.2% of Restormer's parameters and 10.9% of its FLOPS. This underscores the success of the IG-MSA in achieving high performance with reduced computational complexity.

      The following are the results from Table 2 of the original paper:

      Methods DeepUPE [49] MIRNet [61] SNR-Net [57] Restormer [60] Ours
      PSNR (dB) 23.04 23.73 23.81 24.13 24.94
      FLOPS (G) 21.10 785.0 26.35 144.3 15.57

Quantitative Results Analysis (Table 2 - FiveK Dataset): On the FiveK dataset (sRGB output mode), Retinexformer again demonstrates superior performance with a PSNR of 24.94 dB, surpassing Restormer (24.13 dB) and SNR-Net (23.81 dB). Its FLOPS (15.57 G) remain significantly lower than MIRNet (785.0 G) and Restormer (144.3 G), and even lower than SNR-Net (26.35 G), further solidifying its efficiency.

These results collectively confirm Retinexformer's outstanding effectiveness and efficiency across various low-light image enhancement tasks.

6.2. Qualitative Results

The visual comparisons provided in the paper further reinforce Retinexformer's quantitative superiority, showcasing its ability to produce perceptually pleasing and natural-looking enhanced images.

The following figure (Figure 3 from the original paper) shows visual comparison results:

该图像是一个对比图,展示了低光照图像增强技术的效果,包含输入图像、多个算法(RUAS、KinD、Restormer、MIRNet、SNR-Net、Retinexformer)生成的结果,以及真实增强效果(Ground Truth)。每行展示不同图像的增强效果,最后一行为真实效果对比。 Figure 3. Visual comparison of object detection in low-light (left) and enhanced (right) scenes by our method on the Exdark dataset.

The following figure (Figure 4 from the original paper) shows visual comparison results:

该图像是一个对比图,展示了低光照图像增强的不同算法效果。上方展示了输入图像和多种算法(包括RetinexNet、DeepUPE、Restormer、SNR-Net、Retinexformer)的输出,下方是相应的小图。右下角为真实场景的图片作为对照。 Figure 4. Visual comparison of object detection in low-light (left) and enhanced (right) scenes by our method on the Exdark dataset.

The following figure (Figure 5 from the original paper) shows visual comparison results:

该图像是比较不同低光照图像增强算法的示意图,上方展示了输入图像及各算法的输出,包括EnlightenGAN、DRBN、IPT、SNR-Net、Retinexformer和Ground Truth,突出显示了算法的效果差异。 Figure 5. Visual comparison of object detection in low-light (left) and enhanced (right) scenes by our method on the Exdark dataset.

The following figure (Figure 7 from the original paper) shows visual comparison results on datasets without ground truth:

该图像是插图,展示了多个低光照图像的增强效果对比,包括多个算法的结果如LIME、ZeroDCE、Retinexformer等。图像清晰地展示了不同算法在视觉表现上的差异,特别突出Retinexformer在低光照条件下的优越性。 Figure 7. Visual comparison of object detection in low-light (left) and enhanced (right) scenes by our method on the Exdark dataset.

Qualitative Results Analysis (Figures 3, 4, 5, 7):

  • Color Fidelity and Distortion: Previous methods often suffer from color distortion. For instance, RUAS in Figure 3 shows noticeable color shifts. Retinexformer robustly preserves original colors, producing natural and accurate hues.

  • Exposure Control: Many methods struggle with over-/under-exposed regions. RetinexNet and DeepUPE in Figure 4 are cited as failing to suppress noise and exhibiting poor exposure control. Retinexformer effectively balances brightness, enhancing visibility in dark areas without introducing overexposure in brighter regions.

  • Noise and Artifact Suppression: Low-light images are typically noisy. Methods like RetinexNet and DeepUPE amplify noise. DRBN and SNR-Net in Figure 5 are mentioned for introducing black spots or unnatural artifacts. Restormer and SNR-Net in Figure 4 can produce blurry images. In contrast, Retinexformer excels at reliably removing noise and artifacts, yielding cleaner images without introducing new corruptions, spots, or blurriness. This is a direct outcome of its corruption modeling capability.

  • Sharpness and Detail Preservation: Retinexformer maintains good image sharpness and preserves fine details, which is crucial for perceptual quality. Some other methods (e.g., Restormer, SNR-Net) tend to generate blurrier results as a trade-off for noise reduction.

  • Generalization to No Ground Truth Data: Figure 7 showcases results on datasets like LIME, NPE, MEF, DICM, and VV, which lack ground truth. In these scenarios, Retinexformer still demonstrates better visual quality than other supervised and unsupervised algorithms, validating its strong generalization ability to real-world diverse low-light scenes.

    The visual evidence strongly supports the quantitative findings, highlighting Retinexformer's ability to produce high-quality, perceptually superior enhanced images across various challenging conditions.

6.3. User Study Score

To quantify human subjective visual perception, a user study was conducted.

The following are the results from Table 3a of the original paper:

Methods L-v1 L-v2-R L-v2-S SID SMID SD-in SD-out Mean
EnGAN [22] 2.43 1.39 2.13 1.04 2.78 1.83 1.87 1.92
RetinexNet [54] 2.17 1.91 1.13 1.09 2.35 3.96 3.74 2.34
DRBN [58] 2.70 2.26 3.65 1.96 2.22 2.78 2.91 2.64
FIFDE [56] 2.87 2.52 3.48 2.22 2.57 3.04 2.96 2.81
KinD [66] 2.65 2.48 3.17 1.87 3.04 3.43 3.39 2.86
MIRNet [61] 2.96 3.57 3.61 2.35 2.09 2.91 3.09 2.94
Restormer [60] 3.04 3.48 3.39 2.43 3.17 2.48 2.70 2.96
UAS [30] 3.83 3.22 2.74 2.26 3.48 3.39 3.04 3.14
SNR-Net [57] 3.13 3.83 3.57 3.04 3.30 2.74 3.17 3.25
Retinexformer 3.61 4.17 3.78 3.39 3.87 3.65 3.91 3.77

User Study Analysis (Table 3a):

  • Setup: 23 human subjects were invited to score the visual quality of enhanced images from seven datasets, ranging from 1 (worst) to 5 (best). They were instructed to evaluate based on: (i) presence of under-/over-exposed regions, (ii) presence of color distortion, and (iii) presence of noise or artifacts. Images were presented without method names to avoid bias. 156 testing images were used in total.
  • Results: Retinexformer achieved the highest average score of 3.77, significantly outperforming all other methods. It was also the most favored method on LOL-v2-real, LOL-v2-synthetic, SID, SMID, and SDSD-outdoor datasets, and the second most favored on LOL-v1 and SDSD-indoor.
  • Conclusion: The user study confirms that Retinexformer produces results that are not only objectively superior (higher PSNR/SSIM) but also subjectively more appealing and natural-looking to human observers, indicating its success in addressing perceptual quality aspects of low-light enhancement.

6.4. Low-light Object Detection

To demonstrate the practical value of Retinexformer for high-level computer vision tasks, experiments were conducted on low-light object detection.

The following are the results from Table 3b of the original paper:

Methods Bicycle Boat Bottle Bus Car Cat Chair Cup Dog Motor People Table Mean
MIRNet [61] 71.8 63.8 62.9 81.4 71.1 58.8 58.9 61.3 63.1 52.0 68.8 45.5 63.6
RetinexNet [54] 73.8 62.8 64.8 84.9 80.8 53.4 57.2 68.3 61.5 51.3 65.9 43.1 64.0
RUAS [30] 72.0 62.2 65.2 72.9 78.1 57.3 62.4 61.8 60.2 61.5 69.4 46.8 64.2
Restormer [60] 76.2 65.1 64.2 84.0 76.3 59.2 53.0 58.7 66.1 62.9 68.6 45.0 64.9
KinD [66] 72.2 66.5 58.9 83.7 74.5 55.4 61.7 61.3 63.8 63.0 70.5 47.8 65.0
ZeroDCE [117] 75.8 66.5 65.6 84.9 77.2 56.3 53.8 59.0 63.5 64.0 68.3 46.3 65.1
SNR-Net [57] 75.3 64.4 63.6 85.3 77.5 59.1 54.1 59.6 66.3 65.2 69.1 44.6 65.3
SCI [37] 74.6 65.3 65.8 85.4 76.3 59.4 57.1 60.5 65.6 63.9 69.1 45.9 65.6
Retinexformer 76.3 66.7 65.9 84.7 77.6 61.2 53.5 60.7 67.5 63.4 69.5 46.0 66.1

Experiment Settings:

  • Dataset: ExDark [32] dataset, which contains 7363 underexposed images with bounding box annotations for 12 object categories.
  • Splits: 5890 images for training, 1473 for testing.
  • Detector: YOLO-v3 [43], trained from scratch.
  • Enhancement: Different low-light enhancement methods (including Retinexformer and baselines) serve as preprocessing modules with fixed parameters.

Quantitative Results (Table 3b - Object Detection AP):

  • Retinexformer achieves the highest mean Average Precision (AP) of 66.1, surpassing all other enhancement methods.

  • This is 0.5 AP higher than SCI [37] (a recent best self-supervised method) and 0.8 AP higher than SNR-Net [57] (a recent best fully-supervised method).

  • Retinexformer also yields the best results on five specific object categories: bicycle, boat, bottle, cat, and dog.

  • Conclusion: These results clearly demonstrate that Retinexformer is not just good at perceptual enhancement but also significantly improves the input quality for downstream computer vision tasks like object detection, making objects more detectable and reliably localized in low-light scenes.

    The following figure (Figure 6 from the original paper) visually compares object detection results:

    Figure 6. Visual comparison of object detection in low-light (left) and enhanced (right) scenes by our method on the Exdark dataset. Figure 6. Visual comparison of object detection in low-light (left) and enhanced (right) scenes by our method on the Exdark dataset.

    Qualitative Results (Figure 6 - Object Detection):

  • The left image shows object detection in a low-light scene, where the detector (YOLO-v3) misses some boats or predicts inaccurate locations and lower confidence scores.

  • The right image, enhanced by Retinexformer, shows that the detector can now reliably predict well-placed bounding boxes with higher confidence scores, covering all boats.

  • Conclusion: This visual evidence further validates the effectiveness of Retinexformer in making low-light images more amenable to high-level vision tasks.

6.5. Ablation Study

An ablation study was conducted on the SDSD-outdoor dataset to analyze the contribution of each component of Retinexformer. This dataset was chosen due to its good convergence and stable performance.

The following are the results from Table 4a of the original paper:

Baseline-1 ORF IG-MSA PSNR SSIM Params (M) FLOPS (G)
26.47 0.843 1.01 9.18
27.92 0.857 1.27 11.37
28.86 0.868 1.34 13.38
29.84 0.877 1.61 15.57

6.5.1. Break-down Ablation (Table 4a):

  • Baseline-1: This model is derived by removing both ORF (the One-stage Retinex-based Framework) and IG-MSA (Illumination-Guided Multi-head Self-Attention) from Retinexformer. It achieves 26.47 dB PSNR.

  • Adding ORF: When ORF is added to Baseline-1, PSNR improves to 27.92 dB (a gain of 1.45 dB). This shows the effectiveness of the proposed Retinex-based framework in handling low-light conditions and initial brightening.

  • Adding IG-MSA: When IG-MSA is added to Baseline-1 (without ORF), PSNR improves to 28.86 dB (a gain of 2.39 dB). This indicates that the Illumination-Guided Transformer is highly effective in restoring corruptions and modeling long-range dependencies, even without the explicit ORF guidance in the initial stages.

  • Full Retinexformer (ORF + IG-MSA): When both ORF and IG-MSA are jointly exploited, the model achieves the highest PSNR of 29.84 dB (a gain of 3.37 dB over Baseline-1). This demonstrates that ORF and IG-MSA are complementary and their combination leads to the best performance.

  • Conclusion: This breakdown confirms the significant individual contributions of both the ORF and IG-MSA components, and their synergistic effect in achieving superior enhancement.

    The following are the results from Table 4b of the original paper:

    Method Ilu = I Ilu = I./L Ilu = I L +Flu
    PSNR 28.86 28.97 29.26 29.84
    SSIM 0.868 0.868 0.870 0.877
    Params (M) 1.34 1.61 1.61 1.61
    FLOPS (G) 13.38 14.01 14.01 15.57

6.5.2. Ablation of the Proposed ORF (Table 4b): This study examines different ways of forming the lit-up image Ilu\mathbf{I}_{lu} and the role of the light-up feature Flu\mathbf{F}_{lu}.

  • Ilu=IIlu = I: When ORF is removed and the corruption restorer R\mathcal{R} directly takes the raw low-light image I\mathbf{I} as input (effectively bypassing the illumination estimation), the PSNR is 28.86 dB. This configuration is equivalent to the third row of Table 4a (Baseline-1 + IG-MSA).

  • Ilu=I./LIlu = I./L: Here, ORF is used, but the illumination estimator E\mathcal{E} estimates the illumination map L\mathbf{L}. The lit-up image is then computed via element-wise division I./L\mathbf{I} ./ \mathbf{L}. A small constant ϵ=1×104\epsilon = 1 \times 10^{-4} is added to L\mathbf{L} to prevent division by zero. This yields a PSNR of 28.97 dB, a marginal improvement of 0.11 dB over Ilu=IIlu = I. This small gain confirms the numerical instability and vulnerability of the division operation, as discussed in the methodology.

  • Ilu=IIlu = I \odot LL (Corrected in paper to Ilu=IIlu = I \odot bar{L} for clarity): This configuration represents the proposed robust approach where ORF estimates the light-up map Lˉ\bar{\mathbf{L}} (instead of L\mathbf{L}), and the lit-up image is obtained via element-wise multiplication ILˉ\mathbf{I} \odot \bar{\mathbf{L}}. This results in a PSNR of 29.26 dB, a notable improvement of 0.40 dB over Ilu=I./LIlu = I./L. This confirms the robustness and effectiveness of estimating Lˉ\bar{\mathbf{L}} and using multiplication.

  • +Flu+Flu (Full Retinexformer): Finally, when the light-up feature Flu\mathbf{F}_{lu} (output of E\mathcal{E}) is also used to direct the corruption restorer R\mathcal{R} (specifically, the IG-MSA modules within IGT), the PSNR further increases to 29.84 dB (a gain of 0.58 dB). The SSIM also improves from 0.870 to 0.877.

  • Conclusion: This ablation clearly demonstrates the importance of robustly estimating the light-up map Lˉ\bar{\mathbf{L}} via multiplication, and the crucial role of the light-up feature Flu\mathbf{F}_{lu} in guiding the Transformer-based corruption restorer for optimal performance.

    The following are the results from Table 4c of the original paper:

    Method Baseline-2 G-MSA W-MSA IG-MSA
    PSNR 27.92 28.43 28.65 29.84
    SSIM 0.857 0.841 0.845 0.877
    Params (M) 1.27 1.61 1.61 1.61
    FLOPS (G) 11.37 17.65 16.43 15.57

6.5.3. Ablation of Self-Attention Schemes (Table 4c): This study compares the proposed IG-MSA with other self-attention variants.

  • Baseline-2: This corresponds to the ORF component from the breakdown ablation, but with a standard CNN-based corruption restorer, meaning IG-MSA is removed. It achieves 27.92 dB PSNR.
  • G-MSA (Global Multi-head Self-Attention): A global MSA is plugged into each basic unit of the corruption restorer R\mathcal{R}. To avoid out-of-memory errors due to its quadratic complexity, input feature maps are downscaled to 1/41/4 size. This achieves 28.43 dB PSNR, an improvement over Baseline-2. However, it incurs higher FLOPS (17.65 G) and a lower SSIM (0.841) than Baseline-2. The lower SSIM despite higher PSNR suggests potential artifacts or less structural integrity.
  • W-MSA (Window-based Multi-head Self-Attention): This is a local attention mechanism, similar to Swin Transformer [31], where self-attention is computed within non-overlapping windows. It achieves 28.65 dB PSNR, performing better than G-MSA and incurring slightly lower FLOPS (16.43 G).
  • IG-MSA (Illumination-Guided Multi-head Self-Attention): The proposed IG-MSA achieves the highest PSNR of 29.84 dB and SSIM of 0.877. Crucially, it does so with fewer FLOPS (15.57 G) than both G-MSA and W-MSA.
  • Conclusion: IG-MSA surpasses G-MSA by 1.41 dB and W-MSA by 1.34 dB, while costing 2.08 G and 0.86 G less FLOPS, respectively. This powerfully demonstrates the cost-effectiveness advantage and superior performance of the proposed IG-MSA in capturing relevant long-range dependencies for low-light image enhancement, benefiting from the illumination guidance and efficient design.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces Retinexformer, a novel Transformer-based method that significantly advances low-light image enhancement. At its core is the One-stage Retinex-based Framework (ORF), which innovatively revises the traditional Retinex theory to explicitly model corruptions (noise, artifacts, exposure issues) inherent in low-light conditions and those introduced during the brightening process. This framework streamlines the enhancement process into a single, end-to-end trainable pipeline, overcoming the complexities of multi-stage training common in prior Retinex-based deep learning approaches.

A key technical contribution is the Illumination-Guided Transformer (IGT), which incorporates a novel Illumination-Guided Multi-head Self-Attention (IG-MSA) mechanism. IG-MSA leverages illumination representations to direct the modeling of long-range dependencies, enabling effective interactions between regions with varying lighting conditions. Crucially, its design reduces the computational complexity from quadratic to linear with respect to spatial size, making Transformers viable for high-resolution image enhancement without prohibitive costs.

Extensive experiments on thirteen diverse datasets demonstrate Retinexformer's state-of-the-art performance, achieving significant improvements (e.g., over 6 dB PSNR on SID and SDSD datasets) over existing methods, including other Transformer-based and CNN-based approaches, often with greater efficiency. Furthermore, a user study confirms its superior subjective visual quality, and its successful application in low-light object detection showcases its practical value for improving downstream high-level vision tasks.

7.2. Limitations & Future Work

The paper doesn't explicitly list limitations of Retinexformer itself or suggest future work specific to their method in the conclusion. However, implicitly, the paper addresses limitations of previous methods, which can be seen as the improvements Retinexformer brings.

  • Addressing Corruptions: The paper highlights that previous Retinex models didn't consider corruptions hidden in the dark or introduced by the light-up process. Retinexformer's ORF explicitly models and restores these. A potential area for future work, though not stated, might be to explore even more granular or specific types of corruptions (e.g., specific sensor noise patterns) or more adaptive corruption modeling.
  • Multi-stage Training: The tedious multi-stage training pipeline of prior Retinex-based deep learning methods is a key limitation addressed by ORF's one-stage design.
  • Long-range Dependencies and Computational Cost: The inability of CNNs to capture long-range dependencies and the quadratic computational complexity of vanilla Transformers for high-resolution images were significant limitations. Retinexformer addresses these with IGT and IG-MSA. Future work could involve exploring further optimizations for Transformer architectures or more dynamic illumination guidance mechanisms.

7.3. Personal Insights & Critique

Retinexformer presents a highly compelling and well-executed solution to low-light image enhancement, offering several valuable insights:

  1. Principled Integration of Theory and Deep Learning: The work effectively bridges traditional image processing theory (Retinex) with modern deep learning (Transformers). By revising the Retinex model to explicitly account for corruptions, the authors imbue the deep learning framework with a more robust and interpretable physical basis. This "principled" approach, rather than a purely data-driven black-box mapping, likely contributes to its strong generalization and ability to handle diverse corruptions.

  2. Smart Solution to Transformer's Achilles' Heel: The Illumination-Guided Multi-head Self-Attention (IG-MSA) is a particularly ingenious contribution. The quadratic complexity of global self-attention has been a major bottleneck for applying Transformers to dense prediction tasks on high-resolution images. By designing an attention mechanism that is linear in spatial size and, more importantly, uses the illumination context to guide the attention, the paper successfully harnesses the power of Transformers for long-range dependency modeling without incurring prohibitive costs. This sets a precedent for how Transformers can be made more efficient and domain-aware for low-level vision.

  3. Holistic Approach to Enhancement: The ORF's two-stage process (light-up then restore corruption) is intuitive and effective. It recognizes that brightening is only part of the solution; the subsequent corruption restoration is equally critical. The user study and object detection results highlight that Retinexformer achieves not just high PSNR/SSIM but also superior perceptual quality and utility for downstream tasks, which is the ultimate goal of image enhancement.

Potential Issues or Areas for Improvement:

  • Interpretability of Illumination Guidance: While the paper states that F_lu directs the modeling, a deeper dive into how the illumination representation specifically impacts the attention weights or value modulation in IG-MSA could be beneficial. Visualizations of attention maps conditioned on different illumination levels might offer more insights.

  • Generalization to Extreme Conditions: While tested on many datasets, the paper could further discuss its performance under extremely challenging conditions (e.g., highly saturated noise, complex mixed lighting scenarios, severe motion blur in low light) or specific failure modes.

  • Computational Overhead for Illumination Estimation: While IG-MSA is efficient, the depth-wise separable conv5x5 and subsequent conv1x1 in the illumination estimator contribute to the overall complexity. A deeper analysis of this initial estimation's impact on the total runtime and whether it could be further optimized might be insightful.

  • Hyperparameter Sensitivity: The paper mentions alphaialpha_i as a learnable parameter. An ablation study or sensitivity analysis of this parameter, or other key hyperparameters, could provide more practical guidance for implementation and fine-tuning.

  • Positional Encoding Details: The positional encoding P\mathbf{P} is mentioned as "learnable parameters". More details on its structure or how it's learned could be helpful, especially for beginners trying to implement the model.

    Overall, Retinexformer represents a significant step forward in low-light image enhancement, demonstrating that careful theoretical re-evaluation combined with innovative architectural design can yield highly effective and efficient deep learning models. Its methods, particularly the IG-MSA and the principled ORF, could be transferable to other image restoration or image-to-image translation tasks where contextual guidance and efficient long-range dependency modeling are crucial.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.