UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion
TL;DR Summary
UltraFusion is the first exposure fusion technique that merges images with 9-stop differences, modeling it as a guided inpainting problem. It uses under-exposed images for filling over-exposed highlights and generates natural tone mapping, outperforming existing methods.
Abstract
Capturing high dynamic range (HDR) scenes is one of the most important issues in camera design. Majority of cameras use exposure fusion, which fuses images captured by different exposure levels, to increase dynamic range. However, this approach can only handle images with limited exposure difference, normally 3-4 stops. When applying to very high dynamic range scenes where a large exposure difference is required, this approach often fails due to incorrect alignment or inconsistent lighting between inputs, or tone mapping artifacts. In this work, we propose \model, the first exposure fusion technique that can merge inputs with 9 stops differences. The key idea is that we model exposure fusion as a guided inpainting problem, where the under-exposed image is used as a guidance to fill the missing information of over-exposed highlights in the over-exposed region. Using an under-exposed image as a soft guidance, instead of a hard constraint, our model is robust to potential alignment issue or lighting variations. Moreover, by utilizing the image prior of the generative model, our model also generates natural tone mapping, even for very high-dynamic range scenes. Our approach outperforms HDR-Transformer on latest HDR benchmarks. Moreover, to test its performance in ultra high dynamic range scenes, we capture a new real-world exposure fusion benchmark, UltraFusion dataset, with exposure differences up to 9 stops, and experiments show that UltraFusion can generate beautiful and high-quality fusion results under various scenarios. Code and data will be available at https://openimaginglab.github.io/UltraFusion.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
UltraFusion: Ultra High Dynamic Imaging using Exposure Fusion
1.2. Authors
Zixuan Chen (Zhejiang University, Shanghai AI Laboratory), Yujin Wang (Shanghai AI Laboratory), Xin Cai (The Chinese University of Hong Kong), Zhiyuan You (The Chinese University of Hong Kong), Zheming Lu (Zhejiang University), Fan Zhang (Shanghai AI Laboratory), Shi Guo (Shanghai AI Laboratory), and Tianfan Xue (The Chinese University of Hong Kong, Shanghai AI Laboratory).
1.3. Journal/Conference
This paper was published on arXiv on January 20, 2025. Given the affiliations and the quality of the work, it is part of the research output from prominent institutions like the Shanghai AI Laboratory and CUHK, which are leading centers for computer vision research.
1.4. Publication Year
2025
1.5. Abstract
Capturing high dynamic range (HDR) scenes is a critical challenge in camera design. Most cameras use exposure fusion to merge images of different exposure levels. However, standard techniques typically only handle differences of 3-4 stops. In very high dynamic range scenes, these methods fail due to misalignment, lighting inconsistencies, or tone mapping artifacts. This paper introduces UltraFusion, the first technique capable of merging images with a 9-stop exposure difference. By modeling fusion as a guided inpainting problem, the model uses an under-exposed image as soft guidance to fill in highlights in an over-exposed image. This approach is robust to alignment issues and generates natural tone mapping using generative image priors. The authors also contribute the UltraFusion dataset, a real-world benchmark with up to 9-stop differences.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2501.11515
- PDF Link: https://arxiv.org/pdf/2501.11515v4.pdf
2. Executive Summary
2.1. Background & Motivation
The dynamic range of a scene—the ratio between the brightest highlights and the darkest shadows—often exceeds the capabilities of modern camera sensors. To capture such scenes, cameras take multiple photos at different exposure levels.
- The Problem: Traditional
exposure fusionworks well for small exposure differences (3-4 stops). However, for "Ultra" high dynamic range scenes (e.g., a dark room with a view of the sun through a window), a difference of 8-9 stops is needed. - Challenges:
- Alignment: Moving objects cause "ghosting" artifacts because they are in different positions in each frame.
- Lighting Variation: Objects look different at different exposure levels, making simple merging look unnatural.
- Tone Mapping: High-dynamic images must be compressed to be viewed on standard screens, often leading to loss of contrast or "halo" effects.
2.2. Main Contributions / Findings
- UltraFusion Model: A novel framework that treats exposure fusion as
guided inpainting. It uses adiffusion modelto "fill in" over-exposed areas using the information from a darkly exposed photo. - Robustness to Motion: By using the under-exposed image as "soft guidance" rather than a strict requirement, the model can ignore misaligned pixels and generate realistic textures.
- Decompose-and-Fuse Branch: A specialized neural network structure that separates color and structural information from the under-exposed image to guide the generation process more effectively.
- Fidelity Control Branch: An additional module designed to ensure the final output remains faithful to the original scene's colors and textures, preventing the AI from "hallucinating" fake details.
- UltraFusion Dataset: A new benchmark of 100 real-world image pairs with extreme exposure differences (up to 9 stops) to test the limits of HDR algorithms.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand this paper, several core concepts in imaging and AI are required:
- Stops and Exposure Value (EV): In photography, a "stop" represents a doubling or halving of the amount of light. An 8-stop difference means one image is times brighter than the other.
- High Dynamic Range (HDR) Imaging: The process of capturing and displaying scenes with a high range of luminosity.
- Exposure Fusion (MEF): A technique that directly blends multiple Low Dynamic Range (LDR) images into a single high-quality LDR image, skipping the complex step of creating an intermediate mathematical HDR file.
- Diffusion Models: A type of generative AI (like Stable Diffusion) that learns to create images by reversing a process of adding noise. They are excellent at creating realistic textures.
- Image Inpainting: The process of reconstructing lost or deteriorated parts of images. In this paper, over-exposed (white) pixels are treated as "deteriorated" areas to be filled.
- Optical Flow: A technique to track how pixels move between two frames. The paper uses
RAFT(Recurrent All-Pairs Field Transforms) for this.
3.2. Previous Works
- Traditional HDR (e.g., HDR+): Used by Google and others, these typically handle 3 stops. They rely on rigid alignment, which fails when objects move quickly or exposures differ too much.
- ControlNet: A popular framework for adding conditions (like edges or poses) to
Stable Diffusion. The authors find that standardControlNetfails here because it doesn't know which image to prioritize as the "reference," leading to artifacts. - Tone Mapping Operators (TMOs): Algorithms that squash HDR data into LDR. Most TMOs are mathematical filters that can sometimes make images look "flat" or "cartoonish."
3.3. Technological Evolution
HDR research has moved from simple pixel-averaging to flow-based alignment, then to CNN (Convolutional Neural Network) based fusion, and finally to Transformers (like HDR-Transformer). UltraFusion represents the next step: using Generative AI (Diffusion) to solve the "impossible" cases where data is missing or pixels are completely misaligned.
4. Methodology
4.1. Principles
The core intuition is that an over-exposed image is perfect in the shadows but "broken" (pure white) in the highlights. Conversely, the under-exposed image is perfect in the highlights but "broken" (pure black) in the shadows. UltraFusion treats the white areas of as a hole to be filled, using as a "hint" or guidance for what should be inside that hole.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Stage 1: Pre-alignment
Before the AI can merge the images, it needs a rough idea of how they overlap.
- Intensity Mapping: Since one image is dark and the other is bright, standard motion tracking fails. The system first adjusts the brightness of to match using an intensity mapping function.
- Flow Estimation: It uses the
RAFTnetwork to estimate thebidirectional flow(how pixels move from bright to dark) and (dark to bright). - Backward Warping: The under-exposed image is "warped" to match the position of the over-exposed image.
- Consistency Check: To avoid "ghosting" in areas where objects moved or were hidden (occluded), a mask is created. The final output of this stage is the pre-aligned image :
-
: The warping function.
-
: The occlusion mask (1 where pixels are missing/hidden, 0 otherwise).
-
: The motion path from the bright image to the dark image.
The following figure (Figure 3 from the paper) illustrates the two-stage pipeline:
该图像是示意图,展示了 UltraFusion 中的预对齐阶段和引导修复阶段的流程。左侧部分包含了过曝和欠曝图像的流估计方法,而右侧部分则描述了生成模型的编码解码过程,并展示了如何结合控制分支以重建高质量图像。
4.2.2. Stage 2: Guided Inpainting
This stage uses a modified Stable Diffusion model. It takes three inputs: the over-exposed image , the aligned dark guidance , and the current noise latent .
A. Decompose-and-Fuse Control Branch (DFCB)
The authors found that dark images are too dim for the AI to "see" clearly. To fix this, they decompose the dark image into Structure and Color.
- Structure Extraction (): They normalize the luminance channel to remove brightness bias:
- : The brightness (Luminance) channel.
- : The average intensity.
- : The standard deviation (contrast).
- Color Extraction (): They use the Chroma (UV) channels.
- Multi-scale Cross Attention: These features are fused into the main network. Instead of just adding them, the model uses
Attentionto decide which parts of the dark image are actually useful for the current "hole" it is filling. The cross-attention output is calculated as:
-
Q, K, V: The Query (from ), Key, and Value (from ) matrices common inTransformerarchitectures. -
: A learnable scaling factor to control the strength of the guidance.
-
: A small neural layer to adjust dimensions.
The DFCB architecture is shown below:
该图像是示意图,展示了我们提出的分解与融合控制分支的详细架构。图中显示了对不同曝光图像的引导提取与融合过程,包括主提取器、跨注意力机制等模块。
B. Fidelity Control Branch (FCB)
Diffusion models are known to "dream up" textures that weren't there. To keep the image realistic, the FCB acts as a shortcut. It takes the original image and sends high-frequency details directly to the final VAE Decoder (the part of the AI that turns math back into a visible image), ensuring the output isn't too "smooth" or "fake."
4.3. Training Data Synthesis
Since there aren't many "Ultra HDR" videos with ground truth, the authors invented a way to synthesize data:
- Take a standard video (e.g.,
Vimeo-90K). - Select two frames far apart to simulate "large motion."
- Take a high-quality static HDR pair from the
SICEdataset. - Apply the motion from the video to the HDR pair to create a "fake" dynamic HDR scene.
5. Experimental Setup
5.1. Datasets
- SICE: A static dataset used for training.
- MEFB: A benchmark for static exposure fusion.
- RealHDRV: A dynamic dataset containing 50 scenes with motion.
- UltraFusion Benchmark (New): 100 pairs captured by the authors using DSLR (Canon R8) and various phones (iPhone 13, Redmi, etc.). It features differences up to 9 stops.
5.2. Evaluation Metrics
The paper uses several metrics to measure quality without needing a "perfect" reference image:
- MUSIQ / DeQA-Score / PAQ2PIQ / HyperIQA: These are "no-reference" image quality assessors. They use deep learning to give a score based on how "good" or "natural" an image looks to a human.
- MEF-SSIM (Multi-Exposure Fusion Structural Similarity):
- Quantifies how much of the structural information from the input images is preserved in the fused result.
- TMQI (Tone Mapped Quality Index): Specifically designed for HDR. It combines structural fidelity and statistical naturalness.
5.3. Baselines
The method is compared against:
- HDR Reconstruction:
HDR-Transformer,SCTNet,SAFNet. - Exposure Fusion:
Deepfuse,MEF-GAN,U2Fusion,HSDS-MEF.
6. Results & Analysis
6.1. Core Results Analysis
UltraFusion consistently outperforms all competitors, especially in dynamic scenes and extreme exposure differences.
6.1.1. Performance on Static Data (MEFB)
The following are the results from Table 1 of the original paper:
| Type | Method | MEFB [69] | ||||
|---|---|---|---|---|---|---|
| MUSIQ↑ | DeQA-Score↑ | PAQ2PIQ↑ | HyperIQA↑ | MEF-SSIM↑ | ||
| HDR Rec. | HDR-Transformer | 63.10 | 2.983 | 71.36 | 0.5996 | 0.8626 |
| SCTNet | 63.13 | 3.021 | 71.48 | 0.6068 | 0.8777 | |
| MEF | U2Fusion | 63.39 | 3.219 | 72.23 | 0.5159 | 0.9304 |
| MEFLUT | 65.71 | 3.277 | 71.21 | 0.5267 | 0.8608 | |
| HSDS-MEF | 66.76 | 3.544 | 72.60 | 0.6026 | 0.9520 | |
| TC-MoA | 64.60 | 3.355 | 71.85 | 0.5394 | 0.9636 | |
| UltraFusion | 68.82 | 3.881 | 73.80 | 0.6482 | 0.9385 | |
6.1.2. Performance on Dynamic and Ultra HDR Data
The following are the results from Table 2 of the original paper:
| Type | Method | RealHDRV [42] | UltraFusion Benchmark | |||||||
|---|---|---|---|---|---|---|---|---|---|---|
| TMQI↑ | MUSIQ↑ | DeQA↑ | PAQ↑ | Hyp↑ | MUSIQ↑ | DeQA↑ | PAQ↑ | Hyp↑ | ||
| HDR Rec. | HDR-Transformer | 0.8680 | 62.24 | 3.496 | 70.33 | 0.5225 | 63.66 | 2.909 | 72.83 | 0.5619 |
| SCTNet | 0.8715 | 62.69 | 3.532 | 70.74 | 0.5272 | 61.84 | 3.102 | 72.94 | 0.5888 | |
| SAFNet | 0.8726 | 62.07 | 3.506 | 70.48 | 0.5156 | 61.50 | 2.179 | 73.15 | 0.5487 | |
| MEF | Defusion | 0.8187 | 56.60 | 3.302 | 68.38 | 0.4856 | 60.31 | 3.352 | 71.87 | 0.5463 |
| MEFLUT | 0.8297 | 62.42 | 3.315 | 70.04 | 0.5020 | 63.62 | 3.343 | 71.73 | 0.5074 | |
| HSDS-MEF | 0.8323 | 61.76 | 3.360 | 71.11 | 0.5054 | 64.54 | 3.627 | 73.42 | 0.5923 | |
| Ours | UltraFusion | 0.8925 | 67.51 | 3.830 | 73.40 | 0.5833 | 68.41 | 3.957 | 75.18 | 0.6214 |
6.2. Ablation Studies
The authors proved that every part of the model is necessary:
- Without Alignment: The model produces blurry results in motion areas.
- Without DFCB (Structure/Color Branch): The model ignores the dark image hint and "invents" its own highlights, which may not match reality.
- Without FCB (Fidelity Branch): The image loses sharp textures and looks "AI-generated."
7. Conclusion & Reflections
7.1. Conclusion Summary
UltraFusion successfully breaks the 4-stop barrier of traditional exposure fusion, reaching up to 9 stops. By using a diffusion-based guided inpainting approach, it creates images that are not only high-contrast but also look natural and are free of motion artifacts. The introduction of the UltraFusion dataset provides a much-needed benchmark for extreme dynamic range imaging.
7.2. Limitations & Future Work
- Speed: It takes about 3.3 seconds to process a image on an expensive NVIDIA RTX 4090 GPU. This is too slow for real-time smartphone use.
- Single Image Fallback: If the occlusion (hidden area) is too large, the model relies entirely on AI "imagination" (diffusion prior), which can be unreliable for scientific or forensic purposes.
- Future Work: The authors suggest looking for faster implementations and more robust motion estimation algorithms.
7.3. Personal Insights & Critique
This paper is a brilliant application of Generative AI to a "hardware" problem. Instead of trying to build a better sensor, the authors are building a "smarter" brain for the camera.
- Innovation: The move from "merging" to "inpainting" is a paradigm shift. Inpainting acknowledges that some data is truly lost and needs to be reconstructed using intelligent priors.
- Potential Issue: The 3.3-second processing time is a significant hurdle. While the results are beautiful, the practical application in consumer electronics is currently limited to "post-processing" apps rather than the native camera shutter.
- Aesthetic vs. Accurate: Because it uses a diffusion model, there is always a risk that the sun or a light bulb in the fused image is a "perfect AI version" of a sun, not the actual sun that was there. For photography, this is great; for scientific data, it might be problematic.
Similar papers
Recommended via semantic vector search.