WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering
TL;DR Summary
WeatherDiffusion is a novel framework for forward and inverse rendering in autonomous driving, addressing challenges under complex weather and lighting. It introduces an intrinsic map-aware attention mechanism for accurate estimation of material properties, scene geometry, and co
Abstract
Forward and inverse rendering have emerged as key techniques for enabling understanding and reconstruction in the context of autonomous driving (AD). However, complex weather and illumination pose great challenges to this task. The emergence of large diffusion models has shown promise in achieving reasonable results through learning from 2D priors, but these models are difficult to control and lack robustness. In this paper, we introduce WeatherDiffusion, a diffusion-based framework for forward and inverse rendering on AD scenes with various weather and lighting conditions. Our method enables authentic estimation of material properties, scene geometry, and lighting, and further supports controllable weather and illumination editing through the use of predicted intrinsic maps guided by text descriptions. We observe that different intrinsic maps should correspond to different regions of the original image. Based on this observation, we propose Intrinsic map-aware attention (MAA) to enable high-quality inverse rendering. Additionally, we introduce a synthetic dataset (\ie WeatherSynthetic) and a real-world dataset (\ie WeatherReal) for forward and inverse rendering on AD scenes with diverse weather and lighting. Extensive experiments show that our WeatherDiffusion outperforms state-of-the-art methods on several benchmarks. Moreover, our method demonstrates significant value in downstream tasks for AD, enhancing the robustness of object detection and image segmentation in challenging weather scenarios.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is "WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering." It focuses on developing a diffusion-based framework for rendering and reconstructing autonomous driving (AD) scenes under diverse weather and lighting conditions.
1.2. Authors
The authors are YIXIN ZHU (Nanjing University, China), ZUOLIANG ZHU (Nankai University, China), MILO HAAN (Adobe Research, NVIDIA Research, USA), JIAN YANG (Nanjing University, China), JIN XIE (Nanjing University, China), and BEIBEI WANG (Nanjing University, China). Their affiliations suggest a strong background in computer vision, rendering, and potentially autonomous driving research, with contributions from both academia and industry.
1.3. Journal/Conference
The paper is published as a preprint on arXiv, indicated by the original source link https://arxiv.org/abs/2508.06982v1 and PDF link https://arxiv.org/pdf/2508.06982v1.pdf. The abstract mentions "Published at (UTC): 2025-08-09T13:29:39.000Z", suggesting a future publication date or an early release. While arXiv is a highly influential platform for rapid dissemination of research, it is not a peer-reviewed journal or conference in itself. Papers on arXiv are often submitted to top-tier computer vision conferences (e.g., CVPR, ICCV, ECCV) or journals (e.g., TPAMI) for formal peer review and publication. Given the cutting-edge nature of diffusion models and rendering, such a paper would typically target a highly reputable venue in computer vision or graphics.
1.4. Publication Year
The paper's listed publication time (UTC) is 2025-08-09T13:29:39.000Z.
1.5. Abstract
The paper introduces WeatherDiffusion, a novel diffusion-based framework designed for forward rendering (generating images from scene properties) and inverse rendering (recovering scene properties from images) in autonomous driving (AD) environments. It specifically addresses the significant challenges posed by complex weather and illumination conditions, which traditional methods and existing diffusion models struggle with due to issues like control and robustness.
WeatherDiffusion is capable of authentically estimating material properties, scene geometry, and lighting. A key feature is its support for controllable weather and illumination editing, achieved by guiding predicted intrinsic maps with text descriptions. The authors propose an Intrinsic map-aware attention (MAA) mechanism, inspired by the observation that different intrinsic maps (e.g., metallicity, normal) correspond to different image regions, to enhance the quality of inverse rendering. Furthermore, the paper introduces two new datasets: WeatherSynthetic (synthetic data) and WeatherReal (real-world data), specifically curated for AD scenes with diverse weather and lighting conditions. Extensive experiments demonstrate that WeatherDiffusion surpasses state-of-the-art methods on various benchmarks. Its practical value is highlighted in downstream AD tasks, where it improves the robustness of object detection and image segmentation in challenging weather scenarios by providing clearer visual inputs.
1.6. Original Source Link
Original Source Link: https://arxiv.org/abs/2508.06982v1
PDF Link: https://arxiv.org/pdf/2508.06982v1.pdf
Publication Status: This paper is a preprint available on arXiv, indicating that it has not yet undergone formal peer review or official publication in a journal or conference proceedings.
2. Executive Summary
2.1. Background & Motivation
The paper addresses the crucial problems of forward rendering (FR) and inverse rendering (IR) in the context of autonomous driving (AD).
- Core Problem: Both FR and IR, while fundamental for scene understanding and reconstruction in AD, face significant hurdles when confronted with
complex weather and illumination conditions. - Importance:
- Forward Rendering (FR): Essential for generating photo-realistic images under varied conditions, which helps train and gain comprehensive knowledge for learning-based AD models.
- Inverse Rendering (IR): Aims to recover fundamental scene properties like geometry, material, and lighting from observed images. This enables critical AD applications such as material editing, relighting, and augmented reality, which demand controllable and flexible scene manipulation.
- Challenges/Gaps in Prior Research:
- Traditional Methods (FR): Relied on highly detailed inputs (geometry, material, lighting) that are notoriously difficult to acquire in dynamic real-world AD environments.
- Inverse Rendering (IR) as an Ill-Posed Problem: Without strong prior knowledge, multiple scene decompositions can explain the same observed image, making accurate and reasonable recovery of intrinsic properties extremely challenging.
- Limitations of Existing Diffusion Models: While large
diffusion modelshave shown promise by learning from 2D priors, they often lack sufficient control and robustness, especially when faced with the complexities of real-world scenarios. - AD-Specific Challenges:
- Complex Weather and Illumination: Rain, fog, and snow drastically alter lighting, obscure geometry, influence surface characteristics (specular reflections), reduce visibility, and make distant features imperceptible. Existing diffusion-based methods, often trained on indoor or object-level scenes, struggle with these dynamic conditions (as shown in Figure 2).
- Large Scale and Variance: AD scenes are vast and exhibit much larger variations in object scale and scene depth compared to indoor scenes. This challenges both dataset generalization and the model's ability to focus attention on relevant details without wasting capacity.
- Lack of High-Quality Datasets: A significant barrier is the absence of large-scale, high-quality datasets specifically designed for FR and IR in AD scenes with diverse weather and lighting conditions, complete with corresponding intrinsic maps.
- Paper's Entry Point/Innovative Idea: The paper proposes
WeatherDiffusion, a specialized diffusion-based framework that leverages and finetunes powerful generative models (likeStable Diffusion 3.5) with novel mechanisms (Intrinsic map-aware attention) and custom datasets to specifically tackle the weather-induced complexities of FR and IR in AD environments. It introduces aweather-guidedapproach to improve robustness and control.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of forward and inverse rendering for autonomous driving:
- WeatherDiffusion Framework: They introduce
WeatherDiffusion, a novel diffusion-based framework capable ofinverse rendering(decomposing images into intrinsic maps under various weather conditions) andforward rendering(synthesizing images under different lighting or weather conditions based on text prompts and intrinsic maps). This framework provides authentic estimation of material properties, scene geometry, and lighting, and supports controllable weather and illumination editing. - Intrinsic Map-Aware Attention (MAA): They devise
MAA, a mechanism that provides customized visual detail guidance for generative models. This module allows the model to selectively focus on semantically important local regions of an image, which is crucial because different intrinsic maps naturally correspond to distinct visual features (e.g.,metallicityfor metallic objects,normalfor surface orientation). This guidance helps the model decompose more reasonable and high-quality intrinsic maps. - Novel Datasets: They construct two new, large-scale datasets tailored for autonomous driving scenarios:
WeatherSynthetic: A synthetic dataset offering a wide range of scene types and weather conditions, complete with corresponding intrinsic maps, to address the lack of high-quality data.WeatherReal: A real-world dataset forinverse renderingon AD scenes, created by applying weather augmentation to existing open-source datasets (like Waymo and Kitti) and generating pseudoground truth.
- Superior Performance and Downstream Value: Extensive experiments demonstrate that
WeatherDiffusionsignificantly outperforms state-of-the-art methods on several benchmarks for both synthetic and real-world data. Furthermore, the method proves valuable for downstream AD tasks, enhancing the robustness of object detection and image segmentation in challenging weather by providing clearer, weather-corrected visual inputs.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand WeatherDiffusion, a grasp of several foundational concepts in computer graphics, computer vision, and machine learning is essential:
-
Forward Rendering (FR):
- Conceptual Definition: The process of generating a 2D image from a 3D scene description. It takes as input scene geometry (shapes, positions), material properties (how surfaces reflect light), lighting conditions (light sources, their intensity, color, direction), and camera parameters (position, orientation, focal length). It then simulates how light interacts with the scene and produces a pixel-by-pixel image.
- Traditional Methods: These often involve solving the
rendering equationusing techniques likerasterization(projecting 3D objects onto a 2D screen and filling pixels) orray tracing(simulating light rays from the camera into the scene to determine pixel colors). These methods require precise knowledge of the 3D scene, which is hard to obtain in real-world scenarios.
-
Inverse Rendering (IR):
- Conceptual Definition: The inverse problem of FR. Given one or more 2D images of a scene, the goal is to recover the underlying 3D scene properties that generated those images. These properties typically include
geometry(3D shape),material(e.g.,albedo,roughness,metallicity), andlighting(illumination environment). - Ill-Posed Problem: IR is inherently
ill-posed, meaning there isn't a unique solution. Multiple combinations of geometry, material, and lighting could produce the same 2D image. For example, a dark surface under bright light might appear identical to a bright surface under dim light. This necessitates strong priors or additional information. - Intrinsic Image Decomposition: A common formulation of IR where an image is decomposed into its constituent physical components, known as
intrinsic maps. These maps represent properties like:Albedo map (\mathbf{a}): The base color of a surface, independent of lighting. It represents how much light a surface reflects diffusely.Normal map (\mathbf{n}): A map storing the direction of the surface normal at each point, indicating the surface's orientation. This defines the local geometry.Roughness map (\mathbf{r}): A map indicating the micro-surface detail that affects how light scatters. Rougher surfaces scatter light more diffusely, while smoother surfaces reflect it more specularly.Metallicity map (\mathbf{m}): A map indicating how metallic a surface is. Metallic surfaces behave differently from non-metallic (dielectric) ones in terms of light reflection and absorption.Irradiance map (\mathbf{i}): A representation of the lighting conditions, often an environmental map or a map encoding the incident light at each point. This captures the illumination of the scene.
- PBR (Physically Based Rendering): A rendering approach that aims to simulate light transport based on real-world physics.
Albedo,roughness, andmetallicityare key parameters in PBR materials, particularly in thePrincipled Bidirectional Reflectance Distribution Function (BRDF)[Burley and Studios 2012], which describes how light is reflected from an opaque surface.
- Conceptual Definition: The inverse problem of FR. Given one or more 2D images of a scene, the goal is to recover the underlying 3D scene properties that generated those images. These properties typically include
-
Diffusion Models:
- Conceptual Definition: A class of generative models that learn to generate data by reversing a gradual
noising process. They are inspired by non-equilibrium thermodynamics. - Forward Process (Diffusion): An input image is gradually corrupted by adding small amounts of Gaussian noise over several steps, eventually transforming the image into pure Gaussian noise. This process is defined mathematically as a Markov chain.
- Reverse Process (Denoising): A neural network is trained to learn how to reverse this noising process, i.e., to predict and subtract the noise added at each step, starting from pure noise and gradually recovering a clean image. This effectively learns the data distribution.
- DDPM (Denoising Diffusion Probabilistic Models) [Ho et al. 2020]: A seminal work that established the basic framework for diffusion models, demonstrating high-quality image generation.
- Latent Diffusion Models (LDMs) [Rombach et al. 2022]: An advancement where the diffusion process operates in a lower-dimensional
latent spacerather than directly on pixel space. This makes training and inference much more efficient without significantly compromising quality.Stable Diffusionis a prominent example of an LDM. The input image is first encoded into a latent representation by anencoder. The diffusion process then happens in this latent space. - Rectified Flow Matching [Lipman et al. 2022]: A technique used in diffusion models to directly learn the "velocity field" that transforms noise to data (or vice versa) along a straight path in latent space, simplifying the training objective compared to traditional DDPM. It directly estimates the difference between the noise and the original latent given a noisy latent .
- DiT (Diffusion Transformers) [Peebles and Xie 2023]: A type of neural network architecture, based on the
Transformerarchitecture, used within diffusion models (often as the denoiser network).Transformersare good at capturing long-range dependencies in data, making them suitable for generating high-quality images. - Conditioning: Diffusion models can be
conditionedon various inputs (e.g., text, images, class labels) to guide the generation process. This allows for controlled image synthesis (e.g., generating an image "of a cat" from a text prompt).
- Conceptual Definition: A class of generative models that learn to generate data by reversing a gradual
-
Attention Mechanism:
- Conceptual Definition: A mechanism in neural networks that allows the model to dynamically weigh the importance of different parts of the input data when processing it. It helps the model focus on relevant information.
- Self-Attention: A variant where the attention mechanism is applied to a single sequence to relate different positions of the sequence to each other.
- Cross-Attention: A variant where attention is computed between two different sequences (e.g., query from one sequence, key/value from another). This is used in
MAAto relate map embeddings to patch tokens. - Formula for Scaled Dot-Product Attention: The fundamental attention mechanism, from which cross-attention is derived, is often defined as:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
- : Query matrix. Represents what we are looking for.
- : Key matrix. Represents what is available to be looked up.
- : Value matrix. Contains the actual information to be retrieved.
- : The dimension of the keys, used for scaling to prevent very large dot products that push the softmax into regions with extremely small gradients.
- : Dot product of Queries and Keys, measuring similarity.
- : Normalizes the similarity scores into probability distributions.
- The output is a weighted sum of the Value vectors, where weights are determined by the attention scores.
3.2. Previous Works
The paper contextualizes its work by discussing advances in generative models and inverse rendering, highlighting the gap it aims to fill.
-
Generative Models:
- VAE (Variational Autoencoders) [Kingma et al. 2013]: Leverage variational inference for probabilistic modeling, enabling explicit likelihood estimation. Often produce blurry samples compared to GANs.
- GAN (Generative Adversarial Networks) [Goodfellow et al. 2014]: Employ adversarial training (a generator and a discriminator network competing) to produce sharp, high-fidelity outputs. Suffer from potential
mode collapse(generator produces limited variety of samples) andunstable optimization. - Diffusion Models [Esser et al. 2024; Ho et al. 2020; Ramesh et al. 2022; Rombach et al. 2022]: The most recent and powerful class of generative models. They have demonstrated superior capability in generating high-fidelity, diverse, and text-aligned images. Key examples mentioned:
DDPM[Ho et al. 2020],Stable Diffusion[Esser et al. 2024; Rombach et al. 2022], andDALL·E[Ramesh et al. 2022]. These models use architectures likeUNet[Ronneberger et al. 2015] andDiT[Peebles and Xie 2023] for the denoising process. The paper leveragesStable Diffusion 3.5as its base model.
-
Inverse Rendering (Intrinsic Image Decomposition):
- Traditional Learning-based Methods [Janner et al. 2017; Yu and Smith 2019; Zhu et al. 2022]: These methods, often trained on large-scale datasets like
OpenRooms[Li et al. 2020],Hypersim[Roberts et al. 2021], andInteriorVerse[Zhu et al. 2022], moved beyond simplifiedLambertian reflectanceassumptions. They can predictPBR(Physically Based Rendering) material properties, geometric structures, and illumination parameters.
- Traditional Learning-based Methods [Janner et al. 2017; Yu and Smith 2019; Zhu et al. 2022]: These methods, often trained on large-scale datasets like
-
Forward and Inverse Rendering using Diffusion Models:
- The emergence of
latent diffusion modelshas enabled a new paradigm for learning the joint probability distribution between images and their intrinsic maps. - IID (Intrinsic Image Diffusion) [Kocsis et al. 2024]: One of the first works to train a latent diffusion model for estimating
albedo,roughness, andmetallicityfor indoor scenes. - RGB→X [Zeng et al. 2024]: Proposes a framework using diffusion models for both FR and IR, also primarily for indoor scenes.
- DiffusionRenderer [Liang et al. 2025]: Fine-tunes
Stable Video Diffusionto achieve temporally consistent intrinsic map estimation. - GeoWizard [Fu et al. 2024]: Focuses on 3D geometry estimation from a single image, trained across indoor and outdoor scenes.
- IDArb [Li et al. 2024]: Deals with intrinsic decomposition for arbitrary numbers of input views and illuminations, but is limited to the object level.
- Common Limitation: The paper highlights that these prevailing diffusion-based methods are
predominantly designed for indoor scenes or object-level tasks. They struggle with thedynamic and intricate illumination conditionsandexpanded scene dimensionscharacteristic ofAD environments. Crucially, they suffersubstantial performance degradationin the presence of diverse weather conditions (e.g., rain, snow, fog), as illustrated in Figure 2.
- The emergence of
3.3. Technological Evolution
The evolution of rendering and scene understanding has progressed through several stages:
- Traditional Graphics (Pre-Deep Learning): Focused on physically accurate simulations (ray tracing, rasterization) requiring explicit 3D models, materials, and light sources. Inverse problems were largely analytical or optimization-based, often making simplifying assumptions.
- Learning-based Inverse Rendering (Early Deep Learning): Used
Convolutional Neural Networks (CNNs)to learn mappings from images to intrinsic maps. These models learned from large synthetic datasets, overcoming some limitations of explicit modeling but still often relied on simplified physics or limited domain generalization. - Generative Models for Image Synthesis:
VAEsandGANsemerged, capable of generating photo-realistic images. While powerful for synthesis, their application to inverse problems was less direct, and they sometimes struggled with control and diversity. - Diffusion Models Revolution (Current State):
Diffusion models(DDPM, Latent Diffusion) offered unprecedented image quality, diversity, and controllable generation through conditioning. This led to their application in inverse problems, learning the joint distribution of images and their underlying properties. - Diffusion for Rendering (Specific Application): Recent works (IID, RGB→X) adapted diffusion models for
intrinsic image decompositionandneural rendering. However, these largely focused on constrained environments (indoor, object-level). - WeatherDiffusion's Position: This paper represents a crucial step in extending
diffusion-based renderingtochallenging real-world autonomous driving scenarioswithcomplex and diverse weather conditions, a domain previously underserved by high-quality, robust solutions. It builds uponLatent Diffusion Models(specificallyStable Diffusion 3.5) and introducesAD-specific priors(datasets,MAA) to handle the unique scale, dynamics, and atmospheric effects of outdoor driving scenes.
3.4. Differentiation Analysis
Compared to the main methods in related work, WeatherDiffusion introduces several core differences and innovations:
-
Target Domain and Robustness:
- Related Work:
IID,RGB→X, andDiffusionRendererare primarily designed forindoor scenesorobject-level tasks.GeoWizardextends to some outdoor scenes but doesn't explicitly focus on diverse weather conditions.IDArbis object-level. None specifically address the unique complexities of large-scale AD environments under dynamic weather. - WeatherDiffusion: Explicitly targets
large-scale, multi-weather autonomous driving scenarios. It is built to be robust torain, snow, fog, and varying illumination, which cause significant performance degradation in existing systems (as highlighted in Figure 2).
- Related Work:
-
Specialized Data:
- Related Work: Rely on existing indoor datasets (
OpenRooms,Hypersim,InteriorVerse) or general outdoor datasets (MatrixCity) that lack comprehensive weather diversity for rendering tasks. - WeatherDiffusion: Introduces
WeatherSyntheticandWeatherReal, two purpose-built datasets that provide extensive weather and lighting variations for AD scenes, along withintrinsic maps, addressing a critical data gap.
- Related Work: Rely on existing indoor datasets (
-
Architectural Enhancements for Control and Accuracy:
- Related Work: While using diffusion, they lack specific mechanisms to guide the model's attention or conditioning for complex, weather-affected intrinsic decomposition in AD.
- WeatherDiffusion: Proposes
Intrinsic map-aware attention (MAA). This mechanism providescustomized visual detail guidanceby enabling the model to focus on specific local regions relevant to eachintrinsic map(e.g., metallic objects formetallicity), which is crucial for accurate decomposition in complex scenes. It effectively replaces generic text guidance withsemantic visual priors. - Weather-Guided Conditioning: Integrates a
weather controller(one-hot encoded weather categories) into the diffusion process, allowing the model to explicitly learn and distinguish different weather conditions, which helps resolve ambiguities (e.g., low visibility from fog vs. shadows).
-
Base Model Adaptation:
-
Related Work: Use various diffusion models, but not necessarily the latest or specifically adapted for the unique AD latent space challenges.
-
WeatherDiffusion: Finetunes
Stable Diffusion 3.5 medium, noting its redesigned 16-channel latent space is beneficial for handling larger view ranges and complex scale variations in outdoor scenes, a key advantage for AD.In summary,
WeatherDiffusiondifferentiates itself by specifically tailoring adiffusion-based FR/IRframework for theautonomous driving domain, addressing its unique challenges throughpurpose-built datasets,weather-aware conditioning, andmap-specific attention mechanisms, leading to superior performance and practical utility for AD downstream tasks.
-
4. Methodology
4.1. Principles
The core principle of WeatherDiffusion is to leverage the powerful generative capabilities of latent diffusion models, specifically Stable Diffusion 3.5, and adapt them for robust and controllable forward rendering (FR) and inverse rendering (IR) in autonomous driving (AD) scenes under various weather and lighting conditions. The method is built on two main ideas:
-
Weather-Guided Diffusion: Enhancing the diffusion model's conditioning mechanism to explicitly account for and learn from diverse weather conditions, enabling accurate
intrinsic mapdecomposition and realistic image synthesis. This involves categorizing weather and integrating it into the latent space modulation. -
Intrinsic Map-Aware Attention (MAA): Introducing a novel attention mechanism that guides the diffusion model to focus on semantically relevant regions of an image when predicting specific
intrinsic maps. This addresses the challenge of large-scale AD scenes where different properties (e.g.,metallicity,normal) require attention to distinct visual cues.The framework assumes that while weather phenomena like
raindropsorfogare complex, their primary effect can be captured within theirradiancecomponent, leavingmaterial properties(likealbedo) largely unaffected. This simplification allows for effective decomposition.
4.2. Core Methodology In-depth (Layer by Layer)
The WeatherDiffusion framework involves finetuning two separate Stable Diffusion 3.5 models: one for Inverse Rendering (IR) and another for Forward Rendering (FR). Before detailing these, let's briefly revisit the general diffusion model process as described in the paper, which forms the foundation.
4.2.1. Basic Diffusion Model Process
The paper utilizes a latent diffusion model setup. Given an image (for FR) or an intrinsic map (for IR, where is the number of channels for the map), a pre-trained encoder maps these from pixel space to a lower-dimensional latent space.
-
Latent Space Encoding: The original image and intrinsic map are encoded into their respective latent representations: $ \boldsymbol { x } _ { 0 } = \boldsymbol { \mathcal { E } } ( I ) , \quad \boldsymbol { z } _ { 0 } = \boldsymbol { \mathcal { E } } ( \pmb { y } ) . $
- : The latent representation of the input image .
- : The latent representation of the intrinsic map .
- : The pre-trained encoder network that maps high-dimensional pixel data to a lower-dimensional latent space. The paper mentions
SD 3.5redesigns its latent space to have 16 channels, which is beneficial for outdoor scenes.
-
Noising Process (Rectified Flow Matching): Following
Rectified Flow Matching[Lipman et al. 2022], random Gaussian noise is added to the latent component to create a noisy latent at timestep : $ z _ { t } = ( 1 - t ) z _ { 0 } + t \epsilon , \quad \epsilon \sim N ( \mathbf { 0 } , \mathbf { I } ) , $- : The noisy latent variable at timestep .
- : The original clean latent variable of the intrinsic map (in IR) or image (in FR).
- : A continuous
denoising timestep, typically ranging from 0 to 1. means no noise, means pure noise. - : Random noise sampled from a standard Gaussian distribution, , where is the zero vector and is the identity matrix (representing unit variance). This means the noise has a mean of zero and a variance of one.
-
Velocity Field Estimation: A neural network, specifically a
DiT (Diffusion Transformer)[Peebles and Xie 2023], is used to estimate the "velocity field" at a given timestep. This velocity field represents the direction and magnitude of change needed to transition from the noisy latent back to the original clean latent . The estimation can be expressed as: $ v _ { \theta } ( z _ { t } , c , t ) = \frac { \mathrm { d } z _ { t } } { \mathrm { d } t } = \epsilon - z _ { 0 } , $- : The predicted velocity field by the
DiTmodel, parameterized by . - : The noisy latent variable input to the
DiT. - : The
vision or text conditionprovided to guide theDiTmodel. - : The current denoising timestep.
- The model learns to predict the difference between the random noise and the original latent .
- : The predicted velocity field by the
-
Loss Function: The
DiTmodel is trained by minimizing the following loss function, which aims to make the predicted velocity field match the true difference : $ L _ { \theta } = \mathbb { E } _ { t \sim \mathcal { U } ( 0 , 1 ) , \epsilon \sim \mathcal { N } ( \mathbf { 0 } , \mathbf { I } ) } \left[ \lVert \boldsymbol { v } _ { \theta } ( z _ { t } , c , t ) - ( \epsilon - z _ { 0 } ) \rVert _ { 2 } ^ { 2 } \right] . $- : Expectation operator.
- : Timestep is sampled uniformly between 0 and 1.
- : Noise is sampled from a standard Gaussian distribution.
- : The squared norm (Euclidean distance), which measures the difference between the predicted velocity and the target velocity.
- Minimizing this loss allows the model to accurately estimate the velocity field. Once trained, this estimated velocity field is used in the
reverse processto progressively denoise back to .
4.2.2. Inverse Rendering (IR)
The IR diffusion model is finetuned from SD 3.5 to decompose an input image into a set of intrinsic maps: albedo (\mathbf{a}), normal (\mathbf{n}), roughness (\mathbf{r}), metallicity (\mathbf{m}), and irradiance (\mathbf{i}) (as defined in Section 3.1). The key challenge here is that SD 3.5 might struggle with low-visibility weather conditions, potentially confusing them with shadowed regions. To address this, the authors introduce a weather controller.
-
Weather Controller:
- Categorization: Weather conditions are grouped into nine distinct classes based on visual similarities in lighting and particle types (e.g., "sunny", "rainy/thunderstorm", "snow", "foggy").
- Encoding: These categories are encoded as
one-hot vectors. Aone-hot vectoris a binary vector where a single bit is "hot" (1) and all others are "cold" (0), representing a specific category. For instance, if there are 9 categories, a rainy day might be[0, 1, 0, 0, 0, 0, 0, 0, 0]. - Positional Encoding: The one-hot weather controller () is first transformed using
positional encoding. This technique, commonly used in transformers, adds information about the position of an element in a sequence. Here, it likely enriches the categorical weather information, perhaps allowing the model to understand the "degree" or "type" of weather more robustly. - Conditional Input Generation: The positional-encoded weather controller is then combined with the
denoising timestep() and atext projection(). The text projection is derived from a text prompt (e.g., indicating the intrinsic map to decompose, or general scene context). These three components are summed together to form a comprehensive diffusion condition. This combined condition is then fed through aMulti-Layer Perceptron (MLP)to predictscale (\alpha)andshift (\beta)parameters. $ { \alpha , \beta } = \mathrm { M L P } ( f _ { \mathrm { w e a t h e r } } ( r _ { i } ) + f _ { \mathrm { t i m e } } ( t ) + f _ { \mathrm { t e x t } } ( \tau ( c ) ) ) , $- : Scale and shift parameters predicted by the
MLP. - : A
Multi-Layer Perceptron, a type of feedforward neural network. - : Function representing the positional-encoded weather controller.
- : Function representing the denoising timestep.
- : Function representing the projection of the text prompt .
- The sum of these three components forms the core conditional input.
- : Scale and shift parameters predicted by the
-
LayerNorm Modulation: The predicted and parameters are then used to modulate the input features of the
DiTmodel, specifically within itsLayer Normalization (LN)layers. This technique, similar toadaptive instance normalization (AdaIN)orconditional normalization, allows the model to dynamically adjust its internal representations based on the weather, timestep, and text conditions. $ h _ { \mathrm { n o r m } } = \mathrm { L N } ( h ) \odot ( 1 + \alpha ) + \beta , $- : The input features to be normalized and modulated within the
DiTnetwork. - :
Layer Normalizationapplied to .Layer Normalizationnormalizes the activations across the features for each sample independently, making training more stable. - : Element-wise product (Hadamard product).
- : The normalized and modulated features, which are then passed through the rest of the
DiTnetwork. This modulation ensures that the model's processing of features is aware of the specific weather conditions, helping it to distinguish subtle visual cues that might otherwise be ambiguous (e.g., separating rain textures from shadows).
- : The input features to be normalized and modulated within the
4.2.3. Forward Rendering (FR)
The FR diffusion model is also finetuned from SD 3.5. Its task is to synthesize an image given a set of decomposed intrinsic maps and a text prompt specifying the target weather condition.
- Robustness to Absent Intrinsic Maps: To improve the model's generalization and robustness, especially when some intrinsic maps might not be available or are unreliable, the authors employ a
random droppingstrategy during training. A subset of the intrinsic maps is randomly replaced withzero matriceswith a certain probability . $ M = { \hat { z } _ { a } , \hat { z } _ { n } , \hat { z } _ { r } , \hat { z } _ { m } , \hat { z } _ { i } } , \quad \hat { z } _ { i } = \left{ \begin{array} { l l } { { 0 , } } & { { \mathrm { w . p . ~ } p } } \ { { z _ { i } , } } & { { \mathrm { w . p . ~ } 1 - p } } \end{array} \right. . $- : The set of intrinsic maps provided as input to the FR model.
- : The latent representations of the
albedo,normal,roughness,metallicity, andirradiancemaps, respectively, possibly modified by the dropping strategy. - : The original latent representation of the -th intrinsic map (calculated via encoder ).
0: A zero matrix, used to replace a dropped map.- : "with probability ."
- This strategy forces the FR model to learn to generate images even from incomplete sets of intrinsic maps, making it more flexible and robust in real-world applications where complete and perfect intrinsic maps might not always be available.
4.2.4. Intrinsic Map-Aware Attention (MAA)
The Intrinsic Map-Aware Attention (MAA) mechanism is designed to provide detailed visual guidance for the IR diffusion model, replacing or augmenting generic text guidance for improved decomposition quality. It is motivated by the observation that different intrinsic maps require attention to specific, distinct regions of an image (e.g., metallicity for cars, normal for road surfaces).
-
Patch Token Extraction: First, the input image is processed by
DINOv2[Oquab et al. 2023], a powerful self-supervised vision transformer.DINOv2extracts a set ofpatch tokens (\mathbf{p})from the image. Eachpatch tokenis a feature vector representing a small spatial region (patch) of the image, encoding its visual characteristics. The paper notes thatDINOv2tokens exhibit strongintra-class consistency, meaning patches belonging to the same material or structure have similar representations, which is useful for semantic guidance. -
Map-Specific Learnable Embeddings: For each type of
intrinsic map(albedo, normal, roughness, metallicity, irradiance), the model maintains alearnable embedding (\mathbf{d} \in \mathbb{R}^{D_{\mathrm{model}}}). These embeddings conceptually capture the inherent characteristics or "essence" of what that specific intrinsic map represents. -
Gating Mechanism with Map-Aware Cross-Attention: A gating mechanism, implemented as a
cross-attentionlayer, is applied. This mechanism filters theDINOv2 patch tokens (\mathbf{p})based on the specificlearnable embedding (\mathbf{d})of the target intrinsic map being predicted. This generatesmap-aware patch tokens (\mathbf{p}')that emphasize regions relevant to that target map. $ p ^ { \prime } = \mathrm { gating } ( p , d ) = \mathrm { Softmax } \big ( \frac { ( d W _ { Q } ) ( p W _ { K } ) ^ { T } } { \sqrt { d _ { k } } } \big ) p W _ { V } , $- : The
map-aware patch tokens. These are the output of the cross-attention, selectively highlighting features relevant to the specific intrinsic map. - : The original
DINOv2 patch tokensextracted from the image. - : The
learnable embeddingcorresponding to the target intrinsic map. This acts as theQueryin this cross-attention. - : Learnable linear projection matrices for
Query,Key, andValuerespectively. These transform the input embeddings into suitable spaces for attention calculation. In this context, is projected by , and by and . - : The dimension of the
Keyvectors, used for scaling the dot product in thesoftmaxargument. - : Normalizes the attention scores.
- The term computes the similarity between the map embedding (query) and each patch token (key). This similarity is then used to weight the
Valueprojections of the patch tokens , effectively allowing the map embedding to "attend" to the most relevant image patches. Figure 4 visually demonstrates these attention heatmaps.
- : The
-
Aggregated Visual Condition: Instead of directly using the
map-aware patch tokens (\mathbf{p}'), a set oflearnable semantic embeddings (\mathbf{c})(corresponding to broader semantic categories) is introduced.- Modulation: These
semantic embeddings (\mathbf{c})are first modulated byclassification logitsfromDINOv2. This likely helps to align these embeddings with real-world object categories detected byDINOv2. - Fusion via Cross-Attention: The modulated
semantic embeddings (\mathbf{c})are then fused with themap-aware patch tokens (\mathbf{p}')using anothercross-attentionmechanism. This step combines the high-level semantic understanding with the map-specific visual details. $ c ^ { \prime } = \alpha \cdot \mathrm { Softmax } \big ( \frac { ( c W _ { Q } ) ( p ^ { \prime } W _ { K } ) ^ { T } } { \sqrt { d _ { k } } } \big ) p ^ { \prime } W _ { V } + c , $-
: The
final visual conditionthat is input to theDiTmodel forIR. -
: A
learnable coefficientthat scales the attention output. -
: The
learnable semantic embeddings(modulated byDINOv2logits). This acts as theQueryin this second cross-attention. -
: The
map-aware patch tokensfrom the previous step. These act asKeysandValues. -
: Again, learnable projection matrices.
-
: Dimension of keys for scaling.
-
The output of this cross-attention is added back to the original
semantic embeddings (\mathbf{c})to form the final, rich visual condition . This condition effectively captures both general semantic information and specific visual cues relevant to the intrinsic map being decomposed, guiding theDiTmodel to produce high-quality results.This intricate interplay of
weather conditioning,latent space operations, andmap-aware visual guidanceenablesWeatherDiffusionto robustly performforwardandinverse renderingin complexAD scenes.
-
- Modulation: These
5. Experimental Setup
5.1. Datasets
The paper highlights that existing datasets are insufficient for large-scale AD scenes under complex weather conditions. Therefore, they introduce two novel datasets: WeatherSynthetic and WeatherReal.
-
WeatherSynthetic:
-
Source: Synthetic data.
-
Scale: 35K images.
-
Characteristics: Encompasses a wide range of scene types and weather conditions.
- Weather Types:
sunny, overcast, rainy, thunderstorm, snowy, foggy, sandstorm. - Times of Day:
early morning, morning, noon, afternoon, night. - Environments:
urban, suburban, highway, parking.
- Weather Types:
-
Generation: Rendered using
Unreal Engine 5. Commercial 3D assets (allowed for generative models) were purchased fromFab. The rendering pipeline includesmovie render queueandmulti-sample anti-aliasingfor high quality.UltraDynamicSkyandUltraDynamicWeatherassets were used to control weather and daytime. All images are inlinear spacewithouttone mapping. -
Purpose: To provide a large-scale, diverse dataset with ground truth
intrinsic maps(albedo, normal, roughness, metallicity, irradiance) for training and evaluation. -
Example Data Sample: Figure 5 shows example scenes from
WeatherSynthetic, demonstrating diverse lighting and weather conditions across urban environments. For instance, one row displays a street scene under sunny, foggy, snowy, and night conditions.
该图像是图示,展示了WeatherSynthetic数据集中在不同天气条件下渲染的场景。每一行展示了在阳光明媚、雾霾、降雪和夜间条件下的城市街道场景,表现出复杂的光照和气候影响。As shown in Figure 5, each row displays a scene rendered under diverse lighting and weather conditions, showcasing urban environments.
-
-
WeatherReal:
-
Source: Real-world data, derived from open-source AD datasets.
-
Scale: 10K images.
-
Characteristics: Designed for
IRonAD sceneswith various weather conditions. -
Generation:
- Started with datasets like
Waymo[Sun et al. 2020] andKitti[Geiger et al. 2013], which were originally collected under sunny conditions. - Used the model trained on
WeatherSyntheticto generatepseudo ground truthintrinsic maps for these real-world images. - Applied
weather augmentationusing a combination of generative models and image processing:InstructPix2Pix[Brooks et al. 2023] (a pre-trained image editing model) was used to alter surfaces (e.g., converting dry roads to wet, adding snow to objects).- Provided
depth maps(from original datasets) were used to synthesize realisticfog effects. Randomized noise patternswere used to simulate fallingsnowflakesorraindrops.
- Started with datasets like
-
Purpose: To bridge the
domain gapbetween synthetic and real-world data and address thesevere degradationcaused by extreme real-world weather (e.g., heavy fog, torrential rain) that synthetic data alone might not fully capture.The following table summarizes the datasets discussed in the paper, including the two newly introduced datasets: The following are the results from [Table 1] of the original paper:
-
| Images | Scene | Source | Weather | Intrinsic map | |||||
| Albedo | Normal | Roughness | Metallicity | Irradiance | |||||
| InteriorVerse | 50K | Indoor | Synthetic | √ | * | ||||
| Hypersim | 70K | Indoor | Synthetic | × | |||||
| Openrooms | 118K | Indoor | Synthetic | xxx> | * | X | |||
| Matrixcity | 316K | City | Synthetic | 2 | JSSS | ✓ | |||
| WeatherSynthetic | 35K | City | Synthetic | ✓ | v | : | : | ||
| WeatherReal | 10K | City | Real-world | ~ | |||||
Note: The original table contains some characters that appear to be rendering artifacts (, 2, JSSS, , :, ~) in the columns for weather and intrinsic maps for older datasets. Assuming ✓ indicates full satisfaction, (blank) indicates partial, × indicates none. For WeatherSynthetic, it appears all intrinsic maps (Albedo, Normal, Roughness, Metallicity, Irradiance) are provided, and it covers diverse weather. For WeatherReal, the paper indicates ~ for weather, implying weather augmentation, and blank for intrinsic maps, likely meaning pseudo-GTs are generated.
5.2. Evaluation Metrics
The paper uses several standard metrics to evaluate the quality of inverse rendering results. These metrics assess different aspects of image similarity and reconstruction accuracy.
-
PSNR (Peak Signal-to-Noise Ratio):
- Conceptual Definition: A widely used metric to quantify the quality of reconstruction of lossy compression codecs or image processing. It measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Higher PSNR values generally indicate better quality (less noise/distortion).
- Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{\mathrm{MSE}} \right) $ Where is the Mean Squared Error: $ \mathrm{MSE} = \frac{1}{WH} \sum_{i=1}^{W} \sum_{j=1}^{H} [I(i,j) - K(i,j)]^2 $
- Symbol Explanation:
- : The maximum possible pixel value of the image (e.g., 255 for an 8-bit grayscale image, or for each color channel in an RGB image).
- : Mean Squared Error.
- : Width of the image.
- : Height of the image.
I(i,j): The pixel value of the ground truth (original) image at row and column .K(i,j): The pixel value of the predicted (reconstructed) image at row and column .
-
SSIM (Structural Similarity Index Measure):
- Conceptual Definition: A metric designed to measure the perceived structural similarity between two images, considering luminance, contrast, and structure. It is often believed to correlate better with human visual perception than PSNR. Values range from -1 to 1, with 1 indicating perfect structural similarity. Higher values are better.
- Mathematical Formula: $ \mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
- : Pixel values of the first image (or a local window of it).
- : Pixel values of the second image (or a local window of it).
- : The average of .
- : The average of .
- : The variance of .
- : The variance of .
- : The covariance of and .
- : A small constant to avoid division by zero when is very close to zero.
- : A small constant to avoid division by zero when is very close to zero.
- : The dynamic range of the pixel values (e.g., 255 for 8-bit images).
- : Small constants (typically ).
-
MAE (Mean Angular Error):
- Conceptual Definition: Primarily used for evaluating the accuracy of predicted
normal maps. It measures the average angular difference (in degrees or radians) between the ground truth normal vectors and the predicted normal vectors at each pixel. Lower MAE values indicate more accurate normal predictions. - Mathematical Formula: $ \mathrm{MAE} = \frac{1}{WH} \sum_{i=1}^{W} \sum_{j=1}^{H} \arccos(\mathbf{n}{GT}(i,j) \cdot \mathbf{n}{Pred}(i,j)) $
- Symbol Explanation:
- : Width of the image.
- : Height of the image.
- : The inverse cosine function, which returns the angle whose cosine is the argument.
- : The 3D unit normal vector of the ground truth
normal mapat pixel(i,j). - : The 3D unit normal vector of the predicted
normal mapat pixel(i,j). - : The dot product operator for vectors. The dot product of two unit vectors equals the cosine of the angle between them.
- Conceptual Definition: Primarily used for evaluating the accuracy of predicted
-
LPIPS (Learned Perceptual Image Patch Similarity):
- Conceptual Definition: A perceptual metric that aims to quantify the difference between two images in a way that aligns better with human perception of similarity than traditional pixel-wise metrics (like PSNR or MSE). It uses features extracted from a pre-trained deep neural network (e.g., VGG, AlexNet) to compare images. Lower LPIPS values indicate higher perceptual similarity.
- Mathematical Formula: LPIPS does not have a simple, closed-form mathematical formula. It is computed as follows:
- Two image patches (or full images) are passed through a pre-trained deep convolutional neural network (a "feature extractor," often an AlexNet, VGG, or SqueezeNet trained on ImageNet).
- Feature maps are extracted from one or more intermediate layers of this network.
- These feature maps are then spatially downsampled and normalized across channels.
- Finally, the distance (Euclidean distance) between the normalized feature vectors of the two images is computed and averaged across the chosen layers.
- Symbol Explanation: While a concise formula is not available, the process involves:
- : Feature representations of image 1 and image 2, extracted by the pre-trained neural network.
- : Learned weights for each layer of the feature extractor.
- : norm (Euclidean distance).
- The metric essentially calculates a weighted distance between deep features of the two images.
5.3. Baselines
The paper compares WeatherDiffusion against several state-of-the-art methods in inverse rendering, primarily those based on diffusion models, chosen for their relevance to the task and their prominence in the field:
- IID (Intrinsic Image Diffusion) [Kocsis et al. 2024]:
- Description: This method was one of the first to train a latent diffusion model specifically for estimating material properties (albedo, roughness, metallicity) from a single indoor image.
- Representativeness: It represents the foundational application of diffusion models to intrinsic image decomposition.
- RGB→X [Zeng et al. 2024]:
- Description: A framework that uses diffusion models for both
forward renderingandinverse rendering. It aims to decompose images into material and lighting-aware components. - Representativeness: It's a direct competitor as it also handles both FR and IR using diffusion. The paper explicitly mentions finetuning
RGB→X(andIID) on their dataset for a fair comparison, as the originalRGB→Xwas trained on indoor datasets.
- Description: A framework that uses diffusion models for both
- GeoWizard [Fu et al. 2024]:
- Description: This method focuses on
3D geometry estimationfrom a single image, leveraging diffusion priors. It is noted to be trained across bothindoor and outdoor scenes. - Representativeness: It offers a baseline for geometry-focused decomposition, and its training on outdoor scenes makes it more relevant than purely indoor methods.
- Description: This method focuses on
- IDArb (Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations) [Li et al. 2024]:
-
Description: This method addresses intrinsic decomposition from multiple views and illuminations but is explicitly stated to be
limited to the object level. -
Representativeness: While not directly for large-scale scenes, it represents advanced intrinsic decomposition techniques at a finer granularity.
The authors note that
IIDandRGB→Xwere originally trained on indoor datasets, andGeoWizardworks across indoor/outdoor, whileIDArbis object-level. This underscoresWeatherDiffusion's unique focus onlarge-scale AD scenesanddiverse weather, a domain where these baselines are expected to struggle due to their original training focus. The finetuning ofIIDandRGB→XonWeatherSyntheticaims to provide a more equitable comparison, showing that even with adaptation, they may not matchWeatherDiffusion's specialized design.
-
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate WeatherDiffusion's superior performance in both inverse rendering and forward rendering tasks, especially under challenging weather conditions in AD scenes.
6.1.1. Inverse Rendering on Synthetic Data
-
Quantitative Evaluation (Table 2):
WeatherDiffusionachieves state-of-the-art performance across all intrinsic maps and metrics (PSNR, SSIM, LPIPS for albedo, roughness, metallicity, irradiance; PSNR, SSIM, MAE for normal) on theWeatherSyntheticdataset.- For example, in
Albedoestimation,WeatherDiffusionachieves a PSNR of 18.02, significantly higher thanRGB→X (w/ finetune)at 11.35 andIID (w/ finetune)at 11.55. Similar trends are observed for other maps and metrics. - The
MAEforNormalmaps is remarkably low (4.24), indicating highly accurate geometry recovery compared toRGB→X (w/ finetune)(7.05) andGeoWizard(12.47). - The
LPIPSvalues, which reflect perceptual quality, are consistently lower forWeatherDiffusion, especially forMetallicity(0.14) andIrradiance(0.29), suggesting better human-perceived quality.
- For example, in
-
Qualitative Evaluation (Figure 8):
- Albedo Estimation (Figure 8a):
WeatherDiffusioneffectively recovers fine details and completely separates illumination from material, which is crucial for intrinsic decomposition. Baselines likeRGB→Xshow less accuracy, whileIIDandIDArbmistakenly interpret shadows as albedo, even after finetuning. - Normal Estimation (Figure 8b): The model successfully eliminates the impact of atmospheric particles (e.g., snowflakes) to restore clean and sharp normals, whereas other methods are severely affected by such occlusions.
- Roughness and Metallicity Estimation (Figure 8c, Figure 8d): These are particularly challenging under heavy rain and fog due to blurring.
WeatherDiffusionsuccessfully detects distant vehicles and accurately distinguishes metallic from non-metallic objects, outperforming baselines that often fail to do so precisely. - Irradiance Estimation (Figure 8e):
WeatherDiffusionexcels at capturing the presence of rain while preserving details of distant objects, demonstrating its ability to robustly model lighting conditions.
- Albedo Estimation (Figure 8a):
6.1.2. Inverse Rendering on Real Data
- Qualitative Evaluation (Figure 10):
WeatherDiffusiondemonstrates strong generalization from synthetic to real-world data.- For images from
Waymo(initially sunny), it provides reasonable estimations for all intrinsic maps, suggesting good domain transfer. - For images with extreme weather (raindrops on lens, heavy rain, dense fog) from
TransWeather,WeatherDiffusionconsistently provides reasonable estimations where other methods (e.g., finetunedRGB→X,IDArb) struggle with vehicles, distant buildings, or decoupling shadow/material.
- For images from
6.1.3. Forward Rendering
- Synthetic Data (Figure 9):
WeatherDiffusionrecovers material and geometry better thanRGB→X, which tends to generate abnormal textures.- It generates images that align well with text descriptions (e.g., "A sunny day in the city"), demonstrating controllable synthesis.
- Real Data (Figure 11, Figure 12):
WeatherDiffusioneffectively leverages intrinsic maps obtained frominverse renderingto re-render images under different weather and lighting conditions specified by text prompts.- It can generate images that closely match the material and geometry of the original image (e.g., using albedo and normal as input).
- It can simultaneously change materials and weather conditions according to prompts, showcasing its flexibility.
6.1.4. Applications to Downstream Tasks
- Object Detection and Image Segmentation (Figure 12): The re-rendered images from
WeatherDiffusionsignificantly enhance downstreamAD tasks.- In severe weather (e.g., snow), original images cause
object detectionandimage segmentationmodels to struggle due to occlusions. WeatherDiffusion's re-rendered output (e.g., removing snowflakes while strictly preserving object positions and shapes) provides a much cleaner input, enabling these models to generate accurate predictions and thus improving theirrobustnessin challenging scenarios.
- In severe weather (e.g., snow), original images cause
6.2. Data Presentation (Tables)
The following are the results from [Table 2] of the original paper:
| Method | Albedo | | | Normal | | | Roughness | | Metallic | | Irradiance | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | MAE↓ | PSNR↑ | LPIPS↓ | PSNR↑ | LPIPS↓ | PSNR↑ | LPIPS↓ | IID | 7.80 | 0.26 | 0.63 | | | | 10.30 | 0.55 | 12.37 | 0.64 | - | - | IID (w/ finetune) | 11.55 | 0.53 | 0.40 | | | − | 12.34 | 0.43 | 12.22 | 0.55 | | | RGB→X | 9.66 | 0.44 | 0.47 | 11.90 | 0.41 | 15.51 | 13.62 | 0.55 | | | 16.24 | 0.58 | RGB←→X (w/ finetune) | 11.35 | 0.59 | 0.37 | 16.14 | 0.49 | 7.05 | 13.65 | 0.57 | 11.96 | 0.66 | 16.38 | 0.69 | GeoWizard | | | | 16.24 | 0.54 | 12.47 | | | | | | | IDArb | 6.40 | 0.48 | 0.65 | 10.77 | 0.43 | 22.42 | 10.70 | 0.62 | 14.66 | 0.62 | | | ours | 18.02 | 0.66 | 0.35 | 20.95 | 0.61 | 4.24 | 15.03 | 0.45 | 18.94 | 0.14 | 23.55 | 0.29 | ours (w/o MAA) | 17.35 | 0.66 | 0.45 | 18.82 | 0.49 | 5.56 | 13.96 | 0.51 | 18.04 | 0.28 | 23.41 | 0.43
Note: The original table contains some blank cells, likely indicating that the corresponding method either does not provide that specific intrinsic map estimation or results were not reported for that metric.
6.3. Ablation Studies / Parameter Analysis
The paper conducts an ablation study to understand the contribution of its proposed Intrinsic Map-Aware Attention (MAA).
-
Effect of MAA:
-
Methodology: An
IR diffusion modelwas trainedwithout MAA, meaning the visual condition was replaced with the original text guidance thatStable Diffusion 3.5uses. -
Quantitative Results (Table 2, row "ours (w/o MAA)"): The model without
MAAperforms significantly worse across most metrics compared to the fullWeatherDiffusionmodel.- For example,
AlbedoPSNR drops from 18.02 to 17.35.NormalPSNR drops from 20.95 to 18.82, and its MAE increases from 4.24 to 5.56 (higher MAE is worse). MetallicityPSNR drops from 18.94 to 18.04, and its LPIPS increases from 0.14 to 0.28 (higher LPIPS is worse).
- For example,
-
Qualitative Results (Figure 6): The figure visually confirms the improvement.
With MAA, the model produces more refined geometry and material predictions. It successfully identifies a metallic handrail and assigns it a reasonable level ofmetallicity.Without MAA, these details are less accurate or completely missed.
该图像是一个示意图,展示了使用MAA与不使用MAA的结果比较。左侧为输入图像,右侧分别展示未使用MAA和使用MAA的渲染效果,突显了MAA在准确重建场景方面的优势。As shown in Figure 6, the
MAAcomponent significantly improves the quality of inverse rendering, particularly in capturing fine details and material properties likemetallicity. -
Conclusion: This demonstrates that
MAAis crucial for providing the necessarysemantic guidanceforIR, helping the model to focus on appropriate regions and decompose high-quality intrinsic maps.
-
-
Effect of WeatherSynthetic and WeatherReal:
- The paper mentions exploring the effect of its two datasets separately, training the
IR diffusion modelwith only indoor datasets, and only the synthetic dataset. - Results: While qualitative results are stated to be in the supplementary material, the main text implies that training solely on synthetic data leads to degradation on real-world extreme weather samples due to the
domain gap.WeatherRealhelps to mitigate this. This highlights the importance of comprehensive and diverse datasets for robustness.
- The paper mentions exploring the effect of its two datasets separately, training the
6.4. Discussion and limitations
Despite its strong performance, WeatherDiffusion has acknowledged limitations:
-
Out-of-Distribution Objects: The model struggles to accurately estimate intrinsic maps for
out-of-distribution (OOD)objects (e.g., cranes, heavy trucks) that were not present in its training data. This is a common limitation for data-driven models. -
Extreme Occlusion in Heavy Fog: In conditions of
heavy fogwhere distant regions are completely occluded, the model has difficulty distinguishing between ambiguous elements like theskyandbuildings. This often leads to abnormalalbedoandnormalpredictions in such severely degraded areas. Figure 7 illustrates a typical failure case.
该图像是一个示意图,展示了在雾霾场景下的输入图像、内在特征图和重渲染图像之间的关系。左侧为输入图像,其右侧依次为内在特征图,包括颜色、深度和遮挡图。最后为重新渲染的图像,展示了WeatherDiffusion在复杂天气条件下的效果。Figure 7 shows a typical failure case in a foggy scene where
WeatherDiffusionfails to discriminate between the sky and a building obscured by fog, leading to abnormal intrinsic map predictions.
These limitations point to areas where future research could focus, such as improving generalization to novel objects or developing more robust perception under extreme visual ambiguity.
6.5. Applications
The paper highlights the significant value of WeatherDiffusion in improving the robustness of downstream tasks for autonomous driving (AD), specifically object detection and image segmentation.
-
Problem: Existing
object detection[Carion et al. 2020] andimage segmentation[Xie et al. 2021] methods, while highly accurate in ideal conditions, suffersignificant performance degradationinadverse weatherdue to occlusions and reduced visibility. -
Solution via WeatherDiffusion: By performing
inverse rendering,WeatherDiffusioncan decompose a scene into its physical attributes. Then, throughforward rendering, it can re-render these scenes under new, clearer weather and lighting conditions. This process effectivelycorrects environmental distortionsat the visual input level, providingcleaner inputsfor downstream models. -
Demonstration (Figure 12): The paper provides a compelling example. An original image with
severe occlusionsfrom snowflakes makes it difficult for segmentation and detection models to produce reasonable predictions. WhenWeatherDiffusionre-renders this image byremoving the snowflakeswhilestrictly preserving the original objects' positions and shapes, the downstream models are then able to generateaccurate predictions.
该图像是示意图,展示了不同天气条件下的输入图像、真实图(GT)、我们的算法结果以及RGB+X的对比。图像展示了前向和逆向渲染在城市场景中的应用,突出了我们的WeatherDiffusion方法的效果。Figure 12 clearly demonstrates how
WeatherDiffusionenhancesobject detectionandimage segmentationin adverse weather by re-rendering images. The original image with snowflakes leads to poor detection and segmentation, while the re-rendered image (snowflakes removed, details preserved) enables accurate results. -
Impact: This capability significantly
enhances the robustnessofAD systemsin challenging weather, which is critical for safety and reliability.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces WeatherDiffusion, a novel and robust framework for both forward and inverse rendering in complex autonomous driving (AD) environments. Its key contributions include:
- A
diffusion-based frameworkcapable of decomposing images intointrinsic mapsunder diverse weather conditions and synthesizing new images based on these maps and text prompts. - The innovative
Intrinsic Map-Aware Attention (MAA)mechanism, which provides targeted visual guidance to the generative model, enabling higher quality and more reasonableintrinsic mapdecomposition by focusing on semantically relevant regions. - The creation of two crucial datasets,
WeatherSyntheticandWeatherReal, addressing the critical lack of large-scale, high-quality rendering datasets specifically forAD sceneswith varied weather and intrinsic ground truth. Extensive experiments validateWeatherDiffusion's superior performance compared to state-of-the-art methods and demonstrate its significant practical value in improving the robustness ofdownstream AD taskslikeobject detectionandimage segmentationin adverse weather conditions.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Dependence on High-Quality Training Data: While
WeatherDiffusionintroduces new datasets, its performance still inherently relies on the quality and diversity of this training data. The model struggles without-of-distribution (OOD)objects not present in the training set (e.g., cranes, heavy trucks). - Extreme Occlusion Handling: In scenarios with
heavy fogand severe occlusions, the model has difficulty discriminating between visually ambiguous elements (e.g., sky vs. buildings), leading to inaccurate intrinsic map predictions. - Future Directions:
- Reducing Data Dependence: Research efforts could focus on methods to decrease the model's reliance on extensive high-quality training data, possibly through more advanced self-supervised learning or data augmentation techniques.
- Reinforcement Learning and LLM Feedback: Combining
diffusion modelswithreinforcement learning(RL), guided by human orLarge Language Model (LLM)feedback, could further enhance the robustness and controllability of the rendering process. This might allow for more nuanced and context-aware scene manipulation. - Auto-regressive Generative Models: Exploring the capabilities of
auto-regressive generative modelspresents opportunities to advance bothrenderingandinverse renderingtasks, potentially leading to even more coherent and detailed scene generation or decomposition.
7.3. Personal Insights & Critique
This paper presents a highly relevant and impactful contribution to the field of autonomous driving and computer graphics. The integration of diffusion models with specific adaptations for AD scenes and challenging weather is a critical step forward.
- Innovation: The
Intrinsic Map-Aware Attention (MAA)is a particularly clever innovation. Instead of relying solely on textual prompts or global image features, it provides a fine-grained, map-specific visual guidance by leveraging semantic patches fromDINOv2. This bridges the gap between high-level semantic understanding and low-level pixel details, which is crucial for the ill-posedinverse renderingproblem, especially in complex environments. Theweather controllerfor conditioning is also a practical and effective way to tackle weather diversity. - Applicability & Transferability: The core methodology of
weather-guided diffusionandMAAcould potentially be transferred to other domains requiring robust scene understanding or generation under varying environmental conditions, such as augmented reality, virtual production, or environmental monitoring. The idea of using intrinsic maps to clean up sensor data for downstream tasks is highly valuable and broadly applicable beyondobject detectionandsegmentation(e.g., for depth estimation or visual odometry). - Potential Issues/Areas for Improvement:
-
Simplifying Weather Assumption: The assumption that complex weather phenomena are primarily reflected in
irradiancewhilematerial propertiesremain unaffected, while practical, might be an oversimplification in extreme cases (e.g., very wet surfaces change theirroughnessandspecularity, not just illumination). Future work could explore more dynamic material models that interact with weather. -
Generalization to Novel Objects: The acknowledged limitation regarding
out-of-distribution objectsis significant forAD. While re-rendering helps clean the scene, if the intrinsic maps for these novel objects are incorrectly estimated in the first place, the re-rendered output might propagate errors. Incorporating strong 3D priors or few-shot learning for object-levelintrinsic decompositioncould be a valuable extension. -
Computational Cost: Diffusion models, especially large ones like
SD 3.5, are computationally intensive. While not explicitly discussed, the inference speed for real-timeAD applicationsmight be a practical concern, althoughlatent diffusionhelps. Further optimization or distillation techniques could be explored. -
Quantitative Metrics for Real-World Data: The paper primarily relies on qualitative results for real-world data. While understandable given the lack of ground truth, developing more robust quantitative metrics or benchmarks for
inverse renderingon diverse real-world weather data would strengthen claims of generalization.WeatherRealwith itspseudo ground truthis a good step but comes with its own potential biases.Overall,
WeatherDiffusionis a well-designed and highly relevant piece of research that pushes the boundaries offorwardandinverse renderingtowards practical applications inautonomous driving, effectively leveraging the power ofdiffusion modelsto address previously intractable challenges posed by complex weather.
-
Similar papers
Recommended via semantic vector search.