AiPaper
Paper status: completed

WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering

Published:08/09/2025
Original LinkPDF
Price: 0.10
Price: 0.10
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

WeatherDiffusion is a novel framework for forward and inverse rendering in autonomous driving, addressing challenges under complex weather and lighting. It introduces an intrinsic map-aware attention mechanism for accurate estimation of material properties, scene geometry, and co

Abstract

Forward and inverse rendering have emerged as key techniques for enabling understanding and reconstruction in the context of autonomous driving (AD). However, complex weather and illumination pose great challenges to this task. The emergence of large diffusion models has shown promise in achieving reasonable results through learning from 2D priors, but these models are difficult to control and lack robustness. In this paper, we introduce WeatherDiffusion, a diffusion-based framework for forward and inverse rendering on AD scenes with various weather and lighting conditions. Our method enables authentic estimation of material properties, scene geometry, and lighting, and further supports controllable weather and illumination editing through the use of predicted intrinsic maps guided by text descriptions. We observe that different intrinsic maps should correspond to different regions of the original image. Based on this observation, we propose Intrinsic map-aware attention (MAA) to enable high-quality inverse rendering. Additionally, we introduce a synthetic dataset (\ie WeatherSynthetic) and a real-world dataset (\ie WeatherReal) for forward and inverse rendering on AD scenes with diverse weather and lighting. Extensive experiments show that our WeatherDiffusion outperforms state-of-the-art methods on several benchmarks. Moreover, our method demonstrates significant value in downstream tasks for AD, enhancing the robustness of object detection and image segmentation in challenging weather scenarios.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The central topic of the paper is "WeatherDiffusion: Weather-Guided Diffusion Model for Forward and Inverse Rendering." It focuses on developing a diffusion-based framework for rendering and reconstructing autonomous driving (AD) scenes under diverse weather and lighting conditions.

1.2. Authors

The authors are YIXIN ZHU (Nanjing University, China), ZUOLIANG ZHU (Nankai University, China), MILO HAAN (Adobe Research, NVIDIA Research, USA), JIAN YANG (Nanjing University, China), JIN XIE (Nanjing University, China), and BEIBEI WANG (Nanjing University, China). Their affiliations suggest a strong background in computer vision, rendering, and potentially autonomous driving research, with contributions from both academia and industry.

1.3. Journal/Conference

The paper is published as a preprint on arXiv, indicated by the original source link https://arxiv.org/abs/2508.06982v1 and PDF link https://arxiv.org/pdf/2508.06982v1.pdf. The abstract mentions "Published at (UTC): 2025-08-09T13:29:39.000Z", suggesting a future publication date or an early release. While arXiv is a highly influential platform for rapid dissemination of research, it is not a peer-reviewed journal or conference in itself. Papers on arXiv are often submitted to top-tier computer vision conferences (e.g., CVPR, ICCV, ECCV) or journals (e.g., TPAMI) for formal peer review and publication. Given the cutting-edge nature of diffusion models and rendering, such a paper would typically target a highly reputable venue in computer vision or graphics.

1.4. Publication Year

The paper's listed publication time (UTC) is 2025-08-09T13:29:39.000Z.

1.5. Abstract

The paper introduces WeatherDiffusion, a novel diffusion-based framework designed for forward rendering (generating images from scene properties) and inverse rendering (recovering scene properties from images) in autonomous driving (AD) environments. It specifically addresses the significant challenges posed by complex weather and illumination conditions, which traditional methods and existing diffusion models struggle with due to issues like control and robustness.

WeatherDiffusion is capable of authentically estimating material properties, scene geometry, and lighting. A key feature is its support for controllable weather and illumination editing, achieved by guiding predicted intrinsic maps with text descriptions. The authors propose an Intrinsic map-aware attention (MAA) mechanism, inspired by the observation that different intrinsic maps (e.g., metallicity, normal) correspond to different image regions, to enhance the quality of inverse rendering. Furthermore, the paper introduces two new datasets: WeatherSynthetic (synthetic data) and WeatherReal (real-world data), specifically curated for AD scenes with diverse weather and lighting conditions. Extensive experiments demonstrate that WeatherDiffusion surpasses state-of-the-art methods on various benchmarks. Its practical value is highlighted in downstream AD tasks, where it improves the robustness of object detection and image segmentation in challenging weather scenarios by providing clearer visual inputs.

Original Source Link: https://arxiv.org/abs/2508.06982v1 PDF Link: https://arxiv.org/pdf/2508.06982v1.pdf Publication Status: This paper is a preprint available on arXiv, indicating that it has not yet undergone formal peer review or official publication in a journal or conference proceedings.

2. Executive Summary

2.1. Background & Motivation

The paper addresses the crucial problems of forward rendering (FR) and inverse rendering (IR) in the context of autonomous driving (AD).

  • Core Problem: Both FR and IR, while fundamental for scene understanding and reconstruction in AD, face significant hurdles when confronted with complex weather and illumination conditions.
  • Importance:
    • Forward Rendering (FR): Essential for generating photo-realistic images under varied conditions, which helps train and gain comprehensive knowledge for learning-based AD models.
    • Inverse Rendering (IR): Aims to recover fundamental scene properties like geometry, material, and lighting from observed images. This enables critical AD applications such as material editing, relighting, and augmented reality, which demand controllable and flexible scene manipulation.
  • Challenges/Gaps in Prior Research:
    1. Traditional Methods (FR): Relied on highly detailed inputs (geometry, material, lighting) that are notoriously difficult to acquire in dynamic real-world AD environments.
    2. Inverse Rendering (IR) as an Ill-Posed Problem: Without strong prior knowledge, multiple scene decompositions can explain the same observed image, making accurate and reasonable recovery of intrinsic properties extremely challenging.
    3. Limitations of Existing Diffusion Models: While large diffusion models have shown promise by learning from 2D priors, they often lack sufficient control and robustness, especially when faced with the complexities of real-world scenarios.
    4. AD-Specific Challenges:
      • Complex Weather and Illumination: Rain, fog, and snow drastically alter lighting, obscure geometry, influence surface characteristics (specular reflections), reduce visibility, and make distant features imperceptible. Existing diffusion-based methods, often trained on indoor or object-level scenes, struggle with these dynamic conditions (as shown in Figure 2).
      • Large Scale and Variance: AD scenes are vast and exhibit much larger variations in object scale and scene depth compared to indoor scenes. This challenges both dataset generalization and the model's ability to focus attention on relevant details without wasting capacity.
      • Lack of High-Quality Datasets: A significant barrier is the absence of large-scale, high-quality datasets specifically designed for FR and IR in AD scenes with diverse weather and lighting conditions, complete with corresponding intrinsic maps.
  • Paper's Entry Point/Innovative Idea: The paper proposes WeatherDiffusion, a specialized diffusion-based framework that leverages and finetunes powerful generative models (like Stable Diffusion 3.5) with novel mechanisms (Intrinsic map-aware attention) and custom datasets to specifically tackle the weather-induced complexities of FR and IR in AD environments. It introduces a weather-guided approach to improve robustness and control.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of forward and inverse rendering for autonomous driving:

  1. WeatherDiffusion Framework: They introduce WeatherDiffusion, a novel diffusion-based framework capable of inverse rendering (decomposing images into intrinsic maps under various weather conditions) and forward rendering (synthesizing images under different lighting or weather conditions based on text prompts and intrinsic maps). This framework provides authentic estimation of material properties, scene geometry, and lighting, and supports controllable weather and illumination editing.
  2. Intrinsic Map-Aware Attention (MAA): They devise MAA, a mechanism that provides customized visual detail guidance for generative models. This module allows the model to selectively focus on semantically important local regions of an image, which is crucial because different intrinsic maps naturally correspond to distinct visual features (e.g., metallicity for metallic objects, normal for surface orientation). This guidance helps the model decompose more reasonable and high-quality intrinsic maps.
  3. Novel Datasets: They construct two new, large-scale datasets tailored for autonomous driving scenarios:
    • WeatherSynthetic: A synthetic dataset offering a wide range of scene types and weather conditions, complete with corresponding intrinsic maps, to address the lack of high-quality data.
    • WeatherReal: A real-world dataset for inverse rendering on AD scenes, created by applying weather augmentation to existing open-source datasets (like Waymo and Kitti) and generating pseudo ground truth.
  4. Superior Performance and Downstream Value: Extensive experiments demonstrate that WeatherDiffusion significantly outperforms state-of-the-art methods on several benchmarks for both synthetic and real-world data. Furthermore, the method proves valuable for downstream AD tasks, enhancing the robustness of object detection and image segmentation in challenging weather by providing clearer, weather-corrected visual inputs.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand WeatherDiffusion, a grasp of several foundational concepts in computer graphics, computer vision, and machine learning is essential:

  • Forward Rendering (FR):

    • Conceptual Definition: The process of generating a 2D image from a 3D scene description. It takes as input scene geometry (shapes, positions), material properties (how surfaces reflect light), lighting conditions (light sources, their intensity, color, direction), and camera parameters (position, orientation, focal length). It then simulates how light interacts with the scene and produces a pixel-by-pixel image.
    • Traditional Methods: These often involve solving the rendering equation using techniques like rasterization (projecting 3D objects onto a 2D screen and filling pixels) or ray tracing (simulating light rays from the camera into the scene to determine pixel colors). These methods require precise knowledge of the 3D scene, which is hard to obtain in real-world scenarios.
  • Inverse Rendering (IR):

    • Conceptual Definition: The inverse problem of FR. Given one or more 2D images of a scene, the goal is to recover the underlying 3D scene properties that generated those images. These properties typically include geometry (3D shape), material (e.g., albedo, roughness, metallicity), and lighting (illumination environment).
    • Ill-Posed Problem: IR is inherently ill-posed, meaning there isn't a unique solution. Multiple combinations of geometry, material, and lighting could produce the same 2D image. For example, a dark surface under bright light might appear identical to a bright surface under dim light. This necessitates strong priors or additional information.
    • Intrinsic Image Decomposition: A common formulation of IR where an image is decomposed into its constituent physical components, known as intrinsic maps. These maps represent properties like:
      • Albedo map (\mathbf{a}): The base color of a surface, independent of lighting. It represents how much light a surface reflects diffusely.
      • Normal map (\mathbf{n}): A map storing the direction of the surface normal at each point, indicating the surface's orientation. This defines the local geometry.
      • Roughness map (\mathbf{r}): A map indicating the micro-surface detail that affects how light scatters. Rougher surfaces scatter light more diffusely, while smoother surfaces reflect it more specularly.
      • Metallicity map (\mathbf{m}): A map indicating how metallic a surface is. Metallic surfaces behave differently from non-metallic (dielectric) ones in terms of light reflection and absorption.
      • Irradiance map (\mathbf{i}): A representation of the lighting conditions, often an environmental map or a map encoding the incident light at each point. This captures the illumination of the scene.
    • PBR (Physically Based Rendering): A rendering approach that aims to simulate light transport based on real-world physics. Albedo, roughness, and metallicity are key parameters in PBR materials, particularly in the Principled Bidirectional Reflectance Distribution Function (BRDF) [Burley and Studios 2012], which describes how light is reflected from an opaque surface.
  • Diffusion Models:

    • Conceptual Definition: A class of generative models that learn to generate data by reversing a gradual noising process. They are inspired by non-equilibrium thermodynamics.
    • Forward Process (Diffusion): An input image is gradually corrupted by adding small amounts of Gaussian noise over several steps, eventually transforming the image into pure Gaussian noise. This process is defined mathematically as a Markov chain.
    • Reverse Process (Denoising): A neural network is trained to learn how to reverse this noising process, i.e., to predict and subtract the noise added at each step, starting from pure noise and gradually recovering a clean image. This effectively learns the data distribution.
    • DDPM (Denoising Diffusion Probabilistic Models) [Ho et al. 2020]: A seminal work that established the basic framework for diffusion models, demonstrating high-quality image generation.
    • Latent Diffusion Models (LDMs) [Rombach et al. 2022]: An advancement where the diffusion process operates in a lower-dimensional latent space rather than directly on pixel space. This makes training and inference much more efficient without significantly compromising quality. Stable Diffusion is a prominent example of an LDM. The input image II is first encoded into a latent representation x0=E(I)\mathbf{x}_0 = \mathcal{E}(I) by an encoder E\mathcal{E}. The diffusion process then happens in this latent space.
    • Rectified Flow Matching [Lipman et al. 2022]: A technique used in diffusion models to directly learn the "velocity field" that transforms noise to data (or vice versa) along a straight path in latent space, simplifying the training objective compared to traditional DDPM. It directly estimates the difference between the noise ϵ\epsilon and the original latent z0z_0 given a noisy latent ztz_t.
    • DiT (Diffusion Transformers) [Peebles and Xie 2023]: A type of neural network architecture, based on the Transformer architecture, used within diffusion models (often as the denoiser network). Transformers are good at capturing long-range dependencies in data, making them suitable for generating high-quality images.
    • Conditioning: Diffusion models can be conditioned on various inputs (e.g., text, images, class labels) to guide the generation process. This allows for controlled image synthesis (e.g., generating an image "of a cat" from a text prompt).
  • Attention Mechanism:

    • Conceptual Definition: A mechanism in neural networks that allows the model to dynamically weigh the importance of different parts of the input data when processing it. It helps the model focus on relevant information.
    • Self-Attention: A variant where the attention mechanism is applied to a single sequence to relate different positions of the sequence to each other.
    • Cross-Attention: A variant where attention is computed between two different sequences (e.g., query from one sequence, key/value from another). This is used in MAA to relate map embeddings to patch tokens.
    • Formula for Scaled Dot-Product Attention: The fundamental attention mechanism, from which cross-attention is derived, is often defined as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $
      • QQ: Query matrix. Represents what we are looking for.
      • KK: Key matrix. Represents what is available to be looked up.
      • VV: Value matrix. Contains the actual information to be retrieved.
      • dkd_k: The dimension of the keys, used for scaling to prevent very large dot products that push the softmax into regions with extremely small gradients.
      • QKTQK^T: Dot product of Queries and Keys, measuring similarity.
      • softmax()\mathrm{softmax}(\cdot): Normalizes the similarity scores into probability distributions.
      • The output is a weighted sum of the Value vectors, where weights are determined by the attention scores.

3.2. Previous Works

The paper contextualizes its work by discussing advances in generative models and inverse rendering, highlighting the gap it aims to fill.

  • Generative Models:

    • VAE (Variational Autoencoders) [Kingma et al. 2013]: Leverage variational inference for probabilistic modeling, enabling explicit likelihood estimation. Often produce blurry samples compared to GANs.
    • GAN (Generative Adversarial Networks) [Goodfellow et al. 2014]: Employ adversarial training (a generator and a discriminator network competing) to produce sharp, high-fidelity outputs. Suffer from potential mode collapse (generator produces limited variety of samples) and unstable optimization.
    • Diffusion Models [Esser et al. 2024; Ho et al. 2020; Ramesh et al. 2022; Rombach et al. 2022]: The most recent and powerful class of generative models. They have demonstrated superior capability in generating high-fidelity, diverse, and text-aligned images. Key examples mentioned: DDPM [Ho et al. 2020], Stable Diffusion [Esser et al. 2024; Rombach et al. 2022], and DALL·E [Ramesh et al. 2022]. These models use architectures like UNet [Ronneberger et al. 2015] and DiT [Peebles and Xie 2023] for the denoising process. The paper leverages Stable Diffusion 3.5 as its base model.
  • Inverse Rendering (Intrinsic Image Decomposition):

    • Traditional Learning-based Methods [Janner et al. 2017; Yu and Smith 2019; Zhu et al. 2022]: These methods, often trained on large-scale datasets like OpenRooms [Li et al. 2020], Hypersim [Roberts et al. 2021], and InteriorVerse [Zhu et al. 2022], moved beyond simplified Lambertian reflectance assumptions. They can predict PBR (Physically Based Rendering) material properties, geometric structures, and illumination parameters.
  • Forward and Inverse Rendering using Diffusion Models:

    • The emergence of latent diffusion models has enabled a new paradigm for learning the joint probability distribution between images and their intrinsic maps.
    • IID (Intrinsic Image Diffusion) [Kocsis et al. 2024]: One of the first works to train a latent diffusion model for estimating albedo, roughness, and metallicity for indoor scenes.
    • RGB→X [Zeng et al. 2024]: Proposes a framework using diffusion models for both FR and IR, also primarily for indoor scenes.
    • DiffusionRenderer [Liang et al. 2025]: Fine-tunes Stable Video Diffusion to achieve temporally consistent intrinsic map estimation.
    • GeoWizard [Fu et al. 2024]: Focuses on 3D geometry estimation from a single image, trained across indoor and outdoor scenes.
    • IDArb [Li et al. 2024]: Deals with intrinsic decomposition for arbitrary numbers of input views and illuminations, but is limited to the object level.
    • Common Limitation: The paper highlights that these prevailing diffusion-based methods are predominantly designed for indoor scenes or object-level tasks. They struggle with the dynamic and intricate illumination conditions and expanded scene dimensions characteristic of AD environments. Crucially, they suffer substantial performance degradation in the presence of diverse weather conditions (e.g., rain, snow, fog), as illustrated in Figure 2.

3.3. Technological Evolution

The evolution of rendering and scene understanding has progressed through several stages:

  1. Traditional Graphics (Pre-Deep Learning): Focused on physically accurate simulations (ray tracing, rasterization) requiring explicit 3D models, materials, and light sources. Inverse problems were largely analytical or optimization-based, often making simplifying assumptions.
  2. Learning-based Inverse Rendering (Early Deep Learning): Used Convolutional Neural Networks (CNNs) to learn mappings from images to intrinsic maps. These models learned from large synthetic datasets, overcoming some limitations of explicit modeling but still often relied on simplified physics or limited domain generalization.
  3. Generative Models for Image Synthesis: VAEs and GANs emerged, capable of generating photo-realistic images. While powerful for synthesis, their application to inverse problems was less direct, and they sometimes struggled with control and diversity.
  4. Diffusion Models Revolution (Current State): Diffusion models (DDPM, Latent Diffusion) offered unprecedented image quality, diversity, and controllable generation through conditioning. This led to their application in inverse problems, learning the joint distribution of images and their underlying properties.
  5. Diffusion for Rendering (Specific Application): Recent works (IID, RGB→X) adapted diffusion models for intrinsic image decomposition and neural rendering. However, these largely focused on constrained environments (indoor, object-level).
  6. WeatherDiffusion's Position: This paper represents a crucial step in extending diffusion-based rendering to challenging real-world autonomous driving scenarios with complex and diverse weather conditions, a domain previously underserved by high-quality, robust solutions. It builds upon Latent Diffusion Models (specifically Stable Diffusion 3.5) and introduces AD-specific priors (datasets, MAA) to handle the unique scale, dynamics, and atmospheric effects of outdoor driving scenes.

3.4. Differentiation Analysis

Compared to the main methods in related work, WeatherDiffusion introduces several core differences and innovations:

  • Target Domain and Robustness:

    • Related Work: IID, RGB→X, and DiffusionRenderer are primarily designed for indoor scenes or object-level tasks. GeoWizard extends to some outdoor scenes but doesn't explicitly focus on diverse weather conditions. IDArb is object-level. None specifically address the unique complexities of large-scale AD environments under dynamic weather.
    • WeatherDiffusion: Explicitly targets large-scale, multi-weather autonomous driving scenarios. It is built to be robust to rain, snow, fog, and varying illumination, which cause significant performance degradation in existing systems (as highlighted in Figure 2).
  • Specialized Data:

    • Related Work: Rely on existing indoor datasets (OpenRooms, Hypersim, InteriorVerse) or general outdoor datasets (MatrixCity) that lack comprehensive weather diversity for rendering tasks.
    • WeatherDiffusion: Introduces WeatherSynthetic and WeatherReal, two purpose-built datasets that provide extensive weather and lighting variations for AD scenes, along with intrinsic maps, addressing a critical data gap.
  • Architectural Enhancements for Control and Accuracy:

    • Related Work: While using diffusion, they lack specific mechanisms to guide the model's attention or conditioning for complex, weather-affected intrinsic decomposition in AD.
    • WeatherDiffusion: Proposes Intrinsic map-aware attention (MAA). This mechanism provides customized visual detail guidance by enabling the model to focus on specific local regions relevant to each intrinsic map (e.g., metallic objects for metallicity), which is crucial for accurate decomposition in complex scenes. It effectively replaces generic text guidance with semantic visual priors.
    • Weather-Guided Conditioning: Integrates a weather controller (one-hot encoded weather categories) into the diffusion process, allowing the model to explicitly learn and distinguish different weather conditions, which helps resolve ambiguities (e.g., low visibility from fog vs. shadows).
  • Base Model Adaptation:

    • Related Work: Use various diffusion models, but not necessarily the latest or specifically adapted for the unique AD latent space challenges.

    • WeatherDiffusion: Finetunes Stable Diffusion 3.5 medium, noting its redesigned 16-channel latent space is beneficial for handling larger view ranges and complex scale variations in outdoor scenes, a key advantage for AD.

      In summary, WeatherDiffusion differentiates itself by specifically tailoring a diffusion-based FR/IR framework for the autonomous driving domain, addressing its unique challenges through purpose-built datasets, weather-aware conditioning, and map-specific attention mechanisms, leading to superior performance and practical utility for AD downstream tasks.

4. Methodology

4.1. Principles

The core principle of WeatherDiffusion is to leverage the powerful generative capabilities of latent diffusion models, specifically Stable Diffusion 3.5, and adapt them for robust and controllable forward rendering (FR) and inverse rendering (IR) in autonomous driving (AD) scenes under various weather and lighting conditions. The method is built on two main ideas:

  1. Weather-Guided Diffusion: Enhancing the diffusion model's conditioning mechanism to explicitly account for and learn from diverse weather conditions, enabling accurate intrinsic map decomposition and realistic image synthesis. This involves categorizing weather and integrating it into the latent space modulation.

  2. Intrinsic Map-Aware Attention (MAA): Introducing a novel attention mechanism that guides the diffusion model to focus on semantically relevant regions of an image when predicting specific intrinsic maps. This addresses the challenge of large-scale AD scenes where different properties (e.g., metallicity, normal) require attention to distinct visual cues.

    The framework assumes that while weather phenomena like raindrops or fog are complex, their primary effect can be captured within the irradiance component, leaving material properties (like albedo) largely unaffected. This simplification allows for effective decomposition.

4.2. Core Methodology In-depth (Layer by Layer)

The WeatherDiffusion framework involves finetuning two separate Stable Diffusion 3.5 models: one for Inverse Rendering (IR) and another for Forward Rendering (FR). Before detailing these, let's briefly revisit the general diffusion model process as described in the paper, which forms the foundation.

4.2.1. Basic Diffusion Model Process

The paper utilizes a latent diffusion model setup. Given an image IRH×W×3I \in \mathbb{R}^{H \times W \times 3} (for FR) or an intrinsic map yRH×W×C\mathbf{y} \in \mathbb{R}^{H \times W \times C} (for IR, where CC is the number of channels for the map), a pre-trained encoder maps these from pixel space to a lower-dimensional latent space.

  • Latent Space Encoding: The original image II and intrinsic map y\mathbf{y} are encoded into their respective latent representations: $ \boldsymbol { x } _ { 0 } = \boldsymbol { \mathcal { E } } ( I ) , \quad \boldsymbol { z } _ { 0 } = \boldsymbol { \mathcal { E } } ( \pmb { y } ) . $

    • x0\boldsymbol{x}_0: The latent representation of the input image II.
    • z0\boldsymbol{z}_0: The latent representation of the intrinsic map y\mathbf{y}.
    • E()\boldsymbol{\mathcal{E}}(\cdot): The pre-trained encoder network that maps high-dimensional pixel data to a lower-dimensional latent space. The paper mentions SD 3.5 redesigns its latent space to have 16 channels, which is beneficial for outdoor scenes.
  • Noising Process (Rectified Flow Matching): Following Rectified Flow Matching [Lipman et al. 2022], random Gaussian noise is added to the latent component z0z_0 to create a noisy latent ztz_t at timestep tt: $ z _ { t } = ( 1 - t ) z _ { 0 } + t \epsilon , \quad \epsilon \sim N ( \mathbf { 0 } , \mathbf { I } ) , $

    • ztz_t: The noisy latent variable at timestep tt.
    • z0z_0: The original clean latent variable of the intrinsic map (in IR) or image (in FR).
    • tt: A continuous denoising timestep, typically ranging from 0 to 1. t=0t=0 means no noise, t=1t=1 means pure noise.
    • ϵ\epsilon: Random noise sampled from a standard Gaussian distribution, N(0,I)\mathcal{N}(\mathbf{0}, \mathbf{I}), where 0\mathbf{0} is the zero vector and I\mathbf{I} is the identity matrix (representing unit variance). This means the noise has a mean of zero and a variance of one.
  • Velocity Field Estimation: A neural network, specifically a DiT (Diffusion Transformer) [Peebles and Xie 2023], is used to estimate the "velocity field" at a given timestep. This velocity field represents the direction and magnitude of change needed to transition from the noisy latent ztz_t back to the original clean latent z0z_0. The estimation can be expressed as: $ v _ { \theta } ( z _ { t } , c , t ) = \frac { \mathrm { d } z _ { t } } { \mathrm { d } t } = \epsilon - z _ { 0 } , $

    • vθ()v_\theta(\cdot): The predicted velocity field by the DiT model, parameterized by θ\theta.
    • ztz_t: The noisy latent variable input to the DiT.
    • cc: The vision or text condition provided to guide the DiT model.
    • tt: The current denoising timestep.
    • The model learns to predict the difference between the random noise ϵ\epsilon and the original latent z0z_0.
  • Loss Function: The DiT model is trained by minimizing the following loss function, which aims to make the predicted velocity field vθv_\theta match the true difference (ϵz0)(\epsilon - z_0): $ L _ { \theta } = \mathbb { E } _ { t \sim \mathcal { U } ( 0 , 1 ) , \epsilon \sim \mathcal { N } ( \mathbf { 0 } , \mathbf { I } ) } \left[ \lVert \boldsymbol { v } _ { \theta } ( z _ { t } , c , t ) - ( \epsilon - z _ { 0 } ) \rVert _ { 2 } ^ { 2 } \right] . $

    • E[]\mathbb{E}[\cdot]: Expectation operator.
    • tU(0,1)t \sim \mathcal{U}(0,1): Timestep tt is sampled uniformly between 0 and 1.
    • ϵN(0,I)\epsilon \sim \mathcal{N}(\mathbf{0}, \mathbf{I}): Noise ϵ\epsilon is sampled from a standard Gaussian distribution.
    • 22\lVert \cdot \rVert_2^2: The squared L2L_2 norm (Euclidean distance), which measures the difference between the predicted velocity and the target velocity.
    • Minimizing this loss allows the model to accurately estimate the velocity field. Once trained, this estimated velocity field is used in the reverse process to progressively denoise ztz_t back to z0z_0.

4.2.2. Inverse Rendering (IR)

The IR diffusion model is finetuned from SD 3.5 to decompose an input image II into a set of intrinsic maps: albedo (\mathbf{a}), normal (\mathbf{n}), roughness (\mathbf{r}), metallicity (\mathbf{m}), and irradiance (\mathbf{i}) (as defined in Section 3.1). The key challenge here is that SD 3.5 might struggle with low-visibility weather conditions, potentially confusing them with shadowed regions. To address this, the authors introduce a weather controller.

  • Weather Controller:

    • Categorization: Weather conditions are grouped into nine distinct classes based on visual similarities in lighting and particle types (e.g., "sunny", "rainy/thunderstorm", "snow", "foggy").
    • Encoding: These categories are encoded as one-hot vectors. A one-hot vector is a binary vector where a single bit is "hot" (1) and all others are "cold" (0), representing a specific category. For instance, if there are 9 categories, a rainy day might be [0, 1, 0, 0, 0, 0, 0, 0, 0].
    • Positional Encoding: The one-hot weather controller (rir_i) is first transformed using positional encoding. This technique, commonly used in transformers, adds information about the position of an element in a sequence. Here, it likely enriches the categorical weather information, perhaps allowing the model to understand the "degree" or "type" of weather more robustly.
    • Conditional Input Generation: The positional-encoded weather controller is then combined with the denoising timestep (tt) and a text projection (τ(c)\tau(c)). The text projection is derived from a text prompt cc (e.g., indicating the intrinsic map to decompose, or general scene context). These three components are summed together to form a comprehensive diffusion condition. This combined condition is then fed through a Multi-Layer Perceptron (MLP) to predict scale (\alpha) and shift (\beta) parameters. $ { \alpha , \beta } = \mathrm { M L P } ( f _ { \mathrm { w e a t h e r } } ( r _ { i } ) + f _ { \mathrm { t i m e } } ( t ) + f _ { \mathrm { t e x t } } ( \tau ( c ) ) ) , $
      • α,β\alpha, \beta: Scale and shift parameters predicted by the MLP.
      • MLP()\mathrm{MLP}(\cdot): A Multi-Layer Perceptron, a type of feedforward neural network.
      • fweather(ri)f_{\mathrm{weather}}(r_i): Function representing the positional-encoded weather controller.
      • ftime(t)f_{\mathrm{time}}(t): Function representing the denoising timestep.
      • ftext(τ(c))f_{\mathrm{text}}(\tau(c)): Function representing the projection of the text prompt cc.
      • The sum of these three components forms the core conditional input.
  • LayerNorm Modulation: The predicted α\alpha and β\beta parameters are then used to modulate the input features hh of the DiT model, specifically within its Layer Normalization (LN) layers. This technique, similar to adaptive instance normalization (AdaIN) or conditional normalization, allows the model to dynamically adjust its internal representations based on the weather, timestep, and text conditions. $ h _ { \mathrm { n o r m } } = \mathrm { L N } ( h ) \odot ( 1 + \alpha ) + \beta , $

    • hh: The input features to be normalized and modulated within the DiT network.
    • LN(h)\mathrm{LN}(h): Layer Normalization applied to hh. Layer Normalization normalizes the activations across the features for each sample independently, making training more stable.
    • \odot: Element-wise product (Hadamard product).
    • hnormh_{\mathrm{norm}}: The normalized and modulated features, which are then passed through the rest of the DiT network. This modulation ensures that the model's processing of features is aware of the specific weather conditions, helping it to distinguish subtle visual cues that might otherwise be ambiguous (e.g., separating rain textures from shadows).

4.2.3. Forward Rendering (FR)

The FR diffusion model is also finetuned from SD 3.5. Its task is to synthesize an image II given a set of decomposed intrinsic maps and a text prompt specifying the target weather condition.

  • Robustness to Absent Intrinsic Maps: To improve the model's generalization and robustness, especially when some intrinsic maps might not be available or are unreliable, the authors employ a random dropping strategy during training. A subset of the intrinsic maps is randomly replaced with zero matrices with a certain probability pp. $ M = { \hat { z } _ { a } , \hat { z } _ { n } , \hat { z } _ { r } , \hat { z } _ { m } , \hat { z } _ { i } } , \quad \hat { z } _ { i } = \left{ \begin{array} { l l } { { 0 , } } & { { \mathrm { w . p . ~ } p } } \ { { z _ { i } , } } & { { \mathrm { w . p . ~ } 1 - p } } \end{array} \right. . $
    • MM: The set of intrinsic maps provided as input to the FR model.
    • z^a,z^n,z^r,z^m,z^i\hat{z}_a, \hat{z}_n, \hat{z}_r, \hat{z}_m, \hat{z}_i: The latent representations of the albedo, normal, roughness, metallicity, and irradiance maps, respectively, possibly modified by the dropping strategy.
    • ziz_i: The original latent representation of the ii-th intrinsic map (calculated via encoder E\mathcal{E}).
    • 0: A zero matrix, used to replace a dropped map.
    • w.p. p\mathrm{w.p.}~p: "with probability pp."
    • This strategy forces the FR model to learn to generate images even from incomplete sets of intrinsic maps, making it more flexible and robust in real-world applications where complete and perfect intrinsic maps might not always be available.

4.2.4. Intrinsic Map-Aware Attention (MAA)

The Intrinsic Map-Aware Attention (MAA) mechanism is designed to provide detailed visual guidance for the IR diffusion model, replacing or augmenting generic text guidance for improved decomposition quality. It is motivated by the observation that different intrinsic maps require attention to specific, distinct regions of an image (e.g., metallicity for cars, normal for road surfaces).

  • Patch Token Extraction: First, the input image is processed by DINOv2 [Oquab et al. 2023], a powerful self-supervised vision transformer. DINOv2 extracts a set of patch tokens (\mathbf{p}) from the image. Each patch token is a feature vector representing a small spatial region (patch) of the image, encoding its visual characteristics. The paper notes that DINOv2 tokens exhibit strong intra-class consistency, meaning patches belonging to the same material or structure have similar representations, which is useful for semantic guidance.

  • Map-Specific Learnable Embeddings: For each type of intrinsic map (albedo, normal, roughness, metallicity, irradiance), the model maintains a learnable embedding (\mathbf{d} \in \mathbb{R}^{D_{\mathrm{model}}}). These embeddings conceptually capture the inherent characteristics or "essence" of what that specific intrinsic map represents.

  • Gating Mechanism with Map-Aware Cross-Attention: A gating mechanism, implemented as a cross-attention layer, is applied. This mechanism filters the DINOv2 patch tokens (\mathbf{p}) based on the specific learnable embedding (\mathbf{d}) of the target intrinsic map being predicted. This generates map-aware patch tokens (\mathbf{p}') that emphasize regions relevant to that target map. $ p ^ { \prime } = \mathrm { gating } ( p , d ) = \mathrm { Softmax } \big ( \frac { ( d W _ { Q } ) ( p W _ { K } ) ^ { T } } { \sqrt { d _ { k } } } \big ) p W _ { V } , $

    • pp': The map-aware patch tokens. These are the output of the cross-attention, selectively highlighting features relevant to the specific intrinsic map.
    • pp: The original DINOv2 patch tokens extracted from the image.
    • dd: The learnable embedding corresponding to the target intrinsic map. This acts as the Query in this cross-attention.
    • WQ,WK,WVW_Q, W_K, W_V: Learnable linear projection matrices for Query, Key, and Value respectively. These transform the input embeddings into suitable spaces for attention calculation. In this context, dd is projected by WQW_Q, and pp by WKW_K and WVW_V.
    • dkd_k: The dimension of the Key vectors, used for scaling the dot product in the softmax argument.
    • Softmax()\mathrm{Softmax}(\cdot): Normalizes the attention scores.
    • The term (dWQ)(pWK)T(d W_Q)(p W_K)^T computes the similarity between the map embedding (query) and each patch token (key). This similarity is then used to weight the Value projections of the patch tokens (pWV)(p W_V), effectively allowing the map embedding to "attend" to the most relevant image patches. Figure 4 visually demonstrates these attention heatmaps.
  • Aggregated Visual Condition: Instead of directly using the map-aware patch tokens (\mathbf{p}'), a set of learnable semantic embeddings (\mathbf{c}) (corresponding to broader semantic categories) is introduced.

    • Modulation: These semantic embeddings (\mathbf{c}) are first modulated by classification logits from DINOv2. This likely helps to align these embeddings with real-world object categories detected by DINOv2.
    • Fusion via Cross-Attention: The modulated semantic embeddings (\mathbf{c}) are then fused with the map-aware patch tokens (\mathbf{p}') using another cross-attention mechanism. This step combines the high-level semantic understanding with the map-specific visual details. $ c ^ { \prime } = \alpha \cdot \mathrm { Softmax } \big ( \frac { ( c W _ { Q } ) ( p ^ { \prime } W _ { K } ) ^ { T } } { \sqrt { d _ { k } } } \big ) p ^ { \prime } W _ { V } + c , $
      • cc': The final visual condition that is input to the DiT model for IR.

      • α\alpha: A learnable coefficient that scales the attention output.

      • cc: The learnable semantic embeddings (modulated by DINOv2 logits). This acts as the Query in this second cross-attention.

      • pp': The map-aware patch tokens from the previous step. These act as Keys and Values.

      • WQ,WK,WVW_Q, W_K, W_V: Again, learnable projection matrices.

      • dkd_k: Dimension of keys for scaling.

      • The output of this cross-attention is added back to the original semantic embeddings (\mathbf{c}) to form the final, rich visual condition cc'. This condition effectively captures both general semantic information and specific visual cues relevant to the intrinsic map being decomposed, guiding the DiT model to produce high-quality results.

        This intricate interplay of weather conditioning, latent space operations, and map-aware visual guidance enables WeatherDiffusion to robustly perform forward and inverse rendering in complex AD scenes.

5. Experimental Setup

5.1. Datasets

The paper highlights that existing datasets are insufficient for large-scale AD scenes under complex weather conditions. Therefore, they introduce two novel datasets: WeatherSynthetic and WeatherReal.

  • WeatherSynthetic:

    • Source: Synthetic data.

    • Scale: 35K images.

    • Characteristics: Encompasses a wide range of scene types and weather conditions.

      • Weather Types: sunny, overcast, rainy, thunderstorm, snowy, foggy, sandstorm.
      • Times of Day: early morning, morning, noon, afternoon, night.
      • Environments: urban, suburban, highway, parking.
    • Generation: Rendered using Unreal Engine 5. Commercial 3D assets (allowed for generative models) were purchased from Fab. The rendering pipeline includes movie render queue and multi-sample anti-aliasing for high quality. UltraDynamicSky and UltraDynamicWeather assets were used to control weather and daytime. All images are in linear space without tone mapping.

    • Purpose: To provide a large-scale, diverse dataset with ground truth intrinsic maps (albedo, normal, roughness, metallicity, irradiance) for training and evaluation.

    • Example Data Sample: Figure 5 shows example scenes from WeatherSynthetic, demonstrating diverse lighting and weather conditions across urban environments. For instance, one row displays a street scene under sunny, foggy, snowy, and night conditions.

      Fig. 5. Example of our WeatherSynthetic. Each row shows a scene rendered under diverse lighting and weather conditions. 该图像是图示,展示了WeatherSynthetic数据集中在不同天气条件下渲染的场景。每一行展示了在阳光明媚、雾霾、降雪和夜间条件下的城市街道场景,表现出复杂的光照和气候影响。

      As shown in Figure 5, each row displays a scene rendered under diverse lighting and weather conditions, showcasing urban environments.

  • WeatherReal:

    • Source: Real-world data, derived from open-source AD datasets.

    • Scale: 10K images.

    • Characteristics: Designed for IR on AD scenes with various weather conditions.

    • Generation:

      1. Started with datasets like Waymo [Sun et al. 2020] and Kitti [Geiger et al. 2013], which were originally collected under sunny conditions.
      2. Used the model trained on WeatherSynthetic to generate pseudo ground truth intrinsic maps for these real-world images.
      3. Applied weather augmentation using a combination of generative models and image processing:
        • InstructPix2Pix [Brooks et al. 2023] (a pre-trained image editing model) was used to alter surfaces (e.g., converting dry roads to wet, adding snow to objects).
        • Provided depth maps (from original datasets) were used to synthesize realistic fog effects.
        • Randomized noise patterns were used to simulate falling snowflakes or raindrops.
    • Purpose: To bridge the domain gap between synthetic and real-world data and address the severe degradation caused by extreme real-world weather (e.g., heavy fog, torrential rain) that synthetic data alone might not fully capture.

      The following table summarizes the datasets discussed in the paper, including the two newly introduced datasets: The following are the results from [Table 1] of the original paper:

Images Scene Source Weather Intrinsic map
Albedo Normal Roughness Metallicity Irradiance
InteriorVerse 50K Indoor Synthetic *
Hypersim 70K Indoor Synthetic ×
Openrooms 118K Indoor Synthetic xxx> * X
Matrixcity 316K City Synthetic 2 JSSS
WeatherSynthetic 35K City Synthetic v : :
WeatherReal 10K City Real-world ~

Note: The original table contains some characters that appear to be rendering artifacts (xxx>xxx>, 2, JSSS, vv, :, ~) in the columns for weather and intrinsic maps for older datasets. Assuming indicates full satisfaction, (blank) indicates partial, × indicates none. For WeatherSynthetic, it appears all intrinsic maps (Albedo, Normal, Roughness, Metallicity, Irradiance) are provided, and it covers diverse weather. For WeatherReal, the paper indicates ~ for weather, implying weather augmentation, and blank for intrinsic maps, likely meaning pseudo-GTs are generated.

5.2. Evaluation Metrics

The paper uses several standard metrics to evaluate the quality of inverse rendering results. These metrics assess different aspects of image similarity and reconstruction accuracy.

  1. PSNR (Peak Signal-to-Noise Ratio):

    • Conceptual Definition: A widely used metric to quantify the quality of reconstruction of lossy compression codecs or image processing. It measures the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Higher PSNR values generally indicate better quality (less noise/distortion).
    • Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{MAX_I^2}{\mathrm{MSE}} \right) $ Where MSE\mathrm{MSE} is the Mean Squared Error: $ \mathrm{MSE} = \frac{1}{WH} \sum_{i=1}^{W} \sum_{j=1}^{H} [I(i,j) - K(i,j)]^2 $
    • Symbol Explanation:
      • MAXIMAX_I: The maximum possible pixel value of the image (e.g., 255 for an 8-bit grayscale image, or for each color channel in an RGB image).
      • MSE\mathrm{MSE}: Mean Squared Error.
      • WW: Width of the image.
      • HH: Height of the image.
      • I(i,j): The pixel value of the ground truth (original) image at row ii and column jj.
      • K(i,j): The pixel value of the predicted (reconstructed) image at row ii and column jj.
  2. SSIM (Structural Similarity Index Measure):

    • Conceptual Definition: A metric designed to measure the perceived structural similarity between two images, considering luminance, contrast, and structure. It is often believed to correlate better with human visual perception than PSNR. Values range from -1 to 1, with 1 indicating perfect structural similarity. Higher values are better.
    • Mathematical Formula: $ \mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
    • Symbol Explanation:
      • xx: Pixel values of the first image (or a local window of it).
      • yy: Pixel values of the second image (or a local window of it).
      • μx\mu_x: The average of xx.
      • μy\mu_y: The average of yy.
      • σx2\sigma_x^2: The variance of xx.
      • σy2\sigma_y^2: The variance of yy.
      • σxy\sigma_{xy}: The covariance of xx and yy.
      • c1=(K1L)2c_1 = (K_1L)^2: A small constant to avoid division by zero when μx2+μy2\mu_x^2 + \mu_y^2 is very close to zero.
      • c2=(K2L)2c_2 = (K_2L)^2: A small constant to avoid division by zero when σx2+σy2\sigma_x^2 + \sigma_y^2 is very close to zero.
      • LL: The dynamic range of the pixel values (e.g., 255 for 8-bit images).
      • K1,K2K_1, K_2: Small constants (typically K1=0.01,K2=0.03K_1=0.01, K_2=0.03).
  3. MAE (Mean Angular Error):

    • Conceptual Definition: Primarily used for evaluating the accuracy of predicted normal maps. It measures the average angular difference (in degrees or radians) between the ground truth normal vectors and the predicted normal vectors at each pixel. Lower MAE values indicate more accurate normal predictions.
    • Mathematical Formula: $ \mathrm{MAE} = \frac{1}{WH} \sum_{i=1}^{W} \sum_{j=1}^{H} \arccos(\mathbf{n}{GT}(i,j) \cdot \mathbf{n}{Pred}(i,j)) $
    • Symbol Explanation:
      • WW: Width of the image.
      • HH: Height of the image.
      • arccos()\arccos(\cdot): The inverse cosine function, which returns the angle whose cosine is the argument.
      • nGT(i,j)\mathbf{n}_{GT}(i,j): The 3D unit normal vector of the ground truth normal map at pixel (i,j).
      • nPred(i,j)\mathbf{n}_{Pred}(i,j): The 3D unit normal vector of the predicted normal map at pixel (i,j).
      • \cdot: The dot product operator for vectors. The dot product of two unit vectors equals the cosine of the angle between them.
  4. LPIPS (Learned Perceptual Image Patch Similarity):

    • Conceptual Definition: A perceptual metric that aims to quantify the difference between two images in a way that aligns better with human perception of similarity than traditional pixel-wise metrics (like PSNR or MSE). It uses features extracted from a pre-trained deep neural network (e.g., VGG, AlexNet) to compare images. Lower LPIPS values indicate higher perceptual similarity.
    • Mathematical Formula: LPIPS does not have a simple, closed-form mathematical formula. It is computed as follows:
      1. Two image patches (or full images) are passed through a pre-trained deep convolutional neural network (a "feature extractor," often an AlexNet, VGG, or SqueezeNet trained on ImageNet).
      2. Feature maps are extracted from one or more intermediate layers of this network.
      3. These feature maps are then spatially downsampled and normalized across channels.
      4. Finally, the 2\ell_2 distance (Euclidean distance) between the normalized feature vectors of the two images is computed and averaged across the chosen layers.
    • Symbol Explanation: While a concise formula is not available, the process involves:
      • F(Img1),F(Img2)F(Img_1), F(Img_2): Feature representations of image 1 and image 2, extracted by the pre-trained neural network.
      • wlw_l: Learned weights for each layer ll of the feature extractor.
      • 2\|\cdot\|_2: 2\ell_2 norm (Euclidean distance).
      • The metric essentially calculates a weighted distance between deep features of the two images.

5.3. Baselines

The paper compares WeatherDiffusion against several state-of-the-art methods in inverse rendering, primarily those based on diffusion models, chosen for their relevance to the task and their prominence in the field:

  • IID (Intrinsic Image Diffusion) [Kocsis et al. 2024]:
    • Description: This method was one of the first to train a latent diffusion model specifically for estimating material properties (albedo, roughness, metallicity) from a single indoor image.
    • Representativeness: It represents the foundational application of diffusion models to intrinsic image decomposition.
  • RGB→X [Zeng et al. 2024]:
    • Description: A framework that uses diffusion models for both forward rendering and inverse rendering. It aims to decompose images into material and lighting-aware components.
    • Representativeness: It's a direct competitor as it also handles both FR and IR using diffusion. The paper explicitly mentions finetuning RGB→X (and IID) on their dataset for a fair comparison, as the original RGB→X was trained on indoor datasets.
  • GeoWizard [Fu et al. 2024]:
    • Description: This method focuses on 3D geometry estimation from a single image, leveraging diffusion priors. It is noted to be trained across both indoor and outdoor scenes.
    • Representativeness: It offers a baseline for geometry-focused decomposition, and its training on outdoor scenes makes it more relevant than purely indoor methods.
  • IDArb (Intrinsic Decomposition for Arbitrary Number of Input Views and Illuminations) [Li et al. 2024]:
    • Description: This method addresses intrinsic decomposition from multiple views and illuminations but is explicitly stated to be limited to the object level.

    • Representativeness: While not directly for large-scale scenes, it represents advanced intrinsic decomposition techniques at a finer granularity.

      The authors note that IID and RGB→X were originally trained on indoor datasets, and GeoWizard works across indoor/outdoor, while IDArb is object-level. This underscores WeatherDiffusion's unique focus on large-scale AD scenes and diverse weather, a domain where these baselines are expected to struggle due to their original training focus. The finetuning of IID and RGB→X on WeatherSynthetic aims to provide a more equitable comparison, showing that even with adaptation, they may not match WeatherDiffusion's specialized design.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate WeatherDiffusion's superior performance in both inverse rendering and forward rendering tasks, especially under challenging weather conditions in AD scenes.

6.1.1. Inverse Rendering on Synthetic Data

  • Quantitative Evaluation (Table 2): WeatherDiffusion achieves state-of-the-art performance across all intrinsic maps and metrics (PSNR, SSIM, LPIPS for albedo, roughness, metallicity, irradiance; PSNR, SSIM, MAE for normal) on the WeatherSynthetic dataset.

    • For example, in Albedo estimation, WeatherDiffusion achieves a PSNR of 18.02, significantly higher than RGB→X (w/ finetune) at 11.35 and IID (w/ finetune) at 11.55. Similar trends are observed for other maps and metrics.
    • The MAE for Normal maps is remarkably low (4.24), indicating highly accurate geometry recovery compared to RGB→X (w/ finetune) (7.05) and GeoWizard (12.47).
    • The LPIPS values, which reflect perceptual quality, are consistently lower for WeatherDiffusion, especially for Metallicity (0.14) and Irradiance (0.29), suggesting better human-perceived quality.
  • Qualitative Evaluation (Figure 8):

    • Albedo Estimation (Figure 8a): WeatherDiffusion effectively recovers fine details and completely separates illumination from material, which is crucial for intrinsic decomposition. Baselines like RGB→X show less accuracy, while IID and IDArb mistakenly interpret shadows as albedo, even after finetuning.
    • Normal Estimation (Figure 8b): The model successfully eliminates the impact of atmospheric particles (e.g., snowflakes) to restore clean and sharp normals, whereas other methods are severely affected by such occlusions.
    • Roughness and Metallicity Estimation (Figure 8c, Figure 8d): These are particularly challenging under heavy rain and fog due to blurring. WeatherDiffusion successfully detects distant vehicles and accurately distinguishes metallic from non-metallic objects, outperforming baselines that often fail to do so precisely.
    • Irradiance Estimation (Figure 8e): WeatherDiffusion excels at capturing the presence of rain while preserving details of distant objects, demonstrating its ability to robustly model lighting conditions.

6.1.2. Inverse Rendering on Real Data

  • Qualitative Evaluation (Figure 10): WeatherDiffusion demonstrates strong generalization from synthetic to real-world data.
    • For images from Waymo (initially sunny), it provides reasonable estimations for all intrinsic maps, suggesting good domain transfer.
    • For images with extreme weather (raindrops on lens, heavy rain, dense fog) from TransWeather, WeatherDiffusion consistently provides reasonable estimations where other methods (e.g., finetuned RGB→X, IDArb) struggle with vehicles, distant buildings, or decoupling shadow/material.

6.1.3. Forward Rendering

  • Synthetic Data (Figure 9):
    • WeatherDiffusion recovers material and geometry better than RGB→X, which tends to generate abnormal textures.
    • It generates images that align well with text descriptions (e.g., "A sunny day in the city"), demonstrating controllable synthesis.
  • Real Data (Figure 11, Figure 12):
    • WeatherDiffusion effectively leverages intrinsic maps obtained from inverse rendering to re-render images under different weather and lighting conditions specified by text prompts.
    • It can generate images that closely match the material and geometry of the original image (e.g., using albedo and normal as input).
    • It can simultaneously change materials and weather conditions according to prompts, showcasing its flexibility.

6.1.4. Applications to Downstream Tasks

  • Object Detection and Image Segmentation (Figure 12): The re-rendered images from WeatherDiffusion significantly enhance downstream AD tasks.
    • In severe weather (e.g., snow), original images cause object detection and image segmentation models to struggle due to occlusions.
    • WeatherDiffusion's re-rendered output (e.g., removing snowflakes while strictly preserving object positions and shapes) provides a much cleaner input, enabling these models to generate accurate predictions and thus improving their robustness in challenging scenarios.

6.2. Data Presentation (Tables)

The following are the results from [Table 2] of the original paper:

| Method | Albedo | | | Normal | | | Roughness | | Metallic | | Irradiance | | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | :--- | | PSNR↑ | SSIM↑ | LPIPS↓ | PSNR↑ | SSIM↑ | MAE↓ | PSNR↑ | LPIPS↓ | PSNR↑ | LPIPS↓ | PSNR↑ | LPIPS↓ | IID | 7.80 | 0.26 | 0.63 | | | | 10.30 | 0.55 | 12.37 | 0.64 | - | - | IID (w/ finetune) | 11.55 | 0.53 | 0.40 | | | − | 12.34 | 0.43 | 12.22 | 0.55 | | | RGB→X | 9.66 | 0.44 | 0.47 | 11.90 | 0.41 | 15.51 | 13.62 | 0.55 | | | 16.24 | 0.58 | RGB←→X (w/ finetune) | 11.35 | 0.59 | 0.37 | 16.14 | 0.49 | 7.05 | 13.65 | 0.57 | 11.96 | 0.66 | 16.38 | 0.69 | GeoWizard | | | | 16.24 | 0.54 | 12.47 | | | | | | | IDArb | 6.40 | 0.48 | 0.65 | 10.77 | 0.43 | 22.42 | 10.70 | 0.62 | 14.66 | 0.62 | | | ours | 18.02 | 0.66 | 0.35 | 20.95 | 0.61 | 4.24 | 15.03 | 0.45 | 18.94 | 0.14 | 23.55 | 0.29 | ours (w/o MAA) | 17.35 | 0.66 | 0.45 | 18.82 | 0.49 | 5.56 | 13.96 | 0.51 | 18.04 | 0.28 | 23.41 | 0.43

Note: The original table contains some blank cells, likely indicating that the corresponding method either does not provide that specific intrinsic map estimation or results were not reported for that metric.

6.3. Ablation Studies / Parameter Analysis

The paper conducts an ablation study to understand the contribution of its proposed Intrinsic Map-Aware Attention (MAA).

  • Effect of MAA:

    • Methodology: An IR diffusion model was trained without MAA, meaning the visual condition was replaced with the original text guidance that Stable Diffusion 3.5 uses.

    • Quantitative Results (Table 2, row "ours (w/o MAA)"): The model without MAA performs significantly worse across most metrics compared to the full WeatherDiffusion model.

      • For example, Albedo PSNR drops from 18.02 to 17.35. Normal PSNR drops from 20.95 to 18.82, and its MAE increases from 4.24 to 5.56 (higher MAE is worse).
      • Metallicity PSNR drops from 18.94 to 18.04, and its LPIPS increases from 0.14 to 0.28 (higher LPIPS is worse).
    • Qualitative Results (Figure 6): The figure visually confirms the improvement. With MAA, the model produces more refined geometry and material predictions. It successfully identifies a metallic handrail and assigns it a reasonable level of metallicity. Without MAA, these details are less accurate or completely missed.

      Fig. 6. Ablation study on MAA. 该图像是一个示意图,展示了使用MAA与不使用MAA的结果比较。左侧为输入图像,右侧分别展示未使用MAA和使用MAA的渲染效果,突显了MAA在准确重建场景方面的优势。

      As shown in Figure 6, the MAA component significantly improves the quality of inverse rendering, particularly in capturing fine details and material properties like metallicity.

    • Conclusion: This demonstrates that MAA is crucial for providing the necessary semantic guidance for IR, helping the model to focus on appropriate regions and decompose high-quality intrinsic maps.

  • Effect of WeatherSynthetic and WeatherReal:

    • The paper mentions exploring the effect of its two datasets separately, training the IR diffusion model with only indoor datasets, and only the synthetic dataset.
    • Results: While qualitative results are stated to be in the supplementary material, the main text implies that training solely on synthetic data leads to degradation on real-world extreme weather samples due to the domain gap. WeatherReal helps to mitigate this. This highlights the importance of comprehensive and diverse datasets for robustness.

6.4. Discussion and limitations

Despite its strong performance, WeatherDiffusion has acknowledged limitations:

  1. Out-of-Distribution Objects: The model struggles to accurately estimate intrinsic maps for out-of-distribution (OOD) objects (e.g., cranes, heavy trucks) that were not present in its training data. This is a common limitation for data-driven models.

  2. Extreme Occlusion in Heavy Fog: In conditions of heavy fog where distant regions are completely occluded, the model has difficulty distinguishing between ambiguous elements like the sky and buildings. This often leads to abnormal albedo and normal predictions in such severely degraded areas. Figure 7 illustrates a typical failure case.

    Fig. 7. A typical failure case on a foggy scene. WeatherDiffusion fails to discriminate between the sky and the building hiding behind fog. 该图像是一个示意图,展示了在雾霾场景下的输入图像、内在特征图和重渲染图像之间的关系。左侧为输入图像,其右侧依次为内在特征图,包括颜色、深度和遮挡图。最后为重新渲染的图像,展示了WeatherDiffusion在复杂天气条件下的效果。

    Figure 7 shows a typical failure case in a foggy scene where WeatherDiffusion fails to discriminate between the sky and a building obscured by fog, leading to abnormal intrinsic map predictions.

These limitations point to areas where future research could focus, such as improving generalization to novel objects or developing more robust perception under extreme visual ambiguity.

6.5. Applications

The paper highlights the significant value of WeatherDiffusion in improving the robustness of downstream tasks for autonomous driving (AD), specifically object detection and image segmentation.

  • Problem: Existing object detection [Carion et al. 2020] and image segmentation [Xie et al. 2021] methods, while highly accurate in ideal conditions, suffer significant performance degradation in adverse weather due to occlusions and reduced visibility.

  • Solution via WeatherDiffusion: By performing inverse rendering, WeatherDiffusion can decompose a scene into its physical attributes. Then, through forward rendering, it can re-render these scenes under new, clearer weather and lighting conditions. This process effectively corrects environmental distortions at the visual input level, providing cleaner inputs for downstream models.

  • Demonstration (Figure 12): The paper provides a compelling example. An original image with severe occlusions from snowflakes makes it difficult for segmentation and detection models to produce reasonable predictions. When WeatherDiffusion re-renders this image by removing the snowflakes while strictly preserving the original objects' positions and shapes, the downstream models are then able to generate accurate predictions.

    该图像是示意图,展示了不同天气条件下的输入图像、真实图(GT)、我们的算法结果以及RGB+X的对比。图像展示了前向和逆向渲染在城市场景中的应用,突出了我们的WeatherDiffusion方法的效果。 该图像是示意图,展示了不同天气条件下的输入图像、真实图(GT)、我们的算法结果以及RGB+X的对比。图像展示了前向和逆向渲染在城市场景中的应用,突出了我们的WeatherDiffusion方法的效果。

    Figure 12 clearly demonstrates how WeatherDiffusion enhances object detection and image segmentation in adverse weather by re-rendering images. The original image with snowflakes leads to poor detection and segmentation, while the re-rendered image (snowflakes removed, details preserved) enables accurate results.

  • Impact: This capability significantly enhances the robustness of AD systems in challenging weather, which is critical for safety and reliability.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces WeatherDiffusion, a novel and robust framework for both forward and inverse rendering in complex autonomous driving (AD) environments. Its key contributions include:

  1. A diffusion-based framework capable of decomposing images into intrinsic maps under diverse weather conditions and synthesizing new images based on these maps and text prompts.
  2. The innovative Intrinsic Map-Aware Attention (MAA) mechanism, which provides targeted visual guidance to the generative model, enabling higher quality and more reasonable intrinsic map decomposition by focusing on semantically relevant regions.
  3. The creation of two crucial datasets, WeatherSynthetic and WeatherReal, addressing the critical lack of large-scale, high-quality rendering datasets specifically for AD scenes with varied weather and intrinsic ground truth. Extensive experiments validate WeatherDiffusion's superior performance compared to state-of-the-art methods and demonstrate its significant practical value in improving the robustness of downstream AD tasks like object detection and image segmentation in adverse weather conditions.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

  • Dependence on High-Quality Training Data: While WeatherDiffusion introduces new datasets, its performance still inherently relies on the quality and diversity of this training data. The model struggles with out-of-distribution (OOD) objects not present in the training set (e.g., cranes, heavy trucks).
  • Extreme Occlusion Handling: In scenarios with heavy fog and severe occlusions, the model has difficulty discriminating between visually ambiguous elements (e.g., sky vs. buildings), leading to inaccurate intrinsic map predictions.
  • Future Directions:
    • Reducing Data Dependence: Research efforts could focus on methods to decrease the model's reliance on extensive high-quality training data, possibly through more advanced self-supervised learning or data augmentation techniques.
    • Reinforcement Learning and LLM Feedback: Combining diffusion models with reinforcement learning (RL), guided by human or Large Language Model (LLM) feedback, could further enhance the robustness and controllability of the rendering process. This might allow for more nuanced and context-aware scene manipulation.
    • Auto-regressive Generative Models: Exploring the capabilities of auto-regressive generative models presents opportunities to advance both rendering and inverse rendering tasks, potentially leading to even more coherent and detailed scene generation or decomposition.

7.3. Personal Insights & Critique

This paper presents a highly relevant and impactful contribution to the field of autonomous driving and computer graphics. The integration of diffusion models with specific adaptations for AD scenes and challenging weather is a critical step forward.

  • Innovation: The Intrinsic Map-Aware Attention (MAA) is a particularly clever innovation. Instead of relying solely on textual prompts or global image features, it provides a fine-grained, map-specific visual guidance by leveraging semantic patches from DINOv2. This bridges the gap between high-level semantic understanding and low-level pixel details, which is crucial for the ill-posed inverse rendering problem, especially in complex environments. The weather controller for conditioning is also a practical and effective way to tackle weather diversity.
  • Applicability & Transferability: The core methodology of weather-guided diffusion and MAA could potentially be transferred to other domains requiring robust scene understanding or generation under varying environmental conditions, such as augmented reality, virtual production, or environmental monitoring. The idea of using intrinsic maps to clean up sensor data for downstream tasks is highly valuable and broadly applicable beyond object detection and segmentation (e.g., for depth estimation or visual odometry).
  • Potential Issues/Areas for Improvement:
    • Simplifying Weather Assumption: The assumption that complex weather phenomena are primarily reflected in irradiance while material properties remain unaffected, while practical, might be an oversimplification in extreme cases (e.g., very wet surfaces change their roughness and specularity, not just illumination). Future work could explore more dynamic material models that interact with weather.

    • Generalization to Novel Objects: The acknowledged limitation regarding out-of-distribution objects is significant for AD. While re-rendering helps clean the scene, if the intrinsic maps for these novel objects are incorrectly estimated in the first place, the re-rendered output might propagate errors. Incorporating strong 3D priors or few-shot learning for object-level intrinsic decomposition could be a valuable extension.

    • Computational Cost: Diffusion models, especially large ones like SD 3.5, are computationally intensive. While not explicitly discussed, the inference speed for real-time AD applications might be a practical concern, although latent diffusion helps. Further optimization or distillation techniques could be explored.

    • Quantitative Metrics for Real-World Data: The paper primarily relies on qualitative results for real-world data. While understandable given the lack of ground truth, developing more robust quantitative metrics or benchmarks for inverse rendering on diverse real-world weather data would strengthen claims of generalization. WeatherReal with its pseudo ground truth is a good step but comes with its own potential biases.

      Overall, WeatherDiffusion is a well-designed and highly relevant piece of research that pushes the boundaries of forward and inverse rendering towards practical applications in autonomous driving, effectively leveraging the power of diffusion models to address previously intractable challenges posed by complex weather.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.