Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets
TL;DR Summary
Seed3D 1.0 is a foundation model that generates high-fidelity, simulation-ready 3D assets from single images, effectively balancing content diversity and physics accuracy for scalable training environments in embodied AI development.
Abstract
Developing embodied AI agents requires scalable training environments that balance content diversity with physics accuracy. World simulators provide such environments but face distinct limitations: video-based methods generate diverse content but lack real-time physics feedback for interactive learning, while physics-based engines provide accurate dynamics but face scalability limitations from costly manual asset creation. We present Seed3D 1.0, a foundation model that generates simulation-ready 3D assets from single images, addressing the scalability challenge while maintaining physics rigor. Unlike existing 3D generation models, our system produces assets with accurate geometry, well-aligned textures, and realistic physically-based materials. These assets can be directly integrated into physics engines with minimal configuration, enabling deployment in robotic manipulation and simulation training. Beyond individual objects, the system scales to complete scene generation through assembling objects into coherent environments. By enabling scalable simulation-ready content creation, Seed3D 1.0 provides a foundation for advancing physics-based world simulators. Seed3D 1.0 is now available on https://console.volcengine.com/ark/region:ark+cn-beijing/experience/vision?modelId=doubao-seed3d-1-0-250928&tab=Gen3D
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Seed3D 1.0: From Images to High-Fidelity Simulation-Ready 3D Assets
1.2. Authors
The authors are affiliated with ByteDance Seed. The core contributors are Jiashi Feng, Xiu Li, Jing Lin, Jiahang Liu, Gaohong Liu, Weiqiang Lou, Su Ma, Guang Shi, Qinlong Wang, Jun Wang, Zhongcong Xu, Xuanyu Yi, Zihao Yu, Jianfeng Zhang, and Yifan Zhu. Additional contributors and acknowledgments are also listed. Their collective background appears to be in AI, computer vision, 3D graphics, and large-scale model development, given the nature of the research.
1.3. Journal/Conference
The paper is published as a preprint on arXiv (arXiv:2510.19944v1). arXiv is a well-regarded open-access archive for scientific preprints in various fields, including computer science. While preprints have not undergone formal peer review, arXiv serves as a crucial platform for rapid dissemination of research findings and allows for community feedback before or alongside formal publication.
1.4. Publication Year
2025
1.5. Abstract
The paper introduces Seed3D 1.0, a foundation model designed to generate high-fidelity, simulation-ready 3D assets from single 2D images. This addresses a critical challenge in developing embodied AI agents: the need for scalable training environments that offer both diverse content and accurate physics. Traditional methods either provide diverse content (video-based, but lack physics) or accurate physics (physics-based engines, but suffer from costly manual asset creation). Seed3D 1.0 overcomes this by producing 3D assets with accurate geometry, well-aligned textures, and realistic physically-based materials, making them directly integrable into physics engines with minimal configuration. This capability supports applications in robotic manipulation and simulation training. The system is also capable of scaling to complete scene generation by assembling individual objects into coherent environments. By enabling scalable creation of simulation-ready 3D content, Seed3D 1.0 aims to advance physics-based world simulators.
1.6. Original Source Link
Official Source/PDF Link:
-
Original Source Link:
https://arxiv.org/abs/2510.19944v1 -
PDF Link:
https://arxiv.org/pdf/2510.19944v1.pdfPublication Status: This paper is a preprint available on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem addressed by this paper is the scalability and fidelity mismatch in training environments for embodied AI agents. Embodied AI, which aims to create intelligent systems that can perceive, reason, and act in the physical world (like household robots), requires rich, interactive training environments. Current large multimodal models (LMMs) primarily rely on text and 2D data, which lack the crucial 3D object structure, spatial relationships, material properties, and physical dynamics necessary for physical interaction.
Existing "world simulators" face a fundamental trade-off:
-
Video-based methods (e.g., Cosmos, Genie-3) can generate diverse content but lack 3D consistency and the real-time physics feedback essential for interactive learning. They are good for visual diversity but not for understanding physical consequences.
-
Physics-based engines (e.g., IsaacGym) provide rigorous dynamics and explicit physics modeling, which is crucial for interpretability and safety. However, they suffer from severe scalability limitations because creating high-quality 3D assets for these environments is a costly, time-consuming, and labor-intensive manual process. This bottleneck severely constrains the variety and scale of training environments.
This problem is important because overcoming data scarcity for embodied AI requires high-fidelity simulation environments that can provide meaningful feedback for spatial reasoning and physical manipulation tasks. The paper's entry point is to bridge this gap by developing a foundation model that can automatically generate diverse, physically plausible, and simulation-ready 3D assets from simple inputs like a single 2D image, thereby alleviating the content creation bottleneck.
2.2. Main Contributions / Findings
The paper's primary contributions revolve around Seed3D 1.0, a foundation model that offers a novel solution for scalable, high-fidelity 3D asset generation for simulation environments:
-
High-Fidelity Asset Generation: Seed3D 1.0 produces 3D assets with exceptionally detailed geometry, photorealistic textures (up to 4K resolution), and realistic Physically Based Rendering (PBR) materials. These assets are characterized by accurate geometry, well-aligned textures, and physically plausible material properties, ensuring realistic lighting interactions and suitability for both rendering and physical simulation. This addresses the problem of generative diversity without compromising quality.
-
Physics Engine Compatibility: The generated assets can be seamlessly integrated into physics engines (e.g., NVIDIA Isaac Sim) with minimal configuration. They feature watertight, manifold geometry and optimized topology, which are critical for robust physics simulation. This enables practical applications such as generating diverse manipulation scenarios for training robotic models and creating interactive environments for reinforcement learning, addressing the physics rigor requirement.
-
Scalable Scene Composition: Beyond individual objects, Seed3D 1.0 supports factorized scene generation. It leverages Vision-Language Models (VLMs) to understand and plan spatial layouts from prompt images, then generates and places the required assets to compose coherent and diverse scenes, from indoor settings to urban environments. This addresses the scalability challenge for creating entire training worlds.
-
State-of-the-Art Performance: The model demonstrates state-of-the-art performance across geometry and texture generation benchmarks. The 1.5B-parameter geometry generation model outperforms larger baseline methods, and comprehensive user studies validate its superior visual clarity, geometric accuracy, and material realism.
In summary, Seed3D 1.0 solves the critical problem of content scalability for physics-based world simulators, providing a foundation for advancing embodied AI by enabling the creation of rich, diverse, and physically accurate training environments.
3. Prerequisite Knowledge & Related Work
This section provides an overview of the fundamental concepts and prior research necessary to understand the technical contributions of Seed3D 1.0.
3.1. Foundational Concepts
Embodied AI
Embodied AI refers to artificial intelligence systems that learn and act within a physical (real or simulated) environment, possessing a "body" that allows them to perceive and interact with the world. Unlike traditional AI that might process abstract data, embodied agents (like robots) need to understand 3D object structure, spatial relationships, material properties, and physical dynamics to perform tasks such as navigation, manipulation, or interaction with humans. The goal is to enable AI to operate in the complex, unstructured real world.
World Simulators
World simulators are virtual environments designed to mimic the physical world, used for training and testing AI agents. They provide a controlled, reproducible, and often accelerated setting for agents to learn complex tasks.
- Video-based methods: These simulators generate diverse visual content, often from real-world videos or generative models, but typically lack true 3D consistency or accurate physical interactions. They are good for visual realism but poor for physics.
- Physics-based engines: These simulators (e.g., Unity, Unreal Engine, NVIDIA Isaac Sim) explicitly model physical laws (gravity, friction, collision, rigid body dynamics) to provide accurate interactions. They are crucial for training robots where precise physical feedback is needed but traditionally suffer from the high cost of creating 3D assets.
Reinforcement Learning (RL)
Reinforcement learning is a paradigm where an agent learns to make decisions by performing actions in an environment and receiving feedback in the form of rewards or penalties. The agent's goal is to maximize cumulative reward over time. For embodied AI, RL agents can learn complex manipulation skills by interacting with objects in a simulated environment, with the physics engine providing the "feedback" on the consequences of their actions.
3D Asset Generation
3D asset generation involves creating digital 3D models, including their shapes (geometry), visual surfaces (textures), and material properties.
- Geometry: Refers to the shape and form of a 3D object, often represented as a mesh (a collection of vertices, edges, and faces) or implicit functions.
- Textures: 2D images applied to the surface of a 3D model to give it color, pattern, and fine surface details.
UV mappingis the process of flattening a 3D model's surface onto a 2D plane (the UV map) so that a 2D texture can be applied correctly without distortion. - Physically Based Rendering (PBR) Materials: A shading and rendering approach that aims to simulate how light interacts with surfaces in a physically plausible way. PBR materials are typically defined by several maps:
- Albedo (or Base Color): The intrinsic color of an object, reflecting diffuse light.
- Metallic: A map indicating which parts of the surface are metallic (usually 0 or 1, non-metal or metal). Metallic surfaces reflect light differently than non-metallic ones.
- Roughness: A map indicating the microscopic surface irregularities, which scatter light and determine how sharp or blurry reflections appear (0 for perfectly smooth/glossy, 1 for rough/matte).
- Other maps can include Normal (for fine surface detail), Ambient Occlusion (for shadows), etc.
Foundation Models
Foundation models are large AI models trained on a vast amount of broad data (e.g., text, images, code, 3D data) at scale. They are designed to be general-purpose and can be adapted to a wide range of downstream tasks through fine-tuning or prompt engineering. Seed3D 1.0 is presented as a foundation model for 3D asset generation.
Variational Autoencoders (VAEs)
A Variational Autoencoder (VAE) is a type of generative neural network that learns a compressed, low-dimensional latent space representation of input data. It consists of an encoder that maps input data to a distribution in the latent space, and a decoder that reconstructs the data from samples drawn from this latent distribution. VAEs are used for efficient data compression and generating new, diverse samples similar to the training data. The "variational" aspect refers to its use of variational inference to approximate the posterior distribution of the latent variables.
Diffusion Models
Diffusion models are a class of generative models that learn to reverse a gradual "diffusion" process (adding noise) to generate data. They start with random noise and progressively denoise it over several steps to produce a coherent sample (e.g., an image, a 3D shape).
- Rectified Flow: A specific type of diffusion model that learns a "rectified flow" (a straight-line path) between a noisy data point and a clean data point in the latent space. This simplifies the denoising process, potentially leading to faster and more stable generation compared to traditional diffusion models.
- Diffusion Transformers (DiT): Diffusion models where the neural network backbone responsible for denoising is a Transformer. This allows them to leverage the Transformer's power in modeling long-range dependencies and complex data structures, making them suitable for high-resolution image generation and, in this paper, 3D latent space denoising.
Transformers and Attention Mechanisms
Transformers are neural network architectures that rely heavily on attention mechanisms. They have become dominant in natural language processing and increasingly in computer vision.
- Attention: A mechanism that allows the model to weigh the importance of different parts of the input data when processing a specific part. It dynamically focuses on relevant information.
- Self-Attention: A mechanism within Transformers that relates different positions of a single sequence to compute a representation of the sequence. For example, in a sentence, it helps understand how each word relates to other words.
- Cross-Attention: A mechanism that allows a sequence to attend to another, different sequence. For example, in multimodal models, it can allow visual features to attend to text features, or vice-versa, to integrate information from different modalities.
- Positional Encoding: Since Transformers process sequences without inherent order,
positional encodingis added to input embeddings to inject information about the relative or absolute position of items in the sequence.Rotary Positional Encoding (RoPE)is a specific type that applies a rotation to the query and key vectors based on their absolute position.
Truncated Signed Distance Fields (TSDF)
A Signed Distance Field (SDF) is a function that, for any point in 3D space, returns the shortest distance from that point to the surface of an object. The sign of the distance indicates whether the point is inside (negative) or outside (positive) the object. A Truncated Signed Distance Field (TSDF) limits this distance to a small band around the surface, making it more computationally efficient and robust for representing surfaces, especially for reconstruction.
Dual Marching Cubes (DMC)
Dual Marching Cubes is an algorithm used to extract a surface mesh (a collection of triangles) from an implicit function, such as an SDF or TSDF. It's an improved version of the classic Marching Cubes algorithm, often producing meshes with better quality and topology.
Vision-Language Models (VLMs)
Vision-Language Models are AI models that can understand and process information from both visual inputs (images, videos) and text inputs. They can perform tasks like image captioning, visual question answering, or, as used here, extracting rich visual semantics and identifying objects from images. Examples include CLIP, DINOv2, and RADIO.
- CLIP (Contrastive Language-Image Pre-training): A VLM that learns to associate text descriptions with images by being trained on a vast dataset of image-text pairs. It can measure the similarity between text and image embeddings.
- DINOv2: A self-supervised vision transformer that learns robust visual features without requiring labeled data. It excels at tasks like semantic segmentation, depth estimation, and instance retrieval.
- RADIO: A model that complements DINOv2 by providing enhanced geometric understanding, distilled from multiple vision foundation models, which helps resolve depth ambiguity in single-view conditioning.
Data Parallelism and Activation Checkpointing
These are techniques used in large-scale deep learning training to manage computational resources and memory.
- Data Parallelism: A strategy where the same model is replicated across multiple processing units (e.g., GPUs), and each unit processes a different batch of data. Gradients are then aggregated to update the model.
- Fully Sharded Data Parallelism (FSDP): An advanced data parallelism technique that shards (partitions) the model's parameters, gradients, and optimizer states across different GPUs. This significantly reduces memory consumption per GPU, allowing for training larger models.
- Hybrid Sharded Data Parallelism (HSDP): A combination of standard data parallelism within a single node and FSDP across multiple nodes, aiming to balance communication overhead and memory efficiency.
- Activation Checkpointing (Gradient Checkpointing): A memory-saving technique that stores only a subset of activations during the forward pass and recomputes the necessary intermediate activations during the backward pass. This reduces GPU memory usage at the cost of increased computation time.
Multi-Level Activation Checkpointing (MLAC)selectively checkpoints activations based on their recomputation cost, offloading high-cost tensors to CPU memory and using asynchronous prefetching to optimize the trade-off.
3.2. Previous Works
The paper builds upon and differentiates itself from several lines of prior research:
- Video-based World Models (e.g., Cosmos [1], Genie-3 [4]): These models aim to generate dynamic, diverse environments, often from video data. While they produce rich content, they generally lack 3D consistency and the explicit physics modeling required for interactive embodied AI training. Seed3D 1.0 aims to provide this missing physical rigor.
- Physics-based Simulators (e.g., IsaacGym [38]): These offer accurate dynamics but are constrained by the manual, costly creation of 3D assets. Seed3D 1.0 directly addresses this "content bottleneck" by automating asset generation.
- 3D Shape VAEs (e.g., 3DShape2VecSet [10, 65], Dora [10]): Seed3D-VAE is explicitly stated to follow the design of 3DShape2VecSet and Dora. These models encode 3D geometry into compact latent representations, often using point clouds and reconstructing continuous geometric representations like TSDFs. Seed3D 1.0 leverages this foundation for its geometry encoder/decoder.
- Core Idea of 3DShape2VecSet/Dora: Encode surface point clouds into a permutation-invariant latent vector set using self-attention and cross-attention, then decode a continuous geometric representation (e.g., SDF or TSDF) from this latent set. This allows for flexible representation of complex shapes and scaling beyond fixed-length representations.
- Diffusion Transformers (DiT) for 2D Generation (e.g., [14, 46]): The paper extends the concept of denoising in a compressed latent space, similar to 2D image generation using VAEs and rectified flow-based DiTs (like those used in FLUX [27] or related to [14]). Seed3D-DiT adapts this to 3D geometry generation.
- Core Idea of Latent Diffusion Models (LDMs) like FLUX: Instead of performing diffusion in pixel space, they use a VAE to encode images into a lower-dimensional latent space. Diffusion happens in this latent space, which is more computationally efficient. Then, a VAE decoder reconstructs the image from the denoised latent. This paper applies a similar principle but for 3D.
- Image Conditioning in DiTs (e.g., FLUX [27]): The dual-encoder design (DINOv2 + RADIO) for
Image Conditioning Modulein Seed3D-DiT follows similar principles of using powerful visual encoders to provide rich conditioning signals, building on models like FLUX. - Multi-view Generation Models (e.g., [54, 56], UniTex [34], MV-Adapter [21], MMDiT [14]): Seed3D-MV builds upon
Multi-Modal Diffusion Transformer (MMDiT)architecture [14] and concepts from UniTex [34] and Flux.1 Kontext [28] for in-context multi-modal conditioning. It aims to improve multi-view consistency and handle increased sequence lengths more effectively.- Core Idea of Multi-view Diffusion: Generate multiple consistent images of an object from different viewpoints, often conditioned on a single reference image or 3D guidance. Challenges include maintaining consistency across views and handling varying camera parameters.
- PBR Synthesis Methods (e.g., [17, 26, 33]): The paper categorizes these into generation-based (synthesize PBR maps) and estimation-based (decompose multi-view images). Seed3D-PBR adopts the estimation paradigm, using a DiT-based architecture for material decomposition, differing from methods like Pandora3d [61] or Hunyuan3D 2.1 [22] which are also mentioned as baselines.
- Texturing and UV unwrapping (e.g., [6, 37], xatlas [64]): Seed3D-UV addresses the challenge of incomplete UV texture maps due to limited view coverage and self-occlusions, a common problem in 3D reconstruction and texturing pipelines.
3.3. Technological Evolution
The field has evolved from basic 3D model creation (often manual or procedural) to generative AI capable of creating diverse 2D content. The current frontier involves extending generative AI to 3D, aiming for realism, consistency, and physical plausibility.
-
Early 3D: Manual modeling, CAD software.
-
Implicit Representations: SDFs, Neural Radiance Fields (NeRFs) for 3D scene representation.
-
2D Generative AI: GANs, VAEs, and especially Diffusion Models (DMs) revolutionizing image generation.
-
3D Generative AI (Initial Steps): Extending 2D DMs to 3D (e.g., generating 3D from text/image), often facing challenges in geometry quality, texture consistency, and computational cost.
-
Embodied AI: Growing demand for realistic, interactive simulation environments to train agents, pushing for physically accurate and scalable 3D content generation.
Seed3D 1.0 fits into this timeline by integrating state-of-the-art 2D generative techniques (Diffusion Transformers, VAEs) and multi-modal conditioning into a comprehensive pipeline specifically designed for 3D asset generation that is "simulation-ready." It represents a significant step towards automating content creation for the next generation of embodied AI research.
3.4. Differentiation Analysis
Compared to the main methods in related work, Seed3D 1.0 introduces several core differences and innovations:
-
Integrated End-to-End Pipeline for Simulation-Readiness: Unlike methods that focus solely on geometry or texture, Seed3D 1.0 provides a comprehensive pipeline from a single image to a fully textured, PBR-materialized, watertight 3D asset directly compatible with physics engines. This distinguishes it from video-based methods (lack physics) and purely geometry/texture generation models (don't guarantee simulation-readiness or PBR quality).
-
High-Fidelity Geometry via Latent Space DiT: The combination of
Seed3D-VAEfor compact 3D latent encoding (building on3DShape2VecSet/Dora) andSeed3D-DiT(rectified flow-based Diffusion Transformer) operating in this latent space for generation. This approach, conditioned by a dual-encoder VLM (DINOv2 + RADIO), yields superior geometry compared to many baseline 3D generation models, as shown in quantitative results. The focus on watertight, manifold geometry is crucial for simulation. -
Advanced Multi-View and PBR Material Generation:
Seed3D-MVuses anMMDiTarchitecture with in-context multi-modal conditioning and specialized positional encoding for robust multi-view consistency, addressing limitations of prior multi-view diffusion models.Seed3D-PBRintroduces a parameter-efficient two-streamDiTdesign for material decomposition (albedo, metallic-roughness) from multi-view images. This "estimation" paradigm is chosen for its realism advantage over "generation" methods due to limited PBR training data. The two-stream design allows for modality-specific processing while sharing parameters, improving accuracy and efficiency.
-
Robust UV Texture Completion:
Seed3D-UVis a novel coordinate-conditioned diffusion model specifically designed to address self-occlusion artifacts and fill in missing regions in UV maps, a common problem when baking textures from limited multi-view observations. This ensures complete and coherent textures, vital for high-quality assets. -
Scalable Data Infrastructure: The paper highlights a comprehensive data preprocessing pipeline and engineering infrastructure. This robust system for sourcing, cleaning, standardizing, and rendering 3D data at scale is a critical differentiator, enabling the training of such a high-performing foundation model, which is often an overlooked aspect in academic papers.
-
Application Focus: The explicit emphasis on "simulation-ready" assets for robotic manipulation and embodied AI training distinguishes Seed3D 1.0 from models primarily focused on rendering or visual content creation.
In essence, Seed3D 1.0 innovates by tightly integrating multiple advanced generative AI techniques—from latent 3D geometry diffusion to multi-view PBR texture synthesis and UV completion—into a cohesive, scalable system specifically engineered to meet the demanding requirements of physics-based world simulators for embodied AI.
4. Methodology
The methodology of Seed3D 1.0 involves a multi-stage generative pipeline that transforms a single input image into a high-fidelity, simulation-ready 3D asset. This process is structured around generating geometry first, followed by a comprehensive texture and material generation process.
4.1. Model Design
Seed3D 1.0 is composed of four main sequential components:
-
Seed3D-DiT: Generates 3D geometry. -
Seed3D-MV: Synthesizes multi-view consistent RGB images. -
Seed3D-PBR: Estimates Physically Based Rendering (PBR) materials. -
Seed3D-UV: Completes UV texture maps.The overall inference pipeline is illustrated in Figure 7 of the original paper.
4.1.1. Geometry Generation (Seed3D-DiT and Seed3D-VAE)
Geometry generation aims to create watertight, manifold 3D shapes suitable for physics simulation while preserving structural details. It combines a Variational Autoencoder (VAE) for compact latent representation with a Diffusion Transformer (DiT) for shape generation.
The following figure (Figure 2 from the original paper) shows the system architecture for geometry generation:
该图像是Seed3D 1.0几何生成管道的示意图,展示了从输入图像生成高保真3D模型的过程。该框架结合了用于紧凑几何编码和TSDF解码的变分自编码器Seed3D-VAE,以及用于生成形状的双流块和单流块的扩散变换器Seed3D-DiT。
4.1.1.1. Seed3D-VAE
Seed3D-VAE is designed to learn compact latent representations of 3D geometry and reconstruct continuous geometric representations, specifically Truncated Signed Distance Fields (TSDFs), which are crucial for defining shapes with fine details. It follows the design principles of 3DShape2VecSet and Dora.
Architecture:
The Seed3D-VAE uses a dual cross-attention encoder and a self-attention decoder.
-
Input Preparation: Given an input mesh (3D model), two sets of points are uniformly sampled:
- : Uniformly sampled points.
- : Salient edge points.
These points are combined to form a set .
Each point is then embedded using
Fourier positional encoding, denoted as . This embedding helps the model understand the spatial relationships of points. These positional embeddings are concatenated with the surface normals of the points, .
-
Encoder: The encoder transforms the input points and normals into a set of latent vectors, , where is the number of latent tokens and is their dimension. The encoding process involves layers:
- The first layer uses
Cross-Attentionto process the positionally encoded points and their normals: Here, denotes a cross-attention operation. This step allows the model to capture relationships between different aspects of the input geometry. - Subsequent layers use
Self-Attentionto refine these latent tokens: The operation allows the latent tokens to interact with each other, further compressing and refining the geometric information. The final output of the encoder is the latent token set .
- The first layer uses
-
Decoder: The decoder takes the latent token set and a query point to predict its signed distance value, , thus defining a continuous TSDF field .
- The query point is first embedded with
Fourier features, . - This embedded query point is then refined through
self-attentionlayers: , for . - The refined query point representation then attends to the latent descriptors using
Cross-Attention. - Finally, an
MLP (Multi-Layer Perceptron)head processes the output of the cross-attention to produce the predicted signed distance value: This process allows the decoder to reconstruct the 3D shape by querying any point in space and getting its distance to the surface, conditioned on the learned latent representation.
- The query point is first embedded with
VAE Training:
To ensure robustness and generalization across different computational budgets, a multi-scale training strategy is employed.
- During training, the number of latent tokens is randomly sampled from a range (e.g., ). This leverages the vector set architecture's property of being
length-agnostic(latent tokens are position-encoding-free and permutation-invariant), allowing the decoder to scale to token lengths not explicitly seen during training. - The overall training objective combines a
TSDF reconstruction loss() andKL divergence regularization():- : Measures how accurately the decoder reconstructs the TSDF of the input mesh.
- : The Kullback-Leibler divergence term, which regularizes the latent space, pushing the encoder's output distribution towards a prior (typically a standard normal distribution). This ensures the latent space is well-behaved and continuous, facilitating generation.
- : A weighting factor for the KL divergence term. A
warm-up scheduleis used, where starts small and gradually increases to its target value (e.g., ) to ensure stable convergence, preventing the KL term from dominating early in training before the model has learned good reconstructions.
4.1.1.2. Seed3D-DiT
Seed3D-DiT builds upon the geometry-aware latent space learned by Seed3D-VAE. It is a rectified flow-based diffusion framework that generates 3D shapes by modeling the transformation from noise to structured latent representations, conditioned on input images.
Image Conditioning Module:
To provide rich visual semantics for geometry generation, a dual-encoder design is adopted:
- DINOv2 [43]: Provides general robust visual features without supervision.
- RADIO [45]: Complements DINOv2 by offering enhanced geometric understanding, through knowledge distillation from multiple vision foundation models. This helps resolve depth ambiguity inherent in single-view image conditioning and improves training stability. Input images are encoded by both DINOv2 and RADIO. Their feature representations are concatenated channel-wise to form comprehensive conditioning signals that capture both semantic (what the object is) and geometric (its 3D form from a 2D image) properties.
Transformer Architecture:
A Transformer serves as the diffusion backbone to model cross-modal relationships between visual (image) and geometric (latent shape) representations. It follows the hybrid design of FLUX [27]:
- Double-stream processing blocks: These blocks process
shape tokens(from the VAE latent space) andimage tokens(from the image conditioning module) separately using modality-specific parameters (distinct layer normalization, QKV projections, and MLPs). However, they enablecross-modal interactionthrough attention mechanisms applied to concatenated tokens. This allows the model to understand how the image relates to the shape. - Single-stream processing blocks: After double-stream processing,
refined shape tokensare processed through additional transformer layers in a single stream. This focuses on refining the shape's representation. - Final Decoding: The output of these transformer layers is then fed into the
Seed3D-VAE decoderto reconstruct the final 3D mesh. This hybrid approach balances the need for learning interactions between modalities with efficient modality-specific processing.
Diffusion Scheduling:
- The training employs a
flow matching [35] frameworkwithvelocity field prediction. This means the model learns to predict the "velocity" vector at each point in the trajectory from noise to data, which defines a straight path in the latent space. Timesteps(which control the noise level) are sampled from alogit-normal distribution.- A
length-aware timestep shift [14]is applied. This means the noise schedule is scaled dynamically based on the length of the latent sequence. Longer latent sequences (which have more detail) require a higher initial noise level to effectively disrupt their structure. - During inference,
deterministic samplingis used through the learned velocity fields to generate 3D shapes conditioned on input images. This deterministic process ensures consistent and high-quality generation.
4.1.2. Texture Generation
Beyond geometry, high-quality texture synthesis is crucial for realistic 3D assets. The texture generation pipeline produces Physically Based Materials (PBR) through three sequential components:
4.1.2.1. Seed3D-MV
Seed3D-MV is a multi-view diffusion model that generates consistent RGB images from multiple viewpoints, conditioned on a reference image and the generated 3D shape guidance. It addresses limitations of previous multi-view generation methods, which often required extra modules or produced suboptimal results for in-the-wild images.
The following figure (Figure 3 from the original paper) illustrates the Seed3D-MV architecture:

Objective: The model learns the conditional distribution for multi-view consistent image generation:
- : Represents the target multi-view RGB images to be generated.
- : Denotes spatially aligned multi-view geometry images. These are derived from the input mesh (generated by
Seed3D-DiT) and includenormal maps(showing surface orientation) andcanonical coordinate maps(representing spatial coordinates). - : Is the single reference image provided as input to the entire Seed3D 1.0 pipeline.
- : Is an optional text prompt, allowing for additional semantic control.
In-Context Multi-Modal Conditioning: Following approaches like UniTex [34] and Flux.1 Kontext [28], multi-modal conditioning is achieved by concatenating various input tokens along the sequence dimension:
Noisy Input Tokens: These are the latent representations of the multi-view images with added noise, which the diffusion model learns to denoise.Clean Condition Tokens: These come from different modalities:- Geometry (): Encoded latent representations of multi-view normal maps and canonical coordinate maps.
- Reference Image (): Encoded latent representation of the input reference image.
- Text Prompt (): Processed through a pretrained language model [2].
- Encoders: Geometry and reference images are encoded into latent representations using a
frozen VAE(to maintain consistency and reduce training complexity). - Classifier-Free Guidance: During training, conditional tokens are randomly dropped. This technique allows for better quality control during inference by scaling the influence of the conditioning signal.
Positional Encoding:
Cross-modal RoPE (Rotary Positional Encoding) [52]is used to facilitate interaction between different types of multi-modal tokens.- The standard RoPE scheme is modified to handle the distinct token types:
- Spatially aligned
geometry tokens. - Non-aligned
reference image tokens.
- Spatially aligned
- The token sequence is organized as: multi-view noisy tokens, geometry image tokens, reference image tokens, and text tokens. This ordering optimizes cross-modal attention while maintaining
RoPEcompatibility. - Empirically, using separate spatial positions for noisy tokens and geometry tokens is found to perform better than shared spatial positioning, suggesting the model benefits from distinct spatial understanding for these elements.
Timestep Sampling:
- Multi-view generation significantly increases the
input sequence length, which can challenge the model's learning capacity and degrade output quality. - To maintain high-fidelity generation,
resolution-aware timestep sampling [14]is adopted. - This uses a
shift-SNR (Signal-to-Noise Ratio) sampling distributionthat dynamically adapts based on the noisy token sequence length during both training and inference. This ensures that an appropriate level of noise is applied, especially for longer sequences, to maintain generation quality.
4.1.2.2. Seed3D-PBR
Seed3D-PBR is a diffusion model that decomposes the multi-view RGB images generated by Seed3D-MV into albedo, metallic, and roughness maps. These Physically Based Rendering (PBR) components are fundamental for realistic 3D content. It adopts an "estimation" paradigm (decomposing multi-view images) over "generation" (synthesizing from scratch) due to better realism given limited PBR training data.
The following figure (Figure 4 from the original paper) illustrates the Seed3D-PBR model:

Model Architecture:
Seed3D-PBRis built upon theMMDiTarchitecture, featuring an innovativetwo-stream design.- It takes
camera pose embeddings,multi-view images(from Seed3D-MV), and areference imageas input. - It simultaneously generates
multi-view albedo mapsandmulti-view metallic-roughness (MR) mapswhile ensuringcross-view consistency. The MR maps are often combined for efficiency, as metallic and roughness properties are highly correlated.
Conditioning Mechanism: A dual-level conditioning mechanism is used to leverage multi-view information effectively:
- Global Control: Extracts
global feature embeddingsfrom the reference image using apretrained CLIP vision encoder [44]. These embeddings replace the original text embeddings in the diffusion model, providing high-level appearance guidance throughout the generation process. - Local Control: For pixel-level control, a strategy similar to
ImageDream [56]is adopted:- The
VAE-encoded latentof the reference image is concatenated with thenoise latentalong the channel dimension. This serves as additional input to theDiT blocks. - To reduce computational overhead,
multi-view conditioning image latentsare directly added to the initial noise latents and fed only into the first DiT block as initial guidance.
- The
Two-Stream Network Structure:
Albedo and Metallic-Roughness (MR) have significantly different physical properties. Seed3D-PBR addresses this with a fine-grained, parameter-efficient separation mechanism within the DiT blocks, rather than fully separate networks or high-level architectural separation (like separate U-Net heads).
- Separate Projections: Within each
DiTblock, separateprojection layersare instantiated for theQuery (Q),Key (K), andValue (V)tensors for each modality (albedo and MR). This allows the model to learn modality-specific representations for Q, K, and V. - Shared Attention: After computing the respective Q, K, V tensors for albedo and MR, their
latent vectorsare concatenated with theglobal image conditioning(from CLIP) and processed through ashared full-attention module. This allows both modalities to interact and maintain cross-view consistency. - Shared Components: All other
DiTcomponents, such asfeed-forward networks, remain shared between modalities. - Modality Embeddings:
Learnable modality embeddingsare introduced and added to thepositional embeddingsto explicitly distinguish between albedo and MR streams. - Decoder Heads: Finally, two separate
decoder headsmap the processed latents to the albedo and MR outputs respectively. This design effectively captures modality-specific features while significantly reducing the total number of parameters compared to using completely separate networks for albedo and MR.
4.1.2.3. Seed3D-UV
While Seed3D-MV and Seed3D-PBR generate high-quality multi-view images and PBR maps, converting these into complete UV texture maps for a 3D model poses a challenge. Due to limited view coverage and self-occlusions, directly baking multi-view observations into UV space often results in incomplete textures with holes and seams. Seed3D-UV proposes a coordinate-conditioned diffusion model for UV texture completion.
Initial Texture Baking from Multi-view Images:
- Projection: The 3D mesh (from geometry generation) and the multi-view PBR material images (from Seed3D-PBR) are used. Each multi-view image is projected onto the mesh surface using its corresponding
camera projection matrix. - Blending: For each visible surface point on the mesh, contributing pixels are identified from multiple views based on visibility and
surface normal alignment. Following established methods [6, 37], contributions are blended usingweighted averaging, where views with better normal alignment receive higher weights. - Baking: The aggregated surface colors are then baked into a 2D
UV texture mapusing the mesh's predefinedUV parameterization[15]. Each mesh triangle is mapped to UV space, and pixel-wise colors from overlapping views are accumulated and interpolated.
- Challenge: This baking process frequently results in incomplete regions (holes, seams) in the UV map, especially in areas that are self-occluded or partially observed from all available viewpoints.
Coordinate-Conditioned UV Diffusion Transformer:
To complete these partial UV textures, Seed3D-UV uses a coordinate-conditioned DiT for inpainting.
- Geometric Conditioning: Unlike standard image inpainting, this model leverages
UV coordinate informationto maintain geometric consistency.UV coordinate mapsare encoded aspositional tokensand integrated into theDiT's visual streamalongside the texture tokens (representing the incomplete albedo and MR maps). This conditioning guides the model to respect the mesh's UV parameterization, ensuring completions align with mesh boundaries and existing texture content. - Inference: During inference, the diffusion process is conditioned on the partial UV texture obtained from multi-view baking. The model then learns to generate plausible texture in the occluded regions by understanding both the observed pixels and their spatial relationships encoded in the UV coordinates.
- Benefits: This coordinate-guided conditioning produces textures with sharper transitions at UV boundaries and better alignment with mesh geometry compared to naive inpainting methods.
Final Integration and Export:
The completed UV textures (albedo, metallic, roughness) from Seed3D-UV are integrated with the mesh. The final textured mesh, with complete PBR UV maps, is exported in standard 3D formats (e.g., OBJ, GLB) for downstream applications like rendering, animation, or scene creation.
4.2. Model Training
The training of Seed3D 1.0 models is structured in progressive stages for both geometry and texture components.
4.2.1. Geometry Training
The Seed3D-DiT training employs a three-stage progressive strategy:
-
Pre-Training (PT):
- The model is trained from scratch on
low-resolution representationsusing 256 latent tokens. - This initial stage focuses on establishing
foundational shape generation capabilitiesand learning thecross-modal alignmentbetween image conditions and 3D shapes. - It uses the full training dataset, covering diverse object categories and viewing angles, to ensure robust generalization.
- The model is trained from scratch on
-
Continued Training (CT):
- Building on the pre-trained model, the
latent sequence lengthis progressively increased to 4096 tokens. This enables the model to capture finer geometric details and surface structures. - Training continues on the full dataset, augmented with
enhanced data augmentationtechniques to maintain generalization performance even at higher resolutions.
- Building on the pre-trained model, the
-
Supervised Fine-Tuning (SFT):
- After CT, the model is fine-tuned on a
curated high-quality subsetof the data. - This stage uses reduced learning rates to further improve the generation quality, leading to 3D objects with enhanced geometric accuracy and intricate surface details.
- After CT, the model is fine-tuned on a
4.2.2. Texture Training
All texture generation models (Seed3D-MV, Seed3D-PBR, Seed3D-UV) are trained from scratch using a two-stage approach:
-
Stage 1 (Full Dataset Training):
- The models are trained on the
full datasetto learn comprehensivemulti-view consistency(for Seed3D-MV) andmaterial decomposition(for Seed3D-PBR and Seed3D-UV). This stage establishes a broad understanding of texture and material properties across various objects.
- The models are trained on the
-
Stage 2 (Fine-tuning on High-Quality Subset):
- The models are then fine-tuned on a
curated high-quality subsetof the data, similar to SFT for geometry. - This fine-tuning stage uses
reduced learning ratesto further improve the output quality, such as sharpness and realism of textures, while maintaining robust generalization across diverse textures and materials.
- The models are then fine-tuned on a
5. Experimental Setup
The experimental setup for Seed3D 1.0 encompasses the extensive data preprocessing pipeline, the robust data engineering infrastructure, and the specific configurations for model training and evaluation.
5.1. Data
The quality, diversity, and scale of 3D training data are paramount for the performance of 3D generation models. Seed3D 1.0 addresses the inherent complexity and heterogeneity of 3D data through a comprehensive preprocessing pipeline and scalable infrastructure.
5.1.1. Data Preprocessing
A multi-stage preprocessing pipeline systematically transforms raw, heterogeneous 3D asset collections into high-quality, diverse, and consistent datasets.
The following figure (Figure 5 from the original paper) shows the data preprocessing pipeline:
该图像是Seed3D 1.0的数据处理示意图,展示了从异构3D数据源到训练集部署的多阶段数据处理流程。图中包括四视图渲染、纹理检测和SDF采样等步骤,展示了自动化的数据预处理方法。
-
Diversity-Oriented Data Sourcing:
- Data is acquired from diverse sources, including public repositories, licensed marketplaces, and synthetic generation platforms, prioritizing ethical and legal sourcing.
- The aim is to maximize coverage across critical dimensions such as geometric complexity, mesh topology, object categories (e.g., characters, furniture, architecture), artistic styles, and material properties.
- Raw collections often contain corrupted geometries, which are handled by subsequent stages.
-
Format Standardization and Conversion:
- Raw 3D assets (e.g., OBJ, FBX, GLTF, PLY) are converted into a unified
GLB format. - This involves extracting geometry and material information, and normalizing coordinate systems. GLB (GL Transmission Format Binary) provides compact binary encoding and widespread compatibility.
- Raw 3D assets (e.g., OBJ, FBX, GLTF, PLY) are converted into a unified
-
Geometric Data Deduplication:
- To avoid training bias and improve dataset diversity, duplicate or near-duplicate meshes are removed.
- Method: A visual similarity-based deduplication pipeline is used:
- Each asset is rendered from
four canonical viewpointsto generateRGB imagesandnormal maps. - A
pretrained vision encoder(DINOv2 [43]) extracts compact feature representations from both RGB and normal maps. These features are concatenated across all views to form a final mesh representation. FAISS [23](Facebook AI Similarity Search) is used for efficient large-scale nearest-neighbor search.Dual-threshold filteringbased oncosine similarityandL2 distanceis applied to balance effective duplicate removal with the preservation of legitimate geometric variations.
- Each asset is rendered from
-
Mesh Orientation Canonization:
- Consistent mesh orientation is critical for effective 3D model training.
- Method: Automated orientation canonization aligns 3D assets to a standard pose:
- The same
four-view renderingsfrom the deduplication stage are used. - Visual features are fed into a
trained orientation classifierthat predicts the canonical orientation. - The predicted transformation is then applied to align the mesh. This ensures geometrically similar objects have consistent spatial alignment.
- The same
-
Quality Filtering with Aesthetic Scoring and VLM Assessment:
- Raw collections often contain low-quality assets (poor geometry, unrealistic proportions, artifacts). A two-stage filtering system is implemented:
- Stage 1 (Aesthetic Scoring): An open-source model [48] evaluates the visual appeal of the four-view renderings, filtering out assets below a predefined threshold.
- Stage 2 (VLM-based Assessment): A
fine-tuned VLM [3]performs a comprehensive assessment across three dimensions:Quality classification: Unusable, usable, high-quality.Category identification: Characters, vehicles, furniture, etc.Data type detection: Synthetic vs. real-world scanned vs. scene-level data.
- Final Filtering: Only assets with acceptable aesthetic scores and "usable-or-higher" quality ratings are retained. Real-world scanned and scene-level data are excluded to ensure the dataset consists of high-quality individual 3D objects suitable for the model's training objective.
- Raw collections often contain low-quality assets (poor geometry, unrealistic proportions, artifacts). A two-stage filtering system is implemented:
-
Multi-View Image Rendering:
- To bridge 3D geometry and 2D conditioning, high-quality multi-view rendered images are generated for each processed mesh using
Blender's Cycles rendering engine [7]. - Rendering Settings: Physically-based rendering is used with diverse lighting conditions, camera viewpoints, and material assignments to create comprehensive visual representations.
- For Geometry Generation: Reference images are rendered from randomly sampled viewpoints (elevation angles in ) under stochastic illumination (30% point lights, 70% HDR environment maps).
- For Multi-View Generation and PBR Estimation: Random HDRI (High Dynamic Range Image) environments from a curated library are sampled, and normalized 3D objects are rendered from
orthogonal viewpoints.
- Outputs: Rendered
RGB images,normal maps, andcamera coordinate maps (CCMs). For PBR training,albedoandmetallic-roughness mapsare also rendered, along with one fully-lit reference view for appearance context. - For UV Texture Synthesis: 3D meshes are unwrapped into
UV layoutsusingxatlas [64], andalbedoandCCMsare baked using Blender's baking system.
- To bridge 3D geometry and 2D conditioning, high-quality multi-view rendered images are generated for each processed mesh using
-
Mesh Remeshing:
- To enable valid
SDF extractionforVAE training, arbitrary raw meshes are converted into watertight representations. - Method: A CUDA-based remeshing pipeline is used, efficiently removing internal structures while preserving external surface detail through four stages:
Voxelization: Fast raster-like kernels [49] with boundary marking.Signed Distance Floodfill: Classifies interior and exterior regions.Mesh Extraction: Uses a threshold to preserve thin structures.Final Mesh Generation: ViaDual Marching Cubes [47], referencing the original mesh for zero-crossing normals.
- To enable valid
5.1.2. Data Engineering Infrastructure
A comprehensive data engineering infrastructure ensures scalability, traceability, and seamless integration throughout the data pipeline.
The following figure (Figure 6 from the original paper) illustrates the data platform architecture:
该图像是一个示意图,展示了Seed3D 1.0的数据平台架构。该架构包含数据剖析、资产预览、数据验证和数据打包等模块,并通过分布式管道和异构弹性计算实现高效的数据处理。
-
Data Management and Indexing:
- All metadata (source, format, processing status, storage paths) for 3D assets are indexed in a
MongoDB [40]database. - A custom
object-relational mapping (ORM) layerprovides a standardized API for asset registration, updates, and querying, decoupling preprocessing logic from backend storage.
- All metadata (source, format, processing status, storage paths) for 3D assets are indexed in a
-
Storage and Visualization Platform:
- Raw files and intermediate outputs (rendered images, VLM annotations) are stored in a scalable
object storage system. Asset references are maintained in MongoDB. - A
web-based data platformprovides visual inspection tools (filtering, tagging, thumbnail browsing, WebGL [24]-based 3D viewer) for interactive exploration and curation. - For training data, processed assets (SDF samples, VAE latent codes) are packed into
training-ready bundlesand stored in a distributedHDFS [51]cluster. A dedicated data packing module allows users to curate and export structured datasets.
- Raw files and intermediate outputs (rendered images, VLM annotations) are stored in a scalable
-
Distributed Processing with Ray Data:
Ray Data [41]is used to build a scalable processing pipeline for 3D operations (VLM-based quality assessment, multi-view rendering, mesh remeshing).- Heterogeneous Compute: A custom
Kubernetes [9] operatorlaunches CPU and GPU pods with appropriate resource allocation for each processing stage (e.g., rendering needs CPU, remeshing needs GPU). - Cost Efficiency:
Ray Data's elasticity and fault toleranceutilizepreemptible resourcesfrom cluster idle capacity. The system automatically launches replacement pods and reschedules tasks if instances are reclaimed. - Fault Tolerance: Strategic
checkpointingis implemented after each major processing stage, enabling pipeline restarts from intermediate points, minimizing computational waste during disruptions.
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, a complete explanation is provided.
5.2.1. Geometry Generation Metrics
The paper uses ULIP [59] and Uni3D [71] models to measure similarity between generated meshes and input images, utilizing VLM-generated captions [3] as text conditioning. For each mesh, 8,192 surface points are sampled.
-
ULIP-I (Image-to-Image Similarity via ULIP features):
- Conceptual Definition: Measures the similarity between the generated 3D mesh and the input 2D image based on their representations in the feature space of the
ULIP(Unified Language-Image Pre-training) model. A higher value indicates better alignment between the visual features of the generated 3D object and the source image. - Mathematical Formula: While the specific formula for ULIP-I is not explicitly provided in the paper, it typically involves computing the cosine similarity between the image embedding of the input image and the image embedding derived from rendering the generated 3D mesh, both produced by the ULIP model. Let be the ULIP image embedding for an image , and be the ULIP image embedding for a rendering of the mesh .
- Symbol Explanation:
- : The input 2D image.
- : The generated 3D mesh.
- : The image encoder component of the ULIP model that produces a feature embedding for an image.
- : The process of rendering the mesh into an image (or multiple views) and then passing it through the ULIP image encoder.
- : A measure of similarity between two non-zero vectors of an inner product space.
- Conceptual Definition: Measures the similarity between the generated 3D mesh and the input 2D image based on their representations in the feature space of the
-
ULIP-T (Text-to-Image/Mesh Similarity via ULIP features):
- Conceptual Definition: Measures the similarity between the generated 3D mesh and a text description (e.g., VLM-generated caption) based on their
ULIPfeature space. A higher value indicates that the generated 3D object aligns well with the semantic content described by the text. - Mathematical Formula: Similar to ULIP-I, it typically involves computing the cosine similarity between the text embedding of the caption and the image embedding derived from rendering the generated 3D mesh, both produced by the ULIP model. Let be the ULIP text embedding for a caption , and be the ULIP image embedding for a rendering of the mesh .
- Symbol Explanation:
- : The VLM-generated text caption describing the object.
- : The generated 3D mesh.
- : The text encoder component of the ULIP model that produces a feature embedding for a text string.
- : The process of rendering the mesh into an image (or multiple views) and then passing it through the ULIP image encoder.
- : As defined above.
- Conceptual Definition: Measures the similarity between the generated 3D mesh and a text description (e.g., VLM-generated caption) based on their
-
Uni3D-I (Image-to-Image Similarity via Uni3D features):
- Conceptual Definition: Similar to ULIP-I, but uses the feature space of the
Uni3Dmodel to assess the image-to-mesh similarity. Higher values indicate better alignment. - Mathematical Formula: Analogous to ULIP-I, replacing ULIP encoders with Uni3D encoders. Let be the Uni3D image embedding for an image , and be the Uni3D image embedding for a rendering of the mesh .
- Symbol Explanation:
- : The input 2D image.
- : The generated 3D mesh.
- : The image encoder component of the Uni3D model.
- : Rendering of mesh passed through Uni3D image encoder.
- : As defined above.
- Conceptual Definition: Similar to ULIP-I, but uses the feature space of the
-
Uni3D-T (Text-to-Image/Mesh Similarity via Uni3D features):
- Conceptual Definition: Similar to ULIP-T, but uses the feature space of the
Uni3Dmodel to assess the text-to-mesh similarity. Higher values indicate better alignment. - Mathematical Formula: Analogous to ULIP-T, replacing ULIP encoders with Uni3D encoders. Let be the Uni3D text embedding for a caption , and be the Uni3D image embedding for a rendering of the mesh .
- Symbol Explanation:
- : The VLM-generated text caption.
- : The generated 3D mesh.
- : The text encoder component of the Uni3D model.
- : Rendering of mesh passed through Uni3D image encoder.
- : As defined above.
- Conceptual Definition: Similar to ULIP-T, but uses the feature space of the
5.2.2. Texture Generation Metrics
The paper employs several metrics to evaluate multi-view and PBR material generation, primarily focusing on CLIP [44]-based metrics for perceptual similarity.
-
CLIP-FID (CLIP-based Fréchet Inception Distance) [19]:
- Conceptual Definition: A widely used metric to assess the quality of generated images by comparing the distribution of features (in this case, CLIP features) of generated samples against real samples. A lower CLIP-FID score indicates that the generated images are more similar to real images in terms of perceptual quality and diversity.
- Mathematical Formula: The Fréchet Inception Distance (FID) between two multivariate Gaussian distributions, and , is given by:
For CLIP-FID, and are the mean and covariance of feature embeddings extracted by a
CLIP image encoderfrom the real (target) and generated image sets, respectively. - Symbol Explanation:
- : Mean feature vector of the real image set in CLIP's embedding space.
- : Covariance matrix of the real image set in CLIP's embedding space.
- : Mean feature vector of the generated image set in CLIP's embedding space.
- : Covariance matrix of the generated image set in CLIP's embedding space.
- : Squared Euclidean distance (L2 norm).
- : Trace of a matrix (sum of diagonal elements).
- : Matrix square root.
-
LPIPS (Learned Perceptual Image Patch Similarity) [68]:
- Conceptual Definition: Measures the perceptual similarity between two images by comparing their feature activations in a pre-trained deep neural network (e.g., VGG, AlexNet, SqueezeNet). It is designed to correlate well with human perception of image similarity, meaning a lower LPIPS score indicates that two images are perceived as more similar by humans.
- Mathematical Formula:
- Symbol Explanation:
- : The generated image.
- : The reference (ground truth) image.
- : Index over different layers of the pre-trained neural network.
- : The feature activations from layer of the pre-trained network.
- : Height and width of the feature map at layer .
- : A learned scalar weight for each layer .
- : Squared L2 norm (Euclidean distance) to compare feature patches.
-
CMMD (CLIP Maximum-Mean Discrepancy):
- Conceptual Definition: A non-parametric statistical test that measures the distance between two probability distributions. For CLIP CMMD, it quantifies how different the distributions of CLIP features for generated images are from real images. A lower CMMD implies the generated distribution is closer to the real distribution. It's often used when FID's assumption of Gaussian distributions might not hold perfectly.
- Mathematical Formula: The general form of MMD between two distributions and using a kernel is:
For CMMD, and are feature embeddings from the
CLIP image encoderfor real and generated images, respectively. The kernel is typically a Radial Basis Function (RBF) kernel or a linear kernel, applied to the CLIP features. - Symbol Explanation:
P, Q: The probability distributions of real and generated image features (from CLIP).x, x': Samples from the real image feature distribution .y, y': Samples from the generated image feature distribution .- : Expectation operator.
- : A kernel function that measures similarity between two feature vectors.
-
CLIP-I (CLIP-Image Similarity):
- Conceptual Definition: Measures the direct similarity between the CLIP embeddings of a generated image and its corresponding reference image. This assesses how well the generated output retains the visual characteristics of the input. A higher CLIP-I score indicates better retention of visual similarity.
- Mathematical Formula: Let be the CLIP image embedding for the generated image , and be the CLIP image embedding for the reference image .
- Symbol Explanation:
- : The generated image.
- : The reference (ground truth) image.
- : The image encoder component of the CLIP model that produces a feature embedding for an image.
- : As defined above.
5.3. Baselines
5.3.1. Geometry Generation Baselines
Seed3D 1.0's 1.5B-parameter Seed3D-DiT is compared against several state-of-the-art open-source methods for single-image to 3D mesh generation:
-
TRELLIS [58]
-
TripoSG [32]
-
Step1X-3D [31]
-
Direct3D-S2 [57]
-
Hunyuan3D-2.1 [22] (a larger 3B-parameter model)
These baselines represent recent advancements in generative 3D models, often using latent diffusion, transformer architectures, and various conditioning strategies to generate meshes from 2D inputs.
5.3.2. Texture Generation Baselines
For multi-view generation and PBR estimation, Seed3D 1.0 is compared against:
-
MVPainter [50]
-
Hunyuan3D-Paint [70]
-
UniTEX [34]
-
MV-Adapter [21]
-
Pandora3d [61]
-
Hunyuan3D 2.1 [22]
These baselines are prominent models for generating textures, multi-view images, or PBR materials for 3D assets, often employing diffusion models and multi-view consistency techniques. The comparison is conducted using both image and geometry conditioning where applicable.
5.4. Training Infrastructure
Large-scale diffusion model training requires efficient computational resource utilization and robust failure handling. The authors developed a comprehensive training infrastructure incorporating hardware-aware optimizations, memory-efficient parallelism, and fault tolerance.
5.4.1. Kernel Fusion
- To maximize
GPU utilization,torch.compileis integrated with custom CUDA kernels for performance-critical operations. Memory-bound operationswere identified as the primary bottleneck through profiling.- Multiple consecutive
element-wise operationsare fused into unified kernels, which reducesmemory access overheadand improvesarithmetic intensity. - Optimized libraries like
FlashAttention [13](for attention computation) andApex fused optimizers(for weight updates) are employed to substantially reduce computational costs. These optimizations reduce GPU idle time and improve end-to-end training throughput.
5.4.2. Parallelism Strategy
- To scale diffusion model training across multiple GPUs,
Hybrid Sharded Data Parallelism (HSDP) [69]is used. HSDPcombinesdata parallelismwithin nodes (multiple GPUs on one machine) andFully Sharded Data Parallelism (FSDP)across nodes (multiple machines).- This hierarchical approach achieves
memory-efficient weight and optimizer state shardingwhile minimizingcross-node communication, allowing effective scaling to large cluster configurations with reduced performance degradation.
5.4.3. Multi-Level Activation Checkpointing
Memory constraintsare a fundamental bottleneck for large diffusion transformers. Whilefull gradient checkpointing [11]alleviates memory pressure, it introduces substantial recomputation overhead.- To balance this trade-off,
Multi-Level Activation Checkpointing (MLAC) [60]is employed. MLACselectively checkpoints activations based on theirrecomputation cost,offloading high-cost tensors to CPU memorywithasynchronous prefetchingto overlap memory transfers with computation. This achieves significant memory savings with minimal performance impact compared to full checkpointing.
5.4.4. Training Stability and Fault Tolerance
- Large-scale distributed training is susceptible to hardware failures and communication disruptions. A comprehensive
stability frameworkis implemented:- Proactive Failure Prevention:
Machine health checksare performed before job launch to eliminate faulty nodes and potential stragglers. - Reactive Recovery:
Flight recorder capabilitiestrackNCCL [42](NVIDIA Collective Communications Library) communication patterns to identify problematic machines upon failures. - Centralized Monitoring: A system aggregates
real-time performance metricsacross the cluster, includingEffective Training Time Ratio (ETTR), communication patterns, and GPU utilization. This provides comprehensive visibility for rapid diagnosis and resolution of bottlenecks.
- Proactive Failure Prevention:
6. Results & Analysis
The evaluation of Seed3D 1.0 includes quantitative benchmarks, qualitative analysis, and user studies, demonstrating its generation quality across geometry and texture.
6.1. Core Results Analysis
6.1.1. Geometry Generation
Quantitative Results:
The geometry generation performance was evaluated on a test set of 1,000 images covering diverse object categories and artistic styles. The metrics ULIP-T, ULIP-I, Uni3D-T, and Uni3D-I were used, which measure text-to-mesh and image-to-mesh similarity using features from respective Vision-Language Models (VLMs). Higher scores indicate better performance.
The following are the results from Table 1 of the original paper:
| Models | ULIP-T (↑) | ULIP-I (↑) | Uni3D-T (↑) | Uni3D-I (↑) |
| TRELLIS [58] | 0.0951 ± 0.0608 | 0.1686 ± 0.0826 | 0.2786 ± 0.0671 | 0.3754 ± 0.0713 |
| TripoSG [32] | 0.1312 ± 0.0574 | 0.2460 ± 0.0554 | 0.2657 ± 0.0652 | 0.3870 ± 0.0671 |
| Step1X-3D [31] | 0.1316 ± 0.0573 | 0.2441 ± 0.0527 | 0.2709 ± 0.0625 | 0.3837 ± 0.0687 |
| Direct3D-S2 [57] | 0.1203 ± 0.0555 | 0.2191 ± 0.0572 | 0.2571 ± 0.0582 | 0.3497 ± 0.0697 |
| Hunyuan3D-2.1 [22] | 0.1283 ± 0.0580 | 0.2376 ± 0.0593 | 0.2575 ± 0.0672 | 0.3709 ± 0.0769 |
| Seed3D 1.0 | 0.1319 ± 0.0572 | 0.2536 ± 0.0432 | 0.2800 ± 0.0634 | 0.3999 ± 0.0610 |
As shown in the table, Seed3D 1.0 achieves the highest scores across all metrics (ULIP-T, ULIP-I, Uni3D-T, Uni3D-I), indicating its superior performance in generating 3D geometry that is well-aligned with both input images and their semantic descriptions. Notably, Seed3D 1.0, with its 1.5B parameters, outperforms Hunyuan3D-2.1, which is a larger 3B-parameter model. This highlights the effectiveness of Seed3D 1.0's model architecture and training approach. The strong ULIP-I and Uni3D-I scores confirm excellent alignment between the generated geometry and the input images.
Qualitative Analysis: The geometry generation performance of Seed3D 1.0 is further supported by qualitative results.
The following figure (Figure 8 from the original paper) shows qualitative comparisons of geometry generation:
该图像是图表,展示了Seed3D 1.0与多个基线方法在几何生成上的定性比较。与其他方法相比,Seed3D 1.0生成的网格具有更精细的几何细节和更好的结构准确性。
As shown in Figure 8, Seed3D 1.0 consistently generates superior results compared to baseline methods in terms of geometric detail preservation, structural accuracy, and overall shape fidelity. Visual inspection confirms that Seed3D 1.0 captures intricate features, such as the complex structures of architectural elements, the fine textures of woven baskets, and the precise geometry of mechanical objects like bicycles, all crucial for realistic simulation.
6.1.2. Texture Generation
Quantitative Results:
Texture generation models (Seed3D-MV for multi-view and Seed3D-PBR for PBR material estimation) were evaluated against several open-source methods.
The following are the results from Table 2 of the original paper, showing quantitative comparison for multi-view generation:
| Method | CLIP-FID (↓) | CMMD (↓) | CLIP-I (↑) | LPIPS (↓) |
| MVPainter[50] | 31.7290 | 0.3254 | 0.8903 | 0.1420 |
| Hunyuan3D-Paintb [70] | 18.8625 | 0.0825 | 0.9206 | 0.1162 |
| UniTEX [34] | 18.3285 | 0.0873 | 0.9230 | 0.1078 |
| MV-Adapter[21] | 11.6920 | 0.0312 | 0.9399 | 0.1012 |
| Seed3D 1.0 | 9.9752 | 0.0231 | 0.9484 | 0.0891 |
Table 2 shows that Seed3D-MV achieves state-of-the-art performance across all multi-view generation metrics (CLIP-FID, CMMD, CLIP-I, LPIPS). Lower scores are better for FID, CMMD, and LPIPS, while higher is better for CLIP-I. Seed3D 1.0 consistently outperforms all baselines, demonstrating its ability to generate high-quality and perceptually consistent multi-view images.
The following are the results from Table 3 of the original paper, showing quantitative comparison for PBR material generation:
| Method | CLIP-FID (↓) | CMMD (↓) | CLIP-I (↑) | LPIPS (↓) |
| Pandora3d [61] | 37.7028 | 0.3650 | 0.8868 | 0.1229 |
| MVPainter [50] | 40.6763 | 0.4145 | 0.8724 | 0.1274 |
| Hunyuan3D-2.1 [22] | 36.3484 | 0.3026 | 0.8828 | 0.1318 |
| Seed3D 1.0 | 31.5984 | 0.2795 | 0.9000 | 0.1153 |
| Seed3D 1.0* | 23.3919 | 0.2191 | 0.9310 | 0.0843 |
Table 3 presents the PBR estimation results. Seed3D-PBR demonstrates the best performance among all methods when using multi-view images generated by Seed3D-MV as input. The table also reports results for , which uses ground-truth multi-view images. This result represents the upper-bound performance when the PBR estimation is decoupled from potential errors in multi-view generation, showing significant improvements with higher-quality inputs. This implies that further improvements in Seed3D-MV would directly translate to even better PBR material quality.
Qualitative Analysis: Qualitative comparisons highlight Seed3D 1.0's superior texture and material quality.
The following figure (Figure 9 from the original paper) provides qualitative comparisons of texture generation:
该图像是一个示意图,展示了不同模型生成的3D资产质量对比。红框突出显示了在细节保留、纹理清晰度和材料质量方面的改进,最适合以8 imes放大查看。
As shown in Figure 9, Seed3D 1.0 shows notable improvements in preserving fine-grained details from reference images and rendering clear text elements. It maintains strong alignment with reference images, especially for complex visual features like facial structures and textual patterns, which baseline methods often blur or lose. The generated PBR materials exhibit realistic surface properties, including appropriate metallic reflectance and skin subsurface scattering, contributing to photorealistic rendering results. For instance, in the steampunk clock example (third row), Seed3D 1.0 maintains sharp clarity for fine textual elements (numbers on clock face, mechanical components), demonstrating exceptional preservation of high-frequency texture details crucial for realistic 3D generation.
6.1.3. UV Enhancement Analysis
The effectiveness of Seed3D-UV in addressing incomplete texture maps is also demonstrated.
The following figure (Figure 10b from the original paper) shows an ablation of UV enhancement:
该图像是示意图,比较了UV增强前后模型的效果。其中,上方展示的是未进行UV增强的模型,下方为经过UV增强后的模型。可以看到,UV增强显著改善了纹理细节和外观质量。
As shown in Figure 10b, without UV enhancement, back-projection from limited viewpoints results in incomplete texture maps with missing regions due to self-occlusion. Seed3D-UV successfully inpaints these incomplete regions, producing complete and spatially coherent UV textures, which are essential for high-quality 3D assets.
6.2. User Study
A user study was conducted with 14 human evaluators to assess generation quality across 43 diverse test images.
The following figure (Figure 10a from the original paper) shows the user study results:
该图像是一个雷达图,展示了不同3D模型在多个评估指标上的表现,包括几何效果、材质与纹理、细节丰富度等。图中比较了Seed3D-MV、Seed3D-PBR、Rodin 1.5、Hunyuan3D-21、Triplo 2.5和Trellis等模型的特性。
As shown in Figure 10a (the radar chart), evaluators compared 6 methods across multiple dimensions: visual clarity, faithful restoration, geometry quality, perspective & structure accuracy, material & texture realism, and detail richness. Seed3D 1.0 consistently received higher ratings across all dimensions, with particularly strong performance in geometry and material quality. This human evaluation further validates the perceptual quality and fidelity of assets generated by Seed3D 1.0.
6.3. Application: Simulation-ready Generation
Seed3D 1.0's core strength is generating assets suitable for physics-based simulation.
The following figure (Figure 11 from the original paper) illustrates simulation-ready asset generation for robotics:
该图像是模拟准备资产生成的示意图,展示了从输入图像生成3D资产的过程。图中包含多个视角下的模拟操作,展示了机器人在抓取和操控不同物体(如玩具和牛奶盒)时的真实场景。不同视角的比较,有助于理解机器人操作的可行性与灵活性。
As depicted in Figure 11, given a single input image, Seed3D 1.0 produces 3D assets that can be directly integrated into NVIDIA Isaac Sim [38] for robotic manipulation testing. VLMs are used to estimate the scale of assets for real-world dimensions. Isaac Sim automatically generates collision meshes from the watertight, manifold geometry and applies default material properties (e.g., friction), enabling immediate physics simulation without manual tuning.
Robotic manipulation experiments (grasping, multi-object interactions) conducted in Isaac Sim demonstrate the assets' utility. The physics engine provides real-time feedback on contact forces, object dynamics, and manipulation outcomes. The fine geometric details preserved by Seed3D 1.0 (e.g., on toys, electronic devices) are essential for realistic contact simulation and grasp planning. These environments offer three key benefits for embodied AI development:
- Scalable generation of training data through diverse manipulation scenarios.
- Interactive learning via physics feedback on action consequences.
- Diverse multi-view, multimodal observation data for comprehensive evaluation benchmarks for
vision-language-action (VLA) models.
6.4. Application: Scene Generation
Seed3D 1.0 extends its capabilities to scene-level generation through a factorized approach.
The following figure (Figure 12 from the original paper) illustrates factorized scene generation:
该图像是插图,展示了使用 Seed3D 1.0 生成的场景。上方左侧为办公室场景的提示图像,右侧为生成的对象地图和 3D 场景;下方左侧为城市场景的提示图像,右侧呈现生成的对象地图和 3D 场景,展示了完整的场景生成能力。
As shown in Figure 12, given prompt images, a VLM is employed to identify objects and infer their spatial relationships, generating layout maps that specify object scales, positions, and orientations. The system then generates and textures individual objects using Seed3D 1.0. Finally, these generated objects are assembled according to the predicted layout, enabling coherent scene generation for diverse environments, from indoor offices to urban architectural scenes. This demonstrates the potential for creating complex, diverse, and realistic virtual worlds at scale.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces Seed3D 1.0, a pioneering foundation model that transforms single 2D images into high-fidelity, simulation-ready 3D assets. This system addresses the critical bottleneck in embodied AI development by providing scalable content diversity without sacrificing physics accuracy. Seed3D 1.0's innovation lies in its integrated pipeline, comprising Seed3D-DiT for accurate geometry, Seed3D-MV for multi-view synthesis, Seed3D-PBR for realistic material decomposition, and Seed3D-UV for robust texture completion. This is all underpinned by a sophisticated data infrastructure and optimized training systems.
Experimental results rigorously demonstrate Seed3D 1.0's state-of-the-art performance in both geometry and texture generation benchmarks, often outperforming larger baseline models. User studies further validate its superior visual clarity, geometric accuracy, and material realism. Crucially, Seed3D 1.0 generates assets with watertight, manifold geometry and PBR materials, ensuring seamless integration into physics engines like Isaac Sim for robotic manipulation and simulation training. The model also extends to scalable scene composition, enabling the creation of coherent environments. Ultimately, Seed3D 1.0 lays a strong foundation for advancing physics-based world simulators and training embodied agents capable of realistic physical interaction.
7.2. Limitations & Future Work
While the paper does not explicitly detail a "Limitations" section, some can be inferred from the context and common challenges in the field:
-
Single-Image Input: While a strength for ease of use, relying on a single input image inherently limits the amount of 3D information available. This can lead to ambiguities or inaccuracies in regions occluded in the input view, even with the
Seed3D-UVcompletion. Complex or highly ambiguous objects might still pose challenges. -
Computational Cost: Generating high-fidelity 3D assets, especially with multiple diffusion models and extensive post-processing, is computationally intensive. Although optimized, inference time for complex scenes or very high-resolution assets might still be a practical limitation for real-time interactive generation.
-
Generalization to Novel Object Categories/Styles: While the training data is diverse, the model's ability to generalize to truly novel or abstract object categories or artistic styles not well-represented in the training data might vary.
-
Dynamic Object Generation: The paper focuses on static 3D assets. Generating animated or deformable objects, or understanding dynamic interactions beyond simple rigid-body physics, would be a more advanced challenge.
-
VLM Reliance for Scene Layout: The factorized scene generation relies on
VLMsto infer spatial relationships from prompt images. The quality and coherence of generated scenes are thus dependent on theVLM's understanding capabilities, which can have their own biases or limitations.Potential future research directions implicitly suggested or implied by the work include:
-
Improving Geometric Fidelity from Limited Views: Developing even more robust methods for inferring complete and accurate 3D geometry from highly constrained 2D inputs.
-
End-to-End Learning for PBR and Textures: Exploring more integrated or end-to-end approaches for PBR material generation, potentially reducing the reliance on sequential components and enabling deeper joint optimization.
-
Dynamic Asset Generation: Extending the framework to generate dynamic 3D assets, such as deformable objects, liquids, or objects with moving parts, which are crucial for more complex simulations.
-
Beyond Objects to Full Environments: Enhancing the scene generation capabilities to handle more complex environmental structures, finer architectural details, and a broader range of semantic scene understanding.
-
Real-time Asset Editing and Control: Integrating interactive editing capabilities for designers or AI agents to modify generated assets in a loop.
-
Broader Material Properties: Expanding the range of PBR material properties beyond albedo, metallic, and roughness, to include transparency, subsurface scattering (beyond skin), emissive properties, etc.
7.3. Personal Insights & Critique
Seed3D 1.0 represents a significant leap forward in addressing a critical bottleneck for embodied AI: the scalable creation of high-fidelity, physics-compatible 3D content. The modular yet integrated design, breaking down the complex problem into geometry, multi-view texture, PBR, and UV completion stages, is a robust engineering choice. This modularity allows for focused development and optimization of each component while ensuring the overall pipeline delivers on the "simulation-ready" promise.
The emphasis on data infrastructure and preprocessing is particularly commendable. In 3D generative AI, the quality and diversity of training data are often overlooked but are fundamentally enabling factors. The detailed description of their multi-stage preprocessing pipeline and distributed data engineering is an invaluable insight into the practical challenges and solutions for building such large-scale 3D models. The ability to generate watertight, manifold meshes directly usable in physics engines is a game-changer for robotics and simulation, as it bypasses a massive amount of manual effort and opens up possibilities for generating vast, diverse training curricula for RL agents.
A potential area for improvement or future exploration could be the interplay between the sequential stages. While effective, a completely end-to-end differentiable model might offer advantages in terms of joint optimization and potentially richer cross-component feedback. However, the current sequential approach is likely more stable and manageable for such a complex task, especially at this fidelity.
The reliance on existing VLMs (DINOv2, RADIO, CLIP) for conditioning and evaluation is smart, leveraging powerful pre-trained models. However, it also means the model's understanding and semantic alignment are somewhat bounded by the capabilities and potential biases of these upstream VLMs.
From a broader perspective, Seed3D 1.0's methods and conclusions could be transformative for several domains beyond embodied AI:
-
Game Development: Rapid prototyping and content creation for virtual worlds.
-
AR/VR: Populating augmented and virtual reality environments with realistic assets.
-
Digital Twin Technology: Generating accurate 3D models of real-world objects for various industrial applications.
-
E-commerce: Creating interactive 3D product visualizations from single images.
The paper sets a high bar for the quality and utility of generated 3D assets, showcasing a clear path toward bridging the gap between generative AI capabilities and the demanding requirements of physical simulation. The public availability of Seed3D 1.0 through Volcano Engine further underscores its practical relevance and potential impact on the research community.
Similar papers
Recommended via semantic vector search.