Paper status: completed

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

Published:05/30/2024

Large-Scale 3D Generative Model (1)3D Geometry and Material Generation (1)VAE and Diffusion Transformer (1)Multimodal 3D Conditional Control (1)Physically-Based Rendering Texture Generation (1)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

CLAY is a large-scale 3D generative model using a multi-resolution VAE and latent diffusion Transformer, enabling controllable, high-quality 3D geometry and PBR texture generation from diverse multimodal inputs and primitives.

Abstract

In the realm of digital creativity, our potential to craft intricate 3D worlds from imagination is often hampered by the limitations of existing digital tools, which demand extensive expertise and efforts. To narrow this disparity, we introduce CLAY, a 3D geometry and material generator designed to effortlessly transform human imagination into intricate 3D digital structures. CLAY supports classic text or image inputs as well as 3D-aware controls from diverse primitives (multi-view images, voxels, bounding boxes, point clouds, implicit representations, etc). At its core is a large-scale generative model composed of a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT), to extract rich 3D priors directly from a diverse range of 3D geometries. Specifically, it adopts neural fields to represent continuous and complete surfaces and uses a geometry generative module with pure transformer blocks in latent space. We present a progressive training scheme to train CLAY on an ultra large 3D model dataset obtained through a carefully designed processing pipeline, resulting in a 3D native geometry generator with 1.5 billion parameters. For appearance generation, CLAY sets out to produce physically-based rendering (PBR) textures by employing a multi-view material diffusion model that can generate 2K resolution textures with diffuse, roughness, and metallic modalities. We demonstrate using CLAY for a range of controllable 3D asset creations, from sketchy conceptual designs to production ready assets with intricate details. Even first time users can easily use CLAY to bring their vivid 3D imaginations to life, unleashing unlimited creativity.

Mind Map

In-depth Reading

English Analysis~54 min read · 71,426 chars

1. Bibliographic Information

1.1. Title

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

1.2. Authors

LONGWEN ZHANG*, ZIYU WANG*, QIXUAN ZHANG†, QIWEI QIU, ANQI PANG, HAORAN JIANG, WEI YANG, LAN $\mathsf { X U } ^ { \ddag }$ , $\mathsf { J l N G Y l \ Y U ^ { \ddag } }$

Affiliations:
- LONGWEN ZHANG: ShanghaiTech University and Deemos Technology Co., Ltd., China
- ZIYU WANG: ShanghaiTech University and Deemos Technology Co., Ltd., China
- QIXUAN ZHANG: ShanghaiTech University and Deemos Technology Co., Ltd., China
- QIWEI QIU: ShanghaiTech University, China
- ANQI PANG: ShanghaiTech University, China
- HAORAN JIANG: ShanghaiTech University and Deemos Technology Co., Ltd., China
- WEI YANG: Huazhong University of Science and Technology, China
- LAN $\mathsf { X U } ^ { \ddag }$ : ShanghaiTech University, China
- $\mathsf { J l N G Y l \ Y U ^ { \ddag } }$ : ShanghaiTech University, China

1.3. Journal/Conference

This paper is published as a preprint on arXiv. While not yet peer-reviewed and officially published in a specific journal or conference proceeding, arXiv is a highly reputable platform for disseminating cutting-edge research in computer science, including artificial intelligence and computer graphics, allowing for rapid sharing and feedback within the academic community.

1.4. Publication Year

2024

1.5. Abstract

The paper introduces CLAY, a novel 3D geometry and material generator designed to overcome the limitations of current digital tools for 3D asset creation, which often require extensive expertise. CLAY facilitates the transformation of human imagination into intricate 3D digital structures. It supports various input modalities, including classic text and image inputs, as well as 3D-aware controls from diverse primitives such as multi-view images, voxels, bounding boxes, point clouds, and implicit representations. At its core, CLAY utilizes a large-scale generative model comprising a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT). This architecture is designed to extract rich 3D priors directly from a diverse range of 3D geometries, representing continuous and complete surfaces using neural fields and employing pure transformer blocks in latent space for geometry generation. The model is trained using a progressive training scheme on an ultra-large 3D model dataset, meticulously processed through a custom pipeline, resulting in a 3D native geometry generator with 1.5 billion parameters. For appearance generation, CLAY produces physically-based rendering (PBR) textures, including diffuse, roughness, and metallic modalities, via a multi-view material diffusion model capable of generating 2K resolution textures. The paper demonstrates CLAY's application in generating controllable 3D assets, from conceptual designs to production-ready assets with intricate details, highlighting its ease of use even for first-time users.

1.6. Original Source Link

Official Source: https://arxiv.org/abs/2406.13897 (Preprint) PDF Link: https://arxiv.org/pdf/2406.13897v1.pdf (Preprint)

2. Executive Summary

2.1. Background & Motivation

The core problem CLAY aims to solve is the disparity between human imagination and the capabilities of existing digital 3D creation tools. While humans can visualize complex 3D worlds, the actual process of crafting these visions into digital assets is often hindered by:

High demand for expertise: Current tools require significant artistic skill and technical knowledge in 3D modeling, texturing, and rendering.
Tedious manual labor: The creation workflow is time-consuming and labor-intensive.
Limitations of existing techniques:
- 2D-to-3D methods: While leveraging advancements in 2D image generation, they often struggle with geometric fidelity, lack precise 3D controls, and can suffer from view inconsistencies (e.g., the "multi-head Janus problem"). They prioritize image quality over geometric accuracy.
- 3D-native methods: These approaches train directly on 3D datasets, offering better geometric understanding. However, they are often limited by the scarcity of high-quality 3D datasets and the relatively small scale of their models, restricting their generation ability, diversity, and detail compared to handcrafted assets.
- Entanglement of geometry and appearance: 3D assets inherently combine shape and material properties, making their joint generation challenging.
  
  The importance of this problem lies in its impact on various industries, including entertainment (film, games), design, and virtual reality. An ideal 3D creation tool should effortlessly translate abstract concepts into tangible, digital forms, supporting diverse controllable strategies for creation. The recent surge in AI-Generated Content (AIGC) in 2D has reignited hope for a similar revolution in 3D, highlighting the existing gap.

CLAY's entry point is to bridge this gap by developing a large-scale, controllable, 3D-native generative model that can produce high-quality geometry and physically-based rendering (PBR) textures. It aims to combine the strengths of both 2D-based and 3D-based generations by adopting a "pretrain-then-adaptation" paradigm, effectively mitigating the 3D data scarcity issue through careful data processing and architectural scaling.

2.2. Main Contributions / Findings

The primary contributions of CLAY are:

Large-scale 3D-Native Geometry Generator: Introduction of CLAY, a 3D-native geometry generator with an unprecedented scale of 1.5 billion parameters. This allows for significantly improved quality and diversity in 3D object generation compared to prior art.
Novel Generative Architecture: At its core, CLAY employs a large-scale generative model consisting of a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT). This architecture effectively extracts rich 3D priors from diverse geometries, represents continuous surfaces using neural fields, and generates geometries in a latent space using pure transformer blocks.
Progressive Training Scheme and Data Pipeline: A carefully designed progressive training scheme is introduced to efficiently train the large-scale model by gradually increasing latent size and model parameters. This is coupled with a robust data processing pipeline that standardizes diverse 3D data, including a remeshing protocol for watertightness and GPT-4V powered annotation for precise textual descriptions.
High-Quality Physically-Based Rendering (PBR) Texture Generation: CLAY integrates a multi-view material diffusion model capable of generating 2K resolution textures for diffuse, roughness, and metallic modalities. This produces production-ready PBR textures, enhancing the realism and usability of generated assets in existing CG pipelines.
Extensive Multi-modal Control and Adaptability: CLAY is designed as a versatile foundation model supporting a rich class of controllable adaptations. This includes classic text or image inputs, as well as 3D-aware controls from diverse primitives (multi-view images, voxels, bounding boxes, point clouds, implicit representations). It also supports LoRA-like fine-tuning for specific styles.
Demonstrated Superior Performance: Extensive qualitative and quantitative evaluations, including user studies, show that CLAY significantly outperforms state-of-the-art methods in terms of generation quality, geometric fidelity, diversity, and speed for both text-to-3D and image-to-3D tasks. It addresses issues like the "multi-head Janus problem" and generates smooth, detailed geometries with realistic PBR materials.

These findings collectively solve the problem of generating high-quality, controllable 3D assets from imagination, making digital 3D creation more accessible and efficient for a broader range of users.

This section provides foundational knowledge essential for understanding CLAY and contextualizes it within the existing research landscape.

3.1. Foundational Concepts

3.1.1. Variational Autoencoder (VAE)

A Variational Autoencoder (VAE) is a type of generative model that learns a compressed, probabilistic representation of data. It consists of two main parts:

Encoder: Maps input data (e.g., an image or 3D shape) to a statistical distribution (mean and variance) in a lower-dimensional latent space. Instead of a single point, it learns a distribution, allowing for probabilistic representation.
Decoder: Samples a point from this latent distribution and reconstructs the original input data.

The VAE is trained to minimize both the reconstruction error (how well the output matches the input) and the Kullback-Leibler (KL) divergence between the learned latent distribution and a prior distribution (typically a standard normal distribution). This ensures that the latent space is well-structured and continuous, allowing for smooth interpolation and meaningful generation of new data by sampling from this learned latent space. In CLAY, a VAE is used to compress complex 3D geometry into a manageable latent code.

3.1.2. Diffusion Models

Diffusion Models are a class of generative models that learn to reverse a gradual noisy process. They work in two phases:

Forward Diffusion Process: This process gradually adds Gaussian noise to the data (e.g., an image or latent code) over several timesteps, transforming it into pure noise.
Reverse Denoising Process: This is the generative part. A neural network is trained to predict and remove the noise at each step, gradually transforming a sample of pure noise back into a coherent data sample.

Diffusion models have shown remarkable success in generating high-quality and diverse content, especially images. Latent Diffusion Models (LDMs) apply this diffusion process not directly in the high-dimensional pixel space, but in a compressed latent space learned by an autoencoder (like a VAE). This makes the diffusion process computationally more efficient. CLAY uses a latent Diffusion Transformer (DiT) to denoise the 3D geometry's latent codes.

3.1.3. Transformer

The Transformer architecture, introduced by Vaswani et al. (2017) in "Attention Is All You Need," is a neural network model that has revolutionized natural language processing and is now widely used in computer vision and other fields. Its core mechanism is self-attention and cross-attention.

Self-Attention: Allows each element in a sequence (e.g., a word in a sentence, a patch in an image, a point in a point cloud) to weigh the importance of all other elements in the same sequence when computing its representation. This helps capture long-range dependencies. The fundamental formula for Attention is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
- $Q$ (Query), $K$ (Key), $V$ (Value) are matrices derived from the input embeddings.
- $d_k$ is the dimension of the key vectors, used for scaling to prevent the dot products from growing too large.
- The term $QK^T$ computes similarity scores between queries and keys.
- $\mathrm{softmax}$ normalizes these scores, turning them into weights.
- These weights are then applied to the Value matrix $V$ .
Cross-Attention: Similar to self-attention, but Query comes from one sequence (e.g., the latent code to be denoised) and Key and Value come from another sequence (e.g., a conditional input like text embeddings). This allows the model to "attend" to information from different modalities. CLAY's DiT uses pure transformer blocks, leveraging both self-attention and cross-attention for geometry generation and conditioning.

3.1.4. Neural Fields (NeRF, SDF, Occupancy Fields)

Neural Fields represent 3D scenes or objects as continuous functions learned by a neural network. Instead of explicit representations like meshes or point clouds, these implicit representations store information about the 3D shape or appearance at any continuous coordinate in space.

Neural Radiance Fields (NeRF): A neural network that takes a 3D coordinate (x,y,z) and viewing direction as input and outputs the color and opacity (density) at that point. This allows for novel view synthesis by volume rendering.
Signed Distance Fields (SDF): A function that for any point in 3D space, outputs its signed distance to the nearest surface of an object. Points inside the object have negative distances, points outside have positive distances, and points on the surface have zero distance. This explicitly defines surfaces.
Occupancy Fields: A function that outputs a binary value (0 or 1) indicating whether a given 3D point is outside or inside an object, respectively. This is a simpler implicit representation than SDF but also defines the object's volume.

CLAY specifically uses neural fields to represent continuous and complete surfaces, leveraging occupancy fields for its VAE decoder output.

3.1.5. Physically-Based Rendering (PBR)

Physically-Based Rendering (PBR) is a collection of rendering techniques that aim to simulate how light interacts with surfaces in a physically accurate way, leading to more realistic visuals. PBR materials typically consist of several texture maps that define different material properties:

Diffuse (Albedo) Map: Defines the base color of the surface, representing the color when lit by a white light source.
Roughness Map: Controls how rough or smooth a surface is, influencing the spread and sharpness of reflections. Rougher surfaces scatter light more, leading to blurrier reflections.
Metallic Map: Indicates whether a surface is metallic (0 for non-metallic, 1 for metallic). Metallic surfaces reflect light differently than non-metallic ones (dielectrics), generally having colored specular reflections and no diffuse component. CLAY generates these PBR textures at high resolution to ensure production-quality rendering of its 3D assets.

3.1.6. LoRA (Low-Rank Adaptation)

Low-Rank Adaptation (LoRA) is an efficient fine-tuning technique for large pre-trained models. Instead of fine-tuning all the parameters of a large model, LoRA injects small, trainable low-rank matrices into the transformer layers. During fine-tuning, only these low-rank matrices are updated, significantly reducing the number of trainable parameters and computational cost, while still achieving performance comparable to full fine-tuning. This makes adapting large models to specific downstream tasks or styles much more efficient. CLAY uses LoRA for adapting its DiT to generate 3D content in specific styles.

3.1.7. Marching Cubes

Marching Cubes is a computer graphics algorithm used to extract a polygon mesh (a collection of triangles) from a 3D implicit surface, typically represented by a scalar field (like an occupancy field or SDF). It works by iterating through small cubes (voxels) in the 3D scalar field. For each cube, it examines the scalar values at its eight corners to determine how the implicit surface intersects the cube. Based on pre-computed lookup tables, it then generates a set of triangles within that cube to approximate the surface. This process results in a triangulated surface (mesh) that represents the isosurface (e.g., the surface where the occupancy value is 0.5 or SDF is 0). CLAY uses Marching Cubes to convert the occupancy values from its VAE decoder into a mesh.

3.2. Previous Works

Previous research in 3D generation can be broadly categorized into methods that leverage 2D generation and methods that are inherently 3D-native.

3.2.1. Imposing 2D Images as Prior

This line of work attempts to leverage the significant progress made in 2D image generation, particularly with Diffusion Models like DALL·E (Ramesh et al. 2021), Imagen (Saharia et al. 2022), and Stable Diffusion (Rombach et al. 2022).

Score Distillation Sampling (SDS): Pioneered by DreamFusion (Poole et al. 2023), SDS optimizes a Neural Radiance Field (NeRF) using a pre-trained 2D diffusion model. It generates multiple views of a 3D object and uses a 2D diffusion model to evaluate and guide the optimization towards photorealistic images from various viewpoints. While innovative, early SDS struggled with consistency, often requiring extensive parameter tuning and long optimization times, and lacked explicit geometric controls.
- Subsequent works (e.g., Chen et al. 2024, Huang et al. 2024, Lin et al. 2023, Wu et al. 2024, Yu et al. 2023b, Zhu et al. 2024) extended SDS to various neural fields like DMTet and 3D Gaussian Splatting, improving performance but still facing challenges like the multi-head Janus problem (where multiple faces appear in a 3D object due to lack of 3D consistency in 2D priors). Magic3D (Lin et al. 2023) is another prominent example in this category.
Multi-view Generation: To mitigate the Janus problem, methods like Zero-1-to-3 (Liu et al. 2023c) integrate view information into the image generation process by training additional mappings. Others (e.g., Blattmann et al. 2023, Li et al. 2023, Liu et al. 2024a, Long et al. 2024, Qiu et al. 2024, Shi et al. 2023, 2024) directly generate multi-view consistent images using enhanced attention mechanisms. MVDream (Shi et al. 2024) and RichDreamer (Qiu et al. 2024) fall into this category. These methods often require fine-tuning 2D diffusion models on multi-view rendering datasets (Objaverse by Deitke et al. 2023) or auxiliary multi-view datasets. SyncDreamer (Liu et al. 2024a) and Wonder3D (Long et al. 2024) use these multi-view results to extract 3D shapes (e.g., via NeuS by Wang et al. 2021a), while One-2-3-45 (Liu et al. 2023d) trains a generalizable NeuS for sparse view inputs.
Limitations: These 2D-based approaches primarily focus on image quality and often suffer from incomplete or coarse geometry, lacking fine details and geometric fidelity because 2D priors do not easily translate to coherent 3D priors.

3.2.2. Imposing 3D Geometry as Priors (3D-Native Approaches)

These methods attempt to train generative models directly from 3D datasets, aiming for a deeper understanding and preservation of geometric features.

Early 3D Convolutional Networks: Approaches like Choy et al. 2016, Fan et al. 2017, Groueix et al. 2018, Mescheder et al. 2019, Tang et al. 2019, 2021a primarily used 3D convolutional networks on voxel grids or similar structures.
Point Cloud-based Generation: Point-E (Nichol et al. 2022) used a pure transformer-based diffusion model directly on point clouds. While simple and efficient, transforming generated point clouds into precise mesh surfaces remains challenging.
Mesh-based Generation: Polygen (Nash et al. 2020) and MeshGPT (Siddiqui et al. 2024) directly represent meshes as point and surface sequences, capable of producing high-quality meshes but often limited by small, high-quality datasets.
Voxel-based Generation: XCube (Ren et al. 2024) simplifies geometry into multi-resolution voxels before diffusion, but faces challenges with complex prompts and broader applicability.
Implicit Representation-based Generation:
- Methods using SDF (e.g., DeepSDF by Park et al. 2019, Mosaic-SDF by Yariv et al. 2024) or occupancy fields (Peng et al. 2020, Tang et al. 2021b), or both (Liu et al. 2024b, Zheng et al. 2023), train directly on 3D datasets. These provide more explicit surface learning but often require latent encoding of watertight meshes.
- SDFusion (Cheng et al. 2023) and ShapeGPT (Yin et al. 2023) adopt a 3D VAE for encoding and reconstructing SDF fields, but are often limited by datasets like ShapeNet (Chang et al. 2015).
- 3DGen (Gupta et al. 2023) uses a triplane VAE.
- Shap-E (Jun and Nichol 2023), 3DShape2VecSet (Zhang et al. 2023c), and Michelangelo (Zhao et al. 2023) utilize transformers to encode input point clouds into parameters for decoding networks.

3.3. Technological Evolution

The field of 3D content generation has evolved significantly:

Manual Modeling: Traditional 3D artists manually create models using specialized software.
Procedural Generation: Early attempts to automate 3D creation through rule-based systems.
Data-driven Synthesis (Traditional ML): Using traditional machine learning to generate 3D shapes from small datasets, often relying on explicit representations (voxels, point clouds, meshes).
Deep Learning for 3D (2D-Prior based): Leveraging the success of Deep Learning in 2D vision. This involved using pre-trained 2D Convolutional Neural Networks (CNNs) or Diffusion Models to guide 3D reconstruction, often through optimization (SDS) or multi-view image generation. This brought photorealism but struggled with geometric fidelity and 3D consistency.
Deep Learning for 3D (3D-Native): Developing Deep Learning models that directly learn from 3D data, using implicit representations (SDF, occupancy fields) or explicit ones (point clouds, meshes). This offers better geometric understanding but has been constrained by data scarcity and model scalability.
Large-scale 3D Generative Models: The current frontier, where CLAY resides. This involves scaling up 3D-native architectures (Transformers, VAEs, Diffusion Models) and datasets to unprecedented sizes, employing sophisticated data processing and training schemes to achieve high-quality, diverse, and controllable 3D asset generation comparable to 2D AIGC.

3.4. Differentiation Analysis

CLAY differentiates itself from previous works through several key innovations:

Scale and 3D-Native Focus: Unlike 2D-based methods that "lift" 2D priors to 3D and often suffer from geometric inconsistencies, CLAY is a purely 3D-native generator. It scales its core geometry model to an unprecedented 1.5 billion parameters, a significant leap from previous 3D-native approaches which were often constrained by smaller model sizes and datasets. This scale allows CLAY to learn richer geometric features and generate higher-quality details.
Separation of Geometry and Appearance: CLAY explicitly separates the generation of geometry and appearance (PBR textures). This contrasts with many 2D-to-3D methods that try to generate both simultaneously, often leading to compromises in either geometric fidelity or texture quality. By separating them, CLAY can leverage specialized techniques for each, ensuring high quality for both.
Robust Data Processing Pipeline: CLAY addresses the critical challenge of 3D data scarcity and quality by introducing a meticulously designed data processing pipeline. This includes a novel remeshing protocol that ensures watertightness and preserves geometric features, and the use of GPT-4V for robust, detailed annotations. This unified and high-quality dataset is crucial for training a large foundation model, an aspect often overlooked or less rigorously handled in prior 3D-native works.
Multi-resolution VAE and Minimalistic DiT: CLAY introduces a multi-resolution VAE that effectively handles varying geometric details and an adaptive latent size. Its DiT is minimalistic yet powerful, scaling effectively with a progressive training scheme. This architectural and training innovation allows CLAY to efficiently capture fine geometric details while maintaining computational feasibility for large models.
High-Quality PBR Texture Generation: CLAY's multi-view material diffusion model is specifically designed to produce production-ready 2K resolution PBR textures (diffuse, roughness, metallic). This goes beyond simple color textures generated by many existing methods, providing the necessary material properties for realistic rendering, and improving upon techniques that either lack these attributes or model them inconsistently.
Extensive Multi-modal Control: While many methods offer text or image input, CLAY supports a diverse array of 3D-aware controls, including multi-view images, voxels, bounding boxes, point clouds, and partial point clouds with extension boxes. This rich set of conditional inputs, enabled by a flexible cross-attention conditioning scheme, offers unparalleled control over the generation process, allowing users to guide the model with various levels of abstraction and detail.
Speed and Quality Balance: CLAY achieves high-quality results in a significantly shorter time (approximately 45 seconds for geometry and texture) compared to many SDS-based methods that can take hours. This makes it more practical for interactive design workflows.

4. Methodology

CLAY's methodology is centered on building a large-scale 3D generative model that effectively separates geometry and texture generation, leveraging a Variational Autoencoder (VAE) and a Diffusion Transformer (DiT) in a compressed latent space. This approach prioritizes 3D-native understanding and high-quality outputs.

4.1. Principles

The core idea behind CLAY is to overcome the limitations of existing 3D generative models by:

Operating in a compressed latent space: Analogous to successful 2D generative models, CLAY learns to denoise 3D data in a lower-dimensional latent representation. This significantly reduces computational complexity compared to working directly in the high-dimensional 3D space.
Separating geometry and texture generation: CLAY takes a minimalist approach by decoupling the generation of 3D geometry from the generation of its material properties (textures). This allows each component to be optimized for its specific task, avoiding compromises that arise from entangled generation processes, especially in the context of lifting 2D priors to 3D. The paper's experiments suggest that scaling up 3D-native geometry generation with high-quality data leads to superior geometric details compared to 2D-based or 2D-assisted techniques.
Scaling up foundational 3D architectures: CLAY extends the generative model of 3DShape2VecSet by introducing a new multi-resolution VAE for efficient geometric data encoding/decoding and an advanced latent Diffusion Transformer (DiT) for probabilistic geometry generation.
Rigorous data standardization: Recognizing the critical role of data quality and quantity, CLAY employs a sophisticated data processing pipeline, including remeshing for geometric unification and GPT-4V powered annotation, to create an ultra-large, high-quality 3D dataset.

The combination of these principles allows CLAY to generate high-quality 3D models from diverse conditional inputs like text, images, point clouds, and voxels.

4.2. Core Methodology In-depth

4.2.1. Representation and Model Architecture

CLAY's 3D generative model works by learning to denoise 3D data in a compressed latent space. The process starts by sampling a point cloud $\mathbf { X }$ from a 3D mesh surface $\mathbf { M }$ . This point cloud $\mathbf { X } \in \mathbb { R } ^ { N \times 3 }$ (where $N$ is the number of points and 3 represents the (x,y,z) coordinates) is then encoded into a latent code $\mathbf { Z }$ with a dynamic shape $\overline { { \boldsymbol { Z } = \mathbb { R } ^ { L \times 6 4 } } }$ . Here, $L$ is the length of the latent code and 64 is the channel size. This encoding is performed by an encoder $\mathcal { E }$ of a transformer-based Variational Autoencoder (VAE). The relationship is expressed as: $ { \mathbf Z } = \mathcal E ( { \mathbf X } ) $ Subsequently, a Diffusion Transformer (DiT) is trained to denoise the latent code $\mathbf { Z } _ { t }$ (which is the latent code $\mathbf { Z }$ corrupted with noise at timestep $t$ ). Finally, the VAE decoder $\mathcal { D }$ decodes the generated latent codes from the DiT (specifically, the denoised latent code $\mathbf { Z } _ { 0 }$ ) into a neural field representation. This neural field outputs occupancy values [ 0 , 1 ] for any given test coordinate $\mathbf { p }$ in space, indicating whether $\mathbf { p }$ is inside or outside the 3D shape. $ \mathcal { D } ( \mathbf { Z } _ { 0 } , \mathbf { p } ) \rightarrow [ 0 , 1 ] $ The overall architecture of the VAE and DiT is designed for scalability and multi-resolution processing.

As illustrated in Figure 3, the network design of CLAY's VAE and DiT emphasizes a minimalistic yet powerful approach. The VAE is structured to handle various geometric resolutions, while the DiT, composed of pure transformer blocks, is built for scalable training.

Fig. 10. Evaluation of the CLAY's ability to alter generated content by incorporating different geometric feature tags in the prompt. We showcase precise controls over the geometry style, in the extr… 该图像是论文中展示CLAY模型通过不同几何特征标签调控生成内容的插图，展示了对对称性、棱角、模型多边形数量、复杂度及角色形态的精确控制能力。

4.2.2. Multi-resolution VAE

CLAY's VAE module is based on the structure from 3DShape2VecSet, but with augmentations for scaling.

Encoder Structure: The encoder $\mathcal { E }$ $E$ embeds the input point cloud $\mathbf { X } \in \mathbb { R } ^ { N \times 3 }$ $X \in R^{N \times 3}$ (sampled from a mesh $\mathbf { M }$ $M$ ) into a latent code $\mathbf { Z }$ $Z$ . This involves a learnable embedding function and a cross-attention encoding module. The specific formulation for the VAE encoder is: $ \mathbf { Z } = \mathcal { E } ( \mathbf { X } ) = \mathrm { C r o s s A t t n } ( \mathrm { P o s E m b } ( \tilde { \mathbf { X } } ) , \mathrm { P o s E m b } ( \mathbf { X } ) ) $ Where:
- $\mathbf { X }$ : The input point cloud, a set of $N$ points in 3D space.
- $\tilde { \mathbf { X } }$ : A down-sampled version of $\mathbf { X }$ , typically at $1/4$ scale. This helps reduce the length $L$ of the latent code to a quarter of the input point cloud size $N$ , indicating a hierarchical encoding.
- $\mathrm { P o s E m b } ( \cdot )$ : A learnable positional embedding function. Positional embeddings are crucial in transformer architectures to inject information about the relative or absolute position of elements in a sequence, as transformers inherently process sequences without regard to order. Here, it likely embeds the 3D coordinates into a higher-dimensional space.
- $\mathrm { C r o s s A t t n } ( \cdot , \cdot )$ : The cross-attention mechanism. It takes two sets of embeddings (queries from the first input, keys/values from the second) and produces an output that combines information from both. In this case, it combines the positional embeddings of the down-sampled point cloud ( $\tilde { \mathbf { X } }$ ) and the original point cloud ( $\mathbf { X }$ ).
Decoder Structure: The VAE's decoder $\mathcal { D }$ $D$ consists of 24 self-attention layers followed by a cross-attention layer. It processes the latent codes $\mathbf { Z }$ $Z$ and a list of query points $\mathbf { p }$ $p$ to output occupancy logits (values between 0 and 1 indicating probability of being inside the object). The VAE decoder is formulated as: $ \mathcal { D } ( \mathrm { Z } , \mathrm { p } ) = \mathrm { C r o ss A t t n } ( \mathrm { P o s E m b } ( \mathrm { p } ) , \mathrm { S e l f A t t n } ^ { 24 } ( \mathrm { Z } ) ) $ Where:
- $\mathbf { Z }$ : The latent code generated by the encoder or denoised by the DiT.
- $\mathbf { p }$ : A query point (3D coordinate) for which the occupancy value is to be predicted.
- $\mathrm { S e l f A t t n } ^ { 24 } ( \mathrm { Z } )$ : Represents 24 layers of self-attention applied to the latent code $\mathbf { Z }$ . This allows the latent code to refine its internal representation by considering relationships among its own elements.
- $\mathrm { C r o s s A t t n } ( \mathrm { P o s E m b } ( \mathrm { p } ) , \mathrm { S e l f A t t n } ^ { 24 } ( \mathrm { Z } ) )$ : The final cross-attention layer. It takes the positional embedding of the query point $\mathbf { p }$ as queries and the refined latent code $\mathbf { Z }$ (from the self-attention layers) as keys and values. This allows the decoder to query the latent representation at specific 3D locations to determine occupancy.
Parameters: The VAE is dimensioned at 512 (likely the hidden dimension) with 8 attention heads, totaling 82 million parameters. The latent code size is $L \times 64$ , where $L$ (length) varies based on the input point cloud size.
Multi-resolution Approach: To capture fine geometric details, CLAY adopts a multi-resolution sampling strategy for the point clouds. At each training iteration, the number of surface points $N$ sampled from the input mesh $\mathbf { M }$ is randomly chosen from 2048, 4096, or 8192. This ensures the VAE learns to handle and reconstruct details at different levels of granularity.

4.2.3. Coarse-to-fine DiT

CLAY's DiT (Diffusion Transformer) is responsible for denoising the latent codes in a coarse-to-fine manner.

Structure: It employs a minimalistic 24-layer pure transformer architecture. It includes cross-attention mechanisms to incorporate text prompt conditions, which are encoded into textual features $\mathbf { c }$ using CLIP-ViT-L/14.
Encoding Process: The DiT's input is a noisy latent code $\mathbf { Z } _ { t }$ at timestep $t$ . This $\mathbf { Z } _ { t }$ is derived from a latent code $\mathbf { Z } \in \mathbb { R } ^ { L \times 64 }$ that is encoded from $N = 4L$ surface points (ensuring $L$ is a quarter of the point cloud size, consistent with the VAE encoder).
Denoising Function: The DiT's role, denoted as $\epsilon ( \cdot )$ $ϵ (\cdot)$ , is to predict the noise that was added to $\mathbf { Z } _ { t }$ $Z_{t}$ at timestep $t$ $t$ . The DiT's noise prediction function is defined as: $ \epsilon ( { \bf Z } _ { t } , t , { \bf c } ) = { { \cal { C } } \mathrm { r o s s A t t n } ( { \cal { S } } \mathrm { e l f A t t n } ( { \bf Z } _ { t } # # \bf { t } ) , { \bf c } ) } ^ { 24 } $ Where:
- ${ \bf Z } _ { t }$ : The noisy latent code at timestep $t$ .
- $t$ : The current diffusion timestep, which is usually embedded and concatenated.
- ${ \bf c }$ : The textual features (condition) derived from the text prompt using CLIP-ViT-L/14.
- ${ \bf Z } _ { t } \# \# \bf { t }$ : Represents the concatenation of the noisy latent code ${ \bf Z } _ { t }$ with the embedded timestep $\mathbf { t }$ . This provides the transformer with information about the current noise level.
- ${ \cal { S } } \mathrm { e l f A t t n } ( \cdot )$ : Self-attention applied to the concatenated latent code and timestep. This allows the model to process the internal relationships within the noisy latent code, contextualized by the timestep.
- ${ \cal { C } } \mathrm { r o s s A t t n } ( \cdot , \cdot )$ : Cross-attention, where the output of the self-attention on ${ \bf Z } _ { t } \# \# \bf { t }$ acts as queries, and the textual features ${ \bf c }$ act as keys and values. This mechanism allows the model to condition its noise prediction on the provided text prompt.
- $\{ \cdot \} ^ { 24 }$ : Indicates that this block (SelfAttn followed by CrossAttn) is repeated 24 times, forming the 24 layers of the transformer.
Progressive Training Scheme: To efficiently capture fine geometric details and ensure quicker convergence, CLAY employs a progressive training scheme for the DiT:
1. Initial Stage: Training begins with a shorter latent code length, $L = 512$ , using a higher learning rate.
2. Gradual Increase: The latent code length is gradually increased to 1024, and then to 2048.
3. Learning Rate Adjustment: With each increase in latent length, the learning rate is reduced based on empirical observations. This coarse-to-fine approach allows the model to first learn broader structures and then progressively refine details.

4.2.4. Scaling-up Scheme

Scaling up CLAY involved specific architectural and training enhancements:

Architecture Enhancements: The VAE and DiT architectures are augmented with pre-normalization (normalizing inputs to layers before processing) and GeLU (Gaussian Error Linear Unit) activation functions. These are known to facilitate faster computation of attention mechanisms and improve training stability in large models. The feed-forward dimension within the transformer blocks is set to four times the model dimension.
Noise Scheduling: A discrete scheduler with 1000 timesteps is used for the diffusion process. A cosine beta schedule (a specific way of defining the noise levels at different timesteps) is utilized during training. Following recent practices in diffusion training (Lin et al. 2024), CLAY implements zero terminal SNR (Signal-to-Noise Ratio) by rescaling beta values and opts for "vprediction" as the training objective. This strategy promotes stable inference.

Model Sizes and Training: To evaluate the impact of model size, five DiT models were trained, ranging from 227 million to 1.5 billion parameters. The following are the results from Table 1 of the original paper:

Model size	nparams	nlayers	dmodel	nheads	dhead	Latent length	Batch size	Learning rate
Tiny	227M	24	768	12	64	512	1024	1e-4
Small	392M	24	1024	16	64	512	16384	1e-5
Small	392M	24	1024	16	64	1024	8192	5e-6
Medium	600M	24	1280	16	80	512	16384	1e-4
Medium	600M	24	1280	16	80	1024	8192	5e-5
Large	853M	24	1536	16	96	512	8192	1e-4
						1024	4096	1e-5
						2048	2048	5e-6
XL	1.5B	24	2048	16	128	512	4096	1e-4
						1024	2048	1e-5
						2048	1024	5e-6

nparams: Total number of parameters in the model.
nlayers: Number of transformer layers (fixed at 24).
dmodel: Model dimension (hidden size).
nheads: Number of attention heads.
dhead: Dimension of each attention head.
Latent length: The length of the latent code $L$ .
Batch size: The number of samples processed per training iteration.
Learning rate: The step size for updating model weights during optimization.

The smallest model (Tiny) is designed for preliminary experiments on a single node (8 NVidia A800 GPUs). Larger models use larger batch sizes for improved stability and faster convergence. The largest model (XL, 1.5B parameters) was trained on a cluster of 256 NVidia A800 GPUs for approximately 15 days, employing progressive training. The progressive scaling of DiT (following Gesmundo and Maile, 2023, on Head addition, Heads expansion, and Hidden dimension expansion) optimizes the learning trajectory by enhancing time efficiency, retaining knowledge, and reducing the risk of local optima.

Inference: During inference, a 100-timestep denoising process with linear-space timestep spacing is used for efficient 3D geometry generation. The VAE's geometry decoder then performs dense sampling on a $512^3$ grid resolution to precisely determine occupancy values, which are converted to a mesh using Marching Cubes.

4.2.5. Data Standardization for Pretraining

The quality and scale of 3D datasets are paramount for large-scale generative models. CLAY addresses the challenges of limited size, quality issues (non-watertight meshes), inconsistent orientations, and inaccurate annotations in existing datasets (ShapeNet, Objaverse) through a two-step process:

Filtering: Initially, unsuitable data (complex scenes, fragmented scans) is filtered out, resulting in a refined collection of 527K objects from ShapeNet and Objaverse.
Geometry Unification:
- Problem: Non-watertight meshes (meshes with holes or gaps) make it difficult to predict accurate occupancy fields. Traditional remeshing tools often smooth out important geometric features like sharp edges.
- CLAY's Solution: A standardized geometry remeshing protocol is proposed to ensure watertightness while preserving geometric fidelity.
- Comparison with existing tools: While tools like Manifold (Huang et al. 2018a) and ManifoldPlus (Huang et al. 2020) are efficient, they tend to smooth features. mesh-to-sdf (Marian 2021) and Dual Octree Graph Networks (DOGN) (Wang 2022; Wang et al. 2022) compute Signed Distance Fields (SDF) or Unsigned Distance Fields (UDF) but are computationally costly.
- CLAY's UDF-based approach: Inspired by DOGN, CLAY adopts Unsigned Distance Field (UDF) representation due to its seamless conversion capabilities and ability to correct vertex/face density inconsistencies.
- Addressing Mesh Holes: Traditional Marching Cubes can produce thin shells from meshes with holes. CLAY tackles this by employing a grid-based visibility computation before isosurface extraction. A grid point is labeled "inside" if it is completely obscured from all angles, maximizing the volume for stable VAE training. This method ensures robustness even for non-watertight input meshes.
  
  The figure below (Figure 4 from the original paper) visually compares CLAY's remeshing method against existing approaches using cross-sectional analysis. CLAY's method effectively preserves geometric features while ensuring watertightness and maximizing internal volume, crucial for training robust 3D models.
  
  该图像是论文中图11的示意图，展示了CLAY生成的高质量3D几何模型及其数据集中的最近邻样本对比，体现了几何多样性和模型生成的独特性。
Geometry Annotation:
- Importance of Prompts: Recognizing the impact of precise text prompts in 2D models like Stable Diffusion and SDXL, CLAY emphasizes accurate textual descriptions for 3D objects.
- Automated Annotation: CLAY develops unique prompt tags and utilizes GPT-4V (OpenAI 2023) to produce detailed annotations. This enhances the model's ability to interpret and generate complex 3D geometries with nuanced details and diverse styles, by providing rich semantic information during training.

4.2.6. Asset Enhancement

To ensure that generated 3D assets are production-ready and compatible with existing Computer Graphics (CG) pipelines, CLAY includes a two-stage enhancement process: geometry optimization and material synthesis.

Mesh Quadrification and Atlasing:
- Problem: The initial geometric meshes generated by Marching Cubes typically consist of millions of uneven triangles. This "triangle soup" is difficult to edit, challenging for game engines, and complicates UV unwrapping (the process of flattening a 3D surface into a 2D image for texture mapping).
- Solution: CLAY transforms these triangle-faced meshes into quad-faced meshes (meshes composed primarily of quadrilaterals) using off-the-shelf tools like Blender Online Community (2024) and QuadriFlow (Huang et al. 2018b). This process preserves key geometric features (sharp edges, flat surfaces) and simplifies UV unwrapping, leading to higher quality final meshes.
Material Synthesis:
- Problem: Physically-Based Rendering (PBR) materials (diffuse, metallic, roughness textures) are crucial for realism. Existing methods often generate only a subset of these, lack specific attribute supervision, or cannot produce rich material types.
- CLAY's Solution: A multi-view Material Diffusion model is developed to synthesize a wide range of PBR materials.
- Dataset: Over 40,000 objects with high-quality PBR materials are carefully selected from Objaverse (Deitke et al. 2023) for training.
- Model Modification: CLAY modifies MVDream (Shi et al. 2024), a model originally designed for image space generation, to suit texture attribute generation with additional channels and modalities. It integrates three branches into MVDream's UNet's outermost convolutional layers, each with skip connections. This allows for concurrent denoising across diffuse, roughness, and metallic modalities, ensuring view consistency.
- Training Process: The training selects orthogonal-view rendered texture images for each 3D object in the dataset. It uses a combination of full-parameter training for add-on layers and LoRA-based fine-tuning for internal layers, focusing on high-quality, view-consistent PBR materials.
- Conditional Generation: The model synthesizes texture images from four camera viewpoints, precisely aligned with the input geometry. This is achieved by applying a pre-trained ControlNet (Zhang et al. 2023b), using each target view's rendered normal map as input. This also allows for image-based input customization via IPAdapter (Ye et al. 2023).
- Texture Enhancement: To further improve texture detail, CLAY employs a targeted inpainting approach (as in Text2Tex by Chen et al. 2023b) and integrates advanced super-resolution techniques (Real-ESRGAN by Wang et al. 2021b and MultiDiffusion by Bar-Tal et al. 2023). This achieves 2K resolution textures in UV space, sufficient for most realistic rendering tasks.
  
  The following schematic (Figure 5 from the original paper) illustrates the CLAY system architecture and its asset enhancement pipeline.
  
  该图像是图12，展示了基于单张图像和多视角图像条件下的几何形状生成对比。左侧为输入的单张图像及对应的CLAY单张图像生成结果，中间为CLAY基于4视角生成的多视图几何形状及法线图，右侧为Wonder3D基于6视角生成的Neus结果。

4.2.7. Model Adaptation

CLAY, as a pretrained foundation model, is highly versatile and supports various adaptations for controllable 3D content generation.

LoRA Fine-tuning: CLAY directly supports Low-Rank Adaptation (LoRA) on the attention layers of its DiT. This enables efficient fine-tuning for generating 3D content in specific styles (e.g., transforming a LEGO duck into stone or pocket monster variants).

The image below (Figure 6 from the original paper) demonstrates the adaptability of CLAY through LoRA fine-tuning. It shows a LEGO duck generated by CLAY, and then variants generated in stone and pocket monster styles after fine-tuning on specific datasets.

该图像是图13，展示了CLAY与多种最新3D生成方法在文本条件下生成不同3D模型（如神话龙、鹿、星际战舰、火箭和鹰木雕）的对比，突出CLAY在保持细节和结构上的优势。
Conditional Generation: The minimalistic architecture of CLAY allows it to efficiently integrate diverse conditional modalities, either individually or in combination. These include:
- Text prompts (natively supported)
- Image/sketch
- Voxel
- Multi-view images
- Point cloud
- Bounding box
- Partial point cloud with an extension box
  
  These conditions enable the model to generate content based on specific inputs or blend styles and user controls from multiple conditions.

4.2.8. Conditioning Scheme

CLAY extends its text prompt conditioning to incorporate additional conditions in parallel. This is achieved by leveraging pre-normalization (Xiong et al. 2020), which converts attention results into residuals. This allows extra conditions to be added as parallel residuals alongside the text condition. The modification to the cross-attention mechanism for multi-modal conditioning is given by: $ \mathbf { Z } \longleftarrow \mathbf { Z } + \mathrm { C r o s s A t t n } ( \mathbf { Z } , \mathbf { c } ) + \sum _ { i = 1 } ^ { n } \alpha _ { i } \mathrm { C r o s s A t t n } _ { i } ( \mathbf { Z } , \mathbf { c } _ { i } ) $ Where:

$\mathbf { Z }$ : The current latent representation being processed.
$\mathrm { C r o s s A t t n } ( \mathbf { Z } , \mathbf { c } )$ : Represents the original text conditioning, where $\mathbf { c }$ are the textual features.
$\mathrm { C r o s s A t t n } _ { i } ( \mathbf { Z } , \mathbf { c } _ { i } )$ : Denotes the $i$ -th additional trainable module for a specific condition.
$\mathbf { c } _ { i }$ : The $i$ -th additional conditional input (e.g., image features, voxel features).
$\alpha _ { i }$ : A scalar that allows direct manipulation of the influence or weight of each additional condition.

While this scheme is general, obtaining the embedded condition $\mathbf { c } _ { i }$ requires careful calibration. For image/sketch conditions, a pre-trained DINOv2 (Oquab et al. 2024) model extracts features that are directly integrated. However, for spatially related modalities (voxel, multi-view images, point cloud, bounding box, partial point cloud), directly applying cross-attention on features might not preserve spatial information.

4.2.9. Spatial Control

To maintain spatial integrity for 3D conditions, CLAY devises a specific learning strategy:

Positional Embeddings for Spatial Features: Additional positional embeddings are learned for spatial features. This enables the attention layer to differentiate between point coordinates and their features.
Specific Cross-Attention Application: The cross-attention for spatial conditions is applied as: $ \mathrm { { C r o s s A t t n } } _ { i } ( \mathrm { { Z } , f + { P o s E m b } ( p ) ) } $ Where:
- $\mathbf { Z }$ : The latent code.
- $\mathbf { f } \in \mathbb { R } ^ { M \times C }$ : The feature embedding learned during fine-tuning or extracted from a backbone network. $M$ is the length and $C$ is the channel size of the embedding.
- $\mathbf { p } \in \mathbb { R } ^ { M \times 3 }$ : Points sampled based on the type of condition used.
- $\mathrm { P o s E m b } ( \mathbf { p } )$ : A learnable positional embedding for the sampled points $\mathbf { p }$ . This method allows for effective integration of various 3D modalities by explicitly encoding their spatial information.

4.2.10. Implementation of Conditions

Each condition involves independently training an additional $\mathrm { { C r o s s A t t n } } _ { i } ( \cdot )$ module while keeping other parameters fixed. The figure below (Figure 7 from the original paper) illustrates the overall framework of CLAY, highlighting the multi-modal conditional inputs and the transformer-based generation process.

该图像是论文中的示意图，对比展示了CLAY与其他3D生成模型在不同输入下生成多样3D资产（椅子、车辆、龙头、剑）的效果与细节表现。

The following are the results from Table 2 of the original paper, detailing the specifications of each conditioning module:

Conditioning	nparams	M	C	Backbone
Image/Sketch	352M	257	1536	DINOv2-Giant
Voxel	260M	8^3	512	/
Multi-view images	358M	8^3	768	DINOv2-Small
Point cloud	252M	512	512	/
Bounding box	252M	8	512	/
Partial point cloud	252M	2048+8	512	/

nparams: Number of parameters in the conditioning module.
$M$ : Length of the feature embedding.
$C$ : Channel size of the feature embedding.
Backbone: Pre-trained model used to extract features for the condition.
Images and Sketches:
- Features extracted using DINOv2 (ViT) for both patch and global features.
- Integrated via cross-attention (as in Eqn. 4).
- Trained using rendered RGB images and sketches from the dataset.
Voxel:
- Initially, a $16^3$ voxel grid is constructed for each object, marking cells as occupied/vacant.
- Down-sampled to an $8^3$ feature volume using 3D convolution.
- Volume features $\mathbf { f } \in \mathbb { R } ^ { 8 ^ { 3 } \times C }$ are combined with positional embeddings of volume centers $\mathrm { P o s E m b } ( \mathbf { p } )$ , flattened, and integrated via cross-attention.
Bounding Boxes:
- Features $\mathbf { f } \in \mathbb { R } ^ { 8 \times C }$ (representing the 8 corners of the box) are learned during condition fine-tuning.
- Combined with positional embeddings $\mathrm { P o s E m b } ( \mathbf { p } )$ for precise spatial control.
Sparse Point Cloud:
- Feature embeddings $\mathbf { f } = 0$ (no explicit features, just positional).
- 512 points are sampled as $\mathbf { p }$ , and corresponding positional embeddings $\mathrm { P o s E m b } ( \mathbf { p } )$ are learned.
Multi-view Images:
- DINOv2 extracts features from various views (e.g., from Wonder3D).
- Features are back-projected into a 3D volume, down-sampled, flattened, and integrated via cross-attention (similar to voxel condition).
Partial Point Cloud with Extension Box:
- Addresses point cloud completion.
- Input point cloud is merged with corner points of an extension box (specifying the missing region).
- Combines approaches for bounding box conditioning and sparse point cloud conditioning by concatenating their features.

5. Experimental Setup

CLAY's experimental setup focuses on a rigorous training and evaluation process involving multiple model sizes, various conditioning types, and comprehensive metrics, including comparisons against state-of-the-art methods.

5.1. Datasets

CLAY is trained on an extensive dataset derived from ShapeNet (Chang et al. 2015) and Objaverse (Deitke et al. 2023).

Source and Scale: The initial dataset was filtered to 527K objects, undergoing a meticulously designed processing pipeline for standardization.
Characteristics: This pipeline included:
- Geometry Unification: A remeshing protocol that converts various 3D surfaces into occupancy fields, preserving essential geometric features like sharp edges and flat surfaces, and ensuring watertightness. This is crucial as original meshes often suffer from non-watertightness, inconsistent orientations, or low quality.
- Geometry Annotation: GPT-4V (OpenAI 2023) was used to produce robust annotations and unique prompt tags for each 3D model, accentuating geometric characteristics and providing rich textual descriptions for conditional generation.
Training Data Subsets:
- Full Training Data: Used for training the Tiny-base to XL-base models with a latent code length of $L = 1024$ .
- High-Quality Subset: A subset of 300K objects from the full training data was used for training Large-P and XL-P models (initially with $L = 1024$ , then extended to $L = 2048$ for Large-P-HD and XL-P-HD). This subset ensures high quality for fine-grained geometry generation.
Adaptation Training: For LoRA fine-tuning and other conditioning modules, these modules were trained based on the XL-P model using the same high-quality subset data, with each module independently trained for 8 hours.
Effectiveness: These datasets, processed and augmented, are highly effective for validating the method's performance because they provide a large, consistent, and well-annotated collection of 3D shapes, which is critical for training large-scale generative models and enabling fine-grained control.

5.2. Evaluation Metrics

CLAY's performance is evaluated using a comprehensive suite of metrics for both text-to-shape and conditioned shape generation tasks, covering aspects of rendering quality, geometric fidelity, and semantic alignment.

5.2.1. Text-to-Shape Evaluation Metrics

These metrics are applied to a 16K text-shape pair validation set to assess the quality and alignment of generated 3D models from text prompts.

Render-FID (Fréchet Inception Distance on rendered images)
- Conceptual Definition: FID measures the similarity between the feature distributions of real and generated images. A lower FID score indicates that the generated images are more similar to real images in terms of their visual quality and diversity. "Render-FID" specifically applies FID to 2D images rendered from the generated 3D shapes.
- Mathematical Formula: $ \mathrm{FID}(x, g) = ||\mu_x - \mu_g||_2^2 + \mathrm{Tr}(\Sigma_x + \Sigma_g - 2(\Sigma_x \Sigma_g)^{1/2}) $
- Symbol Explanation:
  - $x$ : The feature vectors extracted from a set of real images.
  - $g$ : The feature vectors extracted from a set of generated images.
  - $\mu_x$ , $\mu_g$ : The mean feature vectors for real and generated images, respectively.
  - $\Sigma_x$ , $\Sigma_g$ : The covariance matrices for real and generated images, respectively.
  - $||\cdot||_2^2$ : The squared L2 norm (squared Euclidean distance).
  - $\mathrm{Tr}(\cdot)$ : The trace of a matrix (sum of diagonal elements).
  - $(\cdot)^{1/2}$ : Matrix square root.
Render-KID (Kernel Inception Distance on rendered images)
- Conceptual Definition: KID is an alternative to FID that uses a Maximum Mean Discrepancy (MMD) metric to compare distributions, often considered more robust to small sample sizes. Like FID, a lower KID score indicates higher similarity between real and generated image feature distributions. "Render-KID" applies this to 2D images rendered from 3D shapes.
- Mathematical Formula: $ \mathrm{KID}(X, Y) = \mathrm{MMD}^2(X, Y) = E_{x_i, x_j \sim X} [k(x_i, x_j)] - 2E_{x_i \sim X, y_j \sim Y} [k(x_i, y_j)] + E_{y_i, y_j \sim Y} [k(y_i, y_j)] $
- Symbol Explanation:
  - $X$ : Set of feature vectors from real images.
  - $Y$ : Set of feature vectors from generated images.
  - $k(\cdot, \cdot)$ : A kernel function (typically a polynomial kernel) that measures similarity between two feature vectors.
  - $E[\cdot]$ : Expectation (average).
P-FID (Point-cloud Fréchet Inception Distance)
- Conceptual Definition: Similar to FID, but applied in the 3D feature space. It measures the similarity between the feature distributions of real and generated 3D point clouds. $PointNet++$ (Qi et al. 2017) is used to extract 3D features. A lower P-FID indicates better quality and diversity of generated 3D geometries.
- Mathematical Formula: Same as FID, but $x$ and $g$ are feature vectors extracted from point clouds.
P-KID (Point-cloud Kernel Inception Distance)
- Conceptual Definition: Similar to KID, but applied in the 3D feature space, using $PointNet++$ features. A lower P-KID indicates better quality and diversity of generated 3D geometries.
- Mathematical Formula: Same as KID, but $X$ and $Y$ are sets of feature vectors extracted from point clouds.
CLIP(I-T) (CLIP Image-Text Similarity)
- Conceptual Definition: CLIP (Contrastive Language-Image Pre-training, Radford et al. 2021) learns to embed images and text into a shared latent space where semantically similar pairs are closer. CLIP(I-T) measures the cosine similarity between the CLIP image embedding of a rendered 3D object and the CLIP text embedding of its generating prompt. A higher score indicates better text-to-image alignment.
- Mathematical Formula: $ \mathrm{CLIP}(I, T) = \frac{\mathrm{Embedding}(I) \cdot \mathrm{Embedding}(T)}{||\mathrm{Embedding}(I)|| \cdot ||\mathrm{Embedding}(T)||} $
- Symbol Explanation:
  - $\mathrm{Embedding}(I)$ : The feature vector generated by the CLIP image encoder for a rendered image $I$ .
  - $\mathrm{Embedding}(T)$ : The feature vector generated by the CLIP text encoder for a text prompt $T$ .
  - $\cdot$ : Dot product.
  - $||\cdot||$ : L2 norm (magnitude of the vector).
ULIP-T (ULIP Text-Shape Alignment)
- Conceptual Definition: ULIP (Unified Language-Image Pre-training for 3D, Xue et al. 2023) is a multimodal pre-training framework that aligns language, 2D images, and 3D shapes in a common embedding space. ULIP-T specifically measures the alignment between a text caption $T$ and a generated 3D geometry $S$ . A higher score indicates better semantic alignment.
- Mathematical Formula: $ \mathrm { ULIP } - \mathbf { \xi } ( T , S ) = \langle \mathbf { E } _ { T } , \mathbf { E } _ { S } \rangle $
- Symbol Explanation:
  - $\mathbf { E } _ { T }$ : The normalized ULIP feature embedding of the caption $T$ .
  - $\mathbf { E } _ { S }$ : The normalized ULIP feature embedding of the generated geometry $S$ .
  - $\langle \cdot , \cdot \rangle$ : Inner product (equivalent to cosine similarity for normalized vectors).

These metrics are used to assess the accuracy and quality of generated shapes when conditioned on various inputs (image, multi-view normal, voxel, etc.).

Chamfer Distance (CD)
- Conceptual Definition: Chamfer Distance is a metric used to compare two point clouds. It measures the average squared Euclidean distance from each point in one set to its closest point in the other set, and vice-versa. A lower CD indicates greater geometric similarity between the generated and ground truth shapes.
- Mathematical Formula: For two point clouds $A = \{a_1, ..., a_N\}$ and $B = \{b_1, ..., b_M\}$ : $ \mathrm{CD}(A, B) = \frac{1}{N} \sum_{a \in A} \min_{b \in B} ||a - b||2^2 + \frac{1}{M} \sum{b \in B} \min_{a \in A} ||b - a||_2^2 $
- Symbol Explanation:
  - $A$ , $B$ : Two point clouds being compared.
  - $N$ , $M$ : Number of points in point clouds $A$ and $B$ , respectively.
  - $a \in A$ , $b \in B$ : Individual points from the point clouds.
  - $||\cdot||_2^2$ : Squared Euclidean distance.
  - $\min$ : Minimum distance, finding the closest point.
Earth Mover's Distance (EMD)
- Conceptual Definition: Earth Mover's Distance, also known as Wasserstein distance, measures the minimum "cost" to transform one distribution into another. In the context of point clouds, it quantifies the minimum amount of "work" required to move points from one point cloud to match another, where "work" is defined as the sum of distances each point is moved. A lower EMD indicates higher similarity.
- Mathematical Formula: For two point clouds $P_1$ and $P_2$ , represented as discrete probability distributions: $ \mathrm{EMD}(P_1, P_2) = \min_{f_{ij} \ge 0} \sum_{i=1}^{N} \sum_{j=1}^{M} f_{ij} d(p_{1i}, p_{2j}) $ subject to: $\sum_j f_{ij} = \frac{1}{N}$ (for all $i$ ), $\sum_i f_{ij} = \frac{1}{M}$ (for all $j$ ), and $\sum_i \sum_j f_{ij} = 1$ .
- Symbol Explanation:
  - $P_1 = \{p_{11}, ..., p_{1N}\}$ , $P_2 = \{p_{21}, ..., p_{2M}\}$ : Two point clouds.
  - $N$ , $M$ : Number of points in $P_1$ and $P_2$ .
  - $f_{ij}$ : Flow (amount of "earth" moved) from point $p_{1i}$ to $p_{2j}$ .
  - $d(p_{1i}, p_{2j})$ : Euclidean distance between point $p_{1i}$ and $p_{2j}$ .
  - The constraints ensure that all "earth" from $P_1$ is moved to $P_2$ .
Voxel-IoU (Voxel Intersection over Union)
- Conceptual Definition: Intersection over Union (IoU) is a common metric for object detection and segmentation. Voxel-IoU applies this to 3D voxel grids. It measures the overlap between the generated voxel representation and the ground truth voxel representation, divided by their total union. A higher Voxel-IoU indicates a more accurate reconstruction of the 3D shape's volume.
- Mathematical Formula: $ \mathrm{Voxel-IoU}(V_G, V_P) = \frac{|V_G \cap V_P|}{|V_G \cup V_P|} $
- Symbol Explanation:
  - $V_G$ : Set of occupied voxels in the ground truth 3D shape.
  - $V_P$ : Set of occupied voxels in the predicted (generated) 3D shape.
  - $| \cdot |$ : Cardinality (number of occupied voxels).
  - $\cap$ : Intersection of voxel sets.
  - $\cup$ : Union of voxel sets.
F-Score
- Conceptual Definition: F-score (or F1-score) is the harmonic mean of precision and recall, often used to evaluate the accuracy of a test. In 3D shape generation, it can be adapted to measure how well the generated shape matches the ground truth, particularly in terms of surface reconstruction or point cloud matching. It balances the rate of correctly identified points/surfaces with the rate of incorrectly identified ones.
  - Precision: The proportion of correctly predicted positive instances among all positive predictions.
  - Recall: The proportion of correctly predicted positive instances among all actual positive instances.
- Mathematical Formula: $ \mathrm{F-Score} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $
- Symbol Explanation:
  - Precision and Recall are calculated based on point-to-surface distances or closest point matches between the generated and ground truth shapes. For point clouds, points within a certain threshold distance of the ground truth surface are considered "true positives."
ULIP-I (ULIP Image-Shape Alignment)
- Conceptual Definition: Similar to ULIP-T, ULIP-I measures the alignment between an input conditional image and a generated 3D geometry $S$ in the ULIP common embedding space. A higher score indicates better semantic alignment between the input image and the generated shape.
- Mathematical Formula: $ \mathrm { ULIP } - \mathbf { \xi } ( I , S ) = \langle \mathbf { E } _ { I } , \mathbf { E } _ { S } \rangle $
- Symbol Explanation:
  - $\mathbf { E } _ { I }$ : The normalized ULIP feature embedding of the input image $I$ .
  - $\mathbf { E } _ { S }$ : The normalized ULIP feature embedding of the generated geometry $S$ .

5.2.3. Additional CLIP-based Metrics for SOTA Comparison

For quantitative comparison with state-of-the-art methods, CLAY renders 30 views of RGB images and normal maps for each generated 3D asset. Four additional CLIP-based metrics are applied to these views:

CLIP(N-I): Measures geometric alignment of the normal map with the input image.
CLIP(N-T): Measures geometric alignment of the normal map with the input text.
CLIP(I-I): Evaluates appearance by measuring similarity of rendered images with the input image.
CLIP(I-T): Evaluates appearance by measuring similarity of rendered images with the input text. The average of these scores across 30 views provides a comprehensive assessment.

5.3. Baselines

CLAY is compared against several state-of-the-art 3D generation models, categorized by their primary generation approach:

5.3.1. Text-to-3D Baselines

These methods take text prompts as input and generate 3D assets.

Shap-E (Jun and Nichol 2023): A transformer-based model that generates conditional 3D implicit functions (neural fields) from text or images, based on point clouds.
DreamFusion (Poole et al. 2023): Pioneer of Score Distillation Sampling (SDS), optimizing a NeRF using a pre-trained 2D diffusion model.
Magic3D (Lin et al. 2023): An enhancement of DreamFusion, focusing on high-resolution text-to-3D content creation with improved quality and faster optimization.
MVDream (Shi et al. 2024): A multi-view diffusion model that generates consistent multi-view images from text, which can then be used to reconstruct 3D shapes.
RichDreamer (Qiu et al. 2024): A generalizable normal-depth diffusion model designed to achieve detail richness in text-to-3D generation.

5.3.2. Image-to-3D Baselines

These methods take 2D images as input and generate 3D assets.

Shap-E (Jun and Nichol 2023): Also capable of image-to-3D generation.
Wonder3D (Long et al. 2024): Generates multi-view images from a single input image using cross-domain diffusion, followed by 3D reconstruction.
One-2-3-45++ (Liu et al. 2024b): An advanced version of One-2-3-45, aiming for fast single-image to 3D mesh generation without per-shape optimization.
DreamCraft3D (Sun et al. 2024): A hierarchical 3D generation method with bootstrapped diffusion prior, often achieving high quality but can be time-consuming.
Michelangelo (Zhao et al. 2023): A conditional 3D shape generation model based on shape-image-text aligned latent representation.

The baselines are representative as they cover a spectrum of approaches:
SDS-based optimization (DreamFusion, Magic3D).
Multi-view image generation followed by reconstruction (MVDream, RichDreamer, Wonder3D).
Direct 3D-native generation (Shap-E, Michelangelo).
Single-image generalizable reconstruction (One-2-3-45++).

CLAY uses open-source code for Shap-E, MVDream, and RichDreamer, and third-party implementations for DreamFusion and Magic3D. For One-2-3-45++, the online demo was used.

6. Results & Analysis

CLAY's results demonstrate its strong capabilities in generating high-quality, diverse, and controllable 3D assets, outperforming state-of-the-art methods in both qualitative and quantitative evaluations.

6.1. Core Results Analysis

6.1.1. Qualitative Showcase of CLAY's Versatility

CLAY demonstrates its versatility by generating a wide range of objects with intricate details and textures. From ancient tools to futuristic spacecraft, the generated assets reflect a diverse array of categories including vehicles, cultural artifacts, everyday items, and imaginative elements. This highlights its capacity for high-fidelity and varied 3D creations suitable for applications in gaming, film, and virtual simulations.

The image below (Figure 8 from the original paper) showcases a sample collection of 3D models generated by CLAY.

Fig. 15. Comparison of rendering results under two distinct lighting conditions. The light probes are displayed at the top-right corner. Our method showcases high-quality rendering with accurate spec… 该图像是论文中展示的三种方法在不同光照条件下的火箭模型渲染效果对比示意图。图中显示MVDream缺乏准确高光，RichDreamer高光无视角依赖，CLAY则表现出高质量且具有视角变化的反射高光。

6.1.2. Diverse Conditioning Capabilities

CLAY's ability to respond to various input modalities is a key strength.

Image Conditioning: It can generate geometries that faithfully resemble input images, whether they are real-world photos, AI-generated concepts, or hand-drawn sketches.
Spatial Controls: CLAY allows for the creation of complex scenes, such as entire towns or bedrooms, simply from scattered bounding boxes.
Multi-view Images: It reliably reconstructs 3D geometries from multiple perspectives or normal maps.
Sparse Point Cloud: CLAY can serve as an effective surface reconstruction tool, outperforming methods like GCNO (Xu et al. 2023) by generating detailed 3D geometries from as few as 512 points (e.g., for a "knot" case).
Enhancement of Existing Geometries: It can improve 3D geometries generated by other techniques, maintaining sharp edges and flat surfaces that are often missing in prior art.
Diversity from Coarse Inputs: CLAY excels in generating diverse shapes from the same coarse input, transforming a single voxel input into vastly different objects like a futuristic monument, a medieval castle, an SUV, or a space shuttle, showcasing its "unlimited imagination" capability.
Completion and Editing: CLAY can complete missing parts from partially available geometry, functioning as both a geometry completion and editing tool. Examples include altering a monster's body or transforming a companion robot into a battle-ready counterpart.

The following figure (Figure 9 from the original paper) illustrates CLAY's conditioning capabilities across different modalities.

该图像是包含四个柱状对比图的图表，展示了不同方法在文本和图像条件下对3D外观和几何生成的性能对比，其中本论文提出的方法在各项指标上均表现优异，尤其是在图像几何生成中表现出91.2%的优势。

6.1.3. Quantitative Evaluations of Text-to-3D Performance

The quantitative evaluation of text-to-3D generation across different model sizes reveals a clear trend:

Larger Models Excel: As model size increases from Tiny-base (227M parameters) to XL-P-HD (1.5B parameters with high-definition latent length), performance consistently improves across all metrics (render-FID, render-KID, P-FID, P-KID, CLIP(I-T), ULIP-T).
High-Quality Data (P-suffix) and High-Definition (HD-suffix): Models trained on the high-quality subset (-P suffix) and those with longer latent code lengths (-HD suffix) generally achieve better scores. For instance, XL-P and XL-P-HD show superior performance compared to XL-base, indicating the effectiveness of the refined dataset and higher resolution latent representation.

Text-Shape Alignment: ULIP-T scores, which measure text-shape alignment, also consistently improve with larger models and refined training, demonstrating better semantic understanding.

The following are the results from Table 3 of the original paper:

Model name	Latent length	render-FID↓	render-KID(×103)↓	P-FID↓	P-KID(×103)↓	CLIP(I-T)↑	ULIP-T↑
Tiny-base	1024	12.2241	3.4861	2.3905	4.1187	0.2242	0.1321
Small-base	1024	11.2982	4.2074	1.9332	4.1386	0.2319	0.1509
Medium-base	1024	13.0596	5.4561	1.4714	2.7708	0.2311	0.1511
Large-base	1024	6.5732	2.3617	0.8650	1.6377	0.2358	0.1559
XL-base	1024	5.2961	1.8640	0.7825	1.3805	0.2366	0.1554
Large-P	1024	5.7080	1.9997	0.7148	1.2202	0.2360	0.1565
XL-P	1024	4.0196	1.2773	0.6360	1.0761	0.2371	0.1564
Large-P-HD	2048	5.5634	1.8234	0.6394	0.9170	0.2374	0.1578
XL-P-HD	2048	4.4779	1.4486	0.5072	0.5180	0.2372	0.1569

Evaluation of various conditioning modules using XL-P as the base model shows that CLAY achieves high fidelity with just a single condition, and combining conditions further improves geometric details and alignment.

MVN (Multi-view Normal) Conditioning: This condition exhibits outstanding performance, achieving the lowest CD, EMD, P-FID, P-KID and highest Voxel-IoU, F-Score, ULIP-T, and ULIP-I among single conditions. This indicates its precision in geometric reconstruction.
Combined Conditions: Adding text to MVN (Text-MVN) further improves scores (e.g., lowest P-FID and P-KID), confirming that additional conditions can refine the generated geometry.

Alignment: ULIP-T and ULIP-I scores generally show good alignment with input text and image conditions, respectively.

The following are the results from Table 4 of the original paper:

Condition	CD(×103)↓	EMD(×102)↓	Voxel-IoU↑	F-Score↑	P-FID↓	P-KID(×103)↓	ULIP-T↑	ULIP-I↑
Image	12.4092	17.6155	0.4513	0.4070	0.9946	1.9889	0.1329	0.2066
MVN	0.9924	5.7283	0.7697	0.8218	0.3038	0.2420	0.1393	0.2220
Voxel	0.5676	8.4254	0.6273	0.6049	2.6963	5.0008	0.1186	0.1837
Image-Bbox	5.4733	14.0811	0.5122	0.4909	1.5884	3.2994	0.1275	0.2028
Image-Voxel	0.7491	8.1174	0.6514	0.6541	2.4866	6.8767	0.1262	0.2017
Text-Image	7.7198	14.5489	0.4980	0.4609	0.7996	1.4489	0.1407	0.2122
Text-MVN	0.7301	5.4034	0.7842	0.8358	0.2184	0.1233	0.1424	0.2240
Text-Bbox	5.6421	14.6170	0.4921	0.4659	2.0074	4.0355	0.1417	0.1838
Text-Voxel	0.6090	7.4981	0.6737	0.6689	1.0427	1.0903	0.1397	0.2036

6.1.5. Prompt Engineering

CLAY demonstrates effective control over generated geometry through prompt engineering:

Geometric Style Control: Tags like "asymmetric geometry" successfully yield asymmetric tables and churches. "Sharp edges" vs. "smooth edges" transform characters like Pikachu and dogs into more rounded forms.
Complexity Control: CLAY can transform high-polygon models (aircrafts, tanks) into low-polygon variants, or conversely, generate intricate details for chandeliers and sofas with a "complex geometry" tag.
Anthropomorphic Transformations: Adding "character" to prompts can turn inanimate objects (e.g., a fireplug, a mailbox) into anthropomorphic figures, highlighting the model's ability to interpret and apply abstract concepts. This indicates that the specific annotated tags applied during training enable the model to generate geometries with desired complexities and styles, significantly enhancing the quality and specificity of generated shapes.

The figure below (Figure 10 from the original paper) illustrates CLAY's ability to alter generated content by incorporating different geometric feature tags in the prompt.

该图像是一张示意图，展示了CLAY模型的架构和参数规模，包含1.5亿几何生成器与920M材质扩散模块，支持多种条件输入如文本、图像和多视图，右侧展示了基于不同条件的3D资产生成示例。

6.1.6. Geometry Diversity

CLAY excels at generating high-quality geometries with rich diversity, often producing novel shapes distinct from its training data.

Novel Shapes from Text: With text inputs, CLAY can generate entirely new shapes that do not directly correspond to any existing samples in its dataset.
Novel Structural Combinations from Images: For image inputs, CLAY accurately reconstructs the image content while introducing novel structural combinations. For example, an AI-generated airplane concept, featuring a passenger plane fuselage merged with square air intakes and fighter jet tail fins (a design unseen in training data), is accurately generated in 3D by CLAY, capturing a high degree of resemblance to the provided image. This demonstrates CLAY's generative capacity to go beyond mere memorization and create truly novel, yet semantically aligned, 3D forms.

The following figure (Figure 11 from the original paper) evaluates the geometry diversity.

Fig. 3. Network design of our VAE and DiT. With a minimalist design, our DiT supports scalable training and VAE operates effectively across various geometric resolutions. 该图像是图3，展示了论文中VAE和DiT的网络设计。图中说明了利用多层自注意力和交叉注意力机制处理不同编码位置嵌入，实现多尺度几何表示及大规模1.5亿参数扩散模型训练。

6.1.7. Effectiveness of MVN Conditioning

Multi-view Normal (MVN) conditioning provides precise control over 3D geometry generation.

Contrast with Single-Image: While single-image conditioning allows for more creative liberty, MVN conditioning harnesses multiple perspectives to deliver detailed and precise control, akin to pixel-aligned sparse-view reconstruction.
Example: Using an initial image of a panther's head, single-image conditioning yields a solid 3D geometry. However, when Wonder3D is used to generate multi-view images and corresponding normal maps, it results in a panther face mask with a notably thin surface. CLAY's MVN conditioning successfully leverages these multi-view normal maps to faithfully synthesize this thin surface, distinguishing itself from traditional NeuS methods applied to Wonder3D's outputs, which might produce thicker or less accurate geometry. This highlights the efficiency and precision of CLAY's MVN conditioning for guiding detailed 3D geometry generation.

The figure below (Figure 12 from the original paper) shows geometry generation via single image and multi-view image conditioning.

该图像是论文中图4的示意图，展示了不同网格预处理方法通过剖面分析非封闭椅子模型的效果。红色线条为网格面，浅灰色为外部，深灰色为内部，展示了本方法在保证几何特征和正体积最大化方面的优越性。

6.1.8. Running Time

On a single Nvidia A100 GPU, CLAY's inference time breakdown is:

Shape Latent Generation: Approximately 4 seconds.
Latent Decoding: Approximately 1 second (due to efficient adaptive sampling).
Mesh Processing: Approximately 8 seconds.
PBR Generation: Approximately 32 seconds.
Total Generation Time: Cumulatively, about 45 seconds for a complete, high-quality 3D asset with PBR textures.

6.2. Comparisons with SOTA

6.2.1. Qualitative Comparison

CLAY is qualitatively compared against Shap-E, DreamFusion, Magic3D, MVDream, and RichDreamer for text-to-3D tasks.

Text-to-3D:
- Shap-E: Faster but often lacks complete geometry structures.
- DreamFusion & Magic3D: SDS optimization methods that frequently exhibit the "multi-face Janus artifacts" due to view inconsistencies.
- MVDream & RichDreamer: Generate multi-view images for SDS, producing consistent geometries but often lacking surface smoothness and requiring long optimization times.
- CLAY: Produces high-quality 3D assets in about 45 seconds. The generated geometries feature smooth surfaces, intricate details, and better alignment with text prompts, without the typical artifacts of 2D-lifted methods.
  
  The figure below (Figure 13 from the original paper) illustrates these comparisons using normal maps for text-conditioned generation, with examples like "Mythical creature dragon," "Stag deer," and "Interstellar warship."
  
  该图像是论文中关于CLAY系统架构与资产增强的示意图，展示了从文本到材质扩散的生成流程及三维资产的网格细分和材质合成过程，并通过多视角渲染展示高质量3D模型效果。

CLAY is also compared against Shap-E, Wonder3D, One-2-3-45++, DreamCraft3D, and Michelangelo for image-to-3D generation.

Image-to-3D:
- Shap-E: Fast but struggles to accurately reconstruct input images, resulting in incomplete geometries.
- Wonder3D: Relies on multi-view images and normal prediction followed by NeuS reconstruction, often yielding coarse and incomplete geometries due to inconsistencies in multi-view outputs.
- One-2-3-45++: Efficient in creating smooth geometries but lacks details and struggles with symmetry, especially on complex objects.
- DreamCraft3D: An SDS optimization method that produces high-quality output but is time-consuming and can result in uneven surfaces.
- Michelangelo: Produces geometries (color manually assigned for rendering) but may not match CLAY's speed or detail.
- CLAY: Quickly generates detailed and high-quality geometries along with high-quality PBR textures.
  
  The figure below (Figure 14 from the original paper) shows comparisons between CLAY and other state-of-the-art models for image-to-3D generation.
  
  该图像是图像插图，展示了CLAY模型在不同数据集微调后的生成效果。中央为LEGO鸭子模型，左侧是石头风格变体，右侧是口袋怪兽风格变体，体现了模型的多样化生成能力。

6.2.2. Quantitative Comparisons

Quantitative comparisons are performed using a GPT-4 generated test dataset (50 images, 50 text prompts for text-to-3D and image-to-3D respectively). Metrics include ULIP-T, ULIP-I, and four CLIP-based metrics (CLIP(N-I), CLIP(N-T), CLIP(I-I), CLIP(I-T)) averaged over 30 rendered views.

Superior Performance: CLAY consistently outperforms all state-of-the-art techniques across all metrics for both text-to-3D and image-to-3D tasks. This highlights its superior geometric and appearance alignment, as well as overall generation quality.

Efficiency: CLAY achieves this superior performance in significantly less time (approximately 45 seconds) compared to many optimization-based methods that take hours.

The following are the results from Table 5 of the original paper:

Method	CLIP (N-T)↑	CLIP (I-T)↑	ULIP-T↑	ULIP-I↑	Time
Text-to-3D
Shap-E	0.1761	0.2081	0.1160	/	~10s
DreamFusion	0.1549	0.1781	0.0566	/	~1.5h
Magic3d	0.1553	0.2034	0.0661	/	~1.5h
MVDream	0.1786	0.2237	0.1351	/	~1.5h
RichDreamer	0.1891	0.2281	0.1503	/	~2h
CLAY	0.1948	0.2324	0.1705	/	~45s
Image-to-3D
Shap-E	0.6315	0.6971	/	0.1307	~10s
Wonder3D	0.6489	0.7220	/	0.1520	~4min
DreamCraft3D	0.6641	0.7718	/	0.1706	~4h
One-2-3-45++	0.6271	0.7574	/	0.1743	~90s
Michelangelo	0.6726	/	/	0.1899	~10s
CLAY	0.6848	0.7769	/	0.2140	~45s

CLIP(N-T): CLIP score for Normal maps vs Text.
CLIP(I-T): CLIP score for Rendered Images vs Text.
ULIP-T: ULIP score for Text vs Shape.
ULIP-I: ULIP score for Image vs Shape.
Time: Approximate generation time.

6.2.3. PBR Material Comparison

CLAY's material generation is compared with MVDream and RichDreamer using the text prompt "Space rocket."

MVDream: Lacks PBR materials and thus cannot accurately reproduce specular highlights under varying lighting conditions.
RichDreamer: Employs an albedo diffusion model and attempts to distinguish albedo from complex lighting. However, it often models highlights as fixed surface textures, failing to capture the view-dependent nature of specular reflections (e.g., highlights on the rocket's head appear static).
CLAY: Faithfully models PBR materials, resulting in metallic surfaces that exhibit realistic highlights that move consistently with changing environment lighting. This highlights the advantage of CLAY's separate and specialized PBR material generation.

The figure below (Figure 15 from the original paper) compares rendering results under two distinct lighting conditions.

该图像是包含网络结构示意图和3D模型渲染效果的插图。左侧展示CLAY模型的多模态条件输入流程及Transformer结构，右侧为基于该模型生成的高质量3D角色及场景渲染示例，体现了文本或多视图输入到细节丰富3D资产的生成过程。

6.2.4. User Studies

A comprehensive user study involved 150 volunteers evaluating 15 randomly chosen questions (5 GPT-4 generated text prompts, 15 Stable Diffusion generated images). Users determined their preferred method for both appearance quality and geometry quality.

Text-to-3D:
- CLAY secured 67.4% of votes for appearance and 78.9% for geometry.
- It significantly surpassed the second-ranked RichDreamer, which had notably longer optimization times (~2 hours vs. CLAY's ~45 seconds).
Image-to-3D:
- CLAY garnered 85.4% votes for appearance and 91.2% for geometry. These results indicate a strong user preference for CLAY in terms of both visual appeal and geometric fidelity, validating its effectiveness in practical perception.

The figure below (Figure 16 from the original paper) shows user studies of CLAY vs. state-of-the-art methods.

该图像是一个插图，展示了多样化的3D模型样例，涵盖机器人、乐器、武器、交通工具等多种物体，反映了论文CLAY生成高质量3D资产的能力。

6.3. Ablation Studies / Parameter Analysis

While the paper doesn't present a dedicated "Ablation Studies" section, the quantitative evaluation of different model sizes (Table 3) and conditioning types (Table 4) serves as an implicit form of parameter and component analysis:

Model Size Impact (Table 3): This table effectively shows an ablation of model scale. It demonstrates that increasing nparams, dmodel, and latent length ( $L$ ) directly correlates with improved render-FID, P-FID, CLIP(I-T), and ULIP-T scores. This validates the effectiveness of scaling up the model's capacity (parameters) and resolution (latent length) in achieving higher quality and better text-shape alignment. The progressive training scheme, coupled with these scaling factors, is crucial for realizing these gains.
Conditioning Module Effectiveness (Table 4): This table acts as an ablation study for the various conditioning mechanisms. It highlights:
- The baseline performance of individual conditions (Image, MVN, Voxel).
- The specific strength of MVN conditioning in achieving superior geometric accuracy (CD, EMD, Voxel-IoU, F-Score, P-FID, P-KID). This validates that explicit 3D-aware inputs are highly effective.
- The incremental improvement gained by combining conditions (e.g., Text-MVN generally outperforms MVN alone in semantic metrics like ULIP-T), demonstrating that the parallel conditioning scheme successfully integrates multiple sources of guidance.
- The trade-offs between different conditions: Voxel and MVN provide excellent geometric control, while Image and Text-Image contribute more to appearance and broader semantic interpretation. These analyses implicitly confirm the design choices of CLAY's multi-resolution VAE, minimalistic DiT, progressive training, and multi-modal conditioning scheme.

7. Conclusion & Reflections

7.1. Conclusion Summary

CLAY represents a significant advancement in 3D generative modeling, effectively bridging the gap between human imagination and digital creation. It introduces a large-scale (1.5 billion parameters), controllable generative model for high-quality 3D assets that supports diverse multi-modal inputs, including text, images, multi-view images, voxels, bounding boxes, and point clouds. At its core, CLAY utilizes a multi-resolution Variational Autoencoder (VAE) and a minimalistic latent Diffusion Transformer (DiT), trained with a progressive scheme on an ultra-large, meticulously processed 3D dataset. A robust data pipeline ensures watertight geometry and precise GPT-4V powered annotations. For appearance, CLAY generates production-ready 2K resolution Physically-Based Rendering (PBR) textures (diffuse, roughness, metallic) using a multi-view material diffusion model. Comprehensive evaluations, including user studies, demonstrate CLAY's superior performance over state-of-the-art methods in terms of geometric fidelity, diversity, material realism, and generation speed, making it highly effective for both conceptual design and production-ready asset creation.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

Not Fully End-to-End: CLAY currently operates in distinct stages for geometry and material generation, requiring additional post-processing steps like remeshing and UV unwrapping.
- Future Work: Explore integrated model architectures to concurrently generate geometry and PBR materials. This would necessitate automatic schemes to produce geometry with consistent topology, eliminating manual intermediate steps.
Training Data Quantity and Quality: While CLAY is trained on a substantially large dataset, there is still room for improvement in both the quantity and quality, especially when compared to the vast 2D image datasets used for models like Stable Diffusion.
Complex "Composed Objects": CLAY shows robustness in generating single objects but struggles with complex "composed objects" (e.g., "a tiger riding a motorcycle"), particularly with text-only inputs. This is attributed to insufficient training data for such complex compositions and a lack of detailed textual descriptions.
- Future Work: This issue could potentially be mitigated through a text-to-image-to-3D workflow, similar to approaches in Wonder3D and One-2-3-45++. As the community augments datasets with more diverse 3D shapes and corresponding text, CLAY's ability to handle complexity should improve.
Dynamic Object Generation: The current model focuses on static 3D assets.
- Future Work: Extend CLAY to dynamic object generation. The authors suggest that the generated geometries might be semantically partitionable into meaningful parts, which could facilitate motion and interaction (e.g., as explored in Singer et al. 2023 and Ling et al. 2024).

7.3. Personal Insights & Critique

7.3.1. Strengths and Innovations

Scalability and 3D-Native Approach: CLAY's commitment to a purely 3D-native approach, scaled to 1.5 billion parameters, is a significant step. It correctly identifies that directly learning 3D priors from 3D data, rather than lifting 2D priors, is key to geometric fidelity. This scale, combined with progressive training, is a crucial innovation.
Rigorous Data Pipeline: The emphasis on data standardization, including remeshing for watertightness and GPT-4V for annotation, addresses a fundamental bottleneck in 3D generative AI. This meticulous data preparation is often underestimated but vital for training large foundation models.
Separation of Concerns (Geometry & Appearance): Decoupling geometry and PBR material generation allows for specialized, high-quality solutions for each. This is a pragmatic choice that yields superior results compared to attempts to force joint generation, especially for production-ready assets.
Multi-modal Control Depth: The extensive range of conditional inputs, from abstract text to precise 3D primitives (voxels, point clouds, bounding boxes), offers unprecedented control to users, catering to different stages of the design process. The cross-attention based conditioning scheme is elegant and flexible.
Practicality and Speed: Achieving high-quality results in approximately 45 seconds is a major breakthrough for practical applications, making CLAY viable for interactive design workflows where SDS-based methods often fail due to long optimization times.

7.3.2. Potential Issues and Unverified Assumptions

Dependency on External Tools: While CLAY's modular design allows leveraging existing tools for quadrification and UV unwrapping, it also means the workflow isn't fully integrated. The quality of these external tools can impact the final output. The future work on integrating these steps is critical.
Computational Cost for Training: Training a 1.5 billion parameter model on 256 Nvidia A800 GPUs for 15 days is immensely expensive. While the inference is fast, the barrier to entry for replicating or significantly extending such models remains very high for most research labs.
Ethical Concerns: As acknowledged by the authors, the high generalization capability of AIGC models like CLAY carries risks of misuse, including generating deceptive content or violating intellectual property. The reliance on pre-trained feature encoders (CLIP, DINO) means that biases present in their training data could be propagated.
Limited "Composed Object" Generation: The inability to robustly generate complex "composed objects" from text-only prompts highlights a current limitation in semantic understanding and compositional reasoning, which is a common challenge in AIGC. The proposed text-to-image-to-3D workaround, while effective, implicitly reintroduces a dependency on 2D priors for complex scenes.

7.3.3. Applications and Future Impact

CLAY's methods and conclusions are highly transferable and applicable across various domains:

Entertainment Industry: Revolutionizing asset creation for video games, film, and animation, enabling faster iteration and richer content.
Design and Engineering: Rapid prototyping and conceptual design in product development, architecture, and industrial design.
Virtual and Augmented Reality (VR/AR): Populating virtual worlds with high-quality, diverse objects on demand.
Digital Preservation: Automatically generating 3D models from sparse scans or textual descriptions of historical artifacts.

The work strongly suggests that the future of 3D AIGC lies in large-scale 3D-native foundation models coupled with intelligent data curation and multi-modal control. As 3D datasets continue to grow and computational resources become more accessible, models like CLAY will likely become standard tools, fundamentally altering the landscape of digital creation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

CLAY: A Controllable Large-scale Generative Model for Creating High-quality 3D Assets

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~54 min read · 71,426 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Variational Autoencoder (VAE)

3.1.2. Diffusion Models

3.1.3. Transformer

3.1.4. Neural Fields (NeRF, SDF, Occupancy Fields)

3.1.5. Physically-Based Rendering (PBR)

3.1.6. LoRA (Low-Rank Adaptation)

3.1.7. Marching Cubes

3.2. Previous Works

3.2.1. Imposing 2D Images as Prior

3.2.2. Imposing 3D Geometry as Priors (3D-Native Approaches)

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Representation and Model Architecture

4.2.2. Multi-resolution VAE

4.2.3. Coarse-to-fine DiT

4.2.4. Scaling-up Scheme

4.2.5. Data Standardization for Pretraining

4.2.6. Asset Enhancement

4.2.7. Model Adaptation

4.2.8. Conditioning Scheme

4.2.9. Spatial Control

4.2.10. Implementation of Conditions

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Text-to-Shape Evaluation Metrics

5.2.2. Multi-modal-to-3D Evaluation Metrics

5.2.3. Additional CLIP-based Metrics for SOTA Comparison

5.3. Baselines

5.3.1. Text-to-3D Baselines

5.3.2. Image-to-3D Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Qualitative Showcase of CLAY's Versatility

6.1.2. Diverse Conditioning Capabilities

6.1.3. Quantitative Evaluations of Text-to-3D Performance

6.1.4. Quantitative Evaluations of Multi-modal-to-3D Performance

6.1.5. Prompt Engineering

6.1.6. Geometry Diversity

6.1.7. Effectiveness of MVN Conditioning

6.1.8. Running Time

6.2. Comparisons with SOTA

6.2.1. Qualitative Comparison

6.2.2. Quantitative Comparisons

6.2.3. PBR Material Comparison

6.2.4. User Studies

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

7.3.1. Strengths and Innovations

7.3.2. Potential Issues and Unverified Assumptions

7.3.3. Applications and Future Impact

Similar papers