Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders
TL;DR Summary
Dora-VAE enhances 3D shape VAEs using sharp edge sampling and dual cross-attention, improving fine detail reconstruction. Dora-bench benchmark and Sharp Normal Error metric enable precise evaluation of shape complexity and reconstruction quality.
Abstract
Recent 3D content generation pipelines commonly employ Variational Autoencoders (VAEs) to encode shapes into compact latent representations for diffusion-based generation. However, the widely adopted uniform point sampling strategy in Shape VAE training often leads to a significant loss of geometric details, limiting the quality of shape reconstruction and downstream generation tasks. We present Dora-VAE, a novel approach that enhances VAE reconstruction through our proposed sharp edge sampling strategy and a dual cross-attention mechanism. By identifying and prioritizing regions with high geometric complexity during training, our method significantly improves the preservation of fine-grained shape features. Such sampling strategy and the dual attention mechanism enable the VAE to focus on crucial geometric details that are typically missed by uniform sampling approaches. To systematically evaluate VAE reconstruction quality, we additionally propose Dora-bench, a benchmark that quantifies shape complexity through the density of sharp edges, introducing a new metric focused on reconstruction accuracy at these salient geometric features. Extensive experiments on the Dora-bench demonstrate that Dora-VAE achieves comparable reconstruction quality to the state-of-the-art dense XCube-VAE while requiring a latent space at least 8 smaller (1,280 vs. > 10,000 codes).
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders
1.2. Authors
The paper lists Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan as authors.
- Rui Chen, Yixun Liang, Weiyu Li, Jiarui Liu, Xiaoxiao Long, Ping Tan: Affiliated with The Hong Kong University of Science and Technology and LightIllusions. Ping Tan and Jianfeng Zhang are indicated as corresponding authors. Their research likely focuses on computer graphics, 3D vision, and AI for creative applications.
- Jianfeng Zhang, Guan Luo, Xiu Li, Jiashi Feng: Affiliated with ByteDance Seed and Tsinghua University (Guan Luo). Their research background typically includes deep learning, computer vision, and large-scale model development, often with an industry application focus.
1.3. Journal/Conference
The paper was published on arXiv, a preprint server, with a publication timestamp of 2024-12-23T18:59:06.000Z. As a preprint, it has not yet undergone formal peer review for a specific journal or conference. However, arXiv is a widely respected platform for disseminating cutting-edge research in fields like AI and computer graphics, allowing rapid sharing of findings. The authors' affiliations with prominent academic institutions and industry labs (HKUST, ByteDance, Tsinghua) suggest the work is of high academic caliber.
1.4. Publication Year
2024
1.5. Abstract
The paper addresses a critical limitation in 3D content generation pipelines that utilize Variational Autoencoders (VAEs) to encode shapes into compact latent representations for diffusion models. The prevalent uniform point sampling strategy in VAE training often results in a loss of fine geometric details, compromising reconstruction quality and downstream generation tasks. To counter this, the authors introduce Dora-VAE, a novel approach featuring a sharp edge sampling (SES) strategy that prioritizes geometrically complex regions and a dual cross-attention mechanism for enhanced encoding. This method significantly improves the preservation of fine-grained shape features. Additionally, the paper proposes Dora-bench, a benchmark that quantifies shape complexity based on sharp edge density and introduces a new metric, Sharp Normal Error (SNE), to specifically assess reconstruction accuracy in these salient geometric areas. Extensive experiments on Dora-bench demonstrate that Dora-VAE achieves reconstruction quality comparable to state-of-the-art dense models like XCube-VAE while requiring a significantly smaller latent space (at least smaller), making it more efficient for downstream tasks.
1.6. Original Source Link
Official Source: https://arxiv.org/abs/2412.17808 PDF Link: https://arxiv.org/pdf/2412.17808v3.pdf Publication Status: Preprint on arXiv.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the significant loss of geometric details in 3D shape reconstruction when using Variational Autoencoders (VAEs) for 3D content generation. This issue primarily arises from the widely adopted uniform point sampling strategy during VAE training.
This problem is critical because 3D content generation, especially with the rise of AI-powered methods, is vital for various industries like games, movies, and AR/VR. The quality of generated 3D shapes heavily depends on the VAE's ability to faithfully encode and reconstruct these shapes into compact latent representations, which then serve as input for diffusion models. If the VAE loses details during this encoding/reconstruction process, the subsequent diffusion model will generate lower-quality, less intricate 3D assets.
Specific challenges or gaps in prior research include:
-
Volume-based methods (e.g., XCube): While capable of high-fidelity reconstruction by processing millions of voxelized points, they produce very large latent codes (often tokens). This large latent space significantly complicates and slows down the training of subsequent diffusion models, making them impractical for many generative pipelines.
-
Vector-set (Vecset) methods (e.g., 3DShape2VecSet, Craftsman): These methods use transformers to achieve highly compact latent representations (hundreds to thousands of tokens), which is ideal for efficient diffusion model training. However, due to computational constraints, they typically sample only a few thousand points. When combined with uniform sampling, this leads to substantial information loss, particularly in geometrically complex regions, resulting in poor detail preservation.
The paper's entry point and innovative idea is to improve the reconstruction quality of
Vecset-based VAEswithout sacrificing their compact latent representations. They achieve this by introducing an intelligent sampling strategy that prioritizes geometrically salient features, coupled with an encoding mechanism designed to leverage these rich details.
2.2. Main Contributions / Findings
The paper makes several primary contributions to the field of 3D content generation and VAEs:
-
Introduction of Dora-VAE for High-Quality 3D Reconstruction with Compact Latent Representations: The authors propose
Dora-VAE, a novel 3D VAE model that significantly enhances reconstruction quality, particularly in preserving fine geometric details, while maintaining a compact latent space. This is achieved through:- Sharp Edge Sampling (SES): A new importance sampling algorithm designed for 3D VAE training that adaptively samples points from geometrically salient regions of a mesh, ensuring critical details are captured.
- Dual Cross-Attention Architecture: A novel encoding mechanism that effectively processes both the uniformly sampled points and the detail-rich points sampled by SES, allowing the VAE to focus on crucial geometric features.
-
Development of Dora-Bench for Rigorous 3D VAE Evaluation: To systematically and more rigorously evaluate 3D VAE reconstruction quality, the paper introduces
Dora-bench. This benchmark:- Categorizes Shapes by Geometric Complexity: Divides test shapes into four levels of detail based on the density of sharp edges (), providing a more nuanced assessment than random selection.
- Introduces Sharp Normal Error (SNE): A novel metric specifically designed to quantify reconstruction accuracy at salient geometric features, addressing the limitations of general metrics like
Chamfer DistanceandF-Scorewhich often overlook fine details.
-
Demonstrated Superior Performance and Efficiency: Extensive experiments on
Dora-benchshow thatDora-VAEachieves comparable or superior reconstruction quality to state-of-the-art dense volume-based methods (likeXCube-VAE) while requiring a latent space that is at least smaller (1,280 vs. codes). This efficiency makesDora-VAEparticularly well-suited for training downstreamlatent diffusion models, significantly enhancing the quality of generated 3D shapes in applications likesingle-image 3D generation.These findings directly solve the problem of balancing compact latent representations with high-fidelity geometric detail preservation, a long-standing challenge in 3D VAE design for generative pipelines.
3. Prerequisite Knowledge & Related Work
This section outlines the foundational concepts necessary to understand the paper, summarizes relevant prior research, discusses the technological evolution in the field, and highlights how Dora-VAE differentiates itself.
3.1. Foundational Concepts
3.1.1. Variational Autoencoders (VAEs)
A Variational Autoencoder (VAE) is a type of generative model that learns a compressed, continuous representation (called a latent space) of input data. It consists of two main parts:
- Encoder: Maps input data (e.g., a 3D shape) to a distribution over the latent space. Instead of directly outputting a latent vector, it outputs parameters (mean and variance) of a probability distribution (typically a Gaussian) from which a latent vector is sampled. This makes the latent space continuous and allows for smooth interpolations.
- Decoder: Takes a sampled latent vector from the latent space and reconstructs the original input data.
VAEs are trained to minimize two loss components: a
reconstruction loss(how well the output matches the input) and aKullback-Leibler (KL) divergenceloss, which regularizes the latent space by forcing the learned distribution to be close to a prior distribution (e.g., a standard Gaussian). This regularization ensures the latent space is well-structured and facilitates new data generation by sampling from this prior.
3.1.2. 3D Shape Representation
3D shapes can be represented in various ways, each with its own advantages and disadvantages:
- Meshes: A mesh is a collection of vertices, edges, and faces that defines the shape of a 3D object. Triangular meshes are common, where faces are triangles. They are good for representing surfaces with fine details.
- Point Clouds: A set of discrete points in 3D space, sampled from the surface of an object. Point clouds are flexible and can represent complex geometry, but lack explicit connectivity information (like faces or edges).
- Voxels: A 3D grid where each cell (voxel) indicates whether it's occupied by the object or not. Similar to pixels in 2D, but in 3D. They are good for representing solid objects but can be computationally expensive and memory-intensive for high resolutions due to their cubic growth.
- Occupancy Fields: A continuous function that predicts whether any given 3D coordinate is inside or outside an object. The surface of the object is implicitly defined as the zero-level set of this function. This representation is powerful for generative tasks as it allows for continuous shape manipulation.
3.1.3. Diffusion Models
Diffusion Models are a class of generative models that have achieved state-of-the-art results in image and now 3D content generation. They work by iteratively denoising a noisy input (e.g., a latent vector) until it resembles a data sample from the training distribution.
- Forward Diffusion Process: Gradually adds Gaussian noise to data samples until they become pure noise.
- Reverse Diffusion Process: A neural network learns to reverse this process, starting from pure noise and iteratively predicting and removing noise to generate a clean data sample.
In the context of 3D generation,
latent diffusion modelsoften operate on the compact latent representations learned by VAEs, making the diffusion process more efficient.
3.1.4. Attention Mechanism / Transformers
The Attention mechanism is a key component in modern deep learning, especially for processing sequential or set-based data. It allows a model to weigh the importance of different parts of the input data when processing a specific element.
- Self-Attention: Enables a model to learn relationships between different elements within a single input sequence/set. For example, in a sentence, it helps understand how each word relates to other words. For a point cloud, it helps understand relationships between points.
- Cross-Attention: Used in encoder-decoder architectures where the model needs to attend to information from a different sequence/set. For example, a decoder might use cross-attention to focus on relevant parts of the encoder's output when generating its own output.
The
Transformerarchitecture exclusively relies on attention mechanisms (multi-head self-attention and cross-attention layers) to process data, replacing traditional recurrent or convolutional layers. Transformers are particularly effective forvector-setbased 3D VAEs because point clouds can be viewed as unordered sets of vectors.
The core Attention mechanism, as introduced in "Attention Is All You Need" (Vaswani et al., 2017), is calculated as:
$
\mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V
$
Where:
- (Query), (Key), (Value) are matrices derived from the input embeddings.
- represents the query for information.
- represents the keys to match the query against.
- represents the values containing the information to be retrieved.
- calculates the dot product similarity between queries and keys.
- is a scaling factor, where is the dimension of the key vectors, used to prevent the dot products from growing too large and pushing the softmax function into regions with tiny gradients.
- normalizes the scores, turning them into probability distributions.
- The result is a weighted sum of the values, where weights are determined by the attention scores.
3.1.5. Sampling Strategies
Sampling refers to selecting a subset of points from a larger set or continuous surface.
- Uniform Sampling: Selects points randomly across the surface with equal probability. While simple, it often fails to capture fine details in regions of high curvature if the total number of samples is limited.
- Poisson Disk Sampling: A sampling method that generates points such that no two points are closer than a specified minimum distance, resulting in a more even distribution than pure random sampling.
- Farthest Point Sampling (FPS): An algorithm for selecting a subset of points from a point cloud. It starts with an arbitrary point and then iteratively selects the point farthest from the already selected points. This ensures good coverage and is often used for downsampling point clouds while preserving geometric structure.
3.1.6. Dihedral Angle
In geometry, a dihedral angle is the angle between two intersecting planes or faces of a polyhedron (like a mesh). In the context of 3D meshes, it's the angle between the normal vectors of two adjacent faces. A large dihedral angle (close to 180 degrees, or a small angle between face normals) indicates a sharp edge or a high curvature region, which is crucial for defining geometric details.
3.2. Previous Works
3.2.1. Importance Sampling in Point Clouds
Importance sampling techniques are well-established in point cloud processing. For instance, APES [57] (Attention-Based Point Cloud Edge Sampling) uses attention to sample points for classification and segmentation. However, these methods typically operate on pre-existing point clouds. Dora-VAE's contribution is applying importance sampling directly from mesh surfaces for VAE training, which is different because the goal is to preserve complete geometric information from the source mesh during the initial sampling stage.
3.2.2. 3D Shape VAEs
Previous 3D VAEs fall into two main categories:
- Volume-based methods (e.g., XCube [45]): These approaches leverage sparse convolution to encode densely voxelized surfaces. They achieve very high reconstruction quality, excelling at preserving geometric details due to their dense representation. However, their primary drawback is the generation of very large latent codes (often tokens), which are computationally prohibitive for training
latent diffusion models.XCubeis a prime example of this trade-off. - Vector Set-based (Vecset) approaches (e.g., 3DShape2VecSet [63], Craftsman [31], CLAY [65]): These methods encode uniformly sampled surface points using
Transformers, resulting in highly compact latent spaces (hundreds to thousands of tokens). This compactness is highly desirable for efficient diffusion model training. However, their reliance on uniform sampling, especially when constrained to a limited number of points due toTransformer's quadratic complexity, often leads toinformation lossand poor preservation of intricate geometric details.3DShape2VecSetis the foundational model thatDora-VAEbuilds upon.
3.2.3. 3D Content Creation
Current 3D generation methods can be broadly categorized:
- Optimization-based methods (e.g., DreamFusion [43]): These methods use
Score Distillation Sampling (SDS)to optimize 3D representations (likeNeRFsorGaussian Splatting) using 2D diffusion model priors. They can achieve photorealistic results but are often slow, suffer from training instability, and may struggle with geometric consistency. - Large Reconstruction Models (LRM-based, e.g., LRM [22]): These models perform large-scale sparse-view reconstruction to generate 3D content efficiently. While fast, they typically lack explicit geometric priors, which can lead to compromised geometric fidelity and inconsistent surface details.
- 3D Native Generative Models (e.g., 3DShape2VecSet [63]): This group, to which
Dora-VAEbelongs, adopts a two-stage pipeline: first, a 3D VAE encodes shapes into a compact latent space, and then a conditionallatent diffusion modelis trained on this latent space for generation. This approach generally ensures better geometric consistency due to the VAE's inherent geometric constraints. The quality of generated shapes is fundamentally limited by the VAE's reconstruction capability. Recent works [31, 62] have highlighted that improving VAE reconstruction directly translates to enhanced downstream generation quality, motivating the focus ofDora-VAE.
3.3. Technological Evolution
The field of 3D content generation has rapidly evolved from demanding manual expertise to leveraging AI for automation and accessibility. Early efforts focused on direct 3D model manipulation or procedural generation. The advent of deep learning brought about voxel-based and point cloud-based generative models. The success of VAEs and Generative Adversarial Networks (GANs) in 2D inspired their application to 3D, leading to methods like 3DShape2VecSet that encode 3D shapes into latent spaces. Concurrently, diffusion models revolutionized 2D image generation, quickly extending to 3D. The current paradigm often involves a two-stage process: encoding complex 3D geometry into a compact, manipulable latent representation using a VAE, followed by a diffusion model generating within this latent space. The core challenge in this evolution has been balancing the fidelity of 3D reconstruction (especially fine details) with the compactness of the latent representation (essential for efficient diffusion). Dora-VAE fits into this timeline by directly addressing the detail-loss problem inherent in compact Vecset-based VAEs, pushing the frontier towards higher fidelity in efficient 3D native generation.
3.4. Differentiation Analysis
Compared to the main methods in related work, Dora-VAE introduces several core differences and innovations:
-
Addressing Uniform Sampling Limitation: The most significant differentiation is
Dora-VAE's direct attack on theuniform samplingstrategy, which is a common bottleneck forVecset-based VAEsin preserving details. Unlike3DShape2VecSetorCraftsmanwhich rely on uniform sampling,Dora-VAEintegrates a novelSharp Edge Sampling (SES)strategy. This is a fundamental shift from generic sampling toimportance samplingfor 3D VAE learning, a concept previously unexplored in this specific context (i.e., sampling from meshes for VAE training, rather than processing existing point clouds). -
Specialized Encoding for Salient Features: To fully leverage the detail-rich point clouds generated by
SES,Dora-VAEintroduces adual cross-attention architecture. This is an innovation over simplerTransformer-based VAEs that might process all sampled points uniformly. The dual attention explicitly separates and processes features from salient and uniform regions, allowing the model to better focus on critical details. -
Comprehensive Evaluation Methodology:
Dora-VAEgoes beyond just proposing a new model by also introducingDora-bench. This benchmark is novel because:- It systematically categorizes shapes by
geometric complexityrather than using randomly selected test sets, enabling a more nuanced evaluation of detail preservation. - It proposes a new metric,
Sharp Normal Error (SNE), specifically tailored to assess reconstruction quality insalient geometric features, which existing metrics likeChamfer DistanceorF-Scoreoften fail to adequately capture.
- It systematically categorizes shapes by
-
Balancing Compactness and Fidelity: While
XCube-VAEachieves high fidelity, it does so at the cost of extremely large latent codes, making it unsuitable for diffusion models.Vecset-basedmethods achieve compactness but sacrifice detail.Dora-VAEuniquely offerscomparable reconstruction quality to XCube-VAEwhile maintaining ahighly compact latent space(8x smaller), effectively bridging this gap and providing a more practical solution for3D native generative models.In essence,
Dora-VAEdifferentiates itself by innovating at both the input data preparation stage (sampling) and the model architecture stage (encoding), specifically targeting the preservation of fine geometric details within a compact latent representation, complemented by a rigorous, detail-focused evaluation benchmark.
4. Methodology
The methodology section describes Dora-VAE, a novel approach for high-quality 3D reconstruction, and Dora-Bench, a new benchmark for evaluating 3D VAEs. The core idea revolves around intelligent sampling and encoding strategies to preserve fine geometric details within compact latent representations.
4.1. Principles
The core principle behind Dora-VAE is that for Variational Autoencoders (VAEs) to effectively encode 3D shapes into compact latent representations without losing fine geometric details, the input sampling process must prioritize regions of high geometric complexity. Traditional uniform sampling fails to do this, especially when computational constraints limit the total number of sampled points. By focusing on sharp edges as indicators of salient geometric features and designing a dual cross-attention mechanism to specifically process these feature-rich regions, Dora-VAE aims to achieve high-fidelity reconstruction. Complementing this, Dora-Bench is built on the principle that a robust evaluation of 3D VAEs requires assessing their performance across varying levels of shape complexity and specifically measuring their ability to reconstruct salient geometric features through a dedicated metric.
4.2. Core Methodology In-depth
The Dora-VAE pipeline enhances 3DShape2VecSet, a transformer-based 3D VAE. Figure 2 provides an overview of the proposed Dora-VAE pipeline.
4.2.1. Preliminary: 3DShape2VecSet
Dora-VAE builds upon the 3DShape2VecSet [63] architecture, which encodes uniformly sampled surface points into compact latent codes. The 3DShape2VecSet pipeline consists of three key steps:
4.2.1.1. Surface Sampling
Given a 3D surface , 3DShape2VecSet first uniformly samples points from using Poisson disk sampling [61] to create a dense point cloud . Then, Farthest Point Sampling (FPS) [39] is applied to to downsample it to points, forming a sparse point cloud .
The mathematical representation for this step is: $ P_d = { p_d^i \in S \mid i = 1 , . . . , N_d } , P_s = \mathrm { F P S } ( P_d , N_s ) . $ Where:
- : The input 3D surface (mesh).
- : The target number of dense points to be sampled.
- : The -th point in the dense point cloud .
- : The dense point cloud sampled from the surface .
Poisson disk samplingis used to ensure a relatively even distribution of these points. - : The
Farthest Point Samplingfunction, which takes the dense point cloud and downsamples it to points, ensuring good coverage of the shape. - : The target number of sparse points after downsampling.
- : The sparse point cloud, which serves as a query in the subsequent feature encoding step.
4.2.1.2. Feature Encoding
Next, the 3DShape2VecSet model computes point cloud features using a cross-attention mechanism between the sparse point cloud (as query) and the dense point cloud (as key and value). This is followed by self-attention layers to generate the compact latent code .
The mathematical representation for this step is: $ C = { \mathrm { C r o s s A t t n } } ( P_s , P_d , P_d ) , z = { \mathrm { S e l f A t t n } } ( C ) . $ Where:
- : The
cross-attentionmechanism, where is the Query, is the Key, and is the Value. Here, acts as the query, while provides both the keys and values. This allows the sparse points to attend to features from the dense points. - : The resulting point cloud feature representation.
- : The
self-attentionmechanism, which processes the feature to learn relationships within the feature set itself. - : The final compact latent code, which is a compressed representation of the 3D shape.
4.2.1.3. Geometry Decoding
Finally, the latent code is further processed through self-attention layers. This processed latent code is then used with cross-attention to predict occupancy values for randomly sampled spatial query points . Occupancy values indicate whether a given 3D point lies inside (occupied) or outside (empty) the reconstructed 3D shape.
The mathematical representation for this step is: $ \hat { O } = \mathrm { C r o s s A t t n } ( Q_{space} , \mathrm { S e l f A t t n } ( z ) ) . $ Where:
- :
Self-attentionapplied to the latent code to enrich its representation for decoding. - : Randomly sampled 3D spatial query points, typically within a bounding box enclosing the object, for which the decoder predicts occupancy.
- :
Cross-attentionwhere are queries, and the output ofSelfAttn(z)provides keys and values. This allows the decoder to predict occupancy for arbitrary 3D points based on the learned shape representation. - : The predicted occupancy values for the query points , which implicitly define the reconstructed 3D surface.
4.2.2. Dora-VAE
Figure 2 provides an overview of the Dora-VAE pipeline.
该图像是论文中图2的示意图,展示了Dora-VAE的两个核心部分:(a)通过锐边采样策略从输入网格分别抽样均匀点和显著点,并与密集点结合;(b)双重交叉注意力架构用于编码这些点云,最终重建网格。
Figure 2. Overview of Dora-VAE. (a) We utilize the proposed sharp edge sampling technique to extract both salient and uniform points from the input mesh. These points are then combined with dense points, effectively capturing both salient regions and smooth areas. (b) To enhance the encoding of point clouds sampled through sharp edge sampling, we design a dual cross-attention architecture.
For each input mesh, Dora-VAE augments the uniformly sampled point cloud with additional important points obtained through the proposed sharp edge sampling (SES) strategy. These two sets combine to form the dense point cloud . During encoding, Dora-VAE employs a dual cross-attention mechanism to process and separately, summing the results before passing them to self-attention to compute the latent code . The VAE training largely follows 3DShape2VecSet, supervised by a loss evaluated on the occupancy field.
4.2.2.1. Sharp Edge Sampling (SES)
The Sharp Edge Sampling (SES) algorithm is designed to effectively sample points from geometrically salient regions of a 3D mesh. To ensure overall surface coverage, SES also incorporates uniformly sampled points. The final dense point cloud is a union of uniformly sampled points and points specifically sampled from salient regions , i.e., . The process for computing salient points involves two main steps:
4.2.2.1.1. Salient Edges Detection
Given a triangular mesh, SES identifies a set of salient edges by analyzing the dihedral angles between adjacent faces. The dihedral angle directly measures surface curvature along mesh edges. For each edge shared by two adjacent faces and , the dihedral angle is calculated using their normal vectors.
The dihedral angle for an edge shared by faces and is computed as: $ \theta _ { e } = \operatorname { a r c c o s } \left( \frac { \mathbf { n } _ { f _ { 1 } } \cdot \mathbf { n } _ { f _ { 2 } } } { \left| \mathbf { n } _ { f _ { 1 } } \right| \left| \mathbf { n } _ { f _ { 2 } } \right| } \right) , $ Where:
-
: The normal vector of face .
-
: The normal vector of face .
-
: The dot product operator.
-
: The Euclidean norm (magnitude) of a vector.
-
: The arccosine function, which returns the angle whose cosine is the argument. The result is the angle between the two face normals. A smaller angle between normals (or a larger close to radians or 180 degrees) indicates a sharper edge.
The
salient edge setthen includes all edges whosedihedral angleexceeds a predefined threshold . $ \Gamma = { e \mid \theta _ { e } > \tau } $ Where: -
: An edge in the mesh.
-
: A predefined threshold angle. Edges with a dihedral angle greater than are considered salient. Let represent the total number of salient edges detected.
4.2.2.1.2. Salient Points Sampling
For each salient edge , its two vertices, and , are collected into a salient vertex set . Duplicate vertices arising from connected salient edges are included only once.
$
P _ { \Gamma } = { v _ { e , 1 } , v _ { e , 2 } \mid e \in \Gamma } ,
$
Where:
- : The two vertices defining the salient edge .
- : The set of all unique vertices that are part of at least one salient edge. Let denote the number of unique vertices in .
Given a target number of salient points , the salient point set is generated based on the available salient vertices, considering three cases:
$
P _ { a } = \left{ \begin{array} { l l } { \mathrm { F P S } ( P _ { \Gamma } , N _ { \mathrm { d e s i r e d } } ) , } & { \mathrm { i f } \ N _ { \mathrm { d esired } } < = N _ { V } , } \ { P _ { \Gamma } \cup P _ { \mathrm { i n t e r p o l a t e d } } , } & { \mathrm { i f } \ 0 < N _ { V } < N _ { \mathrm { d esired } } , } \ { \emptyset , } & { \mathrm { i f } \ N _ { V } = 0 . } \end{array} \right.
$
Where:
- : If the number of unique salient vertices () is greater than or equal to the desired number of salient points (),
Farthest Point Samplingis used to downsample to exactly points. - : If is less than (meaning there aren't enough salient vertices), all vertices from are included. Additional points, , are generated by uniformly sampling points along each salient edge in . This ensures that the target number of salient points is met by distributing the remaining points along the detected sharp features.
- : If no salient edges are detected (), the salient point set remains empty.
4.2.2.2. Dual Cross Attention
After obtaining the point clouds (which is ) through the SES strategy, Dora-VAE employs a dual cross-attention architecture to effectively encode both the uniform and salient regions. Following 3DShape2VecSet, and are first downsampled separately using FPS to form the sparse point cloud .
The separate downsampling for and is: $ P _ { s } = \mathrm { F P S } ( P _ { u } , N _ { s , 1 } ) \cup \mathrm { F P S } ( P _ { a } , N _ { s , 2 } ) , $ Where:
-
: Uniformly sampled points.
-
: Salient points sampled by
SES. -
: The number of sparse points sampled from .
-
: The number of sparse points sampled from .
-
:
Farthest Point Sampling. -
: The combined sparse point cloud used as queries for the
cross-attentionmechanism.Next,
cross-attentionfeatures are computed separately for the uniform points () and salient points () using the combined sparse point cloud as the query: $ \begin{array} { c } { { C _ { u } = \mathrm { C r o s s A t t n } ( P _ { s } , P _ { u } , P _ { u } ) } } \ { { C _ { a } = \mathrm { C r o s s A t t n } ( P _ { s } , P _ { a } , P _ { a } ) } } \end{array} $ Where: -
: The feature representation derived from the uniform points , with as the query.
-
: The feature representation derived from the salient points , with as the query. This
dual attention designallows the model to focus separately on features from uniform regions (capturing overall structure) and salient regions (capturing fine details).
The final point cloud feature combines both attention results by summing them:
$
C = C _ { u } + C _ { a } .
$
This combined feature is then used, similar to 3DShape2VecSet, to predict the occupancy field through self-attention blocks. The entire model, with parameters , is optimized using Mean Squared Error (MSE) loss against the ground truth occupancy .
The optimization objective with MSE loss is:
$
\nabla _ { \psi } \mathcal { L } _ { \mathrm { M S E } } ( \hat { O } , O ) = \mathbb { E } \left[ 2 ( \hat { O } - O ) \frac { \partial \hat { O } } { \partial \psi } \right] .
$
Where:
- : The Mean Squared Error loss function.
- : The predicted occupancy values.
- : The ground truth occupancy values.
- : Expectation over the training data.
- : The gradient of the predicted occupancy with respect to the model parameters . This term is part of the gradient calculation for optimizing the model.
4.2.3. Dora-Bench
Dora-Bench is proposed to facilitate a more rigorous and systematic evaluation of VAE performance by addressing the limitations of conventional evaluation protocols that use randomly selected shapes and general metrics.
4.2.3.1. Geometric Complexity-based Evaluation
Dora-Bench categorizes test shapes based on their geometric complexity. This complexity is quantified using the number of salient edges (as defined in Section 3.2.1, where ). Shapes are classified into four levels:
-
Level 1 (Less Detail): . These shapes have relatively few sharp features.
-
Level 2 (Moderate Detail): . Shapes with a moderate number of sharp features.
-
Level 3 (Rich Detail): . Shapes with many intricate details and sharp features.
-
Level 4 (Very Rich Detail): . Shapes with extremely high geometric complexity, characterized by a very large number of sharp edges.
The benchmark curates test shapes from various public datasets, including
GSO[18],ABO[14],Meta[3], andObjaverse[16], to ensure a diverse range of geometric complexities.
The following figure (Figure 3 from the original paper) illustrates the distribution and examples of shapes across different complexity levels in Dora-Bench:
该图像是论文中Figure 3,包含柱状图、饼图和示意图,展示了ABO、GSO、Meta和Objaverse四个数据集在不同细节复杂度等级上的形状数量分布,以及各等级示例的可视化。
Figure 3. Our proposed benchmark include 3D shapes from the ABO [14], GSO [18], Meta [3], and Objaverse [16] datasets. (a) The histogram of different datasets across different shape complexities. (b) The pie chart of the total counts by shape complexities. (c) Sample shapes of different shape complexities.
4.2.3.2. Sharp Normal Error (SNE)
To specifically assess the preservation of fine geometric details, Dora-Bench introduces Sharp Normal Error (SNE). Unlike general metrics, SNE focuses on the normal map differences between reconstructed and ground truth shapes within geometrically significant areas. The process is as follows:
-
Normal Map Rendering: Normal maps are rendered for both the ground truth shape and the reconstructed shape from multiple viewpoints.
-
Salient Region Identification:
Canny edge detectionis applied to these normal maps to identify sharp regions, which are thendilatedto createevaluation masks. These masks delineate the areas where fine details are expected. -
Error Calculation: The
Mean Squared Error (MSE)is computed between the ground truth and reconstructed normal maps, but only within the masked salient areas.The following figure (Figure 4 from the original paper) visualizes the process of computing
Sharp Normal Error (SNE):
该图像是论文中图4的示意图,展示了计算锐角区域法线误差(SNE)的流程。通过Canny边缘检测确定锐边区域,经过膨胀操作后与真实法线(GT normal)和重建法线(normal)相乘,最后计算这两个区域内的均方误差(MSE)。
Figure 4. The process of computing sharp normal errors (SNE). We compute MSE loss in the sharp regions of the normal.
This focused evaluation allows SNE to directly quantify how well VAEs preserve sharp geometric features, addressing a critical limitation of other common metrics.
5. Experimental Setup
This section details the datasets used, the evaluation metrics employed, and the baseline models against which Dora-VAE was compared.
5.1. Datasets
The Dora-VAE model was trained on a filtered subset of approximately 400,000 3D meshes from the Objaverse [16] dataset. This large-scale dataset was chosen to ensure comprehensive training. Low-quality meshes with issues like missing faces or severe self-intersections were filtered out to enhance training stability.
For the Dora-bench benchmark, test shapes were curated from multiple public datasets to ensure diverse geometric complexities:
-
ABO[14] (Amazon Berkeley Objects) -
GSO[18] (Google Scanned Objects) -
Meta[3] (Digital Twin Catalog from Meta) -
Objaverse[16] test setThe
Dora-benchcategorizes these models into four detail levels (Level 1 to Level 4), with approximately 800 samples per level. Due to the scarcity of highly detailed models inABO,GSO, andMeta,Level 4samples were predominantly sourced from theObjaversetest set.
An example of data distribution across complexity levels in Dora-bench is shown in Figure 3 from the Methodology section, depicting histograms and example meshes for each level.
5.2. Evaluation Metrics
The reconstruction quality of Dora-VAE and baselines was evaluated using 1 million (1M) sampled points from the input meshes, comparing them with their decoded counterparts. Three primary metrics were used:
5.2.1. F-score ()
- Conceptual Definition:
F-score(also known as F-measure or F1-score when is a specific threshold) assesses the reconstruction accuracy by computing the precision and recall of point correspondences between the reconstructed shape and the ground truth within a specified distance threshold . It essentially measures how well the reconstructed shape covers the ground truth and how many of its points are close to the ground truth. A higher F-score indicates better reconstruction quality. - Mathematical Formula: The
F-scoreis typically calculated as the harmonic mean of precision and recall. For point cloud reconstruction, precision and recall are defined based on distances between points. Given two point clouds (reconstructed) and (ground truth):- Precision: The proportion of points in that are "close" to a point in . $ \mathrm{Precision}(P_{rec}, P_{gt}, r) = \frac{1}{|P_{rec}|} \sum_{p \in P_{rec}} \mathbb{I}(\min_{q \in P_{gt}} |p - q| \le r) $
- Recall: The proportion of points in that are "close" to a point in . $ \mathrm{Recall}(P_{rec}, P_{gt}, r) = \frac{1}{|P_{gt}|} \sum_{q \in P_{gt}} \mathbb{I}(\min_{p \in P_{rec}} |p - q| \le r) $
- F-score: $ \mathrm{F-score}(P_{rec}, P_{gt}, r) = 2 \cdot \frac{\mathrm{Precision}(P_{rec}, P_{gt}, r) \cdot \mathrm{Recall}(P_{rec}, P_{gt}, r)}{\mathrm{Precision}(P_{rec}, P_{gt}, r) + \mathrm{Recall}(P_{rec}, P_{gt}, r)} $
- Symbol Explanation:
- : The point cloud sampled from the reconstructed 3D shape.
- : The point cloud sampled from the ground truth 3D shape.
- : A distance threshold. Points within this distance are considered a match.
- , : The number of points in the reconstructed and ground truth point clouds, respectively.
- : The indicator function, which is 1 if its argument is true, and 0 otherwise.
- : The minimum Euclidean distance from point in to any point in .
The paper reports F-score at two thresholds:
F-score (0.01)andF-score (0.005), with shapes normalized to range.
5.2.2. Chamfer Distance (CD)
- Conceptual Definition:
Chamfer Distance (CD)is a widely used metric for measuring the similarity between two point clouds. It calculates the average squared Euclidean distance from each point in one point cloud to its nearest neighbor in the other point cloud, and vice versa. A lower CD indicates greater similarity between the shapes. - Mathematical Formula: Given two point clouds (reconstructed) and (ground truth): $ \mathrm{CD}(P_{rec}, P_{gt}) = \frac{1}{|P_{rec}|} \sum_{p \in P_{rec}} \min_{q \in P_{gt}} |p - q|2^2 + \frac{1}{|P{gt}|} \sum_{q \in P_{gt}} \min_{p \in P_{rec}} |q - p|_2^2 $
- Symbol Explanation:
- : The point cloud sampled from the reconstructed 3D shape.
- : The point cloud sampled from the ground truth 3D shape.
- , : The number of points in the reconstructed and ground truth point clouds, respectively.
- : The squared Euclidean distance from point in to its nearest neighbor in .
- : The squared Euclidean distance from point in to its nearest neighbor in .
The paper reports
CDmultiplied by 10000 for presentation (e.g.,CD x 10000).
5.2.3. Sharp Normal Error (SNE)
- Conceptual Definition: As proposed in Section 3.3.2,
Sharp Normal Error (SNE)specifically evaluates reconstruction quality insalient regionsby measuringnormal map differencesbetween the reconstructed and ground truth shapes. It focuses on how accurately sharp edges and fine geometric details are preserved, unlike global metrics that might average out local errors. A lower SNE indicates better preservation of sharp features. - Mathematical Formula: The
SNEis computed as theMean Squared Error (MSE)between the normal maps of the ground truth () and reconstructed () shapes, but only within masked salient areas (). $ \mathrm{SNE} = \frac{1}{|M|} \sum_{(x,y) \in M} |N_{gt}(x,y) - N_{rec}(x,y)|_2^2 $ - Symbol Explanation:
- : The normal vector at pixel
(x,y)in the ground truth normal map. - : The normal vector at pixel
(x,y)in the reconstructed normal map. - : The set of pixels within the
evaluation mask, which defines the salient regions detected byCanny edge detectionanddilation. - : The number of pixels in the masked area.
- : The squared Euclidean distance (or L2 norm) between the normal vectors.
The paper reports
SNEmultiplied by 100 (e.g.,SNE x 100).
- : The normal vector at pixel
5.2.4. Latent Code Length (LCL)
- Conceptual Definition:
Latent Code Length (LCL)refers to the dimensionality or number of tokens in the compact latent representation generated by the VAE. While not a performance metric in itself, it's a crucial factor forefficiencyin downstream tasks like diffusion model training. Shorter latent codes indicate higher compression and faster processing, but typically pose a challenge for maintaining reconstruction quality. For a fair comparison,LCLis reported as it often correlates with reconstruction capability.
5.3. Baselines
Dora-VAE was compared against several state-of-the-art approaches:
-
XCube-VAE [45]: A
volumetric methodknown for achieving high reconstruction quality but generating very large latent codes ( tokens). This represents the high-fidelity, low-compression end of the spectrum. -
XCube-VAE† [45]: A fine-tuned version of the original
XCube-VAE. The authors fine-tunedXCube-VAEon the same dataset used forDora-VAEto ensure a fair comparison, as the originalXCubemight have been trained on different data. This baseline provides a strong upper bound for reconstruction quality. -
Craftsman-VAE [31]: This model is a fine-tuned version of
3DShape2VecSet[63] (avector set-basedmethod) for shorter latent codes, trained onObjaverse. It represents the compact latent space, potentially lower-fidelity end of the spectrum, and is a direct competitor in the Vecset-based category. -
3DShape2VecSet [63]: (Mentioned in supplementary materials) The original
vector set-basedVAE thatDora-VAEbuilds upon. It was originally trained onShapeNet[6], a smaller dataset, so its performance is generally lower, but it is important for context.VAE models from
Direct3D[58] andCLAY[65] were excluded from the primary comparison due to their implementations not being publicly available at the time of submission.XCube-VAEwas also excluded from downstream diffusion model comparisons due to its impractical10,000-dimensional latent codes.
5.4. Implementation Details
- Mesh Preprocessing: All meshes were preprocessed using
CLAY[65] methods to ensure watertight 3D models. - Training Data: Approximately 400,000 3D meshes filtered from
Objaverse[16], with low-quality meshes removed. - Training Environment: Trained on 32 A100 GPUs for two days.
- Training Parameters: Batch size of 2048, learning rate of 5e-5.
- Efficiency Techniques:
Flash-Attention-v2[15],mixed-precision training with FP16, andgradient checkpointing[12] were used to optimize memory and training efficiency. - Sharp Edge Sampling (SES) Parameters: (target number of salient points) and (dihedral angle threshold).
- Sharp Normal Error (SNE) Parameters: Low threshold of 20 and high threshold of 200 for
Canny edge detection. - VAE Architecture: Follows successful designs [31, 67] with 8
self-attentionlayers in the encoder and 16 in the decoder. - Latent Code Length (LCL) Multi-resolution Training: During training, (the latent code length) was randomly selected between 256 and 1280. This strategy, adopted from
CLAY[65], facilitates progressive training for subsequent diffusion stages. - KL Divergence Weight: Set to 0.001.
- Spatial Query Points (): Constructed by combining points randomly sampled near the mesh surface and points uniformly sampled within spatial range.
- Diffusion Model for Image-to-3D: A conditional diffusion model based on the
DiT[7, 41] architecture, similar toDirect3D[58] andCLAY[65]. It conditions on image features extracted byDINOv2[40] from single-view images rendered usingBlenderProc[17]. This model has 0.39 billion parameters and was trained on 32 A100 GPUs for three days.
6. Results & Analysis
This section presents the experimental results, comparing Dora-VAE qualitatively and quantitatively against baselines, and analyzing its performance, including ablation studies and its application in single-image to 3D generation.
6.1. Core Results Analysis
6.1.1. Qualitative Comparison
The visual comparisons in Figure 5 demonstrate Dora-VAE's effectiveness in preserving geometric details across varying levels of shape complexity, especially for L3 and L4 shapes.
The following figure (Figure 5 from the original paper) shows visual comparisons of different methods across different complexity levels:
该图像是图7,展示了基于单张图像进行3D生成的扩散结果对比。图中显示了利用作者所提Dora-VAE和Craftsman†训练的扩散模型生成的3D形状,Dora-VAE生成结果在相同实验条件下具有更多细节和更丰富的几何信息。
Figure 5. Visual comparisons of different methods on Dora-bench. Our method consistently shows better reconstruction quality, especially in preserving geometric details (e.g., sharp edges), compared to baselines. This is evident in the normal maps and overall shape fidelity, particularly for complex shapes (L3 and L4).
-
Low Complexity (L1 and L2): For shapes with lower geometric complexity, most methods achieve comparable reconstruction quality. This indicates that simple shapes are generally easy for modern VAEs to encode and reconstruct.
-
High Complexity (L3 and L4): The advantages of
Dora-VAEbecome pronounced when dealing with shapes of higher complexity.Dora-VAEvisibly maintains finer details, such as sharp edges and intricate surface variations, which are often missed by other methods. -
XCube-VAE vs. Dora-VAE: While
XCube-VAEcan achieve similar visual quality, it requires a significantly largerlatent code length (LCL)(over10,000tokens) compared toDora-VAE's1,280tokens. This impliesDora-VAEprovides a far more efficient representation for similar visual fidelity. -
Craftsman-VAE:
Craftsman-VAEshows a noticeable degradation in reconstruction quality for complex shapes, failing to capture fine geometric details. This aligns with the paper's motivation that uniform sampling limits detail preservation.Additional visual comparisons provided in the supplementary material (Figures S8 and S9) further support these observations, particularly highlighting
XCube's tendency for geometric deviation from the ground truth despite rich visual details, which is attributed to quantization errors during mesh extraction usingNKSR[24].
The following figure (Figure S8 from the original paper) illustrates visual comparisons for Level 3 and 4 examples:
该图像是论文中的示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube)在不同复杂度层级(Level 3、Level 4)上对多种3D模型法线重建的视觉对比效果。
Figure S8. Qualitative comparison of the normal reconstruction results for Level 3 and Level 4 models (details from Table S3). Our method (Ours full) consistently shows superior performance in preserving fine geometric details, leading to more accurate normal maps compared to baselines such as Craftsman, 3DShape2VecSet, and Xcube†.
The following figure (Figure S9 from the original paper) provides more visual comparisons of Level 3 and 4 examples:
该图像是一个多类别3D网格法线视图的对比示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube†)在Level 3和Level 4细节恢复上的表现,突出Dora-VAE在细节保留上的优势。
Figure S9. Further qualitative comparison of the normal reconstruction results for Level 3 and Level 4 models (details from Table S3). This figure reinforces the superior detail preservation of our Dora-VAE, especially in complex regions, compared to Craftsman, 3DShape2VecSet, and Xcube†, as evidenced by the fidelity of the reconstructed normal maps.
6.1.2. Quantitative Comparison
The following are the results from Table 1 of the original paper:
| Methods | LCL | ↑ F-score(0.01) × 100 | ↑ F-score(0.005) × 100 | ↓ CD × 10000 | ↓SNE × 100 | ||||||||||||
| L1 | L2 | L3 | L4 | L1 | L2 | L3 | L4 | L1 | L2 | L3 | L4 | L1 | L2 | L3 | L4 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Xcube [45] | >10000 | 98.968 | 98.799 | 98.615 | 98.226 | 95.525 | 93.872 | 92.322 | 85.365 | 6.315 | 6.288 | 7.935 | 9.926 | 1.579 | 1.432 | 1.430 | 1.679 |
| Xcube† [45] | >10000 | 99.393 | 99.794 | 99.824 | 99.079 | 96.753 | 95.535 | 93.422 | 87.365 | 4.015 | 4.142 | 5.740 | 7.627 | 1.543 | 1.408 | 1.259 | 1.639 |
| Craftsman [31] | 256 | 98.016 | 95.874 | 91.756 | 81.739 | 87.994 | 82.549 | 73.000 | 57.379 | 4.389 | 9.129 | 14.530 | 33.441 | 1.906 | 1.873 | 2.191 | 3.933 |
| Ours w/o SES, DCA | 1280 | 99.964 | 99.925 | 99.678 | 97.890 | 96.561 | 95.975 | 91.618 | 83.124 | 2.236 | 2.506 | 4.444 | 6.432 | 1.448 | 1.215 | 1.205 | 1.828 |
| Ours w/o DCA | 1280 | 99.944 | 99.814 | 97.294 | 96.779 | 95.977 | 94.623 | 88.406 | 79.240 | 2.422 | 2.983 | 3.980 | 6.196 | 1.496 | 1.313 | 1.352 | 2.207 |
| Ours full | 256 | 99.507 | 98.986 | 96.669 | 89.577 | 93.272 | 90.466 | 82.386 | 68.669 | 3.356 | 5.202 | 10.276 | 24.527 | 1.555 | 1.410 | 1.618 | 3.035 |
| Ours full | 1280 | 99.988 | 99.955 | 99.880 | 99.170 | 97.038 | 96.831 | 93.458 | 87.473 | 2.097 | 2.500 | 3.945 | 5.265 | 1.433 | 1.186 | 1.137 | 1.579 |
The quantitative results in Table 1, evaluated on Dora-bench, demonstrate Dora-VAE's superior performance across all complexity levels, with a particularly significant advantage for L3 and L4 shapes.
-
Overall Performance:
Dora-VAE(full model with 1280LCL) achieves the highestF-scoreand lowestCDandSNEacross almost all complexity levels. ForL4shapes (very rich detail), itsF-score (0.01)is 99.170,F-score (0.005)is 87.473,CD x 10000is 5.265, andSNE x 100is 1.579. -
Comparison with XCube-VAE†:
Dora-VAEwith 1280LCLachieves comparable or betterF-scoresandSNEvalues thanXCube-VAE†(which usesLCL), while significantly outperforming it inCD. For instance,Dora-VAEachieves aCD x 10000of 2.097 forL1(vs. 4.015 forXCube-VAE†), representing a improvement. This is remarkable given thatDora-VAE's latent space is at least smaller. The paper attributesXCube-VAE's relatively higherCDto quantization errors introduced byNKSR[24] during mesh extraction, despite its dense representation. -
Comparison with Craftsman-VAE:
Craftsman-VAE(256LCL) shows significantly lower performance across all metrics, especially forL3andL4shapes, confirming its struggle with geometric detail preservation due to uniform sampling and shorter latent codes. ForL4, itsF-score (0.005)is 57.379 andSNE x 100is 3.933, much worse thanDora-VAE's 87.473 and 1.579 respectively. -
SNE Metric Validation: The superior performance in
SNEforDora-VAEis particularly important, validating the effectiveness of thesharp edge sampling strategy. ForL4shapes,Dora-VAE'sSNE x 100is 1.579 compared toXCube-VAE†'s 1.639, a improvement, which aligns with the qualitative observation of better preservation of fine details.The supplementary material also includes Table S2, which shows
3DShape2VecSet's consistent underperformance due to limited training data.
The following are the results from Table S2 of the original paper (from supplementary):
| Methods | LCL | ↑ F-score(0.01) × 100 | ↑ F-score(0.005) × 100 | ↓ CD × 10000 | ↓SNE × 100 | ||||||||||||
| L1 | L2 | L3 | L4 | L1 | L2 | L3 | L4 | L1 | L2 | L3 | L4 | L1 | L2 | L3 | L4 | ||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Xcube [45] | >10000 | 98.968 | 98.799 | 98.615 | 98.226 | 95.525 | 93.872 | 92.322 | 85.365 | 6.315 | 6.288 | 7.935 | 9.926 | 1.579 | 1.432 | 1.430 | 1.679 |
| Xcube† [45] | >10000 | 99.393 | 99.794 | 99.824 | 99.079 | 96.753 | 95.535 | 93.422 | 87.365 | 4.015 | 4.142 | 5.740 | 7.627 | 1.543 | 1.408 | 1.259 | 1.639 |
| VecSet [63] | 512 | 94.768 | 88.890 | 80.126 | 59.347 | 77.545 | 67.929 | 55.516 | 34.619 | 27.380 | 42.075 | 100.975 | 159.151 | 2.939 | 3.056 | 3.470 | 6.034 |
| Craftsman [31] | 256 | 98.016 | 95.874 | 91.756 | 81.739 | 87.994 | 82.549 | 73.000 | 57.379 | 4.389 | 9.129 | 14.530 | 33.441 | 1.906 | 1.873 | 2.191 | 3.933 |
| Ours w/o SES, DCA | 1280 | 99.964 | 99.925 | 99.678 | 97.890 | 96.561 | 95.975 | 91.618 | 83.124 | 2.236 | 2.506 | 4.444 | 6.432 | 1.448 | 1.215 | 1.205 | 1.828 |
| Ours w/o DCA | 1280 | 99.944 | 99.814 | 97.294 | 96.779 | 95.977 | 94.623 | 88.406 | 79.240 | 2.422 | 2.983 | 3.980 | 6.196 | 1.496 | 1.313 | 1.352 | 2.207 |
| Ours full | 256 | 99.507 | 98.986 | 96.669 | 89.577 | 93.272 | 90.466 | 82.386 | 68.669 | 3.356 | 5.202 | 10.276 | 24.527 | 1.555 | 1.410 | 1.618 | 3.035 |
| Ours full | 1280 | 99.988 | 99.955 | 99.880 | 99.170 | 97.038 | 96.831 | 93.458 | 87.473 | 2.097 | 2.500 | 3.945 | 5.265 | 1.433 | 1.186 | 1.137 | 1.579 |
6.1.3. Application: Single Image to 3D
Dora-VAE's effectiveness is further validated by its application to single-image 3D generation using latent diffusion models. A diffusion model based on the DiT [42] architecture was implemented, similar to CLAY [65]. For comparison, Craftsman-VAE was fine-tuned on the same dataset (Craftsman-VAE†). XCube-VAE was excluded from this comparison due to its high-dimensional latent codes () being impractical for diffusion model training.
The following figure (Figure 7 from the original paper) displays generation results from diffusion models trained with Dora-VAE and Craftsman-VAE†:
该图像是一个多类别3D网格法线视图的对比示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube†)在Level 3和Level 4细节恢复上的表现,突出Dora-VAE在细节保留上的优势。
Figure 7. The diffusion results of the single image to 3D generation trained on our Dora-VAE and Craftsman†. The 3D geometry generated by the diffusion model trained on our proposed DoraVAE has more details under the same experimental environment.
Both models used identical architectures (0.39B parameters) and training conditions (same dataset, 32 A100 GPUs, 3 days). The visual results in Figure 7 clearly show that the diffusion model trained with Dora-VAE generates 3D shapes with significantly better preservation of geometric details (e.g., sharper edges, more defined features) compared to Craftsman-VAE†. This directly validates that Dora-VAE's improved reconstruction capabilities translate into higher-quality outputs for downstream generative tasks.
Further comparisons in the supplementary material (Figures S10 and S11) against LRM-based methods (MeshFormer [36], CRM [55]) and a commercial solution (Tripo v2.0 [2]) show:
-
Superiority over LRM-based methods:
Dora-VAE-based generation achieves superior geometric detail and fidelity compared toMeshFormerandCRM, whose limitations are attributed to a lack of explicit geometric constraints. -
Comparability with commercial solutions:
Dora-VAEachieves geometric quality comparable toTripo v2.0, a leading commercial solution, despite using significantly more constrained resources (3 days training on 32 A100 GPUs with training samples). This highlightsDora-VAE's efficiency and effectiveness.The following figure (Figure S10 from the original paper) shows a qualitative comparison of the Image-to-3D results:
该图像是不同方法在Image-to-3D任务中结果的定性比较图。图中展示了输入图像及Ours、MeshFormer、CRM和Tripo v2.0方法生成的多种3D模型表面法线图,直观体现了各方法在细节和结构重建上的差异。
Figure S10. Qualitative comparison of the Image-to-3D results. This figure compares outputs from our Dora-VAE-based diffusion model against MeshFormer, CRM, and Tripo v2.0. Our method consistently produces 3D geometry with richer and more accurate details, demonstrating its superior capability in preserving fine features from a single input image.
The following figure (Figure S11 from the original paper) provides more qualitative comparison of the Image-to-3D results:
该图像是图表,展示了多个输入图片与不同3D重建方法(包括Ours、MeshFormer、CRM、Tripo v2.0)生成的法线贴图结果的对比,体现了所提方法在细节和几何结构上的优势。
Figure S11. Qualitative comparison of the Image-to-3D results. Further examples show the Dora-VAE-based diffusion model's ability to generate highly detailed and geometrically faithful 3D shapes from single images, outperforming LRM-based methods and achieving results competitive with commercial solutions.
6.2. Ablation Studies / Parameter Analysis
To quantify the contribution of each proposed component, Dora-VAE was subjected to ablation studies, comparing the full model with two variants under identical training conditions:
-
Ours w/o SES, DCA: This variant completely removes both the
sharp edge sampling (SES)strategy and thedual cross-attention (DCA)mechanism. Essentially, it reverts to using onlyuniformly sampled point cloudswithPoisson disk sampling, similar to existingVecset-based VAEs, while maintaining the same total number of dense points (). -
Ours w/o DCA: This variant retains the
sharp edge sampling (SES)but removes thedual cross-attention (DCA). Instead, it uses a singlecross-attentionmechanism, similar to the one adopted by3DShape2VecSet[63], processing all sampled points (uniform + salient) together.The following figure (Figure 6 from the original paper) illustrates the ablation study results:
该图像是论文中的示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube)在不同复杂度层级(Level 3、Level 4)上对多种3D模型法线重建的视觉对比效果。
Figure 6. Ablation studies of our method. Given the ground truth of mesh, we employ both our full model and its variations to reconstruct the ground truth mesh, highlighting significant reconstruction discrepancies with red boxes. This visually demonstrates the contribution of SES and DCA to detail preservation.
As shown in Figure 6 and quantitatively in Table 1 (rows for "Ours w/o SES, DCA" and "Ours w/o DCA"), the full Dora-VAE model consistently outperforms both ablated variants.
- Impact of SES and DCA (Ours w/o SES, DCA vs. Ours full):
- The "Ours w/o SES, DCA" variant (1280
LCL) performs noticeably worse than the "Ours full" model (1280LCL). For instance, itsCD x 10000forL4is 6.432 compared to 5.265 for "Ours full". ItsSNE x 100forL4is 1.828 compared to 1.579 for "Ours full". This significant drop in performance, especially forCDandSNEon complex shapes, highlights the crucial role of bothsharp edge samplinganddual cross-attentionin preserving fine geometric details. Visually, the red boxes in Figure 6 show clear degradation without these components.
- The "Ours w/o SES, DCA" variant (1280
- Impact of DCA (Ours w/o DCA vs. Ours full):
-
The "Ours w/o DCA" variant (1280
LCL), while benefiting fromSES, still performs worse than the "Ours full" model. ItsCD x 10000forL4is 6.196, andSNE x 100forL4is 2.207. This indicates that merely sampling salient points is not enough; thedual cross-attentionmechanism is essential for effectively leveraging these detail-rich points during the encoding process by allowing the model to focus on them distinctly.These ablation studies clearly validate the effectiveness and necessity of both the
Sharp Edge Sampling (SES)strategy and theDual Cross-Attention (DCA)architecture forDora-VAEto achieve its high-fidelity reconstruction capabilities.
-
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces Dora-VAE, a novel Variational Autoencoder designed to overcome the limitations of existing 3D shape VAEs in preserving fine geometric details while maintaining compact latent representations. At its core, Dora-VAE innovates through a sharp edge sampling (SES) strategy, which prioritizes points from geometrically salient regions, and a dual cross-attention architecture, specifically engineered to effectively encode these detail-rich point clouds. To provide a rigorous evaluation framework, the authors developed Dora-bench, a benchmark that systematically categorizes 3D shapes by their geometric complexity and features a new metric, Sharp Normal Error (SNE), for assessing reconstruction accuracy of fine details. Extensive experiments on Dora-bench conclusively demonstrate that Dora-VAE significantly outperforms existing methods across various complexity levels. Furthermore, its superior reconstruction capabilities directly translate to enhanced quality in downstream tasks, as evidenced by its application in single-image 3D generation through diffusion models, where it generates more geometrically detailed shapes with a latent space at least smaller than state-of-the-art dense models.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Current Limitations: The primary limitation is the challenge in maintaining high-quality reconstructions when the number of
latent tokensis further reduced. WhileDora-VAEachieves state-of-the-art quality with 1,280 tokens, significantly compressing beyond this point remains difficult, especially compared to advancements in 2D image compression (e.g.,Deep Compression Autoencoder (DC-AE)[8]). - Future Directions:
- Enhanced Compression Efficiency: Future work aims to explore novel techniques to increase the compression rate of 3D VAEs while preserving reconstruction quality, potentially bridging the efficiency gap between 2D and 3D compression methods.
- Advanced Diffusion Models: Leveraging
Dora-VAE's superior reconstruction capabilities, the authors plan to develop more powerfulimage-to-3D diffusion models. They believe that improved VAE reconstruction can directly elevate the performance ceiling of diffusion models, leading to higher-quality generation results under similar training conditions.
7.3. Personal Insights & Critique
Dora-VAE presents a compelling solution to a critical problem in 3D generative AI: the trade-off between latent space compactness and geometric detail fidelity. The paper's strength lies not just in proposing a new model but also in establishing a more rigorous benchmarking framework (Dora-bench) and a specialized evaluation metric (SNE). This holistic approach is commendable, as a good benchmark is often as important as the model itself for advancing a field.
Inspirations:
- Task-Specific Importance Sampling: The idea of
sharp edge samplingis a powerful demonstration of how domain-specific knowledge (identifying salient edges via dihedral angles) can be integrated into general deep learning pipelines to achieve significant performance gains. This principle could be applied to other data modalities where specific features are disproportionately important. - Dual Processing Pathways: The
dual cross-attentionmechanism, which separates and then combines features from uniform and salient regions, is an elegant solution to handling heterogeneous information within a single input. This pattern could be beneficial in other scenarios where input data has both general and "highlighted" components. - Holistic Evaluation: The emphasis on complexity-based evaluation and the
SNEmetric highlights the need for tailored evaluation metrics when generic ones fall short. Many fields might benefit from defining metrics that focus on "important" features rather than just overall similarity.
Potential Issues/Critique:
-
Generalizability of SES Parameters: The paper uses and for
SES. While these work well, the optimal values might vary significantly across different datasets or types of 3D models. A more adaptive or data-driven approach to setting these parameters could further enhance robustness. -
Complexity of Mesh Preprocessing: The method relies on
watertight 3D modelsand mesh processing (e.g.,Canny edge detection,dilation). For real-world, noisy 3D scan data or imperfect models, this preprocessing step might introduce its own challenges or limitations, potentially affecting the quality ofSES. -
Beyond Sharp Edges: While sharp edges are crucial, some fine details might not be characterized by sharp angles (e.g., intricate textures, subtle curvatures). The current
SESmight not capture these types of details. Future work could explore more generalizedsaliency detectionmechanisms. -
Latent Code Structure: While the latent code is compact, its interpretability or disentanglement could be explored. Can specific parts of the latent code be directly linked to sharp features or global structure? This could aid in more controllable 3D generation.
-
Long-Term Compression: The stated limitation of further reducing latent tokens is a significant challenge. Addressing this might require exploring fundamentally different latent space architectures or VAE objectives beyond standard KL divergence.
Overall,
Dora-VAErepresents a solid advancement, offering a practical and effective solution for high-fidelity, compact 3D shape representation, directly impacting the quality of AI-powered 3D content creation.
Similar papers
Recommended via semantic vector search.