Paper status: completed

Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders

Published:12/24/2024
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
6 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Dora-VAE enhances 3D shape VAEs using sharp edge sampling and dual cross-attention, improving fine detail reconstruction. Dora-bench benchmark and Sharp Normal Error metric enable precise evaluation of shape complexity and reconstruction quality.

Abstract

Recent 3D content generation pipelines commonly employ Variational Autoencoders (VAEs) to encode shapes into compact latent representations for diffusion-based generation. However, the widely adopted uniform point sampling strategy in Shape VAE training often leads to a significant loss of geometric details, limiting the quality of shape reconstruction and downstream generation tasks. We present Dora-VAE, a novel approach that enhances VAE reconstruction through our proposed sharp edge sampling strategy and a dual cross-attention mechanism. By identifying and prioritizing regions with high geometric complexity during training, our method significantly improves the preservation of fine-grained shape features. Such sampling strategy and the dual attention mechanism enable the VAE to focus on crucial geometric details that are typically missed by uniform sampling approaches. To systematically evaluate VAE reconstruction quality, we additionally propose Dora-bench, a benchmark that quantifies shape complexity through the density of sharp edges, introducing a new metric focused on reconstruction accuracy at these salient geometric features. Extensive experiments on the Dora-bench demonstrate that Dora-VAE achieves comparable reconstruction quality to the state-of-the-art dense XCube-VAE while requiring a latent space at least 8×\times smaller (1,280 vs. > 10,000 codes).

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Dora: Sampling and Benchmarking for 3D Shape Variational Auto-Encoders

1.2. Authors

The paper lists Rui Chen, Jianfeng Zhang, Yixun Liang, Guan Luo, Weiyu Li, Jiarui Liu, Xiu Li, Xiaoxiao Long, Jiashi Feng, and Ping Tan as authors.

  • Rui Chen, Yixun Liang, Weiyu Li, Jiarui Liu, Xiaoxiao Long, Ping Tan: Affiliated with The Hong Kong University of Science and Technology and LightIllusions. Ping Tan and Jianfeng Zhang are indicated as corresponding authors. Their research likely focuses on computer graphics, 3D vision, and AI for creative applications.
  • Jianfeng Zhang, Guan Luo, Xiu Li, Jiashi Feng: Affiliated with ByteDance Seed and Tsinghua University (Guan Luo). Their research background typically includes deep learning, computer vision, and large-scale model development, often with an industry application focus.

1.3. Journal/Conference

The paper was published on arXiv, a preprint server, with a publication timestamp of 2024-12-23T18:59:06.000Z. As a preprint, it has not yet undergone formal peer review for a specific journal or conference. However, arXiv is a widely respected platform for disseminating cutting-edge research in fields like AI and computer graphics, allowing rapid sharing of findings. The authors' affiliations with prominent academic institutions and industry labs (HKUST, ByteDance, Tsinghua) suggest the work is of high academic caliber.

1.4. Publication Year

2024

1.5. Abstract

The paper addresses a critical limitation in 3D content generation pipelines that utilize Variational Autoencoders (VAEs) to encode shapes into compact latent representations for diffusion models. The prevalent uniform point sampling strategy in VAE training often results in a loss of fine geometric details, compromising reconstruction quality and downstream generation tasks. To counter this, the authors introduce Dora-VAE, a novel approach featuring a sharp edge sampling (SES) strategy that prioritizes geometrically complex regions and a dual cross-attention mechanism for enhanced encoding. This method significantly improves the preservation of fine-grained shape features. Additionally, the paper proposes Dora-bench, a benchmark that quantifies shape complexity based on sharp edge density and introduces a new metric, Sharp Normal Error (SNE), to specifically assess reconstruction accuracy in these salient geometric areas. Extensive experiments on Dora-bench demonstrate that Dora-VAE achieves reconstruction quality comparable to state-of-the-art dense models like XCube-VAE while requiring a significantly smaller latent space (at least 8×8 \times smaller), making it more efficient for downstream tasks.

Official Source: https://arxiv.org/abs/2412.17808 PDF Link: https://arxiv.org/pdf/2412.17808v3.pdf Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the significant loss of geometric details in 3D shape reconstruction when using Variational Autoencoders (VAEs) for 3D content generation. This issue primarily arises from the widely adopted uniform point sampling strategy during VAE training.

This problem is critical because 3D content generation, especially with the rise of AI-powered methods, is vital for various industries like games, movies, and AR/VR. The quality of generated 3D shapes heavily depends on the VAE's ability to faithfully encode and reconstruct these shapes into compact latent representations, which then serve as input for diffusion models. If the VAE loses details during this encoding/reconstruction process, the subsequent diffusion model will generate lower-quality, less intricate 3D assets.

Specific challenges or gaps in prior research include:

  1. Volume-based methods (e.g., XCube): While capable of high-fidelity reconstruction by processing millions of voxelized points, they produce very large latent codes (often >10,000>10,000 tokens). This large latent space significantly complicates and slows down the training of subsequent diffusion models, making them impractical for many generative pipelines.

  2. Vector-set (Vecset) methods (e.g., 3DShape2VecSet, Craftsman): These methods use transformers to achieve highly compact latent representations (hundreds to thousands of tokens), which is ideal for efficient diffusion model training. However, due to computational constraints, they typically sample only a few thousand points. When combined with uniform sampling, this leads to substantial information loss, particularly in geometrically complex regions, resulting in poor detail preservation.

    The paper's entry point and innovative idea is to improve the reconstruction quality of Vecset-based VAEs without sacrificing their compact latent representations. They achieve this by introducing an intelligent sampling strategy that prioritizes geometrically salient features, coupled with an encoding mechanism designed to leverage these rich details.

2.2. Main Contributions / Findings

The paper makes several primary contributions to the field of 3D content generation and VAEs:

  1. Introduction of Dora-VAE for High-Quality 3D Reconstruction with Compact Latent Representations: The authors propose Dora-VAE, a novel 3D VAE model that significantly enhances reconstruction quality, particularly in preserving fine geometric details, while maintaining a compact latent space. This is achieved through:

    • Sharp Edge Sampling (SES): A new importance sampling algorithm designed for 3D VAE training that adaptively samples points from geometrically salient regions of a mesh, ensuring critical details are captured.
    • Dual Cross-Attention Architecture: A novel encoding mechanism that effectively processes both the uniformly sampled points and the detail-rich points sampled by SES, allowing the VAE to focus on crucial geometric features.
  2. Development of Dora-Bench for Rigorous 3D VAE Evaluation: To systematically and more rigorously evaluate 3D VAE reconstruction quality, the paper introduces Dora-bench. This benchmark:

    • Categorizes Shapes by Geometric Complexity: Divides test shapes into four levels of detail based on the density of sharp edges (NΓN_Γ), providing a more nuanced assessment than random selection.
    • Introduces Sharp Normal Error (SNE): A novel metric specifically designed to quantify reconstruction accuracy at salient geometric features, addressing the limitations of general metrics like Chamfer Distance and F-Score which often overlook fine details.
  3. Demonstrated Superior Performance and Efficiency: Extensive experiments on Dora-bench show that Dora-VAE achieves comparable or superior reconstruction quality to state-of-the-art dense volume-based methods (like XCube-VAE) while requiring a latent space that is at least 8×8 \times smaller (1,280 vs. >10,000>10,000 codes). This efficiency makes Dora-VAE particularly well-suited for training downstream latent diffusion models, significantly enhancing the quality of generated 3D shapes in applications like single-image 3D generation.

    These findings directly solve the problem of balancing compact latent representations with high-fidelity geometric detail preservation, a long-standing challenge in 3D VAE design for generative pipelines.

3. Prerequisite Knowledge & Related Work

This section outlines the foundational concepts necessary to understand the paper, summarizes relevant prior research, discusses the technological evolution in the field, and highlights how Dora-VAE differentiates itself.

3.1. Foundational Concepts

3.1.1. Variational Autoencoders (VAEs)

A Variational Autoencoder (VAE) is a type of generative model that learns a compressed, continuous representation (called a latent space) of input data. It consists of two main parts:

  • Encoder: Maps input data (e.g., a 3D shape) to a distribution over the latent space. Instead of directly outputting a latent vector, it outputs parameters (mean and variance) of a probability distribution (typically a Gaussian) from which a latent vector zz is sampled. This makes the latent space continuous and allows for smooth interpolations.
  • Decoder: Takes a sampled latent vector zz from the latent space and reconstructs the original input data. VAEs are trained to minimize two loss components: a reconstruction loss (how well the output matches the input) and a Kullback-Leibler (KL) divergence loss, which regularizes the latent space by forcing the learned distribution to be close to a prior distribution (e.g., a standard Gaussian). This regularization ensures the latent space is well-structured and facilitates new data generation by sampling from this prior.

3.1.2. 3D Shape Representation

3D shapes can be represented in various ways, each with its own advantages and disadvantages:

  • Meshes: A mesh is a collection of vertices, edges, and faces that defines the shape of a 3D object. Triangular meshes are common, where faces are triangles. They are good for representing surfaces with fine details.
  • Point Clouds: A set of discrete points in 3D space, sampled from the surface of an object. Point clouds are flexible and can represent complex geometry, but lack explicit connectivity information (like faces or edges).
  • Voxels: A 3D grid where each cell (voxel) indicates whether it's occupied by the object or not. Similar to pixels in 2D, but in 3D. They are good for representing solid objects but can be computationally expensive and memory-intensive for high resolutions due to their cubic growth.
  • Occupancy Fields: A continuous function that predicts whether any given 3D coordinate is inside or outside an object. The surface of the object is implicitly defined as the zero-level set of this function. This representation is powerful for generative tasks as it allows for continuous shape manipulation.

3.1.3. Diffusion Models

Diffusion Models are a class of generative models that have achieved state-of-the-art results in image and now 3D content generation. They work by iteratively denoising a noisy input (e.g., a latent vector) until it resembles a data sample from the training distribution.

  • Forward Diffusion Process: Gradually adds Gaussian noise to data samples until they become pure noise.
  • Reverse Diffusion Process: A neural network learns to reverse this process, starting from pure noise and iteratively predicting and removing noise to generate a clean data sample. In the context of 3D generation, latent diffusion models often operate on the compact latent representations learned by VAEs, making the diffusion process more efficient.

3.1.4. Attention Mechanism / Transformers

The Attention mechanism is a key component in modern deep learning, especially for processing sequential or set-based data. It allows a model to weigh the importance of different parts of the input data when processing a specific element.

  • Self-Attention: Enables a model to learn relationships between different elements within a single input sequence/set. For example, in a sentence, it helps understand how each word relates to other words. For a point cloud, it helps understand relationships between points.
  • Cross-Attention: Used in encoder-decoder architectures where the model needs to attend to information from a different sequence/set. For example, a decoder might use cross-attention to focus on relevant parts of the encoder's output when generating its own output. The Transformer architecture exclusively relies on attention mechanisms (multi-head self-attention and cross-attention layers) to process data, replacing traditional recurrent or convolutional layers. Transformers are particularly effective for vector-set based 3D VAEs because point clouds can be viewed as unordered sets of vectors.

The core Attention mechanism, as introduced in "Attention Is All You Need" (Vaswani et al., 2017), is calculated as: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:

  • QQ (Query), KK (Key), VV (Value) are matrices derived from the input embeddings.
  • QQ represents the query for information.
  • KK represents the keys to match the query against.
  • VV represents the values containing the information to be retrieved.
  • QKTQ K^T calculates the dot product similarity between queries and keys.
  • dk\sqrt{d_k} is a scaling factor, where dkd_k is the dimension of the key vectors, used to prevent the dot products from growing too large and pushing the softmax function into regions with tiny gradients.
  • softmax\mathrm{softmax} normalizes the scores, turning them into probability distributions.
  • The result is a weighted sum of the values, where weights are determined by the attention scores.

3.1.5. Sampling Strategies

Sampling refers to selecting a subset of points from a larger set or continuous surface.

  • Uniform Sampling: Selects points randomly across the surface with equal probability. While simple, it often fails to capture fine details in regions of high curvature if the total number of samples is limited.
  • Poisson Disk Sampling: A sampling method that generates points such that no two points are closer than a specified minimum distance, resulting in a more even distribution than pure random sampling.
  • Farthest Point Sampling (FPS): An algorithm for selecting a subset of points from a point cloud. It starts with an arbitrary point and then iteratively selects the point farthest from the already selected points. This ensures good coverage and is often used for downsampling point clouds while preserving geometric structure.

3.1.6. Dihedral Angle

In geometry, a dihedral angle is the angle between two intersecting planes or faces of a polyhedron (like a mesh). In the context of 3D meshes, it's the angle between the normal vectors of two adjacent faces. A large dihedral angle (close to 180 degrees, or a small angle between face normals) indicates a sharp edge or a high curvature region, which is crucial for defining geometric details.

3.2. Previous Works

3.2.1. Importance Sampling in Point Clouds

Importance sampling techniques are well-established in point cloud processing. For instance, APES [57] (Attention-Based Point Cloud Edge Sampling) uses attention to sample points for classification and segmentation. However, these methods typically operate on pre-existing point clouds. Dora-VAE's contribution is applying importance sampling directly from mesh surfaces for VAE training, which is different because the goal is to preserve complete geometric information from the source mesh during the initial sampling stage.

3.2.2. 3D Shape VAEs

Previous 3D VAEs fall into two main categories:

  • Volume-based methods (e.g., XCube [45]): These approaches leverage sparse convolution to encode densely voxelized surfaces. They achieve very high reconstruction quality, excelling at preserving geometric details due to their dense representation. However, their primary drawback is the generation of very large latent codes (often >10,000>10,000 tokens), which are computationally prohibitive for training latent diffusion models. XCube is a prime example of this trade-off.
  • Vector Set-based (Vecset) approaches (e.g., 3DShape2VecSet [63], Craftsman [31], CLAY [65]): These methods encode uniformly sampled surface points using Transformers, resulting in highly compact latent spaces (hundreds to thousands of tokens). This compactness is highly desirable for efficient diffusion model training. However, their reliance on uniform sampling, especially when constrained to a limited number of points due to Transformer's quadratic complexity, often leads to information loss and poor preservation of intricate geometric details. 3DShape2VecSet is the foundational model that Dora-VAE builds upon.

3.2.3. 3D Content Creation

Current 3D generation methods can be broadly categorized:

  • Optimization-based methods (e.g., DreamFusion [43]): These methods use Score Distillation Sampling (SDS) to optimize 3D representations (like NeRFs or Gaussian Splatting) using 2D diffusion model priors. They can achieve photorealistic results but are often slow, suffer from training instability, and may struggle with geometric consistency.
  • Large Reconstruction Models (LRM-based, e.g., LRM [22]): These models perform large-scale sparse-view reconstruction to generate 3D content efficiently. While fast, they typically lack explicit geometric priors, which can lead to compromised geometric fidelity and inconsistent surface details.
  • 3D Native Generative Models (e.g., 3DShape2VecSet [63]): This group, to which Dora-VAE belongs, adopts a two-stage pipeline: first, a 3D VAE encodes shapes into a compact latent space, and then a conditional latent diffusion model is trained on this latent space for generation. This approach generally ensures better geometric consistency due to the VAE's inherent geometric constraints. The quality of generated shapes is fundamentally limited by the VAE's reconstruction capability. Recent works [31, 62] have highlighted that improving VAE reconstruction directly translates to enhanced downstream generation quality, motivating the focus of Dora-VAE.

3.3. Technological Evolution

The field of 3D content generation has rapidly evolved from demanding manual expertise to leveraging AI for automation and accessibility. Early efforts focused on direct 3D model manipulation or procedural generation. The advent of deep learning brought about voxel-based and point cloud-based generative models. The success of VAEs and Generative Adversarial Networks (GANs) in 2D inspired their application to 3D, leading to methods like 3DShape2VecSet that encode 3D shapes into latent spaces. Concurrently, diffusion models revolutionized 2D image generation, quickly extending to 3D. The current paradigm often involves a two-stage process: encoding complex 3D geometry into a compact, manipulable latent representation using a VAE, followed by a diffusion model generating within this latent space. The core challenge in this evolution has been balancing the fidelity of 3D reconstruction (especially fine details) with the compactness of the latent representation (essential for efficient diffusion). Dora-VAE fits into this timeline by directly addressing the detail-loss problem inherent in compact Vecset-based VAEs, pushing the frontier towards higher fidelity in efficient 3D native generation.

3.4. Differentiation Analysis

Compared to the main methods in related work, Dora-VAE introduces several core differences and innovations:

  1. Addressing Uniform Sampling Limitation: The most significant differentiation is Dora-VAE's direct attack on the uniform sampling strategy, which is a common bottleneck for Vecset-based VAEs in preserving details. Unlike 3DShape2VecSet or Craftsman which rely on uniform sampling, Dora-VAE integrates a novel Sharp Edge Sampling (SES) strategy. This is a fundamental shift from generic sampling to importance sampling for 3D VAE learning, a concept previously unexplored in this specific context (i.e., sampling from meshes for VAE training, rather than processing existing point clouds).

  2. Specialized Encoding for Salient Features: To fully leverage the detail-rich point clouds generated by SES, Dora-VAE introduces a dual cross-attention architecture. This is an innovation over simpler Transformer-based VAEs that might process all sampled points uniformly. The dual attention explicitly separates and processes features from salient and uniform regions, allowing the model to better focus on critical details.

  3. Comprehensive Evaluation Methodology: Dora-VAE goes beyond just proposing a new model by also introducing Dora-bench. This benchmark is novel because:

    • It systematically categorizes shapes by geometric complexity rather than using randomly selected test sets, enabling a more nuanced evaluation of detail preservation.
    • It proposes a new metric, Sharp Normal Error (SNE), specifically tailored to assess reconstruction quality in salient geometric features, which existing metrics like Chamfer Distance or F-Score often fail to adequately capture.
  4. Balancing Compactness and Fidelity: While XCube-VAE achieves high fidelity, it does so at the cost of extremely large latent codes, making it unsuitable for diffusion models. Vecset-based methods achieve compactness but sacrifice detail. Dora-VAE uniquely offers comparable reconstruction quality to XCube-VAE while maintaining a highly compact latent space (8x smaller), effectively bridging this gap and providing a more practical solution for 3D native generative models.

    In essence, Dora-VAE differentiates itself by innovating at both the input data preparation stage (sampling) and the model architecture stage (encoding), specifically targeting the preservation of fine geometric details within a compact latent representation, complemented by a rigorous, detail-focused evaluation benchmark.

4. Methodology

The methodology section describes Dora-VAE, a novel approach for high-quality 3D reconstruction, and Dora-Bench, a new benchmark for evaluating 3D VAEs. The core idea revolves around intelligent sampling and encoding strategies to preserve fine geometric details within compact latent representations.

4.1. Principles

The core principle behind Dora-VAE is that for Variational Autoencoders (VAEs) to effectively encode 3D shapes into compact latent representations without losing fine geometric details, the input sampling process must prioritize regions of high geometric complexity. Traditional uniform sampling fails to do this, especially when computational constraints limit the total number of sampled points. By focusing on sharp edges as indicators of salient geometric features and designing a dual cross-attention mechanism to specifically process these feature-rich regions, Dora-VAE aims to achieve high-fidelity reconstruction. Complementing this, Dora-Bench is built on the principle that a robust evaluation of 3D VAEs requires assessing their performance across varying levels of shape complexity and specifically measuring their ability to reconstruct salient geometric features through a dedicated metric.

4.2. Core Methodology In-depth

The Dora-VAE pipeline enhances 3DShape2VecSet, a transformer-based 3D VAE. Figure 2 provides an overview of the proposed Dora-VAE pipeline.

4.2.1. Preliminary: 3DShape2VecSet

Dora-VAE builds upon the 3DShape2VecSet [63] architecture, which encodes uniformly sampled surface points into compact latent codes. The 3DShape2VecSet pipeline consists of three key steps:

4.2.1.1. Surface Sampling

Given a 3D surface SS, 3DShape2VecSet first uniformly samples NdN_d points from SS using Poisson disk sampling [61] to create a dense point cloud PdP_d. Then, Farthest Point Sampling (FPS) [39] is applied to PdP_d to downsample it to NsN_s points, forming a sparse point cloud PsP_s.

The mathematical representation for this step is: $ P_d = { p_d^i \in S \mid i = 1 , . . . , N_d } , P_s = \mathrm { F P S } ( P_d , N_s ) . $ Where:

  • SS: The input 3D surface (mesh).
  • NdN_d: The target number of dense points to be sampled.
  • pdip_d^i: The ii-th point in the dense point cloud PdP_d.
  • PdP_d: The dense point cloud sampled from the surface SS. Poisson disk sampling is used to ensure a relatively even distribution of these points.
  • FPS(Pd,Ns)\mathrm{FPS}(P_d, N_s): The Farthest Point Sampling function, which takes the dense point cloud PdP_d and downsamples it to NsN_s points, ensuring good coverage of the shape.
  • NsN_s: The target number of sparse points after downsampling.
  • PsP_s: The sparse point cloud, which serves as a query in the subsequent feature encoding step.

4.2.1.2. Feature Encoding

Next, the 3DShape2VecSet model computes point cloud features CC using a cross-attention mechanism between the sparse point cloud PsP_s (as query) and the dense point cloud PdP_d (as key and value). This is followed by self-attention layers to generate the compact latent code zz.

The mathematical representation for this step is: $ C = { \mathrm { C r o s s A t t n } } ( P_s , P_d , P_d ) , z = { \mathrm { S e l f A t t n } } ( C ) . $ Where:

  • CrossAttn(Q,K,V)\mathrm{CrossAttn}(Q, K, V): The cross-attention mechanism, where QQ is the Query, KK is the Key, and VV is the Value. Here, PsP_s acts as the query, while PdP_d provides both the keys and values. This allows the sparse points to attend to features from the dense points.
  • CC: The resulting point cloud feature representation.
  • SelfAttn(X)\mathrm{SelfAttn}(X): The self-attention mechanism, which processes the feature XX to learn relationships within the feature set itself.
  • zz: The final compact latent code, which is a compressed representation of the 3D shape.

4.2.1.3. Geometry Decoding

Finally, the latent code zz is further processed through self-attention layers. This processed latent code is then used with cross-attention to predict occupancy values for randomly sampled spatial query points QspaceR3Q_{space} \in \mathbb{R}^3. Occupancy values indicate whether a given 3D point lies inside (occupied) or outside (empty) the reconstructed 3D shape.

The mathematical representation for this step is: $ \hat { O } = \mathrm { C r o s s A t t n } ( Q_{space} , \mathrm { S e l f A t t n } ( z ) ) . $ Where:

  • SelfAttn(z)\mathrm{SelfAttn}(z): Self-attention applied to the latent code zz to enrich its representation for decoding.
  • QspaceQ_{space}: Randomly sampled 3D spatial query points, typically within a bounding box enclosing the object, for which the decoder predicts occupancy.
  • CrossAttn(Qspace,SelfAttn(z),SelfAttn(z))\mathrm{CrossAttn}(Q_{space}, \mathrm{SelfAttn}(z), \mathrm{SelfAttn}(z)): Cross-attention where QspaceQ_{space} are queries, and the output of SelfAttn(z) provides keys and values. This allows the decoder to predict occupancy for arbitrary 3D points based on the learned shape representation.
  • O^\hat{O}: The predicted occupancy values for the query points QspaceQ_{space}, which implicitly define the reconstructed 3D surface.

4.2.2. Dora-VAE

Figure 2 provides an overview of the Dora-VAE pipeline.

Figure 2. Overview of Dora-VAE. (a) We utilize the proposed sharp edge sampling technique to extract both salient and uniform points from the input mesh. These points are then combined with dense poi… 该图像是论文中图2的示意图,展示了Dora-VAE的两个核心部分:(a)通过锐边采样策略从输入网格分别抽样均匀点PuP_u和显著点PaP_a,并与密集点PdP_d结合;(b)双重交叉注意力架构用于编码这些点云,最终重建网格。

Figure 2. Overview of Dora-VAE. (a) We utilize the proposed sharp edge sampling technique to extract both salient and uniform points from the input mesh. These points are then combined with dense points, effectively capturing both salient regions and smooth areas. (b) To enhance the encoding of point clouds sampled through sharp edge sampling, we design a dual cross-attention architecture.

For each input mesh, Dora-VAE augments the uniformly sampled point cloud PuP_u with additional important points PaP_a obtained through the proposed sharp edge sampling (SES) strategy. These two sets combine to form the dense point cloud Pd=PuPaP_d = P_u \cup P_a. During encoding, Dora-VAE employs a dual cross-attention mechanism to process PuP_u and PaP_a separately, summing the results before passing them to self-attention to compute the latent code zz. The VAE training largely follows 3DShape2VecSet, supervised by a loss evaluated on the occupancy field.

4.2.2.1. Sharp Edge Sampling (SES)

The Sharp Edge Sampling (SES) algorithm is designed to effectively sample points from geometrically salient regions of a 3D mesh. To ensure overall surface coverage, SES also incorporates uniformly sampled points. The final dense point cloud PdP_d is a union of uniformly sampled points PuP_u and points specifically sampled from salient regions PaP_a, i.e., Pd=PuPaP_d = P_u \cup P_a. The process for computing salient points PaP_a involves two main steps:

4.2.2.1.1. Salient Edges Detection

Given a triangular mesh, SES identifies a set of salient edges Γ\Gamma by analyzing the dihedral angles between adjacent faces. The dihedral angle directly measures surface curvature along mesh edges. For each edge ee shared by two adjacent faces f1f_1 and f2f_2, the dihedral angle θe\theta_e is calculated using their normal vectors.

The dihedral angle θe\theta_e for an edge ee shared by faces f1f_1 and f2f_2 is computed as: $ \theta _ { e } = \operatorname { a r c c o s } \left( \frac { \mathbf { n } _ { f _ { 1 } } \cdot \mathbf { n } _ { f _ { 2 } } } { \left| \mathbf { n } _ { f _ { 1 } } \right| \left| \mathbf { n } _ { f _ { 2 } } \right| } \right) , $ Where:

  • nf1\mathbf{n}_{f_1}: The normal vector of face f1f_1.

  • nf2\mathbf{n}_{f_2}: The normal vector of face f2f_2.

  • \cdot: The dot product operator.

  • \left\| \cdot \right\|: The Euclidean norm (magnitude) of a vector.

  • arccos()\arccos(\cdot): The arccosine function, which returns the angle whose cosine is the argument. The result θe\theta_e is the angle between the two face normals. A smaller angle between normals (or a larger θe\theta_e close to π\pi radians or 180 degrees) indicates a sharper edge.

    The salient edge set Γ\Gamma then includes all edges whose dihedral angle θe\theta_e exceeds a predefined threshold τ\tau. $ \Gamma = { e \mid \theta _ { e } > \tau } $ Where:

  • ee: An edge in the mesh.

  • τ\tau: A predefined threshold angle. Edges with a dihedral angle greater than τ\tau are considered salient. Let NΓ=ΓN_\Gamma = |\Gamma| represent the total number of salient edges detected.

4.2.2.1.2. Salient Points Sampling

For each salient edge eΓe \in \Gamma, its two vertices, ve,1v_{e,1} and ve,2v_{e,2}, are collected into a salient vertex set PΓP_\Gamma. Duplicate vertices arising from connected salient edges are included only once. $ P _ { \Gamma } = { v _ { e , 1 } , v _ { e , 2 } \mid e \in \Gamma } , $ Where:

  • ve,1,ve,2v_{e,1}, v_{e,2}: The two vertices defining the salient edge ee.
  • PΓP_\Gamma: The set of all unique vertices that are part of at least one salient edge. Let NV=PΓN_V = |P_\Gamma| denote the number of unique vertices in PΓP_\Gamma.

Given a target number of salient points NdesiredN_{desired}, the salient point set PaP_a is generated based on the available salient vertices, considering three cases: $ P _ { a } = \left{ \begin{array} { l l } { \mathrm { F P S } ( P _ { \Gamma } , N _ { \mathrm { d e s i r e d } } ) , } & { \mathrm { i f } \ N _ { \mathrm { d esired } } < = N _ { V } , } \ { P _ { \Gamma } \cup P _ { \mathrm { i n t e r p o l a t e d } } , } & { \mathrm { i f } \ 0 < N _ { V } < N _ { \mathrm { d esired } } , } \ { \emptyset , } & { \mathrm { i f } \ N _ { V } = 0 . } \end{array} \right. $ Where:

  • FPS(PΓ,Ndesired)\mathrm{FPS}(P_\Gamma, N_{desired}): If the number of unique salient vertices (NVN_V) is greater than or equal to the desired number of salient points (NdesiredN_{desired}), Farthest Point Sampling is used to downsample PΓP_\Gamma to exactly NdesiredN_{desired} points.
  • PΓPinterpolatedP_\Gamma \cup P_{interpolated}: If NVN_V is less than NdesiredN_{desired} (meaning there aren't enough salient vertices), all vertices from PΓP_\Gamma are included. Additional points, PinterpolatedP_{interpolated}, are generated by uniformly sampling (NdesiredNV)/NΓ(N_{desired} - N_V) / N_\Gamma points along each salient edge in Γ\Gamma. This ensures that the target number of salient points is met by distributing the remaining points along the detected sharp features.
  • \emptyset: If no salient edges are detected (NV=0N_V = 0), the salient point set PaP_a remains empty.

4.2.2.2. Dual Cross Attention

After obtaining the point clouds PdP_d (which is PuPaP_u \cup P_a) through the SES strategy, Dora-VAE employs a dual cross-attention architecture to effectively encode both the uniform and salient regions. Following 3DShape2VecSet, PuP_u and PaP_a are first downsampled separately using FPS to form the sparse point cloud PsP_s.

The separate downsampling for PuP_u and PaP_a is: $ P _ { s } = \mathrm { F P S } ( P _ { u } , N _ { s , 1 } ) \cup \mathrm { F P S } ( P _ { a } , N _ { s , 2 } ) , $ Where:

  • PuP_u: Uniformly sampled points.

  • PaP_a: Salient points sampled by SES.

  • Ns,1N_{s,1}: The number of sparse points sampled from PuP_u.

  • Ns,2N_{s,2}: The number of sparse points sampled from PaP_a.

  • FPS()\mathrm{FPS}(\cdot): Farthest Point Sampling.

  • PsP_s: The combined sparse point cloud used as queries for the cross-attention mechanism.

    Next, cross-attention features are computed separately for the uniform points (PuP_u) and salient points (PaP_a) using the combined sparse point cloud PsP_s as the query: $ \begin{array} { c } { { C _ { u } = \mathrm { C r o s s A t t n } ( P _ { s } , P _ { u } , P _ { u } ) } } \ { { C _ { a } = \mathrm { C r o s s A t t n } ( P _ { s } , P _ { a } , P _ { a } ) } } \end{array} $ Where:

  • CuC_u: The feature representation derived from the uniform points PuP_u, with PsP_s as the query.

  • CaC_a: The feature representation derived from the salient points PaP_a, with PsP_s as the query. This dual attention design allows the model to focus separately on features from uniform regions (capturing overall structure) and salient regions (capturing fine details).

The final point cloud feature CC combines both attention results by summing them: $ C = C _ { u } + C _ { a } . $ This combined feature CC is then used, similar to 3DShape2VecSet, to predict the occupancy field O^\hat{O} through self-attention blocks. The entire model, with parameters ψ\psi, is optimized using Mean Squared Error (MSE) loss against the ground truth occupancy OO.

The optimization objective with MSE loss is: $ \nabla _ { \psi } \mathcal { L } _ { \mathrm { M S E } } ( \hat { O } , O ) = \mathbb { E } \left[ 2 ( \hat { O } - O ) \frac { \partial \hat { O } } { \partial \psi } \right] . $ Where:

  • LMSE\mathcal{L}_{\mathrm{MSE}}: The Mean Squared Error loss function.
  • O^\hat{O}: The predicted occupancy values.
  • OO: The ground truth occupancy values.
  • E[]\mathbb{E}[\cdot]: Expectation over the training data.
  • O^ψ\frac{\partial \hat{O}}{\partial \psi}: The gradient of the predicted occupancy with respect to the model parameters ψ\psi. This term is part of the gradient calculation for optimizing the model.

4.2.3. Dora-Bench

Dora-Bench is proposed to facilitate a more rigorous and systematic evaluation of VAE performance by addressing the limitations of conventional evaluation protocols that use randomly selected shapes and general metrics.

4.2.3.1. Geometric Complexity-based Evaluation

Dora-Bench categorizes test shapes based on their geometric complexity. This complexity is quantified using the number of salient edges NΓN_\Gamma (as defined in Section 3.2.1, where NΓ=ΓN_\Gamma = |\Gamma|). Shapes are classified into four levels:

  • Level 1 (Less Detail): 0<NΓ50000 < N_\Gamma \leq 5000. These shapes have relatively few sharp features.

  • Level 2 (Moderate Detail): 5000<NΓ100005000 < N_\Gamma \leq 10000. Shapes with a moderate number of sharp features.

  • Level 3 (Rich Detail): 10000<NΓ5000010000 < N_\Gamma \leq 50000. Shapes with many intricate details and sharp features.

  • Level 4 (Very Rich Detail): NΓ>50000N_\Gamma > 50000. Shapes with extremely high geometric complexity, characterized by a very large number of sharp edges.

    The benchmark curates test shapes from various public datasets, including GSO [18], ABO [14], Meta [3], and Objaverse [16], to ensure a diverse range of geometric complexities.

The following figure (Figure 3 from the original paper) illustrates the distribution and examples of shapes across different complexity levels in Dora-Bench:

Figure 3. Our proposed benchmark include 3D shapes from the ABO \[14\], GSO \[18\], Meta \[3\], and Objaverse \[16\] datasets. (a) The histogram of different datasets across different shape complexities. (b)… 该图像是论文中Figure 3,包含柱状图、饼图和示意图,展示了ABO、GSO、Meta和Objaverse四个数据集在不同细节复杂度等级上的形状数量分布,以及各等级示例的可视化。

Figure 3. Our proposed benchmark include 3D shapes from the ABO [14], GSO [18], Meta [3], and Objaverse [16] datasets. (a) The histogram of different datasets across different shape complexities. (b) The pie chart of the total counts by shape complexities. (c) Sample shapes of different shape complexities.

4.2.3.2. Sharp Normal Error (SNE)

To specifically assess the preservation of fine geometric details, Dora-Bench introduces Sharp Normal Error (SNE). Unlike general metrics, SNE focuses on the normal map differences between reconstructed and ground truth shapes within geometrically significant areas. The process is as follows:

  1. Normal Map Rendering: Normal maps are rendered for both the ground truth shape and the reconstructed shape from multiple viewpoints.

  2. Salient Region Identification: Canny edge detection is applied to these normal maps to identify sharp regions, which are then dilated to create evaluation masks. These masks delineate the areas where fine details are expected.

  3. Error Calculation: The Mean Squared Error (MSE) is computed between the ground truth and reconstructed normal maps, but only within the masked salient areas.

    The following figure (Figure 4 from the original paper) visualizes the process of computing Sharp Normal Error (SNE):

    Figure 4. The process of computing sharp normal errors (SNE). We compute MSE loss in the sharp regions of the normal. 该图像是论文中图4的示意图,展示了计算锐角区域法线误差(SNE)的流程。通过Canny边缘检测确定锐边区域,经过膨胀操作后与真实法线(GT normal)和重建法线(normal)相乘,最后计算这两个区域内的均方误差(MSE)。

Figure 4. The process of computing sharp normal errors (SNE). We compute MSE loss in the sharp regions of the normal.

This focused evaluation allows SNE to directly quantify how well VAEs preserve sharp geometric features, addressing a critical limitation of other common metrics.

5. Experimental Setup

This section details the datasets used, the evaluation metrics employed, and the baseline models against which Dora-VAE was compared.

5.1. Datasets

The Dora-VAE model was trained on a filtered subset of approximately 400,000 3D meshes from the Objaverse [16] dataset. This large-scale dataset was chosen to ensure comprehensive training. Low-quality meshes with issues like missing faces or severe self-intersections were filtered out to enhance training stability.

For the Dora-bench benchmark, test shapes were curated from multiple public datasets to ensure diverse geometric complexities:

  • ABO [14] (Amazon Berkeley Objects)

  • GSO [18] (Google Scanned Objects)

  • Meta [3] (Digital Twin Catalog from Meta)

  • Objaverse [16] test set

    The Dora-bench categorizes these models into four detail levels (Level 1 to Level 4), with approximately 800 samples per level. Due to the scarcity of highly detailed models in ABO, GSO, and Meta, Level 4 samples were predominantly sourced from the Objaverse test set.

An example of data distribution across complexity levels in Dora-bench is shown in Figure 3 from the Methodology section, depicting histograms and example meshes for each level.

5.2. Evaluation Metrics

The reconstruction quality of Dora-VAE and baselines was evaluated using 1 million (1M) sampled points from the input meshes, comparing them with their decoded counterparts. Three primary metrics were used:

5.2.1. F-score (rr)

  • Conceptual Definition: F-score (also known as F-measure or F1-score when rr is a specific threshold) assesses the reconstruction accuracy by computing the precision and recall of point correspondences between the reconstructed shape and the ground truth within a specified distance threshold rr. It essentially measures how well the reconstructed shape covers the ground truth and how many of its points are close to the ground truth. A higher F-score indicates better reconstruction quality.
  • Mathematical Formula: The F-score is typically calculated as the harmonic mean of precision and recall. For point cloud reconstruction, precision and recall are defined based on distances between points. Given two point clouds PrecP_{rec} (reconstructed) and PgtP_{gt} (ground truth):
    • Precision: The proportion of points in PrecP_{rec} that are "close" to a point in PgtP_{gt}. $ \mathrm{Precision}(P_{rec}, P_{gt}, r) = \frac{1}{|P_{rec}|} \sum_{p \in P_{rec}} \mathbb{I}(\min_{q \in P_{gt}} |p - q| \le r) $
    • Recall: The proportion of points in PgtP_{gt} that are "close" to a point in PrecP_{rec}. $ \mathrm{Recall}(P_{rec}, P_{gt}, r) = \frac{1}{|P_{gt}|} \sum_{q \in P_{gt}} \mathbb{I}(\min_{p \in P_{rec}} |p - q| \le r) $
    • F-score: $ \mathrm{F-score}(P_{rec}, P_{gt}, r) = 2 \cdot \frac{\mathrm{Precision}(P_{rec}, P_{gt}, r) \cdot \mathrm{Recall}(P_{rec}, P_{gt}, r)}{\mathrm{Precision}(P_{rec}, P_{gt}, r) + \mathrm{Recall}(P_{rec}, P_{gt}, r)} $
  • Symbol Explanation:
    • PrecP_{rec}: The point cloud sampled from the reconstructed 3D shape.
    • PgtP_{gt}: The point cloud sampled from the ground truth 3D shape.
    • rr: A distance threshold. Points within this distance are considered a match.
    • Prec|P_{rec}|, Pgt|P_{gt}|: The number of points in the reconstructed and ground truth point clouds, respectively.
    • I()\mathbb{I}(\cdot): The indicator function, which is 1 if its argument is true, and 0 otherwise.
    • minqPgtpq\min_{q \in P_{gt}} \|p - q\|: The minimum Euclidean distance from point pp in PrecP_{rec} to any point qq in PgtP_{gt}. The paper reports F-score at two thresholds: F-score (0.01) and F-score (0.005), with shapes normalized to [1,1][-1, 1] range.

5.2.2. Chamfer Distance (CD)

  • Conceptual Definition: Chamfer Distance (CD) is a widely used metric for measuring the similarity between two point clouds. It calculates the average squared Euclidean distance from each point in one point cloud to its nearest neighbor in the other point cloud, and vice versa. A lower CD indicates greater similarity between the shapes.
  • Mathematical Formula: Given two point clouds PrecP_{rec} (reconstructed) and PgtP_{gt} (ground truth): $ \mathrm{CD}(P_{rec}, P_{gt}) = \frac{1}{|P_{rec}|} \sum_{p \in P_{rec}} \min_{q \in P_{gt}} |p - q|2^2 + \frac{1}{|P{gt}|} \sum_{q \in P_{gt}} \min_{p \in P_{rec}} |q - p|_2^2 $
  • Symbol Explanation:
    • PrecP_{rec}: The point cloud sampled from the reconstructed 3D shape.
    • PgtP_{gt}: The point cloud sampled from the ground truth 3D shape.
    • Prec|P_{rec}|, Pgt|P_{gt}|: The number of points in the reconstructed and ground truth point clouds, respectively.
    • minqPgtpq22\min_{q \in P_{gt}} \|p - q\|_2^2: The squared Euclidean distance from point pp in PrecP_{rec} to its nearest neighbor qq in PgtP_{gt}.
    • minpPrecqp22\min_{p \in P_{rec}} \|q - p\|_2^2: The squared Euclidean distance from point qq in PgtP_{gt} to its nearest neighbor pp in PrecP_{rec}. The paper reports CD multiplied by 10000 for presentation (e.g., CD x 10000).

5.2.3. Sharp Normal Error (SNE)

  • Conceptual Definition: As proposed in Section 3.3.2, Sharp Normal Error (SNE) specifically evaluates reconstruction quality in salient regions by measuring normal map differences between the reconstructed and ground truth shapes. It focuses on how accurately sharp edges and fine geometric details are preserved, unlike global metrics that might average out local errors. A lower SNE indicates better preservation of sharp features.
  • Mathematical Formula: The SNE is computed as the Mean Squared Error (MSE) between the normal maps of the ground truth (NgtN_{gt}) and reconstructed (NrecN_{rec}) shapes, but only within masked salient areas (MM). $ \mathrm{SNE} = \frac{1}{|M|} \sum_{(x,y) \in M} |N_{gt}(x,y) - N_{rec}(x,y)|_2^2 $
  • Symbol Explanation:
    • Ngt(x,y)N_{gt}(x,y): The normal vector at pixel (x,y) in the ground truth normal map.
    • Nrec(x,y)N_{rec}(x,y): The normal vector at pixel (x,y) in the reconstructed normal map.
    • MM: The set of pixels within the evaluation mask, which defines the salient regions detected by Canny edge detection and dilation.
    • M|M|: The number of pixels in the masked area.
    • 22\|\cdot\|_2^2: The squared Euclidean distance (or L2 norm) between the normal vectors. The paper reports SNE multiplied by 100 (e.g., SNE x 100).

5.2.4. Latent Code Length (LCL)

  • Conceptual Definition: Latent Code Length (LCL) refers to the dimensionality or number of tokens in the compact latent representation generated by the VAE. While not a performance metric in itself, it's a crucial factor for efficiency in downstream tasks like diffusion model training. Shorter latent codes indicate higher compression and faster processing, but typically pose a challenge for maintaining reconstruction quality. For a fair comparison, LCL is reported as it often correlates with reconstruction capability.

5.3. Baselines

Dora-VAE was compared against several state-of-the-art approaches:

  1. XCube-VAE [45]: A volumetric method known for achieving high reconstruction quality but generating very large latent codes (>10,000>10,000 tokens). This represents the high-fidelity, low-compression end of the spectrum.

  2. XCube-VAE† [45]: A fine-tuned version of the original XCube-VAE. The authors fine-tuned XCube-VAE on the same dataset used for Dora-VAE to ensure a fair comparison, as the original XCube might have been trained on different data. This baseline provides a strong upper bound for reconstruction quality.

  3. Craftsman-VAE [31]: This model is a fine-tuned version of 3DShape2VecSet [63] (a vector set-based method) for shorter latent codes, trained on Objaverse. It represents the compact latent space, potentially lower-fidelity end of the spectrum, and is a direct competitor in the Vecset-based category.

  4. 3DShape2VecSet [63]: (Mentioned in supplementary materials) The original vector set-based VAE that Dora-VAE builds upon. It was originally trained on ShapeNet [6], a smaller dataset, so its performance is generally lower, but it is important for context.

    VAE models from Direct3D [58] and CLAY [65] were excluded from the primary comparison due to their implementations not being publicly available at the time of submission. XCube-VAE was also excluded from downstream diffusion model comparisons due to its impractical 10,000-dimensional latent codes.

5.4. Implementation Details

  • Mesh Preprocessing: All meshes were preprocessed using CLAY [65] methods to ensure watertight 3D models.
  • Training Data: Approximately 400,000 3D meshes filtered from Objaverse [16], with low-quality meshes removed.
  • Training Environment: Trained on 32 A100 GPUs for two days.
  • Training Parameters: Batch size of 2048, learning rate of 5e-5.
  • Efficiency Techniques: Flash-Attention-v2 [15], mixed-precision training with FP16, and gradient checkpointing [12] were used to optimize memory and training efficiency.
  • Sharp Edge Sampling (SES) Parameters: Ndesired=16384N_desired = 16384 (target number of salient points) and τ=30τ = 30 (dihedral angle threshold).
  • Sharp Normal Error (SNE) Parameters: Low threshold of 20 and high threshold of 200 for Canny edge detection.
  • VAE Architecture: Follows successful designs [31, 67] with 8 self-attention layers in the encoder and 16 in the decoder.
  • Latent Code Length (LCL) Multi-resolution Training: During training, NsN_s (the latent code length) was randomly selected between 256 and 1280. This strategy, adopted from CLAY [65], facilitates progressive training for subsequent diffusion stages.
  • KL Divergence Weight: Set to 0.001.
  • Spatial Query Points (QspaceQ_{space}): Constructed by combining points randomly sampled near the mesh surface and points uniformly sampled within [1,1][-1, 1] spatial range.
  • Diffusion Model for Image-to-3D: A conditional diffusion model based on the DiT [7, 41] architecture, similar to Direct3D [58] and CLAY [65]. It conditions on image features extracted by DINOv2 [40] from single-view images rendered using BlenderProc [17]. This model has 0.39 billion parameters and was trained on 32 A100 GPUs for three days.

6. Results & Analysis

This section presents the experimental results, comparing Dora-VAE qualitatively and quantitatively against baselines, and analyzing its performance, including ablation studies and its application in single-image to 3D generation.

6.1. Core Results Analysis

6.1.1. Qualitative Comparison

The visual comparisons in Figure 5 demonstrate Dora-VAE's effectiveness in preserving geometric details across varying levels of shape complexity, especially for L3 and L4 shapes.

The following figure (Figure 5 from the original paper) shows visual comparisons of different methods across different complexity levels:

Figure 7. The diffusion results of the single image to 3D generation trained on our Dora-VAE and Craftsman†. The 3D geometry generated by the diffusion model trained on our proposed DoraVAE has more… 该图像是图7,展示了基于单张图像进行3D生成的扩散结果对比。图中显示了利用作者所提Dora-VAE和Craftsman†训练的扩散模型生成的3D形状,Dora-VAE生成结果在相同实验条件下具有更多细节和更丰富的几何信息。

Figure 5. Visual comparisons of different methods on Dora-bench. Our method consistently shows better reconstruction quality, especially in preserving geometric details (e.g., sharp edges), compared to baselines. This is evident in the normal maps and overall shape fidelity, particularly for complex shapes (L3 and L4).

  • Low Complexity (L1 and L2): For shapes with lower geometric complexity, most methods achieve comparable reconstruction quality. This indicates that simple shapes are generally easy for modern VAEs to encode and reconstruct.

  • High Complexity (L3 and L4): The advantages of Dora-VAE become pronounced when dealing with shapes of higher complexity. Dora-VAE visibly maintains finer details, such as sharp edges and intricate surface variations, which are often missed by other methods.

  • XCube-VAE vs. Dora-VAE: While XCube-VAE can achieve similar visual quality, it requires a significantly larger latent code length (LCL) (over 10,000 tokens) compared to Dora-VAE's 1,280 tokens. This implies Dora-VAE provides a far more efficient representation for similar visual fidelity.

  • Craftsman-VAE: Craftsman-VAE shows a noticeable degradation in reconstruction quality for complex shapes, failing to capture fine geometric details. This aligns with the paper's motivation that uniform sampling limits detail preservation.

    Additional visual comparisons provided in the supplementary material (Figures S8 and S9) further support these observations, particularly highlighting XCube's tendency for geometric deviation from the ground truth despite rich visual details, which is attributed to quantization errors during mesh extraction using NKSR [24].

The following figure (Figure S8 from the original paper) illustrates visual comparisons for Level 3 and 4 examples:

该图像是论文中的示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube)在不同复杂度层级(Level 3、Level 4)上对多种3D模型法线重建的视觉对比效果。 该图像是论文中的示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube)在不同复杂度层级(Level 3、Level 4)上对多种3D模型法线重建的视觉对比效果。

Figure S8. Qualitative comparison of the normal reconstruction results for Level 3 and Level 4 models (details from Table S3). Our method (Ours full) consistently shows superior performance in preserving fine geometric details, leading to more accurate normal maps compared to baselines such as Craftsman, 3DShape2VecSet, and Xcube†.

The following figure (Figure S9 from the original paper) provides more visual comparisons of Level 3 and 4 examples:

该图像是一个多类别3D网格法线视图的对比示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube†)在Level 3和Level 4细节恢复上的表现,突出Dora-VAE在细节保留上的优势。 该图像是一个多类别3D网格法线视图的对比示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube†)在Level 3和Level 4细节恢复上的表现,突出Dora-VAE在细节保留上的优势。

Figure S9. Further qualitative comparison of the normal reconstruction results for Level 3 and Level 4 models (details from Table S3). This figure reinforces the superior detail preservation of our Dora-VAE, especially in complex regions, compared to Craftsman, 3DShape2VecSet, and Xcube†, as evidenced by the fidelity of the reconstructed normal maps.

6.1.2. Quantitative Comparison

The following are the results from Table 1 of the original paper:

Methods LCL ↑ F-score(0.01) × 100 ↑ F-score(0.005) × 100 ↓ CD × 10000 ↓SNE × 100
L1 L2 L3 L4 L1 L2 L3 L4 L1 L2 L3 L4 L1 L2 L3 L4
Xcube [45] >10000 98.968 98.799 98.615 98.226 95.525 93.872 92.322 85.365 6.315 6.288 7.935 9.926 1.579 1.432 1.430 1.679
Xcube† [45] >10000 99.393 99.794 99.824 99.079 96.753 95.535 93.422 87.365 4.015 4.142 5.740 7.627 1.543 1.408 1.259 1.639
Craftsman [31] 256 98.016 95.874 91.756 81.739 87.994 82.549 73.000 57.379 4.389 9.129 14.530 33.441 1.906 1.873 2.191 3.933
Ours w/o SES, DCA 1280 99.964 99.925 99.678 97.890 96.561 95.975 91.618 83.124 2.236 2.506 4.444 6.432 1.448 1.215 1.205 1.828
Ours w/o DCA 1280 99.944 99.814 97.294 96.779 95.977 94.623 88.406 79.240 2.422 2.983 3.980 6.196 1.496 1.313 1.352 2.207
Ours full 256 99.507 98.986 96.669 89.577 93.272 90.466 82.386 68.669 3.356 5.202 10.276 24.527 1.555 1.410 1.618 3.035
Ours full 1280 99.988 99.955 99.880 99.170 97.038 96.831 93.458 87.473 2.097 2.500 3.945 5.265 1.433 1.186 1.137 1.579

The quantitative results in Table 1, evaluated on Dora-bench, demonstrate Dora-VAE's superior performance across all complexity levels, with a particularly significant advantage for L3 and L4 shapes.

  • Overall Performance: Dora-VAE (full model with 1280 LCL) achieves the highest F-score and lowest CD and SNE across almost all complexity levels. For L4 shapes (very rich detail), its F-score (0.01) is 99.170, F-score (0.005) is 87.473, CD x 10000 is 5.265, and SNE x 100 is 1.579.

  • Comparison with XCube-VAE†: Dora-VAE with 1280 LCL achieves comparable or better F-scores and SNE values than XCube-VAE† (which uses >10000>10000 LCL), while significantly outperforming it in CD. For instance, Dora-VAE achieves a CD x 10000 of 2.097 for L1 (vs. 4.015 for XCube-VAE†), representing a 47.77%47.77\% improvement. This is remarkable given that Dora-VAE's latent space is at least 8×8 \times smaller. The paper attributes XCube-VAE's relatively higher CD to quantization errors introduced by NKSR [24] during mesh extraction, despite its dense representation.

  • Comparison with Craftsman-VAE: Craftsman-VAE (256 LCL) shows significantly lower performance across all metrics, especially for L3 and L4 shapes, confirming its struggle with geometric detail preservation due to uniform sampling and shorter latent codes. For L4, its F-score (0.005) is 57.379 and SNE x 100 is 3.933, much worse than Dora-VAE's 87.473 and 1.579 respectively.

  • SNE Metric Validation: The superior performance in SNE for Dora-VAE is particularly important, validating the effectiveness of the sharp edge sampling strategy. For L4 shapes, Dora-VAE's SNE x 100 is 1.579 compared to XCube-VAE†'s 1.639, a 3.7%3.7\% improvement, which aligns with the qualitative observation of better preservation of fine details.

    The supplementary material also includes Table S2, which shows 3DShape2VecSet's consistent underperformance due to limited training data.

The following are the results from Table S2 of the original paper (from supplementary):

Methods LCL ↑ F-score(0.01) × 100 ↑ F-score(0.005) × 100 ↓ CD × 10000 ↓SNE × 100
L1 L2 L3 L4 L1 L2 L3 L4 L1 L2 L3 L4 L1 L2 L3 L4
Xcube [45] >10000 98.968 98.799 98.615 98.226 95.525 93.872 92.322 85.365 6.315 6.288 7.935 9.926 1.579 1.432 1.430 1.679
Xcube† [45] >10000 99.393 99.794 99.824 99.079 96.753 95.535 93.422 87.365 4.015 4.142 5.740 7.627 1.543 1.408 1.259 1.639
VecSet [63] 512 94.768 88.890 80.126 59.347 77.545 67.929 55.516 34.619 27.380 42.075 100.975 159.151 2.939 3.056 3.470 6.034
Craftsman [31] 256 98.016 95.874 91.756 81.739 87.994 82.549 73.000 57.379 4.389 9.129 14.530 33.441 1.906 1.873 2.191 3.933
Ours w/o SES, DCA 1280 99.964 99.925 99.678 97.890 96.561 95.975 91.618 83.124 2.236 2.506 4.444 6.432 1.448 1.215 1.205 1.828
Ours w/o DCA 1280 99.944 99.814 97.294 96.779 95.977 94.623 88.406 79.240 2.422 2.983 3.980 6.196 1.496 1.313 1.352 2.207
Ours full 256 99.507 98.986 96.669 89.577 93.272 90.466 82.386 68.669 3.356 5.202 10.276 24.527 1.555 1.410 1.618 3.035
Ours full 1280 99.988 99.955 99.880 99.170 97.038 96.831 93.458 87.473 2.097 2.500 3.945 5.265 1.433 1.186 1.137 1.579

6.1.3. Application: Single Image to 3D

Dora-VAE's effectiveness is further validated by its application to single-image 3D generation using latent diffusion models. A diffusion model based on the DiT [42] architecture was implemented, similar to CLAY [65]. For comparison, Craftsman-VAE was fine-tuned on the same dataset (Craftsman-VAE†). XCube-VAE was excluded from this comparison due to its high-dimensional latent codes (>10,000>10,000) being impractical for diffusion model training.

The following figure (Figure 7 from the original paper) displays generation results from diffusion models trained with Dora-VAE and Craftsman-VAE†:

该图像是一个多类别3D网格法线视图的对比示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube†)在Level 3和Level 4细节恢复上的表现,突出Dora-VAE在细节保留上的优势。 该图像是一个多类别3D网格法线视图的对比示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube†)在Level 3和Level 4细节恢复上的表现,突出Dora-VAE在细节保留上的优势。

Figure 7. The diffusion results of the single image to 3D generation trained on our Dora-VAE and Craftsman†. The 3D geometry generated by the diffusion model trained on our proposed DoraVAE has more details under the same experimental environment.

Both models used identical architectures (0.39B parameters) and training conditions (same dataset, 32 A100 GPUs, 3 days). The visual results in Figure 7 clearly show that the diffusion model trained with Dora-VAE generates 3D shapes with significantly better preservation of geometric details (e.g., sharper edges, more defined features) compared to Craftsman-VAE†. This directly validates that Dora-VAE's improved reconstruction capabilities translate into higher-quality outputs for downstream generative tasks.

Further comparisons in the supplementary material (Figures S10 and S11) against LRM-based methods (MeshFormer [36], CRM [55]) and a commercial solution (Tripo v2.0 [2]) show:

  • Superiority over LRM-based methods: Dora-VAE-based generation achieves superior geometric detail and fidelity compared to MeshFormer and CRM, whose limitations are attributed to a lack of explicit geometric constraints.

  • Comparability with commercial solutions: Dora-VAE achieves geometric quality comparable to Tripo v2.0, a leading commercial solution, despite using significantly more constrained resources (3 days training on 32 A100 GPUs with 400,000\sim 400,000 training samples). This highlights Dora-VAE's efficiency and effectiveness.

    The following figure (Figure S10 from the original paper) shows a qualitative comparison of the Image-to-3D results:

    Figure S10. Qualitative comparison of the Image-to-3D results. 该图像是不同方法在Image-to-3D任务中结果的定性比较图。图中展示了输入图像及Ours、MeshFormer、CRM和Tripo v2.0方法生成的多种3D模型表面法线图,直观体现了各方法在细节和结构重建上的差异。

Figure S10. Qualitative comparison of the Image-to-3D results. This figure compares outputs from our Dora-VAE-based diffusion model against MeshFormer, CRM, and Tripo v2.0. Our method consistently produces 3D geometry with richer and more accurate details, demonstrating its superior capability in preserving fine features from a single input image.

The following figure (Figure S11 from the original paper) provides more qualitative comparison of the Image-to-3D results:

Figure S11. Qualitative comparison of the Image-to-3D results. 该图像是图表,展示了多个输入图片与不同3D重建方法(包括Ours、MeshFormer、CRM、Tripo v2.0)生成的法线贴图结果的对比,体现了所提方法在细节和几何结构上的优势。

Figure S11. Qualitative comparison of the Image-to-3D results. Further examples show the Dora-VAE-based diffusion model's ability to generate highly detailed and geometrically faithful 3D shapes from single images, outperforming LRM-based methods and achieving results competitive with commercial solutions.

6.2. Ablation Studies / Parameter Analysis

To quantify the contribution of each proposed component, Dora-VAE was subjected to ablation studies, comparing the full model with two variants under identical training conditions:

  1. Ours w/o SES, DCA: This variant completely removes both the sharp edge sampling (SES) strategy and the dual cross-attention (DCA) mechanism. Essentially, it reverts to using only uniformly sampled point clouds with Poisson disk sampling, similar to existing Vecset-based VAEs, while maintaining the same total number of dense points (NdN_d).

  2. Ours w/o DCA: This variant retains the sharp edge sampling (SES) but removes the dual cross-attention (DCA). Instead, it uses a single cross-attention mechanism, similar to the one adopted by 3DShape2VecSet [63], processing all sampled points (uniform + salient) together.

    The following figure (Figure 6 from the original paper) illustrates the ablation study results:

    该图像是论文中的示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube)在不同复杂度层级(Level 3、Level 4)上对多种3D模型法线重建的视觉对比效果。 该图像是论文中的示意图,展示了不同方法(包括Ours、Craftsman、3DShape2VecSet和Xcube)在不同复杂度层级(Level 3、Level 4)上对多种3D模型法线重建的视觉对比效果。

Figure 6. Ablation studies of our method. Given the ground truth of mesh, we employ both our full model and its variations to reconstruct the ground truth mesh, highlighting significant reconstruction discrepancies with red boxes. This visually demonstrates the contribution of SES and DCA to detail preservation.

As shown in Figure 6 and quantitatively in Table 1 (rows for "Ours w/o SES, DCA" and "Ours w/o DCA"), the full Dora-VAE model consistently outperforms both ablated variants.

  • Impact of SES and DCA (Ours w/o SES, DCA vs. Ours full):
    • The "Ours w/o SES, DCA" variant (1280 LCL) performs noticeably worse than the "Ours full" model (1280 LCL). For instance, its CD x 10000 for L4 is 6.432 compared to 5.265 for "Ours full". Its SNE x 100 for L4 is 1.828 compared to 1.579 for "Ours full". This significant drop in performance, especially for CD and SNE on complex shapes, highlights the crucial role of both sharp edge sampling and dual cross-attention in preserving fine geometric details. Visually, the red boxes in Figure 6 show clear degradation without these components.
  • Impact of DCA (Ours w/o DCA vs. Ours full):
    • The "Ours w/o DCA" variant (1280 LCL), while benefiting from SES, still performs worse than the "Ours full" model. Its CD x 10000 for L4 is 6.196, and SNE x 100 for L4 is 2.207. This indicates that merely sampling salient points is not enough; the dual cross-attention mechanism is essential for effectively leveraging these detail-rich points during the encoding process by allowing the model to focus on them distinctly.

      These ablation studies clearly validate the effectiveness and necessity of both the Sharp Edge Sampling (SES) strategy and the Dual Cross-Attention (DCA) architecture for Dora-VAE to achieve its high-fidelity reconstruction capabilities.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Dora-VAE, a novel Variational Autoencoder designed to overcome the limitations of existing 3D shape VAEs in preserving fine geometric details while maintaining compact latent representations. At its core, Dora-VAE innovates through a sharp edge sampling (SES) strategy, which prioritizes points from geometrically salient regions, and a dual cross-attention architecture, specifically engineered to effectively encode these detail-rich point clouds. To provide a rigorous evaluation framework, the authors developed Dora-bench, a benchmark that systematically categorizes 3D shapes by their geometric complexity and features a new metric, Sharp Normal Error (SNE), for assessing reconstruction accuracy of fine details. Extensive experiments on Dora-bench conclusively demonstrate that Dora-VAE significantly outperforms existing methods across various complexity levels. Furthermore, its superior reconstruction capabilities directly translate to enhanced quality in downstream tasks, as evidenced by its application in single-image 3D generation through diffusion models, where it generates more geometrically detailed shapes with a latent space at least 8×8 \times smaller than state-of-the-art dense models.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

  • Current Limitations: The primary limitation is the challenge in maintaining high-quality reconstructions when the number of latent tokens is further reduced. While Dora-VAE achieves state-of-the-art quality with 1,280 tokens, significantly compressing beyond this point remains difficult, especially compared to advancements in 2D image compression (e.g., Deep Compression Autoencoder (DC-AE) [8]).
  • Future Directions:
    1. Enhanced Compression Efficiency: Future work aims to explore novel techniques to increase the compression rate of 3D VAEs while preserving reconstruction quality, potentially bridging the efficiency gap between 2D and 3D compression methods.
    2. Advanced Diffusion Models: Leveraging Dora-VAE's superior reconstruction capabilities, the authors plan to develop more powerful image-to-3D diffusion models. They believe that improved VAE reconstruction can directly elevate the performance ceiling of diffusion models, leading to higher-quality generation results under similar training conditions.

7.3. Personal Insights & Critique

Dora-VAE presents a compelling solution to a critical problem in 3D generative AI: the trade-off between latent space compactness and geometric detail fidelity. The paper's strength lies not just in proposing a new model but also in establishing a more rigorous benchmarking framework (Dora-bench) and a specialized evaluation metric (SNE). This holistic approach is commendable, as a good benchmark is often as important as the model itself for advancing a field.

Inspirations:

  • Task-Specific Importance Sampling: The idea of sharp edge sampling is a powerful demonstration of how domain-specific knowledge (identifying salient edges via dihedral angles) can be integrated into general deep learning pipelines to achieve significant performance gains. This principle could be applied to other data modalities where specific features are disproportionately important.
  • Dual Processing Pathways: The dual cross-attention mechanism, which separates and then combines features from uniform and salient regions, is an elegant solution to handling heterogeneous information within a single input. This pattern could be beneficial in other scenarios where input data has both general and "highlighted" components.
  • Holistic Evaluation: The emphasis on complexity-based evaluation and the SNE metric highlights the need for tailored evaluation metrics when generic ones fall short. Many fields might benefit from defining metrics that focus on "important" features rather than just overall similarity.

Potential Issues/Critique:

  • Generalizability of SES Parameters: The paper uses Ndesired=16384N_desired = 16384 and τ=30τ = 30 for SES. While these work well, the optimal values might vary significantly across different datasets or types of 3D models. A more adaptive or data-driven approach to setting these parameters could further enhance robustness.

  • Complexity of Mesh Preprocessing: The method relies on watertight 3D models and mesh processing (e.g., Canny edge detection, dilation). For real-world, noisy 3D scan data or imperfect models, this preprocessing step might introduce its own challenges or limitations, potentially affecting the quality of SES.

  • Beyond Sharp Edges: While sharp edges are crucial, some fine details might not be characterized by sharp angles (e.g., intricate textures, subtle curvatures). The current SES might not capture these types of details. Future work could explore more generalized saliency detection mechanisms.

  • Latent Code Structure: While the latent code is compact, its interpretability or disentanglement could be explored. Can specific parts of the latent code be directly linked to sharp features or global structure? This could aid in more controllable 3D generation.

  • Long-Term Compression: The stated limitation of further reducing latent tokens is a significant challenge. Addressing this might require exploring fundamentally different latent space architectures or VAE objectives beyond standard KL divergence.

    Overall, Dora-VAE represents a solid advancement, offering a practical and effective solution for high-fidelity, compact 3D shape representation, directly impacting the quality of AI-powered 3D content creation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.