Paper status: completed

Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention

Published:07/24/2025

Part Attention Mechanism (1)Sparse Voxel 3D Generation (1)VecSet Representation (1)Efficient 3D Shape Modeling (1)High-Resolution 3D Generation (1)

Original Link PDF

Price: 0.100000

6 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Ultra3D accelerates high-fidelity 3D generation by using `VecSet` for efficient coarse layout and novel Part Attention, which localizes computations within semantic regions. This method significantly boosts efficiency (up to 6.7x speed-up) while delivering state-of-the-art 1024-r

Abstract

Recent advances in sparse voxel representations have significantly improved the quality of 3D content generation, enabling high-resolution modeling with fine-grained geometry. However, existing frameworks suffer from severe computational inefficiencies due to the quadratic complexity of attention mechanisms in their two-stage diffusion pipelines. In this work, we propose Ultra3D, an efficient 3D generation framework that significantly accelerates sparse voxel modeling without compromising quality. Our method leverages the compact VecSet representation to efficiently generate a coarse object layout in the first stage, reducing token count and accelerating voxel coordinate prediction. To refine per-voxel latent features in the second stage, we introduce Part Attention, a geometry-aware localized attention mechanism that restricts attention computation within semantically consistent part regions. This design preserves structural continuity while avoiding unnecessary global attention, achieving up to 6.7x speed-up in latent generation. To support this mechanism, we construct a scalable part annotation pipeline that converts raw meshes into part-labeled sparse voxels. Extensive experiments demonstrate that Ultra3D supports high-resolution 3D generation at 1024 resolution and achieves state-of-the-art performance in both visual fidelity and user preference.

Mind Map

In-depth Reading

English Analysis~15 min read · 17,640 chars

1. Bibliographic Information

Title: Ultra3D: Efficient and High-Fidelity 3D Generation with Part Attention
Authors: Yiwen Chen, Zhihao Li, Yikai Wang, Hu Zhang, Qin Li, Chi Zhang, Guosheng Lin.
- Affiliations: Nanyang Technological University, Math Magic, Tsinghua University, Beijing Normal University, Westlake University. The authors are affiliated with prominent academic and industry research institutions, indicating a strong background in computer vision and machine learning.
Journal/Conference: The paper is available on arXiv, which is a preprint server. This means it has not yet undergone formal peer review for a specific conference or journal at the time of this analysis, but it represents cutting-edge research being shared with the community. The citation format suggests a 2025 publication year, likely indicating a submission to a future conference.
Publication Year: 2025 (as per citation format in the text).
Abstract: The paper introduces Ultra3D, a framework for generating high-fidelity 3D content efficiently. It addresses the computational inefficiency of existing two-stage diffusion pipelines that use sparse voxel representations, which suffer from the quadratic complexity of attention mechanisms. Ultra3D improves efficiency in two ways. First, it uses the compact VecSet representation to quickly generate a coarse object layout. Second, it introduces Part Attention, a localized attention mechanism that confines computations within semantically consistent object parts during the refinement stage, achieving up to a 6.7x speed-up. To enable this, the authors developed a scalable pipeline to annotate raw meshes with part labels. Experiments show that Ultra3D can generate 3D models at 1024 resolution and achieves state-of-the-art results in visual quality and user preference.
Original Source Link:
- Official Link: https://arxiv.org/pdf/2507.17745
- PDF Link: http://arxiv.org/pdf/2507.17745v3
- Publication Status: Preprint on arXiv.

2. Executive Summary

Background & Motivation (Why):
- Core Problem: High-fidelity 3D content generation is computationally expensive. Recent state-of-the-art methods rely on sparse voxel representations, which capture fine-grained geometry but require a two-stage generation process using Diffusion Transformers (DiT). The attention mechanism within these transformers has a quadratic complexity ( $O(N^2)$ where $N$ is the number of voxels), making it a severe bottleneck, especially at high resolutions where the number of voxels can be in the tens of thousands.
- Importance & Gaps: As the demand for 3D assets in gaming, VR/AR, and digital media grows, efficient generation is crucial. Prior methods faced a difficult trade-off: either use low resolutions to maintain speed, sacrificing quality, or use high resolutions, incurring prohibitive computational costs and long generation times. There was a clear need for a method that could achieve both high fidelity and high efficiency.
- Fresh Angle: Ultra3D tackles this problem by re-architecting the two-stage pipeline. Instead of using a computationally heavy method for the initial coarse layout, it employs a highly efficient representation (VecSet). For the detail-refinement stage, it replaces the inefficient global attention with a novel, geometry-aware Part Attention mechanism that localizes computation, drastically reducing the overhead without compromising structural integrity.
Main Contributions / Findings (What):
1. A Hybrid Two-Stage Framework (Ultra3D): The paper proposes a novel pipeline that first uses the compact and efficient VecSet representation to generate a coarse 3D mesh, which is then voxelized. This is significantly faster than directly generating voxel coordinates with a DiT. The second stage then refines this structure by generating per-voxel latent features.
2. Part Attention Mechanism: This is the core technical innovation. It is a localized attention mechanism tailored for sparse voxels. By grouping voxels into semantically consistent parts and restricting self-attention and cross-attention computations within these groups, it avoids unnecessary global computations. This achieves a speed-up of up to 6.7x in latent generation.
3. Scalable Part Annotation Pipeline: To provide the part labels required by Part Attention, the authors developed an efficient pipeline that automatically converts large-scale raw 3D mesh datasets into sparse voxels with part annotations, taking only a few seconds per mesh.
4. State-of-the-Art Performance: Ultra3D is shown to generate high-resolution (1024) 3D meshes that outperform previous methods in visual fidelity and user preference studies, all while being significantly faster (up to 3.3x overall speed-up).

Foundational Concepts:
- Sparse Voxel Representation: A method to represent a 3D object by only storing information for the "active" or occupied cells (voxels) in a 3D grid. Each active voxel is typically associated with a coordinate and a feature vector that describes local geometry and texture. This is more memory-efficient than a dense grid for representing surfaces.
- Diffusion Models: A class of generative models that learn to create data by reversing a gradual noising process. They start with random noise and iteratively denoise it, guided by a learned model, to produce a sample (e.g., an image or 3D model).
- Diffusion Transformer (DiT): A specific architecture for diffusion models that replaces the commonly used U-Net backbone with a Transformer. Transformers are powerful for modeling long-range dependencies through their self-attention mechanism, but this comes at a high computational cost.
- Attention Mechanism: The core component of Transformers. It allows the model to weigh the importance of different parts of the input data (tokens) when processing a specific token. Self-attention relates different positions of a single sequence, while cross-attention relates positions of two different sequences (e.g., a 3D model and a conditioning image). Its computational complexity is quadratic with respect to the sequence length.
- Vector Set (VecSet) Representation: A compact 3D representation where an object is encoded as an unordered set of latent vectors. Each vector captures local shape attributes. Because the set is small (a few thousand tokens), generating it with a diffusion model is very fast, but it struggles to capture fine-grained surface details compared to sparse voxels.
Previous Works & Technological Evolution: The paper categorizes related 3D generation methods into three main trends:
1. Vector Set-Based Generation (3DShape2Vecset): These methods are very efficient due to the compact VecSet representation. They excel at generating the overall shape quickly but lack the precision for fine-grained surface geometry. Ultra3D strategically uses this approach for its efficient first stage.
2. Sparse Voxel-Based Generation (Trellis): These methods achieve superior geometric fidelity by associating latent features with sparse voxels. Trellis introduced the two-stage pipeline (predict coordinates, then features) that Ultra3D builds upon. However, their reliance on full attention in DiTs makes them slow and resource-intensive, a key limitation Ultra3D aims to solve.
3. Autoregressive Mesh Generation (MeshGPT): These methods generate a mesh vertex-by-vertex or face-by-face, similar to how language models generate text. They can produce artist-friendly topology but are also computationally expensive due to the long sequences of tokens they must process.
Differentiation: Ultra3D differentiates itself from its direct predecessor, Trellis, in two critical ways:
- Stage 1 (Coarse Layout): Trellis compresses voxel coordinates into a dense feature grid and uses a DiT to generate it. This is still slow at high resolutions. Ultra3D replaces this with a much faster VecSet-based generator that produces a coarse mesh, which is then voxelized.
- Stage 2 (Detail Refinement): Trellis uses a DiT with standard (full) self-attention to generate per-voxel latent features, which is the main computational bottleneck. Ultra3D introduces Part Attention, which restricts attention within semantic parts, dramatically improving efficiency while maintaining quality by preserving geometric continuity.

4. Methodology (Core Technology & Implementation)

The core of Ultra3D is a two-stage pipeline optimized for both speed and quality, with Part Attention as its key innovation.

Figure 3: Pipeline Overview. We introduce ULTRA3D, an efficient and high-quality 3D generation framework that first generates sparse voxel layout via VecSet and then refines it by generating pervoxel… 该图像是示意图，展示了Ultra3D的整体生成流程。首先输入图像条件通过DiT生成粗糙三维网格，经过部分分割和稀疏体素注释后，利用Part-DiT结合局部Part Attention模块进行细化，模块内独立处理每个部件的自注意力及与原图特征的交叉注意力，最终生成高精度细化网格。

As shown in Image 1, the pipeline starts with a condition (e.g., an image), generates a coarse mesh using a VecSet-based DiT, annotates this mesh with part labels, and then uses a Part-DiT with the novel Part Attention mechanism to generate the final, refined mesh.

4.1 Part Attention

The motivation for Part Attention is that for refining local surface details on an already structured coarse model, global attention across all voxels is redundant and inefficient. A localized approach is better, but simple spatial partitioning (like window attention) fails because it doesn't respect the semantic structure of 3D objects (see Image 3). Part Attention solves this by using semantic part information for grouping.

Part Self Attention: During self-attention, a voxel token only attends to other voxel tokens within the same part group. This is implemented via an attention mask. Let $a_i$ be the part index for the voxel at position $p_i$ . The attention from voxel $i$ to voxel $j$ is masked if their part indices are different. $\operatorname { A t t n } ( i , j ) = 0 \quad { \mathrm { i f } } \quad a _ { i } \neq a _ { j } .$
- Explanation:
  - $\operatorname{Attn}(i, j)$ : The attention score from token $i$ to token $j$ .
  - $a_i, a_j$ : The part group indices assigned to tokens $i$ and $j$ , respectively. By enforcing this, the computation is broken down into smaller, independent attention calculations for each part, reducing the complexity from $O(N^2)$ to approximately $K \times O((N/K)^2) = O(N^2/K)$ , where $K$ is the number of parts.
Part Cross Attention: For image-to-3D tasks, cross-attention between 3D voxels and 2D image features is also costly. Part Cross Attention localizes this as well. Each 3D part group is projected onto the 2D image plane using camera parameters. A 3D voxel token is then only allowed to attend to the image patch tokens corresponding to its own part's projection. $\operatorname { A t t n } ( i , j ) = 0 \quad { \mathrm { i f } } \quad a _ { i } \notin { \mathcal { A } } _ { j } ,$
- Explanation:
  - $\operatorname{Attn}(i, j)$ : The attention score from 3D voxel token $i$ to 2D image token $j$ .
  - $a_i$ : The part index of voxel token $i$ .
  - $\mathcal{A}_j$ : The set of part indices that are projected onto the location of image token $j$ . This ensures that, for example, voxels corresponding to a character's "head" only attend to the "head" region in the input image, preserving semantic consistency and improving efficiency.

4.2 ULTRA3D Pipeline

Stage 1: VecSet-based Sparse Voxel Generation:
- Instead of generating a dense grid of voxel coordinates like Trellis, Ultra3D first uses a VecSet-based diffusion model to generate a coarse 512-resolution mesh from the input condition. This process is very fast as VecSet uses only a few thousand tokens.
- This coarse mesh is then voxelized to produce the sparse voxel coordinates (e.g., at 64 or 128 resolution). The lower surface quality of the VecSet output is not an issue, as it's only used to define the overall structure.
Stage 2: Sparse Latent Generation with Part-DiT:
- With the sparse voxel coordinates fixed from Stage 1, a second DiT model is used to generate the per-voxel latent features $\{z_i\}$ .
- This DiT architecture is modified to be efficient. It is composed of repeating blocks, where each block contains one full-attention layer followed by three Part Attention layers.
- To make even the full-attention layer efficient, it operates at a lower resolution. The sparse voxel features are downsampled, full attention is applied (allowing for global style communication between parts), and the result is upsampled and fused back. This is shown in the "Part-DiT" diagram in Image 1.
- Part labels for Part Attention are generated on-the-fly for the coarse mesh from Stage 1, using the annotation pipeline described next.

4.3 Sparse Voxel Part Annotation Pipeline

Since large-scale datasets with part annotations are rare, the authors created an automatic pipeline.

Process: For a given raw mesh, the pipeline first samples a point cloud from its surface. This point cloud is fed into PartField, a pre-trained part segmentation model, which outputs a feature field. The original mesh is voxelized, and for each voxel, the features of the points inside it are averaged. Finally, Agglomerative Clustering is applied to these voxel features to segment the object into a fixed number of parts (empirically set to 8 for training).
Quality Filtering: To ensure high-quality annotations, two filtering metrics are applied:
1. Sum of Squared Ratios: Measures imbalance in part sizes. A high value (e.g., one part dominates the object) often indicates poor segmentation.
2. Neighborhood Inconsistency: Measures the proportion of voxels whose neighbors belong to a different part. High inconsistency suggests fragmented, noisy segmentation. Samples exceeding predefined thresholds on these metrics are discarded. Image 6 shows that most samples have low (good) scores on both metrics, indicating the pipeline's reliability.
  
  该图像是图表，展示了用于部分标注过滤的两种指标的百分位曲线。左图为平方比例和的百分位曲线，用于反映部分分布是否均衡；右图为邻域不一致性的百分位曲线，衡量邻居体素标签的差异比例。两图均显示大部分样本在指标上维持较低且稳定的数值，表明标注质量较高。

5. Experimental Setup

Datasets: The models were trained on a private large-scale 3D dataset. The part annotation pipeline described in Section 4.3 was used to process this dataset. Samples were filtered out if the sum of squared part ratios exceeded 25% or neighborhood inconsistency exceeded 25%.
Evaluation Metrics:
- User Study: The primary method for evaluating visual quality. Participants were shown image-3D mesh pairs and asked to choose the best result based on fidelity and consistency with the input image. This subjective metric is crucial for generative tasks where objective scores may not capture aesthetic quality.
- Efficiency (Speed-up Rate): This metric measures the computational performance improvement.
  - Conceptual Definition: It is the ratio of the time taken by a baseline method (e.g., using full attention) to the time taken by the proposed method (Part Attention) to complete the same task (e.g., one inference step). A higher value indicates greater acceleration.
  - Formula: $\text{Speed-up Rate} = \frac{\text{Execution Time}_{\text{Baseline}}}{\text{Execution Time}_{\text{Proposed}}}$
  - Symbol Explanation:
    - $\text{Execution Time}_{\text{Baseline}}$ : Wall-clock time for the baseline model.
    - $\text{Execution Time}_{\text{Proposed}}$ : Wall-clock time for the Ultra3D model.
Baselines:
- Direct3D-S2: A concurrent state-of-the-art method for image-to-3D generation.
- Commercial Model A: An unnamed high-quality commercial 3D generation model.
- Trellis and Hi3DGen are also included in qualitative comparisons (Image 7).
- For ablation studies, Ultra3D with Part Attention was compared against two variants:
  1. Ours-Full: A version where Part Attention is replaced with standard full attention.
  2. Ours-Naive (3D Window Attention): A version where Part Attention is replaced with a naive spatial windowing attention.

6. Results & Analysis

Core Results:
- Qualitative Comparison: As shown in Image 7, Ultra3D produces 3D models with significantly finer geometric details and higher fidelity compared to Trellis, Hi3DGen, Direct3D-S2, and Commercial Model A. The details highlighted in red boxes show its superior ability to capture complex surfaces like dragon scales, armor engravings, and animal fur.
  
  该图像是多组3D模型法线贴图的对比插图，展示了Ultra3D方法与Trellis、Hi3DGen、Direct3D-S2、商业模型A在四个不同输入模型上的生成效果。图中通过红框放大展示了各方法在细节刻画上的差异，结果显示Ultra3D在表面细节和结构连续性上更为精细，且与输入形象更为一致。
- User Study: The user study results, transcribed from Table 1, confirm the qualitative findings. (Manual transcription of Table 1)
  
  (a) Comparison with Other Methods
  
  Model Direct3D-S2 Commercial Model A Ours
  
  Select. 7.2% 24.3% 68.5%
  
  Ultra3D was overwhelmingly preferred by users (68.5%) over strong competitors, validating its state-of-the-art visual quality.
Ablations / Parameter Sensitivity:
- Part Attention vs. Full Attention vs. 3D Window Attention:
  - Image 3 visually demonstrates the importance of semantic partitioning. Full Attention produces good results but is slow. 3D Window Attention creates blocky artifacts and style inconsistencies because its fixed spatial partitions cut across meaningful object parts. Part Attention matches the quality of Full Attention while being efficient.
    
    该图像是示意图，展示了不同注意力机制下3D模型生成的效果对比。图中包含一张真实人物图像和对应的多种注意力机制生成的3D模型：全局注意力（Full Attention）、3D窗口注意力（3D Window Attention）和本文提出的部分注意力机制（Part Attention）。3D窗口注意力通过固定空间划分导致语义边界错位，产生风格不一致的问题，而部分注意力机制按照语义一致的部位划分区域，实现更连贯的结构和更高质量的细节表现。
  - The user study (Table 1b and 1c) quantifies this. (Manual transcription of Table 1)
    
    (b) Full Attention vs. Part Attention (c) 3D Window vs. Part Attention
    
    Model Ours-Full Ours No Pref. Model Ours-Naive Ours No Pref.
    
    Select. 12.4% 8.9% 78.7% Select. 2.1% 63.7% 34.2%
    
    Users found Part Attention and Full Attention to be of comparable quality (78.7% "No Preference"), confirming that Part Attention does not compromise fidelity. However, Part Attention was strongly preferred over 3D Window Attention (63.7% vs. 2.1%).
- Efficiency Gains: Table 2 shows the significant speed-ups from Part Attention. (Manual transcription of Table 2)
  
  Part Self Attention Part Cross Attention DiT Training DiT Inference
  
  Speedup Rate 6.7× 4.1× 3.1× 3.3×
  
  Part Self Attention alone is 6.7x faster than full self-attention. This leads to a 3.1x speed-up in overall DiT training and a 3.3x speed-up in inference, reducing generation time from over 15 minutes to just 4 minutes per sample.
- Impact of Resolution: Image 4 shows that both mesh resolution and the sparse voxel resolution used for attention computation are crucial for quality. Prior methods were forced to downsample voxels before attention to manage costs, resulting in loss of detail. Ultra3D's efficiency allows it to operate on higher-resolution sparse voxels (128), enabling superior final quality at a 1024 mesh resolution.
  
  该图像是示意图，展示了不同网格分辨率和稀疏体素分辨率配置下的3D模型法线贴图质量对比，分别为512-64（含降采样与不降采样）和1024-128（含降采样与Ultra3D方法）。图中通过放大局部细节对比，表明Ultra3D支持更高稀疏体素分辨率，能够实现更高质量的生成效果。
- Robustness of Part Annotation: Image 5 demonstrates that the model is not sensitive to the exact number of part groups used at inference time. Despite being trained with exactly 8 parts, it generates high-quality results when given 4, 12, or 16 parts. This suggests that the model learns a general concept of local refinement rather than overfitting to a specific partition count.
  
  该图像是示意图，展示了不同部件划分数量（4、8、12、16部分）对三维模型分割和渲染的影响。上排展示了模型的分部着色效果，不同颜色对应不同部件，显示Part Attention中多部件注释的多样性；下排为对应的表面法线渲染图，证明了细分部分对模型结构连续性的保持和细节表现的稳定性。

7. Conclusion & Reflections

Conclusion Summary: The paper successfully addresses the critical efficiency bottleneck in high-resolution 3D generation using sparse voxels. By introducing the Ultra3D framework, which combines the speed of VecSet for coarse modeling with a novel Part Attention mechanism for efficient detail refinement, the authors achieve a new state of the art. The framework delivers superior visual quality at high resolutions (1024) while being significantly faster than previous methods. The scalable part annotation pipeline also represents a valuable contribution for enabling such part-aware models.
Limitations & Future Work:
- Dependence on External Part Segmenter: The quality of Part Attention is directly tied to the quality of the part labels provided by the PartField model. Errors or inconsistencies from this external model could propagate and degrade generation quality. Future work could explore jointly learning part segmentation and generation, or using self-supervised methods to discover parts.
- Fixed Number of Parts During Training: The model was trained with a fixed number of 8 parts for practical reasons. While it shows robustness at inference, an adaptive mechanism to determine the optimal number of parts per object could potentially yield even better performance and efficiency trade-offs.
- Two-Stage Disconnection: The two stages (coarse generation and refinement) are trained and executed separately. An end-to-end trainable model might lead to better overall optimization, though it would be more complex to design.
Personal Insights & Critique: Ultra3D is an excellent example of thoughtful system design and engineering. Instead of pursuing a single, monolithic model, it cleverly combines the strengths of two different 3D representations (VecSet for speed, sparse voxels for detail) to optimize the overall pipeline. The core idea of Part Attention is both intuitive and highly effective. It acknowledges that not all information is equally relevant at all stages of generation and that leveraging semantic structure is a powerful way to reduce computational complexity. This principle is highly transferable to other domains dealing with structured data, not just 3D models. A minor critique is the reliance on a private dataset, which makes direct reproduction of the results by the broader research community challenging. However, the developed part annotation pipeline is a step toward mitigating this, as it could be applied to public datasets like Objaverse. Overall, Ultra3D represents a significant and practical step forward in making high-quality 3D content creation more accessible and scalable.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

(a) Comparison with Other Methods
Model	Direct3D-S2	Commercial Model A	Ours
Select.	7.2%	24.3%	68.5%

(b) Full Attention vs. Part Attention			(c) 3D Window vs. Part Attention
Model	Ours-Full	Ours	No Pref.	Model	Ours-Naive	Ours	No Pref.
Select.	12.4%	8.9%	78.7%	Select.	2.1%	63.7%	34.2%

	Part Self Attention	Part Cross Attention	DiT Training	DiT Inference
Speedup Rate	6.7×	4.1×	3.1×	3.3×