Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention
TL;DR Summary
Direct3D-S2 employs Spatial Sparse Attention to efficiently generate gigascale 3D shapes using sparse volumetric data, combining a unified sparse VAE design that boosts training efficiency and stability while drastically reducing computational costs.
Abstract
Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, substantially reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research/direct3d-s2.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention
1.2. Authors
Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Xun Cao, Philip Torr, Yao Yao. Affiliations include Nanjing University, DreamTech, Fudan University, and University of Oxford.
1.3. Journal/Conference
The paper is available as a preprint on arXiv (arXiv preprint arXiv:2505.17412). The abstract indicates it was "Published at (UTC): 2025-05-23T02:58:01.000Z", suggesting it might be accepted for a future conference or journal in 2025, but as of the provided information, it is an arXiv preprint. arXiv is a well-regarded open-access repository for preprints of scientific papers, particularly influential in fields like machine learning and computer vision for early dissemination of research.
1.4. Publication Year
2025 (based on the provided publication date).
1.5. Abstract
Generating high-resolution 3D shapes using volumetric representations like Signed Distance Functions (SDFs) faces significant computational and memory challenges. The paper introduces Direct3D-S2, a scalable 3D generation framework built upon sparse volumes, which achieves superior output quality with considerably reduced training costs. A central innovation is the Spatial Sparse Attention (SSA) mechanism, designed to boost the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA efficiently processes large token sets within sparse volumes, leading to a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. The framework also incorporates a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. This unified design, unlike previous methods with heterogeneous representations in 3D VAEs, substantially improves training efficiency and stability. Trained on publicly available datasets, Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency but also enables training at resolution using just 8 GPUs, a task that typically demands at least 32 GPUs for volumetric representations at resolution. This advancement makes gigascale 3D generation both practical and accessible.
1.6. Original Source Link
https://arxiv.org/abs/2505.17412 PDF Link: https://arxiv.org/pdf/2505.17412v2.pdf
2. Executive Summary
2.1. Background & Motivation
The creation of high-quality 3D models directly from text or images holds immense potential for various applications, including virtual reality, gaming, product prototyping, and computer-aided design. However, generating these 3D shapes, especially at high resolutions, using volumetric representations such as Signed Distance Functions (SDFs), is plagued by substantial computational and memory demands.
Prior research in large-scale 3D generative models has explored two main avenues:
-
Implicit Latent Representations: Methods like
3DShape2Vecset,CLAY, andTripoSGleverage neural fields andVariational Autoencoders (VAEs)to encode 3D shapes into compact latent codes. While beneficial for scalability, these often rely onVAEswith asymmetric 3D representations (e.g., converting point clouds to 1D vectors and then to SDF fields), leading to lower training efficiency and requiring vast computational resources (e.g., hundreds of GPUs). -
Explicit Latent Representations: Approaches such as
Direct3D,XCube, andTrellisuse more interpretable representations liketri-planesor sparse voxels. While offering simpler training and direct editing, these methods are often limited in output resolution due to high memory demands. Scaling to resolutions like with sufficient latent tokens and valid voxels remains challenging. A primary bottleneck is the quadratic computational cost offull attentioninDiffusion Transformers (DiT), which makes high-resolution training prohibitively expensive.The core problem the paper aims to solve is the scalability barrier in high-resolution 3D shape generation, specifically addressing the computational and memory challenges associated with volumetric representations and the inefficiency of
attentionmechanisms inDiTfor large, sparse 3D data. The paper's entry point is to unify sparse volumetric representations across the entire generative pipeline and dramatically improve the efficiency of attention mechanisms for these sparse data structures.
2.2. Main Contributions / Findings
The paper introduces Direct3D-S2, a novel framework that makes gigascale 3D generation practical and accessible. Its primary contributions and findings are:
- Spatial Sparse Attention (SSA) Mechanism: The paper's key innovation,
SSA, is specifically designed to enhance the efficiency ofDiffusion Transformer (DiT)computations on sparse volumetric data. By intelligently processing large token sets within sparse volumes,SSAsignificantly reduces computational overhead, achieving a remarkable 3.9x speedup in the forward pass and an even more impressive 9.6x speedup in the backward pass compared toFlashAttention-2at resolution with 128k tokens. - Unified Sparse SDF VAE:
Direct3D-S2incorporates aVariational Autoencoder (VAE)that maintains a consistent sparse volumetric format across the input, latent, and output stages. This symmetric and unified design eliminates the need for cross-modality translations common in previous heterogeneous 3DVAEs, thereby substantially improving training efficiency, stability, and geometric fidelity. - Scalability and Efficiency Breakthrough: The framework enables training at an unprecedented resolution using only 8 GPUs. This is a significant leap, as previous state-of-the-art volumetric methods typically require at least 32 GPUs to train at a much lower resolution. This finding demonstrates
Direct3D-S2's capability to make "gigascale" 3D generation both practical and accessible. - Superior Generation Quality: Extensive experiments confirm that
Direct3D-S2not only offers superior efficiency but also surpasses state-of-the-art methods in generation quality, producing highly detailed 3D shapes. Quantitative evaluations usingULIP-2,Uni3D, andOpenShapemetrics, as well as qualitative comparisons and user studies, affirm its leading performance. - Sparse Conditioning Mechanism: A novel
sparse conditioning mechanismis introduced to selectively extract and process foreground tokens from input images, reducing computational overhead and improving alignment between generated meshes and conditional images by mitigating background noise.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To understand Direct3D-S2, a reader should be familiar with the following core concepts:
- 3D Shape Generation: The overarching task of creating three-dimensional digital models. This can involve various input modalities (e.g., text, images) and output representations (e.g., meshes, point clouds, voxels).
- Volumetric Representations: A method of representing 3D objects as a grid of discrete elements (voxels) in 3D space. Each voxel can store information about the object at that location.
- Signed Distance Functions (SDFs): A specific type of volumetric representation where each point in space is assigned a value representing its shortest distance to the surface of an object. The sign of the distance indicates whether the point is inside (negative) or outside (positive) the object. Points exactly on the surface have an SDF value of zero.
SDFsare useful for tasks like shape reconstruction and boolean operations. - Watertight Meshes: A 3D mesh is "watertight" if it completely encloses a volume without any holes or gaps in its surface. This property is crucial for generating
SDFsand for many 3D printing or simulation applications. - Sparse Volumes: In a dense volumetric representation, every voxel in a fixed grid stores data. In contrast,
sparse volumesonly store and process the "active" or non-empty voxels, significantly reducing memory and computational requirements when most of the space is empty (e.g., for a hollow object).
- Signed Distance Functions (SDFs): A specific type of volumetric representation where each point in space is assigned a value representing its shortest distance to the surface of an object. The sign of the distance indicates whether the point is inside (negative) or outside (positive) the object. Points exactly on the surface have an SDF value of zero.
- Variational Autoencoder (VAE): A type of generative neural network that learns a compressed, continuous
latent spacerepresentation of data.- Encoder: Maps input data (e.g., an image or 3D shape) to a distribution over the
latent space(typically a Gaussian distribution, characterized by its mean and variance). - Decoder: Maps samples from the
latent spaceback to the original data space, reconstructing the input. - Latent Space: A lower-dimensional representation of the input data, capturing its essential features. In a
VAE, this space is designed to be continuous and easily traversable, allowing for novel data generation by sampling from it. - KL Divergence (Kullback-Leibler Divergence): A measure of how one probability distribution diverges from a second, expected probability distribution. In
VAEs, it's used to regularize thelatent spaceby forcing the encoded distributions to be close to a standard normal distribution.
- Encoder: Maps input data (e.g., an image or 3D shape) to a distribution over the
- Diffusion Models: A class of generative models that learn to reverse a diffusion process. They gradually add noise to data until it becomes pure noise, then learn to "denoise" it step-by-step to generate new data from noise.
- Transformers: A neural network architecture that revolutionized sequence modeling. Key components include:
- Attention Mechanism: A core component that allows the model to weigh the importance of different parts of the input sequence (or tokens) when processing a specific part.
- Self-Attention: A variant of attention where the model computes attention weights between different parts of the same input sequence, allowing it to capture internal dependencies.
- Cross-Attention: Used in conditional generation, where the model computes attention between a target sequence (e.g., noisy tokens) and a conditioning sequence (e.g., image features or text embeddings).
- Diffusion Transformer (DiT): A
Transformer-based neural network architecture specifically adapted fordiffusion models. Instead of U-Net architectures,DiTs useTransformersto predict the noise or velocity field indiffusion models, showing strong performance and scalability. - Rectified Flow: A type of generative model that defines a straight-line trajectory between a data distribution and a simple prior distribution (like a standard normal distribution). The model learns to predict the velocity field along these straight paths, offering an alternative to traditional
diffusion modelswith potentially faster sampling. - GPU Kernel: A specialized program or function designed to run directly on the Graphics Processing Unit (GPU).
GPU kernelsare highly optimized for parallel computation and are crucial for accelerating computationally intensive tasks like neural network training. - Triton: A domain-specific language (DSL) and compiler developed by OpenAI that allows researchers to write highly efficient
GPU kernelsfor deep learning operations, often outperforming hand-tunedCUDAkernels. It abstracts away some of the complexities ofCUDAprogramming, making it easier to optimize tensor operations.
3.2. Previous Works
The paper contextualizes its contributions by discussing several categories of prior work:
- Multi-view Generation and 3D Reconstruction: These methods (e.g.,
[16, 22, 23, 42]) often start withmulti-view diffusion models([38]) trained on 2D image priors (likeStable Diffusion[33]) to generate consistent multi-view images of a 3D shape. These images are then used to reconstruct the 3D shape via generalizedsparse-view reconstruction models([21, 27, 36, 43, 48]).- Limitation: They struggle with
multi-view consistencyandshape quality, often producing artifacts. They also rely onrendering-based supervision(NeRF[28],DMTet[34]), which adds complexity and computational overhead.
- Limitation: They struggle with
- Large Scale 3D Latent Diffusion Model: Inspired by 2D
Latent Diffusion Models (LDMs)[33], these extendLDMsto 3D.- Implicit Vecset-based Methods: Examples include
3DShape2Vecset[47],Michelangelo[50],CLAY[49], andCraftsMan3D[17]. They represent 3D shapes using alatent vecset(a set of vectors) and reconstruct meshes vianeural SDFsoroccupancy fields.- Limitation: Constrained by
vecsetsize; largervecsetslead to more complex mappings and longer training times. They often use asymmetricVAEarchitectures (e.g., point cloud input to 1D vector latent toSDFoutput), which reduces efficiency.
- Limitation: Constrained by
- Voxel-based Methods: Examples include
XCube[32],Trellis[40], and[11, 45]. These usevoxel gridsas latent representations, offering better interpretability and simpler training.- Limitation: Face significant challenges in
latent resolutiondue to the cubic growth ofGPU memory requirementsand the high computational costs ofattention mechanismsinTransformers.XCube[32]can generate sparse volumes but is limited to millions of valid voxels, impacting final quality.Trellis[40]integrates sparse voxel representations with rendering supervision.
- Limitation: Face significant challenges in
- Implicit Vecset-based Methods: Examples include
- Efficient Large Tokens Generation: Addresses the challenge of efficiently processing a large number of tokens.
- Native Sparse Attention (NSA)
[46]: A technique that uses adaptive token compression, integrating compression, selection, and windowing to identify relevant tokens. It was designed for 1D sequences and applied to large language models ([31, 46]) and video generation ([35]).- Limitation:
NSAis not directly applicable to unstructured, sparse 3D data because its 1D block partitioning doesn't preserve 3D spatial coherence.
- Limitation:
- Linear Attention
[13]: Reduces attention complexity by approximating attention weights with linear functions ([41, 53, 26]).- Limitation: Can lead to a significant performance decline due to the absence of non-linear similarity.
- Native Sparse Attention (NSA)
3.3. Technological Evolution
The evolution of 3D generation has moved from early explicit mesh-based methods to implicit neural representations (like NeRFs and SDFs) due to their compactness and flexibility. More recently, the success of Latent Diffusion Models in 2D has inspired their adaptation to 3D. This has led to the development of 3D VAEs to compress 3D data into latent spaces, and Diffusion Transformers (DiT) to generate in these latent spaces.
However, scaling these models to high resolutions (e.g., ) has remained a major hurdle. The challenge stems from the inherent cubic growth of volumetric data and the quadratic complexity of standard Transformer attention mechanisms with respect to sequence length (number of tokens). Early solutions tried to reduce the number of tokens through packing or coarse representations.
Direct3D-S2 fits into this evolution by pushing the boundaries of voxel-based latent diffusion. It addresses the attention bottleneck head-on by developing a specialized Spatial Sparse Attention (SSA) mechanism that can effectively operate on sparse 3D data, overcoming the limitations of 1D sparse attention methods. Furthermore, it tackles the VAE inefficiency by introducing a fully symmetric, sparse volumetric VAE that maintains consistency throughout the pipeline. This work represents a significant step towards practical, high-resolution, high-fidelity 3D content creation.
3.4. Differentiation Analysis
Compared to the main methods in related work, Direct3D-S2 introduces several core differences and innovations:
-
Unified Sparse Volumetric VAE Design: Unlike previous
3D VAEs that often use heterogeneous or asymmetric representations (e.g., point cloud input, 1D vector latent, dense volume output, or reliance on differentiable rendering to bridge latent spaces),Direct3D-S2employs asymmetric encoding-decoding networkthat consistently uses asparse volumetric formatacross input, latent, and output stages. This unified approach eliminates costly cross-modality translations, leading to significantly improved training efficiency, stability, and geometric fidelity. -
Spatial Sparse Attention (SSA) for 3D Data: While
Native Sparse Attention (NSA)inspiredSSA,NSAwas designed for 1D sequences and cannot effectively handle the spatial coherence required for unstructured, sparse 3D data.Direct3D-S2'sSSAexplicitly redesigns the block partitioning to preserve 3D spatial coherence and revises compression, selection, and window modules to accommodate the irregular nature of sparse volumetric tokens. This adaptation is crucial for enabling efficientDiTcomputation ongigascale3D data. -
Unprecedented Scalability and Resource Efficiency:
Direct3D-S2demonstrates the ability to trainDiffusion Transformersat resolution using only 8 GPUs. This is a dramatic improvement over existing volumetric methods that typically require 32 or more GPUs for merely resolution. This leap in efficiency makes high-resolution 3D generation practically achievable. -
Sparse Conditioning Mechanism: The introduction of a
sparse conditioning mechanismis a practical improvement over standardcross-attentionin image-to-3D models. By selectively processing only foreground tokens from conditional images (instead of all pixel-level features), it reduces computational overhead and improves alignment by focusing on relevant visual information. -
Direct Volumetric Generation: Many
multi-view generationmethods rely onrendering-based supervisionand subsequent 3D reconstruction, which can introduce artifacts and complexity.Direct3D-S2directly generates 3D shapes in a volumetricSDFformat, maintaining geometric precision through itsSS-VAEandSSA-enhancedDiT.In essence,
Direct3D-S2differentiates itself by pioneering an efficient and coherent sparse volumetric pipeline fromVAEencoding toDiTgeneration, enabled by a 3D-aware sparse attention mechanism, leading to breakthroughs in resolution, quality, and computational accessibility.
4. Methodology
The Direct3D-S2 framework is designed for scalable and efficient high-resolution 3D shape generation, leveraging sparse volumetric representations. It consists of two main components: a Sparse SDF VAE (SS-VAE) for encoding and decoding 3D shapes into sparse latent representations, and a Diffusion Transformer (DiT) with a novel Spatial Sparse Attention (SSA) mechanism for generating these latent representations.
4.1. Principles
The core idea behind Direct3D-S2 is to overcome the computational and memory challenges of high-resolution 3D generation by focusing on sparse data structures and designing attention mechanisms tailored for them.
- Sparsity: Instead of processing dense volumetric grids, the framework operates exclusively on
sparse voxels(only those near the object's surface or containing valid data). This drastically reduces the number of data points (tokens) to be processed. - Unified Representation: Maintain a consistent
sparse volumetric formatthroughout the entireVAEpipeline (input, latent, output). This symmetry simplifies the architecture and improves efficiency by avoiding heterogeneous data conversions. - Spatial Awareness in Attention: Adapt the
Transformer attentionmechanism to bespatially awarefor 3D data. Standardfull attentionis quadratically expensive, and existing sparse attention (e.g.,NSA) is not designed for 3D spatial coherence. TheSSAmechanism intelligently groups and selects tokens based on their 3D coordinates, allowing for efficient processing of large, sparse token sets. - Progressive Training: Utilize a
multi-resolution trainingstrategy for theVAEand aprogressive trainingstrategy for theDiTto accelerate convergence and enable training at very high resolutions.
4.2. Core Methodology In-depth (Layer by Layer)
The Direct3D-S2 framework is composed of an SS-VAE (upper half of Figure 2) and an SS-DiT with SSA and a sparse conditioning mechanism (lower half of Figure 2).
The following figure (Figure 6 from the original paper) provides a schematic diagram of the Direct3D-S2 framework, illustrating the overall structure and data flow of the SS-VAE and SS-DiT modules, including multi-resolution sparse SDF encoding and decoding, spatial sparse attention mechanism, and the final 3D mesh generation process.
该图像是Direct3D-S2框架的流程示意图,展示了SS-VAE与SS-DiT模块的整体结构与数据流动,包含多分辨率稀疏SDF编码解码和空间稀疏注意力机制,以及最终生成3D网格的过程。
4.2.1. Sparse SDF VAE (SS-VAE)
The SS-VAE is designed to encode high-resolution sparse SDF volumes into a compact sparse latent representation and then reconstruct them. This addresses the challenge of directly processing dense SDF volumes, which is computationally prohibitive.
Input Data Processing:
Given a mesh represented as an SDF volume with resolution (e.g., ), the SS-VAE focuses on valid sparse voxels where the absolute SDF values fall below a predefined threshold . This effectively limits processing to regions near the object's surface.
The input is formally defined as:
$
V = { (\mathbf{x}_i, s(\mathbf{x}_i)) \big| |s(\mathbf{x}i)| < \tau }{i=1}^{|V|}
$
Here, represents the SDF value at a 3D position , and is the total number of valid sparse voxels.
Symmetric Network Architecture:
The SS-VAE employs a symmetric encoder-decoder network architecture.
-
Encoder:
- Local Feature Extraction: The encoder first extracts local geometric features using a series of
residual sparse 3D CNN blocks. - Downsampling: These
CNN blocksare interleaved with3D mean pooling operations, which progressively downsample the spatial resolution. - Transformer Processing: The sparse voxels, now treated as variable-length tokens, are then processed by
shifted window attentionmechanisms. This step captures local contextual information among the valid voxels. - Positional Encoding: Inspired by
Trellis[40], the feature of each valid voxel is augmented withpositional encodingbased on its 3D coordinates before being fed into the3D shift window attention layers. This helps the model understand the spatial arrangement of tokens. - Output: This hybrid design outputs a
sparse latent representationat a reduced resolution , where is thedownsampling factor.
- Local Feature Extraction: The encoder first extracts local geometric features using a series of
-
Decoder: The decoder mirrors the encoder's structure. It uses
attention layersandsparse 3D CNN blocksto progressively upsample thelatent representationand reconstruct theSDF volume.
Training Losses:
The training objective for the SS-VAE aims to ensure accurate reconstruction and a well-behaved latent space.
The decoded sparse voxels comprise the original input voxels and any additional valid voxels that the decoder might generate. Supervision is enforced on the SDF values across all these spatial positions. To enhance geometric fidelity, additional supervision is applied to active voxels near sharp edges, identified as regions with high-curvature variations on the mesh surface (). A KL divergence regularization term is also imposed on the latent representation to constrain excessive variations in the latent space, encouraging it to follow a simple prior distribution (e.g., standard normal).
The reconstruction loss for each category of voxels is formulated as: $ \mathcal{L}_c = \frac{1}{\vert \tilde{V}c \vert} \sum{(\mathbf{x}, \tilde{s}(\mathbf{x})) \in \tilde{V}_c} \left. s(\mathbf{x}) - \tilde{s}(\mathbf{x}) \right|_2^2, \quad c \in { \mathrm{in}, \mathrm{ext}, \mathrm{sharp} } $ Where:
-
: The L2 reconstruction loss for voxel set .
-
: The number of voxels in the set .
-
: A voxel at position with its reconstructed
SDFvalue . -
: The ground-truth
SDFvalue at position . -
: The set of original input voxels.
-
: The set of extra valid voxels generated by the decoder.
-
: The set of active voxels near sharp edges of the mesh.
The overall training objective for the
SS-VAEis: $ \mathcal{L}{\mathrm{total}} = \sum_c \lambda_c \mathcal{L}c + \lambda{\mathrm{KL}} \mathcal{L}{\mathrm{KL}} $ Where: -
: The total training loss for the
SS-VAE. -
: Weighting coefficients for the respective reconstruction loss terms.
-
: Weighting coefficient for the
KL divergence regularizationterm. -
: The
KL divergenceloss, which regularizes thelatent space.Multi-resolution Training: To enhance training efficiency and allow the
SS-VAEto handle meshes across varying resolutions, amulti-resolution training paradigmis used. During each training iteration, a target resolution is randomly sampled from a candidate set . The inputSDF volumeis thentrilinearly interpolatedto the selected resolution before being fed into theSS-VAE.Trilinear interpolationis a method of estimating the value of a function at an intermediate point within a 3D grid, given its values at the grid points.
4.2.2. Spatial Sparse Attention and DiT
After the SS-VAE encodes 3D shapes into latent representations , a rectified flow transformer-based 3D shape generator (SS-DiT) is trained on these latents, conditioned on input images. The SS-DiT incorporates the novel Spatial Sparse Attention (SSA) mechanism and a sparse conditioning mechanism.
The following figure (Figure 7 from the original paper) is a schematic from the Direct3D-S2 paper showing the three branches of the Spatial Sparse Attention mechanism: Sparse 3D Compression, Spatial Blockwise Selection, and Sparse 3D Window, and how they produce Compressed Attention, Selected Attention, and Window Attention, which are then combined into a gated output.
该图像是Direct3D-S2论文中的示意图,展示了空间稀疏注意力机制的三个分支:稀疏3D压缩、空间块选择和稀疏3D窗口,以及它们如何生成压缩注意力、选择注意力和窗口注意力,最终合成为门控输出。
Standard Full Attention (Context):
For context, the standard full attention mechanism, given input tokens , where is the token length and is the head dimension, is formulated as:
$
\begin{array}{l}
\mathbf{o}t = \mathrm{Attn}(\mathbf{q}t, \mathbf{k}, \mathbf{v}) \
\qquad = \sum{i=1}^N \frac{\mathbf{p}{t,i} \mathbf{v}i}{N}
\end{array}
$
And the attention weights are computed as:
$
\mathbf{p}{t,j} = \exp\left(\frac{\mathbf{q}_t^\intercal \mathbf{k}_j}{\sqrt{d}}\right)
$
Where:
- : The query vector for the -th token.
- : The matrix of key vectors.
- : The matrix of value vectors.
- : The output vector for the -th token.
- : The total number of tokens.
- : The dimension of the attention head (used to scale the dot product).
- : The dot product similarity between the query and the key .
- : A scaling factor to prevent large dot products from pushing the
softmaxfunction into regions with tiny gradients. Thisfull attentionhas a quadratic computational cost with respect to , which becomes prohibitive for large token lengths (e.g., over 100k at resolution).
Spatial Sparse Attention (SSA):
SSA is proposed to overcome the computational inefficiency of full attention when dealing with sparse volumetric data and large token sets. It addresses the limitations of Native Sparse Attention (NSA) [46], which treats tokens as a 1D sequence, leading to issues with 3D spatial coherence and unstable training for sparse 3D data.
SSA partitions key and value tokens into spatially coherent blocks based on their 3D coordinates. The 3D space is divided into subgrids of size , and active tokens within the same subgrid are grouped into one block.
The SSA mechanism comprises three core modules: sparse 3D compression, spatial blockwise selection, and sparse 3D window. The attention computation combines these modules through gated aggregation:
$
\begin{array}{rl}
& \mathbf{o}_t = \omega_t^{\mathrm{cmp}} \mathrm{Attn}(\mathbf{q}_t, \mathbf{k}_t^{\mathrm{cmp}}, \mathbf{v}_t^{\mathrm{cmp}}) \
& \qquad + \omega_t^{\mathrm{slc}} \mathrm{Attn}(\mathbf{q}_t, \mathbf{k}_t^{\mathrm{slc}}, \mathbf{v}_t^{\mathrm{slc}}) \
& \qquad + \omega_t^{\mathrm{win}} \mathrm{Attn}(\mathbf{q}_t, \mathbf{k}_t^{\mathrm{win}}, \mathbf{v}_t^{\mathrm{win}})
\end{array}
$
Where:
-
: The final output vector for query .
-
: The standard attention function, but applied to selected key-value pairs.
-
: Key and value tokens selected by the
sparse 3D compressionmodule for query . -
: Key and value tokens selected by the
spatial blockwise selectionmodule for query . -
: Key and value tokens selected by the
sparse 3D windowmodule for query . -
:
Gating scoresfor each module, obtained by applying alinear layerfollowed by asigmoid activationto the input features. These scores dynamically weigh the contributions of each attention module.Let's break down each module:
-
Sparse 3D Compression: This module extracts
block-level representationsof the input tokens.- Intra-block Positional Encoding: Each token within a block of size is augmented with
intra-block positional encoding. - Compression:
Sparse 3D convolutionfollowed bysparse 3D mean poolingis applied to compress the entire block. The block-level key token is computed as: $ \mathbf{k}_t^{\mathrm{cmp}} = \delta(\mathbf{k}_t + \mathrm{PE}(\mathbf{k}_t)) $ Where: - : The block-level key token for query .
- : The original key token (or an aggregated representation).
- : Absolute
positional encodingfunction. - : Represents the operations of
sparse 3D convolutionandsparse 3D mean pooling. - : The size of the compression block.
This module captures coarse-grained
global informationwhile reducing the number of tokens.
- Intra-block Positional Encoding: Each token within a block of size is augmented with
-
Spatial Blockwise Selection: This module retains
token-level featuresfor fine details. It leverages thesparse 3D compressionmodule to determine which blocks are most relevant.- Attention Score Computation: It computes attention scores between the query and each
compression block. - Block Selection: All tokens within the
top-kblocks exhibiting the highest scores are selected. - Resolution Constraint: The resolution of the
selection blocksmust be greater than and divisible by the resolution of thecompression blocks. - Relevance Score Aggregation: The relevance score for a
selection blockis aggregated from its constituentcompression blocks.GQA (Grouped-Query Attention)[4]is used for further efficiency improvement, where attention scores of shared query heads within each group are accumulated: $ \mathbf{s}t^{\mathrm{slc}} = \sum{i \in \mathcal{B}{\mathrm{cmp}}} \sum{h=1}^{h_s} s_{t,h}^{\mathrm{cmp},i} $ Where: - : The aggregated relevance score for a selection block for query .
- : The set of
compression blockswithin theselection block. - : The number of shared heads within a group in
GQA. - : The attention score for query , head , and compression block .
The
top-k selection blockswith the highest scores are chosen, and all tokens within them form and for attention computation.
Triton Kernel Implementation: The
spatial blockwise selection attention kernelis implemented usingTriton[37]to address challenges arising fromsparse 3D voxel structures:- The number of tokens varies across blocks.
- Tokens within the same block may not be contiguous in
High Bandwidth Memory (HBM). To handle this, input tokens are sorted by block indices, and the starting index of each block is passed as kernel input. In the inner loop, dynamically controls loading corresponding block tokens.
The complete procedure of the forward pass is formalized in Algorithm 1.
Algorithm 1 Spatial Blockwise Selection Attention Forward Pass Require: q ∈ ℝ^(N × (h_kv × h_s) × d), k ∈ ℝ^(N × h_kv × d) and v ∈ ℝ^(N × h_kv × d), number of key/value heads h_kv, number of the shared heads h_s, number of the selected blocks T, indices of the selected blocks I ∈ ℝ^(N × h_kv × T), the number of divided key/value blocks N_b, C ∈ ℝ^(N_b + 1), block size B_k. Output: o ∈ ℝ^(N × (h_kv × h_s) × d), l ∈ ℝ^(N × (h_kv × h_s)) 1: Divide the output o ∈ ℝ^(N × (h_kv × h_s) × d) into (N, h_kv) blocks, each of size h_s × d 2: Divide the logsumexp l ∈ ℝ^(N × (h_kv × h_s)) into (N, h_kv) blocks, each of size h_s. 3: Sort all tokens within q, k and v according to their respective block indices. 4: for t = 1 to N do 5: for h = 1 to h_kv do 6: Initialize o_(t,h) = (0)_(h_s × d) ∈ ℝ^(h_s × d), logsumexp l_(t,h) = (0)_(h_s) ∈ ℝ^(h_s), and m_(t,h) = (-inf)_(h_s) ∈ ℝ^(h_s). 7: Load q_(t,h) ∈ ℝ^(h_s × d) and I_(t,h) ∈ ℝ^T from HBM to on-chip SRAM. 8: for j = 1 to T do 9: Load b_s = C^((I_(t,h)^(j))) and ending token index b_e = C^((I_(t,h)^(j)) + 1) - 1 of the I_(t,h)^(j)th block from HBM to on-chip SRAM. 10: for i = b_s to b_e by B_k do 11: Load k_i and v_i ∈ ℝ^(B_k × d) from HBM to on-chip SRAM. 12: Compute s_(t,h)^(i) = q_(t,h) k_i^T ∈ ℝ^(h_s × B_k). 13: Compute m_(t,h)^(i) = max(m_(t,h), rowmax(s_(t,h)^(i))) ∈ ℝ^(h_s). 14: Compute p_(t,h)^(i) = e^(s_(t,h)^(i) - m_(t,h)^(i)) ∈ ℝ^(h_s × B_k). 15: Update o_(t,h) = e^(m_(t,h) - m_(t,h)^(i)) o_(t,h) + p_(t,h)^(i) v_i. 16: Update l_(t,h) = m_(t,h)^(i) + log(e^(l_(t,h) - m_(t,h)^(i)) + rowsum(p_(t,h)^(i))), m_(t,h) = m_(t,h)^(i). 17: end for 18: end for 19: Compute o_(t,h) = e^(m_(t,h) - l_(t,h)) o_(t,h). 20: Write o_(t,h) and l_(t,h) to HBM as the (t,h)-th block of o and l, respectively. 21: end for 22: end for 23: Return the output o and the logsumexp l.Where:
- : Query tokens.
- : Key tokens.
- : Value tokens.
- : Total number of tokens.
- : Number of key/value heads.
- : Number of shared heads within a group.
- : Head dimension.
- : Number of selected blocks.
- : Indices of the selected blocks for each token and head.
- : Number of divided key/value blocks.
- : An array storing the starting index in sorted tokens for each block. gives the starting index of block .
- : Block size for loading and within a selected block.
- : Output tokens.
- :
logsumexpfor numerical stability (used in softmax normalization). - : Output for query , head .
- :
logsumexpfor query , head . - : Maximum value for query , head (for
softmaxnormalization trick). - : Query tokens for query , head .
- : Selected block indices for query , head .
- : Starting and ending token indices for a selected block.
- : Key and value tokens loaded in chunks.
- : Attention scores (dot product) between and .
- : Updated maximum for
softmaxstabilization. - : Exponentiated attention scores (pre-softmax).
- : Row-wise maximum.
- : Row-wise sum. The algorithm efficiently computes attention by iterating through selected blocks and processing tokens in smaller chunks, accumulating partial results while maintaining numerical stability.
- Attention Score Computation: It computes attention scores between the query and each
-
Sparse 3D Window: This auxiliary module explicitly incorporates
localized feature interactions.- Window Partitioning: The input tokens are partitioned into
windowed regionsof size . - Contextual Computation: For each token, its contextual computation is formulated by dynamically aggregating active tokens within its corresponding window to form and .
- Localized Self-Attention:
Self-attentionis then calculated exclusively over this constructed token subset, ensuring local detail preservation.
- Window Partitioning: The input tokens are partitioned into
4.2.3. Sparse Conditioning Mechanism
Existing image-to-3D models ([17, 39, 49]) typically use DINO-v2 [29] to extract pixel-level features from conditional images and then perform cross-attention. However, background regions (often more than half the image) introduce unnecessary computational overhead and can degrade alignment.
The sparse conditioning mechanism addresses this by selectively extracting and processing sparse foreground tokens from input images for cross-attention.
Given an input image , the sparse conditioning tokens are computed as follows:
$
\mathbf{c} = \mathrm{Linear}(f(E_{\mathrm{DINO}}(\mathcal{T}))) + \mathrm{PE}(f(E_{\mathrm{DINO}}(\mathcal{T})))
$
Where:
- : The finalized
sparse conditioning tokens. - : The input image.
- : The
DINO-v2 encoder[29], which extracts rich visual features. - : An operation that extracts
foreground tokensbased on a mask (presumably generated to segment the foreground object). - : Absolute
positional encoding, which adds spatial information to the extracted features. - : A
linear layerthat projects the features into the appropriate dimension. Thesesparse conditioning tokensare then used to performcross-attentionwith thenoisy latent tokensin theDiT.
4.2.4. Rectified Flow
The SS-DiT is trained using the rectified flow objective [10, 19]. Rectified flow defines a forward process as a linear trajectory between the data distribution and a standard normal distribution:
$
\mathbf{x}(t) = (1 - t)\mathbf{x}_0 + t\epsilon
$
Where:
- : The noisy sample at timestep .
- : The original data sample (the
latent representationfrom theSS-VAE). - : A standard Gaussian noise vector.
- : The timestep, ranging from 0 to 1.
The generative model is trained to predict the
velocity fieldthat pushes noisy samples () back towards the data distribution (). The training loss usesconditional flow matching, formulated as: $ \mathcal{L}{\mathrm{CFM}} = \mathbb{E}{t, \mathbf{x}0, \epsilon} |\mathbf{v}\theta(\mathbf{x}_t, \mathbf{c}, t) - (\epsilon - \mathbf{x}_0)|_2^2 $ Where: - : The
conditional flow matchingloss. - : Expectation over timesteps , data samples , and noise .
- : The neural network (the
SS-DiT) parameterized by , which predicts thevelocity fieldat state , conditioned on (thesparse conditioning tokens), at timestep . - : The target
velocity field(the direction and magnitude needed to move from the data to the noise). The objective is to minimize the L2 distance between the predictedvelocity fieldand the truevelocity field.
5. Experimental Setup
5.1. Datasets
Direct3D-S2 is trained on a combination of publicly available 3D datasets, which undergo a rigorous curation process.
- Source Datasets:
Objaverse[9]: A large-scale collection of annotated 3D objects.Objaverse-XL[8]: An even larger universe of over 10 million 3D objects.ShapeNet[5]: An information-rich 3D model repository.
- Data Curation: Due to the presence of low-quality meshes in these raw datasets, the authors curated approximately 452k high-quality 3D assets through rigorous filtering.
- 3D Representation Preparation: Following established geometry processing approaches
[49], the original non-watertight meshes are converted intowatertight meshes. Subsequently,ground-truth SDF volumesare computed from thesewatertight meshes, serving as both input and supervision for theSS-VAE. - Image Conditioning Data: For training the
image-conditioned DiT, 45 RGB images are rendered per mesh.- Resolution: Rendered at resolution.
- Camera Parameters: Random camera parameters are used to generate diverse views.
- Elevation angles: to .
- Azimuth angles: .
- Focal lengths: to .
- Benchmark for Evaluation: To rigorously evaluate the geometric fidelity of meshes generated by
Direct3D-S2, a challenging benchmark is established using highly detailed images sourced from professional communities likeNeural4D[3],Meshy[2], andCivitAI[1].
5.2. Evaluation Metrics
The geometric fidelity and alignment between generated meshes and conditional input images are quantitatively assessed using three multimodal metrics: ULIP-2, Uni3D, and OpenShape. These metrics are designed to measure the similarity between 3D shapes and their corresponding 2D images or textual descriptions in a shared embedding space. A higher value for these metrics indicates better performance (closer alignment).
5.2.1. ULIP-2
- Conceptual Definition:
ULIP-2(Unified Language-Image Pre-training 2) is a framework for scalable multimodal pre-training for 3D understanding. It learns a joint embedding space for 3D shapes, images, and text. The metric typically measures the cosine similarity between the embeddings of a generated 3D mesh and its corresponding input image (or text description) within this unified space. A higher cosine similarity implies better alignment between the generated shape and the conditioning input. - Mathematical Formula:
Let be the embedding of a 3D mesh and be the embedding of an image, both projected into the
ULIP-2shared multimodal embedding space. TheULIP-2score for a given mesh-image pair is typically the cosine similarity: $ \text{ULIP-2 Score} = \frac{E_{3D}(\text{mesh}) \cdot E_{IMG}(\text{image})}{|E_{3D}(\text{mesh})| |E_{IMG}(\text{image})|} $ - Symbol Explanation:
- : The feature vector (embedding) of the generated 3D mesh in the
ULIP-2embedding space. - : The feature vector (embedding) of the input 2D image in the
ULIP-2embedding space. - : Dot product operation.
- : L2 norm (magnitude) of a vector.
- : The feature vector (embedding) of the generated 3D mesh in the
5.2.2. Uni3D
- Conceptual Definition:
Uni3Daims to explore unified 3D representation at scale by pre-training a model to understand 3D data across various modalities. Similar toULIP-2, it provides a joint embedding space where 3D shapes can be compared with images or text. TheUni3Dmetric, when used for evaluation, quantifies the semantic and geometric consistency between a generated 3D object and its input image by measuring the distance or similarity in this learned multimodal embedding space. - Mathematical Formula:
Let be the
Uni3Dembedding for a 3D mesh and be theUni3Dembedding for an image. TheUni3Dscore is typically the cosine similarity: $ \text{Uni3D Score} = \frac{U_{3D}(\text{mesh}) \cdot U_{IMG}(\text{image})}{|U_{3D}(\text{mesh})| |U_{IMG}(\text{image})|} $ - Symbol Explanation:
- : The feature vector (embedding) of the generated 3D mesh in the
Uni3Dembedding space. - : The feature vector (embedding) of the input 2D image in the
Uni3Dembedding space. - : Dot product operation.
- : L2 norm (magnitude) of a vector.
- : The feature vector (embedding) of the generated 3D mesh in the
5.2.3. OpenShape
- Conceptual Definition:
OpenShapefocuses on scaling up 3D shape representation towards open-world understanding, providing a robust representation that can generalize to novel categories and unseen data. TheOpenShapemetric evaluates the quality of generated 3D shapes by assessing their alignment with the conditional input (e.g., an image) in a shared semantic space. It measures how well the generated shape captures the visual and semantic content of the input. - Mathematical Formula:
Let be the
OpenShapeembedding for a 3D mesh and be theOpenShapeembedding for an image. TheOpenShapescore is typically the cosine similarity: $ \text{OpenShape Score} = \frac{O_{3D}(\text{mesh}) \cdot O_{IMG}(\text{image})}{|O_{3D}(\text{mesh})| |O_{IMG}(\text{image})|} $ - Symbol Explanation:
- : The feature vector (embedding) of the generated 3D mesh in the
OpenShapeembedding space. - : The feature vector (embedding) of the input 2D image in the
OpenShapeembedding space. - : Dot product operation.
- : L2 norm (magnitude) of a vector.
- : The feature vector (embedding) of the generated 3D mesh in the
5.3. Baselines
The Direct3D-S2 method is compared against several state-of-the-art image-to-3D approaches and internal ablation configurations:
- For Quantitative Image-to-3D Comparison (Table 2):
Trellis [40]: A method that integrates sparse voxel representations and uses rendering supervision for VAE training, and structured 3D latents for scalable generation.Hunyuan3D 2.0 [51]: A diffusion model for high-resolution textured 3D asset generation.TripoSG [18]: A method for high-fidelity 3D shape synthesis using large-scale rectified flow models.Hi3DGen [45]: Focuses on high-fidelity 3D geometry generation from images via normal bridging.
- For Qualitative VAE Reconstruction Comparison (Figure 6):
Trellis [40]XCube [32]: A method for large-scale 3D generative modeling using sparse voxel hierarchies.Dora [6]: A method focusing on sampling and benchmarking for 3D shape variational auto-encoders.
- For Ablation Studies on SSA (Figure 10):
Full attention: A variant that appliesfull attentionbut usesTrellis'slatent packing strategyto group tokens in local regions to reduce token count. This serves as a proxy for what afull attention DiTmight attempt to do to cope with high token counts.NSA (Native Sparse Attention): An implementation ofNSA[46]that processes latent tokens as 1D sequences with fixed-length block partitioning, explicitly disregarding 3D spatial coherence.
- For Ablation Studies on SSA Modules (Figure 9):
win: Only thesparse 3D windowmodule.- :
Sparse 3D windowcombined withsparse 3D compression. - :
Sparse 3D compressioncombined withspatial blockwise selection(without the window).
- For Ablation Studies on Sparse Conditioning (Figure 11):
-
A baseline without the
sparse conditioning mechanism, implyingcross-attentionis performed on allDINO-v2features from the input image, including background.These baselines represent state-of-the-art or relevant internal configurations to demonstrate the effectiveness and necessity of
Direct3D-S2's proposed components.
-
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that Direct3D-S2 achieves superior performance in high-resolution 3D generation, both quantitatively and qualitatively, while significantly reducing computational demands.
Quantitative Superiority in Image-to-3D Task: The following are the results from Table 2 of the original paper:
| Methods | ULIP-2 ↑ | Uni3D ↑ | OpenShape ↑ |
|---|---|---|---|
| Trellis [40] | 0.2825 | 0.3755 | 0.1732 |
| Hunyuan3D 2.0 [51] | 0.2535 | 0.3738 | 0.1699 |
| TripoSG [18] | 0.2626 | 0.3870 | 0.1728 |
| Hi3DGen [45] | 0.2725 | 0.3723 | 0.1689 |
| Ours | 0.3111 | 0.3931 | 0.1752 |
As shown in Table 2, Direct3D-S2 consistently outperforms all compared state-of-the-art methods across all three multimodal evaluation metrics (ULIP-2, Uni3D, OpenShape). This indicates that the meshes generated by Direct3D-S2 exhibit better alignment and consistency with the input images, validating its effectiveness in capturing both high-level semantics and fine-grained geometric details. For instance, Direct3D-S2 achieves a ULIP-2 score of 0.3111, notably higher than the second-best Trellis at 0.2825.
Qualitative Superiority in Detail Generation:
The qualitative comparisons (Figure 4, Figure 12) strongly highlight Direct3D-S2's advantage. While other methods produce generally satisfactory results, they often struggle with capturing intricate structures due to resolution limitations. Direct3D-S2, thanks to its SSA mechanism, can generate highly detailed meshes, demonstrating superior quality even for complex elements like railings, tree branches, or delicate character features. This is a direct consequence of its ability to handle gigascale resolutions effectively.
The following figure (Figure 8 from the original paper) shows qualitative comparisons of different image-to-3D reconstruction methods in Figure 4 of the paper. The left column shows input images, the middle columns present normal maps and zoomed-in details from five methods, and the rightmost column shows the results of the proposed method with richer details closely matching the input.
该图像是一个图表,展示了论文中图4的不同图像到三维形状重建方法的定性对比。左列为输入图像,中间多列展示了五种方法生成的三维法线图及局部细节放大,最右列为该论文提出方法的结果,细节更丰富且更接近输入形象。
The following figure (Figure 13 from the original paper) is a multi-row and multi-column 3D model comparison illustration showing differences in modeling details and reconstruction effects between Direct3D-S2 and five other 3D generation methods (Trellis, Hunyuan-2.0, TripoSG, Hi3DGen). The left column shows the original color images, while the right columns present grayscale 3D models generated by different methods, highlighting the superior detail representation of Direct3D-S2.
该图像是一个多行多列的3D模型对比示意图,展示了论文Direct3D-S2中与其他五种3D生成方法(Trellis、Hunyuan-2.0、TripoSG、Hi3DGen)的造型细节差异和重建效果。左侧为原始彩色图像,右侧为不同方法生成的灰色3D模型,体现了Direct3D-S2的细节表现优势。
The following figure (Figure 14 from the original paper) is a schematic illustration comparing different models (Model N, M, R, T, and the authors' method) in generating 3D models from six sets of 2D character images, highlighting the superiority of the authors' method in detail restoration and shape consistency.
该图像是一幅示意图,展示了对比不同模型(Model N、M、R、T及作者方法)在六组二维角色图像到三维模型生成上的效果,突出作者方法在细节还原和形态一致性上的优越性。
User Study Validation:
The user study conducted with 40 participants (Figure 5) further corroborates the model's superior performance. Direct3D-S2 achieved the highest scores across both image consistency and overall geometric quality, demonstrating statistically significant superiority over other approaches. This indicates that human evaluators perceive Direct3D-S2's outputs as more faithful to the input images and geometrically sound.
The following figure (Figure 9 from the original paper) is a chart showing user rating comparisons of different methods on image consistency and overall quality. The x-axis represents "Image Consistency" and "Overall Quality", and the y-axis shows rating values, with the "Ours" method achieving the highest scores in both metrics.
该图像是图表,展示了不同方法在图像一致性和整体质量上的用户评分对比。横轴分别为“Image Consistency”和“Overall Quality”,纵轴为评分值,结果显示’Ours‘方法在两个指标上均表现最佳。
Superior VAE Reconstruction Quality:
The SS-VAE demonstrates superior reconstruction accuracy (Figure 6), especially for complex geometries at resolution. Crucially, this is achieved with significantly reduced training costs: 2 days on 8 A100 GPUs, compared to typically at least 32 GPUs for competing methods for equivalent durations. This underscores the efficiency and stability gains from the unified sparse SDF reconstruction framework.
The following figure (Figure 10 from the original paper) is a comparative visualization of 3D model reconstruction results, including Trellis, XCube, Dora, and the proposed method at different resolutions, alongside the ground truth, demonstrating superior detail and shape consistency of the proposed approach.
该图像是多组3D模型重建结果的可视化对比图,包括Trellis、XCube、Dora及本文方法在不同分辨率下的生成效果与真实样本,展示了本文方法在细节和形状一致性上的优越性。
6.2. Data Presentation (Tables)
The following are the results from Table 1 of the original paper:
| Res. | NT | LR | BS | TT |
|---|---|---|---|---|
| 2563 | ≈2058 | 1e-4 | 8×8 | 2 days |
| 3843 | ≈5510 | 1e-4 | 8×8 | 2 days |
| 5123 | ≈10655 | 5e-5 | 8×8 | 2 days |
| 10243 | ≈45904 | 2e-5 | 2×8 | 1 day |
Table 1 outlines the training configurations for the DiT at different voxel resolutions. Res. denotes resolution, NT the approximate number of latent tokens, LR the learning rate, BS the batch size (GPUs × batch size per GPU), and TT the total training time. As resolution increases, the number of tokens grows significantly, necessitating adjustments in learning rate and batch size to manage computational load. Notably, even at resolution, the model can be trained with a relatively small batch size (2 samples per GPU, 8 GPUs total) for a short duration (1 day), highlighting its efficiency.
6.3. Ablation Studies / Parameter Analysis
The paper conducts several ablation studies to validate the effectiveness of its core components.
6.3.1. Image-to-3D Generation in Different Resolution
The authors present generation results across four resolutions (Figure 8). The following figure (Figure 12 from the original paper) is an illustration of Figure 8 in the paper, showing the image-to-3D generation results of the Direct3D-S2 model at various resolutions of 256^3, 384^3, 512^3, and 1024^3. The leftmost are the input images, followed by the generated 3D shapes at increasing resolutions.
该图像是论文中图8的插图,展示了Direct3D-S2模型在不同分辨率 256^3, 384^3, 512^3, 1024^3 下的图像到3D生成效果。左侧为输入图像,后续依次为不同分辨率下生成的3D形状。
The results clearly show that increasing resolution progressively improves mesh quality. At lower resolutions ( and ), meshes exhibit limited geometric details and sometimes misalignment. At , details are enhanced. At , meshes achieve sharper edges and superior alignment with input image details, demonstrating the direct impact of higher resolution on fidelity and the model's ability to leverage it.
6.3.2. Effect of Each Module in SSA
Ablation studies on the three modules of SSA were conducted at resolution (Figure 9).
The following figure (Figure 13 from the original paper) is an illustration of shape reconstruction results, showing the effects of different feature windows (win), comparison module (cmp), and selection module (slc) on 3D model surface normals. The left side shows the input images, and the right side displays normal visualizations under various module combinations, demonstrating improved detail restoration.
该图像是形状重建效果的示意图,展示了不同特征窗口(win)、比较模块(cmp)与选择模块(slc)对3D模型表面法线的影响。左侧为输入图,右侧依次为不同模块组合下的法线可视化结果,体现了模块组合对细节还原的改善。
win(Sparse 3D Window only): Generated meshes showed detailed structures but suffered from surface irregularities due to a lack of global context modeling.- (Window + Compression): Introducing the
sparse 3D compressionmodule showed minimal performance changes in terms of output quality. This is expected, ascmpprimarily generates block-level representations used to compute attention scores for selection, rather than directly improving detail. - (Full SSA): After incorporating the
spatial blockwise selectionmodule, the model could focus on globally important regions, leading to a notable improvement in mesh quality, resulting in smooth and geometrically consistent surfaces. - (Compression + Selection, no Window): Not utilizing the
windowmodule did not cause a significant drop in model performance, but it slowed convergence. This suggests that whileglobal compressionandselectionare crucial for quality,local feature interactionprovided by thewindowmodule contributes to more stable and faster training convergence.
6.3.3. Runtime of Different Attention Mechanisms
A custom Triton [37] GPU kernel was implemented for SSA, and its forward and backward execution times were compared with FlashAttention-2 [7] (using Xformers [15]).
The following figure (Figure 11 from the original paper) is two bar charts showing the forward and backward computation times comparison between Spatial Sparse Attention and Flash Attention 2 at varying numbers of tokens, highlighting the significant acceleration of SSA for large-scale token processing.
该图像是两幅柱状图,展示了Spatial Sparse Attention与Flash Attention 2在不同Token数量下的前向和后向计算时间对比,突出SSA在大规模Token处理上的显著加速性能。
The results (Figure 7) indicate that SSA performs comparably to FlashAttention-2 for a low number of tokens. However, as the number of tokens increases, SSA's speed advantage becomes pronounced. Specifically, at tokens (relevant for resolution), SSA achieves a 3.9x speedup in the forward pass and an impressive 9.6x speedup in the backward pass. This demonstrates the efficiency of the proposed SSA mechanism for gigascale data.
6.3.4. Effectiveness of SSA
Ablation studies were conducted at resolution to validate the robustness of SSA (Figure 10).
The following figure (Figure 2 from the original paper) is a chart illustrating the effect of the proposed SSA mechanism compared to full attention and NSA mechanisms on 3D model normal maps, featuring two groups of models with different complexity, highlighting SSA's advantage in detail preservation.
该图像是图表,展示了论文中提出的SSA机制对比全注意力和NSA机制下3D模型法线图的影响,包含两组不同复杂度的模型,体现SSA在细节保留上的优势。
Full attention(with latent packing): This variant produced meshes with high-frequency surface artifacts. This is attributed to its forced packing operation, which disruptslocal geometric continuityby aggregating tokens in a manner that doesn't respect fine-grained spatial relationships.NSA(Native Sparse Attention): TheNSAimplementation exhibited training instability due topositional ambiguityin its 1D block partitioning, leading to less smooth and lower-quality meshes. This highlights the importance of3D spatial coherencein sparse attention for volumetric data.Our proposed SSA: In contrast,SSAnot only preserved the fine details of the meshes but also yielded smoother and more organized surfaces. This validates the effectiveness ofSSAin handlingsparse 3D dataefficiently and accurately by respectingspatial coherenceduring attention computation.
6.3.5. Effect of Sparse Conditioning Mechanism
Ablation experiments were performed at resolution to assess the sparse conditioning mechanism (Figure 11).
The following figure (Figure 3 from the original paper) is a chart showing ablation study results of the sparse conditioning mechanism. It compares differences in 3D model details without sparse conditioning (w/o sparse conditioning) and with sparse conditioning (w/ sparse conditioning), highlighting improved detail preservation in models generated with sparse conditioning.
该图像是图表,展示了稀疏条件机制消融实验的结果。图中对比了无稀疏条件(w/o sparse conditioning)与有稀疏条件(w/ sparse conditioning)下,3D模型细节的差异,凸显带稀疏条件时生成模型更好地还原了输入的细节特征。
The results demonstrate that by excluding non-foreground conditioning tokens, the generated meshes achieve notably better alignment with the input images. This confirms that focusing the cross-attention mechanism on relevant foreground features improves the conditioning process, leading to more accurate and higher-quality 3D outputs.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces Direct3D-S2, a novel and highly effective framework for high-resolution 3D shape generation. The core of its innovation lies in the Spatial Sparse Attention (SSA) mechanism, which dramatically accelerates the training and inference speeds of Diffusion Transformers (DiT) when processing sparse volumetric data. Complementing this, the Direct3D-S2 framework incorporates a fully end-to-end symmetric sparse SDF VAE that ensures consistent volumetric representation across all stages (input, latent, output), thereby enhancing training stability and efficiency.
The extensive experimental results unequivocally demonstrate that Direct3D-S2 outperforms existing state-of-the-art image-to-3D methods in terms of generation quality. Crucially, it achieves this while requiring significantly fewer computational resources, notably enabling training at an unprecedented resolution using only 8 GPUs, a task previously requiring substantially more hardware even for lower resolutions. This makes gigascale 3D generation both practical and accessible.
7.2. Limitations & Future Work
The authors acknowledge a key limitation of the proposed Spatial Sparse Attention (SSA): the forward pass exhibits a notably smaller acceleration ratio compared to the backward pass. This discrepancy is primarily attributed to the computational overhead introduced by topk sorting operations during the forward pass, which are necessary for selecting the most relevant blocks.
For future work, the authors prioritize optimizing these topk sorting operations. Reducing this overhead would further enhance the efficiency of the forward pass, making the SSA mechanism even more balanced and powerful across both training and inference.
7.3. Personal Insights & Critique
Direct3D-S2 represents a significant leap forward in the field of 3D content generation. The ability to train gigascale () 3D models on just 8 GPUs is a groundbreaking achievement that democratizes high-fidelity 3D creation. This drastically lowers the barrier to entry for researchers and practitioners, moving 3D generation from the realm of massive compute clusters to more accessible setups.
The paper's methodological innovations are well-reasoned and elegantly address long-standing challenges:
-
Unified Sparse Volumetric VAE: The symmetric design of the
SS-VAEis highly commendable. By maintaining a consistentsparse volumetric formatthroughout, it cleverly sidesteps the inefficiencies and potential geometric approximations introduced by heterogeneous representations common in prior works. This holistic approach to data representation likely contributes significantly to both stability and fidelity. -
Spatial Sparse Attention (SSA): This is the true centerpiece. The adaptation of sparse attention principles from 1D sequences (
NSA) to the complexities ofsparse 3D datais a sophisticated technical contribution. The explicit consideration of3D spatial coherencethrough coordinate-based block partitioning and the thoughtful combination ofcompression,selection, andwindowingmodules showcase a deep understanding of the problem space. The customTritonkernel implementation further demonstrates a commitment to practical, high-performance solutions. -
Sparse Conditioning Mechanism: This is a subtle but impactful improvement. Recognizing and mitigating the noise and inefficiency introduced by background pixels in conditional images is a practical refinement that directly contributes to better alignment and reduced computational waste.
Potential Issues and Areas for Improvement:
-
The
topk sorting overheadin the forward pass is a valid limitation. While the backward pass benefits immensely, further optimizing the forward pass would solidifySSA's efficiency. Techniques like approximatetopkor hardware-aware sorting optimizations could be explored. -
Generalizability of Sparsity: The effectiveness of
sparse volumetric representationsheavily relies on the assumption that most of the 3D space is empty. For highly complex, dense objects or scenes (e.g., volumetric data like clouds or granular materials), the benefits of sparsity might diminish, or the definition of "active" voxels might become more challenging. The current approach focuses onSDFs, which inherently represent surfaces and thus have sparse "active" regions. -
Topology Flexibility: While
SDFsare powerful for representing complex shapes, they can sometimes be more restrictive for specific topological operations compared to explicit mesh manipulation. Future work could explore how this high-resolution sparseSDFgeneration could be seamlessly integrated with mesh editing tools or other explicit representations. -
Data Quality Dependence: The paper highlights the rigorous curation of 452k high-quality 3D assets. This underscores the critical role of data quality in achieving
gigascaleresults. WhileDirect3D-S2is powerful, its performance will likely remain highly dependent on the quality and diversity of the underlying 3D datasets.Inspirationally,
Direct3D-S2exemplifies how deep architectural insights combined with hardware-aware optimization can lead to paradigm-shifting capabilities. Its methods could be transferable to other domains involving large, sparse data structures (e.g., medical imaging, scientific simulations, or even large-scale graph processing), where efficientattentionmechanisms for irregular data are crucial. The emphasis on unified representations and specialized attention mechanisms offers a blueprint for futuregigascalegenerative models beyond 3D.
Similar papers
Recommended via semantic vector search.