Paper status: completed

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

Published:05/23/2025

Diffusion Transformer (6)Spatial Sparse Attention (1)Sparse Volumetric Representation (1)3D Generation Framework (1)Variational Autoencoder (VAE) (1)

Original Link PDF

Price: 0.100000

7 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

Direct3D-S2 employs Spatial Sparse Attention to efficiently generate gigascale 3D shapes using sparse volumetric data, combining a unified sparse VAE design that boosts training efficiency and stability while drastically reducing computational costs.

Abstract

Generating high-resolution 3D shapes using volumetric representations such as Signed Distance Functions (SDFs) presents substantial computational and memory challenges. We introduce Direct3D-S2, a scalable 3D generation framework based on sparse volumes that achieves superior output quality with dramatically reduced training costs. Our key innovation is the Spatial Sparse Attention (SSA) mechanism, which greatly enhances the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA allows the model to effectively process large token sets within sparse volumes, substantially reducing computational overhead and achieving a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. Our framework also includes a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. Compared to previous methods with heterogeneous representations in 3D VAE, this unified design significantly improves training efficiency and stability. Our model is trained on public available datasets, and experiments demonstrate that Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency, but also enables training at 1024 resolution using only 8 GPUs, a task typically requiring at least 32 GPUs for volumetric representations at 256 resolution, thus making gigascale 3D generation both practical and accessible. Project page: https://www.neural4d.com/research/direct3d-s2.

Mind Map

In-depth Reading

English Analysis~24 min read · 29,948 chars

1. Bibliographic Information

1.1. Title

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

1.2. Authors

Shuang Wu, Youtian Lin, Feihu Zhang, Yifei Zeng, Yikang Yang, Yajie Bao, Jiachen Qian, Siyu Zhu, Xun Cao, Philip Torr, Yao Yao. Affiliations include Nanjing University, DreamTech, Fudan University, and University of Oxford.

1.3. Journal/Conference

The paper is available as a preprint on arXiv (arXiv preprint arXiv:2505.17412). The abstract indicates it was "Published at (UTC): 2025-05-23T02:58:01.000Z", suggesting it might be accepted for a future conference or journal in 2025, but as of the provided information, it is an arXiv preprint. arXiv is a well-regarded open-access repository for preprints of scientific papers, particularly influential in fields like machine learning and computer vision for early dissemination of research.

1.4. Publication Year

2025 (based on the provided publication date).

1.5. Abstract

Generating high-resolution 3D shapes using volumetric representations like Signed Distance Functions (SDFs) faces significant computational and memory challenges. The paper introduces Direct3D-S2, a scalable 3D generation framework built upon sparse volumes, which achieves superior output quality with considerably reduced training costs. A central innovation is the Spatial Sparse Attention (SSA) mechanism, designed to boost the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. SSA efficiently processes large token sets within sparse volumes, leading to a 3.9x speedup in the forward pass and a 9.6x speedup in the backward pass. The framework also incorporates a variational autoencoder (VAE) that maintains a consistent sparse volumetric format across input, latent, and output stages. This unified design, unlike previous methods with heterogeneous representations in 3D VAEs, substantially improves training efficiency and stability. Trained on publicly available datasets, Direct3D-S2 not only surpasses state-of-the-art methods in generation quality and efficiency but also enables training at $1024^3$ resolution using just 8 GPUs, a task that typically demands at least 32 GPUs for volumetric representations at $256^3$ resolution. This advancement makes gigascale 3D generation both practical and accessible.

1.6. Original Source Link

https://arxiv.org/abs/2505.17412 PDF Link: https://arxiv.org/pdf/2505.17412v2.pdf

2. Executive Summary

2.1. Background & Motivation

The creation of high-quality 3D models directly from text or images holds immense potential for various applications, including virtual reality, gaming, product prototyping, and computer-aided design. However, generating these 3D shapes, especially at high resolutions, using volumetric representations such as Signed Distance Functions (SDFs), is plagued by substantial computational and memory demands.

Prior research in large-scale 3D generative models has explored two main avenues:

Implicit Latent Representations: Methods like 3DShape2Vecset, CLAY, and TripoSG leverage neural fields and Variational Autoencoders (VAEs) to encode 3D shapes into compact latent codes. While beneficial for scalability, these often rely on VAEs with asymmetric 3D representations (e.g., converting point clouds to 1D vectors and then to SDF fields), leading to lower training efficiency and requiring vast computational resources (e.g., hundreds of GPUs).
Explicit Latent Representations: Approaches such as Direct3D, XCube, and Trellis use more interpretable representations like tri-planes or sparse voxels. While offering simpler training and direct editing, these methods are often limited in output resolution due to high memory demands. Scaling to resolutions like $1024^3$ with sufficient latent tokens and valid voxels remains challenging. A primary bottleneck is the quadratic computational cost of full attention in Diffusion Transformers (DiT), which makes high-resolution training prohibitively expensive.

The core problem the paper aims to solve is the scalability barrier in high-resolution 3D shape generation, specifically addressing the computational and memory challenges associated with volumetric representations and the inefficiency of attention mechanisms in DiT for large, sparse 3D data. The paper's entry point is to unify sparse volumetric representations across the entire generative pipeline and dramatically improve the efficiency of attention mechanisms for these sparse data structures.

2.2. Main Contributions / Findings

The paper introduces Direct3D-S2, a novel framework that makes gigascale 3D generation practical and accessible. Its primary contributions and findings are:

Spatial Sparse Attention (SSA) Mechanism: The paper's key innovation, SSA, is specifically designed to enhance the efficiency of Diffusion Transformer (DiT) computations on sparse volumetric data. By intelligently processing large token sets within sparse volumes, SSA significantly reduces computational overhead, achieving a remarkable 3.9x speedup in the forward pass and an even more impressive 9.6x speedup in the backward pass compared to FlashAttention-2 at $1024^3$ resolution with 128k tokens.
Unified Sparse SDF VAE: Direct3D-S2 incorporates a Variational Autoencoder (VAE) that maintains a consistent sparse volumetric format across the input, latent, and output stages. This symmetric and unified design eliminates the need for cross-modality translations common in previous heterogeneous 3D VAEs, thereby substantially improving training efficiency, stability, and geometric fidelity.
Scalability and Efficiency Breakthrough: The framework enables training at an unprecedented $1024^3$ resolution using only 8 GPUs. This is a significant leap, as previous state-of-the-art volumetric methods typically require at least 32 GPUs to train at a much lower $256^3$ resolution. This finding demonstrates Direct3D-S2's capability to make "gigascale" 3D generation both practical and accessible.
Superior Generation Quality: Extensive experiments confirm that Direct3D-S2 not only offers superior efficiency but also surpasses state-of-the-art methods in generation quality, producing highly detailed 3D shapes. Quantitative evaluations using ULIP-2, Uni3D, and OpenShape metrics, as well as qualitative comparisons and user studies, affirm its leading performance.
Sparse Conditioning Mechanism: A novel sparse conditioning mechanism is introduced to selectively extract and process foreground tokens from input images, reducing computational overhead and improving alignment between generated meshes and conditional images by mitigating background noise.

3.1. Foundational Concepts

To understand Direct3D-S2, a reader should be familiar with the following core concepts:

3D Shape Generation: The overarching task of creating three-dimensional digital models. This can involve various input modalities (e.g., text, images) and output representations (e.g., meshes, point clouds, voxels).
Volumetric Representations: A method of representing 3D objects as a grid of discrete elements (voxels) in 3D space. Each voxel can store information about the object at that location.
- Signed Distance Functions (SDFs): A specific type of volumetric representation where each point in space is assigned a value representing its shortest distance to the surface of an object. The sign of the distance indicates whether the point is inside (negative) or outside (positive) the object. Points exactly on the surface have an SDF value of zero. SDFs are useful for tasks like shape reconstruction and boolean operations.
- Watertight Meshes: A 3D mesh is "watertight" if it completely encloses a volume without any holes or gaps in its surface. This property is crucial for generating SDFs and for many 3D printing or simulation applications.
- Sparse Volumes: In a dense volumetric representation, every voxel in a fixed grid stores data. In contrast, sparse volumes only store and process the "active" or non-empty voxels, significantly reducing memory and computational requirements when most of the space is empty (e.g., for a hollow object).
Variational Autoencoder (VAE): A type of generative neural network that learns a compressed, continuous latent space representation of data.
- Encoder: Maps input data (e.g., an image or 3D shape) to a distribution over the latent space (typically a Gaussian distribution, characterized by its mean and variance).
- Decoder: Maps samples from the latent space back to the original data space, reconstructing the input.
- Latent Space: A lower-dimensional representation of the input data, capturing its essential features. In a VAE, this space is designed to be continuous and easily traversable, allowing for novel data generation by sampling from it.
- KL Divergence (Kullback-Leibler Divergence): A measure of how one probability distribution diverges from a second, expected probability distribution. In VAEs, it's used to regularize the latent space by forcing the encoded distributions to be close to a standard normal distribution.
Diffusion Models: A class of generative models that learn to reverse a diffusion process. They gradually add noise to data until it becomes pure noise, then learn to "denoise" it step-by-step to generate new data from noise.
Transformers: A neural network architecture that revolutionized sequence modeling. Key components include:
- Attention Mechanism: A core component that allows the model to weigh the importance of different parts of the input sequence (or tokens) when processing a specific part.
- Self-Attention: A variant of attention where the model computes attention weights between different parts of the same input sequence, allowing it to capture internal dependencies.
- Cross-Attention: Used in conditional generation, where the model computes attention between a target sequence (e.g., noisy tokens) and a conditioning sequence (e.g., image features or text embeddings).
Diffusion Transformer (DiT): A Transformer-based neural network architecture specifically adapted for diffusion models. Instead of U-Net architectures, DiTs use Transformers to predict the noise or velocity field in diffusion models, showing strong performance and scalability.
Rectified Flow: A type of generative model that defines a straight-line trajectory between a data distribution and a simple prior distribution (like a standard normal distribution). The model learns to predict the velocity field along these straight paths, offering an alternative to traditional diffusion models with potentially faster sampling.
GPU Kernel: A specialized program or function designed to run directly on the Graphics Processing Unit (GPU). GPU kernels are highly optimized for parallel computation and are crucial for accelerating computationally intensive tasks like neural network training.
Triton: A domain-specific language (DSL) and compiler developed by OpenAI that allows researchers to write highly efficient GPU kernels for deep learning operations, often outperforming hand-tuned CUDA kernels. It abstracts away some of the complexities of CUDA programming, making it easier to optimize tensor operations.

3.2. Previous Works

The paper contextualizes its contributions by discussing several categories of prior work:

Multi-view Generation and 3D Reconstruction: These methods (e.g., [16, 22, 23, 42]) often start with multi-view diffusion models ([38]) trained on 2D image priors (like Stable Diffusion [33]) to generate consistent multi-view images of a 3D shape. These images are then used to reconstruct the 3D shape via generalized sparse-view reconstruction models ([21, 27, 36, 43, 48]).
- Limitation: They struggle with multi-view consistency and shape quality, often producing artifacts. They also rely on rendering-based supervision (NeRF [28], DMTet [34]), which adds complexity and computational overhead.
Large Scale 3D Latent Diffusion Model: Inspired by 2D Latent Diffusion Models (LDMs) [33], these extend LDMs to 3D.
- Implicit Vecset-based Methods: Examples include 3DShape2Vecset [47], Michelangelo [50], CLAY [49], and CraftsMan3D [17]. They represent 3D shapes using a latent vecset (a set of vectors) and reconstruct meshes via neural SDFs or occupancy fields.
  - Limitation: Constrained by vecset size; larger vecsets lead to more complex mappings and longer training times. They often use asymmetric VAE architectures (e.g., point cloud input to 1D vector latent to SDF output), which reduces efficiency.
- Voxel-based Methods: Examples include XCube [32], Trellis [40], and [11, 45]. These use voxel grids as latent representations, offering better interpretability and simpler training.
  - Limitation: Face significant challenges in latent resolution due to the cubic growth of GPU memory requirements and the high computational costs of attention mechanisms in Transformers. XCube [32] can generate $1024^3$ sparse volumes but is limited to millions of valid voxels, impacting final quality. Trellis [40] integrates $256^3$ sparse voxel representations with rendering supervision.
Efficient Large Tokens Generation: Addresses the challenge of efficiently processing a large number of tokens.
- Native Sparse Attention (NSA) [46]: A technique that uses adaptive token compression, integrating compression, selection, and windowing to identify relevant tokens. It was designed for 1D sequences and applied to large language models ([31, 46]) and video generation ([35]).
  - Limitation: NSA is not directly applicable to unstructured, sparse 3D data because its 1D block partitioning doesn't preserve 3D spatial coherence.
- Linear Attention [13]: Reduces attention complexity by approximating attention weights with linear functions ([41, 53, 26]).
  - Limitation: Can lead to a significant performance decline due to the absence of non-linear similarity.

3.3. Technological Evolution

The evolution of 3D generation has moved from early explicit mesh-based methods to implicit neural representations (like NeRFs and SDFs) due to their compactness and flexibility. More recently, the success of Latent Diffusion Models in 2D has inspired their adaptation to 3D. This has led to the development of 3D VAEs to compress 3D data into latent spaces, and Diffusion Transformers (DiT) to generate in these latent spaces.

However, scaling these models to high resolutions (e.g., $1024^3$ ) has remained a major hurdle. The challenge stems from the inherent cubic growth of volumetric data and the quadratic complexity of standard Transformer attention mechanisms with respect to sequence length (number of tokens). Early solutions tried to reduce the number of tokens through packing or coarse representations.

Direct3D-S2 fits into this evolution by pushing the boundaries of voxel-based latent diffusion. It addresses the attention bottleneck head-on by developing a specialized Spatial Sparse Attention (SSA) mechanism that can effectively operate on sparse 3D data, overcoming the limitations of 1D sparse attention methods. Furthermore, it tackles the VAE inefficiency by introducing a fully symmetric, sparse volumetric VAE that maintains consistency throughout the pipeline. This work represents a significant step towards practical, high-resolution, high-fidelity 3D content creation.

3.4. Differentiation Analysis

Compared to the main methods in related work, Direct3D-S2 introduces several core differences and innovations:

Unified Sparse Volumetric VAE Design: Unlike previous 3D VAEs that often use heterogeneous or asymmetric representations (e.g., point cloud input, 1D vector latent, dense volume output, or reliance on differentiable rendering to bridge latent spaces), Direct3D-S2 employs a symmetric encoding-decoding network that consistently uses a sparse volumetric format across input, latent, and output stages. This unified approach eliminates costly cross-modality translations, leading to significantly improved training efficiency, stability, and geometric fidelity.
Spatial Sparse Attention (SSA) for 3D Data: While Native Sparse Attention (NSA) inspired SSA, NSA was designed for 1D sequences and cannot effectively handle the spatial coherence required for unstructured, sparse 3D data. Direct3D-S2's SSA explicitly redesigns the block partitioning to preserve 3D spatial coherence and revises compression, selection, and window modules to accommodate the irregular nature of sparse volumetric tokens. This adaptation is crucial for enabling efficient DiT computation on gigascale 3D data.
Unprecedented Scalability and Resource Efficiency: Direct3D-S2 demonstrates the ability to train Diffusion Transformers at $1024^3$ resolution using only 8 GPUs. This is a dramatic improvement over existing volumetric methods that typically require 32 or more GPUs for merely $256^3$ resolution. This leap in efficiency makes high-resolution 3D generation practically achievable.
Sparse Conditioning Mechanism: The introduction of a sparse conditioning mechanism is a practical improvement over standard cross-attention in image-to-3D models. By selectively processing only foreground tokens from conditional images (instead of all pixel-level features), it reduces computational overhead and improves alignment by focusing on relevant visual information.
Direct Volumetric Generation: Many multi-view generation methods rely on rendering-based supervision and subsequent 3D reconstruction, which can introduce artifacts and complexity. Direct3D-S2 directly generates 3D shapes in a volumetric SDF format, maintaining geometric precision through its SS-VAE and SSA-enhanced DiT.

In essence, Direct3D-S2 differentiates itself by pioneering an efficient and coherent sparse volumetric pipeline from VAE encoding to DiT generation, enabled by a 3D-aware sparse attention mechanism, leading to breakthroughs in resolution, quality, and computational accessibility.

4. Methodology

The Direct3D-S2 framework is designed for scalable and efficient high-resolution 3D shape generation, leveraging sparse volumetric representations. It consists of two main components: a Sparse SDF VAE (SS-VAE) for encoding and decoding 3D shapes into sparse latent representations, and a Diffusion Transformer (DiT) with a novel Spatial Sparse Attention (SSA) mechanism for generating these latent representations.

4.1. Principles

The core idea behind Direct3D-S2 is to overcome the computational and memory challenges of high-resolution 3D generation by focusing on sparse data structures and designing attention mechanisms tailored for them.

Sparsity: Instead of processing dense volumetric grids, the framework operates exclusively on sparse voxels (only those near the object's surface or containing valid data). This drastically reduces the number of data points (tokens) to be processed.
Unified Representation: Maintain a consistent sparse volumetric format throughout the entire VAE pipeline (input, latent, output). This symmetry simplifies the architecture and improves efficiency by avoiding heterogeneous data conversions.
Spatial Awareness in Attention: Adapt the Transformer attention mechanism to be spatially aware for 3D data. Standard full attention is quadratically expensive, and existing sparse attention (e.g., NSA) is not designed for 3D spatial coherence. The SSA mechanism intelligently groups and selects tokens based on their 3D coordinates, allowing for efficient processing of large, sparse token sets.
Progressive Training: Utilize a multi-resolution training strategy for the VAE and a progressive training strategy for the DiT to accelerate convergence and enable training at very high resolutions.

4.2. Core Methodology In-depth (Layer by Layer)

The Direct3D-S2 framework is composed of an SS-VAE (upper half of Figure 2) and an SS-DiT with SSA and a sparse conditioning mechanism (lower half of Figure 2).

The following figure (Figure 6 from the original paper) provides a schematic diagram of the Direct3D-S2 framework, illustrating the overall structure and data flow of the SS-VAE and SS-DiT modules, including multi-resolution sparse SDF encoding and decoding, spatial sparse attention mechanism, and the final 3D mesh generation process.

该图像是Direct3D-S2框架的流程示意图，展示了SS-VAE与SS-DiT模块的整体结构与数据流动，包含多分辨率稀疏SDF编码解码和空间稀疏注意力机制，以及最终生成3D网格的过程。

4.2.1. Sparse SDF VAE (SS-VAE)

The SS-VAE is designed to encode high-resolution sparse SDF volumes into a compact sparse latent representation and then reconstruct them. This addresses the challenge of directly processing dense $R^3$ SDF volumes, which is computationally prohibitive.

Input Data Processing: Given a mesh represented as an SDF volume $V$ with resolution $R^3$ (e.g., $1024^3$ ), the SS-VAE focuses on valid sparse voxels where the absolute SDF values fall below a predefined threshold $\tau$ . This effectively limits processing to regions near the object's surface. The input is formally defined as: $ V = { (\mathbf{x}_i, s(\mathbf{x}_i)) \big| |s(\mathbf{x}i)| < \tau }{i=1}^{|V|} $ Here, $s(\mathbf{x}_i)$ represents the SDF value at a 3D position $\mathbf{x}_i$ , and $|V|$ is the total number of valid sparse voxels.

Symmetric Network Architecture: The SS-VAE employs a symmetric encoder-decoder network architecture.

Encoder:
1. Local Feature Extraction: The encoder first extracts local geometric features using a series of residual sparse 3D CNN blocks.
2. Downsampling: These CNN blocks are interleaved with 3D mean pooling operations, which progressively downsample the spatial resolution.
3. Transformer Processing: The sparse voxels, now treated as variable-length tokens, are then processed by shifted window attention mechanisms. This step captures local contextual information among the valid voxels.
4. Positional Encoding: Inspired by Trellis [40], the feature of each valid voxel is augmented with positional encoding based on its 3D coordinates before being fed into the 3D shift window attention layers. This helps the model understand the spatial arrangement of tokens.
5. Output: This hybrid design outputs a sparse latent representation $\mathbf{z} = E(V)$ at a reduced resolution $(\frac{R}{f})^3$ , where $f$ is the downsampling factor.
Decoder: The decoder $\tilde{V} = D(\mathbf{z})$ mirrors the encoder's structure. It uses attention layers and sparse 3D CNN blocks to progressively upsample the latent representation $\mathbf{z}$ and reconstruct the SDF volume $\tilde{V}$ .

Training Losses: The training objective for the SS-VAE aims to ensure accurate reconstruction and a well-behaved latent space. The decoded sparse voxels $\tilde{V}$ comprise the original input voxels $\tilde{V}_{\mathrm{in}}$ and any additional valid voxels $\tilde{V}_{\mathrm{extra}}$ that the decoder might generate. Supervision is enforced on the SDF values across all these spatial positions. To enhance geometric fidelity, additional supervision is applied to active voxels near sharp edges, identified as regions with high-curvature variations on the mesh surface ( $\tilde{V}_{\mathrm{sharp}}$ ). A KL divergence regularization term is also imposed on the latent representation $\mathbf{z}$ to constrain excessive variations in the latent space, encouraging it to follow a simple prior distribution (e.g., standard normal).

The reconstruction loss $\mathcal{L}_c$ for each category of voxels is formulated as: $ \mathcal{L}_c = \frac{1}{\vert \tilde{V}c \vert} \sum{(\mathbf{x}, \tilde{s}(\mathbf{x})) \in \tilde{V}_c} \left. s(\mathbf{x}) - \tilde{s}(\mathbf{x}) \right|_2^2, \quad c \in { \mathrm{in}, \mathrm{ext}, \mathrm{sharp} } $ Where:

$\mathcal{L}_c$ : The L2 reconstruction loss for voxel set $c$ .
$\vert \tilde{V}_c \vert$ : The number of voxels in the set $\tilde{V}_c$ .
$(\mathbf{x}, \tilde{s}(\mathbf{x}))$ : A voxel at position $\mathbf{x}$ with its reconstructed SDF value $\tilde{s}(\mathbf{x})$ .
$s(\mathbf{x})$ : The ground-truth SDF value at position $\mathbf{x}$ .
$\tilde{V}_{\mathrm{in}}$ : The set of original input voxels.
$\tilde{V}_{\mathrm{ext}}$ : The set of extra valid voxels generated by the decoder.
$\tilde{V}_{\mathrm{sharp}}$ : The set of active voxels near sharp edges of the mesh.

The overall training objective $\mathcal{L}_{\mathrm{total}}$ for the SS-VAE is: $ \mathcal{L}{\mathrm{total}} = \sum_c \lambda_c \mathcal{L}c + \lambda{\mathrm{KL}} \mathcal{L}{\mathrm{KL}} $ Where:
$\mathcal{L}_{\mathrm{total}}$ : The total training loss for the SS-VAE.
$\lambda_{\mathrm{in}}, \lambda_{\mathrm{ext}}, \lambda_{\mathrm{sharp}}$ : Weighting coefficients for the respective reconstruction loss terms.
$\lambda_{\mathrm{KL}}$ : Weighting coefficient for the KL divergence regularization term.
$\mathcal{L}_{\mathrm{KL}}$ : The KL divergence loss, which regularizes the latent space.

Multi-resolution Training: To enhance training efficiency and allow the SS-VAE to handle meshes across varying resolutions, a multi-resolution training paradigm is used. During each training iteration, a target resolution is randomly sampled from a candidate set $\{256^3, 384^3, 512^3, 1024^3\}$ . The input SDF volume is then trilinearly interpolated to the selected resolution before being fed into the SS-VAE. Trilinear interpolation is a method of estimating the value of a function at an intermediate point within a 3D grid, given its values at the grid points.

4.2.2. Spatial Sparse Attention and DiT

After the SS-VAE encodes 3D shapes into latent representations $\mathbf{z}$ , a rectified flow transformer-based 3D shape generator (SS-DiT) is trained on these latents, conditioned on input images. The SS-DiT incorporates the novel Spatial Sparse Attention (SSA) mechanism and a sparse conditioning mechanism.

The following figure (Figure 7 from the original paper) is a schematic from the Direct3D-S2 paper showing the three branches of the Spatial Sparse Attention mechanism: Sparse 3D Compression, Spatial Blockwise Selection, and Sparse 3D Window, and how they produce Compressed Attention, Selected Attention, and Window Attention, which are then combined into a gated output.

该图像是Direct3D-S2论文中的示意图，展示了空间稀疏注意力机制的三个分支：稀疏3D压缩、空间块选择和稀疏3D窗口，以及它们如何生成压缩注意力、选择注意力和窗口注意力，最终合成为门控输出。

Standard Full Attention (Context): For context, the standard full attention mechanism, given input tokens $\mathbf{q}, \mathbf{k}, \mathbf{v} \in \mathbb{R}^{N \times d}$ , where $N$ is the token length and $d$ is the head dimension, is formulated as: $ \begin{array}{l} \mathbf{o}t = \mathrm{Attn}(\mathbf{q}t, \mathbf{k}, \mathbf{v}) \ \qquad = \sum{i=1}^N \frac{\mathbf{p}{t,i} \mathbf{v}i}{N} \end{array} $ And the attention weights $\mathbf{p}_{t,j}$ are computed as: $ \mathbf{p}{t,j} = \exp\left(\frac{\mathbf{q}_t^\intercal \mathbf{k}_j}{\sqrt{d}}\right) $ Where:

$\mathbf{q}_t$ : The query vector for the $t$ -th token.
$\mathbf{k}$ : The matrix of key vectors.
$\mathbf{v}$ : The matrix of value vectors.
$\mathbf{o}_t$ : The output vector for the $t$ -th token.
$N$ : The total number of tokens.
$d$ : The dimension of the attention head (used to scale the dot product).
$\mathbf{q}_t^\intercal \mathbf{k}_j$ : The dot product similarity between the query $\mathbf{q}_t$ and the key $\mathbf{k}_j$ .
$\sqrt{d}$ : A scaling factor to prevent large dot products from pushing the softmax function into regions with tiny gradients. This full attention has a quadratic computational cost with respect to $N$ , which becomes prohibitive for large token lengths (e.g., over 100k at $1024^3$ resolution).

Spatial Sparse Attention (SSA): SSA is proposed to overcome the computational inefficiency of full attention when dealing with sparse volumetric data and large token sets. It addresses the limitations of Native Sparse Attention (NSA) [46], which treats tokens as a 1D sequence, leading to issues with 3D spatial coherence and unstable training for sparse 3D data. SSA partitions key and value tokens into spatially coherent blocks based on their 3D coordinates. The 3D space is divided into subgrids of size $m^3$ , and active tokens within the same subgrid are grouped into one block. The SSA mechanism comprises three core modules: sparse 3D compression, spatial blockwise selection, and sparse 3D window. The attention computation combines these modules through gated aggregation: $ \begin{array}{rl} & \mathbf{o}_t = \omega_t^{\mathrm{cmp}} \mathrm{Attn}(\mathbf{q}_t, \mathbf{k}_t^{\mathrm{cmp}}, \mathbf{v}_t^{\mathrm{cmp}}) \ & \qquad + \omega_t^{\mathrm{slc}} \mathrm{Attn}(\mathbf{q}_t, \mathbf{k}_t^{\mathrm{slc}}, \mathbf{v}_t^{\mathrm{slc}}) \ & \qquad + \omega_t^{\mathrm{win}} \mathrm{Attn}(\mathbf{q}_t, \mathbf{k}_t^{\mathrm{win}}, \mathbf{v}_t^{\mathrm{win}}) \end{array} $ Where:

$\mathbf{o}_t$ : The final output vector for query $\mathbf{q}_t$ .
$\mathrm{Attn}(\cdot, \cdot, \cdot)$ : The standard attention function, but applied to selected key-value pairs.
$\mathbf{k}_t^{\mathrm{cmp}}, \mathbf{v}_t^{\mathrm{cmp}}$ : Key and value tokens selected by the sparse 3D compression module for query $\mathbf{q}_t$ .
$\mathbf{k}_t^{\mathrm{slc}}, \mathbf{v}_t^{\mathrm{slc}}$ : Key and value tokens selected by the spatial blockwise selection module for query $\mathbf{q}_t$ .
$\mathbf{k}_t^{\mathrm{win}}, \mathbf{v}_t^{\mathrm{win}}$ : Key and value tokens selected by the sparse 3D window module for query $\mathbf{q}_t$ .
$\omega_t^{\mathrm{cmp}}, \omega_t^{\mathrm{slc}}, \omega_t^{\mathrm{win}}$ : Gating scores for each module, obtained by applying a linear layer followed by a sigmoid activation to the input features. These scores dynamically weigh the contributions of each attention module.

Let's break down each module:

Sparse 3D Compression: This module extracts block-level representations of the input tokens.
- Intra-block Positional Encoding: Each token within a block of size $m_{\mathrm{cmp}}^3$ is augmented with intra-block positional encoding.
- Compression: Sparse 3D convolution followed by sparse 3D mean pooling is applied to compress the entire block. The block-level key token $\mathbf{k}_t^{\mathrm{cmp}}$ is computed as: $ \mathbf{k}_t^{\mathrm{cmp}} = \delta(\mathbf{k}_t + \mathrm{PE}(\mathbf{k}_t)) $ Where:
- $\mathbf{k}_t^{\mathrm{cmp}}$ : The block-level key token for query $\mathbf{q}_t$ .
- $\mathbf{k}_t$ : The original key token (or an aggregated representation).
- $\mathrm{PE}(\cdot)$ : Absolute positional encoding function.
- $\delta(\cdot)$ : Represents the operations of sparse 3D convolution and sparse 3D mean pooling.
- $m_{\mathrm{cmp}}^3$ : The size of the compression block. This module captures coarse-grained global information while reducing the number of tokens.
Spatial Blockwise Selection: This module retains token-level features for fine details. It leverages the sparse 3D compression module to determine which blocks are most relevant.
- Attention Score Computation: It computes attention scores $\mathbf{s}_{\mathrm{cmp}}$ between the query $\mathbf{q}$ and each compression block.
- Block Selection: All tokens within the top-k blocks exhibiting the highest scores are selected.
- Resolution Constraint: The resolution $m_{\mathrm{slc}}$ of the selection blocks must be greater than and divisible by the resolution $m_{\mathrm{cmp}}$ of the compression blocks.
- Relevance Score Aggregation: The relevance score $\mathbf{s}_t^{\mathrm{slc}}$ for a selection block is aggregated from its constituent compression blocks. GQA (Grouped-Query Attention) [4] is used for further efficiency improvement, where attention scores of shared query heads within each group are accumulated: $ \mathbf{s}t^{\mathrm{slc}} = \sum{i \in \mathcal{B}{\mathrm{cmp}}} \sum{h=1}^{h_s} s_{t,h}^{\mathrm{cmp},i} $ Where:
- $\mathbf{s}_t^{\mathrm{slc}}$ : The aggregated relevance score for a selection block for query $\mathbf{q}_t$ .
- $\mathcal{B}_{\mathrm{cmp}}$ : The set of compression blocks within the selection block.
- $h_s$ : The number of shared heads within a group in GQA.
- $s_{t,h}^{\mathrm{cmp},i}$ : The attention score for query $\mathbf{q}_t$ , head $h$ , and compression block $i$ . The top-k selection blocks with the highest $\mathbf{s}_t^{\mathrm{slc}}$ scores are chosen, and all tokens within them form $\mathbf{k}_t^{\mathrm{slc}}$ and $\mathbf{v}_t^{\mathrm{slc}}$ for attention computation.
Triton Kernel Implementation: The spatial blockwise selection attention kernel is implemented using Triton [37] to address challenges arising from sparse 3D voxel structures:
- The number of tokens varies across blocks.
- Tokens within the same block may not be contiguous in High Bandwidth Memory (HBM). To handle this, input tokens are sorted by block indices, and the starting index $\mathcal{C}$ of each block is passed as kernel input. In the inner loop, $\mathcal{C}$ dynamically controls loading corresponding block tokens.
The complete procedure of the forward pass is formalized in Algorithm 1.
```
Algorithm 1 Spatial Blockwise Selection Attention Forward Pass

Require: q ∈ ℝ^(N × (h_kv × h_s) × d), k ∈ ℝ^(N × h_kv × d) and v ∈ ℝ^(N × h_kv × d), 
         number of key/value heads h_kv, number of the shared heads h_s, 
         number of the selected blocks T, indices of the selected blocks I ∈ ℝ^(N × h_kv × T), 
         the number of divided key/value blocks N_b, C ∈ ℝ^(N_b + 1), block size B_k.
Output: o ∈ ℝ^(N × (h_kv × h_s) × d), l ∈ ℝ^(N × (h_kv × h_s))

1: Divide the output o ∈ ℝ^(N × (h_kv × h_s) × d) into (N, h_kv) blocks, each of size h_s × d
2: Divide the logsumexp l ∈ ℝ^(N × (h_kv × h_s)) into (N, h_kv) blocks, each of size h_s.
3: Sort all tokens within q, k and v according to their respective block indices.
4: for t = 1 to N do
5:   for h = 1 to h_kv do
6:     Initialize o_(t,h) = (0)_(h_s × d) ∈ ℝ^(h_s × d), 
                   logsumexp l_(t,h) = (0)_(h_s) ∈ ℝ^(h_s), 
                   and m_(t,h) = (-inf)_(h_s) ∈ ℝ^(h_s).
7:     Load q_(t,h) ∈ ℝ^(h_s × d) and I_(t,h) ∈ ℝ^T from HBM to on-chip SRAM.
8:     for j = 1 to T do
9:       Load b_s = C^((I_(t,h)^(j))) and ending token index b_e = C^((I_(t,h)^(j)) + 1) - 1 
                  of the I_(t,h)^(j)th block from HBM to on-chip SRAM.
10:      for i = b_s to b_e by B_k do
11:        Load k_i and v_i ∈ ℝ^(B_k × d) from HBM to on-chip SRAM.
12:        Compute s_(t,h)^(i) = q_(t,h) k_i^T ∈ ℝ^(h_s × B_k).
13:        Compute m_(t,h)^(i) = max(m_(t,h), rowmax(s_(t,h)^(i))) ∈ ℝ^(h_s).
14:        Compute p_(t,h)^(i) = e^(s_(t,h)^(i) - m_(t,h)^(i)) ∈ ℝ^(h_s × B_k).
15:        Update o_(t,h) = e^(m_(t,h) - m_(t,h)^(i)) o_(t,h) + p_(t,h)^(i) v_i.
16:        Update l_(t,h) = m_(t,h)^(i) + log(e^(l_(t,h) - m_(t,h)^(i)) + rowsum(p_(t,h)^(i))), 
                   m_(t,h) = m_(t,h)^(i).
17:      end for
18:    end for
19:    Compute o_(t,h) = e^(m_(t,h) - l_(t,h)) o_(t,h).
20:    Write o_(t,h) and l_(t,h) to HBM as the (t,h)-th block of o and l, respectively.
21:  end for
22: end for
23: Return the output o and the logsumexp l.
```
Where:
- $\mathbf{q}$ : Query tokens.
- $\mathbf{k}$ : Key tokens.
- $\mathbf{v}$ : Value tokens.
- $N$ : Total number of tokens.
- $h_{kv}$ : Number of key/value heads.
- $h_s$ : Number of shared heads within a group.
- $d$ : Head dimension.
- $T$ : Number of selected blocks.
- $\mathbf{I}$ : Indices of the selected blocks for each token and head.
- $N_b$ : Number of divided key/value blocks.
- $\mathcal{C}$ : An array storing the starting index in sorted tokens for each block. $\mathcal{C}^{(j)}$ gives the starting index of block $j$ .
- $B_k$ : Block size for loading $k$ and $v$ within a selected block.
- $\mathbf{o}$ : Output tokens.
- $l$ : logsumexp for numerical stability (used in softmax normalization).
- $\mathbf{o}_{t,h}$ : Output for query $t$ , head $h$ .
- $l_{t,h}$ : logsumexp for query $t$ , head $h$ .
- $\mathbf{m}_{t,h}$ : Maximum value for query $t$ , head $h$ (for softmax normalization trick).
- $\mathbf{q}_{t,h}$ : Query tokens for query $t$ , head $h$ .
- $\mathbf{I}_{t,h}$ : Selected block indices for query $t$ , head $h$ .
- $b_s, b_e$ : Starting and ending token indices for a selected block.
- $\mathbf{k}_i, \mathbf{v}_i$ : Key and value tokens loaded in chunks.
- $\mathbf{s}_{t,h}^{(i)}$ : Attention scores (dot product) between $\mathbf{q}_{t,h}$ and $\mathbf{k}_i$ .
- $\mathbf{m}_{t,h}^{(i)}$ : Updated maximum for softmax stabilization.
- $\mathbf{p}_{t,h}^{(i)}$ : Exponentiated attention scores (pre-softmax).
- $\operatorname{rowmax}(\cdot)$ : Row-wise maximum.
- $\operatorname{rowsum}(\cdot)$ : Row-wise sum. The algorithm efficiently computes attention by iterating through selected blocks and processing tokens in smaller chunks, accumulating partial results while maintaining numerical stability.
Sparse 3D Window: This auxiliary module explicitly incorporates localized feature interactions.
- Window Partitioning: The input tokens are partitioned into windowed regions of size $m_{\mathrm{win}}^3$ .
- Contextual Computation: For each token, its contextual computation is formulated by dynamically aggregating active tokens within its corresponding window to form $\mathbf{k}_t^{\mathrm{win}}$ and $\mathbf{v}_t^{\mathrm{win}}$ .
- Localized Self-Attention: Self-attention is then calculated exclusively over this constructed token subset, ensuring local detail preservation.

4.2.3. Sparse Conditioning Mechanism

Existing image-to-3D models ([17, 39, 49]) typically use DINO-v2 [29] to extract pixel-level features from conditional images and then perform cross-attention. However, background regions (often more than half the image) introduce unnecessary computational overhead and can degrade alignment. The sparse conditioning mechanism addresses this by selectively extracting and processing sparse foreground tokens from input images for cross-attention. Given an input image $\mathcal{T}$ , the sparse conditioning tokens $\mathbf{c}$ are computed as follows: $ \mathbf{c} = \mathrm{Linear}(f(E_{\mathrm{DINO}}(\mathcal{T}))) + \mathrm{PE}(f(E_{\mathrm{DINO}}(\mathcal{T}))) $ Where:

$\mathbf{c}$ : The finalized sparse conditioning tokens.
$\mathcal{T}$ : The input image.
$E_{\mathrm{DINO}}$ : The DINO-v2 encoder [29], which extracts rich visual features.
$f(\cdot)$ : An operation that extracts foreground tokens based on a mask (presumably generated to segment the foreground object).
$\mathrm{PE}(\cdot)$ : Absolute positional encoding, which adds spatial information to the extracted features.
$\mathrm{Linear}(\cdot)$ : A linear layer that projects the features into the appropriate dimension. These sparse conditioning tokens $\mathbf{c}$ are then used to perform cross-attention with the noisy latent tokens in the DiT.

4.2.4. Rectified Flow

The SS-DiT is trained using the rectified flow objective [10, 19]. Rectified flow defines a forward process as a linear trajectory between the data distribution and a standard normal distribution: $ \mathbf{x}(t) = (1 - t)\mathbf{x}_0 + t\epsilon $ Where:

$\mathbf{x}(t)$ : The noisy sample at timestep $t$ .
$\mathbf{x}_0$ : The original data sample (the latent representation $\mathbf{z}$ from the SS-VAE).
$\epsilon$ : A standard Gaussian noise vector.
$t$ : The timestep, ranging from 0 to 1. The generative model is trained to predict the velocity field that pushes noisy samples ( $\mathbf{x}(t)$ ) back towards the data distribution ( $\mathbf{x}_0$ ). The training loss uses conditional flow matching, formulated as: $ \mathcal{L}{\mathrm{CFM}} = \mathbb{E}{t, \mathbf{x}0, \epsilon} |\mathbf{v}\theta(\mathbf{x}_t, \mathbf{c}, t) - (\epsilon - \mathbf{x}_0)|_2^2 $ Where:
$\mathcal{L}_{\mathrm{CFM}}$ : The conditional flow matching loss.
$\mathbb{E}_{t, \mathbf{x}_0, \epsilon}$ : Expectation over timesteps $t$ , data samples $\mathbf{x}_0$ , and noise $\epsilon$ .
$\mathbf{v}_\theta(\mathbf{x}_t, \mathbf{c}, t)$ : The neural network (the SS-DiT) parameterized by $\theta$ , which predicts the velocity field at state $\mathbf{x}_t$ , conditioned on $\mathbf{c}$ (the sparse conditioning tokens), at timestep $t$ .
$(\epsilon - \mathbf{x}_0)$ : The target velocity field (the direction and magnitude needed to move from the data to the noise). The objective is to minimize the L2 distance between the predicted velocity field and the true velocity field.

5. Experimental Setup

5.1. Datasets

Direct3D-S2 is trained on a combination of publicly available 3D datasets, which undergo a rigorous curation process.

Source Datasets:
- Objaverse [9]: A large-scale collection of annotated 3D objects.
- Objaverse-XL [8]: An even larger universe of over 10 million 3D objects.
- ShapeNet [5]: An information-rich 3D model repository.
Data Curation: Due to the presence of low-quality meshes in these raw datasets, the authors curated approximately 452k high-quality 3D assets through rigorous filtering.
3D Representation Preparation: Following established geometry processing approaches [49], the original non-watertight meshes are converted into watertight meshes. Subsequently, ground-truth SDF volumes are computed from these watertight meshes, serving as both input and supervision for the SS-VAE.
Image Conditioning Data: For training the image-conditioned DiT, 45 RGB images are rendered per mesh.
- Resolution: Rendered at $1024 \times 1024$ resolution.
- Camera Parameters: Random camera parameters are used to generate diverse views.
  - Elevation angles: $10^\circ$ to $40^\circ$ .
  - Azimuth angles: $[0^\circ, 180^\circ]$ .
  - Focal lengths: $30 \mathrm{mm}$ to $100 \mathrm{mm}$ .
Benchmark for Evaluation: To rigorously evaluate the geometric fidelity of meshes generated by Direct3D-S2, a challenging benchmark is established using highly detailed images sourced from professional communities like Neural4D [3], Meshy [2], and CivitAI [1].

5.2. Evaluation Metrics

The geometric fidelity and alignment between generated meshes and conditional input images are quantitatively assessed using three multimodal metrics: ULIP-2, Uni3D, and OpenShape. These metrics are designed to measure the similarity between 3D shapes and their corresponding 2D images or textual descriptions in a shared embedding space. A higher value for these metrics indicates better performance (closer alignment).

5.2.1. ULIP-2

Conceptual Definition: ULIP-2 (Unified Language-Image Pre-training 2) is a framework for scalable multimodal pre-training for 3D understanding. It learns a joint embedding space for 3D shapes, images, and text. The metric typically measures the cosine similarity between the embeddings of a generated 3D mesh and its corresponding input image (or text description) within this unified space. A higher cosine similarity implies better alignment between the generated shape and the conditioning input.
Mathematical Formula: Let $E_{3D}(\text{mesh})$ be the embedding of a 3D mesh and $E_{IMG}(\text{image})$ be the embedding of an image, both projected into the ULIP-2 shared multimodal embedding space. The ULIP-2 score for a given mesh-image pair is typically the cosine similarity: $ \text{ULIP-2 Score} = \frac{E_{3D}(\text{mesh}) \cdot E_{IMG}(\text{image})}{|E_{3D}(\text{mesh})| |E_{IMG}(\text{image})|} $
Symbol Explanation:
- $E_{3D}(\text{mesh})$ : The feature vector (embedding) of the generated 3D mesh in the ULIP-2 embedding space.
- $E_{IMG}(\text{image})$ : The feature vector (embedding) of the input 2D image in the ULIP-2 embedding space.
- $\cdot$ : Dot product operation.
- $\|\cdot\|$ : L2 norm (magnitude) of a vector.

5.2.2. Uni3D

Conceptual Definition: Uni3D aims to explore unified 3D representation at scale by pre-training a model to understand 3D data across various modalities. Similar to ULIP-2, it provides a joint embedding space where 3D shapes can be compared with images or text. The Uni3D metric, when used for evaluation, quantifies the semantic and geometric consistency between a generated 3D object and its input image by measuring the distance or similarity in this learned multimodal embedding space.
Mathematical Formula: Let $U_{3D}(\text{mesh})$ be the Uni3D embedding for a 3D mesh and $U_{IMG}(\text{image})$ be the Uni3D embedding for an image. The Uni3D score is typically the cosine similarity: $ \text{Uni3D Score} = \frac{U_{3D}(\text{mesh}) \cdot U_{IMG}(\text{image})}{|U_{3D}(\text{mesh})| |U_{IMG}(\text{image})|} $
Symbol Explanation:
- $U_{3D}(\text{mesh})$ : The feature vector (embedding) of the generated 3D mesh in the Uni3D embedding space.
- $U_{IMG}(\text{image})$ : The feature vector (embedding) of the input 2D image in the Uni3D embedding space.
- $\cdot$ : Dot product operation.
- $\|\cdot\|$ : L2 norm (magnitude) of a vector.

5.2.3. OpenShape

Conceptual Definition: OpenShape focuses on scaling up 3D shape representation towards open-world understanding, providing a robust representation that can generalize to novel categories and unseen data. The OpenShape metric evaluates the quality of generated 3D shapes by assessing their alignment with the conditional input (e.g., an image) in a shared semantic space. It measures how well the generated shape captures the visual and semantic content of the input.
Mathematical Formula: Let $O_{3D}(\text{mesh})$ be the OpenShape embedding for a 3D mesh and $O_{IMG}(\text{image})$ be the OpenShape embedding for an image. The OpenShape score is typically the cosine similarity: $ \text{OpenShape Score} = \frac{O_{3D}(\text{mesh}) \cdot O_{IMG}(\text{image})}{|O_{3D}(\text{mesh})| |O_{IMG}(\text{image})|} $
Symbol Explanation:
- $O_{3D}(\text{mesh})$ : The feature vector (embedding) of the generated 3D mesh in the OpenShape embedding space.
- $O_{IMG}(\text{image})$ : The feature vector (embedding) of the input 2D image in the OpenShape embedding space.
- $\cdot$ : Dot product operation.
- $\|\cdot\|$ : L2 norm (magnitude) of a vector.

5.3. Baselines

The Direct3D-S2 method is compared against several state-of-the-art image-to-3D approaches and internal ablation configurations:

For Quantitative Image-to-3D Comparison (Table 2):
- Trellis [40]: A method that integrates sparse voxel representations and uses rendering supervision for VAE training, and structured 3D latents for scalable generation.
- Hunyuan3D 2.0 [51]: A diffusion model for high-resolution textured 3D asset generation.
- TripoSG [18]: A method for high-fidelity 3D shape synthesis using large-scale rectified flow models.
- Hi3DGen [45]: Focuses on high-fidelity 3D geometry generation from images via normal bridging.
For Qualitative VAE Reconstruction Comparison (Figure 6):
- Trellis [40]
- XCube [32]: A method for large-scale 3D generative modeling using sparse voxel hierarchies.
- Dora [6]: A method focusing on sampling and benchmarking for 3D shape variational auto-encoders.
For Ablation Studies on SSA (Figure 10):
- Full attention: A variant that applies full attention but uses Trellis's latent packing strategy to group tokens in $2^3$ local regions to reduce token count. This serves as a proxy for what a full attention DiT might attempt to do to cope with high token counts.
- NSA (Native Sparse Attention): An implementation of NSA [46] that processes latent tokens as 1D sequences with fixed-length block partitioning, explicitly disregarding 3D spatial coherence.
For Ablation Studies on SSA Modules (Figure 9):
- win: Only the sparse 3D window module.
- $win + cmp$ : Sparse 3D window combined with sparse 3D compression.
- $cmp + slc$ : Sparse 3D compression combined with spatial blockwise selection (without the window).
For Ablation Studies on Sparse Conditioning (Figure 11):
- A baseline without the sparse conditioning mechanism, implying cross-attention is performed on all DINO-v2 features from the input image, including background.
  
  These baselines represent state-of-the-art or relevant internal configurations to demonstrate the effectiveness and necessity of Direct3D-S2's proposed components.

6. Results & Analysis

6.1. Core Results Analysis

The experimental results demonstrate that Direct3D-S2 achieves superior performance in high-resolution 3D generation, both quantitatively and qualitatively, while significantly reducing computational demands.

Quantitative Superiority in Image-to-3D Task: The following are the results from Table 2 of the original paper:

Methods	ULIP-2 ↑	Uni3D ↑	OpenShape ↑
Trellis [40]	0.2825	0.3755	0.1732
Hunyuan3D 2.0 [51]	0.2535	0.3738	0.1699
TripoSG [18]	0.2626	0.3870	0.1728
Hi3DGen [45]	0.2725	0.3723	0.1689
Ours	0.3111	0.3931	0.1752

As shown in Table 2, Direct3D-S2 consistently outperforms all compared state-of-the-art methods across all three multimodal evaluation metrics (ULIP-2, Uni3D, OpenShape). This indicates that the meshes generated by Direct3D-S2 exhibit better alignment and consistency with the input images, validating its effectiveness in capturing both high-level semantics and fine-grained geometric details. For instance, Direct3D-S2 achieves a ULIP-2 score of 0.3111, notably higher than the second-best Trellis at 0.2825.

Qualitative Superiority in Detail Generation: The qualitative comparisons (Figure 4, Figure 12) strongly highlight Direct3D-S2's advantage. While other methods produce generally satisfactory results, they often struggle with capturing intricate structures due to resolution limitations. Direct3D-S2, thanks to its SSA mechanism, can generate highly detailed meshes, demonstrating superior quality even for complex elements like railings, tree branches, or delicate character features. This is a direct consequence of its ability to handle gigascale resolutions effectively.

The following figure (Figure 8 from the original paper) shows qualitative comparisons of different image-to-3D reconstruction methods in Figure 4 of the paper. The left column shows input images, the middle columns present normal maps and zoomed-in details from five methods, and the rightmost column shows the results of the proposed method with richer details closely matching the input.

Figure 4. Qualitative comparisons between other image-to-3D methods and our approach. 该图像是一个图表，展示了论文中图4的不同图像到三维形状重建方法的定性对比。左列为输入图像，中间多列展示了五种方法生成的三维法线图及局部细节放大，最右列为该论文提出方法的结果，细节更丰富且更接近输入形象。

The following figure (Figure 13 from the original paper) is a multi-row and multi-column 3D model comparison illustration showing differences in modeling details and reconstruction effects between Direct3D-S2 and five other 3D generation methods (Trellis, Hunyuan-2.0, TripoSG, Hi3DGen). The left column shows the original color images, while the right columns present grayscale 3D models generated by different methods, highlighting the superior detail representation of Direct3D-S2.

该图像是一个多行多列的3D模型对比示意图，展示了论文Direct3D-S2中与其他五种3D生成方法（Trellis、Hunyuan-2.0、TripoSG、Hi3DGen）的造型细节差异和重建效果。左侧为原始彩色图像，右侧为不同方法生成的灰色3D模型，体现了Direct3D-S2的细节表现优势。

The following figure (Figure 14 from the original paper) is a schematic illustration comparing different models (Model N, M, R, T, and the authors' method) in generating 3D models from six sets of 2D character images, highlighting the superiority of the authors' method in detail restoration and shape consistency.

该图像是一幅示意图，展示了对比不同模型（Model N、M、R、T及作者方法）在六组二维角色图像到三维模型生成上的效果，突出作者方法在细节还原和形态一致性上的优越性。

User Study Validation: The user study conducted with 40 participants (Figure 5) further corroborates the model's superior performance. Direct3D-S2 achieved the highest scores across both image consistency and overall geometric quality, demonstrating statistically significant superiority over other approaches. This indicates that human evaluators perceive Direct3D-S2's outputs as more faithful to the input images and geometrically sound.

The following figure (Figure 9 from the original paper) is a chart showing user rating comparisons of different methods on image consistency and overall quality. The x-axis represents "Image Consistency" and "Overall Quality", and the y-axis shows rating values, with the "Ours" method achieving the highest scores in both metrics.

Figure 5. User Study for Image-to-3D Generation. 该图像是图表，展示了不同方法在图像一致性和整体质量上的用户评分对比。横轴分别为“Image Consistency”和“Overall Quality”，纵轴为评分值，结果显示’Ours‘方法在两个指标上均表现最佳。

Superior VAE Reconstruction Quality: The SS-VAE demonstrates superior reconstruction accuracy (Figure 6), especially for complex geometries at $1024^3$ resolution. Crucially, this is achieved with significantly reduced training costs: 2 days on 8 A100 GPUs, compared to typically at least 32 GPUs for competing methods for equivalent durations. This underscores the efficiency and stability gains from the unified sparse SDF reconstruction framework.

The following figure (Figure 10 from the original paper) is a comparative visualization of 3D model reconstruction results, including Trellis, XCube, Dora, and the proposed method at different resolutions, alongside the ground truth, demonstrating superior detail and shape consistency of the proposed approach.

该图像是多组3D模型重建结果的可视化对比图，包括Trellis、XCube、Dora及本文方法在不同分辨率下的生成效果与真实样本，展示了本文方法在细节和形状一致性上的优越性。

6.2. Data Presentation (Tables)

The following are the results from Table 1 of the original paper:

Res.	NT	LR	BS	TT
256³	≈2058	1e-4	8×8	2 days
384³	≈5510	1e-4	8×8	2 days
512³	≈10655	5e-5	8×8	2 days
1024³	≈45904	2e-5	2×8	1 day

Table 1 outlines the training configurations for the DiT at different voxel resolutions. Res. denotes resolution, NT the approximate number of latent tokens, LR the learning rate, BS the batch size (GPUs × batch size per GPU), and TT the total training time. As resolution increases, the number of tokens grows significantly, necessitating adjustments in learning rate and batch size to manage computational load. Notably, even at $1024^3$ resolution, the model can be trained with a relatively small batch size (2 samples per GPU, 8 GPUs total) for a short duration (1 day), highlighting its efficiency.

6.3. Ablation Studies / Parameter Analysis

The paper conducts several ablation studies to validate the effectiveness of its core components.

6.3.1. Image-to-3D Generation in Different Resolution

The authors present generation results across four resolutions $\{256^3, 384^3, 512^3, 1024^3\}$ (Figure 8). The following figure (Figure 12 from the original paper) is an illustration of Figure 8 in the paper, showing the image-to-3D generation results of the Direct3D-S2 model at various resolutions of 256^3, 384^3, 512^3, and 1024^3. The leftmost are the input images, followed by the generated 3D shapes at increasing resolutions.

Figure 7. Comparison of the forward and backward time of our SSA and FlashAttention-2. 该图像是论文中图8的插图，展示了Direct3D-S2模型在不同分辨率 $ 256^3, 384^3, 512^3, 1024^3 $ 下的图像到3D生成效果。左侧为输入图像，后续依次为不同分辨率下生成的3D形状。

The results clearly show that increasing resolution progressively improves mesh quality. At lower resolutions ( $256^3$ and $384^3$ ), meshes exhibit limited geometric details and sometimes misalignment. At $512^3$ , details are enhanced. At $1024^3$ , meshes achieve sharper edges and superior alignment with input image details, demonstrating the direct impact of higher resolution on fidelity and the model's ability to leverage it.

6.3.2. Effect of Each Module in SSA

Ablation studies on the three modules of SSA were conducted at $512^3$ resolution (Figure 9). The following figure (Figure 13 from the original paper) is an illustration of shape reconstruction results, showing the effects of different feature windows (win), comparison module (cmp), and selection module (slc) on 3D model surface normals. The left side shows the input images, and the right side displays normal visualizations under various module combinations, demonstrating improved detail restoration.

该图像是形状重建效果的示意图，展示了不同特征窗口（win）、比较模块（cmp）与选择模块（slc）对3D模型表面法线的影响。左侧为输入图，右侧依次为不同模块组合下的法线可视化结果，体现了模块组合对细节还原的改善。

win (Sparse 3D Window only): Generated meshes showed detailed structures but suffered from surface irregularities due to a lack of global context modeling.
$win + cmp$ (Window + Compression): Introducing the sparse 3D compression module showed minimal performance changes in terms of output quality. This is expected, as cmp primarily generates block-level representations used to compute attention scores for selection, rather than directly improving detail.
$win + cmp + slc$ (Full SSA): After incorporating the spatial blockwise selection module, the model could focus on globally important regions, leading to a notable improvement in mesh quality, resulting in smooth and geometrically consistent surfaces.
$cmp + slc$ (Compression + Selection, no Window): Not utilizing the window module did not cause a significant drop in model performance, but it slowed convergence. This suggests that while global compression and selection are crucial for quality, local feature interaction provided by the window module contributes to more stable and faster training convergence.

6.3.3. Runtime of Different Attention Mechanisms

A custom Triton [37] GPU kernel was implemented for SSA, and its forward and backward execution times were compared with FlashAttention-2 [7] (using Xformers [15]). The following figure (Figure 11 from the original paper) is two bar charts showing the forward and backward computation times comparison between Spatial Sparse Attention and Flash Attention 2 at varying numbers of tokens, highlighting the significant acceleration of SSA for large-scale token processing.

该图像是两幅柱状图，展示了Spatial Sparse Attention与Flash Attention 2在不同Token数量下的前向和后向计算时间对比，突出SSA在大规模Token处理上的显著加速性能。

The results (Figure 7) indicate that SSA performs comparably to FlashAttention-2 for a low number of tokens. However, as the number of tokens increases, SSA's speed advantage becomes pronounced. Specifically, at $128\mathrm{k}$ tokens (relevant for $1024^3$ resolution), SSA achieves a 3.9x speedup in the forward pass and an impressive 9.6x speedup in the backward pass. This demonstrates the efficiency of the proposed SSA mechanism for gigascale data.

6.3.4. Effectiveness of SSA

Ablation studies were conducted at $512^3$ resolution to validate the robustness of SSA (Figure 10). The following figure (Figure 2 from the original paper) is a chart illustrating the effect of the proposed SSA mechanism compared to full attention and NSA mechanisms on 3D model normal maps, featuring two groups of models with different complexity, highlighting SSA's advantage in detail preservation.

Figure 10. Ablation studies of our proposed SSA mechanism. 该图像是图表，展示了论文中提出的SSA机制对比全注意力和NSA机制下3D模型法线图的影响，包含两组不同复杂度的模型，体现SSA在细节保留上的优势。

Full attention (with latent packing): This variant produced meshes with high-frequency surface artifacts. This is attributed to its forced packing operation, which disrupts local geometric continuity by aggregating tokens in a manner that doesn't respect fine-grained spatial relationships.
NSA (Native Sparse Attention): The NSA implementation exhibited training instability due to positional ambiguity in its 1D block partitioning, leading to less smooth and lower-quality meshes. This highlights the importance of 3D spatial coherence in sparse attention for volumetric data.
Our proposed SSA: In contrast, SSA not only preserved the fine details of the meshes but also yielded smoother and more organized surfaces. This validates the effectiveness of SSA in handling sparse 3D data efficiently and accurately by respecting spatial coherence during attention computation.

6.3.5. Effect of Sparse Conditioning Mechanism

Ablation experiments were performed at $512^3$ resolution to assess the sparse conditioning mechanism (Figure 11). The following figure (Figure 3 from the original paper) is a chart showing ablation study results of the sparse conditioning mechanism. It compares differences in 3D model details without sparse conditioning (w/o sparse conditioning) and with sparse conditioning (w/ sparse conditioning), highlighting improved detail preservation in models generated with sparse conditioning.

Figure 11. Ablation studies for sparse conditioning mechanism. 该图像是图表，展示了稀疏条件机制消融实验的结果。图中对比了无稀疏条件（w/o sparse conditioning）与有稀疏条件（w/ sparse conditioning）下，3D模型细节的差异，凸显带稀疏条件时生成模型更好地还原了输入的细节特征。

The results demonstrate that by excluding non-foreground conditioning tokens, the generated meshes achieve notably better alignment with the input images. This confirms that focusing the cross-attention mechanism on relevant foreground features improves the conditioning process, leading to more accurate and higher-quality 3D outputs.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Direct3D-S2, a novel and highly effective framework for high-resolution 3D shape generation. The core of its innovation lies in the Spatial Sparse Attention (SSA) mechanism, which dramatically accelerates the training and inference speeds of Diffusion Transformers (DiT) when processing sparse volumetric data. Complementing this, the Direct3D-S2 framework incorporates a fully end-to-end symmetric sparse SDF VAE that ensures consistent volumetric representation across all stages (input, latent, output), thereby enhancing training stability and efficiency.

The extensive experimental results unequivocally demonstrate that Direct3D-S2 outperforms existing state-of-the-art image-to-3D methods in terms of generation quality. Crucially, it achieves this while requiring significantly fewer computational resources, notably enabling training at an unprecedented $1024^3$ resolution using only 8 GPUs, a task previously requiring substantially more hardware even for lower resolutions. This makes gigascale 3D generation both practical and accessible.

7.2. Limitations & Future Work

The authors acknowledge a key limitation of the proposed Spatial Sparse Attention (SSA): the forward pass exhibits a notably smaller acceleration ratio compared to the backward pass. This discrepancy is primarily attributed to the computational overhead introduced by topk sorting operations during the forward pass, which are necessary for selecting the most relevant blocks.

For future work, the authors prioritize optimizing these topk sorting operations. Reducing this overhead would further enhance the efficiency of the forward pass, making the SSA mechanism even more balanced and powerful across both training and inference.

7.3. Personal Insights & Critique

Direct3D-S2 represents a significant leap forward in the field of 3D content generation. The ability to train gigascale ( $1024^3$ ) 3D models on just 8 GPUs is a groundbreaking achievement that democratizes high-fidelity 3D creation. This drastically lowers the barrier to entry for researchers and practitioners, moving 3D generation from the realm of massive compute clusters to more accessible setups.

The paper's methodological innovations are well-reasoned and elegantly address long-standing challenges:

Unified Sparse Volumetric VAE: The symmetric design of the SS-VAE is highly commendable. By maintaining a consistent sparse volumetric format throughout, it cleverly sidesteps the inefficiencies and potential geometric approximations introduced by heterogeneous representations common in prior works. This holistic approach to data representation likely contributes significantly to both stability and fidelity.
Spatial Sparse Attention (SSA): This is the true centerpiece. The adaptation of sparse attention principles from 1D sequences (NSA) to the complexities of sparse 3D data is a sophisticated technical contribution. The explicit consideration of 3D spatial coherence through coordinate-based block partitioning and the thoughtful combination of compression, selection, and windowing modules showcase a deep understanding of the problem space. The custom Triton kernel implementation further demonstrates a commitment to practical, high-performance solutions.
Sparse Conditioning Mechanism: This is a subtle but impactful improvement. Recognizing and mitigating the noise and inefficiency introduced by background pixels in conditional images is a practical refinement that directly contributes to better alignment and reduced computational waste.

Potential Issues and Areas for Improvement:

The topk sorting overhead in the forward pass is a valid limitation. While the backward pass benefits immensely, further optimizing the forward pass would solidify SSA's efficiency. Techniques like approximate topk or hardware-aware sorting optimizations could be explored.
Generalizability of Sparsity: The effectiveness of sparse volumetric representations heavily relies on the assumption that most of the 3D space is empty. For highly complex, dense objects or scenes (e.g., volumetric data like clouds or granular materials), the benefits of sparsity might diminish, or the definition of "active" voxels might become more challenging. The current approach focuses on SDFs, which inherently represent surfaces and thus have sparse "active" regions.
Topology Flexibility: While SDFs are powerful for representing complex shapes, they can sometimes be more restrictive for specific topological operations compared to explicit mesh manipulation. Future work could explore how this high-resolution sparse SDF generation could be seamlessly integrated with mesh editing tools or other explicit representations.
Data Quality Dependence: The paper highlights the rigorous curation of 452k high-quality 3D assets. This underscores the critical role of data quality in achieving gigascale results. While Direct3D-S2 is powerful, its performance will likely remain highly dependent on the quality and diversity of the underlying 3D datasets.

Inspirationally, Direct3D-S2 exemplifies how deep architectural insights combined with hardware-aware optimization can lead to paradigm-shifting capabilities. Its methods could be transferable to other domains involving large, sparse data structures (e.g., medical imaging, scientific simulations, or even large-scale graph processing), where efficient attention mechanisms for irregular data are crucial. The emphasis on unified representations and specialized attention mechanisms offers a blueprint for future gigascale generative models beyond 3D.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Direct3D-S2: Gigascale 3D Generation Made Easy with Spatial Sparse Attention

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~24 min read · 29,948 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Sparse SDF VAE (SS-VAE)

4.2.2. Spatial Sparse Attention and DiT

4.2.3. Sparse Conditioning Mechanism

4.2.4. Rectified Flow

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. ULIP-2

5.2.2. Uni3D

5.2.3. OpenShape

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

6.3.1. Image-to-3D Generation in Different Resolution

6.3.2. Effect of Each Module in SSA

6.3.3. Runtime of Different Attention Mechanisms

6.3.4. Effectiveness of SSA

6.3.5. Effect of Sparse Conditioning Mechanism

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers