GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation
TL;DR Summary
This study introduces GeoLRM, a geometry-aware large reconstruction model that efficiently generates high-quality 3D assets with 512k Gaussians from 21 input images using only 11 GB of GPU memory. It addresses limitations of existing methods by utilizing a novel 3D-aware transfor
Abstract
In this work, we introduce the Geometry-Aware Large Reconstruction Model (GeoLRM), an approach which can predict high-quality assets with 512k Gaussians and 21 input images in only 11 GB GPU memory. Previous works neglect the inherent sparsity of 3D structure and do not utilize explicit geometric relationships between 3D and 2D images. This limits these methods to a low-resolution representation and makes it difficult to scale up to the dense views for better quality. GeoLRM tackles these issues by incorporating a novel 3D-aware transformer structure that directly processes 3D points and uses deformable cross-attention mechanisms to effectively integrate image features into 3D representations. We implement this solution through a two-stage pipeline: initially, a lightweight proposal network generates a sparse set of 3D anchor points from the posed image inputs; subsequently, a specialized reconstruction transformer refines the geometry and retrieves textural details. Extensive experimental results demonstrate that GeoLRM significantly outperforms existing models, especially for dense view inputs. We also demonstrate the practical applicability of our model with 3D generation tasks, showcasing its versatility and potential for broader adoption in real-world applications. The project page: https://linshan-bin.github.io/GeoLRM/.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
GeoLRM: Geometry-Aware Large Reconstruction Model for High-Quality 3D Gaussian Generation. The title clearly states the paper's central contribution: a new model named GeoLRM for generating high-quality 3D models represented as 3D Gaussians. Key features highlighted are its "Geometry-Aware" nature and its classification as a "Large Reconstruction Model."
1.2. Authors
The authors are Chubin Zhang, Hongliang Song, Yi Wei, Yu Chen, Jiwen Lu, and Yansong Tang. Their affiliations include Tsinghua University (Tsinghua Shenzhen International Graduate School and Department of Automation) and Alibaba Group. This collaboration between a top academic institution and a major tech company suggests a strong blend of theoretical research and practical application focus, particularly in areas like computer graphics, computer vision, and large-scale model development. Jiwen Lu and Yansong Tang are well-known researchers in computer vision and deep learning.
1.3. Journal/Conference
The paper was published on arXiv as a preprint. ArXiv is a widely used open-access repository for scientific papers, often used by researchers to share their work before or during the formal peer-review process for a conference or journal. The specific conference or journal it was submitted to is not mentioned, but given the topic and timing, it would be a strong candidate for top-tier computer vision conferences like CVPR, ICCV, ECCV, or ICLR.
1.4. Publication Year
The first version was submitted to arXiv on June 21, 2024.
1.5. Abstract
The abstract introduces the Geometry-Aware Large Reconstruction Model (GeoLRM), a method for generating high-quality 3D assets. The model can produce assets with 512,000 Gaussians from 21 input images, using only 11 GB of GPU memory. The authors identify two key limitations in prior work: neglecting the sparsity of 3D structures and failing to use explicit geometric relationships between 3D and 2D. GeoLRM addresses this with a novel 3D-aware transformer that directly processes 3D points and uses deformable cross-attention to integrate image features. The methodology is a two-stage pipeline: a lightweight proposal network first generates sparse 3D anchor points, and then a reconstruction transformer refines these points to produce detailed geometry and texture. The paper claims significant outperformance over existing models, especially with dense input views, and demonstrates its practical use in 3D generation tasks.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2406.15333
- PDF Link: https://arxiv.org/pdf/2406.15333v2.pdf
- Publication Status: This is a preprint available on arXiv and has not yet been formally published in a peer-reviewed venue at the time of this analysis.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the efficient and high-quality generation of 3D assets from a set of 2D images, a task known as Multi-View 3D Reconstruction. While traditional methods are slow and optimization-based, recent feed-forward Large Reconstruction Models (LRMs) have accelerated this process. However, existing LRMs face two major challenges:
-
Inefficient Representation: Many models use dense representations like triplanes, which allocate memory and computation to empty space. The paper highlights that visible surfaces in typical 3D objects occupy only about 5% of the total volume, making dense representations wasteful. Other methods using pixel-aligned Gaussians are too dependent on input views and struggle to reconstruct unseen areas.
-
Ineffective Feature Fusion: Previous models often ignore the explicit geometric projection relationship between 3D space and 2D images. They typically use dense attention mechanisms where every 3D query attends to all image features, which is computationally expensive and inefficient. This limitation prevents them from effectively leveraging a larger number of input views to improve quality.
The paper's innovative idea is to design a geometry-aware model that directly tackles these issues. Instead of dense representations,
GeoLRMoperates on a sparse set of 3D anchor points. It then uses deformable cross-attention, which leverages the known camera poses to project these 3D points onto the 2D images and focus attention only on relevant local features. This geometry-aware attention mechanism is far more efficient and allows the model to scale to a much larger number of input views, leading to higher-quality reconstructions.
2.2. Main Contributions / Findings
The paper presents the following primary contributions:
-
A Two-Stage Sparse Pipeline:
GeoLRMintroduces a coarse-to-fine pipeline that respects the sparse nature of 3D objects. A lightweight proposal network first identifies occupied regions in 3D space, generating a sparse set of3D anchor points. This avoids the massive computational cost of processing the entire 3D volume. -
Geometry-Aware Deformable Attention: The model fully exploits the camera projection geometry. By using
deformable cross-attention, 3D tokens are projected onto the 2D feature maps, and attention is sampled from local neighborhoods. This drastically reduces the complexity of the attention mechanism from being dense to sparse, enabling the use of many more input views. -
Scalability to Dense Inputs: A key finding is that
GeoLRM's performance improves as the number of input views increases (up to 21 images tested). This is a significant advantage over previous LRM-based methods, whose performance often stagnates or degrades with more views. This makesGeoLRMsuitable for integrating with multi-view synthesis or video generation models.The main conclusion is that by explicitly incorporating 3D geometric principles into the transformer architecture, it is possible to create a Large Reconstruction Model that is not only more efficient but also scales effectively to produce higher-fidelity 3D assets from denser image sets.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. 3D Gaussian Splatting
3D Gaussian Splatting is a novel 3D scene representation technique that has gained prominence for its ability to render high-quality images in real-time. Instead of representing a scene with a continuous field (like NeRF) or a discrete mesh, it uses a collection of 3D Gaussians. Each Gaussian is a point in 3D space with several properties:
-
Position (
x, y, z): The center of the Gaussian. -
Covariance Matrix (represented by Scale and Rotation): Defines the shape and orientation of the Gaussian ellipsoid. A 3D scale vector and a rotation quaternion are typically used.
-
Color (RGB): The color of the Gaussian.
-
Opacity (): How transparent or opaque the Gaussian is.
To render an image from a specific viewpoint, these 3D Gaussians are "splatted" (projected) onto the 2D image plane, creating 2D Gaussians. These 2D splats are then blended together in depth order to form the final pixel colors. This process is highly parallelizable on GPUs, enabling very fast rendering.
3.1.2. Transformer Architecture
A Transformer is a deep learning model architecture that relies on the self-attention mechanism. It was originally proposed for natural language processing tasks but is now widely used in computer vision. Unlike recurrent neural networks (RNNs) that process data sequentially, transformers process entire sequences of data (tokens) at once. The key component is the self-attention layer, which allows the model to weigh the importance of different tokens in the input sequence when producing a representation for each token.
3.1.3. Deformable Attention
Deformable Attention is an efficient attention mechanism, notably used in models like Deformable DETR. In standard attention, a query attends to all key positions in a feature map. This is computationally expensive, especially for high-resolution images. Deformable attention improves this by having the query attend to only a small, fixed number of key sampling points around a reference point. Crucially, the locations of these sampling points are not fixed but are learned by the network. The model predicts offsets from the reference point, allowing it to "deform" the sampling grid and focus on the most informative regions. In GeoLRM, this is adapted to a 3D-to-2D context: a 3D anchor point is projected onto the 2D image plane (the reference point), and the model learns to sample features from locations around this projection, correcting for geometric inaccuracies.
3.2. Previous Works
3.2.1. Optimization-based 3D Reconstruction
- Structure-from-Motion (SfM) and Multi-View Stereo (MVS): These are traditional computer vision techniques.
SfMestimates the 3D structure of a scene and camera poses from a set of images.MVSthen uses these poses to reconstruct a dense 3D model (e.g., a point cloud or mesh). These methods are foundational but can be slow and less robust for complex scenes or textures. - Neural Radiance Fields (NeRF):
NeRFrepresents a 3D scene as a continuous function (a small MLP) that maps a 3D coordinate and a viewing direction to a color and density. This representation can capture photorealistic details but requires a lengthy per-scene optimization process and is slow to render.3D Gaussian Splattingwas introduced as a faster alternative.
3.2.2. Large Reconstruction Models (LRMs)
LRMs are a new paradigm of feed-forward models trained on large 3D datasets to perform reconstruction in a single pass, eliminating the need for per-scene optimization.
- LRM (The Original): This pioneering work by Hong et al. used a transformer to map 2D image features to an implicit 3D representation called a triplane. A triplane represents a 3D volume using three orthogonal 2D feature planes. While effective, it is a dense representation.
- InstantMesh / CRM: These models also use feed-forward transformers but output different representations.
InstantMeshreconstructs a 3D mesh, whileCRMuses a convolutional architecture. Both often rely on triplanes or similar dense structures internally, sharing the same limitations of inefficiency and difficulty in scaling with more input views. - LGM / GRM: These models generate 3D Gaussians.
LGMuses a U-Net architecture, andGRMtranslates image pixels directly to a set of pixel-aligned 3D Gaussians. This pixel-aligned approach tightly couples the 3D representation to the input views, making it hard to reconstruct areas not visible in the input images.
3.2.3. 3D Generation
This area focuses on creating 3D content, often from text prompts or single images.
- Score Distillation Sampling (SDS): Popularized by
DreamFusion, this technique uses a pre-trained 2D text-to-image diffusion model to "guide" the optimization of a 3D representation (like NeRF or 3D Gaussians). It does not require 3D data but can suffer from issues like the "multi-head Janus problem" (e.g., a face appearing on both the front and back of a head). - 3D-Aware Diffusion Models: Models like
Zero-1-to-3andSV3Dare fine-tuned on large datasets of 3D objects to generate novel, consistent views of an object from a single input image.GeoLRMleverages these models by taking their generated multi-view images as input for high-quality 3D reconstruction.
3.3. Technological Evolution
The field has evolved from slow, per-scene optimization methods (SfM, NeRF) to fast, feed-forward models (LRMs) trained on large-scale data. Within LRMs, the evolution is moving from dense, inefficient representations (triplanes) towards more explicit and sparse ones (3D Gaussians). The way 2D features are integrated is also evolving, from dense attention mechanisms to more structured and efficient approaches.
GeoLRM fits into this timeline as a next-generation LRM. It combines the speed of feed-forward models with the explicit, high-quality representation of 3D Gaussians. Its key innovation lies in moving away from both dense triplanes and view-dependent pixel-aligned Gaussians by proposing a sparse, geometry-aware transformer that can effectively leverage dense multi-view inputs.
3.4. Differentiation Analysis
Compared to previous LRMs, GeoLRM is differentiated in two main ways:
-
Representation: Instead of triplanes or pixel-aligned Gaussians,
GeoLRMuses a sparse set of 3D anchor points that represent occupied space. This is more memory-efficient and scalable than triplanes and less view-dependent than pixel-aligned Gaussians. -
Attention Mechanism: Instead of dense cross-attention between 3D queries and all 2D image features,
GeoLRMuses deformable cross-attention guided by camera geometry. This projects 3D points to 2D feature maps and samples features locally, making the process computationally efficient and allowing it to benefit from more input views. This is a significant departure from methods likeLRMandInstantMesh, whose performance does not scale well with more views.The overall pipeline of
GeoLRMis shown in the figure below.
该图像是图示,展示了GeoLRM的工作流程。图中的流程包括使用Proposal Transformer将密集token转化为占用网格,并通过重建Transformer对几何形状进行精炼,最终生成3D高斯体。整体流程利用了自注意力和可变形交叉注意力机制。
4. Methodology
4.1. Principles
The core principle of GeoLRM is to build a 3D reconstruction model that is aware of the underlying geometry of multi-view imaging and the inherent sparsity of 3D objects. It achieves this through a two-stage, coarse-to-fine process.
- Sparsity: The model first prunes the vast, empty 3D space by using a lightweight
Proposal Networkto predict a coarse occupancy grid. This ensures that subsequent, more computationally intensive steps only focus on regions that actually contain the object. - Geometry Awareness: The model then uses a
Reconstruction Transformerthat explicitly leverages camera projection information. Instead of naively correlating 3D and 2D features, it usesdeformable cross-attentionto project 3D anchor points onto the 2D images and intelligently sample features from relevant locations. This makes feature lifting from 2D to 3D highly efficient and robust.
4.2. Core Methodology In-depth (Layer by Layer)
The GeoLRM pipeline takes a set of images with their corresponding camera intrinsics and extrinsics as input.
4.2.1. Hierarchical Image Encoder
The model first extracts both high-level semantic features and low-level detail features from each input image. This hierarchical approach provides a richer set of information for the subsequent reconstruction steps.
-
High-Level Features: For each input image , a pre-trained
DINOv2model is used as a vision foundation model to extract high-level semantic features.DINOv2is known for its strong understanding of 3D geometry from single images. $ \mathcal{F}_H^v = \mathrm{DINOv2}(I^v) $ Here, is the high-level feature map for view . -
Low-Level Features: To capture fine-grained details like textures and precise geometric cues, low-level features are also created. These are formed by combining two sources of information:
- Plücker Ray Embeddings: For each pixel, its corresponding camera ray is encoded using Plücker coordinates. A ray is defined by its origin and direction . Its Plücker coordinate is a 6D vector . This explicitly encodes the camera geometry for each pixel. These embeddings are denoted as .
- RGB Values: The raw pixel colors are used directly. The ray embeddings and RGB values are concatenated and passed through a convolutional layer to produce the low-level feature map . $ \mathcal{F}_L^v = \mathrm{Conv}(\mathrm{Concat}(I^v, R^v)) $
Both and are provided to the transformers for feature fusion.
4.2.2. Two-Stage Transformer Pipeline
Stage 1: Proposal Network
The goal of this stage is to efficiently identify which parts of the 3D space are occupied by the object.
- Dense Initial Anchors: The process starts with a low-resolution dense grid of anchor points (e.g., points) covering the entire normalized 3D space. Each anchor point is initialized with a learnable feature vector.
- Proposal Transformer: These anchor point features are fed as tokens into a
Geometry-aware Transformer(a smaller version with 6 layers). This transformer uses self-attention and deformable cross-attention (explained below) to aggregate information from the hierarchical image features (). - Occupancy Prediction: The output features from the transformer are upsampled via a linear layer to a high-resolution grid (e.g., ). This grid represents the probability of occupancy for each voxel. A combination of binary cross-entropy and scene-class affinity loss is used for supervision.
Stage 2: Reconstruction Transformer
This stage takes the sparse set of occupied voxels identified in Stage 1 and refines their geometry and appearance.
-
Sparse Anchor Points: Only the voxels predicted as "occupied" from the proposal network are used as
3D anchor pointsfor this stage. This dramatically reduces the number of tokens the transformer needs to process (e.g., from down to a maximum of 16k). Each anchor point is again initialized with a learnable feature vector. -
Reconstruction Transformer: These sparse anchor features are processed by a deeper
Geometry-aware Transformer(16 layers). This transformer has the same architecture as the proposal transformer but is responsible for producing detailed features for the final 3D Gaussians.The architecture of each block in the Geometry-aware Transformer is as follows:
-
Self-Attention with 3D RoPE: The transformer first applies self-attention among the 3D anchor point tokens. To inject positional information, the paper extends Rotary Positional Embedding (RoPE) to 3D. The feature vector of each token is split into three parts, and 1D RoPE is applied to each part using its x, y, and z coordinates, respectively. This allows the attention mechanism to be aware of the relative spatial positions of the anchor points. $ \mathcal{F}_A^{self} = \mathcal{F}_A^{in} + \mathrm{SelfAttn}(\mathrm{RMSNorm}(\mathcal{F}_A^{in})) $
-
Deformable Cross-Attention: This is the key mechanism for lifting 2D image features into the 3D representation. For each 3D anchor point feature at spatial coordinate , it attends to the multi-view image feature maps .
- The 3D point is projected onto each 2D feature map using the known camera parameters, yielding a reference point .
- Instead of attending to all pixels, the network predicts small 2D offsets and attention weights from the query feature .
- Features are sampled from the feature map at the deformed locations using bilinear interpolation.
- The final feature is a weighted sum of these sampled features, also weighted by a per-view importance weight . The formula is: $ \operatorname{DeformAttn}(f_i, \mathbf{x}i, {\mathcal{F}^v}{v=1}^V) = \sum_{v=1}^V w_v \left[ \sum_{k=1}^K A_k \mathcal{F}^v \langle \mathbf{p}{iv} + \Delta\mathbf{p}{ivk} \rangle \right] $ where denotes the interpolation. This allows the model to correct for small geometric errors in the anchor point positions and to weigh views based on their quality or relevance. The full block operation is: $ \mathcal{F}_A^{cross} = \mathcal{F}_A^{self} + \mathrm{DeformCrossAttn}(\mathrm{RMSNorm}(\mathcal{F}_A^{self}), {(\mathcal{F}_H^v, \mathcal{F}L^v)}{v=1}^V) $
-
Feed-Forward Network (FFN): Finally, the features are passed through a standard FFN layer. The block uses
RMSNormfor normalization andSiLUfor activation, which improves training stability. $ \mathcal{F}_A^{out} = \mathcal{F}_A^{cross} + \mathrm{FFN}(\mathrm{RMSNorm}(\mathcal{F}_A^{cross})) $
4.2.3. Post-processing and Rendering
The output tokens from the reconstruction transformer are decoded into 3D Gaussians. Each output token (corresponding to one anchor point) is passed through a small MLP to predict the parameters for multiple (32 in the paper) 3D Gaussians. For each Gaussian , the MLP predicts:
-
Offset (): A 3D offset relative to the anchor point's position.
-
Color (): An RGB color.
-
Scale (): A 3D scale vector.
-
Rotation (): A 4D quaternion for rotation.
-
Opacity (): A scalar opacity value.
To ensure training stability, the predicted raw outputs for offset, scale, and opacity are passed through
Sigmoidactivation functions to constrain their ranges: where and are predefined maximum values. The final collection of all Gaussians is then rendered into images using the standard 3D Gaussian Splatting rasterizer.
4.3. Training Objectives
The model is trained in two stages.
-
Proposal Network Training: This is supervised as a binary classification task on the high-resolution () occupancy grid. Since most voxels are empty (a highly imbalanced problem), the loss combines binary cross-entropy loss and scene-class affinity loss.
-
Reconstruction Network Training: The full model is trained with a composite loss function that supervises the rendered images, alpha masks, and depth maps against ground truth data from target camera views. The total loss is: $ \mathcal{L} = \sum_{t=1}^T \left( \mathcal{L}_{\mathrm{img}}(\hat{I}t, I_t) + \mathcal{L}{\mathrm{mask}}(\hat{M}t, M_t) + 0.2 \mathcal{L}{\mathrm{depth}}(\hat{D}_t, D_t, I_t) \right) $ The individual loss components are:
- Image Loss: A combination of L2 loss and LPIPS perceptual loss. $ \mathcal{L}_{\mathrm{img}}(\hat{I}_t, I_t) = ||\hat{I}_t - I_t||2 + 2\mathcal{L}{\mathrm{LPIPS}}(\hat{I}_t, I_t) $
- Mask Loss: Simple L2 loss on the rendered alpha mask. $ \mathcal{L}_{\mathrm{mask}}(\hat{M}_t, M_t) = ||\hat{M}_t - M_t||_2 $
- Depth Loss: A scale-invariant depth loss that is weighted by the image gradients . This encourages smoother geometry, especially in textureless regions. $ \mathcal{L}_{\mathrm{depth}}(\hat{D}_t, D_t, I_t) = \frac{1}{|\hat{D}_t|} \left| \left| \exp(-\Delta I_t) \odot \log(1 + |\hat{D}_t - D_t|) \right| \right|_1 $ where is element-wise multiplication.
5. Experimental Setup
5.1. Datasets
- G-buffer Objaverse (GObjaverse): This is the primary training dataset. It is derived from the large-scale
Objaversedataset of 3D models.GObjaverseprovides high-quality renderings for ~280,000 objects, including RGB, depth, normal, and albedo images. For each object, 38 views are rendered from two orbital paths and top/bottom viewpoints, ensuring comprehensive coverage. The objects are normalized to fit within a cube. - Google Scanned Objects (GSO): This dataset is used for evaluation. It contains high-quality 3D scans of real-world household items. The paper uses a random subset of 100 objects, rendered in the same manner as
GObjaverse. - OmniObject3D: This is another evaluation dataset, containing diverse real-world objects. Again, a random subset of 100 objects is used for evaluation.
5.2. Evaluation Metrics
The paper evaluates both the visual quality of rendered images and the geometric accuracy of the reconstructed 3D models.
5.2.1. 2D Visual Quality Metrics
-
PSNR (Peak Signal-to-Noise Ratio):
- Conceptual Definition: PSNR measures the quality of a reconstructed image by comparing it to a ground-truth image. It quantifies the ratio between the maximum possible power of a signal (the pixel values) and the power of the corrupting noise that affects its fidelity. A higher PSNR value indicates better reconstruction quality.
- Mathematical Formula: $ \mathrm{PSNR} = 20 \cdot \log_{10}(\mathrm{MAX}I) - 10 \cdot \log{10}(\mathrm{MSE}) $
- Symbol Explanation:
- : The maximum possible pixel value of the image (e.g., 255 for 8-bit grayscale images).
- : The Mean Squared Error between the ground-truth image and the predicted image , calculated as
\frac{1}{mn}\sum_{i=0}^{m-1}\sum_{j=0}^{n-1}[I(i,j) - \hat{I}(i,j)]^2.
-
SSIM (Structural Similarity Index Measure):
- Conceptual Definition: SSIM is a perceptual metric that measures image quality based on the degradation of structural information. It is considered closer to human perception than PSNR because it evaluates changes in luminance, contrast, and structure. Values range from -1 to 1, with 1 indicating a perfect match.
- Mathematical Formula: $ \mathrm{SSIM}(x, y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
- Symbol Explanation:
x, y: The two image windows being compared.- : The average pixel values of and .
- : The variances of and .
- : The covariance of and .
- : Small constants to stabilize the division.
-
LPIPS (Learned Perceptual Image Patch Similarity):
- Conceptual Definition: LPIPS measures the perceptual similarity between two images. It computes the distance between deep features extracted from the images using a pre-trained neural network (e.g., VGG or AlexNet). It has been shown to correlate well with human judgments of image similarity. A lower LPIPS score indicates that the two images are more perceptually similar.
- Mathematical Formula: $ d(x, x_0) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} | w_l \odot (\hat{y}{hw}^l - \hat{y}{0hw}^l) |_2^2 $
- Symbol Explanation:
- : The LPIPS distance between images and .
- : The deep features extracted from layer of the network for each image, normalized channel-wise.
- : A learned channel-wise weight vector for layer .
- : The height and width of the feature map at layer .
5.2.2. 3D Geometric Accuracy Metrics
-
CD (Chamfer Distance):
- Conceptual Definition: Chamfer Distance measures the dissimilarity between two point clouds. It calculates the average distance from each point in one set to its nearest neighbor in the other set. A lower CD value indicates a better geometric match.
- Mathematical Formula: $ d_{CD}(S_1, S_2) = \frac{1}{|S_1|} \sum_{x \in S_1} \min_{y \in S_2} |x - y|2^2 + \frac{1}{|S_2|} \sum{y \in S_2} \min_{x \in S_1} |x - y|_2^2 $
- Symbol Explanation:
- : The two point clouds being compared.
x, y: Points in the respective point clouds.
-
F-Score:
- Conceptual Definition: F-Score is another metric for evaluating point cloud similarity. It is the harmonic mean of precision and recall, where precision is the percentage of reconstructed points that are close (within a threshold ) to the ground truth, and recall is the percentage of ground truth points that have a close reconstructed point. A higher F-Score indicates better accuracy.
- Mathematical Formula: $ \text{F-Score} = \frac{2 \cdot \text{Precision} \cdot \text{Recall}}{\text{Precision} + \text{Recall}} $
- Symbol Explanation:
P, G: The predicted and ground-truth point clouds.- : A distance threshold (set to 0.2 in the paper).
5.3. Baselines
The paper compares GeoLRM with several state-of-the-art LRM-based methods:
-
LGM (Large Multi-view Gaussian Model): A U-Net based model for generating 3D Gaussians.
-
CRM (Convolutional Reconstruction Model): A model using a convolutional architecture to produce a textured mesh.
-
InstantMesh: A transformer-based model that reconstructs a 3D mesh from sparse views, often relying on triplanes.
The paper excludes comparisons with proprietary models and single-image methods like
OpenLRMandTripoSRto ensure a fair comparison on the multi-view reconstruction task.
6. Results & Analysis
6.1. Core Results Analysis
The core findings of the experiments demonstrate that GeoLRM sets a new state-of-the-art in multi-view 3D reconstruction, particularly when scaling to denser input views.
6.1.1. Comparison with Baselines on GSO and OmniObject3D
The following are the results from Table 1 of the original paper, showing performance on the Google Scanned Objects (GSO) dataset with 6 input views:
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | CD ↓ | FS ↑ | Inf. Time (s) | Memory (GB) |
|---|---|---|---|---|---|---|---|
| LGM | 20.76 | 0.832 | 0.227 | 0.295 | 0.703 | 0.07 | 7.23 |
| CRM | 22.78 | 0.843 | 0.190 | 0.213 | 0.831 | 0.30 | 5.93 |
| InstantMesh | 23.19 | 0.856 | 0.166 | 0.186 | 0.854 | 0.78 | 23.12 |
| Ours | 23.57 | 0.872 | 0.167 | 0.167 | 0.892 | 0.67 | 4.92 |
The following are the results from Table 2 of the original paper, showing performance on the OmniObject3D dataset:
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ | CD ↓ | FS ↑ | |
|---|---|---|---|---|---|---|
| LGM | 21.94 | 0.824 | 0.203 | 0.256 | 0.787 | |
| CRM | 23.12 | 0.855 | 0.175 | 0.204 | 0.810 | |
| InstantMesh | 23.86 | 0.860 | 0.139 | 0.178 | 0.834 | |
| Ours | 24.74 | 0.883 | 0.134 | 0.156 | 0.863 | |
Analysis:
GeoLRMachieves the best performance on almost all metrics (PSNR, SSIM, CD, F-Score) on both datasets. This demonstrates superior visual quality and geometric accuracy.InstantMeshhas a slightly better LPIPS on GSO, which the authors attribute to the inherent smoothing effect of its mesh-based representation. However,GeoLRM's superior CD and F-Score highlight its strength in capturing precise geometry, a direct benefit of its geometry-aware design.- Notably,
GeoLRMis highly efficient, requiring only 4.92 GB of GPU memory for inference, significantly less thanInstantMesh(23.12 GB). This efficiency is a direct result of its sparse pipeline and geometry-aware attention.
6.1.2. Scalability with Dense Input Views
This is a key experiment highlighting the main advantage of GeoLRM. The following results are from Table 3 of the original paper:
| Num Input | PSNR | SSIM | Inf. Time (s) | Memory (GB) | ||||
|---|---|---|---|---|---|---|---|---|
| InstantMesh | Ours | InstantMesh | Ours | InstantMesh | Ours | InstantMesh | Ours | |
| 4 | 22.87 | 22.84 | 0.832 | 0.851 | 0.68 | 0.51 | 22.09 | 4.30 |
| 8 | 23.22 | 23.82 | 0.861 | 0.883 | 0.87 | 0.84 | 24.35 | 5.50 |
| 12 | 23.05 | 24.43 | 0.843 | 0.892 | 1.07 | 1.16 | 24.62 | 6.96 |
| 16 | 23.15 | 24.79 | 0.861 | 0.903 | 1.30 | 1.51 | 26.69 | 8.23 |
| 20 | 23.25 | 25.13 | 0.895 | 0.905 | 1.62 | 1.84 | 28.73 | 9.43 |
Analysis:
-
GeoLRM shows consistent and significant improvement as the number of input views increases. Its PSNR steadily rises from 22.84 (4 views) to 25.13 (20 views).
-
In stark contrast, InstantMesh's performance stagnates and even degrades (e.g., PSNR drops from 8 to 12 views). The authors suggest this is because its low-resolution triplane representation becomes a bottleneck, and its dense attention mechanism cannot effectively process the large volume of image tokens.
-
GeoLRM's memory usage scales gracefully, increasing from 4.3 GB to only 9.43 GB with 20 views, whereasInstantMesh's memory grows more steeply and starts from a much higher baseline. This validates the efficiency of the proposed geometry-aware deformable attention.The qualitative comparisons in the paper further support these findings. Figure 3 shows
GeoLRMproducing much finer details and textures compared to baselines. Figure 4 demonstrates thatGeoLRMsuccessfully reconstructs high-quality assets from the 21 views generated bySV3D, whileInstantMeshfails to produce a satisfactory result from the same inputs.
该图像是一个示意图,展示了不同图像-3D方法的定性比较。上方是输入图像,下面则是应用了多个方法生成的3D模型,分别为TriposR、LGM、CRM、InstantMesh和我们的模型,显示了各个方法在3D重建上的效果差异。
6.2. Ablation Studies / Parameter Analysis
The ablation studies in Table 4 and 5 systematically validate the key design choices of GeoLRM.
The following are the results from Table 4 of the original paper:
| Method | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
|---|---|---|---|
| W/o Plücker rays | 20.64 | 0.826 | 0.244 |
| W/o low-level features | 20.29 | 0.817 | 0.246 |
| W/o high-level features | 15.85 | 0.798 | 0.289 |
| W/o 3D RoPE | 20.52 | 0.827 | 0.224 |
| Fixed # input views | 20.97 | 0.839 | 0.220 |
| Full model | 20.73 | 0.831 | 0.216 |
| 4 Inputs | 8 Inputs | 12 Inputs | ||||
|---|---|---|---|---|---|---|
| PSNR ↑ | SSIM ↑ | PSNR ↑ | SSIM ↑ | PSNR ↑ | SSIM ↑ | |
| Fixed # input views | 19.72 | 0.822 | 20.85 | 0.833 | 21.43 | 0.838 |
| Full model | 19.94 | 0.835 | 21.16 | 0.840 | 22.04 | 0.853 |
Analysis:
- Hierarchical Image Encoder: Removing high-level features (
DINOv2) causes a catastrophic drop in performance (PSNR from 20.73 to 15.85), confirming their importance for semantic understanding. Removing low-level features (Plücker rays, RGB) also hurts performance, leading to a loss of texture detail, as seen in Figure 5. - 3D RoPE: Removing the 3D Rotary Positional Embedding leads to a noticeable drop in performance, confirming its necessity for the self-attention layers to understand the spatial arrangement of the 3D anchor points.
- Dynamic Input: Training with a fixed number of input views (6) gives slightly better PSNR/SSIM when testing with exactly 6 views. However, the "Full model," trained with a dynamic number of views (1-7), shows much better generalization when tested with varying numbers of inputs (4, 8, 12), demonstrating its superior robustness and flexibility.
- Deformable Attention: As shown in Table 5, using deformable attention with learnable offsets is crucial. Using
0 sampling points(i.e., just projecting and sampling at a single point without deformation) yields very poor results (PSNR 19.52). Increasing the number of sampling points from 4 to 8 brings a significant performance boost. The authors chose 8 points as a good trade-off between performance and computational cost.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces GeoLRM, a geometry-aware large reconstruction model that significantly advances the state of the art in multi-view 3D generation. By pioneering a two-stage, sparse pipeline and a novel geometry-aware transformer with deformable cross-attention, GeoLRM overcomes the key limitations of previous LRMs. Its main contributions are threefold: it leverages 3D sparsity for efficiency, explicitly uses camera geometry for effective feature fusion, and demonstrates unprecedented scalability to dense input views. The experimental results confirm that GeoLRM not only outperforms existing models in reconstruction quality and geometric accuracy but also does so with remarkable efficiency. The model's ability to improve with more input data makes it a powerful and practical tool for high-quality 3D content creation.
7.2. Limitations & Future Work
The authors candidly acknowledge one main limitation:
-
Non-End-to-End Pipeline: The two-stage process (proposal then reconstruction) is not fully end-to-end. This can lead to error accumulation, where inaccuracies in the initial proposal stage negatively impact the final reconstruction. The current reliance on a separate proposal network is a necessary evil to manage the computational cost of processing the entire 3D space.
For future work, the authors propose focusing on developing a seamless, end-to-end solution. This would involve integrating the proposal and reconstruction stages, which could reduce error propagation, optimize processing time, and enhance the model's overall robustness for a wider range of applications, including real-time scenarios.
7.3. Personal Insights & Critique
GeoLRM presents a very compelling and well-executed idea. The paper is a strong example of how incorporating classical geometric principles into modern deep learning architectures can lead to significant breakthroughs.
Inspirations and Strengths:
- First-Principles Thinking: The core idea of using deformable attention guided by camera projections is brilliant. It directly addresses the computational bottleneck of dense attention and is a much more principled way to perform 2D-to-3D feature lifting than what previous LRMs have done.
- Scalability as a Key Feature: The focus on scalability with more views is a crucial contribution. As multi-view image/video generation models (like
SV3D) improve, the ability to effectively consume their output will become a defining feature of top-tier reconstruction models.GeoLRMis well-positioned for this future. - Efficiency: The impressive performance achieved with such low memory usage makes the model highly practical and accessible to a wider range of users who may not have access to top-of-the-line GPUs.
Potential Issues and Areas for Improvement:
-
Dependence on Accurate Poses: The method's geometry-aware nature heavily relies on accurate camera pose information. While this is standard for multi-view reconstruction, its performance might degrade significantly if poses are noisy or inaccurate. An interesting extension would be to make the model robust to pose errors or even jointly refine poses, as explored in some NeRF variants.
-
Generalization to Complex Scenes: The model was trained and evaluated on single-object datasets (
Objaverse,GSO). Its ability to handle large, complex scenes with multiple objects, occlusions, and diverse backgrounds remains to be seen. The sparse proposal network might struggle to capture intricate, non-contiguous structures in such scenarios. -
The "End-to-End" Challenge: The limitation identified by the authors is indeed significant. An end-to-end model could potentially learn a more optimal joint representation for both occupancy and final appearance. This might be achievable with techniques like sparse convolutional networks or octree-based structures that can hierarchically represent sparse 3D data efficiently within a single network.
Overall,
GeoLRMis a landmark paper in the field of large reconstruction models. Its elegant fusion of transformer architecture with fundamental geometric insights provides a clear and powerful path forward for creating high-fidelity 3D assets efficiently.
Similar papers
Recommended via semantic vector search.