Paper status: completed

Objaverse-XL: A Universe of 10M+ 3D Objects

Published:07/11/2023

Objaverse-XL Dataset (1)3D Vision Tasks (1)Multi-View Rendered Images (1)Zero-Shot Generalization Capability (1)Acquisition of High-Quality 3D Data (1)

Original Link PDF

Price: 0.10

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The paper introduces Objaverse-XL, a dataset of over 10 million 3D objects, addressing the scarcity of high-quality data in 3D vision tasks. Training on 100 million multi-view images achieved significant zero-shot generalization, promoting innovation in 3D vision.

Abstract

Natural language processing and 2D vision models have attained remarkable proficiency on many tasks primarily by escalating the scale of training data. However, 3D vision tasks have not seen the same progress, in part due to the challenges of acquiring high-quality 3D data. In this work, we present Objaverse-XL, a dataset of over 10 million 3D objects. Our dataset comprises deduplicated 3D objects from a diverse set of sources, including manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic and antique artifacts. Representing the largest scale and diversity in the realm of 3D datasets, Objaverse-XL enables significant new possibilities for 3D vision. Our experiments demonstrate the improvements enabled with the scale provided by Objaverse-XL. We show that by training Zero123 on novel view synthesis, utilizing over 100 million multi-view rendered images, we achieve strong zero-shot generalization abilities. We hope that releasing Objaverse-XL will enable further innovations in the field of 3D vision at scale.

Mind Map

In-depth Reading

English Analysis~29 min read · 39,572 chars

1. Bibliographic Information

1.1. Title

The central topic of the paper is "Objaverse-XL: A Universe of 10M+ 3D Objects." This title refers to a new, large-scale dataset of 3D objects designed to address the data scarcity in 3D vision tasks.

1.2. Authors

The paper lists numerous authors, with Matt Deitke, Ruoshi Liu, and Matthew Wallingford among the primary contributors. The affiliations include the Allen Institute for AI, the University of Washington, Seattle, Columbia University, Stability AI, California Institute of Technology, and LAION. Ali Farhadi and Ludwig Schmidt are noted for equal senior contributions. This diverse authorship indicates a collaborative effort spanning multiple research institutions and industry partners, bringing together expertise in AI, computer vision, and large-scale data systems.

1.3. Journal/Conference

The paper is published on arXiv, a preprint server, indicating it has not yet undergone formal peer review for a specific journal or conference at the time of this publication. However, arXiv is a widely recognized platform for disseminating cutting-edge research in AI and computer science, allowing for rapid sharing of findings. Given the affiliations of the authors and the impact of large datasets in AI, this work is likely intended for a top-tier computer vision or machine learning conference (e.g., NeurIPS, CVPR, ICCV) or journal.

1.4. Publication Year

The paper was published at (UTC): 2023-07-11T00:00:00.000Z.

1.5. Abstract

The abstract introduces Objaverse-XL, a dataset comprising over 10 million deduplicated 3D objects gathered from various sources, including manually designed models, photogrammetry scans of real-world items and landmarks, and professional scans of historical artifacts. The core problem it addresses is the lack of large-scale, high-quality 3D data, which has hindered progress in 3D vision tasks compared to 2D vision and natural language processing. Objaverse-XL aims to fill this gap by providing unprecedented scale and diversity in 3D data. The main finding highlighted is that training models like Zero123 for novel view synthesis on Objaverse-XL, utilizing over 100 million multi-view rendered images, leads to strong zero-shot generalization abilities. The authors hope this release will catalyze further innovations in large-scale 3D vision.

1.6. Original Source Link

https://arxiv.org/abs/2307.05663v1 Publication Status: Preprint on arXiv.

1.7. PDF Link

https://arxiv.org/pdf/2307.05663.pdf

2. Executive Summary

2.1. Background & Motivation

The paper addresses a significant challenge in the field of 3D computer vision: the scarcity of high-quality, large-scale 3D training data. While natural language processing (NLP) and 2D vision have seen remarkable advancements driven by massive datasets (e.g., LAION-5B for images, LLaMA for language), 3D vision tasks such as 3D object generation and reconstruction have lagged. This lag is primarily due to the inherent difficulties and high costs associated with acquiring and annotating 3D data, often relying on labor-intensive methods like professional 3D designers or small-scale photogrammetry. Existing 3D datasets, like ShapeNet, are orders of magnitude smaller than their 2D and NLP counterparts, limiting the capabilities of learning-driven methods in 3D.

This problem is particularly important in the current technological landscape due to the increasing demand for and interest in augmented reality (AR) and virtual reality (VR) technologies, as well as advancements in generative AI that seek to create 3D content. The gap in data availability acts as a bottleneck for developing robust, generalizable 3D models.

The paper's innovative idea is to leverage the recent proliferation of 3D content available across various online platforms (e.g., GitHub, Thingiverse, Sketchfab, Polycam, Smithsonian Institute) by systematically crawling and curating this data into an unprecedentedly large dataset. This approach mirrors how large-scale 2D and text datasets were assembled from web-crawled sources, aiming to bring similar scaling benefits to 3D vision.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Objaverse-XL Dataset: The introduction of Objaverse-XL, a web-crawled dataset containing over 10 million deduplicated 3D objects. This represents a significant leap in scale, being an order of magnitude larger than its predecessor (Objaverse 1.0) and two orders of magnitude larger than traditional datasets like ShapeNet. The dataset is sourced from diverse platforms, ensuring a rich variety of object types, geometries, and textures.
Demonstration of Scaling Benefits: The paper experimentally demonstrates that training state-of-the-art 3D models, such as Zero123 for novel view synthesis and PixelNeRF for neural radiance fields, on the massive scale of Objaverse-XL significantly improves their performance. These improvements are observed in:
- Zero-shot generalization: Zero123-XL (Zero123 trained on Objaverse-XL) shows substantially better zero-shot generalization to challenging modalities like photorealistic assets, cartoons, drawings, and sketches, compared to its counterpart trained on Objaverse 1.0. This indicates that larger datasets help models learn more robust and transferable representations.
- Consistent performance gains with scale: Experiments show that performance continues to improve as the dataset size increases from thousands to millions of objects, with few signs of saturation.
- Fine-tuning benefits: Pretraining PixelNeRF on Objaverse-XL leads to improved performance when fine-tuned on other datasets like DTU and ShapeNet, showcasing the value of Objaverse-XL as a pretraining corpus.
Comprehensive Data Curation and Analysis: The authors detail their methodology for crawling, deduplicating, and extracting metadata from the diverse 3D sources. They also provide analyses of the dataset's properties, including NSFW filtering, face detection, and photogrammetry hole detection, to ensure data quality and address potential concerns.

These findings collectively solve the problem of data scarcity in 3D vision by providing a robust, large-scale foundation for training advanced 3D models. The results suggest that scaling data in 3D vision can lead to similar breakthroughs observed in 2D vision and NLP, enabling new possibilities for 3D content generation, reconstruction, and understanding.

3.1. Foundational Concepts

To understand Objaverse-XL and its implications, a few foundational concepts in AI and 3D vision are helpful:

Natural Language Processing (NLP): A field of artificial intelligence that enables computers to understand, interpret, and generate human language. Models like GPT-2 or LLaMA (mentioned in the paper) are examples of large language models trained on massive text datasets for tasks like language generation and comprehension.
2D Vision: The field of computer science that trains machines to "see" and interpret visual information from 2D images. Datasets like ImageNet (millions of images) and LAION-5B (billions of images) are fundamental for training powerful 2D vision models.
3D Vision: The field concerned with enabling computers to understand the 3D world from various data sources (e.g., images, depth sensors, 3D models). Tasks include 3D object generation, reconstruction, pose estimation, and scene understanding.
Datasets for Machine Learning: Collections of data used to train and evaluate machine learning models. The scale and diversity of these datasets are crucial for a model's performance and generalization ability.
Deep Learning: A subfield of machine learning that uses artificial neural networks with multiple layers (deep neural networks) to learn representations from data.
Zero-shot Generalization: The ability of a machine learning model to perform a task on data types or categories it has not explicitly been trained on. This is a highly desirable property, often achieved by models trained on very large and diverse datasets.
Novel View Synthesis: The task of generating new views of a 3D object or scene from a limited number of input views. For example, given a single image of an object, generating what it would look like from a different angle.
Diffusion Models: A class of generative models that learn to reverse a diffusion process (gradually adding noise to data) to generate new data samples. They have shown remarkable success in image generation (e.g., Stable Diffusion).
Neural Radiance Fields (NeRFs): A technique for synthesizing novel views of a complex 3D scene by optimizing a neural network to represent the scene's volumetric radiance field. It learns to map 3D coordinates to color and density, allowing for photorealistic rendering from any viewpoint.
Photogrammetry: The science of making measurements from photographs, often used to create 3D models of real-world objects or scenes by stitching together multiple 2D images.
CAD Models (Computer-Aided Design): Digital 3D models created using CAD software, commonly used in engineering and design.
Image Embeddings (CLIP embeddings): Vector representations of images generated by models like CLIP (Contrastive Language-Image Pre-training). These embeddings capture semantic information about the image and can be used for tasks like similarity search or classification.

3.2. Previous Works

The paper extensively references prior work, particularly concerning datasets and 3D applications.

3.2.1. Pre-training Datasets (General AI)

GPT-2 [49], Chinchilla [25], LLaMA [57]: These are large language models. GPT-2 was a groundbreaking model demonstrating promising zero-shot results on NLP benchmarks using roughly 30 billion language tokens. Chinchilla and LLaMA surpassed GPT-2 by consuming trillions of web-crawled tokens, highlighting the importance of data scale in NLP.
ImageNet [15]: With 1 million images, it was the gold standard for representation learning in computer vision for a long time, used for tasks like object detection and instance segmentation.
LAION-5B [55]: A web-crawled dataset of billions of image-text pairs, which powered advances in generative AI (e.g., Stable Diffusion) and visual-language representations (e.g., CLIP, Flamingo).
CLIP [50]: Contrastive Language-Image Pre-training. A model trained on a vast dataset of image-text pairs to learn highly transferable visual representations. It aligns images and text in a shared embedding space. For instance, CLIP embeddings are used in Objaverse-XL for metadata extraction and filtering.
- How CLIP works (simplified): CLIP consists of an image encoder and a text encoder. During training, it learns to embed images and their corresponding text captions close to each other in a multi-modal embedding space, while pushing apart embeddings of mismatched pairs. This is achieved by maximizing the cosine similarity of correct image-text pairs and minimizing it for incorrect pairs.
- Cosine Similarity: A measure of similarity between two non-zero vectors of an inner product space that measures the cosine of the angle between them. $ \text{cosine_similarity}(\mathbf{A}, \mathbf{B}) = \frac{\mathbf{A} \cdot \mathbf{B}}{|\mathbf{A}| |\mathbf{B}|} $ where $\mathbf{A}$ and $\mathbf{B}$ are vectors, $\mathbf{A} \cdot \mathbf{B}$ is their dot product, and $\|\mathbf{A}\|$ and $\|\mathbf{B}\|$ are their magnitudes.
SAM [31]: Segment Anything Model. A recent vision model that demonstrates powerful zero-shot segmentation capabilities, further emphasizing the trend of large-scale pre-training.

3.2.2. 3D Datasets

ShapeNet [9]: A foundational 3D dataset, providing a collection of textured CAD models (around 51K after filtering) with semantic labels. While influential, its objects are often low-resolution with simplistic textures.
ABO [13], GSO [17], OmniObjects3D [63]: These datasets improved texture quality over ShapeNet but remained significantly smaller (e.g., 15K CAD models for the largest).
Objaverse 1.0 [14]: A more recent 3D dataset of 800K high-quality 3D models with diverse textures and geometries, a significant step up from ShapeNet. Objaverse-XL extends this work.

3.2.3. 3D Applications

3D Reconstruction:
- Methods exploring novel representations [12, 59, 38, 39], network architectures [22, 64], and differentiable rendering [30, 10, 52, 34, 35]. Many of these initially experimented on small datasets like ShapeNet.
Generative AI in 3D:
- DreamFusion [48] and Magic3D [33]: Demonstrated text-to-3D shape generation by leveraging text-to-image models.
- Point-E [42] and Shape-E [28]: Trained for text-to-3D generation using 3D models from undisclosed sources.
- Zero123 [36]: An image-conditioned diffusion model for novel view synthesis, trained on Objaverse 1.0. This paper's Zero123-XL builds directly upon this by using the larger Objaverse-XL dataset.
- Stable Dreamfusion [56]: Replaced the text-to-image model in DreamFusion with the 3D-informed Zero123 for improved 3D generations.
PixelNeRF [64]: A method for constructing NeRF models that generalize across scenes with few input images. The paper uses Objaverse-XL to train PixelNeRF on an unprecedented scale.

3.3. Technological Evolution

The evolution of AI has clearly demonstrated a trend: scale matters.

Early AI (Pre-Deep Learning): Relied on smaller, carefully curated datasets and hand-engineered features.
Deep Learning Revolution (ImageNet Era): ImageNet provided a sufficiently large dataset (1M images) to train deep convolutional neural networks, leading to breakthroughs in image recognition.
Web-Scale Data Era (LAION, LLaMA): The shift to web-crawled datasets (billions/trillions of data points) for 2D images and text led to the development of highly powerful, general-purpose models like CLIP, Stable Diffusion, and large language models (LLMs). These models exhibit remarkable zero-shot and few-shot learning capabilities.

3D vision, however, has historically lagged due to the difficulty of acquiring 3D data. Datasets like ShapeNet (51K objects) and even Objaverse 1.0 (800K objects) are significantly smaller than their 2D and NLP counterparts. This paper's work with Objaverse-XL (10M+ objects) represents the current apex of this scaling trend in 3D, attempting to bridge the data gap and enable similar advancements seen in other modalities.

3.4. Differentiation Analysis

Compared to prior main methods and datasets, Objaverse-XL's core differences and innovations are:

Scale: It is the largest 3D dataset by an order of magnitude (10.2M objects) compared to previous large datasets like Objaverse 1.0 (800K) and ShapeNet (51K). This scale is critical for training larger, more generalizable 3D models, mirroring the success of web-scale datasets in 2D vision and NLP.
Diversity of Sources: Objaverse-XL aggregates 3D objects from a much wider range of sources (GitHub, Thingiverse, Sketchfab, Polycam, Smithsonian Institution) than previous datasets. This includes not just CAD models but also photogrammetry scans of real-world objects, historical artifacts, and user-generated content, leading to higher diversity in object types, styles, and quality.
Real-world Complexity: By including photogrammetry scans and assets from public repositories, the dataset inherently captures more real-world geometric and textural complexity, as well as stylistic variations (e.g., cartoons, sketches), which are often lacking in purely CAD-based datasets.
Impact on Zero-shot Generalization: The paper specifically highlights the improved zero-shot generalization capabilities of models trained on Objaverse-XL, especially for challenging and complex modalities. This is a direct benefit of the increased scale and diversity, distinguishing it from prior datasets that might lead to models with more limited generalization.
Comprehensive Metadata: The dataset provides rich metadata, including source information, Blender-extracted properties (polygon/vertex/edge counts, material/texture counts), and CLIP embeddings, which facilitate various analyses and filtering strategies (e.g., NSFW detection, aesthetic scoring, alignment finetuning).
Bridging the 3D Data Gap: Objaverse-XL is positioned as a direct response to the bottleneck of data scarcity in 3D vision, aiming to provide the necessary foundation for large-scale training of generative and predictive models, similar to how LAION-5B enabled the current generation of 2D image synthesis models.

4. Methodology

The core methodology revolves around the creation and analysis of the Objaverse-XL dataset, followed by experimental validation using state-of-the-art 3D vision models.

4.1. Principles

The core idea behind Objaverse-XL is that data scale and diversity are paramount for advancing 3D vision, just as they have been for 2D vision and natural language processing. The theoretical basis is rooted in the empirical observation that large models trained on massive, diverse datasets exhibit superior generalization and emergent zero-shot capabilities. The intuition is that by curating a dataset of unprecedented size (over 10 million 3D objects) from a wide array of web sources, models trained on this data will learn more robust, comprehensive representations of the 3D world, leading to significant improvements in tasks like novel view synthesis and 3D generation.

4.2. Core Methodology In-depth

The methodology can be broken down into three main phases: Data Collection and Curation, Metadata Extraction and Analysis, and Experimental Validation.

4.2.1. Data Collection and Curation (Objaverse-XL Composition)

Objaverse-XL is assembled from diverse web sources, each contributing different types and qualities of 3D data. The overall process involves identifying sources, crawling for 3D files, attempting to import and render these files, and then deduplicating them.

4.2.1.1. Source Identification and Crawling

The authors identify five primary sources:

GitHub: A popular platform for code hosting.
- Process: The authors index 37 million public files that contain common 3D object extensions (e.g., .obj, .glb, .gltf, .usdz, .usd, .usda, .fbx, .stl, .dae, .ply, .abc, and .blend). These extensions are chosen for their compatibility with Blender, which is used for rendering.
- Filtering: Only objects from "base" GitHub repositories (non-forked, or forks with more stars than the original) are indexed.
- Contribution: Files come from over 500K repositories.
Thingiverse: A platform for sharing 3D printable objects.
- Process: Approximately 3.5 million objects are indexed and downloaded. These are predominantly released under Creative Commons licenses.
- Characteristics: The vast majority are STL files, which are watertight meshes typically untextured. For rendering, colors are randomized to increase diversity.
Sketchfab: An online platform for publishing and sharing 3D models.
- Process: Data is sourced specifically from Objaverse 1.0 [14], which consists of 800K Creative Commons-licensed 3D models.
- Characteristics: Each model is distributed as a standardized GLB file, freely usable and modifiable, covering a broad array of object types.
Polycam: A 3D scanning mobile application.
- Process: The focus is on objects from its "explore" functionality that are designated as "savable" and governed by a Creative Commons Attribution 4.0 International License (CC-BY 4.0).
- Contribution: 72K such objects were indexed.
Smithsonian Institution: Digitized collection of historical and cultural artifacts.
- Process: A set of 2.4K models, all licensed under a CC0 license (public domain), are included.
- Characteristics: Primarily scans of real-world artifacts, distributed in a standardized compressed GLB format.

4.2.1.2. Deduplication and Rendering

After initial crawling, a rigorous deduplication and rendering process is applied:

Deduplication: Objects are deduplicated by file content hash. This step removes approximately 23 million files, indicating a significant amount of redundant data across sources.
Import and Rendering: Among the remaining files, only those that can be successfully imported and rendered in Blender are included.
- Challenges: Files may fail to render due to import compatibility issues (e.g., FBX ASCII files not natively importable to Blender), absence of meshes, or invalid 3D file formats (e.g., an .obj file being a C compiler file).
- Outcome: 5.5 million successfully rendered files are retained from GitHub.
Final Count: Objaverse-XL ultimately comprises 10.2 million unique 3D objects after combining and processing all sources.

The following figure (Figure 4c from the original paper) shows the contribution of various sources to Objaverse-XL:

该图像是一个展示多种3D物体的插图，包含了超过50种多样的物品，如建筑、工具、玩具及生物等，彰显了Objaverse-XL数据集的丰富性与多样性。

4.2.2. Metadata Extraction and Analysis

Each object in Objaverse-XL is enriched with metadata from its source and additional extracted properties.

4.2.2.1. Source Metadata

Information like popularity (e.g., GitHub stars), license type, and textual descriptions (e.g., file names on GitHub) are gathered directly from the source platforms.

4.2.2.2. Blender Metadata

For every object rendered, the following metadata is extracted using Blender:

sha256: A cryptographic hash of the file content for identification.
file-size: Size of the 3D model file.
polygon-count, vertex-count, edge-count: Geometric complexity metrics.
material-count, texture-count: Information about surface properties.
object-count, animation-count: Number of distinct objects and animations within the file.
linked-files: References to external files.
scene-dimensions: Bounding box dimensions of the object in 3D space.
missing-textures: Indicates if any textures are missing. Missing textures are randomized during rendering.
Animated Objects: Analysis of Blender metadata reveals a significant increase in animated objects (from 41K in Objaverse 1.0 to 459K in Objaverse-XL) and objects with armature (from 34K to 438K), indicating greater dynamic complexity.

4.2.2.3. Model Metadata (CLIP Features)

CLIP ViT-L/14 Embedding: For each object, a CLIP ViT-L/14 image embedding is extracted.
- Process: 12 different renders of the object are generated from random camera positions inside a hollow sphere. The CLIP embedding of these 12 rendered images are then averaged to obtain a robust embedding for the 3D object.
- Use Cases: These CLIP embeddings are used to predict various metadata properties, including aesthetic scores, NSFW predictions, face detection, and detection of holes in photogrammetry renders.

4.2.2.4. Data Analysis and Filtering

Several analyses are performed using the extracted metadata:

NSFW Annotations:
- Process: Each 3D object is rendered in 12 random views. Each rendered image is passed through an NSFW classifier, which is trained on the LAION-5B dataset [55] using CLIP ViT-L/14 features.
- Filtering Criteria: An image is marked NSFW if its NSFW score is above 0.9. A 3D object is marked NSFW if at least 3 of its rendered images are deemed NSFW.
- Outcome: Only 815 objects out of 10M are filtered out, demonstrating the effectiveness of the process and the relatively low NSFW content. The high threshold and multi-view consistency are crucial due to distribution shifts and potential false positives from innocent 3D objects.
Face Detection:
- Process: A face detector [20] is applied to the 12 rendered images per object. An object is considered to contain faces if at least 3 images have a detected face.
- Outcome: Approximately 266K objects (out of 10M) are estimated to include faces. However, these often stem from dolls, historical sculptures, and anthropomorphic animations, reducing privacy concerns.
Photogrammetry Hole Detection:
- Motivation: Scanned 3D objects (e.g., from Polycam) can have "holes" if certain parts (e.g., back or bottom) are not scanned, leading to noisy or low-fidelity renders from specific viewpoints.
- Process: 1.2K Polycam renders are manually annotated as "good" (label 1) or "bad" (label 0). A binary classifier (2-layer MLP) is trained on CLIP ViT-L/14 features of the rendered images to predict "bad renders." This classifier achieves over 90% cross-validation accuracy with a "render score" threshold of 0.5.
- Outcome: Out of 71K Polycam objects, 38.20% of renders are "bad," with 58K objects having at least 2 bad renders. This analysis helps identify and potentially filter lower-quality scans.
  
  The following figure (Figure 4 from the original paper) shows an analysis of metadata from Objaverse-XL, including distributions of polygon/vertex/edge counts and source contributions:
  
  该图像是一个展示多种3D物体的插图，包含了超过50种多样的物品，如建筑、工具、玩具及生物等，彰显了Objaverse-XL数据集的丰富性与多样性。

4.2.3. Experimental Validation

The dataset's utility is demonstrated by training and evaluating state-of-the-art 3D vision models on Objaverse-XL.

4.2.3.1. Novel View Synthesis with Zero123-XL

Model: Zero123 [36] is a view-conditioned diffusion model for novel view synthesis. Its weights are initialized from Stable Diffusion to leverage strong 2D image generation capabilities.
Zero123-XL: This is the Zero123 framework trained on the larger Objaverse-XL dataset.
Training Details (Appendix A.1):
- Batch Size: 2048.
- Learning Rate: $1 \mathrm{e}-4$ .
- Two-stage finetuning:
  1. First stage: Trained for 375K iterations on the entire Objaverse-XL.
  2. Second stage (Alignment Finetuning): Trained for 65K iterations with a reduced learning rate of $5 \mathrm{e}-5$ $5 e - 5$ on a high-quality subset of Objaverse-XL (1.3 million objects).
    - High-quality subset selection: Based on proxy estimation of human preference using heuristics like vertex count, face count, source website popularity, and data source. This is inspired by InstructGPT [44] and LIMA [66] which show benefits of finetuning on high-quality data.
Scaling Experiment: For evaluating performance across different dataset sizes (Figure 6), smaller datasets (below 800K) are randomly sampled subsets from Objaverse 1.0.

4.2.3.2. Novel View Synthesis with PixelNeRF

Model: PixelNeRF [64] is a method for constructing NeRF models that generalize across scenes with few input images. It typically requires many input views for a single scene, but PixelNeRF aims to generalize with fewer views.
Training on Objaverse-XL: PixelNeRF is trained on over two million objects from Objaverse-XL, a magnitude larger than previously used datasets for this model. This training uses 24 million rendered images.
Evaluation: Performance is evaluated on a held-out set of Objaverse-XL objects.
Fine-tuning: The pretrained PixelNeRF model on Objaverse-XL is then fine-tuned on other datasets like DTU [2] and ShapeNet [9] to assess transfer learning capabilities.

The models are trained to synthesize novel views given an input view, a task that allows for easy scaling by rendering many view pairs from the 3D objects in Objaverse-XL. This framework enables direct comparison of performance improvements due to data scale.

The following figure (Figure 6 from the original paper) shows novel view synthesis at scale for PixelNeRF and Zero123:

fig 6 该图像是一个展示超过10万个3D对象的图表，显示了来自不同来源的各种物体，如手工设计、摄影测量扫描和专业扫描的历史工艺品等。此图表展示了Objaverse-XL数据集中不同3D对象的多样性和丰富性。

5. Experimental Setup

5.1. Datasets

The primary dataset used for training and pre-training is Objaverse-XL. For evaluation and fine-tuning, other established 3D datasets are utilized.

Objaverse-XL:
- Source: Web-crawled from GitHub, Thingiverse, Sketchfab, Polycam, and the Smithsonian Institution.
- Scale: Over 10 million deduplicated 3D objects. Specifically, 10.2M unique objects are compiled.
  - GitHub: 5.5 million objects (after filtering and successful rendering).
  - Thingiverse: ~3.5 million objects.
  - Sketchfab (from Objaverse 1.0): 800K objects.
  - Polycam: 71K unique objects (after deduplication).
  - Smithsonian Institution: 2.4K objects.
- Characteristics: High diversity in object types, styles, and quality. Includes manually designed objects, photogrammetry scans of landmarks and everyday items, and professional scans of historic/antique artifacts. Objects from Thingiverse are predominantly untextured STL meshes, while others are textured CAD models or scans (e.g., GLB files).
- Domain: General 3D objects, spanning a vast range of real-world and artistic items.
- Data Sample: The paper includes several figures (e.g., Figure 1, Figure 3) showing diverse examples of 3D objects from Objaverse-XL, illustrating items like furniture, vehicles, characters, tools, and abstract shapes. For instance, Figure 1 shows examples of objects rendered in a scene, and Figure 3 specifically shows examples from GitHub, Thingiverse, Polycam, and the Smithsonian.
  
  Figure 1: Objaverse-XL includes a ginormous collection of diverse 3D objects from a variety of sources. Here, we show examples of objects in Objaverse-XL rendered in a scene.
  
  Figure 3: Examples of 3D objects from various sources of Objaverse-XL spanning GitHub, Thingiverse, Polycam, the Smithsonian Institution, and Sketchfab. Objects from Thingiverse do not include color information, so each object's primary color is randomized during rendering.
- Rationale: Chosen as the primary large-scale training dataset to overcome data scarcity in 3D vision and leverage the benefits of scale.
Objaverse 1.0 [14]:
- Source: Sketchfab.
- Scale: 800K 3D models.
- Characteristics: High-quality and diverse textures, geometry, and object types.
- Domain: General 3D objects.
- Rationale: Used as a baseline for Zero123 performance comparison and for sampling smaller dataset sizes in scaling experiments.
Google Scanned Objects (GSO) [17]:
- Source: Google.
- Scale: Approximately 1K high-quality 3D scanned household items.
- Characteristics: Photorealistic 3D scans.
- Domain: Household items.
- Rationale: Used for quantitative evaluation of novel view synthesis performance for Zero123-XL.
DTU [2]:
- Source: Technical University of Denmark.
- Scale: Multi-view image datasets of various real-world objects.
- Characteristics: Real-world objects with ground-truth camera parameters, often used for multi-view stereo and novel view synthesis tasks.
- Domain: Various small-to-medium scale objects.
- Rationale: Used for fine-tuning and evaluation of PixelNeRF to demonstrate generalization benefits.
ShapeNet [9]:
- Source: Crowdsourced/professional 3D designers.
- Scale: 51K models (after filtering from 3M theoretical CAD models).
- Characteristics: Textured CAD models labeled with semantic categories. Often low resolution and simplistic textures.
- Domain: Common object categories (e.g., chairs, tables, airplanes).
- Rationale: Used for fine-tuning and evaluation of PixelNeRF to demonstrate generalization benefits, representing a more traditional 3D dataset.

5.2. Evaluation Metrics

The paper uses several standard evaluation metrics for novel view synthesis and image quality:

PSNR (Peak Signal-to-Noise Ratio):
1. Conceptual Definition: PSNR quantifies the quality of reconstruction of an image compared to an original image. It is most commonly used to measure the quality of reconstruction of lossy compression codecs (e.g., for image compression). A higher PSNR generally indicates a higher quality (less noisy) reconstruction.
2. Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $
3. Symbol Explanation:
  - $\mathrm{MAX}_I$ : The maximum possible pixel value of the image. For an 8-bit grayscale image, this is 255. For color images, it's typically 255 for each channel.
  - $\mathrm{MSE}$ : Mean Squared Error between the ground-truth image and the generated image. $ \mathrm{MSE} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2 $ where $M$ and $N$ are the dimensions of the image, I(i,j) is the pixel value of the original image at coordinate (i,j), and K(i,j) is the pixel value of the generated image at coordinate (i,j).
SSIM (Structural Similarity Index Measure):
1. Conceptual Definition: SSIM is a perception-based model that considers image degradation as a perceived change in structural information, while also incorporating important perceptual phenomena, including luminance masking and contrast masking. It is a full reference metric, meaning it measures the quality of an image with reference to a ground-truth image. A value of 1 indicates perfect similarity, while 0 indicates no similarity.
2. Mathematical Formula: $ \mathrm{SSIM}(\mathbf{x}, \mathbf{y}) = [l(\mathbf{x}, \mathbf{y})]^{\alpha} \cdot [c(\mathbf{x}, \mathbf{y})]^{\beta} \cdot [s(\mathbf{x}, \mathbf{y})]^{\gamma} $ Typically, $\alpha = \beta = \gamma = 1$ . The formula then simplifies to: $ \mathrm{SSIM}(\mathbf{x}, \mathbf{y}) = \frac{(2\mu_x\mu_y + C_1)(2\sigma_{xy} + C_2)}{(\mu_x^2 + \mu_y^2 + C_1)(\sigma_x^2 + \sigma_y^2 + C_2)} $
3. Symbol Explanation:
  - $\mathbf{x}$ , $\mathbf{y}$ : Two image patches (ground-truth and generated, respectively).
  - $\mu_x$ , $\mu_y$ : The average of $\mathbf{x}$ and $\mathbf{y}$ .
  - $\sigma_x^2$ , $\sigma_y^2$ : The variance of $\mathbf{x}$ and $\mathbf{y}$ .
  - $\sigma_{xy}$ : The covariance of $\mathbf{x}$ and $\mathbf{y}$ .
  - $C_1 = (K_1L)^2$ , $C_2 = (K_2L)^2$ : Two constants to avoid division by zero, where $L$ is the dynamic range of the pixel values (e.g., 255 for 8-bit images), and $K_1 \ll 1$ , $K_2 \ll 1$ are small constants (e.g., $K_1=0.01, K_2=0.03$ ).
  - $l(\mathbf{x}, \mathbf{y})$ , $c(\mathbf{x}, \mathbf{y})$ , $s(\mathbf{x}, \mathbf{y})$ : Luminance, contrast, and structure comparison functions, respectively.
LPIPS (Learned Perceptual Image Patch Similarity):
1. Conceptual Definition: LPIPS is a metric that uses features from a pre-trained deep neural network (e.g., AlexNet or VGG) to measure the perceptual similarity between two images. Unlike traditional metrics like MSE or PSNR, LPIPS is designed to correlate better with human judgment of image similarity. A lower LPIPS score indicates higher perceptual similarity.
2. Mathematical Formula: $ \mathrm{LPIPS}(\mathbf{x}, \mathbf{x}_0) = \sum_l \frac{1}{H_l W_l} | w_l \odot (\phi_l(\mathbf{x}) - \phi_l(\mathbf{x}_0)) |_2^2 $
3. Symbol Explanation:
  - $\mathbf{x}$ , $\mathbf{x}_0$ : The two images being compared (ground-truth and generated).
  - $\phi_l$ : The feature stack from layer $l$ of a pre-trained deep network (e.g., AlexNet).
  - $w_l$ : A learned scalar weight for each channel in layer $l$ .
  - $\odot$ : Element-wise product.
  - $\| \cdot \|_2^2$ : Squared $L_2$ norm.
  - $H_l, W_l$ : Height and width of the feature map at layer $l$ .
FID (Frechet Inception Distance):
1. Conceptual Definition: FID measures the similarity between two sets of images, typically a set of real images and a set of generated images. It calculates the Fréchet distance between two Gaussian distributions fitted to the feature representations (extracted from an Inception-v3 network) of the real and generated images. A lower FID score indicates higher quality and diversity of generated images, implying they are more similar to real images.
2. Mathematical Formula: $ \mathrm{FID} = |\mu_x - \mu_y|_2^2 + \mathrm{Tr}(\Sigma_x + \Sigma_y - 2(\Sigma_x \Sigma_y)^{1/2}) $
3. Symbol Explanation:
  - $\mu_x$ , $\mu_y$ : The mean feature vectors of the real and generated image sets, respectively, extracted from an Inception-v3 network's activation layer.
  - $\Sigma_x$ , $\Sigma_y$ : The covariance matrices of the feature vectors for the real and generated image sets.
  - $\|\cdot\|_2^2$ : Squared $L_2$ norm.
  - $\mathrm{Tr}$ : Trace of a matrix.
  - $(\cdot)^{1/2}$ : Matrix square root.

5.3. Baselines

The paper primarily compares Zero123-XL against its predecessor, Zero123 trained on Objaverse 1.0. For PixelNeRF, it compares the Objaverse-XL-trained version against a base PixelNeRF (trained from scratch or on smaller datasets).

Zero123 (trained on Objaverse 1.0): This is the direct baseline for Zero123-XL. It uses the same architecture and methodology but is trained on a dataset an order of magnitude smaller (800K objects vs. 10.2M objects). This comparison directly validates the impact of data scale.
PixelNeRF (Base / trained from scratch): This refers to PixelNeRF models trained without the benefit of Objaverse-XL's large-scale pre-training, representing the conventional approach or training on smaller, specialized datasets like DTU or ShapeNet.

These baselines are representative because they either use the same model architecture with different data scales (for Zero123) or represent standard training approaches for the specific task (for PixelNeRF), allowing for a clear assessment of Objaverse-XL's impact.

6. Results & Analysis

6.1. Core Results Analysis

The core results demonstrate that training state-of-the-art 3D vision models on Objaverse-XL significantly enhances their performance, particularly in zero-shot generalization and consistency with increased data scale.

6.1.1. Zero-shot Generalization with Zero123-XL

The paper highlights that Zero123-XL (trained on Objaverse-XL) achieves substantially better zero-shot generalization compared to Zero123 (trained on Objaverse 1.0). This improvement is visually demonstrated across various challenging modalities:

Photorealistic assets, cartoons, drawings, and sketches: Zero123-XL can generate novel views that are more consistent with the input view and maintain the original style and geometric details, even for highly stylized inputs like sketches. In contrast, Zero123 often interprets such inputs as 2D planes, performing simpler homography-like transformations instead of true 3D rotations.
Improved Viewpoint Control: Zero123-XL shows better control over camera pose transformations, generating more plausible and accurate novel views.

The following figure (Figure 5 from the original paper) showcases novel view synthesis on in-the-wild images, comparing Zero123-XL and Zero123:

Figure 5: Novel view synthesis on in-the-wild images. Comparison between Zero123-XL trained on Objaverse-XL and Zero123 trained on Objaverse. Starting from the input view, the task is to generate an image of the object under a specific camera pose transformation. The camera poses are shown beside each example. Significant improvement can be found by training with more data, especially for categories including people (1st row), anime (2nd row), cartoon (3rd row), fruit (4th row), and sketches (5th row). Additionally, viewpoint control is significantly improved (see 2nd row).

These qualitative results strongly validate that the increased scale and diversity of Objaverse-XL enable Zero123-XL to learn more robust and generalizable 3D representations.

6.1.2. Quantitative Improvement with Scale

Quantitatively, the novel view synthesis performance consistently improves with larger dataset sizes.

The following figure (Figure 6 from the original paper) illustrates novel view synthesis at scale for PixelNeRF and Zero123:

fig 6 Figure 6: Novel view synthesis at scale. Left: PixelNeRF [64] trained on varying scales of data and evaluated on a held-out subset of Objavverse-XL. Right: Zero123 [36] trained on varying scales of data and evaluated on a zero-shot dataset. Note that the 800K datapoint is Zero123 and the 10M datapoint is Zero123-XL. The synthesis quality consistently improves with scale. LPIPS is scaled-up 10 times for visualization.

The graph shows a clear trend: as the number of objects (dataset size) increases, the LPIPS (Learned Perceptual Image Patch Similarity) score decreases for both PixelNeRF and Zero123. A lower LPIPS indicates higher perceptual similarity between the generated and ground-truth novel views, meaning better synthesis quality. This trend holds consistently, suggesting that the models continue to benefit from more data even at the scale of millions of objects, with no signs of performance saturation. The 10M datapoint (representing Objaverse-XL) shows the lowest LPIPS values for Zero123-XL, indicating its superior performance.

6.1.3. Generalization to Downstream Datasets with PixelNeRF

Pretraining PixelNeRF on Objaverse-XL significantly improves its performance when fine-tuned on other established 3D datasets.

The following are the results from Table 3 of the original paper:

PixelNeRF	DTU [2]	ShapeNet [9]
Base	15.32	22.71
w/ Objaverse-XL	17.53 ± .37	24.22 ± .55

The table shows the PSNR (Peak Signal-to-Noise Ratio) for PixelNeRF models.

Base: Refers to PixelNeRF trained from scratch or without Objaverse-XL pretraining.
w/ Objaverse-XL: Refers to PixelNeRF pretrained on Objaverse-XL and then fine-tuned on the respective dataset (DTU or ShapeNet).

For both DTU and ShapeNet, the PSNR values are significantly higher when PixelNeRF is pretrained on Objaverse-XL. For DTU, PSNR increases from 15.32 to 17.53, and for ShapeNet, it increases from 22.71 to 24.22. A higher PSNR indicates better image reconstruction quality. This demonstrates the effectiveness of Objaverse-XL as a large-scale pretraining corpus, allowing models to learn more generalizable features that transfer well to different domains.

6.2. Ablation Studies / Parameter Analysis

6.2.1. Alignment Finetuning for Zero123-XL

Inspired by InstructGPT and LIMA, the authors conducted an ablation study to evaluate the effect of fine-tuning Zero123-XL on a high-quality alignment subset of Objaverse-XL.

Alignment Subset: A subset of 1.3 million objects selected based on proxy estimations of human preference (heuristics like vertex count, face count, popularity on source website, and data source).
Process: After pretraining the base Zero123-XL model on the entire Objaverse-XL, a second stage of fine-tuning was performed on this alignment subset with a reduced learning rate.

The following are the results from Table 2 of the original paper:

Zero123-XL PSNR (↑) SSIM (↑) LPIPS (↓) FID (↓)
Base 18.225 0.877 0.088 0.070
w/ Alignment Finetuning 19.876 0.888 0.075 0.056

Zero123-XL	PSNR (↑)	SSIM (↑)	LPIPS (↓)	FID (↓)
Base	18.225	0.877	0.088	0.070
w/ Alignment Finetuning	19.876	0.888	0.075	0.056

The table presents the zero-shot generalization performance on the Google Scanned Objects [17] dataset.

Base: Zero123-XL trained on the entire Objaverse-XL.
w/ Alignment Finetuning: Zero123-XL with the additional fine-tuning stage on the high-quality subset.

The results show significant improvements across all metrics after alignment fine-tuning:
PSNR (↑): Increases from 18.225 to 19.876.
SSIM (↑): Increases from 0.877 to 0.888.
LPIPS (↓): Decreases from 0.088 to 0.075.
FID (↓): Decreases from 0.070 to 0.056.

These improvements demonstrate that even after pre-training on a massive dataset, a targeted fine-tuning phase using a curated, high-quality subset can further align the model's outputs with desired properties, leading to better zero-shot generalization performance. This highlights the importance of data quality in addition to quantity.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces Objaverse-XL, a groundbreaking dataset comprising over 10 million unique 3D objects collected from diverse web sources. This dataset represents an unprecedented scale and diversity in the realm of 3D data, addressing the critical bottleneck of data scarcity that has hindered progress in 3D vision tasks. The authors rigorously detail the data collection, curation, and metadata extraction processes, including robust deduplication, NSFW filtering, and quality analysis. Through comprehensive experiments, the paper empirically demonstrates that training state-of-the-art models like Zero123 and PixelNeRF on Objaverse-XL leads to significant improvements in zero-shot generalization abilities for novel view synthesis and enhanced performance when fine-tuning on downstream tasks. The consistent performance gains observed with increasing data scale underscore the potential of large-scale 3D datasets to drive future innovations in 3D computer vision.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Scale Gap: Despite being orders of magnitude larger than previous 3D datasets, Objaverse-XL (10M+ objects) is still orders of magnitude smaller than modern billion-scale image-text datasets (e.g., LAION-5B). Future work should explore ways to further scale 3D datasets and simplify 3D content capture and creation.
Data Redundancy/Utility: Not all samples in Objaverse-XL may be equally necessary or effective for training high-performance models. Future research could focus on intelligent data selection strategies to identify the most impactful data points for training.
Focus on Generative Tasks: The current work primarily demonstrates the benefits of Objaverse-XL for generative tasks (novel view synthesis). Future work could explore its utility for discriminative tasks such as 3D segmentation and detection.
Ethical Considerations: While NSFW filtering and face detection were performed, the potential for confidential or sensitive data within web-crawled content remains a concern. The paper notes that individuals were not notified about data collection and consent was not explicitly obtained, which are common limitations for large web-scraped datasets.

7.3. Personal Insights & Critique

This paper marks a pivotal moment for 3D computer vision, akin to the impact of ImageNet or LAION-5B on 2D vision. The sheer scale of Objaverse-XL, coupled with its diverse data sources, offers a much-needed foundation for developing more robust and generalizable 3D models.

Inspirations:

Replication of Scaling Laws: The paper effectively demonstrates that the "scaling laws" observed in NLP and 2D vision hold true for 3D as well. This provides strong motivation for continued efforts in large-scale data collection for 3D.
Democratization of 3D AI: By open-sourcing such a massive dataset, the authors are democratizing access to crucial resources, which can accelerate research and development in 3D content creation, AR/VR, robotics, and simulation.
Value of Hybrid Data Sources: The successful integration of diverse data sources—from professionally scanned artifacts to user-generated STL files and code repository assets—highlights the potential of combining varied data types to achieve comprehensive coverage.
Alignment Finetuning: The effectiveness of alignment fine-tuning on a high-quality subset, even after massive pre-training, is a valuable insight. It suggests a powerful paradigm where quantity provides broad generalization, and quality refines model behavior to human preferences, a lesson relevant across many AI domains.

Potential Issues or Areas for Improvement:

Data Quality Heterogeneity: While diversity is a strength, the vast differences in quality (e.g., highly detailed photogrammetry vs. simple untextured STL files) could pose challenges for models. While the paper addresses this with quality filtering and aesthetic scoring, the impact of this heterogeneity on different downstream tasks might warrant further investigation.
Licensing Complexity: The reliance on various Creative Commons licenses and terms of service from different platforms introduces legal complexities for users. Although the paper provides disclaimers, navigating these licenses for commercial or derivative use could be challenging for some researchers and developers.
Computational Cost: Training models on 10 million 3D objects, which translates to over 100 million rendered images, is computationally intensive. While the benefits are clear, this raises barriers for researchers with limited computational resources, potentially creating a divide in the field.
Bias in Web-Scraped Data: Like all web-scraped datasets, Objaverse-XL is susceptible to biases present in the internet's 3D content. While NSFW and face detection are performed, other subtle biases (e.g., cultural representation, object categories) might exist and could influence model behavior. A more thorough datasheet for datasets analysis regarding potential biases would be beneficial.
Dynamic Nature of Web Sources: The reliance on external links for GitHub and Thingiverse means the dataset's long-term integrity depends on the persistence of these external resources. While the authors expect stability, link rot is a known issue for web-crawled datasets. Providing more robust archival mechanisms for the source files (if legally permissible) would enhance the dataset's longevity.

Overall, Objaverse-XL is an essential contribution that pushes the boundaries of 3D vision. Its release is likely to inspire a new generation of 3D models capable of unprecedented levels of understanding and generation, fundamentally reshaping the landscape of 3D AI.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.