Paper status: completed

OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation

Published:01/19/2023
Original LinkPDF
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

OmniObject3D is a large-vocabulary 3D object dataset with 6,000 high-quality real scans across 190 categories, featuring rich annotations. It aims to advance 3D perception, reconstruction, and generation research with four evaluation tasks.

Abstract

Recent advances in modeling 3D objects mostly rely on synthetic datasets due to the lack of large-scale realscanned 3D databases. To facilitate the development of 3D perception, reconstruction, and generation in the real world, we propose OmniObject3D, a large vocabulary 3D object dataset with massive high-quality real-scanned 3D objects. OmniObject3D has several appealing properties: 1) Large Vocabulary: It comprises 6,000 scanned objects in 190 daily categories, sharing common classes with popular 2D datasets (e.g., ImageNet and LVIS), benefiting the pursuit of generalizable 3D representations. 2) Rich Annotations: Each 3D object is captured with both 2D and 3D sensors, providing textured meshes, point clouds, multiview rendered images, and multiple real-captured videos. 3) Realistic Scans: The professional scanners support highquality object scans with precise shapes and realistic appearances. With the vast exploration space offered by OmniObject3D, we carefully set up four evaluation tracks: a) robust 3D perception, b) novel-view synthesis, c) neural surface reconstruction, and d) 3D object generation. Extensive studies are performed on these four benchmarks, revealing new observations, challenges, and opportunities for future research in realistic 3D vision.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

The title of the paper is OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation. It clearly indicates that the paper introduces a new 3D object dataset, emphasizing its large vocabulary, real-scanned nature, and applicability to tasks such as 3D perception, reconstruction, and generation.

1.2. Authors

The authors are:

  • Tong Wu

  • Jiarui Zhang

  • Xiao Fu

  • Yuxin Wang

  • Jiawei Ren

  • Liang Pan

  • Wayne Wu

  • Lei Yang

  • Jiaqi Wang

  • Chen Qian

  • Dahua Lin

  • Ziwei Liu

    Their affiliations include:

  • Shanghai Artificial Intelligence Laboratory

  • The Chinese University of Hong Kong

  • SenseTime Research

  • Hong Kong University of Science and Technology

  • S Lab, Nanyang Technological University

    The diverse affiliations across leading AI labs and universities suggest a collaborative effort from prominent researchers in the field of computer vision and 3D reconstruction.

1.3. Journal/Conference

The paper is published as a preprint on arXiv.org. While arXiv is not a peer-reviewed journal or conference, it is a widely recognized platform for sharing cutting-edge research in computer science and other fields before formal publication. Papers published here are often submitted to major conferences like CVPR, ICCV, or ECCV, or journals like IJCV.

1.4. Publication Year

The paper was published on 2023-01-18T18:14:18.000Z.

1.5. Abstract

The paper addresses the challenge in 3D object modeling, which often relies on synthetic datasets due to the scarcity of large-scale, real-scanned 3D databases. To bridge this gap and advance 3D perception, reconstruction, and generation in real-world scenarios, the authors propose OmniObject3D. This dataset is characterized by several key properties:

  1. Large Vocabulary: It comprises 6,000 scanned objects across 190 daily categories. These categories align with popular 2D datasets like ImageNet and LVIS, fostering the development of generalizable 3D representations.

  2. Rich Annotations: Each 3D object is meticulously captured using both 2D and 3D sensors, yielding textured meshes, point clouds, multiview rendered images, and multiple real-captured videos.

  3. Realistic Scans: Professional scanning equipment ensures high-quality object scans, characterized by precise shapes and realistic appearances.

    Leveraging the extensive data and annotations of OmniObject3D, the authors establish four distinct evaluation tracks: a) Robust 3D Perception b) Novel-View Synthesis c) Neural Surface Reconstruction d) 3D Object Generation

Extensive experiments conducted on these benchmarks reveal novel observations, highlight existing challenges, and identify future research opportunities in the domain of realistic 3D vision.

The original source link is: https://arxiv.org/abs/2301.07525 The PDF link is: https://arxiv.org/pdf/2301.07525v2.pdf This is a preprint published on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the scarcity of large-scale, high-quality, real-world 3D object datasets. Current advances in 3D object modeling predominantly rely on synthetic datasets, such as ShapeNet or ModelNet. While these synthetic datasets are abundant and readily available, they suffer from significant limitations:

  • Appearance and Distribution Gaps: Synthetic data inherently differs from real-world data in terms of visual appearance, texture realism, and object distribution. This sim-to-real domain gap is a major hurdle.

  • Hindrance to Real-Life Applications: Models trained on synthetic data often fail to generalize effectively to real-world scenarios, limiting their practical applicability in tasks like robotics, augmented reality, and autonomous driving.

    Existing real-world 3D datasets, while a step in the right direction, are unsatisfactory due to various limitations:

  • CO3D: Provides 19k videos but only 20% have accurate point clouds, and textured meshes are absent.

  • GSO: Contains only 1k scanned objects across a narrow 17 household classes.

  • AKB-48: Focuses on 2k articulated objects for robotics, leading to a limited semantic distribution not suitable for general 3D research.

  • DTU and BlendedMVS: Small scale and lack category annotations.

  • ScanObjectNN: Contains noisy, incomplete point clouds, often with multiple objects in a scene.

    The importance of addressing this problem is paramount for 3D vision to move beyond academic benchmarks and achieve robust real-world applications. The paper's innovative idea is to systematically build a large-vocabulary, high-quality, real-scanned 3D object dataset called OmniObject3D to facilitate realistic 3D perception, reconstruction, and generation.

2.2. Main Contributions / Findings

The paper's primary contributions revolve around the creation and benchmarking of OmniObject3D:

  1. A Novel Large-Vocabulary 3D Object Dataset (OmniObject3D):

    • It comprises 6,000 high-quality textured meshes scanned from real-world objects, making it the largest real-world 3D object dataset with accurate 3D meshes to date.
    • It covers 190 daily categories, significantly expanding the semantic scope compared to previous real-world datasets. These categories overlap with popular 2D datasets (ImageNet, LVIS) and 3D datasets (ShapeNet), promoting generalizable 3D representations.
    • Each object comes with rich annotations, including textured 3D meshes, sampled point clouds, posed multi-view images (rendered by Blender), and real-captured video frames with foreground masks and COLMAP camera poses.
    • The scans are of high fidelity, captured by professional scanners, ensuring precise shapes and realistic appearances with high-frequency textures.
  2. Establishment of Four Comprehensive Evaluation Tracks:

    • Robust 3D Perception: Provides a benchmark for point cloud classification against out-of-distribution (OOD) styles (sim-to-real gap) and OOD corruptions (e.g., jittering, missing points), allowing for fine-grained analysis.
    • Novel-View Synthesis (NVS): Offers a diverse dataset for evaluating both single-scene and cross-scene NVS methods, pushing towards more generalizable and robust algorithms.
    • Neural Surface Reconstruction: Enables evaluation of dense-view and sparse-view surface reconstruction, particularly highlighting challenges with complex geometries and textures.
    • 3D Object Generation: Provides a new large-vocabulary, realistic dataset for training and evaluating 3D generative models, revealing semantic distribution biases and varied exploration difficulties.
  3. New Observations, Challenges, and Opportunities:

    • 3D Perception: Reveals that performance on clean synthetic data has little correlation with OOD-style robustness. Advanced point grouping methods (CurveNet, GDANet) show robustness to both OOD styles and corruptions. The combination of OOD style + OOD corruption is a particularly challenging setting.

    • NVS: Voxel-based methods (Plenoxels) excel at high-frequency textures but are less stable with concave geometry or dark objects. OmniObject3D is beneficial for learning strong generalizable priors across scenes. Real-captured videos introduce additional challenges due to motion blur and SfM inaccuracies.

    • Surface Reconstruction: Identifies "hard" categories with dark/low-texture appearances, concave geometries, or complex/thin structures. Demonstrates that sparse-view reconstruction remains a significant challenge, with NeuS surprisingly strong as a baseline and issues with MonoSDF's reliance on estimated depth accuracy.

    • 3D Object Generation: Highlights semantic distribution biases in generative models when trained on large-vocabulary datasets, where certain categories or groups dominate generation. Identifies challenges in generating complex textures and achieving disentanglement between geometry and texture.

      The key conclusions are that OmniObject3D effectively addresses the critical need for a large-scale, high-quality real-world 3D object dataset. Its rich annotations and diverse content enable comprehensive benchmarking, which in turn reveals specific weaknesses and strengths of current 3D vision models across various tasks, paving the way for future research in realistic 3D vision.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand this paper, a novice reader should be familiar with the following core concepts:

  • 3D Object Datasets: Collections of 3D models or scans used for training and evaluating computer vision models. They can be synthetic (CAD models) or real-scanned. Key properties include the number of objects, categories, and types of annotations (e.g., meshes, point clouds, images).

  • Textured Meshes: A common representation for 3D objects, consisting of a polygonal mesh (a collection of vertices, edges, and faces that define the shape) and a texture map (an image applied to the mesh's surface to give it color and detail).

  • Point Clouds: A set of data points in a three-dimensional coordinate system. These points represent the external surface of an object or environment. Point clouds are typically generated by 3D scanners. They lack explicit topological information (like faces or edges) that meshes have.

  • Multi-view Images: A collection of 2D images of a 3D object or scene captured from different camera viewpoints. These are crucial for tasks like 3D reconstruction and novel-view synthesis.

  • Novel-View Synthesis (NVS): The task of generating a realistic image of a 3D scene or object from a new, unseen viewpoint, given a set of existing 2D images.

  • Neural Radiance Field (NeRF): A neural network-based method for novel-view synthesis. It represents a 3D scene as a continuous function that maps 3D coordinates (x, y, z) and viewing direction (θ, φ) to a color (RGB) and a density (σ). An MLP (Multi-Layer Perceptron) implicitly learns this function. Images from novel views are synthesized by volume rendering, where rays are cast through the scene, and color and density are integrated along these rays.

  • Neural Surface Reconstruction: The process of recovering the 3D surface geometry of an object or scene from 2D images using neural networks. This often involves implicit surface representations like Signed Distance Functions (SDFs).

  • Signed Distance Function (SDF): An implicit representation of a 3D surface. For any point in 3D space, an SDF returns the shortest distance from that point to the surface. The sign of the distance indicates whether the point is inside (negative) or outside (positive) the object, with points on the surface having a distance of zero.

  • 3D Object Generation: The task of creating new, diverse 3D models of objects, often conditioned on text prompts, images, or learned latent codes. This can involve generating meshes, point clouds, or implicit representations.

  • Point Cloud Perception: Tasks that involve analyzing and understanding 3D point cloud data, such as classification (assigning a category label to a point cloud), segmentation (assigning a label to each point), and object detection.

  • Out-of-Distribution (OOD) Data: Data that differs significantly from the data a model was trained on. In 3D vision, OOD styles (e.g., synthetic vs. real) and OOD corruptions (e.g., noise, missing points) are common challenges.

  • Sim-to-Real Gap: The performance degradation observed when a model trained on synthetic (simulated) data is deployed in real-world environments. This gap arises from differences in data distributions and characteristics.

  • COLMAP: A general-purpose Structure-from-Motion (SfM) and Multi-View Stereo (MVS) pipeline. SfM reconstructs 3D camera poses and sparse 3D point clouds from a set of 2D images, while MVS densifies this sparse reconstruction into a dense point cloud or mesh.

  • Blender: A free and open-source 3D computer graphics software toolset used for creating animated films, visual effects, art, 3D printed models, motion graphics, interactive 3D applications, virtual reality, and video games. It's used here for rendering multi-view images from 3D models.

3.2. Previous Works

The paper extensively discusses prior 3D object datasets and related methods across various tasks.

3.2.1. 3D Object Datasets

  • Synthetic CAD Models:

    • ShapeNet [9]: A large repository of 3D CAD models. Contains 51,300 models across 55 categories. Widely used for 3D shape analysis, completion, and generation.
    • ModelNet [96]: Derived from ShapeNet, consisting of 12,311 models in 40 categories. Popular for 3D object classification and segmentation benchmarks.
    • 3D-FUTURE [26] & ABO [16]: High-quality CAD models with rich geometric details and informative textures, focusing on furniture and objects.
    • Toys4K [83]: Another synthetic dataset with 4k toys in 105 categories.
    • Critique: While large, these synthetic datasets suffer from the sim-to-real gap.
  • Real-world Scanned Datasets (Limited Scale or Scope):

    • DTU [1] & BlendedMVS [102]: Photo-realistic datasets for multi-view stereo. Critique: Small scale, lack category annotations, not suitable for general 3D object research.
    • ScanObjectNN [87]: A real-world point cloud dataset from scanned indoor scenes. Contains 15,000 objects in 15 categories. Critique: Point clouds are often incomplete, noisy, and multiple objects can coexist in a scene, complicating object-level analysis.
    • GSO (Google Scanned Objects) [21]: 1,030 scanned objects with fine geometries and textures, but limited to 17 household items. Critique: Narrow semantic scope.
    • AKB-48 [49]: Focuses on robotics manipulation with 2,037 articulated object models in 48 categories. Critique: Specialized for articulated objects, limiting general 3D research.
    • CO3D [74]: Contains 19,000 object-centric videos. Critique: Only 20% have accurate point clouds reconstructed by COLMAP, and they do not provide textured meshes.

3.2.2. Robust 3D Perception

  • OOD Corruptions:
    • Works like [13, 45, 71, 92] study robustness to OOD corruptions (e.g., jittering, random point missing) by applying them to clean test sets.
    • ModelNet-C [75]: A standard corruption test suite built on ModelNet.
    • Critique: These works do not account for OOD styles (sim-to-real gap).
  • Sim-to-Real Domain Gap:
    • Works like [3, 74] evaluate this by training on synthetic datasets (ModelNet-40) and testing on noisy real-world sets (ScanObjectNN).
    • Critique: This approach conflates OOD styles and OOD corruptions, making independent analysis difficult.

3.2.3. Neural Radiance Field (NeRF) and Neural Surface Reconstruction

  • NeRF [60]: The foundational work representing scenes as MLPs.

    • $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ (This formula is for Attention mechanism, not directly from NeRF. However, it is an example of a common foundational formula to be included if relevant in the paper.)
    • NeRF Function: A NeRF model learns a continuous 5D function FΘ:(x,y,z,θ,ϕ)(R,G,B,σ)F_{\Theta}: (x, y, z, \theta, \phi) \rightarrow (R, G, B, \sigma), where (x, y, z) is a 3D point, (θ,ϕ)(\theta, \phi) is the 2D viewing direction, (R, G, B) is the color, and σ\sigma is the volume density.
    • Volume Rendering: To render a pixel color C(r)C(\mathbf{r}) for a ray r(t)=o+td\mathbf{r}(t) = \mathbf{o} + t\mathbf{d} (origin o\mathbf{o}, direction d\mathbf{d}), NeRF uses numerical integration: $ C(\mathbf{r}) = \sum_{i=1}^{N} T_i (1 - \exp(-\sigma_i \delta_i)) c_i $ where T_i = \exp(-\sum_{j=1}^{i-1} \sigma_j \delta_j), cic_i is the color, σi\sigma_i is the density, and δi\delta_i is the distance between adjacent samples along the ray.
    • Explanation: NeRF maps a 3D coordinate and viewing direction to color and density using a neural network. To generate an image, it casts rays from the camera through each pixel. Along each ray, it samples points, queries their color and density from the network, and then uses a volume rendering formula to accumulate these values into a final pixel color. The term σiδi\sigma_i \delta_i represents the opacity of a small segment along the ray, and TiT_i is the accumulated transmittance, representing the probability that light reaches segment ii without being obstructed.
  • NeRF Improvements:

    • Quality: mip-NeRF [5], NeRF in the Dark [59], mip-NeRF 360 [6].
    • Efficiency: Tensorf [10], Plenoxels [25], Instant NGP [62], Direct Voxel Grid Optimization [84].
  • NeRF Generalization: MVSNeRF [11], pixelNeRF [105], IBRNet [91], NeuRay [52], NeRFormer [74], GNT [88]. These aim to learn priors across multiple scenes.

  • Neural Surface Reconstruction:

    • Implicit surface representations (e.g., SDF) combined with NeRF: NeuS [90], VolSDF [103], Voxurf [95]. These achieve accurate, mask-free surface reconstruction.
    • NeuS [90] and VolSDF [103] bridge neural volume rendering with implicit surface representation. NeuS proposes a different volume rendering formulation from NeRF to more accurately recover surfaces from SDFs.
      • NeuS's SDF-based Rendering: NeuS integrates color along rays using a formulation inspired by SDF. The volume density σ(s)\sigma(s) at a point ss (distance along ray) is derived from the SDF f(x)f(\mathbf{x}) and a learnable parameter β\beta: $ \sigma(s) = \frac{\exp(-f(\mathbf{x})/\beta)}{\beta} $ The color C(r)C(\mathbf{r}) along a ray r(t)\mathbf{r}(t) is then: $ C(\mathbf{r}) = \int_0^\infty T(t) \cdot \alpha(t) \cdot c(t, \mathbf{d}) , dt $ where T(t)=exp(0tσ(s)ds)T(t) = \exp(-\int_0^t \sigma(s) \, ds) is the transmittance, α(t)=σ(t)\alpha(t) = \sigma(t) is the alpha value (opacity), and c(t,d)c(t, \mathbf{d}) is the color at point r(t)\mathbf{r}(t) from direction d\mathbf{d}.
      • Explanation: NeuS uses a learnable parameter β\beta to control the "sharpness" of the SDF to density mapping. This allows it to model surfaces more explicitly than standard NeRF and helps in extracting high-quality meshes. The integral is approximated via numerical summation.
    • Voxurf [95] uses an explicit volumetric representation for acceleration.
    • Sparse-view reconstruction: SparseNeuS [54], MonoSDF [106]. These exploit generalizable priors or geometric cues from pre-trained networks.

3.2.4. 3D Object Generation

  • Early approaches (voxels): Extend 2D generation to 3D voxels [27, 35, 55, 82, 94]. Critique: High computational cost for high resolution.
  • Other 3D data formulations: Point clouds [2, 61, 101, 109], octree [39], implicit representations [14, 57]. Critique: Challenging to generate complex and textured surfaces.
  • Textured 3D Meshes: Textured3DGAN [70], DIBR [12] (deform template meshes, limited complexity), PolyGen [63], SurfGen [56], GET3D [28].
    • GET3D [29]: State-of-the-art for generating diverse meshes with rich geometry and textures in two branches. The paper specifically benchmarks GET3D.
    • Critique: Training generative models on large-vocabulary, realistic datasets like OmniObject3D is still challenging.

3.3. Technological Evolution

The evolution of 3D vision has largely followed the availability of data and computational power.

  1. Early 3D Vision (Pre-2015): Relied heavily on traditional geometric methods and smaller, often manually created, 3D models.

  2. Synthetic Data Era (2015-2019): Datasets like ShapeNet and ModelNet enabled the rise of deep learning for 3D tasks. Models like PointNet and PointNet++PointNet++ revolutionized point cloud processing. Generative models started exploring 3D data.

  3. NeRF and Implicit Representations (2020-Present): NeRF burst onto the scene, offering unprecedented realism in novel-view synthesis. This led to a surge in research on implicit neural representations for geometry and appearance, alongside efforts to improve NeRF's efficiency, quality, and generalization. Simultaneously, research in 3D generation moved towards more complex and textured outputs.

  4. Real-world Data Push (Recent): The limitations of synthetic data became increasingly apparent. Efforts began to gather larger-scale, real-world 3D data (CO3D, GSO, AKB-48), often facing challenges in quality, scale, or breadth.

    This paper's work, OmniObject3D, fits squarely into the "Real-world Data Push" phase, aiming to overcome the limitations of previous real-world datasets by providing an unprecedented scale and quality for real-scanned objects. It seeks to close the sim-to-real gap and accelerate research on realistic 3D vision.

3.4. Differentiation Analysis

Compared to prior works, OmniObject3D offers several core innovations:

  • Scale and Vocabulary: It is the largest real-world 3D object dataset with accurate 3D meshes, comprising 6,000 objects in 190 categories. This is significantly larger and more diverse than GSO (1k objects, 17 cats) or AKB-48 (2k articulated objects, 48 cats). Its category overlap with ImageNet and LVIS is also a key differentiator for generalizable representation learning.

  • Richness of Annotations: Unlike CO3D (which lacks meshes and has limited point clouds), OmniObject3D provides a comprehensive suite of data per object: textured meshes, point clouds, multi-view rendered images, and real-captured videos with foreground masks and COLMAP poses. This multi-modal annotation supports a wider range of 3D tasks.

  • Quality and Realism: The use of professional scanners ensures high-fidelity geometry and realistic textures, which is a step above many datasets derived from consumer-grade scanning or less controlled environments. This directly addresses the sim-to-real gap by providing truly realistic data.

  • Systematic Benchmarking: The paper not only releases the dataset but also meticulously sets up and evaluates four distinct, challenging benchmarks (robust 3D perception, novel-view synthesis, neural surface reconstruction, and 3D object generation). This structured evaluation framework is crucial for guiding future research.

  • Fine-grained Robustness Analysis: For 3D perception, OmniObject3D is the first clean real-world point cloud object dataset that allows for disentangling OOD styles and OOD corruptions, enabling a more granular understanding of model robustness than previous benchmarks.

    In essence, OmniObject3D differentiates itself by combining unprecedented scale, rich multi-modal annotations, high realism, and a structured benchmarking approach, directly addressing the key limitations of existing 3D object datasets.

4. Methodology

The core methodology presented in the paper is the data collection, processing, and annotation pipeline for the OmniObject3D dataset. This involves several stages to ensure the dataset's large vocabulary, rich annotations, and realistic scans.

4.1. Principles

The overarching principle behind the OmniObject3D dataset creation is to provide a large-scale, high-quality, real-world 3D object dataset that can bridge the synthetic-to-real domain gap and facilitate research in various 3D vision tasks. The methodology is designed to:

  1. Maximize Vocabulary Diversity: Cover a wide range of daily object categories, aligning with common 2D and 3D datasets.
  2. Ensure Annotation Richness: Provide multi-modal data for each object, including textured meshes, point clouds, rendered multi-view images, and real-captured videos.
  3. Guarantee Realistic High-Fidelity Scans: Utilize professional scanning equipment to capture precise geometry and realistic appearances.
  4. Enable Comprehensive Benchmarking: Structure the data to support diverse evaluation tracks for perception, reconstruction, and generation.

4.2. Core Methodology In-depth (Layer by Layer)

The data creation process is broken down into four main stages: Category List Definition, Object Collection Pipeline, Image Rendering and Point Cloud Sampling, and Video Capturing and Annotation.

4.2.1. Category List Definition

  • Process: The first step involves carefully defining a category list. This is done by aggregating and cross-referencing categories from several popular existing 2D and 3D datasets, such as ShapeNet [9], ImageNet [19], LVIS [33], Open Images [47], MS-COCO [48], Objects365 [80], and ModelNet [96].
  • Goal: The aim is to create a list of objects that are both commonly-distributed (frequently encountered in daily life) and highly diverse (spanning a wide range of shapes, sizes, and functionalities). The list is also dynamically expanded during collection to include reasonable new classes.
  • Result: The final dataset comprises 190 widely-spread categories, ensuring a rich library of texture, geometry, and semantic information.

4.2.2. Object Collection Pipeline

  • Process: After defining categories, a variety of physical objects from each category are collected. These objects are then meticulously 3D scanned using professional-grade equipment.
    • Scanners Used:
      • Shining 3D scanner
      • Artec Eva 3D scanner
    • These scanners are selected to accommodate objects of different scales.
  • Scanning Time: The time required for scanning varies significantly:
    • 15 minutes for small, rigid objects with simple geometry (e.g., an apple, a toy).
    • Up to an hour for non-rigid, complex, or large objects (e.g., a bed, a kite).
  • Object Manipulation: For approximately 10% of the objects, common real-world manipulations (e.g., a bite taken out of an apple, an object cut into pieces) are performed to capture naturalistic variations.
  • Scale Preservation: The 3D scans faithfully retain the real-world scale of each object.
  • Canonical Pose Alignment: While scans preserve real-world scale, their initial poses are not strictly aligned. Therefore, a canonical pose is pre-defined for each category, and objects within that category are manually aligned to this standard pose.
  • Quality Control: Each scan undergoes a quality check, and only ~83% of the collected scans that meet high-quality standards are reserved for the dataset.
  • Result: 6,000 high-quality textured meshes are obtained, representing precise shapes with geometric details and realistic appearances with high-frequency textures.

4.2.3. Image Rendering and Point Cloud Sampling

This stage generates additional data modalities from the collected 3D meshes to support diverse research objectives.

  • Multi-view Image Rendering:

    • Tool: Blender [17] (a professional 3D computer graphics software) is used for rendering.
    • Settings: Object-centric and photo-realistic multi-view images are rendered.
    • Camera Poses: Accurate camera poses are generated and stored alongside the images.
    • Viewpoints: Images are rendered from 100 random viewpoints sampled on the upper hemisphere.
    • Resolution: Rendered images have a resolution of 800 x 800 pixels.
    • Additional Cues: High-resolution mid-level cues like depth maps and normal maps are also produced for each rendered image, which can be useful for various research applications.
  • Point Cloud Sampling:

    • Tool: The Open3D toolbox [111] is used to sample point clouds from the textured 3D meshes.
    • Method: Uniform sampling is applied.
    • Resolution: Multi-resolution point clouds are sampled, with 2n2^n points for n{10,11,12,13,14}n \in \{10, 11, 12, 13, 14\}. This means point clouds contain 210=10242^{10}=1024, 211=20482^{11}=2048, 212=40962^{12}=4096, 213=81922^{13}=8192, and 214=163842^{14}=16384 points, respectively.
  • Data Generation Pipeline: Beyond the pre-generated data, a pipeline is provided to allow users to generate new data with custom camera distributions, lighting, and point sampling methods as needed.

  • Result: For each 3D mesh, 100 rendered images (with corresponding depth and normal maps), and multi-resolution point clouds are generated.

    The following figure (Figure S2 from the original paper) shows examples of the Blender rendered results, including RGB, depth, and normal maps.

    Figure S2. Examples of the Blender \[17\] rendered results. 该图像是一个示意图,展示了不同物体的RGB、深度和法线图像。在第一列中,显示了不同物体的真实色彩;第二列为相应的深度图,突出物体的形状;第三列展示了法线图,有助于理解物体表面的光照反应。

4.2.4. Video Capturing and Annotation

This stage involves capturing real-world video footage of the scanned objects and annotating them with camera poses and foreground masks.

  • Video Capture:

    • Device: An iPhone 12 Pro mobile phone is used.
    • Setup: The object is placed on or beside a calibration board.
    • Coverage: Each video captures a full 360-degree range around the object.
  • Frame Filtering:

    • QR Codes on the calibration board are used to recognize its square corners.
    • Blurry frames, where fewer than 8 corners are recognized, are filtered out.
  • Camera Pose Estimation (COLMAP):

    • Sampling: 200 frames are uniformly sampled from the filtered video.
    • Tool: COLMAP [77] (a Structure-from-Motion and Multi-View Stereo pipeline) is applied to these sampled frames.
    • Annotation: COLMAP annotates the frames with camera poses.
    • Absolute Scale Recovery: The scales of the calibration board in both the SfM coordinate space (output by COLMAP) and the real world are used to recover the absolute scale of the SfM coordinate system.
  • Foreground Mask Generation:

    • Pipeline: A two-stage matting pipeline is developed.
    • Models: It is based on U2-Net [73] and FBA [24] matting models.
    • Process: Initially, the Rembg tool is used on image frames to remove backgrounds and generate 3,000 good pseudo segmentation labels. The pipeline is then refined by fine-tuning with these pseudo labels to boost segmentation ability.
  • Result: Real-captured video frames with corresponding foreground masks and COLMAP-estimated camera poses are provided for each object.

    The following figure (Figure S3 from the original paper) illustrates examples of the segmentation process, object manipulations, and a comparison of COLMAP sparse reconstruction with the professional textured mesh from the scanner.

    Figure S3. Examples of the segmentation (a), manipulation (b), and reconstruction (c). In (c), the missing bottom of the SfM reconstruction from video frames is due to its touch with the table. 该图像是图表,展示了分割示例(a)、操作实例(b)和SfM重建(c)。其中,分割示例展示了物体的分割情况,操作实例展示了对物体的不同处理方式,而SfM重建展示了物体的3D扫描效果,并指出了重建中底部缺失的问题。

The full methodology ensures that OmniObject3D provides a diverse, high-quality, and richly annotated dataset suitable for a wide array of 3D vision research.

5. Experimental Setup

The paper establishes four main evaluation tracks to demonstrate the utility of OmniObject3D. For each track, specific datasets, evaluation metrics, and baselines are used.

5.1. Datasets

The OmniObject3D dataset itself serves as the primary dataset for all evaluation tracks, often in conjunction with other datasets for training or comparison.

5.1.1. OmniObject3D Characteristics

  • Source: Real-scanned objects.
  • Scale: 6,000 objects.
  • Categories: 190 daily categories.
  • Annotations per object:
    • High-quality textured meshes.
    • Multi-resolution point clouds (e.g., 2102^{10} to 2142^{14} points).
    • 100 multi-view rendered images (800x800 pixels) with depth and normal maps.
    • Real-captured videos (200 frames) with foreground masks and COLMAP camera poses.
  • Distribution: Long-tailed distribution with an average of ~30 objects per category.
  • Purpose: Facilitates realistic 3D perception, reconstruction, and generation.

5.1.2. Datasets for Specific Tracks

Robust 3D Perception:

  • Training Data: ModelNet-40 [96] (synthetic dataset for object classification, 12,311 CAD models in 40 categories).
  • Test Data:
    • OmniObject3D: Used to evaluate OOD style robustness (sim-to-real gap).
    • OmniObject3D-C: OmniObject3D corrupted with common corruptions (e.g., Scale, Jitter, Drop Global/Local, Add Global/Local, Rotate) as described in ModelNet-C [75]. Used to evaluate OOD corruption robustness.

Novel View Synthesis (NVS):

  • Single-Scene NVS:
    • OmniObject3D rendered images (100 views per object): 1/81/8 images randomly sampled as the test set, 7/87/8 for training.
    • OmniObject3D iPhone videos: Used for qualitative and quantitative comparisons under SfM-wo-bg (SfM-estimated poses, foreground only) and SfM-w-bg (SfM-estimated poses, with background) settings.
  • Cross-Scene NVS:
    • OmniObject3D rendered images: 10 specific categories with varied scenes are selected as the test set. For each category, 3 scenes are chosen for testing, and the rest for training. 3 source views are used as input, and 10 test views (from remaining 97 views by FPS sampling) are evaluated.

Neural Surface Reconstruction:

  • Dense-View Surface Reconstruction:
    • OmniObject3D rendered images: 3 objects per category are selected. 100 views per object are used for training.
  • Sparse-View Surface Reconstruction:
    • OmniObject3D rendered images: 3 views are sampled using Farthest Point Sampling (FPS) for input. For SparseNeuS and MVSNeRF, FPS is conducted among the nearest 30 camera poses from a random reference view.

3D Object Generation:

  • Training Data: Subsets of OmniObject3D are used:
    • Fruits, Furniture, Toys: Representative categories.
    • Rand-100: 100 randomly selected categories.
    • Each subset is split into 80% training and 20% testing.
  • Input for Generation Model: Multi-view images (24 inward-facing views per object) rendered with Blender.

5.2. Evaluation Metrics

5.2.1. Robust 3D Perception

  • Overall Accuracy (OA):
    • Conceptual Definition: Measures the proportion of correctly classified samples out of the total number of samples. It's a fundamental metric for classification tasks, indicating the general correctness of a model's predictions.
    • Mathematical Formula: $ \mathrm{OA} = \frac{\text{Number of Correct Predictions}}{\text{Total Number of Predictions}} $
    • Symbol Explanation:
      • Number of Correct Predictions\text{Number of Correct Predictions}: The count of samples where the model's predicted class matches the true class.
      • Total Number of Predictions\text{Total Number of Predictions}: The total number of samples evaluated.
  • Mean Corruption Error (mCE):
    • Conceptual Definition: A metric used to quantify a model's robustness to various types of corruptions. It averages the corruption error across different corruption types and severity levels, often normalized by a baseline model's performance to provide a relative measure. Lower mCE indicates better robustness.
    • Mathematical Formula: The paper refers to DGCNN normalized mCE [75]. The original mCE from ModelNet-C is calculated as the average of Corruption Error (CEcCE_c) across all corruption types cc: $ \mathrm{CE}c = \frac{E{c}^{\text{model}}}{E_{c}^{\text{baseline}}} $ $ \mathrm{mCE} = \frac{1}{C} \sum_{c=1}^{C} \mathrm{CE}_c $
    • Symbol Explanation:
      • EcmodelE_{c}^{\text{model}}: Error rate of the evaluated model on corruption type cc.
      • EcbaselineE_{c}^{\text{baseline}}: Error rate of a specified baseline model (e.g., DGCNN) on corruption type cc.
      • CC: Total number of corruption types.

5.2.2. Novel View Synthesis (NVS)

  • Peak Signal-to-Noise Ratio (PSNR):

    • Conceptual Definition: A widely used metric to measure the quality of reconstruction of lossy compression codecs or, in this case, the quality of a synthesized image compared to its ground truth. It quantifies the ratio between the maximum possible power of a signal and the power of corrupting noise that affects the fidelity of its representation. Higher PSNR indicates better quality.
    • Mathematical Formula: $ \mathrm{PSNR} = 10 \cdot \log_{10} \left( \frac{\mathrm{MAX}_I^2}{\mathrm{MSE}} \right) $ where \mathrm{MSE} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} [I(i,j) - K(i,j)]^2
    • Symbol Explanation:
      • MAXI\mathrm{MAX}_I: The maximum possible pixel value of the image (e.g., 255 for an 8-bit image).
      • MSE\mathrm{MSE}: Mean Squared Error between the original image II and the reconstructed image KK.
      • M, N: Dimensions of the image.
      • I(i,j): Pixel value at position (i,j) in the original image.
      • K(i,j): Pixel value at position (i,j) in the reconstructed image.
  • Structural Similarity Index Measure (SSIM) [93]:

    • Conceptual Definition: A perceptual metric that quantifies the similarity between two images. It considers image degradation as perceived change in structural information, also incorporating luminance and contrast changes. Higher SSIM indicates better perceived quality.
    • Mathematical Formula: $ \mathrm{SSIM}(x,y) = \frac{(2\mu_x\mu_y + c_1)(2\sigma_{xy} + c_2)}{(\mu_x^2 + \mu_y^2 + c_1)(\sigma_x^2 + \sigma_y^2 + c_2)} $
    • Symbol Explanation:
      • x, y: Two image patches being compared.
      • μx,μy\mu_x, \mu_y: Mean intensity of xx and yy.
      • σx,σy\sigma_x, \sigma_y: Standard deviation (contrast) of xx and yy.
      • σxy\sigma_{xy}: Covariance of xx and yy (structural similarity).
      • c1=(K1L)2c_1 = (K_1L)^2, c2=(K2L)2c_2 = (K_2L)^2: Constants to prevent division by zero, where LL is the dynamic range of pixel values, and K1,K21K_1, K_2 \ll 1.
  • Learned Perceptual Image Patch Similarity (LPIPS) [108]:

    • Conceptual Definition: A metric that uses features from a pre-trained deep neural network (like VGG or AlexNet) to measure the perceptual distance between two images. It correlates better with human judgment of image similarity than traditional metrics like PSNR or SSIM. Lower LPIPS indicates higher perceptual similarity.
    • Mathematical Formula: LPIPS is typically computed by:
      1. Extracting features wlϕl(x)\mathbf{w}_l \odot \phi_l(x) and wlϕl(y)\mathbf{w}_l \odot \phi_l(y) from images xx and yy using a deep network ϕ\phi.
      2. Normalizing features in channel dimension.
      3. Calculating the L2L_2 distance between feature maps at layer ll.
      4. Summing over layers and spatial locations. $ \mathrm{LPIPS}(x,y) = \sum_l \frac{1}{H_l W_l} \sum_{h,w} \Vert \mathbf{w}l \odot (\phi_l(x){h,w} - \phi_l(y)_{h,w}) \Vert_2^2 $
    • Symbol Explanation:
      • x, y: Two images.
      • ϕl\phi_l: Feature map extracted from layer ll of a pre-trained network.
      • wl\mathbf{w}_l: Learnable scaling factors for each channel at layer ll.
      • Hl,WlH_l, W_l: Height and width of the feature map at layer ll.
      • \odot: Element-wise multiplication.
  • Ldepth (Depth Error):

    • Conceptual Definition: Measures the difference between the predicted depth map and the ground truth depth map. Lower Ldepth indicates more accurate geometry reconstruction. The paper does not specify the exact formula, but commonly this refers to L1 or L2 error on depth. Assuming L1 for simplicity if not specified.
    • Mathematical Formula (Example for L1): $ \mathrm{Ldepth} = \frac{1}{MN} \sum_{i=0}^{M-1} \sum_{j=0}^{N-1} |\mathrm{Depth}{pred}(i,j) - \mathrm{Depth}{gt}(i,j)| $
    • Symbol Explanation:
      • Depthpred(i,j)\mathrm{Depth}_{pred}(i,j): Predicted depth value at pixel (i,j).
      • Depthgt(i,j)\mathrm{Depth}_{gt}(i,j): Ground truth depth value at pixel (i,j).
      • M, N: Dimensions of the depth map.

5.2.3. Neural Surface Reconstruction

  • Chamfer Distance (CD):
    • Conceptual Definition: A metric used to measure the similarity between two point sets or surfaces. It computes the average closest point distance from each point in one set to the other set, and vice-versa. Lower CD indicates higher similarity between the reconstructed and ground truth surfaces.
    • Mathematical Formula: For two point sets S1S_1 and S2S_2: $ \mathrm{CD}(S_1, S_2) = \frac{1}{|S_1|} \sum_{x \in S_1} \min_{y \in S_2} |x - y|2^2 + \frac{1}{|S_2|} \sum{y \in S_2} \min_{x \in S_1} |y - x|_2^2 $
    • Symbol Explanation:
      • S1,S2S_1, S_2: The two point sets (or sampled points from surfaces) being compared.
      • S1,S2|S_1|, |S_2|: Number of points in each set.
      • xS1,yS2x \in S_1, y \in S_2: Points from the respective sets.
      • minyS2xy22\min_{y \in S_2} \|x - y\|_2^2: Squared Euclidean distance from point xx to its nearest neighbor in S2S_2.

5.2.4. 3D Object Generation

  • Coverage (Cov):
    • Conceptual Definition: Measures the diversity of generated shapes. It quantifies how well the generated point clouds "cover" the distribution of the real (test) shapes. Higher Cov indicates greater diversity.
    • Mathematical Formula: The paper states CD is used to compute Coverage. A common formulation for coverage (from works like PointFlow) is related to the fraction of reference data points that are "covered" by the generated data. This typically involves computing the minimum CD from each reference point to the generated set and checking if it's below a threshold. $ \mathrm{Coverage}(S_{gen}, S_{ref}) = \frac{1}{|S_{ref}|} \sum_{x \in S_{ref}} \mathbb{I} \left( \min_{y \in S_{gen}} |x - y|_2^2 \leq \tau \right) $
    • Symbol Explanation:
      • SgenS_{gen}: Set of generated shapes (or their point clouds).
      • SrefS_{ref}: Set of reference (test) shapes.
      • I()\mathbb{I}(\cdot): Indicator function, which is 1 if the condition is true, 0 otherwise.
      • τ\tau: A threshold distance.
  • Minimum Matching Distance (MMD):
    • Conceptual Definition: Measures the quality or realism of generated shapes. It quantifies how similar, on average, each generated shape is to its closest match in the real (test) data. Lower MMD indicates better quality.
    • Mathematical Formula: The paper states CD is used to compute MMD. A common formulation for MMD (from works like PointFlow) is the average minimum CD from generated shapes to reference shapes. $ \mathrm{MMD}(S_{gen}, S_{ref}) = \frac{1}{|S_{gen}|} \sum_{x \in S_{gen}} \min_{y \in S_{ref}} |x - y|_2^2 $
    • Symbol Explanation:
      • SgenS_{gen}: Set of generated shapes.
      • SrefS_{ref}: Set of reference (test) shapes.
      • xSgenx \in S_{gen}: A generated shape.
      • minySrefxy22\min_{y \in S_{ref}} \|x - y\|_2^2: Squared Euclidean distance from xx to its nearest neighbor in SrefS_{ref}.
  • Fréchet Inception Distance (FID) [37]:
    • Conceptual Definition: A metric for evaluating the quality of images generated by generative models. It calculates the Fréchet distance between the feature distributions of real images and generated images, typically using a pre-trained Inception-v3 network. Lower FID indicates higher quality and realism.
    • Mathematical Formula: For two multivariate Gaussian distributions N(μ1,Σ1)\mathcal{N}(\mu_1, \Sigma_1) and N(μ2,Σ2)\mathcal{N}(\mu_2, \Sigma_2) representing the feature embeddings of real and generated images: $ \mathrm{FID} = \Vert \mu_1 - \mu_2 \Vert_2^2 + \mathrm{Tr}(\Sigma_1 + \Sigma_2 - 2(\Sigma_1 \Sigma_2)^{1/2}) $
    • Symbol Explanation:
      • μ1,μ2\mu_1, \mu_2: Mean feature vectors of real and generated images.
      • Σ1,Σ2\Sigma_1, \Sigma_2: Covariance matrices of real and generated images.
      • 22\Vert \cdot \Vert_2^2: Squared L2L_2 norm.
      • Tr()\mathrm{Tr}(\cdot): Trace of a matrix.
  • FIDref\mathrm{FID^{ref}}:
    • Conceptual Definition: A reference FID score calculated between the training set and the test set. This value indicates the inherent FID "gap" between the train and test splits, providing a context for interpreting the FID between generated and test data. If the FID for generated data is close to FIDref\mathrm{FID^{ref}}, it suggests the model is generating samples similar to the data distribution.
    • Mathematical Formula: Same as FID, but applied between the training and testing sets.

5.3. Baselines

5.3.1. Robust 3D Perception

The paper benchmarks ten state-of-the-art point cloud classification models:

  • DGCNN [92]

  • PointNet [71]

  • PointNet++ [72]

  • RSCNN [51]

  • Simple View [30]

  • GDANet [99]

  • PAConv [98]

  • CurveNet [97]

  • PCT [32]

  • RPC [75]

    These baselines are representative of different architectural designs for point cloud processing, including early pioneering works (PointNet, PointNet++PointNet++), graph-based methods (DGCNN, RSCNN), view-based methods (Simple View), and more recent advancements focusing on attention (PCT) or robustness (RPC, CurveNet, GDANet).

5.3.2. Novel View Synthesis (NVS)

  • Single-Scene NVS:
    • NeRF [60]: The foundational implicit representation method.
    • mip-NeRF [5]: An improved version of NeRF addressing aliasing.
    • Plenoxels [25]: A voxel-based method known for efficiency and handling high-frequency details.
  • Cross-Scene NVS:
    • MVSNeRF [11]: Generalizable NeRF reconstruction from multi-view stereo.
    • IBRNet [91]: Image-based rendering network for novel views.
    • pixelNeRF [105]: NeRF from one or a few images, learning priors across scenes.

5.3.3. Neural Surface Reconstruction

  • Dense-View Surface Reconstruction:
    • NeuS [90]: Neural implicit surfaces with volume rendering for multi-view reconstruction (SDF-based).
    • VolSDF [103]: Volume rendering of neural implicit surfaces based on SDF.
    • Voxurf [95]: Voxel-based efficient and accurate neural surface reconstruction.
  • Sparse-View Surface Reconstruction:
    • NeuS [90] (with sparse-view input): As a strong baseline.
    • MonoSDF [106]: Explores monocular geometric cues for neural implicit surface reconstruction.
    • SparseNeuS [54]: Generalizable neural surface prediction from sparse views.
    • pixelNeRF [105] and MVSNeRF [11]: Their extracted geometries are evaluated.

5.3.4. 3D Object Generation

  • GET3D [29]: A state-of-the-art generative model that directly generates explicit textured 3D meshes. It is chosen due to its ability to generate diverse meshes with rich geometry and textures.

6. Results & Analysis

The paper conducts extensive studies across four evaluation tracks using OmniObject3D.

6.1. Core Results Analysis

6.1.1. Robust 3D Perception

The OmniObject3D dataset enables a unique fine-grained analysis of point cloud classification robustness by disentangling OOD styles and OOD corruptions. Models are trained on ModelNet-40 (synthetic) and tested on OmniObject3D (real-world, clean) for OOD style and OmniObject3D-C (real-world, corrupted) for OOD corruption.

The following are the results from Table 2 of the original paper:

mCE† ↓OAClean ↑OAstyle ↑mCE ↓
DGCNN [92]1.0000.9260.4481.000
PointNet [71]1.4220.9070.4660.969
PointNet++ [72]1.0720.9300.4071.066
RSCNN [51]1.1300.9230.3931.076
Simple View [30]1.0470.9390.4760.990
GDANet [99]0.8920.9340.4970.920
PAConv [98]1.1040.9360.4031.073
CurveNet [97]0.9270.9380.5000.929
PCT [32]0.9250.9300.4590.940
RPC [75]0.8630.9300.4720.936

Observations:

  1. OAClean vs. OAstyle Correlation: There is little correlation between performance on a clean synthetic test set (OAClean) and OOD-style robustness (OAstyle). For example, SimpleView achieves the best OAClean (0.939) but only a mediocre OAstyle (0.476). This highlights that strong performance on synthetic data does not guarantee generalization to real-world styles.

  2. Robustness of Advanced Grouping: GDANet (0.497 OAstyle, 0.920 mCE) and CurveNet (0.500 OAstyle, 0.929 mCE) demonstrate strong robustness to both OOD styles and OOD corruptions. These methods employ advanced point grouping strategies (frequency-based and curve-based, respectively), suggesting that effective local feature aggregation is key for generalization.

  3. Challenge of Combined OOD: The OOD style + OOD corruption setting (measured by mCE on OmniObject3D-C) is significantly more challenging. RPC [75], which is specifically designed for robustness to OOD corruptions and achieves the best mCE† (0.863) on ModelNet-C, shows inferior mCE (0.936) on the real-world corrupted OmniObject3D-C. This indicates that models optimized for synthetic corruptions may not perform as well when exposed to real-world data distributions AND corruptions simultaneously.

    Conclusion: The results reveal that robust point cloud perception models, capable of handling both OOD styles and OOD corruptions simultaneously, are still under-explored. OmniObject3D provides a crucial benchmark for this comprehensive understanding.

6.1.2. Novel View Synthesis (NVS)

Single-Scene NVS: The paper evaluates NeRF [60], mip-NeRF [5], and Plenoxels [25] on OmniObject3D objects. 1/81/8 images are used for testing, and 7/87/8 for training.

The following are the results from Table 3 of the original paper:

Method|| PSNR (↑) / SD SSIM (↑) / SD
NeRF [60] mip-NeRF [5]34.01 / 3.46 39.86 / 4.580.953 / 0.029 0.068 / 0.061 0.974 / 0.013 0.084 / 0.048
Plenoxels [25]41.04 / 6.840.982 / 0.031 0.030 / 0.031

Observations:

  1. Plenoxels Performance: Plenoxels achieves the best average PSNR (41.04), SSIM (0.982), and especially LPIPS (0.030), suggesting excellent quality for modeling high-frequency appearances.
  2. Stability: However, Plenoxels shows relatively higher standard deviation (SD) for PSNR (6.84) and SSIM (0.031) compared to NeRF and mip-NeRF, indicating less stability across diverse scenes. It tends to introduce artifacts with concave geometry (e.g., bowls) or dark foreground objects.
  3. MLP-based Robustness: NeRF and mip-NeRF are more robust to challenging cases like dark textures and concave geometries.

Impact of Data Type (Rendered vs. Real-Captured Videos): Experiments comparing Blender rendered data with iPhone videos (processed with SfM-wo-bg and SfM-w-bg) show that Blender data yields the best quality. SfM-wo-bg (foreground only) performs slightly worse due to motion blur and SfM pose inaccuracies, while SfM-w-bg (with background) performs the worst due to the additional challenge of unbounded scenes. This highlights challenges in casual video capture for NeRF-like methods.

The following are the results from Table R2 of the original paper:

MethodData-typePSNR (↑)
NeRF [60]SfM-w-bgSfM-wo-bgBlender22.9224.7028.07
Mip-NeRF [5]SfM-w-bgSfM-wo-bgBlender23.2925.6231.25
Plenoxel [25]SfM-w-bgSfM-wo-bgBlender14.0619.1828.07

Cross-Scene NVS: The paper evaluates MVSNeRF [11], IBRNet [91], and pixelNeRF [105] on 10 categories for generalization.

The following are the results from Table 4 of the original paper:

MethodTrain PSNR(↑)SSIM (↑) LPIPS (↓) Ldepth
MVSNeRF [11]All* 17.49 Cat. 17.54 All*-ft. 25.70 Cat.-ft.0.544 0.442 0.193 0.542 0.448 0.230 0.754 0.251 0.081
25.52 All* 19.39 Cat. 19.030.750 0.5690.264 0.076 0.399 0.423 0.415 0.290
IBRNet [91]All*-ft. 26.890.551 0.7920.215 0.081
Cat.-ft. All* pixelNeRF [105] Cat.25.67 22.160.7600.238 0.099

Observations:

  1. Generalizable Priors: MVSNeRF and pixelNeRF trained on all categories (AllAll*) achieve competitive or even superior results to models trained only on a specific category (Cat.), especially in geometric metrics (Ldepth). This suggests that OmniObject3D effectively provides rich information for learning strong generalizable priors.
  2. Fin-tuning Benefits: After finetuning (-ft), IBRNetAll*-ft achieves the best NVS results, comparable to scene-specific methods. This indicates that OmniObject3D provides a good pre-training base for finetuning on new scenes.
  3. Method-specific weaknesses: IBRNet struggles with geometry from sparse inputs, being better suited for dense-view generalization. MVSNeRF lags in visual performance when test frames are widely distributed due to potential inaccuracies in cost volume for large viewpoint changes.
  4. Unaligned Coordinate Systems: Evaluating pixelNeRF-U on unaligned coordinate systems shows a significant drop in PSNR and more blurry/irregular shapes. This indicates that current NeRF methods implicitly rely on canonical coordinate systems, and misalignment impairs learned variance.

6.1.3. Neural Surface Reconstruction

Dense-View Surface Reconstruction: The paper evaluates NeuS [90], VolSDF [103], and Voxurf [95] using 100 views per object. Categories are split into Hard, Medium, and Easy based on reconstruction difficulty.

The following are the results from Table 5 of the original paper:

MethodChamfer Distance × 103 ()
HardMediumEasy Avg
NeuS [90]9.265.633.46 6.09
VolSDF [103]10.064.942.86 5.92
Voxurf [95]9.014.982.58 5.49
Avg| 9.445.192.97 5.83

Observations:

  1. Difficulty Levels: A clear margin exists in Chamfer Distance between Hard, Medium, and Easy categories. Hard categories typically involve dark/low-texture objects, concave geometries, or complex/thin structures (e.g., pan, vase, durian). Easy cases are usually simple geometries with proper textures.

  2. Voxurf Performance: Voxurf achieves the best average CD (5.49), especially in Easy (2.58) and Medium (4.98) categories, suggesting its efficiency and fine geometry reconstruction capabilities.

    The following figure (Figure 5 from the original paper) shows the performance distribution of dense-view surface reconstruction, illustrating the imbalance across categories.

    Figure 5. Performance distribution of dense-view surface reconstruction. The averaged results of the three methods is imbalanced. The colored area denotes a smoothed range of results.

    Sparse-View Surface Reconstruction: The paper evaluates NeuS [90], MonoSDF [106], SparseNeuS [54], pixelNeRF [105], and MVSNeRF [11] with 3 views.

The following are the results from Table 6 of the original paper:

MethodTrainChamfer Distance × 103 (↓)
HardMediumEasy Avg
NeuS [90]MonoSDF [106]SingleSingle29.35 27.62 24.7927.335.14 35.35 32.7634.68
SparseNeuS [54]1 cat.34.05 31.32 31.1432.3630.75 30.11 28.3729.87
10 cats.All cats.EasyMediumHard30.75 30.11 28.37
cats.26.13 26.08 22.1325.00
28.39 26.65 23.7626.4827.38 26.66 23.0825.8727.42 26.95 24.6326.47
MVSNeRF [11]pixelNeRF [105]All cats.All cats.56.68 48.09 48.7051.1663.31 59.91 61.4761.56

Observations:

  1. Overall Challenge: All methods show apparent artifacts in sparse-view reconstruction, indicating that this remains a challenging problem.

  2. SparseNeuS Performance: SparseNeuS trained on sufficient data (All cats.) demonstrates the best quantitative performance on average (25.00 CD), learning generalizable priors.

  3. NeuS Baseline Strength: NeuS with sparse-view input achieves surprisingly good performance (27.33 CD) without specialized sparse-view techniques, especially for thin structures when combined with FPS sampling. However, it can suffer from local geometry ambiguity.

  4. MonoSDF Limitations: MonoSDF, which relies on estimated geometry cues, performs worse than NeuS in OmniObject3D (34.68 CD). It struggles when depth/normal estimations are inaccurate, even though it performs well on DTU. This suggests a domain gap or reliance on cue accuracy that is not always met.

  5. Generalized NeRF Surfaces: Surfaces extracted from pixelNeRF and MVSNeRF (generalized NeRF models) are of relatively low quality, indicating that their density fields don't always translate to precise surface geometry in sparse settings.

  6. Impact of View Count: Increasing view numbers from 2 to 8 significantly improves accuracy for NeuS and MonoSDF, but even 8 views still lag behind dense-view performance.

    The following figure (Figure 4 from the original paper) shows qualitative comparisons of neural surface reconstruction results for both dense-view and sparse-view settings.

    Figure 4. Neural surface reconstruction results for both dense-view and sparse-view settings. 该图像是一个展示神经表面重建结果的插图,左侧为密集视图示例,包括三个案例及其对应的多视图图像,右侧为稀疏视图结果,展示了不同方法(如NeuS、VolSDF等)下的效果。每个案例展示了物体形状和细节的还原情况。

6.1.4. 3D Object Generation

The paper evaluates GET3D [29] on subsets of OmniObject3D.

Qualitative Results: GET3D generates realistic textures and coherent shapes with fine geometric details (e.g., lychee, pineapple). Shape interpolation shows smooth transitions between semantically different instances.

The following figure (Figure 7 from the original paper) shows qualitative results of GET3D on OmniObject3D.

该图像是一个包含多种3D物体的示意图,展示了来自OmniObject3D数据集中的不同类别的对象,包括水果、家居用品和交通工具等。图中的物体具有高度真实感,展现了精准的形状和色彩。 该图像是一个包含多种3D物体的示意图,展示了来自OmniObject3D数据集中的不同类别的对象,包括水果、家居用品和交通工具等。图中的物体具有高度真实感,展现了精准的形状和色彩。

The following figure (Figure 8 from the original paper) shows shape interpolation results.

Figure 8. Shape interpolation. We interpolate both geometry and texture latent codes from left to right.

Semantic Distribution: Training an unconditional model on 100 randomly selected categories reveals an imbalanced generation distribution.

The following figure (Figure 6 from the original paper) presents the category distribution of generated shapes.

Figure 6. The category distribution of the generated shapes. (a) shows a weak positive correlation between the number of generated shapes and training shapes per category. (b) visualizes the correlation matrix among different categories by Chamfer Distance between their mean shapes. (c) visualizes categories being clustered into eight groups by KMeans. (d) presents a clear training-generation relation in the group-level statistics.

Observations:

  1. Weak Positive Correlation: The number of generated shapes per category shows a weak positive correlation with the number of training shapes.
  2. Category Correlation and Grouping: Categories are not independent and can be grouped. For instance, Group 1 (18 categories, 587 training shapes) has relatively low inner-group divergence (many similar fruits/vegetables) and becomes the most popular in generated shapes. Group 2 (27 categories, 883 training shapes) has high inner-group divergence, preventing it from dominating. This highlights how cross-class relationships and inner-group divergence affect generation.

Diversity and Quality: Evaluations on Fruits, Furniture, Toys, and Rand-100 subsets using Coverage (Cov), Minimum Matching Distance (MMD), FID, and FIDref\mathrm{FID^{ref}}.

The following are the results from Table 7 of the original paper:

Split| #Objs #Cats || Cov (% ↑)MMD (↓)FID (↓)FIDref
Furniture2651767.924.2787.39 58.40
Fruits61017 46.723.32105.3187.15
Toys339 755.222.78122.7741.40
Rand-100295110061.70 3.8946.578.65

Observations:

  1. Furniture: Suffers from the lowest quality (MMD 4.27), likely due to a small training set (265 objects, 17 categories) and inherent complexity.

  2. Fruits: Has higher quality (MMD 3.32) and lower diversity (Cov 46.72%) despite the same number of categories as Furniture, possibly because fruits often share similar structures.

  3. Toys: Achieves the best quality (MMD 2.78) while training on only 7 categories, indicating that concentrated, coherent categories can lead to better generation.

  4. Rand-100: The most difficult case, showing a trade-off between quality (MMD 3.89) and diversity (Cov 61.70%). FID and FIDref\mathrm{FID^{ref}} are relatively low, reflecting the large dataset size.

  5. Disentanglement Issues: Disentangled interpolation experiments show that geometry and texture latent codes are not fully disentangled, as geometry code can sometimes affect texture, especially when categories, geometry, and texture are highly correlated in the dataset. Complex textures (e.g., book covers) also remain challenging to generate well.

    Conclusion: Training generative models on a large-vocabulary, realistic dataset like OmniObject3D is promising but challenging. Key issues include semantic distribution bias, varying exploration difficulties in different groups, and limitations in disentanglement and complex texture generation.

6.2. Data Presentation (Tables)

All tables from the original paper's main text and supplementary materials, as used in the analysis above, have been transcribed completely.

6.3. Ablation Studies / Parameter Analysis

The paper includes several analyses that resemble ablation studies or parameter analyses:

  • Impact of Data Type on NVS: The comparison between Blender rendered images, SfM-wo-bg, and SfM-w-bg for single-scene NVS acts as an ablation on the realism and complexity of input data. It shows a clear performance drop from ideal rendered data to noisy real-captured videos, particularly with background present.

  • Training Strategy for Cross-Scene NVS: The comparison between training on AllAll* categories vs. Cat. (individual categories) for cross-scene NVS explores the effect of broader vs. narrower training distributions on generalizability. The finetuning (-ft) results further analyze the benefit of adapting pre-trained models to specific test scenes.

  • View Count for Sparse-View Reconstruction: The supplementary material presents results for NeuS and MonoSDF with 2, 3, 5, 8 views, effectively an ablation on the number of input views for sparse reconstruction. This demonstrates the performance curve as view density changes.

  • View Selection Range for Cost Volume Initialization: For MVSNeRF, the paper analyzes how the number of nearest source views for FPS (ranging from 10 to 50 camera poses) affects geometric quality. This is a parameter analysis to find an optimal trade-off (selected as 30).

  • Semantic Distribution in 3D Generation: The grouping of categories using KMeans and analyzing generation statistics at a group level (Figure 6) serves as an ablation/analysis of how inter- and intra-group semantic relationships influence the generative model's output distribution. This highlights the bias introduced by dataset composition.

  • Disentangled Interpolation (Supplementary): The interpolation of geometry and texture latent codes separately (Figure S13) is an analysis of the disentanglement capabilities of GET3D on OmniObject3D, revealing that disentanglement is not perfect and latent codes can interact.

    These analyses provide valuable insights into the behavior of different models under varying conditions and dataset characteristics provided by OmniObject3D, going beyond mere performance numbers to explain why certain results are observed.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper OmniObject3D: Large-Vocabulary 3D Object Dataset for Realistic Perception, Reconstruction and Generation introduces a monumental contribution to the 3D vision community: OmniObject3D, a large-scale, high-quality dataset of 6,000 real-scanned 3D objects spanning 190 diverse daily categories. Each object is meticulously annotated with textured meshes, point clouds, multi-view rendered images, and real-captured videos with camera poses and foreground masks.

This dataset addresses the critical sim-to-real gap that has plagued 3D vision research relying on synthetic data. The authors establish four comprehensive evaluation tracks—robust 3D perception, novel-view synthesis, neural surface reconstruction, and 3D object generation—and conduct extensive experiments with state-of-the-art methods. These studies reveal crucial observations: current models struggle with combined OOD styles and corruptions in perception, real-captured videos introduce significant challenges for NVS, sparse-view reconstruction remains unsolved, and 3D generative models exhibit semantic distribution biases when trained on large-vocabulary datasets.

Overall, OmniObject3D not only provides an invaluable resource for developing and evaluating realistic 3D vision technologies but also rigorously identifies key challenges and opportunities for future research.

7.2. Limitations & Future Work

The authors implicitly and explicitly highlight several limitations and suggest future work:

  • Robust 3D Perception:
    • Limitation: Performance on clean synthetic data has low correlation with OOD-style robustness. Current models struggle with OOD styles combined with OOD corruptions.
    • Future Work: Develop point cloud perception models robust against both OOD styles and OOD corruptions.
  • Novel-View Synthesis:
    • Limitation: Voxel-based methods (Plenoxels) can be unstable with concave geometries or dark objects. Real-captured videos introduce significant challenges (motion blur, SfM inaccuracies) compared to rendered data.
    • Future Work: Pursue more generalizable and robust novel-view synthesis methods, especially those capable of handling casually captured videos and unaligned coordinate systems. Explore how generic methods can achieve both accurate shape contour and geometry.
  • Neural Surface Reconstruction:
    • Limitation: Sparse-view surface reconstruction is not yet well-solved, with current methods showing apparent artifacts. MonoSDF struggles when estimated depth/normal cues are inaccurate.
    • Future Work: Study generalizable surface reconstruction pipelines and robust strategies for utilizing estimated geometric cues in sparse-view settings. Develop methods that can maintain coherent global shapes while also capturing local geometric details accurately.
  • 3D Object Generation:
    • Limitation: Generative models trained on large-vocabulary datasets exhibit semantic distribution bias. The disentanglement of geometry and texture is imperfect, and complex textures are challenging to generate.
    • Future Work: Investigate methods to mitigate semantic distribution bias, improve disentanglement between geometry and texture, and enhance the generation of complex textures in large-vocabulary, realistic 3D object generation.

7.3. Personal Insights & Critique

This paper presents a highly valuable and timely contribution to the 3D vision field. The sheer scale and meticulous annotation of OmniObject3D represent a significant step forward in addressing the sim-to-real gap. My key insights and critiques are:

  • Impact on Research: The dataset's comprehensiveness, particularly its inclusion of textured meshes, point clouds, rendered images, and real videos, will undoubtedly catalyze research across multiple 3D tasks. The alignment of categories with ImageNet and LVIS is a smart design choice that facilitates cross-modal learning and transfer. The fine-grained OOD robustness analysis for point clouds is particularly insightful and will guide the development of more robust perception systems.

  • Dataset Quality and Effort: The dedication to professional scanning and manual canonical pose alignment speaks to the high quality. The challenges in scanning certain object types (e.g., non-rigid, complex) and the 10% manipulation objects add crucial realism. The explicit mention of data generation pipelines for users to customize views/sampling further enhances its utility.

  • Critique on Benchmarking Depth: While the paper sets up four tracks and performs extensive studies, the depth of analysis for each SOTA model within each track could be expanded. For instance, explaining why GDANet and CurveNet are robust (e.g., their specific architectural components) more thoroughly in the results section would be beneficial for a beginner. However, this is often constrained by paper length limits.

  • Realism of "Real-Captured Videos": The findings on SfM-w-bg being the worst for NVS are critical. While OmniObject3D aims for realism, the reliance on iPhone 12 Pro for videos and COLMAP for poses, while practical, might not represent the absolute cutting edge of multi-view capture systems. Future work could explore incorporating data from more advanced capture setups or more robust SfM/MVS pipelines for casual videos.

  • Potential for Multimodal Foundation Models: The rich, multimodal nature of OmniObject3D makes it an ideal candidate for training 3D multimodal foundation models. This is a clear, unstated opportunity. Such models could learn powerful joint representations across point clouds, meshes, and images, leading to stronger generalization and cross-task capabilities.

  • Long-tail Distribution Challenge: The dataset has a long-tailed distribution. While common in real-world data, this presents inherent challenges for generative models (as observed in semantic bias) and potentially for perception models in low-data categories. Future research using OmniObject3D could focus on few-shot 3D learning or long-tail 3D generation techniques to address this.

  • Ethical Considerations: The paper briefly mentions regulating data usage "to avoid potential negative social impacts." This is good practice, and expanding on the specific types of impacts considered and the mechanisms for regulation would strengthen this aspect.

    In conclusion, OmniObject3D is a landmark dataset that will significantly push the boundaries of realistic 3D vision research. Its meticulously crafted content and thoughtful benchmarking framework provide a solid foundation for tackling the next generation of challenges in 3D perception, reconstruction, and generation.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.