Abstract

We introduce Amazon Berkeley Objects (ABO), a new large-scale dataset designed to help bridge the gap between real and virtual 3D worlds. ABO contains product catalog images, metadata, and artist-created 3D models with complex geometries and physically-based materials that correspond to real, household objects. We derive challenging benchmarks that exploit the unique properties of ABO and measure the current limits of the state-of-the-art on three open problems for real-world 3D object understanding: single-view 3D reconstruction, material estimation, and cross-domain multi-view object retrieval.

1. Bibliographic Information

1.1. Title

The central topic of the paper is the introduction of a new dataset called Amazon Berkeley Objects (ABO) and benchmarks derived from it, aimed at advancing real-world 3D object understanding.

1.2. Authors

The authors are Jasmine Collins, Shubham Goel, Kenan Deng, Achleshwar Luthra, Leon Xu, Erhan Gundogdu, Xi Zhang, Tomas F. Yago Vicente, Thomas Dideriksen, Himanshu Arora, Matthieu Guillaumin, and Jitendra Malik. Their affiliations include UC Berkeley, Amazon, and BITS Pilani, indicating a collaboration between academia and industry. Jitendra Malik is a prominent researcher in computer vision.

1.3. Journal/Conference

The paper was published at UTC: 2022-06-01T00:00:00.000Z. While the specific conference or journal is not explicitly named in the provided text, the publication date suggests it was likely presented at a major computer vision conference in 2022, given the typical publication cycles for such venues. Papers by these authors and topics are frequently presented at top-tier conferences like CVPR, ICCV, or ECCV.

1.4. Publication Year

1.5. Abstract

This paper introduces the Amazon Berkeley Objects (ABO) dataset, a new large-scale resource designed to bridge the gap between real and virtual 3D representations. ABO comprises product catalog images, extensive metadata, and artist-created 3D models featuring complex geometries and physically-based materials, all corresponding to actual household objects. The authors leverage the unique properties of ABO to establish challenging benchmarks for three key open problems in real-world 3D object understanding: single-view 3D reconstruction, material estimation, and cross-domain multi-view object retrieval. Through these benchmarks, the paper evaluates the current state-of-the-art methods and identifies their limitations.

1.6. Original Source Link

/files/papers/693a3605e65c1507e459c744/paper.pdf This appears to be a direct link to the PDF file, indicating its publication status as an academic paper.

2. Executive Summary

2.1. Background & Motivation

The core problem the paper aims to solve is the significant gap between the progress in 2D image recognition, largely driven by abundant large-scale datasets, and the comparatively slower progress in 3D computer vision due to the scarcity of high-quality, large-scale 3D datasets.

This problem is important because while synthetic 3D datasets (like ShapeNet) offer scale, they often lack realism, texture, and diverse geometries, leading to models that perform well on synthetic data but fail to generalize to real-world images. Existing datasets that attempt to link 3D models to real images often suffer from approximate matches, small scale, limited categories, or lack of physically-based materials. Datasets created via classical 3D reconstruction from real images are typically small, labor-intensive, and often lack corresponding in-context real images or realistic reflectance properties.

The paper's entry point is to leverage the vast resources of Amazon.com product listings, which include real catalog images, rich metadata, and artist-created, high-quality 3D models with physically-based rendering (PBR) materials, to create a dataset that overcomes these limitations. The innovative idea is to use this commercial data to bridge the "synthetic-to-real" domain gap for 3D object understanding tasks.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Introduction of the Amazon Berkeley Objects (ABO) Dataset: This is a novel, large-scale dataset featuring 147,702 product listings, 398,212 unique catalog images, detailed metadata, and crucially, 7,953 artist-created 3D models with complex geometries and high-resolution Physically-Based Rendering (PBR) materials. ABO is unique in combining real-world images with high-quality, PBR-ready 3D models from diverse categories.
Derivation of Challenging Benchmarks: The authors use ABO to create benchmarks for three open problems in 3D object understanding:
- Single-View 3D Reconstruction: Evaluating how well models trained on synthetic data (ShapeNet) generalize to realistic ABO objects.
- Material Estimation: Providing a baseline for predicting spatially-varying Bidirectional Reflectance Distribution Functions (SV-BRDFs) from single- and multi-view images of complex real-world objects, enabled by ABO's PBR materials.
- Cross-Domain Multi-View Object Retrieval: A challenging benchmark that leverages ABO's 3D models to generate diverse viewpoints and scenes, evaluating the robustness of Deep Metric Learning (DML) algorithms to viewpoint and domain shifts.
Extensive Evaluation of State-of-the-Art Methods: The paper measures the current limits of existing methods on these new benchmarks.

Key conclusions and findings include:

Significant Domain Gap in 3D Reconstruction: State-of-the-art 3D reconstruction models trained on synthetic ShapeNet data show a large performance drop when tested on realistic ABO objects, even within the same categories. This highlights the challenge posed by real-world object complexities and textures.
Effectiveness of Multi-View for Material Estimation: Incorporating multiple views significantly improves the accuracy of SV-BRDF material estimation, particularly for view-dependent properties like roughness and metallicness, and benefits from 3D structure-based alignment. The proposed baseline demonstrates reasonable performance on real catalog images despite the domain gap.
Challenging Nature of Cross-Domain Retrieval: The ABO retrieval benchmark proves highly challenging, with ImageNet-pretrained baselines performing poorly and even DML methods achieving significantly lower Recall@1 scores compared to existing benchmarks. This indicates that current DML methods are likely saturated on older datasets and that ABO offers a valuable new challenge.
Viewpoint Sensitivity in Retrieval: Retrieval performance degrades rapidly for query images with extreme azimuth ( $| \theta | > 75^\circ$ ) and elevation ( $\varphi > 50^\circ$ ) angles, revealing a critical area for future DML research.

These findings collectively underscore the need for more realistic and diverse 3D datasets like ABO to drive progress in 3D computer vision and bridge the gap towards real-world applications.

3.1. Foundational Concepts

To fully understand this paper, a foundational grasp of several computer vision and 3D graphics concepts is essential:

3D Computer Vision: This field aims to enable computers to understand the 3D structure, shape, and properties of objects and scenes from images or video. Unlike 2D computer vision which deals with flat images, 3D CV seeks to infer depth, volume, and material properties. It's crucial for applications like augmented reality (AR), virtual reality (VR), robotics, and autonomous navigation.
3D Object Understanding: This is a subfield of 3D computer vision focused on tasks such as:
- 3D Reconstruction: Creating a 3D model (e.g., mesh, point cloud) of an object from one or more 2D images.
- Material Estimation: Inferring the physical properties of an object's surface (e.g., color, shininess, texture) that determine how it interacts with light.
- Object Retrieval: Finding 3D models or images of similar objects based on a query image or 3D model.
3D Representations: Different ways to represent 3D objects in a computer:
- Meshes: A collection of vertices (points in 3D space) connected by edges to form triangular or quadrilateral faces, approximating the object's surface. They are common for representing solid objects.
- Voxels: A 3D grid where each cell (voxel) indicates whether it's occupied by the object or empty. Similar to pixels in 2D, but in 3D. They are easy to process but can be memory-intensive for high resolutions.
- Point Clouds: A set of discrete data points in 3D space, representing the exterior surface of an object. They are often generated by 3D scanners.
- Implicit Functions: Mathematical functions that define a surface as the zero-level set (e.g., $f(x, y, z) = 0$ ). They can represent complex, smooth shapes efficiently and are gaining popularity in deep learning for 3D.
Physically-Based Rendering (PBR): A collection of rendering techniques that aim to simulate light's interaction with materials in a way that is physically accurate, producing highly realistic images. PBR materials are defined by parameters that correspond to physical properties, making them consistent under various lighting conditions.
Spatially-Varying Bidirectional Reflectance Distribution Function (SV-BRDF):
- BRDF: A mathematical function that describes how light reflects off an opaque surface. It specifies how much light from any incoming direction reflects out in any outgoing direction. It's a 4D function (two angles for incoming light, two for outgoing).
- SV-BRDF: Extends BRDF by allowing the material properties (and thus the BRDF) to vary across the surface of an object. This means different parts of an object can have different material appearances (e.g., a metal handle attached to a wooden body).
- Disney PBR Model (glTF 2.0): A widely used and simplified PBR material model, adopted by many rendering engines. It defines materials using a small set of intuitive parameters like base color (albedo), metallic (how metallic the surface is), roughness (how rough or smooth the surface is, affecting specularity), and normal map (for fine surface details).
6-Degrees of Freedom (6-DOF) Pose Estimation: Determining the 3D position (translation: x, y, z) and orientation (rotation: roll, pitch, yaw) of an object relative to a camera or a global coordinate system.
Differentiable Rendering: A technique that makes the rendering process (simulating how light interacts with 3D objects to produce a 2D image) differentiable. This means that gradients can be computed through the rendering process, allowing neural networks to learn to generate 3D models directly from 2D images by optimizing a loss function defined in image space. PyTorch3D is a library that implements differentiable rendering.
Deep Metric Learning (DML): A machine learning paradigm where the goal is to learn an embedding space where semantically similar data points are close together, and dissimilar data points are far apart. It's crucial for tasks like image retrieval, face recognition, and clustering. Various loss functions exist (e.g., Contrastive Loss, Triplet Margin Loss, NTXent, NormSoftmax, ProxyNCA, Multi-similarity Loss).
Convolutional Neural Networks (CNNs): A class of deep neural networks commonly used for analyzing visual imagery. They use convolutional layers to automatically and adaptively learn spatial hierarchies of features.
- U-Net: A CNN architecture originally developed for biomedical image segmentation. It has a U-shaped structure with an encoder (downsampling path) and a decoder (upsampling path) with skip connections, allowing it to combine high-level semantic information with fine-grained spatial details.
- ResNet (Residual Network): A type of CNN that introduces "residual connections" or "skip connections" that allow the network to bypass one or more layers. This helps in training very deep networks by addressing the vanishing gradient problem.

3.2. Previous Works

The paper contextualizes ABO by discussing existing datasets and methods, highlighting their limitations:

2D Image Recognition Datasets:
- ImageNet [15], COCO [44], LVIS [26]: These datasets, with their vast scale and diverse annotations (class labels, segmentation masks), have fueled significant progress in 2D computer vision. The paper notes that their success is due to the relative ease of collecting 2D annotations.
Synthetic 3D Datasets:
- ShapeNet [10], 3D-Future [19], Thingi10k [72]: These provide large collections of CAD models. However, the paper points out that many models are low quality, untextured, or do not exist in the real world. This leads to methods that work well on clear-background renderings but struggle to generalize to real images and complex geometries.
2D-3D Alignment Datasets:
- Pascal3D+ [66], ObjectNet3D [65]: These datasets link existing 3D models (often CAD) to real-world images, with human annotators aligning pose. Their limitation is that the shape and pose matches are often approximate, and they inherit the quality limitations of the underlying CAD model datasets (poor coverage, basic geometries/textures).
- IKEA [42, 43], Pix3D [59]: These improved upon the above by providing exact, pixel-aligned 3D models for real images. However, they are relatively small (90 and 395 unique 3D models, respectively) and cover few categories. Pix3D models are untextured, limiting tasks like material prediction.
Reconstructed 3D Datasets (from real images/videos):
- Object Scans [12], Objectron [3], Google Scans [56], BigBIRD [57], CO3D [55]: These datasets create 3D reconstructions from real images or videos, faithfully representing real objects. The main drawbacks are small scale (due to manual effort in collection), objects often imaged in controlled lab settings, and usually lacking corresponding in-context real images. Furthermore, textured surfaces are often assumed Lambertian, lacking realistic reflectance properties (PBR). CO3D specifically does not provide full 3D mesh reconstructions.
Material-Specific Datasets:
- PhotoShapes [53]: Augmented ShapeNet CAD models with spatially-varying (SV-) BRDFs, but limited to a single category (chairs).
- [17, 20]: Provide high-quality SV-BRDF maps but only for planar surfaces.
- [32]: Contains only homogeneous BRDFs (i.e., uniform material properties across the surface) for various objects.
- [41, 7]: Introduce datasets with full SV-BRDFs, but their models are procedurally generated and do not correspond to real objects.
3D Shape Reconstruction Methods: The paper evaluates methods like 3D-R2N2 [13] (voxels), GenRe [71] (spherical maps), Occupancy Networks [48] (implicit functions), and Mesh R-CNN [22] (meshes). These are typically trained on ShapeNet and are mostly category-specific (except [71] claiming category-agnosticism).
2D/3D Image Retrieval Datasets:
- [40]: Explored joint embeddings for shapes and images but constrained by ShapeNet's limitations (e.g., cross-view retrieval only for chairs/cars).
- CARS-196 [36], CUB-200-2011 [62], In-Shop Clothes [45], SOP (Ebay) [52]: Datasets for Deep Metric Learning (DML) evaluation, focusing on fine-grained instances/categories of a few object types. The paper notes their limited diversity and structure, leading to near-saturation of state-of-the-art DML algorithms.

3.3. Technological Evolution

The evolution of 3D computer vision datasets has generally moved from:

Synthetic, CAD-based models: Easy to scale, but lack realism and often textures (e.g., ShapeNet).
Approximate 2D-3D alignment: Linking existing CAD models to real images, improving real-world relevance but still limited by CAD quality and imprecise annotations (e.g., Pascal3D+, ObjectNet3D).
Exact 2D-3D alignment for limited categories: Manually intensive, but providing precise ground truth for real images (e.g., IKEA, Pix3D). Still small scale and often untextured.
3D reconstruction from real scans/videos: Capturing real-world geometry, but often in controlled settings, small scale, and lacking PBR materials (e.g., Object Scans, CO3D).
Specialized material datasets: Focusing on PBR but often limited to specific materials, planar surfaces, or procedurally generated shapes (e.g., PhotoShapes).

ABO fits into this evolution by combining the strengths of various approaches while addressing their weaknesses. It provides large-scale, artist-created 3D models that represent real-world household objects, come with PBR materials, and are accompanied by corresponding real catalog images and rich metadata.

3.4. Differentiation Analysis

Compared to the main methods and datasets in related work, ABO introduces several core differences and innovations:

Scale and Realism: ABO is large-scale (nearly 8K 3D models, 147K products, 398K images) and features highly realistic 3D models derived from actual product listings. This contrasts with synthetic datasets (e.g., ShapeNet) which are large but less realistic, and real-world 3D datasets (e.g., Pix3D, Object Scans) which are realistic but small-scale.
Physically-Based Rendering (PBR) Materials: ABO's 3D models come with high-resolution, artist-created PBR materials (SV-BRDFs). This is a crucial distinction from most prior datasets where 3D models are often untextured or have simplistic texture models (e.g., Pix3D, ShapeNet, Google Scans), enabling material estimation tasks previously not possible at this scale and realism.
Corresponding Real-World Images and Metadata: ABO pairs its 3D models with multi-view product catalog images and rich structured metadata. This multi-modal data allows for cross-domain tasks and detailed analysis that go beyond shape or pose.
Diverse Categories: With 63 categories for 3D models, ABO is more diverse than many existing datasets that focus on a handful of categories (e.g., Pix3D's 9 categories, PhotoShape's 1 category).
Automated Pose Annotation: The paper details an automatic pipeline for 6-DOF pose annotations using differentiable rendering, reducing the manual annotation burden seen in datasets like Pascal3D+ or Pix3D.

Challenging Benchmarks: The dataset's unique properties allow for the creation of novel and challenging benchmarks that expose limitations of state-of-the-art methods in scenarios closer to real-world deployment (e.g., measuring domain gap for ShapeNet-trained models, cross-domain retrieval with varied viewpoints and complex backgrounds).

The following are the results from Table 1 of the original paper:

Dataset	# Models	# Classes	Real images	Full 3D	PBR
ShapeNet [10]	51.3K	55	X	✓	X
3D-Future [19]	16.6K	8	X	✓	X
Google Scans [56]	1K	-	X	✓	X
CO3D [55]	18.6K	50	✓	X	X
IKEA [43]	219	11	✓	✓	X
Pix3D [59]	395	9	✓	✓	X
PhotoShape [53]	5.8K	1	X	✓	✓
ABO (Ours)	8K	63	✓	✓	✓

Table 1 clearly illustrates ABO's differentiation, being the only dataset listed that satisfies all criteria: having a substantial number of models (8K), a high number of classes (63), real images, full 3D models, and PBR materials.

4. Methodology

4.1. Principles

The core idea behind ABO is to create a large-scale, highly realistic, and diverse dataset for 3D object understanding by leveraging Amazon's extensive product data. This data includes product catalog images, rich textual and categorical metadata, and artist-created 3D models. The theoretical basis is that progress in 3D computer vision, similar to 2D vision, requires large, diverse, and realistic datasets. The intuition is that by providing 3D models with Physically-Based Rendering (PBR) materials that correspond to real-world objects and are accompanied by multi-view real images and detailed metadata, researchers can develop and evaluate more robust 3D vision algorithms that generalize better to real-world scenarios.

4.2. Core Methodology In-depth (Layer by Layer)

The ABO dataset is constructed from Amazon.com product listings and subsequently used to derive three main benchmarks.

4.2.1. ABO Dataset Properties and Composition

The ABO dataset originates from worldwide product listings, metadata, images, and 3D models provided by Amazon.com.

Product Listings: It contains 147,702 listings of products from 576 product types sold across various Amazon-owned platforms. Each listing is identified by an item ID.
Metadata: Each listing is provided with structured metadata, which is publicly available on the product's webpage. This includes up to 18 unique attributes such as category, color, material, weight, and dimensions (as illustrated in Figure 3, which shows sample catalog images alongside their attributes). The following image (Figure 3 from the original paper) shows sample catalog images and attributes that accompany ABO objects:

该图像是一个产品展示图，展示了几种家具和家居用品，包括黑色金属瓶架和棕色木质橱柜，搭配其名称、类别、材质、颜色和重量等信息。图中左侧为现代风格的木质橱柜，右侧为金属瓶架，整体展示了不同材料和用途的家具商品。

The following image (Figure 4 from the original paper) shows the distribution of 3D model categories in ABO:

该图像是一个条形图，展示了不同类别3D模型的数量分布。y轴为模型数量，使用对数尺度，x轴列出了具体的模型类别，如椅子、盆栽和办公桌等。该图强调了某些类别的模型数量相对较多，有助于理解3D模型的多样性和分布特征。

Figure 4 indicates the diverse range of categories (63 unique categories with 3D models) and their distribution (note the y-axis is on a log scale), showing a rich variety of objects.
Images: The dataset includes 398,212 high-resolution catalog images. For 8,222 products, "360° View" turntable-style images are available, captured at $5^\circ$ or $15^\circ$ azimuth intervals.
3D Models: A critical component is the inclusion of 7,953 artist-created high-quality 3D models. These models are provided in the glTF 2.0 format.
- Orientation: The 3D models are oriented in a canonical coordinate system, meaning the "front" of all objects (when well-defined) are aligned.
- Scale: Each model has a scale corresponding to real-world units, which is crucial for realistic applications.
- Material: They feature complex geometries and high-resolution, Physically-Based Rendering (PBR) materials, allowing for photorealistic rendering.
- Category Annotation: To facilitate comparison with existing methods (especially those trained on ShapeNet), each 3D model has category annotations mapped to noun synsets under the WordNet [49] taxonomy.

4.2.2. Catalog Image Pose Annotations

To enable tasks that require precise object placement, the dataset provides 6-Degrees of Freedom (6-DOF) pose annotations for 6,334 catalog images.

Automatic Pipeline: The authors developed an automated pipeline for pose estimation, which contrasts with previous approaches that relied on human annotators for alignment. This pipeline leverages:
1. Knowledge of the 3D model corresponding to the object in the image.
2. Off-the-shelf instance masks [28, 34] (e.g., from Mask R-CNN or PointRend) to identify the object's pixels in the 2D image.
3. Differentiable rendering to bridge the gap between 3D model and 2D image.
Pose Optimization: For each instance mask $\mathbf{M}$ (a binary mask representing the object in the image), the pipeline estimates the optimal rotation matrix $\mathbf{R} \in SO(3)$ (Special Orthogonal Group, representing 3D rotations) and translation vector $\mathbf{T} \in \mathbb{R}^3$ such that the silhouette generated by rendering the 3D model with $(\mathbf{R}, \mathbf{T})$ best matches the instance mask $\mathbf{M}$ . This is formulated as a minimization problem: $\mathbf { R } ^ { * } , \mathbf { T } ^ { * } = \underset { \mathbf { R } , \mathbf { T } } { \mathrm { a r g m i n } } \left\| D R ( \mathbf { R } , \mathbf { T } ) - \mathbf { M } \right\|$ Where:
- $\mathbf{R}^*$ , $\mathbf{T}^*$ : The optimal rotation matrix and translation vector, respectively, that minimize the loss.
- $DR(\mathbf{R}, \mathbf{T})$ : A differentiable renderer, implemented using PyTorch3D [54]. This function takes the 3D model (implicitly) and the current pose parameters $(\mathbf{R}, \mathbf{T})$ and renders its 2D silhouette. The "differentiable" aspect means that gradients can be computed through this rendering process, allowing for optimization using gradient-based methods.
- $\mathbf{M}$ : The ground truth instance mask of the object in the catalog image.
- $\left\| \cdot \right\|$ : Denotes a norm (e.g., L2 norm) to measure the difference between the rendered silhouette and the ground truth mask. Minimizing this norm means finding the pose that makes the rendered 3D model's silhouette best match the observed 2D mask.
Optimization Details:
- For each instance mask, 24 different runs are initialized with random rotations.
- The pose parameters $(\mathbf{R}, \mathbf{T})$ are optimized for 1,000 steps using the Adam optimizer [47] with a learning rate of $1 \mathrm{e}{-}2$ .
- The rotation matrix is parameterized using the symmetric orthogonalization procedure described in [38].
- The pose that yields the lowest loss among the 24 runs is selected and then subjected to a final human verification step to ensure correctness. The following image (Figure 2 from the original paper) shows posed 3D models in catalog images:
  
  该图像是用实例掩膜生成的3D模型姿态注释的插图，展示了不同角度的家居产品。这些产品包括沙发、椅子和其他家具，通过对比实际物体与其3D模型的姿态，帮助理解物体的三维形态及其相互关系。
Figure 2 visually confirms how the instance masks are used to automatically generate 6-DOF pose annotations, showing the 3D model rendered onto the image with the estimated pose.

4.2.3. Material Estimation Dataset

To facilitate research in material estimation, a specialized dataset of rendered images with ground truth material properties is created from ABO's 3D models.

Material Parameterization: The Disney [9] base color, metallic, and roughness parameters, defined in the glTF 2.0 specification [25], are used. These parameters describe the SV-BRDF of the object.
Rendering Process:
- $512 \times 512$ images are rendered.
- 91 camera positions are used, distributed along an upper icosphere (a sphere approximated by triangles) around the object, ensuring diverse viewpoints.
- A $60^\circ$ field-of-view (FOV) is used for the camera.
- Blender's [14] Cycles path-tracer is employed, which is a physically-accurate renderer, ensuring high photorealism.
Lighting and Backgrounds: To simulate diverse and realistic lighting conditions and backgrounds, each scene is illuminated using 3 random environment maps selected from a collection of 108 indoor HDRIs [23] (High Dynamic Range Images, providing realistic environmental lighting).
Ground Truth Generation: For each rendered image, the corresponding ground truth data is generated:
- base color map
- metallicness map
- roughness map
- normal map (surface orientation)
- depth map (distance from camera to surface)
- segmentation mask (object outline)
Scale: The resulting material estimation dataset comprises 2.1 million rendered images, along with their camera intrinsics and extrinsics.
Dataset Curation for Material Estimation: Objects with transparencies are omitted, resulting in 7,679 models. These are split into a non-overlapping train/test set of 6,897 and 782 models, respectively. For testing generalization to new lighting, 10 out of 108 HDRI environment maps are reserved exclusively for the test set.

4.2.4. Single-View 3D Reconstruction Benchmark

This benchmark evaluates how well existing 3D reconstruction methods, predominantly trained on synthetic datasets like ShapeNet, transfer to more realistic objects from ABO.

Object Subset: Only ABO models that fall into ShapeNet training categories are considered to isolate the domain gap from cross-category generalization. This includes 6 classes (e.g., bench, chair, couch, cabinet, lamp, table) capturing 4,170 of the 7,953 3D models.
Rendered Dataset: A separate dataset of rendered images (distinct from the material estimation dataset) is created for this benchmark.
- Objects are rendered on a blank (white) background.
- 30 viewpoints are rendered for each mesh, using Blender [14].
- A $40^\circ$ field-of-view (FOV) is used, ensuring the entire object is visible.
- Camera azimuth and elevation are uniformly sampled on a unit sphere, with a $-10^\circ$ lower limit on elevations to avoid uncommon bottom views.
Evaluation Protocol: The evaluation largely follows [22].
- View-Space Evaluation: For methods like GenRe and Mesh R-CNN that predict in "view-space" (pose aligned to image view), depth ambiguity is resolved by aligning predicted and ground truth (GT) meshes. Known camera extrinsics transform the GT mesh to view-space, and then a Chamfer-distance minimizing depth search (51 candidates) is performed after normalizing average vertex depths. Meshes are scaled such that the longest edge of the GT bounding box is 10, following [18, 22].
- Canonical-Space Evaluation: For methods like 3D-R2N2 and Occupancy Networks that predict in "canonical space" (category-specific, consistent pose), models are aligned with GT shapes using a single, manually-set rotation for cross-category semantic alignment. Relative translation and scale are optimized to minimize Chamfer distance after mean-centering and re-scaling.
- Voxel to Mesh Conversion: For 3D-R2N2, which predicts a voxel grid, it is converted to a mesh using an efficient protocol from [22] (replacing occupied voxels with cubes, merging vertices, removing internal faces) rather than Marching Cubes [46].

4.2.5. Material Prediction Baseline

A U-Net-based model with a ResNet-34 backbone is proposed as a simple baseline for estimating SV-BRDFs from single- and multi-view images. The following image (Figure 10 from the original paper) shows the network architecture for material estimation:

$Figure 10. Top: Encoder-decoder architecture. Encoder uses a ResNet-34 (conv1-conv5) backbone. `K x K – N – M – X` denotes a double convolution block of `K x K` filter, $N$ input channels, $M$ intermediate channels, and $X$ output channels. We use BatchNorm and leaky ReLU. Middle: Single-view baseline. Bottom: Multi-view baseline. Given a reference view, neighboring views are selected and projected to the reference view and passed as input to the network.$ 该图像是一个示意图，展示了一个基于ResNet-34的编码器-解码器架构。图中标示了编码器和解码器模块，以及跳跃连接。输入包含了参考视图和邻接视图，通过投影合成。该架构用于单视图和多视图的3D物体重建。

Network Architecture (Figure 10):
- Encoder: A common encoder based on a ResNet-34 backbone (using conv1-conv5 blocks) takes the RGB image(s) as input.
- Multi-head Decoder: The encoder output is fed into a multi-head decoder. Each head is responsible for outputting one component of the SV-BRDF separately (e.g., base color, roughness, metallicness, normals).
- U-Net Structure: The architecture uses skip connections from the encoder to the decoder, characteristic of a U-Net, to preserve spatial details.
Single-View Network (SV-net): The U-Net takes a single RGB image as input and outputs the SV-BRDF parameters.
Multi-View Network (MV-net):
- Image Alignment: Inspired by [7, 17], images from multiple viewpoints are aligned by projection using depth maps.
- Input Data: The original reference image and its projected neighboring image pairs are bundled as input to the network.
- Architecture Reuse: The single-view U-Net architecture is reused.
- Arbitrary Number of Inputs: Global max pooling is used to handle a variable number of input images.
- Pixel-Level Correspondence: The MV-net uses camera poses to establish pixel-level correspondences. For a pixel $p$ $p$ in one viewpoint with image coordinate $x$ $x$ and depth $z$ $z$ , its corresponding pixel $x'$ $x^{'}$ in another viewpoint is computed as: $x' = KRK^{-1}x + Kt/z$ $x^{'} = K R K^{- 1} x + K t / z$ Where:
  - $K$ : The camera intrinsic matrix (describes camera properties like focal length, principal point).
  - $R$ : The rotation matrix between the two viewpoints.
  - $t$ : The translation vector between the two viewpoints.
  - $z$ : The depth of pixel $p$ . This formula projects a 3D point (derived from $x$ and $z$ ) from one camera's coordinate system to another's 2D image plane.
- Occlusion Handling: Pixels that are occluded in other views are identified using a depth-based occlusion test and filled with values from the reference view.
Training Details:
- Input Size: $256 \times 256$ rendered images.
- Views: For training, 40 views are randomly subsampled from the icosphere for each object. For the multi-view network, for each reference view, its immediate 4 adjacent views are selected as neighboring views.
- Loss Function: Mean Squared Error (MSE) is used for base color, roughness, metallicness, surface normal, and render losses.
- Differentiable Rendering Loss: Similar to [16], a differentiable rendering layer is utilized. This layer renders a flash-illuminated image from the network's material predictions and compares it to a similarly rendered ground truth image. This render loss helps regularize the network and guide the training process, enforcing perceptual realism.
- Direct Supervision: Ground truth material maps provide direct supervision for the network's outputs.
- Optimizer: AdamW optimizer [47] is used for 17 epochs with a learning rate of $1 \mathrm{e}{-}3$ and weight decay of $1 \mathrm{e}{-}4$ .
Full Texture Map Reconstruction: Beyond per-view prediction, the paper describes a pipeline to generate a full textured PBR model. This involves back-projecting the per-view predicted material maps to the UV domain (a 2D texture coordinate space) and then using a learned encoder-decoder network to aggregate and smooth these predictions into a complete UV map for the entire object. The following image (Figure 11 from the original paper) shows the pipeline for full UV map prediction:

该图像是一个示意图，展示了完整UV图预测的流程。输入的预测通过UV反投影生成部分UV卷，然后针对每种材质属性进行处理，最终得到完整的UV图。

Figure 11 illustrates how per-view predictions are back-projected onto the UV space and then processed by an encoder-decoder network for smoothing and aggregation, leading to a complete UV map for the material properties.

4.2.6. Multi-View Cross-Domain Object Retrieval Benchmark

This benchmark evaluates the robustness of Deep Metric Learning (DML) algorithms to viewpoint changes and domain shifts using ABO's diverse images and 3D models.

Dataset Curation:
- Object Selection: Focus on rigid objects, removing non-rigid items like garments and home linens.
- Near-Duplicate Detection: A hierarchical Union-Find algorithm is applied to detect and group near-duplicate products (e.g., different sizes of the same shoe) based on shared imagery. Each unique instance is assigned an instance id.
- Product Grouping: Product groups are formed for items from product lines that share design details, materials, or patterns (and thus may share common images). All instances within a group are assigned to the same data split (train, val, or test) to avoid data leakage.
- Splits:
  - Train Set: 49,066 instances (3,993 with 3D models), 187,912 catalog images, and 110,928 rendered images (298,840 total).
  - Validation Set: 854 instances with 3D models. val-target images are catalog images (4,707), val-query images are 8 rendered images per environment map.
  - Test Set: 836 instances with 3D models. test-target images are catalog images (4,313), test-query images are 8 rendered images per environment map.
- Cross-Domain Aspect: The rendered images used as queries have complex and cluttered indoor backgrounds and diverse viewpoints (generated from the 3D models), forming a different domain than the catalog images used as targets, which typically have cleaner backgrounds and standard viewpoints. This creates a challenging cross-domain retrieval task.
Methodology for DML Evaluation:
- DML Methods: State-of-the-art DML methods are compared, covering major approaches:
  - Classification-based: NormSoftmax [70]
  - Proxy-based: ProxyNCA [50]
  - Tuple-based: Contrastive, TripletMargin, NTXent [11], Multi-similarity [63]
- Framework: PyTorch Metric Learning [2] implementations are used within the Powerful Benchmarker framework [1] for fair comparisons and Bayesian hyperparameter optimization.
- Backbone: ResNet-50 [29] (pre-trained on ImageNet) is used as the backbone network.
- Embedding: The ResNet-50 output is projected to a 128D embedding after a LayerNorm [4] layer.
- Training Details:
  - BatchNorm parameters are not frozen.
  - Image preprocessing: Padding to undistorted square images, resizing to $256 \times 256$ .
  - Batch size: 256 samples with 4 samples per class (except for NormSoftmax and ProxyNCA which use 32 samples with 1 sample per class, due to better results).
  - Epochs: All losses trained for 1000 epochs.
  - Early Stopping: Based on validation Recall@1 metric, computed every other epoch.
  - Balancing: Critical for good performance, the training process balances classes with and without renderings in each batch (e.g., $N$ classes with rendered images and $N$ classes without). This ensures that the diverse viewpoints and scenes from renderings are exploited effectively and negative pairs are sufficiently sampled.
  - Optimizer: RMSProp with learning rate $1 \mathrm{e}{-}6$ , weight decay $1 \mathrm{e}{-}4$ , momentum 0.9. For metric losses, learning rate is further optimized via Bayesian hyperparameter optimization.
    
    The following are the results from Table 6 of the original paper:
    
    3D Recon. Material Est. Retrieval
    **Train**
    No-BG Renders ✓
    BG Renders ✓ ✓
    Catalog Images ✓
    **Test**
    No-BG Renders ✓
    BG Renders ✓ ✓
    Catalog Images ✓ ✓ ✓

	3D Recon.	Material Est.	Retrieval
Train
No-BG Renders	✓
BG Renders		✓	✓
Catalog Images			✓
Test
No-BG Renders	✓
BG Renders		✓	✓
Catalog Images	✓	✓	✓

Table 6 explicitly outlines how different data subsets of ABO (No-BG Renders, BG Renders, Catalog Images) are used for training and testing in each of the three benchmark experiments (3D Reconstruction, Material Estimation, Retrieval). "No-BG Renders" refers to white-background rendered images, while "BG Renders" refers to images from the Material Estimation Dataset with diverse backgrounds.

5. Experimental Setup

5.1. Datasets

The primary dataset used in this paper is the Amazon Berkeley Objects (ABO) dataset, which is the core contribution of the work.

ABO Dataset:
- Source: Derived from Amazon.com product listings.
- Scale and Characteristics:
  - Product Listings: 147,702 unique product listings.
  - Catalog Images: 398,212 high-resolution real-world product images.
  - Metadata: Up to 18 unique attributes per product, including category, color, material, weight, and dimensions.
  - 360° View Images: Available for 8,222 products, showing turntable-style views.
  - 3D Models: 7,953 artist-created high-quality 3D meshes in glTF 2.0 format. These models have complex geometries, are canonically oriented, scaled to real-world units, and crucially, incorporate high-resolution Physically-Based Rendering (PBR) materials (Spatially-Varying BRDFs).
  - Categories: 63 distinct categories for 3D models, mapped to WordNet synsets.
  - Pose Annotations: 6-DOF pose annotations for 6,334 catalog images, automatically generated using a differentiable rendering pipeline.
- Domain: Real-world household objects, providing a diverse and realistic domain.
ABO Subsets for Benchmarks:
- Single-View 3D Reconstruction: A subset of ABO models (4,170 models from 6 ShapeNet-overlapping categories) is used. Images are rendered on a blank background (No-BG Renders) from 30 viewpoints.
- Material Estimation: 7,679 ABO models (excluding transparent objects) are used. Images are rendered with diverse lighting (3 random HDRIs from 108) and backgrounds (BG Renders) from 91 camera positions, generating 2.1 million images with ground truth material maps.
- Multi-View Cross-Domain Object Retrieval: A curated subset of ABO products focusing on rigid objects (29,988 groups, 50,756 instances). Uses both real catalog images and synthetic rendered images (with complex backgrounds and diverse viewpoints) from 3D models.
  - Train Split: 49,066 instances (3,993 with 3D models), 187,912 catalog images, 110,928 rendered images.
  - Validation/Test Splits: Constructed from instances with 3D models, using rendered images as queries and catalog images as targets to create a cross-domain challenge.
External Datasets:
- ShapeNet [10]: A large-scale database of synthetic 3D CAD models. Used to pre-train the baseline models for single-view 3D reconstruction, allowing the authors to measure the domain gap when these models are applied to ABO.
- LVIS [26] and COCO [44]: Large datasets for instance segmentation. Used to train the Mask R-CNN [28] and PointRend [34] models, which provide the instance masks necessary for ABO's automated 6-DOF pose annotation pipeline.
- ImageNet [15]: A large-scale image dataset for object recognition. Used to pre-train the ResNet-50 backbone for the Deep Metric Learning (DML) methods in the retrieval benchmark.
  
  These datasets were chosen because ABO itself provides the realistic and diverse data needed to address the research questions. ShapeNet, LVIS, COCO, and ImageNet are standard, widely-used benchmarks for their respective tasks, making them appropriate for evaluating the transferability and generalization capabilities of existing state-of-the-art models to the new ABO dataset.

5.2. Evaluation Metrics

For each benchmark, specific evaluation metrics are used to quantify performance.

5.2.1. Single-View 3D Reconstruction

The evaluation of 3D reconstruction quality is performed using two common metrics: Chamfer Distance and Absolute Normal Consistency.

Chamfer Distance (CD):
- Conceptual Definition: Chamfer Distance measures the average squared Euclidean distance between each point in one point set to its closest point in another point set. It's a common metric for comparing two shapes, typically represented as point clouds or sampled surfaces from meshes. A lower CD indicates better similarity between the reconstructed shape and the ground truth.
- Mathematical Formula: Given two point sets $S_1$ and $S_2$ , Chamfer Distance $D_{CD}(S_1, S_2)$ is defined as: $D_{CD}(S_1, S_2) = \frac{1}{|S_1|} \sum_{x \in S_1} \min_{y \in S_2} \|x - y\|_2^2 + \frac{1}{|S_2|} \sum_{y \in S_2} \min_{x \in S_1} \|x - y\|_2^2$
- Symbol Explanation:
  - $S_1$ : The set of points sampled from the reconstructed 3D model.
  - $S_2$ : The set of points sampled from the ground truth 3D model.
  - $|S_1|$ : The number of points in set $S_1$ .
  - $|S_2|$ : The number of points in set $S_2$ .
  - $x$ : A point in set $S_1$ .
  - $y$ : A point in set $S_2$ .
  - $\min_{y \in S_2} \|x - y\|_2^2$ : The squared Euclidean distance from point $x$ to its nearest neighbor in set $S_2$ .
  - $\min_{x \in S_1} \|x - y\|_2^2$ : The squared Euclidean distance from point $y$ to its nearest neighbor in set $S_1$ .
  - $\| \cdot \|_2$ : Denotes the Euclidean (L2) norm.
Absolute Normal Consistency (ANC):
- Conceptual Definition: Absolute Normal Consistency measures the similarity of surface normals between the reconstructed 3D model and the ground truth model. For each point on the reconstructed surface, it finds the closest point on the ground truth surface and calculates the absolute dot product (cosine similarity) between their normals. A higher ANC value (closer to 1) indicates better alignment of surface orientations.
- Mathematical Formula: Given two point sets $S_1$ and $S_2$ with corresponding normals $N_1$ and $N_2$ , Absolute Normal Consistency is often approximated as: $\mathrm{ANC}(S_1, S_2) = \frac{1}{|S_1|} \sum_{x \in S_1} |\vec{n}_x \cdot \vec{n}_{y_x}|$ where $y_x$ is the closest point in $S_2$ to $x \in S_1$ . This is typically averaged over both directions (from $S_1$ to $S_2$ and $S_2$ to $S_1$ ).
- Symbol Explanation:
  - $S_1$ : The set of points (and their normals) sampled from the reconstructed 3D model.
  - $S_2$ : The set of points (and their normals) sampled from the ground truth 3D model.
  - $x$ : A point in set $S_1$ .
  - $y_x$ : The point in $S_2$ that is closest to $x$ .
  - $\vec{n}_x$ : The surface normal vector at point $x$ .
  - $\vec{n}_{y_x}$ : The surface normal vector at point $y_x$ .
  - $\cdot$ : Denotes the dot product between two vectors.
  - $|\cdot|$ : Denotes the absolute value.

5.2.2. Material Estimation

For material estimation, standard error metrics are used for different material properties, and cosine similarity for normals.

Root Mean Squared Error (RMSE):
- Conceptual Definition: RMSE is a frequently used measure of the differences between values predicted by a model or an estimator and the values observed. It is the square root of the average of the squared errors. It gives a relatively high weight to large errors. A lower RMSE indicates better prediction accuracy. It's applied to base color, roughness, metallicness, and rendering loss.
- Mathematical Formula: Given a set of $N$ predictions $\hat{y}_i$ and corresponding ground truth values $y_i$ : $\mathrm{RMSE} = \sqrt{\frac{1}{N} \sum_{i=1}^{N} (y_i - \hat{y}_i)^2}$
- Symbol Explanation:
  - $N$ : The total number of observations or pixels.
  - $y_i$ : The ground truth value for the $i$ -th observation (e.g., ground truth base color at a pixel).
  - $\hat{y}_i$ : The predicted value for the $i$ -th observation (e.g., predicted base color at a pixel).
Cosine Similarity (for Normals):
- Conceptual Definition: Cosine similarity measures the cosine of the angle between two non-zero vectors. When used for surface normals, it quantifies how similar two normal vectors are in direction. A value of 1 indicates identical direction, 0 indicates orthogonality, and -1 indicates opposite directions. For normal consistency, higher values (closer to 1) are better.
- Mathematical Formula: Given two normal vectors $\vec{a}$ and $\vec{b}$ : $\text{Cosine Similarity} = \frac{\vec{a} \cdot \vec{b}}{\|\vec{a}\| \|\vec{b}\|}$
- Symbol Explanation:
  - $\vec{a}$ : The ground truth normal vector.
  - $\vec{b}$ : The predicted normal vector.
  - $\cdot$ : The dot product of the vectors.
  - $\|\vec{a}\|$ : The magnitude (length) of vector $\vec{a}$ .
  - $\|\vec{b}\|$ : The magnitude (length) of vector $\vec{b}$ .

5.2.3. Multi-View Cross-Domain Object Retrieval

For object retrieval, several metrics commonly used in information retrieval and Deep Metric Learning (DML) are employed.

Recall@k:
- Conceptual Definition: Recall@k measures the proportion of queries for which at least one relevant item (belonging to the same class as the query) is found among the top- $k$ retrieved results. It indicates how well the system can find any correct item within a limited number of top predictions.
- Mathematical Formula: $\text{Recall@k} = \frac{\text{Number of queries with at least one relevant item in top-k}}{\text{Total number of queries}}$
- Symbol Explanation:
  - top-k: The first $k$ items in the ranked list of retrieved results.
  - relevant item: An item that belongs to the same class or instance group as the query item.
Mean Average Precision (MAP):
- Conceptual Definition: MAP is a single-number metric that provides a holistic measure of a retrieval system's performance, taking into account both precision (how many retrieved items are relevant) and recall (how many relevant items are retrieved) at different levels of recall. For each query, Average Precision (AP) is calculated by averaging the precision values at each point where a new relevant document is retrieved. MAP is then the mean of these APs over all queries. It gives higher scores to rankings where relevant items appear earlier.
- Mathematical Formula: $\text{MAP} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \text{AP}(q)$ where $\text{AP}(q) = \sum_{k=1}^{n} P(k) \times \text{rel}(k)$
- Symbol Explanation:
  - $|Q|$ : The total number of queries.
  - $\text{AP}(q)$ : The Average Precision for query $q$ .
  - $n$ : The total number of retrieved items for query $q$ .
  - P(k): The precision at cut-off $k$ in the ranked list for query $q$ (i.e., the proportion of relevant items among the top $k$ retrieved).
  - $\text{rel}(k)$ : A binary indicator function, equal to 1 if the item at rank $k$ is relevant, and 0 otherwise.
Mean Average Precision at R (MAP@R):
- Conceptual Definition: Similar to MAP, but Average Precision is calculated only up to the $R$ -th retrieved item, where $R$ is the total number of ground truth relevant items for that specific query. This metric focuses on the initial part of the ranked list, up to the point where all possible relevant items could have been retrieved.
- Mathematical Formula: $\text{MAP@R} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \text{AP@R}(q)$ where $\text{AP@R}(q) = \frac{1}{R_q} \sum_{k=1}^{R_q} P(k) \times \text{rel}(k)$
- Symbol Explanation:
  - $R_q$ : The total number of relevant items for query $q$ .
  - Other symbols are as defined for MAP.
R-Precision:
- Conceptual Definition: R-Precision is the precision at the $R$ -th position in the ranked list, where $R$ is the total number of relevant documents for the query. It measures the proportion of relevant items among the first $R$ retrieved items. It's essentially precision calculated at a recall level of 1.0 (if all relevant items were indeed retrieved by rank R).
- Mathematical Formula: $\text{R-Precision} = \frac{1}{|Q|} \sum_{q=1}^{|Q|} \frac{\text{Number of relevant items in top-R}_q}{\text{R}_q}$
- Symbol Explanation:
  - $\text{R}_q$ : The total number of relevant items for query $q$ .
  - $top-R_q$ : The first $R_q$ items in the ranked list for query $q$ .

5.3. Baselines

5.3.1. Single-View 3D Reconstruction

The paper evaluates the following state-of-the-art 3D reconstruction methods, all pre-trained on ShapeNet, to measure their generalization to ABO objects:

3D-R2N2 [13]: Predicts 3D shapes as voxel grids.
GenRe [71]: Predicts 3D shapes using spherical maps. It also takes a silhouette mask as input during both training and testing.
Occupancy Networks [48]: Predicts 3D shapes using implicit functions, which represent the object as the decision boundary of a neural network.
Mesh R-CNN [22]: Predicts 3D shapes as meshes, often built upon a 2D object detection framework.

These methods represent diverse approaches to 3D representation (voxels, spherical maps, implicit functions, meshes) and coordinate systems (canonical vs. view-space), making them good candidates to assess the domain gap.

5.3.2. Material Estimation

The paper proposes its own U-Net-based models with a ResNet-34 backbone as baselines for material estimation:

Single-View Network (SV-net): Estimates SV-BRDFs from a single input image.
Multi-View Network (MV-net): Estimates SV-BRDFs by leveraging multiple input images. This network has two variants:
- MV-net (with projection): Uses 3D structure-based alignment (pixel projection using depth maps) to combine information from neighboring views.
- MV-net (no projection): A variant of the multi-view network that does not use 3D structure-based alignment, serving as an ablation to show the benefit of explicit 3D information.

5.3.3. Multi-View Cross-Domain Object Retrieval

The DML methods evaluated against the ABO retrieval benchmark are standard approaches in the field:

Pre-trained: A ResNet-50 [29] backbone directly pre-trained on ImageNet, serving as a basic baseline without explicit DML training.
Contrastive Loss: A tuple-based loss function that pulls positive pairs (similar items) closer and pushes negative pairs (dissimilar items) farther apart in the embedding space.
Multi-similarity Loss [63]: A tuple-based loss that dynamically weights positive and negative pairs based on their similarity values, aiming to select the most informative samples.
NormSoftmax [70]: A classification-based approach that treats each instance/class as a distinct category and uses a softmax function, often with temperature scaling, to learn embeddings.
NTXent [11]: (Normalized Temperature-scaled Cross-Entropy Loss) A tuple-based loss function widely used in self-supervised contrastive learning, designed to maximize agreement between different augmented views of the same data point.
ProxyNCA [50]: A proxy-based loss where a small set of "proxies" (learnable embedding vectors) represent each class, and samples are pulled towards their class proxies and pushed away from others.
TripletMargin Loss: A tuple-based loss that requires an anchor, a positive example (similar to anchor), and a negative example (dissimilar to anchor). It enforces that the distance between the anchor and positive is smaller than the distance between the anchor and negative by a certain margin.

These baselines are implemented using PyTorch Metric Learning [2] and evaluated with the Powerful Benchmarker framework [1] for fair comparison.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Single-View 3D Reconstruction

The paper investigates how well ShapeNet-trained 3D reconstruction models generalize to more realistic ABO objects. The following are the results from Table 3 of the original paper:

	Chamfer Distance (↓)						Absolute Normal Consistency (↑)
	bench	chair	couch	cabinet	lamp	table	bench	chair	couch	cabinet	lamp	table
3D R2N2 [13]	2.46/0.85	1.46/0.77	1.15/0.59	1.88/0.25	3.79/2.02	2.83/0.66	0.51/0.55	0.59/0.61	0.57/0.62	0.53/0.67	0.51/0.54	0.51/0.65
Occ Nets [48]	1.72/0.51	0.72/0.39	0.86/0.30	0.80/0.23	2.53/1.66	1.79/0.41	0.66/0.68	0.67/0.76	0.70/0.77	0.71/0.77	0.65/0.69	0.67/0.78
GenRe [71]	1.54/2.86	0.89/0.79	1.08/2.18	1.40/2.03	3.72/2.47	2.26/2.37	0.63/0.56	0.69/0.67	0.66/0.60	0.62/0.59	0.59/0.57	0.61/0.59
Mesh R-CNN [22]	1.05/0.09	0.78/0.13	0.45/0.10	0.80/0.11	1.97/0.24	1.15/0.12	0.62/0.65	0.62/0.70	0.62/0.72	0.65/0.74	0.57/0.66	0.62/0.74

Table 3 presents the Chamfer Distance (lower is better, indicated by $\downarrow$ ) and Absolute Normal Consistency (higher is better, indicated by $\uparrow$ ) for four methods across six object categories. Each cell contains two values: ShapeNet performance / ABO performance.

Key observations:

Significant Domain Gap: For nearly all categories and metrics, there is a substantial performance drop from ShapeNet to ABO. For example, Mesh R-CNN's Chamfer Distance for 'bench' increases from 0.09 on ShapeNet to 1.05 on ABO, indicating a much worse reconstruction. This strongly validates the hypothesis that objects in ABO, despite being from the same categories, are out-of-distribution and more challenging for models trained on synthetic ShapeNet data.
Mesh R-CNN Performance: Mesh R-CNN [22] generally outperforms other methods in terms of Chamfer Distance on both ShapeNet and ABO, suggesting its mesh-based representation and training approach are robust.
Occupancy Networks Performance: Occupancy Networks [48] performs best in terms of Absolute Normal Consistency, especially on ShapeNet.
Challenging Categories: The lamp category exhibits a particularly large performance drop, especially in Chamfer Distance. Qualitative results (Figure 5) suggest this is due to the difficulty in reconstructing thin structures, which are common in lamps and less frequently or realistically represented in ShapeNet. The following image (Figure 5 from the original paper) shows qualitative results for single-view 3D reconstruction:

该图像是一个示意图，展示了不同方法在3D物体重建任务中的输出对比，包括输入图像和多种算法（R2N2、Occupancy Networks、GenRe、Mesh RCNN）所生成的3D模型，以及相应的真实模型（GT）。

Figure 5 illustrates the qualitative results of different reconstruction methods, highlighting the performance drop from ShapeNet-trained networks to ABO objects. The lamp example clearly shows difficulties in reconstructing thin structures, where both GenRe and Mesh R-CNN struggle.

The following are the results from Table 7 of the original paper:

	Chamfer (↓)	Abs. Normal Consistency (↑)
3D R2N2 [13]	1.97	0.55
OccNets [48]	1.19	0.70
GenRe [71]	1.61	0.66
Mesh R-CNN [21]	0.82	0.62

Table 7 provides an aggregated view of the 3D reconstruction performance on the ABO test split, averaged across all categories. Mesh R-CNN achieves the lowest Chamfer Distance (0.82), confirming its superior performance in shape reconstruction. Occupancy Networks achieves the highest Absolute Normal Consistency (0.70), indicating better normal alignment. These aggregate results reinforce the findings from the per-category breakdown in Table 3.

The following image (Figure 12 from the original paper) shows additional qualitative examples of 3D reconstruction:

该图像是一个示意图，展示了输入图像及其通过不同算法（GenRe和Mesh RCNN）生成的3D重建结果。该图像比较了三种不同对象的重建效果，展示了机器学习模型在3D物体理解中的应用。

Figure 12 provides additional qualitative examples for Mesh R-CNN and GenRe on various ABO objects. It reinforces the observation that both methods struggle with thin structures, but in different ways: GenRe tends to omit them, while Mesh R-CNN produces a more "hulled" or filled-in reconstruction.

6.1.2. Material Prediction

The paper presents a baseline approach for material estimation and evaluates the impact of multi-view inputs and 3D structure. The following are the results from Table 4 of the original paper:

	SV-net	MV-net (no proj.)	MV-net
Base Color (↓)	0.129	0.132	0.127
Roughness (↓)	0.163	0.155	0.129
Metallicness (↓)	0.170	0.167	0.162
Normals (↑)	0.970	0.949	0.976
Render (↓)	0.096	0.090	0.086

Table 4 shows the material estimation results comparing the single-view network (SV-net), multi-view network without projection (MV-net no proj.), and multi-view network with projection (MV-net). RMSE (lower is better) is used for Base Color, Roughness, Metallicness, and Render loss, while Cosine Similarity (higher is better) is used for Normals.

Key observations:

Multi-View Benefit: The MV-net (with projection) generally outperforms the SV-net across all metrics. This is particularly noticeable for Roughness (0.129 vs 0.163) and Metallicness (0.162 vs 0.170), which are view-dependent specular components. This confirms that incorporating multiple views helps in disentangling and accurately predicting material properties.
Value of 3D Structure: Comparing MV-net (no proj.) to MV-net (with proj.), the latter consistently achieves better performance (e.g., Roughness 0.129 vs 0.155, Normals 0.976 vs 0.949). This ablation demonstrates that explicitly using 3D structure-based alignment (pixel projection) to integrate information from neighboring views significantly improves material estimation accuracy.
Reasonable Performance: Even SV-net achieves a high Normals score (0.970), indicating good surface orientation prediction. The Render loss also decreases with multi-view, suggesting more photorealistic material predictions.

The following image (Figure 6 from the original paper) shows qualitative material estimation results:

该图像是一个示意图，展示了SV-net和MV-net在物体重建任务中对基础色、粗糙度、金属感和法线的估计。图中展示了来自不同角度的输入图像以及各自的GT（真实值）对比，突显了材料属性的估计能力。

Figure 6 provides qualitative results for SV-net and MV-net, showing input images and their estimated base color, roughness, metallicness, and normal maps compared to ground truth. These visual examples support the quantitative findings, demonstrating the networks' ability to estimate these properties.

The following image (Figure 7 from the original paper) shows qualitative multi-view material estimation results on real catalog images:

Figure 7. Qualitative multi-view material estimation results on real catalog images. Each of the multiple views is aligned to the reference view using the catalog image pose annotations. 该图像是图表，展示了多视角材料估计的结果。左侧为输入的真实目录图像，右侧依次为基础颜色、粗糙度、金属感、法线和重新照明的可视化效果，突出不同材料属性的估计情况。

Figure 7 showcases the transferability of the trained MV-net to real catalog images. Despite the domain gap (lighting, background differences between synthetic training data and real images), the network makes reasonable predictions for base color, roughness, metallicness, and normals, which are then used for relighting the object. The example in the last row, where the network fails to accurately infer the true base color, highlights challenges posed by self-shadowing in real images.

6.1.3. Multi-View Cross-Domain Object Retrieval

This benchmark evaluates Deep Metric Learning (DML) methods on the challenging task of retrieving objects across different image domains (rendered queries vs. catalog targets). The following are the results from Table 5 of the original paper:

Rendered images				Catalog k=1
Recall@k (%)	k=1	k=2	k=4 k=8	Catalog k=1
Pre-trained	5.0	8.1	11.4	15.3 18.0
Constrastive	28.6	38.3	48.9	59.1	39.7
Multi-similarity	23.1	32.2	41.9	52.1	38.0
NormSoftmax	30.0	40.3	50.2	60.0	35.5
NTXent	23.9	33.0	42.6	52.0	37.5
ProxyNCA	29.4	39.5	50.0	60.1	35.6
TripletMargin	22.1	31.1	41.3	51.9	36.9

Table 5 presents the Recall@k percentages for various DML methods when using rendered images as queries (first four columns) and Recall@1 when using catalog images as queries (last column).

Key observations:

Challenging Benchmark: The Pre-trained ResNet-50 (on ImageNet) performs very poorly with Recall@1 of 5.0% for rendered queries, confirming the challenging nature of this cross-domain multi-view retrieval task. This is significantly lower than typical performance on existing DML benchmarks.
DML Improves Performance: All DML methods show substantial improvements over the pre-trained baseline, with Recall@1 reaching around 22-30% for rendered queries. This validates the need for metric learning in this challenging scenario.
Performance Gaps among DML Methods: NormSoftmax, ProxyNCA, and Contrastive generally perform better ( $\approx 29-30\%$ Recall@1) than Multi-similarity, NTXent, and TripletMargin ( $\approx 22-24\%$ Recall@1) when using rendered queries. This gap is not always apparent in other datasets, suggesting ABO reveals nuances in DML method performance.
Easier Catalog-to-Catalog Retrieval: When using catalog images as queries (last column, $Catalog k=1$ ), the Recall@1 scores are significantly higher for all DML methods (ranging from 35.5% to 39.7%). This confirms that the cross-domain aspect and diverse viewpoints of rendered images make the task much harder.
Saturation of Existing Benchmarks: The overall low performance on ABO (e.g., ~30% Recall@1 for best DML method on rendered queries) compared to the near-saturation reported on existing benchmarks underscores the value of ABO in pushing research forward.

The following image (Figure 8 from the original paper) shows Recall@1 as a function of query azimuth and elevation:

$Figure 8. Recall `@ 1` as a function of the azimuth and elevation of the product view. For all methods, retrieval performance degrades rapidly beyond azimuth $| \\theta | > 7 5 ^ { \\circ }$ and elevation $\\varphi > 5 0 ^ { \\circ }$ .$ 该图像是一个图表，展示了召回率 Recall@1 随产品视角的方位角（Query Azimuth $\Theta$ ）和仰角（Query Elevation $\varphi$ ）变化的情况。上半部分的曲线图表明，当方位角 $| \Theta | > 75^{\circ}$ 和仰角 $\varphi > 50^{\circ}$ 时，检索性能迅速下降。下半部分的柱状图则比较了不同方法在不同仰角范围内的召回率，体现出不同方法的性能差异。

Figure 8 illustrates how Recall@1 performance degrades as query azimuth and elevation angles deviate from typical product viewpoints. For all methods, retrieval performance drops significantly beyond azimuth $| \theta | > 75^\circ$ and elevation $\varphi > 50^\circ$ . This highlights the sensitivity of current DML algorithms to viewpoint changes and points to a crucial area for future research.

6.2. Data Presentation (Tables)

The following are the results from Table 8 of the original paper:

Loss	Recall@1 (%)	Recall@2 (%)	Recall@4 (%)	Recall@8 (%)	MAP (%)	MAP@R (%)	R-Precision (%)
Pre-trained	4.97	8.10	11.41	15.30	7.69	2.27	3.44
Constrastive	28.56	38.34	48.85	59.10	31.19	14.16	19.19
Multi-similarity	23.12	32.24	41.86	52.13	26.77	11.72	16.29
NormSoftmax	30.02	40.32	. 50.19	59.96	32.61	14.03	18.76
NTXent	23.86	33.04	42.59	51.98	27.00	12.05	16.51
ProxyNCA	29.36	39.47	50.05	60.11	32.38	14.05	19.00
TripletMargin	22.15	31.10	41.32	51.90	25.80	10.87	15.41

Table 8 presents a more complete set of metrics for the multi-view cross-domain object retrieval benchmark, where rendered images are used as queries against a gallery of catalog images from both train and test classes. It includes Recall@k for $k \in \{1, 2, 4, 8\}$ , MAP, MAP@R, and R-Precision. The results reaffirm the challenging nature of the task and the relative performance of DML methods, with NormSoftmax, Contrastive, and ProxyNCA generally performing best. All metrics are significantly lower than typically observed on less challenging benchmarks, confirming the value of ABO for advanced DML research.

The following are the results from Table 9 of the original paper:

Loss	Recall@1 (%)	Recall@2 (%)	Recall@4 (%)	Recall@8 (%)	MAP (%)	MAP@R (%)	R-Precision (%)
Pre-trained	17.99	23.93	31.72	38.65	22.57	6.99	9.55
Constrastive	39.67	. 52.21	64.41	71.64	42.96	22.52	28.07
Multi-similarity	38.05	50.06	61.79	68.17	40.87	21.06	26.32
NormSoftmax	35.50	46.70	57.38	64.78	38.07	18.63	23.42
NTXent	37.51	49.34	61.37	69.23	40.12	20.03	25.32
ProxyNCA	35.64	46.53	57.36	65.06	38.50	18.81	23.65
TripletMargin	36.87	48.34	60.98	69.44	40.03	19.94	25.46

Table 9 shows the retrieval results when catalog images are used as queries against the same gallery of catalog images. As expected, all metrics are significantly higher compared to Table 8 (rendered queries), demonstrating that the "cross-domain" aspect (rendered vs. catalog images) is the primary source of difficulty in the benchmark. Contrastive loss performs best in this setting. The comparison between Table 8 and Table 9 clearly highlights the impact of viewpoint and domain diversity on retrieval performance.

6.3. Ablation Studies / Parameter Analysis

Impact of 3D Structure in Multi-View Material Estimation: The ablation study presented in Table 4 (MV-net vs. MV-net no proj.) explicitly demonstrates the benefit of using 3D structure-based alignment (pixel projection) for multi-view material estimation. MV-net (with projection) consistently outperforms MV-net (no proj.) across all metrics. This indicates that leveraging explicit 3D geometry information is crucial for accurately disentangling and estimating SV-BRDF parameters, especially view-dependent ones like roughness and metallicness. Even without explicit projection, the MV-net (no proj.) still outperforms the SV-net for roughness and metallicness, suggesting that simply having multiple views, even unaligned, provides some benefit.
Viewpoint Sensitivity in Retrieval: While not a typical ablation study of model components, the analysis of Recall@1 as a function of azimuth and elevation angles (Figure 8) serves as a critical parameter analysis. It reveals that the performance of DML algorithms degrades sharply for extreme viewpoints ( $|\theta| > 75^\circ$ and $\varphi > 50^\circ$ ). This highlights a limitation of current DML models: they are not robust to significant viewpoint variations, especially those uncommon in standard product catalog images. This analysis points towards the need for DML losses or architectures that explicitly model or are invariant to geometric transformations.

The following image (Figure 13 from the original paper) shows qualitative retrieval results:

该图像是一个插图，展示了不同家居产品的图样，包括床头柜、空调和手推车等，图中通过红框和绿框标识了相似度。该插图旨在展示在真实世界和虚拟3D模型之间的对比。

Figure 13 provides qualitative retrieval results for NormSoftmax, ProxyNCA, and Contrastive methods across different query elevations (low, mid, high). It showcases both success and failure cases. For instance, a low-elevation table query might successfully retrieve similar tables, while a high-elevation cart query might be more challenging, reflecting the quantitative findings from Figure 8.

7. Conclusion & Reflections

7.1. Conclusion Summary

This work successfully introduces Amazon Berkeley Objects (ABO), a groundbreaking large-scale dataset that addresses critical limitations of existing 3D datasets. ABO is unique in providing artist-created 3D models with complex geometries and Physically-Based Rendering (PBR) materials, corresponding to real-world household objects, alongside rich metadata and multi-view catalog images. The paper demonstrates that ABO serves as a challenging testbed for state-of-the-art 3D computer vision.

Key findings from the derived benchmarks include:

Single-View 3D Reconstruction: Models trained on synthetic ShapeNet data exhibit a significant domain gap when applied to ABO objects, underscoring the need for more realistic training data.
Material Estimation: A multi-view network, especially when incorporating 3D structure-based alignment, substantially improves SV-BRDF material estimation accuracy for complex geometries, even transferring reasonably well to real catalog images.
Cross-Domain Multi-View Object Retrieval: ABO provides a highly challenging benchmark for Deep Metric Learning (DML), revealing performance gaps among state-of-the-art methods not apparent in older datasets. Retrieval performance is particularly sensitive to extreme viewpoints, highlighting a major area for improvement.

Overall, ABO bridges the gap between real and virtual 3D worlds, enabling the development and rigorous evaluation of more robust and generalizable 3D object understanding algorithms.

7.2. Limitations & Future Work

The authors point out several avenues for future research and current limitations:

Text Annotations: The large amounts of text annotations (product descriptions and keywords) are not explored in this work. They enable future language and vision tasks such as predicting styles, patterns, captions, or keywords from product images.
Non-Rigid Products: ABO contains non-rigid products (apparel, home linens) whose 3D models were not explicitly used in the presented benchmarks. These could enable research on deformable object modeling.
Robotics Research: The 3D objects in ABO correspond to items found in a home and include associated object weight and dimensions. This data can directly benefit robotics research by supporting simulations of manipulation and navigation tasks.
Geometric Information in DML: For the multi-view object retrieval task, the authors note that current DML losses do not explicitly model the geometric information available in the training data (e.g., azimuth and elevation angles). This is a promising direction to improve robustness to viewpoint changes.

7.3. Personal Insights & Critique

This paper presents a highly valuable contribution to the 3D computer vision community. The ABO dataset is a significant step forward because it tackles several persistent challenges simultaneously: scale, realism, physically-based materials, and multi-modal data (3D, images, metadata).

Inspirations and Applications:

E-commerce and AR/VR: The direct relevance to Amazon's product catalog makes ABO immediately applicable to e-commerce (e.g., improved product search, virtual try-on, automated content generation) and immersive experiences (e.g., realistic object placement in AR/VR).
Robotics: The detailed 3D models with real-world dimensions and weights are invaluable for training robotic agents to interact with household objects, improving grasping, manipulation, and navigation in cluttered environments.
Computer Graphics: The PBR materials and high-quality 3D models are a rich resource for researchers in computer graphics for tasks like scene synthesis, relighting, and material editing.
Domain Adaptation and Generalization: The benchmarks clearly expose the limitations of current models trained on synthetic data, strongly encouraging research into better domain adaptation techniques and methods that generalize to real-world complexities.

Potential Issues, Unverified Assumptions, or Areas for Improvement:

Artist-Created Models: While superior to procedurally generated or low-quality CAD models, artist-created models still involve human interpretation and might not perfectly capture every nuance of real-world object imperfections or manufacturing variations. There could be subtle biases introduced by artists' styles or assumptions.
Diversity of Backgrounds/Lighting for Material Estimation: Although the material estimation dataset uses 3 random HDRIs from 108, the diversity of these environments could be further expanded. Real-world lighting can be far more complex and challenging than what's captured even by HDRIs.
Generalization Beyond Household Objects: While a strength, the dataset's focus on "household objects" means generalization to industrial, outdoor, or highly specialized object categories might still be an open problem.
Lack of Scene Context: The dataset focuses on individual objects. While catalog images provide some "in-context" views, the 3D models themselves are of isolated objects. Future work could involve reconstructing these objects within full 3D scenes to provide even richer contextual understanding.
Computational Cost: Training and evaluating on such a large-scale, high-fidelity dataset can be computationally intensive, potentially posing a barrier for researchers with limited resources.

Overall, ABO is a meticulously designed dataset that addresses a critical need in 3D computer vision. Its multi-faceted nature and carefully constructed benchmarks provide a robust platform for advancing research in 3D object understanding, material properties, and robust retrieval.

ABO: Dataset and Benchmarks for Real-World 3D Object Understanding

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~44 min read · 58,904 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. ABO Dataset Properties and Composition

4.2.2. Catalog Image Pose Annotations

4.2.3. Material Estimation Dataset

4.2.4. Single-View 3D Reconstruction Benchmark

4.2.5. Material Prediction Baseline

4.2.6. Multi-View Cross-Domain Object Retrieval Benchmark

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.2.1. Single-View 3D Reconstruction

5.2.2. Material Estimation

5.2.3. Multi-View Cross-Domain Object Retrieval

5.3. Baselines

5.3.1. Single-View 3D Reconstruction

5.3.2. Material Estimation

5.3.3. Multi-View Cross-Domain Object Retrieval

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Single-View 3D Reconstruction

6.1.2. Material Prediction

6.1.3. Multi-View Cross-Domain Object Retrieval

6.2. Data Presentation (Tables)

6.3. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers