Paper status: completed

ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion

Published:09/10/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
7 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ScoreHOI is a novel optimizer that leverages score-guided diffusion for physically plausible human-object interaction reconstruction from a single image. By incorporating diffusion priors and physical constraints, it improves reconstruction quality and contact plausibility, showi

Abstract

Joint reconstruction of human-object interaction marks a significant milestone in comprehending the intricate interrelations between humans and their surrounding environment. Nevertheless, previous optimization methods often struggle to achieve physically plausible reconstruction results due to the lack of prior knowledge about human-object interactions. In this paper, we introduce ScoreHOI, an effective diffusion-based optimizer that introduces diffusion priors for the precise recovery of human-object interactions. By harnessing the controllability within score-guided sampling, the diffusion model can reconstruct a conditional distribution of human and object pose given the image observation and object feature. During inference, the ScoreHOI effectively improves the reconstruction results by guiding the denoising process with specific physical constraints. Furthermore, we propose a contact-driven iterative refinement approach to enhance the contact plausibility and improve the reconstruction accuracy. Extensive evaluations on standard benchmarks demonstrate ScoreHOI's superior performance over state-of-the-art methods, highlighting its ability to achieve a precise and robust improvement in joint human-object interaction reconstruction.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

ScoreHOI: Physically Plausible Reconstruction of Human-Object Interaction via Score-Guided Diffusion

The title clearly states the paper's central topic: reconstructing 3D human-object interactions (HOI) to be physically plausible. It highlights the core techniques used: a score-guided diffusion model named ScoreHOI.

1.2. Authors

The authors are Ao Li, Jinpeng Liu, Yixuan Zhu, and Yansong Tang. They are affiliated with Tsinghua Shenzhen International Graduate School and Tsinghua University, which are highly reputable institutions in computer science and engineering research.

1.3. Journal/Conference

The paper provides a publication date of September 9, 2025, and an arXiv link. This indicates it is a preprint submitted to a future conference. Given the topic and authors, it is likely intended for a top-tier computer vision conference such as CVPR, ICCV, or ECCV. Publishing in these venues signifies high-quality, impactful research.

1.4. Publication Year

The provided publication date is 2025-09-09.

1.5. Abstract

The abstract summarizes the paper's work on the joint reconstruction of human-object interactions. It identifies a key problem with previous methods: they often fail to produce physically plausible results due to a lack of prior knowledge. The proposed solution, ScoreHOI, is a diffusion-based optimizer that leverages diffusion priors for more accurate reconstruction. The method uses score-guided sampling to enforce physical constraints during the denoising process. A novel contact-driven iterative refinement approach is also introduced to improve contact plausibility. The abstract concludes by stating that ScoreHOI achieves state-of-the-art performance on standard benchmarks, demonstrating precise and robust improvements.

  • Original Source Link: https://arxiv.org/abs/2509.07920

  • PDF Link: https://arxiv.org/pdf/2509.07920v1.pdf

    The paper is available as a preprint on arXiv, which means it has not yet completed the peer-review process for an official conference or journal publication at the time of this analysis.

2. Executive Summary

2.1. Background & Motivation

The core problem this paper addresses is the 3D reconstruction of human-object interactions (HOI) from a single RGB image. This task is crucial for understanding how humans interact with their environment, with applications in robotics, virtual/augmented reality (VR/AR), and gaming.

However, reconstructing both a human and an object in 3D from a 2D image is an ill-posed problem due to depth ambiguity and occlusions. Previous methods fall into two main categories, each with significant limitations:

  1. Optimization-based methods: These approaches iteratively refine an initial estimate by minimizing a loss function that includes physical constraints (e.g., no penetration, contact points should touch). While they can improve physical plausibility, they are often computationally expensive (slow) and can produce results that deviate significantly from the input image, as they may over-optimize for physical correctness at the expense of visual accuracy.

  2. Regression-based methods: These methods use deep neural networks to directly predict the 3D human and object parameters in a single forward pass. They are fast but often lack robustness, especially in challenging scenarios with heavy occlusion or unusual poses. They struggle to incorporate complex interaction priors and often produce physically implausible results (e.g., floating objects, body parts penetrating objects).

    The key gap identified by the authors is the lack of a strong, learned prior for plausible human-object interactions. Existing methods struggle to balance image-based evidence with physical rules. The paper's innovative idea is to leverage the power of diffusion models as a new kind of optimizer. Diffusion models learn a rich prior distribution of data (in this case, plausible HOI configurations) and can be guided during their sampling (generation) process, offering a way to merge learned priors with explicit constraints.

2.2. Main Contributions / Findings

The paper's primary contributions are:

  1. ScoreHOI Framework: The proposal of a novel diffusion-based optimization framework, ScoreHOI, for refining coarse HOI reconstructions. It uses a pre-trained diffusion model as a powerful optimizer, endowed with rich prior knowledge of plausible human and object poses.

  2. Score-Guided Physical Refinement: The method introduces physical constraints (human-object contact, object-floor contact, and penetration avoidance) directly into the diffusion model's denoising process. This is achieved through score-guided sampling, which modifies the generation process at each step to steer the output towards physically plausible results without sacrificing the learned prior.

  3. Contact-Driven Iterative Refinement: A novel iterative strategy where the entire diffusion-based optimization is run multiple times. In each iteration, the contact regions between the human and object are re-estimated based on the refined poses from the previous step. This feedback loop progressively improves the accuracy of contact prediction and, consequently, the overall reconstruction quality.

  4. State-of-the-Art Performance: The paper demonstrates through extensive experiments that ScoreHOI significantly outperforms previous methods on standard benchmarks (BEHAVE and InterCap). Notably, it achieves a 9% improvement in contact F-Score on the BEHAVE dataset, highlighting its superior ability to model physical interactions accurately, while also being significantly more efficient than traditional optimization-based techniques.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Parametric Human Body Models (SMPL, SMPL-H)

To represent a 3D human body digitally, researchers often use parametric models. These models can generate a 3D human mesh (a collection of vertices and faces) from a small set of parameters.

  • SMPL (Skinned Multi-Person Linear Model): This is the most widely used parametric body model. It represents the human body using two main sets of parameters:

    • Shape parameters (βR10\beta \in \mathbb{R}^{10}): A vector of 10 numbers that control the identity-specific body shape, such as height, weight, and proportions. These are learned from a large scan dataset and represent the principal components of body shape variation.

    • Pose parameters (θR24×3\theta \in \mathbb{R}^{24 \times 3}): These parameters define the body's articulation. It consists of the global orientation of the body and the relative rotations of 23 joints (e.g., elbows, knees, neck) in a kinematic tree.

      A function M(θ,β)\mathcal{M}(\theta, \beta) takes these parameters and outputs a 3D mesh with a fixed topology (e.g., 6,890 vertices).

  • SMPL-H: An extension of SMPL that includes fully articulated hands. It combines the SMPL body model with the MANO hand model, allowing for the representation of complex hand gestures and interactions with objects. The paper uses this model to capture detailed hand-object contact.

3.1.2. Score-Based Diffusion Models (DDPM, DDIM)

Diffusion models are a class of powerful generative models that have achieved state-of-the-art results in tasks like image generation. They work by reversing a diffusion process.

  • Forward Diffusion Process: This process gradually adds a small amount of Gaussian noise to the data (e.g., an image or, in this paper, a vector of pose parameters) over a series of timesteps TT. Starting with clean data x0\mathbf{x}_0, we get a sequence of increasingly noisy samples x1,x2,,xT\mathbf{x}_1, \mathbf{x}_2, \dots, \mathbf{x}_T, where xT\mathbf{x}_T is pure Gaussian noise. This process is defined by a fixed schedule.

  • Reverse Denoising Process: The core of the model is a neural network (often a U-Net or Transformer) trained to reverse this process. At any timestep tt, the network takes the noisy data xt\mathbf{x}_t and predicts the noise ϵ\epsilon that was added to create it. By subtracting this predicted noise, it can take a step towards a less noisy sample xt1\mathbf{x}_{t-1}. Repeating this from t=Tt=T down to t=0t=0 generates a clean sample from random noise.

  • Score Function (xtlogp(xt)\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t)): The noise predicted by the model, ϵϕ(xt,t)\epsilon_\phi(\mathbf{x}_t, t), is directly related to the score, which is the gradient of the log-probability density of the noisy data. This connection allows us to "guide" the denoising process. If we want to generate a sample that also satisfies some condition yy (e.g., matches a text description or a physical constraint), we can modify the score using Bayes' theorem: xtlogp(xty)=xtlogp(xt)+xtlogp(yxt)\nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t | y) = \nabla_{\mathbf{x}_t} \log p(\mathbf{x}_t) + \nabla_{\mathbf{x}_t} \log p(y | \mathbf{x}_t). The first term is the original score from the diffusion model. The second term is a guidance term that pushes the sample towards satisfying condition yy. This is the key mechanism ScoreHOI uses to enforce physical plausibility.

  • DDIM (Denoising Diffusion Implicit Models): A faster sampling method for diffusion models. While traditional DDPMs require hundreds or thousands of steps to generate a sample, DDIM uses a non-Markovian process that allows for much larger jumps, reducing the number of sampling steps to as few as 10-50 while maintaining high quality. ScoreHOI uses DDIM for both noising (inversion) and denoising (sampling) to ensure efficiency.

3.2. Previous Works

The paper categorizes prior work on HOI reconstruction into two main types:

  1. Optimization-based Methods:

    • Examples: CHORE [59], VisTracker [61].
    • These methods typically start with an initial guess for the human and object poses and then use a gradient-based optimizer like Adam to iteratively minimize a complex objective function. This function usually includes:
      • A data term: How well the rendered 2D projection of the 3D meshes matches the input image (e.g., silhouettes, keypoints).
      • Physical constraint terms: Penalties for body parts penetrating the object, lack of contact where it should exist, or floating objects.
    • Limitation: These methods are very slow (often taking minutes per frame) and can get stuck in local minima, sometimes producing results that are physically correct but visually inconsistent with the image.
  2. Regression-based Methods:

    • Examples: PHOSA [68], CONTHO [37].
    • These methods use a deep neural network (e.g., a CNN-Transformer architecture) to directly predict the SMPL and object parameters from the input image in a single forward pass.
    • Some, like CONTHO, incorporate refinement modules with cross-attention to better model the human-object relationship.
    • Limitation: While fast, they are less robust to challenging scenes (occlusion, depth ambiguity) and often produce physically impossible results because the complex physics are difficult to encode in a single-shot regression model.

3.3. Technological Evolution

The field has evolved from focusing solely on single human mesh recovery (HMR) to the more complex problem of HOI reconstruction. Initially, methods were purely regression-based or optimization-based. More recent works like CONTHO have tried to combine the speed of regression with more sophisticated attention mechanisms for refinement.

This paper marks the next step in this evolution by introducing generative priors from diffusion models. Instead of relying on hand-crafted physical loss terms within a classic optimizer or a simple feed-forward network, ScoreHOI uses the diffusion model itself as a powerful, learned optimizer. This allows it to leverage a rich, data-driven prior of plausible interactions while still being steerable by explicit physical constraints, combining the strengths of both previous approaches.

3.4. Differentiation Analysis

ScoreHOI's core innovation lies in its method of optimization.

  • Versus Optimization-based Methods: ScoreHOI is significantly faster because it uses a fixed number of DDIM sampling steps, which is much more efficient than running an Adam optimizer until convergence. More importantly, its "optimizer" (the diffusion model) has a strong prior learned from data, which prevents it from diverging to solutions that are physically plausible but visually wrong.

  • Versus Regression-based Methods: ScoreHOI is more robust and produces more physically plausible results. Instead of a single-step refinement, it performs an iterative denoising process where physical guidance is applied at each step. This multi-step, guided generation is more powerful than a single forward pass.

  • Key Differentiator: The central idea is framing the refinement task as a conditional generation problem solved by a guided diffusion model. The contact-driven iterative refinement loop, which updates contact masks between optimization rounds, is also a unique contribution that creates a powerful feedback mechanism for improving accuracy.

4. Methodology

The core of ScoreHOI is an effective optimizer that refines an initial coarse estimation of human-object interaction parameters using a guided diffusion model. The overall inference pipeline is illustrated in Figure 2 of the paper.

Figure 2. The inference procedure of ScoreHOI. (a) Given the input image \(I\) the human and object segmented silhouette \(S _ { \\mathrm { h } } , S _ { \\mathrm { o } }\) and the object template \(P _ { \\mathrm { ~ o ~ } }\) , we initially extract the image feature \(\\mathcal { F }\) and estimate the SMPL and object parameters \(\\theta\) \(\\beta\) , \(R _ { \\mathrm { o } }\) and \(t _ { \\mathrm { o } }\) . (b) Employing a contact-driven iterative refinement strategy, we refine these parameters \(_ { \\textbf { \\em x } }\) through the execution of a DDIM inversion and guided sampling lo Due i s, hysl ctant uc pe \(L _ { \\mathrm { p t } }\) and contact \(L _ { \\mathrm { h o } } , L _ { \\mathrm { o f } }\) are actively supervised. Following each optimization iteration, the contact masks \(\\{ \\mathbf { M } _ { i } \\} _ { i \\in \\{ \\mathrm { h , o , f } \\} }\) are updated to enhance the precision of the guidance. 该图像是示意图,展示了ScoreHOI的推理过程。图中分为(a) 和(b) 两部分,(a) 为“考虑可用性的回归器”,展示了如何从视觉输入 II 和点云 PoP_o 中提取特征并通过回归器进行SMPL拟合,估计人类和物体的参数;(b) 为“接触驱动的迭代精炼”,阐述了如何进行DDIM反演和指导采样以优化模型参数,并通过接触预测器更新接触掩码以提高指导的精确度。

4.1. Principles

The main principle is to treat the refinement of human-object pose parameters as a conditional sampling problem. Given a coarse initial prediction, the method first "inverts" it into a noisy latent representation using DDIM inversion. Then, it performs a guided DDIM sampling (denoising) process to generate a refined, physically plausible output. The guidance comes from specific physical objectives (contact, penetration) that are formulated as gradients and injected into the score-based sampling loop. This entire process is iterated to further refine contact accuracy.

4.2. Core Methodology In-depth

4.2.1. Step 1: Affordance-Aware Regressor

The process begins with an initial, coarse prediction.

  • Input: A cropped RGB image II, human and object segmentation masks (ShS_h, SoS_o), and a coarse object template point cloud PoP_o.
  • Goal: To obtain an initial estimate of the human and object parameters. The full parameter vector to be optimized is x={θ,β,Ro,to}\pmb{x} = \{\theta, \beta, R_o, t_o\}, which includes:
    • Human pose θR52×6\theta \in \mathbb{R}^{52 \times 6} (using 6D rotation representation for stability, covering body and hands).
    • Human shape βR10\beta \in \mathbb{R}^{10}.
    • Object rotation RoR6R_o \in \mathbb{R}^{6} (6D representation).
    • Object translation toR3t_o \in \mathbb{R}^{3}.
  • Method:
    1. An image backbone (e.g., ResNet50) extracts a visual feature map F\mathcal{F} from the input image.
    2. To improve generalization to unseen objects, the model uses an affordance-aware network (pre-trained PointNeXt on a large 3D object dataset). This network provides prior knowledge about how an object can be used (its "affordances"), based on its geometry, which is more general than just a class label.
    3. Two separate prediction heads take the image feature F\mathcal{F} and estimate the initial human parameters (θ0,β0\theta^0, \beta^0) and object parameters (Ro0,to0R_o^0, t_o^0). This initial full parameter vector is denoted as xinit\pmb{x}^{init}.

4.2.2. Step 2: Optimization with Physical Guidance

This is the core refinement step, where the initial estimate xinit\pmb{x}^{init} is improved using the diffusion model.

  1. DDIM Inversion: The initial prediction xinit\pmb{x}^{init} (which is a clean data point, x0\pmb{x}_0) is deterministically noised up to an intermediate timestep τ\tau using the DDIM inversion process. This yields a noisy latent representation xτ\pmb{x}_\tau. The inversion process essentially "finds" the noise that would have led to xinit\pmb{x}^{init} if we had started from pure noise.

  2. Guided DDIM Sampling: Starting from xτ\pmb{x}_\tau, a guided denoising process is performed for τ\tau steps to get back to a clean prediction x^0\hat{\pmb{x}}_0. At each denoising step tt (from τ\tau down to 1), the model predicts the noise required to move from xt\pmb{x}_t to xt1\pmb{x}_{t-1}. This is where physical guidance is injected.

  3. Score Guidance Mechanism: The standard denoising process aims to predict the original noise ϵ\epsilon. The predicted noise is related to the score of the data distribution. To incorporate physical constraints P\mathcal{P}, the score is modified according to Bayes' rule: $ \nabla_{\pmb{x}_t} \log p(\pmb{x}t | \pmb{c}, \mathcal{P}) = \nabla{\pmb{x}_t} \log p(\pmb{x}t | \pmb{c}) + \nabla{\pmb{x}_t} \log p(\mathcal{P} | \pmb{c}, \pmb{x}_t) $

    • c\pmb{c} represents the conditions (image and geometry features).

    • xtlogp(xtc)\nabla_{\pmb{x}_t} \log p(\pmb{x}_t | \pmb{c}) is the score predicted by the conditional diffusion model.

    • xtlogp(Pc,xt)\nabla_{\pmb{x}_t} \log p(\mathcal{P} | \pmb{c}, \pmb{x}_t) is the guidance term that pushes the result towards satisfying the physical constraints P\mathcal{P}.

      Directly computing this guidance term on the noisy data xt\pmb{x}_t is difficult. The paper uses a crucial approximation: the gradient is computed on the predicted clean data x^0(xt)\hat{\pmb{x}}_0(\pmb{x}_t) and then used to guide the current step. $ \nabla_{\pmb{x}_t} \log p(\mathcal{P} | \pmb{c}, \pmb{x}t) \simeq \nabla{\pmb{x}_t} \log p(\mathcal{P} | \pmb{c}, \hat{\pmb{x}}_0(\pmb{x}_t)) $ The one-step denoised prediction x^0(xt)\hat{\pmb{x}}_0(\pmb{x}_t) is calculated as: $ \hat{\pmb{x}_0}(\pmb{x}_t) = \frac{1}{\sqrt{\alpha_t}}(\pmb{x}t - \sqrt{1-\alpha_t}\epsilon\phi(\pmb{x}_t, t)) $ where ϵϕ\epsilon_\phi is the noise predicted by the diffusion model and αt\alpha_t is a noise schedule parameter.

  4. Physical Objectives (LPL_{\mathcal{P}}): The guidance term is derived from a loss function LPL_{\mathcal{P}} that measures physical implausibility. The gradient of this loss, xtLP\nabla_{\pmb{x}_t} L_{\mathcal{P}}, is used for guidance. $ L_{\mathcal{P}} = \lambda_{\mathrm{ho}}L_{\mathrm{ho}} + \lambda_{\mathrm{of}}L_{\mathrm{of}} + \lambda_{\mathrm{pt}}L_{\mathrm{pt}} $ The individual constraints are defined on the predicted 3D meshes for the human (VhV_h) and object (VoV_o), which are derived from x^0(xt)\hat{\pmb{x}}_0(\pmb{x}_t):

    • Human-Object Contact (LhoL_{\mathrm{ho}}): This loss minimizes the distance between predicted contact regions on the human and object. $ L_{\mathrm{ho}} = ||(\mathbf{M}{\mathrm{h}} + \mathbf{M}{\mathrm{o}}) \odot |V_{\mathrm{h}} - V_{\mathrm{o}}| ||_2 $ where Mh\mathbf{M}_{\mathrm{h}} and Mo\mathbf{M}_{\mathrm{o}} are binary masks indicating the vertices on the human and object that are supposed to be in contact.
    • Object-Floor Contact (LofL_{\mathrm{of}}): This loss encourages predicted object vertices that should be on the floor to have a height of zero. $ L_{\mathrm{of}} = ||\mathbf{M}{\mathrm{f}} \odot |V{\mathrm{o}}| ||_1 $ where Mf\mathbf{M}_{\mathrm{f}} is a mask for object vertices that should touch the ground.
    • Penetration Avoidance (LptL_{\mathrm{pt}}): This loss penalizes human vertices that are inside the object volume. $ L_{\mathrm{pt}} = - \mathbb{E}[|\Phi_0^-(V_{\mathrm{h}})|] $ where Φ0(Vh)\Phi_0^-(V_h) is the Signed Distance Function (SDF) of the object, which gives a negative value for points inside the object. The loss is designed to increase when penetration occurs.
  5. Modified Noise Prediction: The guidance gradient is added to the original noise prediction to get a modified noise ϵϕ\epsilon'_\phi. $ \epsilon'\phi = \epsilon\phi(\pmb{x}t, t, \pmb{c}) + \rho \sqrt{1-\alpha_t} \nabla{\pmb{x}t} L{\mathcal{P}} $

    • ρ\rho is a guidance scale hyperparameter.
    • This modified noise ϵϕ\epsilon'_\phi is then used in the standard DDIM update step to compute xtΔt\pmb{x}_{t-\Delta t}, effectively steering the denoising trajectory towards physically plausible solutions.

4.2.3. Step 3: Contact-Driven Iterative Refinement

A single pass of the guided diffusion optimization might not be sufficient, especially if the initial contact prediction is poor. To address this, ScoreHOI employs an outer loop of iterative refinement.

The full process is detailed in Algorithm 1.

Algorithm 1 Contact-Driven Iterative Refinement
1: Input: Initial parameters x_0^0, diffusion model ε_θ, image feature F, time steps t, and condition c
2: Output: The optimized parameters x_0^N
3: for n = 0 to N-1 do
4:   F_h^n, F_o^n <- sample(x_0^n, F)
5:   {M_i}_{i∈{h,o,f}} <- Contact(x_0^n, F_h^n, F_o^n)
6:   x_0^(n+1) <- DDIMLoop(x_0^n, t, c, ε_θ, {M_i}_{i∈{h,o,f}})
7: end for
8: return x_0^N
  • The algorithm iterates NN times (e.g., N=10N=10).
  • In each iteration nn, it starts with the refined parameters from the previous step, x0n\pmb{x}_0^n.
  • A contact predictor module re-estimates the contact masks (Mh,Mo,Mf\mathbf{M}_h, \mathbf{M}_o, \mathbf{M}_f) based on the current refined parameters x0n\pmb{x}_0^n.
  • The entire DDIM inversion and guided sampling loop (DDIMLoop) is then executed using these updated contact masks.
  • This creates a feedback loop: better pose estimates lead to better contact predictions, which in turn lead to better guidance for the next round of pose refinement.

4.2.4. Diffusion Model Architecture

The diffusion model itself is a Transformer-based architecture. A key component is the IG-Adapter (Image and Geometry Adapter), shown in Figure 3.

Figure 3. The overview of IG-Adapter. We introduce an IGAdapter designed to integrate the image feature guidance \(c _ { \\mathrm { I } }\) and the geometry feature guidance \(c _ { \\mathrm { G } }\) into the diffusion model. The incorporation of observational and geometric awareness enhances the controllability of the model during the inference process. 该图像是示意图,展示了IG-Adapter的结构。图中展示了如何将图像特征cIc_I与几何特征cGc_G通过线性层和交叉注意力机制整合到扩散模型中,增强模型在推理过程中的可控性。箭头指向输入和输出,分别标记为xtxt-1,强调特征融合对重建过程的重要性。

  • Goal: To effectively condition the diffusion process on both visual and geometric information.
  • Conditions:
    1. Image Feature Condition (cIc_I): Derived from the global average pooling of the image feature map F\mathcal{F}. It provides visual context.
    2. Geometry Feature Condition (cGc_G): Extracted from the object point cloud using the pre-trained affordance-aware network (PointNeXt). It provides geometric prior knowledge.
  • Architecture: The IG-Adapter consists of additional cross-attention layers that are inserted into the main diffusion model. The noisy parameter vector xt\pmb{x}_t attends to both the image condition cIc_I and the geometry condition cGc_G. This allows the model to be aware of both the visual appearance in the image and the intrinsic geometric properties of the object during denoising.
  • Training Objective: The diffusion model is trained to predict the noise ϵ\epsilon added to the clean data x0\pmb{x}_0, conditioned on the timestep tt and the conditions cIc_I and cGc_G. $ L_{\mathrm{DM}} = \mathbb{E}_{\pmb{x}_0, \epsilon, t, \pmb{c}_I, \pmb{c}G} ||\epsilon - \epsilon\theta(\pmb{x}_t, t, \pmb{c}_I, \pmb{c}_G)||^2 $

5. Experimental Setup

5.1. Datasets

The experiments are conducted on three standard datasets for human-object interaction.

  • BEHAVE [2]: A large-scale dataset featuring 8 subjects interacting with 20 different objects in natural indoor and outdoor environments. It provides multi-view RGB-D videos with 3D ground truth annotations for human pose, object pose, and contact regions. It is a primary benchmark for this task.
  • InterCap [23]: This dataset focuses on close-range interactions, with 10 subjects interacting with 10 objects. It includes ground truth from multi-view RGB-D cameras and is particularly good for evaluating hand-object and foot-object contact.
  • IMHD [72]: A dataset with 15 subjects in 10 interaction scenarios, captured by 32 cameras. The authors use only the training set of IMHD during the diffusion model training phase to augment its generative capabilities with more diverse interaction data, without using it for evaluation.

5.2. Evaluation Metrics

The performance of the model is evaluated using metrics for both reconstruction accuracy and contact quality.

5.2.1. Chamfer Distance (CD)

  • Conceptual Definition: Chamfer Distance measures the dissimilarity between two point clouds. It calculates the average closest point distance from each point in one set to the other set. A lower CD indicates a better match between the predicted and ground truth meshes. It is calculated separately for the human and object.
  • Mathematical Formula: For two point clouds S1S_1 and S2S_2: $ \mathrm{CD}(S_1, S_2) = \frac{1}{|S_1|} \sum_{x \in S_1} \min_{y \in S_2} ||x - y||2^2 + \frac{1}{|S_2|} \sum{y \in S_2} \min_{x \in S_1} ||y - x||_2^2 $
  • Symbol Explanation:
    • S1S_1: The predicted point cloud (mesh vertices).
    • S2S_2: The ground truth point cloud (mesh vertices).
    • x, y: Points in the respective point clouds.
    • S|S|: The number of points in the point cloud. The paper reports CDhuman\mathbf{CD_{human}} and CDobject\mathbf{CD_{object}} in centimeters (cm).

5.2.2. Contact Precision, Recall, and F-Score

  • Conceptual Definition: These metrics evaluate how well the model identifies the correct contact regions on the human body. A human vertex is classified as "in contact" if it is within 5cm of the object mesh. This binary classification result is then compared to the ground truth contact map.
    • Precision: Of all the vertices the model predicted to be in contact, what fraction actually were? (Measures correctness).
    • Recall: Of all the vertices that were actually in contact, what fraction did the model correctly identify? (Measures completeness).
    • F-Score: The harmonic mean of precision and recall, providing a single balanced measure of contact quality.
  • Mathematical Formula: $ \text{Precision} = \frac{\text{TP}}{\text{TP} + \text{FP}} $ $ \text{Recall} = \frac{\text{TP}}{\text{TP} + \text{FN}} $ $ \text{F-Score} = 2 \times \frac{\text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}} $
  • Symbol Explanation:
    • TP (True Positives): Number of vertices correctly predicted as being in contact.
    • FP (False Positives): Number of vertices incorrectly predicted as being in contact.
    • FN (False Negatives): Number of vertices incorrectly predicted as not being in contact. The paper uses the notations ContactprecContact_p^rec, ContactrrecContact_r^rec, and ContactFSrecContact_F-S^rec for these metrics.

5.3. Baselines

The proposed ScoreHOI method is compared against several state-of-the-art baselines:

  • PHOSA [68]: A regression-based method that reasons about interactions using predefined contact pairs.

  • CHORE [59]: A strong optimization-based method that reconstructs human, object, and contact from a single image.

  • CONTHO [37]: A state-of-the-art regression-based method that uses a contact-aware transformer for refinement.

  • VisTracker [61]: An optimization-based method for tracking HOI from video.

    These baselines are representative of the main competing paradigms in the field.

6. Results & Analysis

6.1. Core Results Analysis

The main quantitative results are presented in Table 1, comparing ScoreHOI with previous state-of-the-art methods on the BEHAVE and InterCap datasets.

The following are the results from Table 1 of the original paper:

Datasets Methods CDhuman CDobject Contactprec Contactrrec ContactF-Srec
BEHAVE PHOSA [68] 12.17 26.62 0.393 0.266 0.317
CHORE [59] 5.58 10.66 0.587 0.472 0.523
CONTHO [37] 4.99 8.42 0.628 0.496 0.554
Ours 4.85 7.86 0.634 0.586 0.609
InterCap PHOSA [68] 11.20 20.57 0.228 0.159 0.187
CHORE [59] 7.01 12.81 0.339 0.253 0.290
CONTHO [37] 5.96 9.50 0.661 0.432 0.522
Ours 5.56 8.75 0.627 0.590 0.578

Analysis:

  • Superior Performance: ScoreHOI achieves the best performance across most metrics on both datasets. It obtains the lowest (best) Chamfer Distance for both human and object reconstruction on BEHAVE and InterCap, indicating higher geometric accuracy.

  • Significant Improvement in Contact: The most impressive result is the significant improvement in contact metrics. On the BEHAVE dataset, ScoreHOI achieves a contact F-Score of 0.609, which is a 9.9% relative improvement over the previous best (CONTHO's 0.554). This is driven by a large gain in contact recall (0.586 vs. 0.496), demonstrating that the physical guidance effectively encourages plausible contacts without sacrificing precision.

  • Qualitative Results: Figure 4 shows qualitative comparisons, where ScoreHOI produces visually more plausible interactions, correctly handling contact and avoiding penetration, especially in challenging side views where other methods fail.

    该图像是多个输入图像与人-物体交互重建结果的对比示意图,展示了前视图和侧视图的不同展示效果,对比了 CHORE、CONTHO 和我们的 ScoreHOI 方法在重建过程中的效果。 该图像是多个输入图像与人-物体交互重建结果的对比示意图,展示了前视图和侧视图的不同展示效果,对比了 CHORE、CONTHO 和我们的 ScoreHOI 方法在重建过程中的效果。

6.2. Efficiency Comparison

The following are the results from Table 2 of the original paper:

Methods CDhuman CDobject FPS↑
CHORE [59] 5.58 10.66 0.0035
VisTracker [61] 5.24 7.89 0.0359
Ours-Faster 4.87 7.95 2.0080
Ours 4.85 7.86 0.2895

Analysis:

  • ScoreHOI is substantially more efficient than traditional optimization-based methods. The standard version runs at 0.2895 Frames Per Second (FPS), which is ~80 times faster than CHORE.
  • A faster version, Ours-Faster (with N=2N=2 iterations), achieves 2.0 FPS with only a minimal drop in performance. This demonstrates a flexible trade-off between speed and accuracy, making the method practical for a wider range of applications.

6.3. Ablation Studies / Parameter Analysis

Ablation studies were conducted on the BEHAVE dataset to validate the contribution of each component.

The following are the results from Table 3 of the original paper:

Methods CDhuman CDobject Contactprec Contactrrec ContactF-Srec
* Module
w/o diffusion 5.03 8.48 0.612 0.523 0.588
w/o CDIR 4.93 7.98 0.628 0.545 0.577
* Condition
No condition 4.94 8.23 0.626 0.549 0.585
w/o cG 4.87 7.99 0.628 0.559 0.591
w/o cI 4.88 8.03 0.631 0.566 0.597
* Guidance
No guidance 4.93 8.01 0.624 0.524 0.570
w/o Lho 4.87 7.95 0.632 0.525 0.574
w/o Lpt 4.87 7.93 0.619 0.567 0.592
w/o Lof 4.89 7.95 0.631 0.577 0.602
Full model 4.85 7.86 0.634 0.586 0.609

Analysis of Ablations:

  • Optimizer Modules: Removing the diffusion optimizer (w/o diffusion) or the contact-driven iterative refinement (w/o CDIR) both degrade performance, especially the F-Score, confirming their importance. CDIR is particularly crucial for boosting the final contact quality.
  • Conditions: Removing either the image condition (cIc_I) or the geometry condition (cGc_G) leads to worse results. This shows that conditioning the diffusion model on both visual and geometric information is beneficial.
  • Physical Guidance: Removing the physical guidance (No guidance) causes a significant drop in the F-Score, primarily due to lower recall. Removing individual loss terms also hurts performance:
    • Without contact loss (w/o L_ho), recall drops sharply, as the model no longer explicitly tries to make parts touch.

    • Without penetration loss (w/o L_pt), precision drops, as more invalid contacts (penetrations) are produced. This is also shown qualitatively in Figure 5, where removing the losses leads to visible artifacts like hands passing through objects.

      Figure 5. Qualitative results for ablation study. Upper row: the ablation study of \(L _ { \\mathrm { { h o } } }\) . The inclusion of contact guidance between the \(L _ { \\mathrm { h o } }\) . The absence of a penetration penalty results in a notable rise in the occurrence of unreasonable interactions. 该图像是插图,展示了物体与人类交互的重建效果。在上半部分的“正面视图”和下半部分的“侧面视图”中,包含多个输入图像及其对应的重建模型。每个视图分别展示了在不使用接触损失(LhoL_{ho}LptL_{pt})和完整模型的情况下,如何影响交互效果。图中可见不同条件下的小人和物体的相对位置,突显了接触引导在提高交互重建质量方面的重要性。

      Analysis of Hyper-parameters (Table 4): This table explores the effects of the number of iterative refinements (NN), the intermediate noise level (ττ), and the DDIM sampling step size (ΔtΔt). The results show a trade-off: more iterations (NN) and a higher noise level (ττ) generally improve contact fidelity, while more DDIM steps (ΔtΔt) improve overall reconstruction quality. The chosen baseline (N=10,τ=50,Δt=2N=10, τ=50, Δt=2) represents a good balance between performance and efficiency.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper introduces ScoreHOI, a novel and highly effective framework for the physically plausible 3D reconstruction of human-object interactions from a single image. By ingeniously framing the refinement task as a guided conditional generation problem, the method leverages the strong priors of a diffusion model as a powerful optimizer. The core contributions—score-guided physical sampling and a contact-driven iterative refinement loop—work in concert to produce reconstructions that are not only geometrically accurate but also significantly more physically plausible than those from prior art. The method achieves state-of-the-art results on major benchmarks while being orders of magnitude faster than traditional optimization-based techniques.

7.2. Limitations & Future Work

The authors acknowledge one primary limitation:

  • Dependence on Object Templates: The current method relies on a known canonical pose for the objects in the training set. This makes it difficult to optimize parameters for novel objects where such a template or canonical pose is not available.

    As a direction for future work, the authors plan to address this by developing a template-free approach capable of handling unseen objects without predefined canonical poses.

7.3. Personal Insights & Critique

ScoreHOI represents a significant conceptual advance in the field of 3D reconstruction and, more broadly, for solving inverse problems in computer vision.

Positive Insights:

  • Diffusion Models as Optimizers: The paper provides a compelling demonstration of using diffusion models not just for generation from noise, but as powerful, learned optimizers for refinement tasks. The ability to inject arbitrary (differentiable) constraints via score guidance is an elegant and powerful paradigm that could be applied to many other domains, such as medical image reconstruction, robotic motion planning, or computational design.
  • Balancing Priors and Evidence: The method strikes an excellent balance between a strong, learned data prior (from the diffusion model) and explicit, hard-coded rules (physical constraints). This is a long-standing challenge in AI, and ScoreHOI offers a very promising solution framework.
  • Iterative Refinement Loop: The contact-driven iterative refinement is a clever feedback mechanism. It turns a static optimization into a dynamic process where the problem definition (the contact targets) is improved along with the solution, leading to progressively better results.

Potential Issues and Critique:

  • Complexity and Hyperparameters: The system has many moving parts and hyperparameters (ρρ for guidance scale, NN for iterations, ττ for noise level, ΔtΔt for DDIM steps, and λλ weights for physical losses). Tuning these for optimal performance may be complex and dataset-dependent.

  • Reliance on Initial Regressor: The quality of the final output is likely still dependent on the quality of the initial coarse prediction. While the diffusion optimizer is powerful, a very poor starting point might be difficult to recover from, potentially leading the optimization to a wrong but locally plausible mode of the distribution.

  • Generalization to "Interaction" itself: While the affordance-aware network helps with object geometry, the model's understanding of "interaction" is learned from the training data (BEHAVE, InterCap, IMHD). Its ability to generalize to truly novel types of interactions not seen in training might be limited.

    Overall, ScoreHOI is an impressive piece of work that pushes the boundaries of 3D human-object understanding. It introduces a powerful new paradigm that successfully marries the generative strength of diffusion models with the logical rigor of physical constraints, setting a new standard for the field.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.