Paper status: completed

Thinking in 360°: Humanoid Visual Search in the Wild

Published:11/25/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The study introduces humanoid visual search (HVS), where agents use head movements in immersive 360° images. The new benchmark H* Bench emphasizes advanced visual-spatial reasoning. Experiments reveal low success rates for top models, though post-training significantly enhances p

Abstract

Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Thinking in 360°: Humanoid Visual Search in the Wild

1.2. Authors

Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang, Xiangyu Han, Xinhao Liu, Jing Zhang, Marco Pavone, Chen Feng, Saining Xie, Yiming Li

Their affiliations include NYU, NVIDIA, TU Darmstadt, UC Berkeley, and Stanford University. The authors represent a collaborative effort from both academia and industry in the fields of artificial intelligence, robotics, and computer vision.

1.3. Journal/Conference

This paper is published as a preprint on arXiv. While not yet peer-reviewed and published in a formal journal or conference proceedings at the time of this analysis, arXiv is a highly influential platform for rapid dissemination of research in computer science, physics, mathematics, and related fields. Papers on arXiv often represent cutting-edge work that is in the process of peer review or has been accepted to top-tier venues.

1.4. Publication Year

2025

1.5. Abstract

Humans effectively use a combination of head (cephalomotor) and eye (oculomotor) movements for visual search in a 360° environment. Current visual search methods, however, are typically confined to static images, overlooking the physical embodiment and real-world interaction. This paper proposes humanoid visual search (HVS), where a humanoid agent actively rotates its head to locate objects or paths within an immersive 360° panoramic image. To study this in complex, visually-crowded real-world scenarios, the authors introduce HBenchH*Bench, a new benchmark focusing on challenging in-the-wild scenes like transportation hubs, large retail spaces, urban streets, and public institutions, moving beyond traditional household settings. Experiments reveal that even top-tier proprietary models achieve only approximately 30% success in object and path search. The authors then enhance the open-source Qwen2.5-VL model using post-training techniques, which boosts its success rate significantly: over threefold for object search (HOS) (from 14.83% to 47.38%) and path search (HPS) (from 6.44% to 24.94%). The lower success ceiling for path search indicates its greater difficulty, attributed to the need for sophisticated spatial commonsense. The results suggest a promising direction for developing Multimodal Large Language Model (MLLM) agents that can integrate into human daily life, while also quantifying the substantial challenges that remain.

https://arxiv.org/abs/2511.20351v1 The paper is available as a preprint on arXiv.

https://arxiv.org/pdf/2511.20351v1.pdf

2. Executive Summary

2.1. Background & Motivation

2.1.1. Core Problem

The core problem the paper addresses is the limitation of existing visual search methods in accurately simulating human-like visual exploration within dynamic, 3D environments. Current state-of-the-art computational methods, primarily based on Multimodal Large Language Models (MLLMs), typically operate on a single, static 2D image. This approach suffers from two fundamental gaps compared to biological visual search:

  1. Non-interactive: Models cannot change their perspective or acquire information beyond their initial field of view, limiting their ability to explore.
  2. Disembodied: Models lack physical embodiment, meaning they cannot couple visual reasoning with actions in the physical world. The search often becomes an abstract perceptual exercise rather than a goal-directed behavior.

2.1.2. Importance and Challenges

This problem is crucial because developing embodied visual agents that can actively search for information in visually crowded scenes has significant potential in various real-world applications, including:

  • Humanoid robots: Enabling robots to efficiently find objects or navigate in complex environments.

  • Assistive technology: Developing intelligent systems to help humans with visual impairments or in challenging search tasks.

  • Augmented reality: Creating more intuitive and interactive AR experiences.

    The main challenges in prior research are:

  • Limited perceptual realism: Existing embodied AI platforms often lack the visual fidelity of real-world scenes.

  • Restriction to household scenes: Most benchmarks are confined to simpler, controlled household environments, failing to capture the structural (multi-level layouts), semantic (dense compositional cues), and volumetric (cluttered 3D space) complexities of in-the-wild human-made environments.

  • Hardware constraints: Developing and testing embodied agents in real-world hardware or highly realistic 3D simulators is expensive, difficult to scale, and hard to reproduce.

2.1.3. Innovative Idea

The paper's innovative idea is to prototype humanoid visual search (HVS). This approach allows humanoid agents to couple deliberate reasoning with active head turns for visual search in complex environments. A key enabler is a scalable paradigm where a single 360° panorama closes the perception-action cycle, effectively serving as a lightweight, hardware-free simulator. This bypasses the constraints of real-world hardware and expensive 3D simulators, making it tractable to study embodied visual search in diverse, challenging in-the-wild scenes.

2.2. Main Contributions / Findings

2.2.1. Primary Contributions

The paper makes three primary contributions:

  1. Introduces Humanoid Visual Search (HVS): A novel task that enables human-like active spatial reasoning in 360° environments, bridging the gap between passive visual reasoning and active embodied interaction. This includes two core forms: humanoid object search (HOS) and humanoid path search (HPS).
  2. Proposes a Scalable Framework and H*Bench: A new benchmark, HBenchH*Bench, is introduced, which leverages real-world 360° panoramas as lightweight simulators. This creates a hardware-free platform for studying embodied reasoning in in-the-wild environments (e.g., transportation hubs, large retail spaces, public institutions). It features dense annotations for embodied task questions and ground-truth actions.
  3. Conducts Thorough Evaluations and Demonstrates Post-Training Effectiveness: The paper conducts comprehensive evaluations showing that post-training techniques (Supervised Fine-Tuning and Reinforcement Learning) can significantly improve the performance of MLLMs in HVS. It also highlights major unresolved challenges and promising avenues for future research.

2.2.2. Key Conclusions / Findings

The paper reached several key conclusions:

  • Significant Performance Gap in MLLMs: Even top-tier proprietary models like GPT-4o and Gemini2.5-Pro falter, achieving only approximately 30% success in object and path search on HBenchH*Bench, indicating the inherent difficulty of the task for existing models.
  • Effectiveness of Post-Training: Post-training techniques (specifically, Supervised Fine-Tuning (SFT) followed by Reinforcement Learning (RL)) can substantially enhance the performance of open-source MLLMs like Qwen2.5-VL. HVS-3B (their fine-tuned model) increased its success rate from 14.83% to 47.38% for object search and from 6.44% to 24.94% for path search.
  • Path Search is Inherently More Difficult: The consistently lower success ceiling for path search compared to object search reveals its greater inherent difficulty. This is attributed to the demand for sophisticated spatial commonsense, physical commonsense, and socio-spatial commonsense, which are often implicit and procedural.
  • Limitations of Post-Training for Higher-Order Reasoning: While post-training improves low-level perceptual-motor abilities (visual grounding, exploration), it struggles to impart higher-level reasoning capabilities required for path search. RL, while beneficial for simpler tasks, can sometimes paradoxically degrade performance on more complex path search scenarios, potentially due to reward hacking.
  • Active vs. Passive Visual Search: The active visual search paradigm (rotating a narrow field-of-view) is superior to passive analysis of a complete panorama, mimicking human efficiency and avoiding panoramic distortions.
  • Embodied vs. Disembodied Benchmarks: Capabilities learned from passive Internet data (on 2D benchmarks like VBenchV*Bench) do not transfer well to embodied active interaction in 3D (HBenchH*Bench), highlighting the unique challenges of embodied AI.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To fully understand this paper, a novice reader should be familiar with the following fundamental concepts:

3.1.1. Multimodal Large Language Models (MLLMs)

Multimodal Large Language Models (MLLMs) are advanced artificial intelligence models that can understand and process information from multiple types, or modalities, of data simultaneously. Traditionally, Large Language Models (LLMs) (like GPT-3 or GPT-4) primarily deal with text. MLLMs extend this capability by integrating other modalities such as images, audio, or video. This means an MLLM can, for example, take an image as input and answer questions about its content in natural language, or generate a description of the image. They achieve this by aligning the feature spaces of different modalities (e.g., using visual encoders to extract features from images and language encoders for text) and feeding these combined features into a powerful LLM backbone for reasoning and generation. Their ability to process and reason across different data types makes them a promising pathway toward Artificial General Intelligence (AGI).

Visual search is the cognitive process by which humans and artificial agents scan a visual environment to locate specific targets or information among distractors. It's a fundamental aspect of perception and attention. In computational terms, it involves an agent processing visual input to identify a target object, a specific feature, or a navigable path based on a given query or goal. Unlike general object detection, visual search often implies an active, goal-directed process that might involve exploration and decision-making over time, especially in complex or crowded scenes.

3.1.3. Embodied AI

Embodied AI refers to artificial intelligence systems that are situated within a physical (or highly realistic simulated physical) body and interact with a 3D environment. Unlike purely computational AI that operates on abstract data, embodied AI agents perceive their surroundings through sensors (like cameras), process that information, and take physical actions (like moving, grasping, or rotating) that affect their environment and subsequent perceptions. This field focuses on developing agents that can exhibit intelligent behavior in the real world, grounding their knowledge in physical interactions and commonsense reasoning.

3.1.4. 360° Panoramic Images

A 360° panoramic image is a wide-angle photograph that captures a complete view of a scene in all directions. Imagine standing at a single point and taking pictures all around you, then stitching them together into one continuous image. These images provide an immersive representation of a 3D environment from a fixed viewpoint. In the context of this paper, they are used as lightweight simulators, allowing an agent to "look around" by projecting different narrow field-of-view (FoV) perspective images from the panorama, simulating head movements without needing a full 3D model.

3.1.5. Supervised Fine-Tuning (SFT)

Supervised Fine-Tuning (SFT) is a common technique in machine learning, particularly with Large Language Models (LLMs) and MLLMs. It involves taking a pre-trained model (a model that has already learned general patterns from a vast amount of data) and further training it on a smaller, task-specific, labeled dataset. The goal of SFT is to adapt the general knowledge of the pre-trained model to excel at a particular downstream task, like humanoid visual search in this case. The training process uses supervised learning, meaning the model learns from input-output pairs (e.g., visual observation + instruction \rightarrow correct action + chain-of-thought rationale).

3.1.6. Reinforcement Learning (RL)

Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make sequential decisions by interacting with an environment. The agent performs actions, and in response, the environment transitions to a new state and provides a reward signal. The agent's goal is to learn a policy (a mapping from states to actions) that maximizes the cumulative reward over time. Unlike supervised learning which learns from explicit input-output pairs, RL learns from trial and error, often suited for tasks requiring long-horizon planning and strategy. In this paper, RL is used as a post-training step to further refine the MLLM agent's policy and improve its instruction-following and reasoning capabilities.

3.1.7. Chain-of-Thought (CoT)

Chain-of-Thought (CoT) is a prompting technique used with Large Language Models (LLMs) to encourage them to generate a series of intermediate reasoning steps before providing a final answer. Instead of directly asking for an answer, the model is prompted to "think step by step." This process makes the model's reasoning explicit, often leading to more accurate and robust solutions for complex tasks, especially those requiring multi-step logical reasoning or problem-solving. In this paper, CoT rationales are used during SFT to instill structured reasoning abilities in the MLLM agent.

3.1.8. Fovea/Foveation

In human vision, the fovea is a small pit in the center of the retina responsible for sharp central vision, providing the highest visual acuity. When we look directly at an object, we are foveating it, meaning we are aligning our eyes so that the image of the object falls onto our fovea. This allows us to perceive fine details. In the context of humanoid object search (HOS), foveation refers to the agent's action of rotating its "head" (or camera) to bring the target object into the central, high-resolution region of its field of view, mimicking human visual precision.

3.2. Previous Works

The paper contextualizes its work by discussing several related areas:

Early visual search methods relied on bottom-up visual saliency (identifying visually prominent features) or top-down contextual guidance (using prior knowledge to direct search), or a combination thereof [40, 56, 65]. These often struggled with generalization due to limited contextual understanding. Recent advancements, like VV* [58], leverage MLLMs with rich world knowledge (e.g., object co-occurrence) to improve performance. However, these works predominantly focus on search within static 2D images. This disembodied approach neglects the active, dynamic nature of human visual search, which coordinates both eye (oculomotor) and head (cephalomotor) movements to explore the 3D world, where the head explores unseen regions and eyes exploit visible content [4, 30, 50].

3.2.2. Visual Navigation

Visual navigation and vision-language navigation (VLN) aim to develop agents that can move through an environment to reach a specified goal [2, 72]. These tasks typically require either a full 3D simulator or real physical hardware, which are difficult to build, scale, and reproduce. Consequently, most efforts have been confined to household scenes where 3D data is more accessible [7-10, 37, 45, 71]. This leaves in-the-wild challenges largely unexplored. This paper is motivated by the observation that human navigation involves intermittent reasoning at critical decision points, allowing them to focus on these points using 360° panoramas to bypass the need for full 3D simulation or physical hardware.

3.2.3. Multimodal LLMs (MLLMs)

MLLMs are at the forefront of Artificial General Intelligence (AGI), integrating various data modalities (text, images, etc.). Key foundational models include Flamingo [1], BLIP [33, 34], and LLaVA [35], which focus on aligning visual encoder features with LLMs. More recent MLLMs like GPT-4o [41] and Gemini 2.5 [46] have set new benchmarks through increased model capacity and novel training recipes, notably using Reinforcement Learning (RL)-based post-training to align outputs with human preferences and improve instruction-following [43, 63]. RL can also foster stronger reasoning for complex, multi-step tasks [20, 22, 27, 28]. This paper grounds MLLMs in the physical world to assess and improve their active and embodied visual search capabilities.

3.2.4. Multimodal LLMs with Tools

Inspired by humans using external tools, LLM agents have demonstrated superior performance in long-horizon tasks by leveraging external toolkits (e.g., web browsing, code execution) and multi-turn reinforcement learning [17, 18, 24, 42]. This concept extends to multimodal settings, where MLLMs generate symbolic tool calls (e.g., OCR, marking, cropping, zoom in) to overcome limitations in semantic grounding and visual perception [36, 44, 49, 67, 70]. However, these tool operations typically occur on a disembodied 2D canvas, involving computational manipulations of a static image file. This paper distinguishes itself by coupling tool use with physical world actions, specifically active head rotation, to construct a visual chain of thought, bridging the gap between passive and active embodied reasoning.

3.2.5. Multimodal LLMs for Embodied Reasoning

A growing body of research aims to ground MLLMs in embodied reasoning to bridge the gap between symbolic linguistic representations and physical world perception [12, 23, 54, 64, 68]. Cosmos-Reason1 [3] enables MLLMs to perceive the physical world via video and generate physically grounded responses. Gemini Robotics-ER [52] extends Gemini's multimodal reasoning to the physical world with enhanced spatiotemporal understanding. This paper specifically focuses on active visual search with interleaved multimodal reasoning, an area that remains largely unexplored in this context.

3.3. Technological Evolution

The field has evolved from:

  1. Early Visual Search (pre-MLLMs): Focused on saliency maps and top-down context within static 2D images. Limited by lack of world knowledge and generalization.

  2. MLLMs for 2D Visual Search (VV*, etc.): Leveraged MLLMs' rich world knowledge to improve static 2D search, often using computational tools like zooming and cropping on a fixed canvas. Still disembodied.

  3. Visual Navigation in Simulators: Developed agents for navigation in 3D environments, but often restricted to simple household scenes due to the complexity and cost of 3D simulators or real hardware.

  4. MLLMs with Tool Use and Embodied Reasoning (Current): Began integrating MLLMs with external tools (still mostly 2D computational) and exploring grounding MLLMs in embodied contexts (often passively observing or generating high-level plans).

    This paper's work fits into the current wave by pushing MLLMs beyond passive observation and 2D tool use into active, embodied interaction in 3D environments, specifically focusing on the critical visual search component necessary for open-world navigation and manipulation. It uniquely achieves this without full 3D simulation by cleverly using 360° panoramas.

3.4. Differentiation Analysis

Compared to previous approaches, this paper's core innovations and differentiations are:

  • Active & Embodied Interaction: Unlike prior visual search methods confined to static 2D images or MLLMs using disembodied 2D tools, this work introduces humanoid visual search (HVS) where agents actively rotate their "head" (camera view within a 360° panorama) to change their perspective and gather new information. This tightly couples visual reasoning with physical-like actions.
  • Scalable, Hardware-Free Embodied Simulation: Instead of relying on expensive 3D simulators or real hardware (common in visual navigation), the paper proposes a lightweight framework using 360° panoramic images. This allows for a closed-loop perception-action cycle in in-the-wild scenes without the traditional overhead, enabling scalable research.
  • Focus on In-the-Wild Complexity: While embodied AI platforms often use simplistic household scenes, HBenchH*Bench explicitly moves to visually-crowded, structurally, semantically, and volumetrically complex real-world environments (e.g., subway stations, shopping malls). This necessitates more advanced visual-spatial reasoning.
  • Task-Driven Embodiment: The search is explicitly driven by embodied tasks: humanoid object search (HOS) (for manipulation) and humanoid path search (HPS) (for locomotion), making the search goal-directed and physically relevant, unlike abstract perceptual exercises.
  • Human-like Eye-Head Coordination: The proposed model prototypes human-like eye-head coordination where head rotations explore new regions, aligning with neuroscience findings.

4. Methodology

4.1. Principles

The core idea of humanoid visual search (HVS) is to enable Multimodal Large Language Models (MLLMs) to perform active visual exploration and reasoning in 360° immersive environments, mimicking how humans coordinate head and eye movements. This is driven by the observation that human spatial intelligence involves critical decision points where observation, reasoning, and ambiguity resolution occur before action. The method abstracts full-body motion to the atomic action of head rotation, allowing the study of core cognitive processes of embodied visual search in a tractable yet realistic manner. The system functions as a tool-augmented MLLM where head rotations are treated as tools that continuously construct a visual chain of thought.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

The environment is modeled as a single 360° panoramic image. From this panorama, narrow Field-of-View (FoV) perspective images are sampled, representing the agent's observations. Each observation oϕ,γo_{\phi, \gamma} is defined by its azimuth (ϕ)(\phi) and polar angle (γ)(\gamma).

The objective of HVS is to identify the optimal direction (ϕ,γ)(\phi^*, \gamma^*) that maximizes the probability of task success rsr_s given a natural language instruction xx and the visual observation oϕ,γo_{\phi, \gamma}. This can be formally expressed as:

$ (\phi^, \gamma^) = \arg \max_{\phi, \gamma} P (r_s \mid o_{\phi, \gamma}, x) $

  • (ϕ,γ)(\phi^*, \gamma^*): The optimal final viewing direction (azimuth and polar angle) that the agent needs to find.

  • argmaxϕ,γ\arg \max_{\phi, \gamma}: The operator that finds the values of ϕ\phi and γ\gamma that maximize the subsequent expression.

  • P(rsoϕ,γ,x)P (r_s \mid o_{\phi, \gamma}, x): The probability of achieving task success rsr_s given the current visual observation oϕ,γo_{\phi, \gamma} (a perspective image from the panorama) and the language instruction xx.

    The paper defines two core embodied search tasks:

  • Humanoid Object Search (HOS): The agent's goal is to locate a target object and foveate it, meaning the final viewing direction (ϕ,γ)(\phi^*, \gamma^*) must bring the target object into the central foveal region of the perspective view. This is a prerequisite for potential manipulation tasks.

  • Humanoid Path Search (HPS): The agent needs to identify a navigable path to a target location. The goal is to find a final viewing direction ϕ\phi^* (only azimuth is considered, assuming a planar ground) that is aligned with the optimal path direction. This serves as a high-level planning step before actual locomotion.

4.2.2. Humanoid Visual Search with MLLMs

The humanoid visual search task is framed as a multimodal reasoning task where MLLM tool use is coupled with head rotation. An MLLM agent's behavior is governed by a policy πθ(yt,atot,x,Ht)\pi_{\boldsymbol{\theta}}(y_t, a_t \mid o_t, x, \mathcal{H}_t).

  • πθ\pi_{\boldsymbol{\theta}}: The agent's policy, parameterized by θ\boldsymbol{\theta}.

  • yty_t: The textual chain of thought (CoT) generated by the agent at time step tt.

  • ata_t: The action generated by the agent at time step tt.

  • ot=oϕt,γto_t = o_{\phi_t, \gamma_t}: The current visual observation (a perspective image) at time tt, defined by its azimuth ϕt\phi_t and polar angle γt\gamma_t.

  • xx: The initial natural language instruction for the task.

  • Ht={(oi,yi,ai)}i=1t1\mathcal{H}_t = \{ (o_i, y_i, a_i) \}_{i=1}^{t-1}: The history of past observations, chain-of-thought rationales, and actions up to time t-1.

    At each time step tt, the agent generates a textual chain of thought (yty_t) to reason about the current situation and then produces an action (ata_t). The process allows for a sequence of rotation actions, culminating in a submission action. The action space consists of two primitives:

  1. Rotate (atrot=(Δϕ,Δγ)a_t^{rot} = (\Delta\phi, \Delta\gamma)): This action adjusts the agent's viewing direction.

    • Δϕ\Delta\phi: The change in azimuth (yaw angle). Positive values indicate rotation to the right, negative values to the left. Yaw is circular, meaning rotations wrap around 360 degrees.
    • Δγ\Delta\gamma: The change in polar angle (pitch angle). Positive values indicate looking up, negative values indicate looking down.
    • The viewing direction is updated as ϕt+1=ϕt+Δϕ\phi_{t+1} = \phi_t + \Delta\phi and γt+1=γt+Δγ\gamma_{t+1} = \gamma_t + \Delta\gamma.
  2. Submit (atsuba_t^{sub}): This action commits the current viewing direction (ϕ^,γ^)(\hat{\phi}, \hat{\gamma}) as the agent's final estimate and terminates the episode.

    The image images/2.jpg (Figure 2 from the original paper) visually illustrates this two-stage post-training pipeline and the closed-loop perception-action cycle.

    该图像是一个示意图,展示了在一个360°全景环境中实现多轮强化学习的两个阶段。左侧为专家轨迹标注的阶段,涉及多模态大语言模型的预训练。右侧则展示了任务执行过程,包括如何识别目标并获取新视角,涉及到的公式包括 \(H_t = \[o_t, a_t, o_{t-1}, y_t\]\)。 该图像是一个示意图,展示了在一个360°全景环境中实现多轮强化学习的两个阶段。左侧为专家轨迹标注的阶段,涉及多模态大语言模型的预训练。右侧则展示了任务执行过程,包括如何识别目标并获取新视角,涉及到的公式包括 Ht=[ot,at,ot1,yt]H_t = [o_t, a_t, o_{t-1}, y_t] Figure 2. This image illustrates the two stages of multi-turn reinforcement learning in a 360° panoramic environment. The left side represents the expert trajectory annotation phase, which involves pre-training a multimodal large language model. The right side shows the task execution process, including how to identify targets and acquire a new view, involving the formula Ht=[ot,at,ot1,yt]H_t = [o_t, a_t, o_{t-1}, y_t].

4.2.3. MLLM Post-Training

MLLMs, initially trained on static, disembodied Internet data, often lack the spatial commonsense and active 3D planning capabilities necessary for humanoid visual search. To address this, the paper adapts MLLMs through a two-stage post-training pipeline:

4.2.3.1. Stage 1: Supervised Fine-Tuning (SFT)

The first stage involves Supervised Fine-Tuning (SFT). This step aims to instill basic task-oriented reasoning and tool-use abilities in the model. The SFT is performed on a curated multi-turn dataset (described in Section 4.3.2) that contains visual observations, verified Chain-of-Thought (CoT) rationales, and corresponding ground-truth actions. This process teaches the model to generate structured action plans from multimodal inputs, thereby establishing a strong behavioral prior.

The SFT objective function is the expected negative log-likelihood (also known as cross-entropy loss) over the dataset DSFT\mathcal{D}^{SFT}. This dataset consists of task inputs xx and labeled trajectories HT\mathcal{H}_T:

$ \min_{\theta} \ \mathbb{E}{(x, \mathcal{H}T) \sim \mathcal{D}^{SFT}} \left[ - \sum{i=0}^{T-1} \log \pi{\theta} (y_i, a_i \mid o_i, x, \mathcal{H}_i) \right] $

  • minθ\min_{\theta}: The optimization objective is to find the model parameters θ\theta that minimize the expression.
  • E(x,HT)DSFT\mathbb{E}_{(x, \mathcal{H}_T) \sim \mathcal{D}^{SFT}}: Expectation taken over all task inputs xx and their corresponding labeled trajectories HT\mathcal{H}_T sampled from the SFT dataset DSFT\mathcal{D}^{SFT}.
  • HT\mathcal{H}_T: A complete ground-truth trajectory, which is a sequence of observations, rationales, and actions from the start of an episode (i=0i=0) to the end (T-1).
  • i=0T1logπθ(yi,aioi,x,Hi)- \sum_{i=0}^{T-1} \log \pi_{\theta} (y_i, a_i \mid o_i, x, \mathcal{H}_i): This is the negative log-likelihood for a given trajectory. It sums the negative logarithm of the probability assigned by the policy πθ\pi_{\theta} to the correct action aia_i and chain of thought yiy_i at each step ii of the trajectory, given the current observation oio_i, instruction xx, and historical context Hi\mathcal{H}_i. Minimizing this value maximizes the likelihood of the model producing the expert-provided CoT and action sequence.

4.2.3.2. Stage 2: Multi-Turn Reinforcement Learning (RL)

Following SFT, the policy is further refined using Group Relative Policy Optimization (GRPO) [48]. This RL stage is crucial for encouraging long-horizon reasoning and developing robust, generalizable search strategies beyond what imitation learning (from SFT) alone can achieve [14].

For each task, the agent samples GG times to get a group of output trajectories (denoted as ω1,ω2,,ωG\omega_1, \omega_2, \ldots, \omega_G). Each ωi\omega_i represents a complete output sequence for a single rollout, consisting of all generated tokens for chain of thought and actions: {y0,a0,y1,a1,,yT1,aT1}\left\{ y_0, a_0, y_1, a_1, \dotsc, y_{T-1}, a_{T-1} \right\}. The GRPO algorithm then calculates an advantage for each token ωi,t\omega_{i,t} within these sequences to update the model parameters.

The GRPO objective function is given as:

$ \begin{array}{l l} \mathcal{I}{\mathrm{GRPO}}(\theta) = \mathbb{E}{[(s_o, x, y) \sim \mathcal{D}^{RL}, {\omega_i}{i=1}^G \sim \pi{\theta_{\mathrm{old}}}(\Omega | s_o, x)]} \ \displaystyle \frac{1}{G} \sum_{i=1}^G \frac{1}{|\omega_i|} \sum_{t=1}^{|\omega_i|} \ \Big { \min \left[ \frac{\pi_{\theta}(\omega_{i,t} | s_o, x, \omega_{i,<t})}{\pi_{\theta_{\mathrm{old}}}(\omega_{i,t} | s_o, x, \omega_{i,<t})} \hat{A}{i,t}, \mathrm{clip}(\frac{\pi{\theta}(\omega_{i,t} | s_o, x, \omega_{i,<t})}{\pi_{\theta_{\mathrm{old}}}(\omega_{i,t} | s_o, x, \omega_{i,<t})} \right. \ \left. , 1 - \epsilon, 1 + \epsilon) \hat{A}{i,t} \Big ] - \beta \mathbb{KL} (\pi{\theta} || \pi_{\mathrm{ref}}) \Big }, \end{array} $

  • IGRPO(θ)\mathcal{I}_{\mathrm{GRPO}}(\theta): The GRPO objective function to be maximized with respect to the current policy parameters θ\theta.

  • E[(so,x,y)DRL,{ωi}i=1Gπθold(Ωso,x)]\mathbb{E}_{[(s_o, x, y) \sim \mathcal{D}^{RL}, \{\omega_i\}_{i=1}^G \sim \pi_{\theta_{\mathrm{old}}}(\Omega | s_o, x)]}: Expectation taken over initial states (sos_o), instructions (xx), and CoT (yy) from the RL dataset DRL\mathcal{D}^{RL}, and over a group of GG trajectories {ωi}i=1G\{\omega_i\}_{i=1}^G sampled using the old policy πθold\pi_{\theta_{\mathrm{old}}}.

  • GG: The number of trajectories sampled in a group.

  • 1Gi=1G1ωit=1ωi\frac{1}{G} \sum_{i=1}^G \frac{1}{|\omega_i|} \sum_{t=1}^{|\omega_i|}: Averages the objective over the GG trajectories and over all tokens tt within each trajectory ωi\omega_i.

  • πθ(ωi,tso,x,ωi,<t)\pi_{\theta}(\omega_{i,t} | s_o, x, \omega_{i,<t}): The probability assigned by the current policy πθ\pi_{\theta} to the token ωi,t\omega_{i,t} at step tt in trajectory ii, given the initial state sos_o, instruction xx, and preceding tokens ωi,<t\omega_{i,<t}.

  • πθold(ωi,tso,x,ωi,<t)\pi_{\theta_{\mathrm{old}}}(\omega_{i,t} | s_o, x, \omega_{i,<t}): The probability assigned by the old policy (the policy before the current update) to the token ωi,t\omega_{i,t}. This ratio is a core component of Proximal Policy Optimization (PPO)-like algorithms to ensure stable updates.

  • A^i,t\hat{A}_{i,t}: The relative advantage for the token ωi,t\omega_{i,t} in trajectory ii. This term indicates how much better (or worse) generating this token was compared to the average.

  • clip(,1ϵ,1+ϵ)\mathrm{clip}(\cdot, 1 - \epsilon, 1 + \epsilon): A clipping function, typically used in PPO to limit the policy ratio, preventing excessively large policy updates that could destabilize training. ϵ\epsilon is a hyperparameter (e.g., 0.2).

  • βKL(πθπref)\beta \mathbb{KL}(\pi_{\theta} || \pi_{\mathrm{ref}}): A KL divergence penalty term, where KL\mathbb{KL} measures the difference between the current policy πθ\pi_{\theta} and a reference policy πref\pi_{\mathrm{ref}} (often the SFT model or a previous version of the RL policy). This term encourages the RL policy not to deviate too far from the SFT base, preserving instruction-following capabilities and preventing catastrophic forgetting. β\beta is a coefficient controlling the strength of this penalty.

    The relative advantage A^i,t\hat{A}_{i,t} is calculated as:

$ \hat{A}_{i,t} = \frac{r_i - \mathrm{mean}(r)}{\mathrm{std}(r)} $

  • rir_i: The total reward obtained for trajectory ii.

  • mean(r)\mathrm{mean}(r): The average reward across all GG trajectories in the current group.

  • std(r)\mathrm{std}(r): The standard deviation of rewards across all GG trajectories.

    This relative advantage normalizes the reward, indicating whether a particular trajectory's reward is above or below the average for the current batch, allowing the RL algorithm to learn from relative performance within a group.

5. Experimental Setup

5.1. Datasets

The paper introduces HBenchH*Bench, a new benchmark designed to systematically evaluate humanoid visual search in rich, dense, and visually cluttered real-world environments.

5.1.1. Dataset Overview

  • Scale: Approximately 3,000 annotated task instances. By initializing the agent with four distinct starting orientations per task, a total of 12,000 search episodes are generated.

  • Resolution: High-resolution panoramic videos (up to 7680×38407680 \times 3840 pixels).

  • Source: Sourced from both self-collected footage across global metropolitan areas (New York, Paris, Amsterdam, Frankfurt) and open platforms (YouTube and the 360+X dataset [11]).

  • Geographical Coverage: Broad, spanning 13 cities across 12 countries and 4 continents (see images/8.jpg, Figure I). This diversity contributes to a wide range of architectural styles, languages/scripts on signage, and environmental conditions.

  • Scene Diversity: Systematically organized into 6 major scene categories and 18 fine-grained scene types, covering a wide spectrum of challenging scenarios (see images/3.jpg, Figure 3 Top). These include:

    • Transportation hubs (airports, subway stations)
    • Large-scale retail spaces (supermarkets, shopping malls)
    • Public institutions (libraries, museums)
    • Urban streets
  • Task Type Diversity: HBenchH*Bench supports both Humanoid Object Search (HOS) and Humanoid Path Search (HPS). The diversity of HOS target objects is summarized in images/3.jpg, Figure 3 Bottom Left.

    Figure I. \(H ^ { \\ast }\) Bench aggregates panoramic videos from diverse global locations, featuring visually cluttered environments 该图像是H Bench的示意图,展示了来自不同全球位置的全景视频,包含了各种视觉复杂的环境。每个位置上都有指引,展示了在这些场景中可能的活动,例如在超市购物、购买车票等,突显了在真实环境中进行视觉搜索的挑战。* Figure I. This image illustrates H Bench, showcasing panoramic videos from diverse global locations with visually cluttered environments. Each location features prompts indicating possible activities, such as shopping in a supermarket or purchasing tickets, highlighting the challenges of visual search in real-world scenarios.*

    该图像是一个数据可视化图表,展示了不同场景类别和物体实例的分布。上方部分显示了六类场景的数量分布,其中大型零售空间占主导地位。下方部分则列出了物体实例的具体分类情况,包括签标、家具和电器等,同时给出了人类任务和路径搜索的难度等级分析。 Figure 3. This image is a data visualization chart that displays the distribution of different scene categories and object instances. The upper part shows the distribution of six scene categories, with large-scale retail spaces being predominant. The lower part lists the specific classifications of object instances, including signs, furniture, and electronics, while also providing an analysis of the difficulty levels for humanoid tasks and path search.

5.1.2. Benchmark Construction

5.1.2.1. Task Annotation

  • Interface: Annotators use a perspective-view interface that renders narrow-field-of-view (FOV) images from the panorama at known viewing angles (ϕ,γ)(\phi, \gamma).
  • Process: Annotators freely rotate the virtual camera to inspect the scene, identify a suitable embodied search task, write a natural-language instruction, and mark the target.
  • Target Marking: For HOS, the target is marked by drawing a tight bounding box that specifies its optimal direction. This bounding box is then back-projected onto the panorama, and its center yields the optimal target direction (ϕ,γ)(\phi^*, \gamma^*).
  • HPS Annotation: For HPS, only the azimuth (ϕ)(\phi^*) is retained as the optimal direction, assuming the environment can be approximated by a planar ground geometry.

5.1.2.2. Cold-Start Data Curation (for SFT)

  • To create high-quality multi-turn trajectories for Supervised Fine-Tuning (SFT), a subset of annotated task instances is augmented with structured Chain-of-Thought (CoT) rationales.
  • GPT-4o Prompting: A strong MLLM (GPT-4o [41]) is prompted to produce a concise, observation-grounded rationale for each annotation step (given task instruction, current observation, and human-provided optimal action).
  • Human-in-the-Loop: Annotators review and refine the generated rationales to eliminate hallucinations, ensure grounding in visible scene evidence, and enforce stylistic consistency.
  • Dataset: The resulting SFT dataset consists of 2,000 multi-turn trajectories containing visual observations, verified CoT rationales, and actions.
  • Effort: Six annotators dedicated 250 hours to embodied question annotation and CoT refinement.

5.1.2.3. Difficulty Taxonomy

  • HOS Difficulty: Defined by the initial visibility of the target object.

    • Visibility ratio dd: Fraction of the object area visible in the initial viewpoint compared to the complete object area.

    • Categorization: Easy (high visibility), Medium (partially visible), Hard (low/invisible).

    • Visualizations: See images/9.jpg (Figure III).

      Figure III. Visualizations of HOS task instances. Figure III. This image is an illustration of HOS task examples, showcasing different scenarios of searching for specific objects in a shopping environment. Each scenario includes the target area of the object, initial observation images, and descriptions of varying difficulty levels. The formula d=area(PT)area(P)d = \frac{area(P \cap T)}{area(P)} is used in the image to indicate the difficulty of object detection.

  • HPS Difficulty: Depends on two factors:

    • Whether the scene contains textual cues (e.g., signs).

    • Whether the visual or textual cues align with the actual path direction.

    • These factors jointly define four difficulty levels.

    • Visualizations: See images/14.jpg (Figure IV), images/20.jpg (Figure V), images/25.jpg (Figure VI), images/27.jpg (Figure VII).

      该图像是插图,展示了一个360°全景图和一个示意机器人在城市环境中的视角。图中强调了提示(Cue)和运动(Motion)的方向,并用文本说明了相关指示。该图像展示了在视觉搜索任务中如何有效地结合提示与运动方向的信息。 Figure IV. This image is an illustration showcasing a 360° panoramic view and a robotic figure within an urban setting. It highlights the directions of Cue and Motion with textual annotations, demonstrating how to effectively combine cues and motion direction information in visual search tasks.

      该图像是示意图,展示了在面对文本指令时,运动方向与提示方向不对齐的场景。左侧为一个全景图,右侧为两个局部图,分别标示了运动方向和提示方向。该图旨在描绘人形代理在复杂环境中执行视觉搜索的挑战。 Figure V. This image is a schematic diagram illustrating the misalignment of motion and cue directions in response to textual instructions. The left side shows a panoramic view, while the right side features two close-up images indicating the direction of motion and the direction of the cue. This diagram aims to depict the challenges faced by humanoid agents during visual search in complex environments.

      Figure VI. Visualizations of hard-level HPS task instances. Figure VI. This image is a chart illustrating the alignment of cue and motion directions without textual instruction. The panoramic view on the left displays the layout of the environment, while the right side provides examples of direction of motion and direction of cue. Different colors highlight each direction, emphasizing the complexity of visual search in a multidimensional environment.

      Figure VII. Visualizations of extreme-level HPS task instances. Figure VII. This image is a visualization of extreme-level HPS task instances, showcasing examples of a robot making path choices in complex environments across multiple scenarios. Each task illustrates the mismatch between the robot's movement direction and cues, aiming to explore the difficulty of decision-making in action.

5.1.3. Train-Test Split

  • H*Bench (Evaluation): 1,000 instances (600 HOS and 400 HPS) were reserved as the HBenchH*Bench for evaluation, resulting in 4,000 evaluation episodes (due to 4 starting orientations per instance).
  • SFT Split: 250 instances from HOS and 250 instances from HPS pools were randomly sampled to construct the SFT dataset.
  • RL Training Split: All leftover instances from the initial 3,000, after allocating for HBenchH*Bench and SFT, were exclusively used for RL training.

5.2. Evaluation Metrics

The primary evaluation metric is Success Rate (%).

5.2.1. Conceptual Definition

A trial (episode) is considered a success if the agent's final submitted viewing direction (ϕ^,γ^)(\hat{\phi}, \hat{\gamma}) falls within a predefined tolerance region around the ground-truth optimal direction (ϕ,γ)(\phi^*, \gamma^*). This metric directly quantifies the agent's ability to accurately locate and orient towards the target object or path.

5.2.2. Mathematical Formula

The tolerance region is defined as: $ [\phi^* - \tau_\phi, \phi^* + \tau_\phi] \times [\gamma^* - \tau_\gamma, \gamma^* + \tau_\gamma] $ A submission (ϕ^,γ^)(\hat{\phi}, \hat{\gamma}) is successful if ϕ^\hat{\phi} is within [ϕτϕ,ϕ+τϕ][\phi^* - \tau_\phi, \phi^* + \tau_\phi] AND γ^\hat{\gamma} is within [γτγ,γ+τγ][\gamma^* - \tau_\gamma, \gamma^* + \tau_\gamma]. The tolerance parameters τϕ\tau_\phi and τγ\tau_\gamma are calculated as: $ \tau_\phi = \max\left(\frac{w_\phi}{2}, \text{constant } \tau_{min,\phi}\right) $ $ \tau_\gamma = \max\left(\frac{w_\gamma}{2}, \text{constant } \tau_{min,\gamma}\right) $ (The paper uses τϕ=max(wϕ2,τϕ)\tau_\phi = \max\left(\frac{w_\phi}{2}, \tau_\phi\right) and τγ=max(wγ2,τγ)\tau_\gamma = \max\left(\frac{w_\gamma}{2}, \tau_\gamma\right) where the second τϕ\tau_\phi and τγ\tau_\gamma refer to the minimum constant tolerance values. For clarity, I've used τmin,ϕ\tau_{min,\phi} and τmin,γ\tau_{min,\gamma}.)

5.2.3. Symbol Explanation

  • (ϕ^,γ^)(\hat{\phi}, \hat{\gamma}): The azimuth and polar angle of the agent's final submitted viewing direction.
  • (ϕ,γ)(\phi^*, \gamma^*): The azimuth and polar angle of the ground-truth optimal direction (center of the annotated bounding box).
  • τϕ\tau_\phi: The tolerance angle for azimuth (yaw) deviation.
  • τγ\tau_\gamma: The tolerance angle for polar angle (pitch) deviation.
  • wϕw_\phi: The angular width of the annotated bounding box.
  • wγw_\gamma: The angular height of the annotated bounding box.
  • max(,)\max(\cdot, \cdot): A function that returns the larger of the two input values, ensuring a minimum tolerance even for very small bounding boxes.

Specific Tolerances Used:

  • For HOS tasks: τϕ=30\tau_\phi = 30^\circ, τγ=20\tau_\gamma = 20^\circ. These values are chosen to mimic human foveation (the area of sharpest vision). Both ϕ^\hat{\phi} and γ^\hat{\gamma} are evaluated.
  • For HPS tasks: τϕ=10\tau_\phi = 10^\circ. Only ϕ^\hat{\phi} is assessed, as path search primarily requires precise motion direction on a planar ground.

5.3. Baselines

The paper evaluates several Multimodal Large Language Models (MLLMs) from both open-source and proprietary categories, comparing them against their own fine-tuned models.

5.3.1. Open-Weight Multi-Image Models

These models are generally accessible for research and development.

  • InternVL3.5-4B [13]
  • InternVL3.5-8B [13]
  • Qwen2.5-VL-3B-Instruct [61] (This is the base model for their own fine-tuned HVS-3B model)
  • Qwen2.5-VL-7B-Instruct [61]
  • Gemma-3-4B-it [19]
  • Gemma-3-12B-it [19]
  • Kimi-VL-A3B-Instruct [53]

5.3.2. Proprietary Models

These are state-of-the-art models developed by major AI companies, known for their high performance.

  • GPT-4o [41] (OpenAI)
  • Gemini2.5-Pro [46] (Google DeepMind)

5.3.3. Fine-Tuned Models (Ours)

These represent the models developed and evaluated by the authors.

  • HVS-3B (w/ SFT only): This is the Qwen2.5-VL-3B-Instruct model after undergoing only the Supervised Fine-Tuning (SFT) stage.

  • HVS-3B: This is the Qwen2.5-VL-3B-Instruct model after completing both the SFT and Reinforcement Learning (RL) stages.

    These baselines were chosen to represent a broad spectrum of current MLLM capabilities, from smaller open-source models to powerful proprietary ones, allowing for a comprehensive assessment of the challenges and potential of humanoid visual search.

6. Results & Analysis

6.1. Core Results Analysis

The main experimental results, as presented in Table 1, reveal a significant performance gap and the effectiveness of the proposed post-training framework.

The following are the results from Table 1 of the original paper:

Humanoid Object Search Humanoid Path Search
Method Overall Easy Medium Hard Overall Easy Medium Hard Extreme
Open-Weight Multi Image Models
InternVL3.5-4B [13] 3.08 7.32 2.84 1.49 4.81 6.00 5.70 4.67 0.46
InternVL3.5-8B [13] 6.38 9.76 9.10 4.79 7.25 10.00 7.68 5.14 4.17
Qwen2.5-VL-3B-Instruct [61] 14.83 27.97 13.07 10.01 6.44 7.00 8.77 4.91 3.24
Qwen2.5-VL-7B-Instruct [61] 11.38 23.42 9.10 7.02 6.31 9.00 5.92 5.84 1.85
Gemma-3-4B-it [19] 17.13 32.85 26.14 10.13 14.44 17.20 14.47 14.72 7.41
Gemma-3-12B-it [19] 10.21 24.72 17.33 3.88 14.50 16.80 14.25 14.49 9.72
Kimi-VL-A3B-Instruct [53] 4.92 12.85 0.57 2.36 4.32 8.79 3.32 2.21 4.17
Proprietary Models
GPT-4o [41] 19.75 18.17 17.35 20.92 23.69 26.80 22.59 26.17 13.89
Gemini2.5-Pro [46] 31.96 33.58 23.78 32.13 33.00 41.60 29.39 35.75 15.28
Fine-Tuned Models (Ours)
HVS-3B (w/ SFT only) 40.83 53.82 23.86 37.73 23.00 28.00 23.03 21.26 14.81
HVS-3B 47.38 60.49 24.43 44.87 24.94 34.80 20.18 25.00 12.04

6.1.1. Performance Gap Between Models

  • Proprietary Models vs. Open-Weight Models: There's a substantial performance gap. Gemini2.5-Pro stands out as the strongest baseline, achieving 31.96% overall for HOS and 33.00% for HPS. GPT-4o also performs reasonably well (19.75% HOS, 23.69% HPS). In contrast, most open-weight models, even larger ones, achieve significantly lower success rates, often in single digits or low teens for HPS. Gemma-3-4B-it is the best among open-weight models (17.13% HOS, 14.44% HPS).
  • Model Size Anomaly: Interestingly, larger model sizes do not consistently guarantee better performance. For Gemma-3 and Qwen2.5-VL series, the smaller 4B/3B4B/3B models often surpass their 12B/7B12B/7B counterparts in HOS, and perform comparably in HPS. This suggests that model scale alone isn't sufficient for these embodied tasks without specific training.

6.1.2. Effectiveness of Post-Training

The authors' post-training framework (Supervised Fine-Tuning + Reinforcement Learning) demonstrates significant improvements over the base Qwen2.5-VL-3B-Instruct model.

  • SFT Contribution: Supervised Fine-Tuning (SFT) alone contributes the majority of performance gains. For HOS, HVS-3B (w/ SFT only) jumps from 14.83% to 40.83% (an increase of 26.00%). For HPS, it improves from 6.44% to 23.00% (an increase of 16.56%). This indicates SFT is crucial for establishing fundamental task-oriented capabilities and tool-use.
  • RL Contribution: Subsequent Reinforcement Learning (RL) provides additional, albeit more modest, gains. HVS-3B (with both SFT and RL) further improves HOS from 40.83% to 47.38% (an increase of 6.55%) and HPS from 23.00% to 24.94% (an increase of 1.94%). This suggests RL acts as a refinement step for optimization.

6.1.3. Task-Dependent Efficacy

  • Object Search Superiority: For the relatively simpler object search, HVS-3B (47.38%) outperforms the state-of-the-art proprietary model Gemini2.5-Pro (31.96%). This indicates that post-training can be highly effective in improving visual grounding and exploration for HOS.
  • Path Search Challenges: For the more complex path search, HVS-3B (24.94%) still falls short of Gemini2.5-Pro (33.00%). This gap suggests that post-training has limitations in enhancing higher-order spatial reasoning capabilities required for HPS. The lower ceiling for HPS is attributed to the demand for sophisticated spatial commonsense.

6.1.4. Error Analysis

The paper provides an in-depth error analysis, highlighting challenges in MLLMs:

  • HOS Errors:

    1. Limited visual grounding capabilities: The agent struggles to reliably identify targets in cluttered environments (e.g., failing to distinguish a specific panda-patterned product, as shown in images/28.jpg and images/30.jpg).

    2. Perception-action gap: The agent might detect a target but fail to perform precise fine-grained foveation (e.g., not rotating enough to center the target, as shown in images/32.jpg and images/34.jpg).

      该图像是一个展示大型零售空间内部货架的插图。图中可以看到货架两侧排列着多种食品罐头,展示了一个拥挤的购物环境,体现了在现实世界场景中进行视觉搜索的复杂性。 Figure VIII. Qualitative Examples of Limited Visual Grounding Capabilities in HOS. This image is an illustration showing the interior shelves of a large retail space. The shelves on both sides are lined with various canned foods, reflecting a crowded shopping environment, highlighting the complexity of visual search in real-world scenarios.

      该图像是一个零售空间的全景图,展示了一个拥挤的商店内部,顾客和货架的布局在视觉上呈现复杂的场景。该场景适合用于评估视觉搜索算法在真实环境中的表现。 Figure IX. Qualitative Examples of Perception-Action Gap in HOS. This image is a panoramic view of a retail space, illustrating a crowded store interior with a complex layout of customers and shelves. This scene is suitable for evaluating the performance of visual search algorithms in real-world environments.

  • HPS Errors:

    1. Vision-action mismatch: The model perceives visual cues (e.g., signs) but fails to translate them into correct physical actions (e.g., seeing an arrow but rotating the wrong way, as shown in images/5.jpg Left and images/37.jpg).

    2. Lack of physical commonsense: Actions violate 3D constraints (e.g., attempting to pass through walls, misjudging vertical connections, ignoring drop-offs, as shown in images/42.jpg).

    3. Lack of socio-spatial commonsense: The model misses implicit rules and norms of built environments (e.g., ignoring functions of stairs, police tape, crosswalks, or attempting to use an emergency exit as a routine path, as shown in images/5.jpg Left and images/45.jpg).

      该图像是示意图,展示了人类指导与机器学习模型(MLLM)之间在寻找目标位置时的互动实例。左侧描述了MLLM在识别航空公司标志方面的缺乏,而右侧则展示了其在选择去机场的路径时缺乏社会空间常识的挑战。 Figure 5. This image is an illustrative diagram showing the interaction between human instructions and the machine learning model (MLLM) in locating target positions. The left side describes the MLLM's deficiency in recognizing airline signs, while the right side highlights its challenges in choosing the route to the airport due to a lack of socio-spatial commonsense.

      该图像是一个室内场景,展示了一个宽敞的公共空间,具备大型窗户和现代化的照明设施。该场景可能位于交通枢纽、零售空间或公共机构内,反映出论文中关于视觉搜索在真实世界场景中的应用。 Figure X. Qualitative Examples of Vision-Action Mismatch in HPS. This image is an indoor scene that showcases a spacious public area with large windows and modern lighting fixtures. This setting may be located in a transportation hub, retail space, or public institution, reflecting the paper's focus on applying visual search in real-world scenarios.

      该图像是一个地铁车站的场景,展示了拥挤的人群和站台结构,体现了复杂的城市交通环境。图像中的结构设计和光线变化使得人们与周围环境的互动得以明确。 Figure XI. Qualitative Examples of Lack of Physical Commonsense in HPS. This image is a scene of a subway station, showcasing a crowded crowd and the platform structure, reflecting the complexity of urban transportation environments. The design of the structures and the variation in lighting clearly illustrate the interaction of people with their surroundings.

      该图像是一个示意图,展示了一个拥挤的地铁通道场景,其中包含闸机和周围环境。该场景展示了真人视觉搜索在复杂环境中的应用,强调了在真实世界中进行物体和路径搜索的挑战。 Figure XII. Qualitative Examples of Lack of Socio-Spatial Commonsense in HPS. This image is an illustration depicting a crowded subway passage scene, featuring turnstiles and the surrounding environment. This scene showcases the application of humanoid visual search in complex environments, highlighting the challenges of object and path search in the real world.

These findings collectively suggest that MLLMs can form linguistically grounded spatial models for passive world description, but struggle to develop physically grounded ones for embodied world interaction.

6.2. Ablation Studies / Parameter Analysis

6.2.1. On the Role and Limits of Post-Training

6.2.1.1. Effectiveness of SFT and RL

As previously noted in the main results, SFT is the primary driver of performance gains, establishing fundamental task capabilities. RL provides additional, albeit smaller, refinement. The paper observes that post-training specifically improves:

  1. Precise control over rotation angles: Allows for fine-tuned foveation of targets.
  2. Use of large-angle turns to explore new areas: Essential for efficient exploration in 360° environments.
  3. Capacity to act on directional signs: Better interpretation and translation of visual cues into actions. Case studies (Figures XIII-XV in Appendix) illustrate these improvements. The authors also found that applying RL directly without prior SFT degrades the model's instruction-following capability, underscoring the importance of SFT for establishing a strong behavioral prior.

6.2.1.2. Task-Dependent Efficacy

The benefits of post-training are task-dependent: significant gains for object search but more modest for path search. This suggests that post-training effectively enhances visual grounding and exploration (crucial for HOS), but struggles to impart physical, spatial, and social commonsense required for HPS.

6.2.1.3. Negative Impact of RL on Complex Tasks

For HPS, RL surprisingly reduces performance on medium difficulty (from 23.03% to 20.18%) and extreme difficulty (from 14.81% to 12.04%). These scenarios are characterized by a misalignment between visual cues and the optimal path, posing a significant challenge. The authors hypothesize this degradation may stem from reward hacking, where the model learns to exploit the reward signal rather than genuinely improving its reasoning capability. This highlights a key challenge in RL: designing reward functions that consistently align with true task objectives across all difficulty levels, especially for complex, implicit commonsense reasoning.

6.2.1.4. Key Takeaway

The disparate impact of SFT+RLSFT+RL on object versus path search leads to a crucial conclusion: Post-training can improve visual grounding and exploration for object search, but struggles to impart physical, spatial, and social commonsense for path search, as these are often implicit, situational, and procedural.

6.2.2.1. In-Task Superiority with an Exception

As shown in images/4.jpg (Figure 4), models generally perform best when trained on the specific task they are evaluated on. However, there's one exception: a model trained solely on object search achieves 37.8% on the easy HPS split, outperforming both the baseline (7.0%) and the dedicated in-task HPS model (33.8%). This is hypothesized to occur because easy HPS tasks often reduce to simple object searches where clear visual cues directly define the path. The powerful object-finding skills acquired during HOS training transfer effectively to these specific HPS scenarios.

该图像是一个对比图,展示了在不同难度级别下,HOS 和 HPS 任务的成功率(%)。横坐标表示任务的难度等级,纵坐标表示成功率,包含多条曲线,便于观察模型在不同任务上的性能差异。 Figure 4. This image is a comparison chart showing the success rates (%) of HOS and HPS tasks across different difficulty levels. The x-axis represents the difficulty levels, while the y-axis shows the success rates, featuring multiple curves to illustrate the performance differences of the models across tasks.

6.2.2.2. Cross-Task Generalization

The study observes a clear bidirectional synergy between HOS and HPS:

  • Training on object search boosts path search performance from 6.4% to 20.7%.
  • Training on path search elevates object search from 14.8% to 29.5%. This synergy occurs because skills like active exploration and path reasoning acquired from HPS learning directly benefit HOS, while visual grounding honed in HOS reciprocally aids HPS.

6.2.2.3. Mixed-Data Training

Training on a mixed object and path search dataset yields the best overall performance. However, this comes with a challenge: performance gains are unevenly distributed, meaning improvements on certain splits might lead to reduced performance on others. Balancing this trade-off is critical for developing generalist humanoid agents.

6.2.3. Ablation Study: Reward Shaping

The authors ablate different reward functions for path search to understand their impact.

The following are the results from Table 2 of the original paper:

Humanoid Path Search
Method Overall Easy Medium Hard Extreme
GRPO on HPS
sft (baseline) 23.44 26.00 24.56 24.77 12.50
form+corr 22.38 33.80 17.32 21.73 7.87
form+corr+dist 21.37 34.40 15.13 20.09 6.94
form+dist 21.31 29.80 17.54 20.56 11.11

The paper tests three types of reward shaping for HPS:

  1. format + correctness (form+corr): This reward combines a component for correct output format and another for successful task completion.

  2. format + correctness + distance-to-goal (form+corr+dist): Adds a term that rewards the agent for getting closer to the target direction.

  3. format + distance-to-goal (form+dist): Combines correct format with a distance-to-goal reward, without an explicit correctness reward.

    All reward shaping variants only improve performance on the easy split, often degrading performance on harder levels (e.g., medium, hard, extreme). This underscores the inherent difficulty of path search and indicates the need for more advanced learning algorithms and sophisticated reward design beyond simple distance metrics.

6.2.3.1. Reward Functions

The rule-based reward function used to calculate the reward for a trajectory is: $ r = r_{corr} + r_{form} $

Where: $ r_{\mathrm{corr}} = \left{ \begin{array}{l l} 0.5, & \mathrm{if \ the \ submitted \ answer \ satisfies \ the \ completion \ condition}, \ 0, & \mathrm{otherwise}, \end{array} \right. $

  • rcorrr_{\mathrm{corr}}: The correctness reward. It is 0.5 if the final submitted action successfully completes the task (e.g., the target is within tolerance), and 0 otherwise.

    $ r_{\mathrm{form}} = \left{ \begin{array}{l l} 0.5, & \mathrm{if \ the \ response \ is \ in \ <\text{/think}> \ <\text{answer}><\text{/answer}> \ format}, \ 0, & \mathrm{otherwise}. \end{array} \right. $

  • rformr_{\mathrm{form}}: The format reward. It is 0.5 if the agent's output adheres to the expected <think>...</think><answer>...</answer><think>...</think><answer>...</answer> format, and 0 otherwise. This encourages instruction-following in terms of output structure.

    Additionally, a distance-to-goal reward (rdistr_{dist}) is added specifically for HPS. This reward is calculated based on the distance of the final direction to the target bounding box.

$ r_{dist} = \frac{\pi - d(\phi_T, \phi^) + \pi - d(\gamma_T, \gamma^)}{2\pi} $

  • rdistr_{dist}: The distance-to-goal reward. This term encourages the agent to get closer to the target direction. The π\pi term normalizes the reward, making it higher when the distance is smaller.

  • ϕT\phi_T: The agent's azimuth at the final step TT.

  • γT\gamma_T: The agent's polar angle at the final step TT.

  • ϕ\phi^*: The ground-truth optimal azimuth.

  • γ\gamma^*: The ground-truth optimal polar angle.

  • d(α,α)d(\alpha, \alpha^*): A function representing the distance to the bounding box for a given angle α\alpha and target angle α\alpha^*.

    The distance to bounding box is calculated by: $ d(\alpha, \alpha^) = |\alpha - (\alpha^ - \tau_\alpha)| + |\alpha - (\alpha^* + \tau_\alpha)| $

  • d(α,α)d(\alpha, \alpha^*): This function calculates a distance-like value. When the angle α\alpha is within the tolerance region [ατα,α+τα][\alpha^* - \tau_\alpha, \alpha^* + \tau_\alpha], this function remains at a constant minimum value, effectively indicating that the target is "hit." Outside this region, it increases.

  • α\alpha: The current angle (either ϕT\phi_T or γT\gamma_T).

  • α\alpha^*: The target angle (either ϕ\phi^* or γ\gamma^*).

  • τα\tau_\alpha: The tolerance for the angle.

6.2.4. Ablation Study: Training Rollout and Context Length

As shown in images/6.jpg (Figure 6):

  • Training Rollout (Left): Models trained with shorter GRPO rollouts (e.g., up to 5 turns) can achieve satisfactory performance through test-time scaling and match the performance of models trained with longer rollouts (10 turns). This indicates training efficiency can be achieved without sacrificing final performance.

  • Context Length (Right): A short context length of 2 rounds (meaning the model considers the current observation and the immediately preceding dialogue turn) is sufficient for HVS. Longer context lengths do not significantly improve performance, suggesting that HVS tasks, while multi-turn, do not require remembering an extensive dialogue history.

    Figure 6. Left: Cumulative success rate by step before and after RL (t indicates maximum turn limit in RL training). Right: Impact of test-time context length on success rate. Figure 6. This image is a chart showing the cumulative success rate changes for the HOS and HPS tasks based on varying maximum turn limits during inference on the left side. The right side illustrates the impact of test-time context length on success rates for both tasks.

The paper compares the active visual search paradigm (where the agent with a perspective view rotates to gather information) against the passive analysis of a complete panorama. As shown in images/7.jpg Left (Figure 7 Left):

  • The active paradigm is superior.
  • Reasons for Superiority:
    1. It mirrors efficient, human-like search strategies that coordinate head and eye movements.
    2. It avoids panoramic distortions that can conflict with MLLM training priors (which are often based on standard perspective images).
  • Empirical Validation: Using Gemma-3-4B-it, the passive approach leads to degraded performance. This aligns the work with research on active vision [26, 62].

6.2.6. Ablation Study: Embodied vs. Disembodied Bench

This critical comparison, illustrated in images/7.jpg Right (Figure 7 Right), highlights the distinct challenges of embodied AI.

  • Performance on VBenchV*Bench (Disembodied 2D): Traditional 2D methods like Mini-o3 [29] and Chain-of-Focus [67] achieve near-saturation performance (88.2% and 88.0% respectively) on the disembodied V*Bench (a benchmark for visual search within a static 2D image). This indicates that visual search within a static 2D image is largely "solved" for MLLMs.

  • Performance on HBenchH*Bench (Embodied 3D): However, the performance of these same methods plummets dramatically on HBenchH*Bench, with success rates dropping to a mere 2.5% and 11.6%.

  • Contrast and Conclusion: This stark contrast demonstrates that capabilities learned from passive Internet data do not transfer effectively to embodied active interaction in 3D. The HVS-3B model itself achieves only 38.4% success on HBenchH*Bench, indicating that HVS remains a wide-open research problem.

  • Unified Model Potential: Notably, HVS-3B maintains a satisfactory 65.5% success rate on VBenchV*Bench. This suggests that our model learns 3D embodied search without compromising its 2D visual search ability too much, indicating a promising path toward a unified model capable of operating in both physical and digital realms.

    Figure 7. Left: Comparison of active and passive visual search. Right: Comparison of different visual search paradigms. 该图像是一个图表,左侧比较了主动和被动视觉搜索的成功率,分别在HOS和HPS场景下的表现,右侧展示了不同方法的成功率,其中包含了对H Bench和V* Bench的比较。* Figure 7. This image is a chart that compares the success rates of active and passive visual searches on the left, showing performances in HOS and HPS scenarios, while the right side displays the success rates of different methods, including comparisons of H Bench and V* Bench.*

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces humanoid visual search (HVS), a novel task that enables MLLM agents to perform active spatial reasoning in 360° immersive environments, mimicking human cephalomotor and oculomotor control. By leveraging real-world 360° panoramas as lightweight, hardware-free simulators, the authors present HBenchH*Bench, a systematic benchmark featuring diverse and challenging in-the-wild scenes.

The experimental results highlight that even top-tier proprietary MLLMs struggle with HVS, achieving only approximately 30% success. However, the proposed post-training pipeline (Supervised Fine-Tuning followed by Reinforcement Learning) significantly enhances an open-source model (Qwen2.5-VL-3B), improving object search success by over threefold (to 47.38%) and path search success (to 24.94%). A critical finding is the inherent difficulty of path search, which demands sophisticated spatial commonsense and reveals fundamental limitations in MLLMs' higher-level reasoning capabilities. While post-training effectively boosts low-level perceptual-motor skills like visual grounding and exploration, it struggles with implicit, situational, and procedural commonsense required for complex path search, with RL even showing detrimental effects in some challenging scenarios due to potential reward hacking. The work also demonstrates the superiority of active visual search over passive methods and underscores the significant gap between disembodied 2D visual search and embodied 3D interaction.

7.2. Limitations & Future Work

The authors acknowledge several limitations and propose future research directions:

  • Reward Function Design: The current rule-based reward functions (especially for RL) are insufficient for complex tasks like path search, often leading to reward hacking and performance degradation in nuanced scenarios. Future work should focus on designing more robust and aligned reward functions.
  • Vision Tokenizers: More efficient vision tokenizers are needed to better process and represent visual information for MLLMs, particularly in high-resolution, complex 360° scenes.
  • Pre-training for Spatial World Knowledge: Current MLLMs are pre-trained on disembodied Internet data. Developing pre-training methods that instill action-oriented spatial world knowledge will be crucial for improving embodied reasoning.
  • Balancing Performance Across Task Difficulties: Achieving consistent performance gains across all difficulty levels, especially for path search, remains a challenge. Future efforts should aim to balance these trade-offs.
  • Scaling Embodied Search Data: Scaling up the collection of diverse and densely annotated embodied search data is essential to fully unlock visual-spatial reasoning capabilities in the wild.

7.3. Personal Insights & Critique

This paper presents a highly valuable contribution to the field of embodied AI and MLLMs.

  • Innovation in Simulation: The most significant innovation, in my opinion, is the ingenious use of 360° panoramas as a lightweight, hardware-free simulator. This approach effectively bridges the gap between disembodied 2D tasks and expensive 3D simulations or real-world robotics. It allows for scalable data collection and experimentation in complex in-the-wild environments, which is a major bottleneck for embodied AI research. This method could potentially be transferred to other embodied tasks that require a wide field of view but can abstract away full 3D physics, such as high-level planning for drone navigation or detailed inspection tasks.
  • Quantifying the Commonsense Gap: The paper rigorously quantifies the commonsense reasoning gap in MLLMs. The stark performance drop from 2D benchmarks to HBenchH*Bench, and the difficulty disparity between object search and path search, clearly illustrate that MLLMs, despite their linguistic prowess, profoundly lack physical, spatial, and social commonsense required for real-world interaction. This is a crucial finding that should guide future research, emphasizing that simply scaling models or fine-tuning on current datasets might not be enough.
  • Critique on Reward Hacking: The observation of reward hacking in RL for complex HPS tasks is a critical insight. It underscores a fundamental challenge in RL: aligning sparse or imperfect reward signals with true task objectives, especially when those objectives involve implicit human-like commonsense. This suggests that future reward design might need to move beyond simple rule-based metrics to incorporate more nuanced human feedback (e.g., RLHF-V methods [63]) or curiosity-driven exploration [51] that explicitly rewards learning useful commonsense.
  • Potential for Unified Models: The finding that HVS-3B maintains strong performance on VBenchV*Bench while also learning 3D embodied search is highly promising. It suggests that specialized embodied training doesn't necessarily come at the cost of general 2D visual understanding, paving the way for unified models that can operate seamlessly in both digital and physical realms.
  • Future Directions: Future work could explore hybrid architectures that explicitly integrate 3D geometric reasoning modules with MLLMs, or novel pre-training strategies that expose models to diverse embodied experiences and interaction data rather than just passive observation. Furthermore, instead of just head rotation, the framework could be extended to include whole-body movements or manipulation actions to create more comprehensive embodied agents.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.