Thinking in 360°: Humanoid Visual Search in the Wild
TL;DR Summary
The study introduces humanoid visual search (HVS), where agents use head movements in immersive 360° images. The new benchmark H* Bench emphasizes advanced visual-spatial reasoning. Experiments reveal low success rates for top models, though post-training significantly enhances p
Abstract
Humans rely on the synergistic control of head (cephalomotor) and eye (oculomotor) to efficiently search for visual information in 360°. However, prior approaches to visual search are limited to a static image, neglecting the physical embodiment and its interaction with the 3D world. How can we develop embodied visual search agents as efficient as humans while bypassing the constraints imposed by real-world hardware? To this end, we propose humanoid visual search where a humanoid agent actively rotates its head to search for objects or paths in an immersive world represented by a 360° panoramic image. To study visual search in visually-crowded real-world scenarios, we build H* Bench, a new benchmark that moves beyond household scenes to challenging in-the-wild scenes that necessitate advanced visual-spatial reasoning capabilities, such as transportation hubs, large-scale retail spaces, urban streets, and public institutions. Our experiments first reveal that even top-tier proprietary models falter, achieving only ~30% success in object and path search. We then use post-training techniques to enhance the open-source Qwen2.5-VL, increasing its success rate by over threefold for both object search (14.83% to 47.38%) and path search (6.44% to 24.94%). Notably, the lower ceiling of path search reveals its inherent difficulty, which we attribute to the demand for sophisticated spatial commonsense. Our results not only show a promising path forward but also quantify the immense challenge that remains in building MLLM agents that can be seamlessly integrated into everyday human life.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Thinking in 360°: Humanoid Visual Search in the Wild
1.2. Authors
Heyang Yu, Yinan Han, Xiangyu Zhang, Baiqiao Yin, Bowen Chang, Xiangyu Han, Xinhao Liu, Jing Zhang, Marco Pavone, Chen Feng, Saining Xie, Yiming Li
Their affiliations include NYU, NVIDIA, TU Darmstadt, UC Berkeley, and Stanford University. The authors represent a collaborative effort from both academia and industry in the fields of artificial intelligence, robotics, and computer vision.
1.3. Journal/Conference
This paper is published as a preprint on arXiv. While not yet peer-reviewed and published in a formal journal or conference proceedings at the time of this analysis, arXiv is a highly influential platform for rapid dissemination of research in computer science, physics, mathematics, and related fields. Papers on arXiv often represent cutting-edge work that is in the process of peer review or has been accepted to top-tier venues.
1.4. Publication Year
2025
1.5. Abstract
Humans effectively use a combination of head (cephalomotor) and eye (oculomotor) movements for visual search in a 360° environment. Current visual search methods, however, are typically confined to static images, overlooking the physical embodiment and real-world interaction. This paper proposes humanoid visual search (HVS), where a humanoid agent actively rotates its head to locate objects or paths within an immersive 360° panoramic image. To study this in complex, visually-crowded real-world scenarios, the authors introduce , a new benchmark focusing on challenging in-the-wild scenes like transportation hubs, large retail spaces, urban streets, and public institutions, moving beyond traditional household settings. Experiments reveal that even top-tier proprietary models achieve only approximately 30% success in object and path search. The authors then enhance the open-source Qwen2.5-VL model using post-training techniques, which boosts its success rate significantly: over threefold for object search (HOS) (from 14.83% to 47.38%) and path search (HPS) (from 6.44% to 24.94%). The lower success ceiling for path search indicates its greater difficulty, attributed to the need for sophisticated spatial commonsense. The results suggest a promising direction for developing Multimodal Large Language Model (MLLM) agents that can integrate into human daily life, while also quantifying the substantial challenges that remain.
1.6. Original Source Link
https://arxiv.org/abs/2511.20351v1 The paper is available as a preprint on arXiv.
1.7. PDF Link
https://arxiv.org/pdf/2511.20351v1.pdf
2. Executive Summary
2.1. Background & Motivation
2.1.1. Core Problem
The core problem the paper addresses is the limitation of existing visual search methods in accurately simulating human-like visual exploration within dynamic, 3D environments. Current state-of-the-art computational methods, primarily based on Multimodal Large Language Models (MLLMs), typically operate on a single, static 2D image. This approach suffers from two fundamental gaps compared to biological visual search:
- Non-interactive: Models cannot change their perspective or acquire information beyond their initial field of view, limiting their ability to explore.
- Disembodied: Models lack physical embodiment, meaning they cannot couple visual reasoning with actions in the physical world. The search often becomes an abstract perceptual exercise rather than a goal-directed behavior.
2.1.2. Importance and Challenges
This problem is crucial because developing embodied visual agents that can actively search for information in visually crowded scenes has significant potential in various real-world applications, including:
-
Humanoid robots: Enabling robots to efficiently find objects or navigate in complex environments. -
Assistive technology: Developing intelligent systems to help humans with visual impairments or in challenging search tasks. -
Augmented reality: Creating more intuitive and interactive AR experiences.The main challenges in prior research are:
-
Limited perceptual realism: Existing
embodied AI platformsoften lack the visual fidelity of real-world scenes. -
Restriction to household scenes: Most benchmarks are confined to simpler, controlled household environments, failing to capture the
structural (multi-level layouts),semantic (dense compositional cues), andvolumetric (cluttered 3D space)complexities ofin-the-wildhuman-made environments. -
Hardware constraints: Developing and testing embodied agents in real-world hardware or highly realistic 3D simulators is expensive, difficult to scale, and hard to reproduce.
2.1.3. Innovative Idea
The paper's innovative idea is to prototype humanoid visual search (HVS). This approach allows humanoid agents to couple deliberate reasoning with active head turns for visual search in complex environments. A key enabler is a scalable paradigm where a single 360° panorama closes the perception-action cycle, effectively serving as a lightweight, hardware-free simulator. This bypasses the constraints of real-world hardware and expensive 3D simulators, making it tractable to study embodied visual search in diverse, challenging in-the-wild scenes.
2.2. Main Contributions / Findings
2.2.1. Primary Contributions
The paper makes three primary contributions:
- Introduces Humanoid Visual Search (HVS): A novel task that enables human-like active spatial reasoning in 360° environments, bridging the gap between passive visual reasoning and active embodied interaction. This includes two core forms:
humanoid object search (HOS)andhumanoid path search (HPS). - Proposes a Scalable Framework and H*Bench: A new benchmark, , is introduced, which leverages real-world 360° panoramas as lightweight simulators. This creates a hardware-free platform for studying embodied reasoning in
in-the-wildenvironments (e.g., transportation hubs, large retail spaces, public institutions). It features dense annotations for embodied task questions and ground-truth actions. - Conducts Thorough Evaluations and Demonstrates Post-Training Effectiveness: The paper conducts comprehensive evaluations showing that
post-training techniques(Supervised Fine-Tuning and Reinforcement Learning) can significantly improve the performance ofMLLMsinHVS. It also highlights major unresolved challenges and promising avenues for future research.
2.2.2. Key Conclusions / Findings
The paper reached several key conclusions:
- Significant Performance Gap in MLLMs: Even top-tier proprietary models like
GPT-4oandGemini2.5-Profalter, achieving only approximately 30% success inobjectandpath searchon , indicating the inherent difficulty of the task for existing models. - Effectiveness of Post-Training: Post-training techniques (specifically,
Supervised Fine-Tuning (SFT)followed byReinforcement Learning (RL)) can substantially enhance the performance of open-sourceMLLMslikeQwen2.5-VL.HVS-3B(their fine-tuned model) increased its success rate from 14.83% to 47.38% forobject searchand from 6.44% to 24.94% forpath search. - Path Search is Inherently More Difficult: The consistently lower success ceiling for
path searchcompared toobject searchreveals its greater inherent difficulty. This is attributed to the demand for sophisticatedspatial commonsense,physical commonsense, andsocio-spatial commonsense, which are often implicit and procedural. - Limitations of Post-Training for Higher-Order Reasoning: While post-training improves low-level
perceptual-motor abilities(visual grounding, exploration), it struggles to imparthigher-level reasoningcapabilities required forpath search.RL, while beneficial for simpler tasks, can sometimes paradoxically degrade performance on more complexpath searchscenarios, potentially due toreward hacking. - Active vs. Passive Visual Search: The active visual search paradigm (rotating a narrow field-of-view) is superior to passive analysis of a complete panorama, mimicking human efficiency and avoiding panoramic distortions.
- Embodied vs. Disembodied Benchmarks: Capabilities learned from passive Internet data (on 2D benchmarks like ) do not transfer well to embodied active interaction in 3D (), highlighting the unique challenges of embodied AI.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully understand this paper, a novice reader should be familiar with the following fundamental concepts:
3.1.1. Multimodal Large Language Models (MLLMs)
Multimodal Large Language Models (MLLMs) are advanced artificial intelligence models that can understand and process information from multiple types, or modalities, of data simultaneously. Traditionally, Large Language Models (LLMs) (like GPT-3 or GPT-4) primarily deal with text. MLLMs extend this capability by integrating other modalities such as images, audio, or video. This means an MLLM can, for example, take an image as input and answer questions about its content in natural language, or generate a description of the image. They achieve this by aligning the feature spaces of different modalities (e.g., using visual encoders to extract features from images and language encoders for text) and feeding these combined features into a powerful LLM backbone for reasoning and generation. Their ability to process and reason across different data types makes them a promising pathway toward Artificial General Intelligence (AGI).
3.1.2. Visual Search
Visual search is the cognitive process by which humans and artificial agents scan a visual environment to locate specific targets or information among distractors. It's a fundamental aspect of perception and attention. In computational terms, it involves an agent processing visual input to identify a target object, a specific feature, or a navigable path based on a given query or goal. Unlike general object detection, visual search often implies an active, goal-directed process that might involve exploration and decision-making over time, especially in complex or crowded scenes.
3.1.3. Embodied AI
Embodied AI refers to artificial intelligence systems that are situated within a physical (or highly realistic simulated physical) body and interact with a 3D environment. Unlike purely computational AI that operates on abstract data, embodied AI agents perceive their surroundings through sensors (like cameras), process that information, and take physical actions (like moving, grasping, or rotating) that affect their environment and subsequent perceptions. This field focuses on developing agents that can exhibit intelligent behavior in the real world, grounding their knowledge in physical interactions and commonsense reasoning.
3.1.4. 360° Panoramic Images
A 360° panoramic image is a wide-angle photograph that captures a complete view of a scene in all directions. Imagine standing at a single point and taking pictures all around you, then stitching them together into one continuous image. These images provide an immersive representation of a 3D environment from a fixed viewpoint. In the context of this paper, they are used as lightweight simulators, allowing an agent to "look around" by projecting different narrow field-of-view (FoV) perspective images from the panorama, simulating head movements without needing a full 3D model.
3.1.5. Supervised Fine-Tuning (SFT)
Supervised Fine-Tuning (SFT) is a common technique in machine learning, particularly with Large Language Models (LLMs) and MLLMs. It involves taking a pre-trained model (a model that has already learned general patterns from a vast amount of data) and further training it on a smaller, task-specific, labeled dataset. The goal of SFT is to adapt the general knowledge of the pre-trained model to excel at a particular downstream task, like humanoid visual search in this case. The training process uses supervised learning, meaning the model learns from input-output pairs (e.g., visual observation + instruction correct action + chain-of-thought rationale).
3.1.6. Reinforcement Learning (RL)
Reinforcement Learning (RL) is a paradigm of machine learning where an agent learns to make sequential decisions by interacting with an environment. The agent performs actions, and in response, the environment transitions to a new state and provides a reward signal. The agent's goal is to learn a policy (a mapping from states to actions) that maximizes the cumulative reward over time. Unlike supervised learning which learns from explicit input-output pairs, RL learns from trial and error, often suited for tasks requiring long-horizon planning and strategy. In this paper, RL is used as a post-training step to further refine the MLLM agent's policy and improve its instruction-following and reasoning capabilities.
3.1.7. Chain-of-Thought (CoT)
Chain-of-Thought (CoT) is a prompting technique used with Large Language Models (LLMs) to encourage them to generate a series of intermediate reasoning steps before providing a final answer. Instead of directly asking for an answer, the model is prompted to "think step by step." This process makes the model's reasoning explicit, often leading to more accurate and robust solutions for complex tasks, especially those requiring multi-step logical reasoning or problem-solving. In this paper, CoT rationales are used during SFT to instill structured reasoning abilities in the MLLM agent.
3.1.8. Fovea/Foveation
In human vision, the fovea is a small pit in the center of the retina responsible for sharp central vision, providing the highest visual acuity. When we look directly at an object, we are foveating it, meaning we are aligning our eyes so that the image of the object falls onto our fovea. This allows us to perceive fine details. In the context of humanoid object search (HOS), foveation refers to the agent's action of rotating its "head" (or camera) to bring the target object into the central, high-resolution region of its field of view, mimicking human visual precision.
3.2. Previous Works
The paper contextualizes its work by discussing several related areas:
3.2.1. Visual Search
Early visual search methods relied on bottom-up visual saliency (identifying visually prominent features) or top-down contextual guidance (using prior knowledge to direct search), or a combination thereof [40, 56, 65]. These often struggled with generalization due to limited contextual understanding. Recent advancements, like [58], leverage MLLMs with rich world knowledge (e.g., object co-occurrence) to improve performance. However, these works predominantly focus on search within static 2D images. This disembodied approach neglects the active, dynamic nature of human visual search, which coordinates both eye (oculomotor) and head (cephalomotor) movements to explore the 3D world, where the head explores unseen regions and eyes exploit visible content [4, 30, 50].
3.2.2. Visual Navigation
Visual navigation and vision-language navigation (VLN) aim to develop agents that can move through an environment to reach a specified goal [2, 72]. These tasks typically require either a full 3D simulator or real physical hardware, which are difficult to build, scale, and reproduce. Consequently, most efforts have been confined to household scenes where 3D data is more accessible [7-10, 37, 45, 71]. This leaves in-the-wild challenges largely unexplored. This paper is motivated by the observation that human navigation involves intermittent reasoning at critical decision points, allowing them to focus on these points using 360° panoramas to bypass the need for full 3D simulation or physical hardware.
3.2.3. Multimodal LLMs (MLLMs)
MLLMs are at the forefront of Artificial General Intelligence (AGI), integrating various data modalities (text, images, etc.). Key foundational models include Flamingo [1], BLIP [33, 34], and LLaVA [35], which focus on aligning visual encoder features with LLMs. More recent MLLMs like GPT-4o [41] and Gemini 2.5 [46] have set new benchmarks through increased model capacity and novel training recipes, notably using Reinforcement Learning (RL)-based post-training to align outputs with human preferences and improve instruction-following [43, 63]. RL can also foster stronger reasoning for complex, multi-step tasks [20, 22, 27, 28]. This paper grounds MLLMs in the physical world to assess and improve their active and embodied visual search capabilities.
3.2.4. Multimodal LLMs with Tools
Inspired by humans using external tools, LLM agents have demonstrated superior performance in long-horizon tasks by leveraging external toolkits (e.g., web browsing, code execution) and multi-turn reinforcement learning [17, 18, 24, 42]. This concept extends to multimodal settings, where MLLMs generate symbolic tool calls (e.g., OCR, marking, cropping, zoom in) to overcome limitations in semantic grounding and visual perception [36, 44, 49, 67, 70]. However, these tool operations typically occur on a disembodied 2D canvas, involving computational manipulations of a static image file. This paper distinguishes itself by coupling tool use with physical world actions, specifically active head rotation, to construct a visual chain of thought, bridging the gap between passive and active embodied reasoning.
3.2.5. Multimodal LLMs for Embodied Reasoning
A growing body of research aims to ground MLLMs in embodied reasoning to bridge the gap between symbolic linguistic representations and physical world perception [12, 23, 54, 64, 68]. Cosmos-Reason1 [3] enables MLLMs to perceive the physical world via video and generate physically grounded responses. Gemini Robotics-ER [52] extends Gemini's multimodal reasoning to the physical world with enhanced spatiotemporal understanding. This paper specifically focuses on active visual search with interleaved multimodal reasoning, an area that remains largely unexplored in this context.
3.3. Technological Evolution
The field has evolved from:
-
Early
Visual Search(pre-MLLMs): Focused onsaliency mapsandtop-down contextwithin static 2D images. Limited by lack of world knowledge and generalization. -
MLLMsfor 2DVisual Search(, etc.): LeveragedMLLMs' rich world knowledge to improve static 2D search, often using computational tools like zooming and cropping on a fixed canvas. Still disembodied. -
Visual Navigationin Simulators: Developed agents for navigation in 3D environments, but often restricted to simplehousehold scenesdue to the complexity and cost of3D simulatorsor real hardware. -
MLLMswithTool UseandEmbodied Reasoning(Current): Began integratingMLLMswith external tools (still mostly 2D computational) and exploring groundingMLLMsinembodied contexts(often passively observing or generating high-level plans).This paper's work fits into the current wave by pushing
MLLMsbeyond passive observation and 2D tool use into active, embodied interaction in 3D environments, specifically focusing on the criticalvisual searchcomponent necessary for open-world navigation and manipulation. It uniquely achieves this without full 3D simulation by cleverly using360° panoramas.
3.4. Differentiation Analysis
Compared to previous approaches, this paper's core innovations and differentiations are:
- Active & Embodied Interaction: Unlike prior
visual searchmethods confined to static 2D images orMLLMsusingdisembodied 2D tools, this work introduceshumanoid visual search (HVS)where agents actively rotate their "head" (camera view within a 360° panorama) to change their perspective and gather new information. This tightly couples visual reasoning with physical-like actions. - Scalable, Hardware-Free Embodied Simulation: Instead of relying on expensive
3D simulatorsor real hardware (common invisual navigation), the paper proposes a lightweight framework using360° panoramic images. This allows for aclosed-loop perception-action cycleinin-the-wildscenes without the traditional overhead, enabling scalable research. - Focus on In-the-Wild Complexity: While
embodied AI platformsoften use simplistichousehold scenes, explicitly moves tovisually-crowded,structurally,semantically, andvolumetrically complex real-world environments(e.g., subway stations, shopping malls). This necessitates more advancedvisual-spatial reasoning. - Task-Driven Embodiment: The search is explicitly driven by
embodied tasks:humanoid object search (HOS)(for manipulation) andhumanoid path search (HPS)(for locomotion), making the search goal-directed and physically relevant, unlike abstract perceptual exercises. - Human-like Eye-Head Coordination: The proposed model prototypes human-like
eye-head coordinationwhere head rotations explore new regions, aligning with neuroscience findings.
4. Methodology
4.1. Principles
The core idea of humanoid visual search (HVS) is to enable Multimodal Large Language Models (MLLMs) to perform active visual exploration and reasoning in 360° immersive environments, mimicking how humans coordinate head and eye movements. This is driven by the observation that human spatial intelligence involves critical decision points where observation, reasoning, and ambiguity resolution occur before action. The method abstracts full-body motion to the atomic action of head rotation, allowing the study of core cognitive processes of embodied visual search in a tractable yet realistic manner. The system functions as a tool-augmented MLLM where head rotations are treated as tools that continuously construct a visual chain of thought.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Problem Formulation
The environment is modeled as a single 360° panoramic image. From this panorama, narrow Field-of-View (FoV) perspective images are sampled, representing the agent's observations. Each observation is defined by its azimuth and polar angle .
The objective of HVS is to identify the optimal direction that maximizes the probability of task success given a natural language instruction and the visual observation . This can be formally expressed as:
$ (\phi^, \gamma^) = \arg \max_{\phi, \gamma} P (r_s \mid o_{\phi, \gamma}, x) $
-
: The optimal final viewing direction (azimuth and polar angle) that the agent needs to find.
-
: The operator that finds the values of and that maximize the subsequent expression.
-
: The probability of achieving task success given the current visual observation (a perspective image from the panorama) and the language instruction .
The paper defines two core embodied search tasks:
-
Humanoid Object Search (HOS): The agent's goal is to locate a target object and
foveateit, meaning the final viewing direction must bring the target object into the central foveal region of theperspective view. This is a prerequisite for potential manipulation tasks. -
Humanoid Path Search (HPS): The agent needs to identify a navigable path to a target location. The goal is to find a final viewing direction (only
azimuthis considered, assuming a planar ground) that is aligned with the optimal path direction. This serves as a high-level planning step before actual locomotion.
4.2.2. Humanoid Visual Search with MLLMs
The humanoid visual search task is framed as a multimodal reasoning task where MLLM tool use is coupled with head rotation. An MLLM agent's behavior is governed by a policy .
-
: The agent's policy, parameterized by .
-
: The textual
chain of thought (CoT)generated by the agent at time step . -
: The action generated by the agent at time step .
-
: The current visual observation (a perspective image) at time , defined by its azimuth and polar angle .
-
: The initial natural language instruction for the task.
-
: The history of past observations,
chain-of-thoughtrationales, and actions up to timet-1.At each time step , the agent generates a textual
chain of thought() to reason about the current situation and then produces an action (). The process allows for a sequence of rotation actions, culminating in a submission action. The action space consists of two primitives:
-
Rotate (): This action adjusts the agent's viewing direction.
- : The change in
azimuth(yaw angle). Positive values indicate rotation to the right, negative values to the left.Yawis circular, meaning rotations wrap around 360 degrees. - : The change in
polar angle(pitch angle). Positive values indicate looking up, negative values indicate looking down. - The viewing direction is updated as and .
- : The change in
-
Submit (): This action commits the current viewing direction as the agent's final estimate and terminates the episode.
The image
images/2.jpg(Figure 2 from the original paper) visually illustrates this two-stage post-training pipeline and the closed-loop perception-action cycle.
该图像是一个示意图,展示了在一个360°全景环境中实现多轮强化学习的两个阶段。左侧为专家轨迹标注的阶段,涉及多模态大语言模型的预训练。右侧则展示了任务执行过程,包括如何识别目标并获取新视角,涉及到的公式包括 。
Figure 2. This image illustrates the two stages of multi-turn reinforcement learning in a 360° panoramic environment. The left side represents the expert trajectory annotation phase, which involves pre-training a multimodal large language model. The right side shows the task execution process, including how to identify targets and acquire a new view, involving the formula .
4.2.3. MLLM Post-Training
MLLMs, initially trained on static, disembodied Internet data, often lack the spatial commonsense and active 3D planning capabilities necessary for humanoid visual search. To address this, the paper adapts MLLMs through a two-stage post-training pipeline:
4.2.3.1. Stage 1: Supervised Fine-Tuning (SFT)
The first stage involves Supervised Fine-Tuning (SFT). This step aims to instill basic task-oriented reasoning and tool-use abilities in the model. The SFT is performed on a curated multi-turn dataset (described in Section 4.3.2) that contains visual observations, verified Chain-of-Thought (CoT) rationales, and corresponding ground-truth actions. This process teaches the model to generate structured action plans from multimodal inputs, thereby establishing a strong behavioral prior.
The SFT objective function is the expected negative log-likelihood (also known as cross-entropy loss) over the dataset . This dataset consists of task inputs and labeled trajectories :
$ \min_{\theta} \ \mathbb{E}{(x, \mathcal{H}T) \sim \mathcal{D}^{SFT}} \left[ - \sum{i=0}^{T-1} \log \pi{\theta} (y_i, a_i \mid o_i, x, \mathcal{H}_i) \right] $
- : The optimization objective is to find the model parameters that minimize the expression.
- : Expectation taken over all task inputs and their corresponding labeled trajectories sampled from the
SFTdataset . - : A complete ground-truth trajectory, which is a sequence of observations, rationales, and actions from the start of an episode () to the end (
T-1). - : This is the negative log-likelihood for a given trajectory. It sums the negative logarithm of the probability assigned by the policy to the correct action and
chain of thoughtat each step of the trajectory, given the current observation , instruction , and historical context . Minimizing this value maximizes the likelihood of the model producing the expert-providedCoTand action sequence.
4.2.3.2. Stage 2: Multi-Turn Reinforcement Learning (RL)
Following SFT, the policy is further refined using Group Relative Policy Optimization (GRPO) [48]. This RL stage is crucial for encouraging long-horizon reasoning and developing robust, generalizable search strategies beyond what imitation learning (from SFT) alone can achieve [14].
For each task, the agent samples times to get a group of output trajectories (denoted as ). Each represents a complete output sequence for a single rollout, consisting of all generated tokens for chain of thought and actions: . The GRPO algorithm then calculates an advantage for each token within these sequences to update the model parameters.
The GRPO objective function is given as:
$ \begin{array}{l l} \mathcal{I}{\mathrm{GRPO}}(\theta) = \mathbb{E}{[(s_o, x, y) \sim \mathcal{D}^{RL}, {\omega_i}{i=1}^G \sim \pi{\theta_{\mathrm{old}}}(\Omega | s_o, x)]} \ \displaystyle \frac{1}{G} \sum_{i=1}^G \frac{1}{|\omega_i|} \sum_{t=1}^{|\omega_i|} \ \Big { \min \left[ \frac{\pi_{\theta}(\omega_{i,t} | s_o, x, \omega_{i,<t})}{\pi_{\theta_{\mathrm{old}}}(\omega_{i,t} | s_o, x, \omega_{i,<t})} \hat{A}{i,t}, \mathrm{clip}(\frac{\pi{\theta}(\omega_{i,t} | s_o, x, \omega_{i,<t})}{\pi_{\theta_{\mathrm{old}}}(\omega_{i,t} | s_o, x, \omega_{i,<t})} \right. \ \left. , 1 - \epsilon, 1 + \epsilon) \hat{A}{i,t} \Big ] - \beta \mathbb{KL} (\pi{\theta} || \pi_{\mathrm{ref}}) \Big }, \end{array} $
-
: The
GRPOobjective function to be maximized with respect to the current policy parameters . -
: Expectation taken over initial states (), instructions (), and
CoT() from theRLdataset , and over a group of trajectories sampled using theold policy. -
: The number of trajectories sampled in a group.
-
: Averages the objective over the trajectories and over all tokens within each trajectory .
-
: The probability assigned by the
current policyto the token at step in trajectory , given the initial state , instruction , and preceding tokens . -
: The probability assigned by the
old policy(the policy before the current update) to the token . This ratio is a core component ofProximal Policy Optimization (PPO)-like algorithms to ensure stable updates. -
: The
relative advantagefor the token in trajectory . This term indicates how much better (or worse) generating this token was compared to the average. -
: A clipping function, typically used in
PPOto limit the policy ratio, preventing excessively large policy updates that could destabilize training. is a hyperparameter (e.g., 0.2). -
: A
KL divergencepenalty term, where measures the difference between thecurrent policyand areference policy(often theSFTmodel or a previous version of theRLpolicy). This term encourages theRLpolicy not to deviate too far from theSFTbase, preservinginstruction-followingcapabilities and preventingcatastrophic forgetting. is a coefficient controlling the strength of this penalty.The
relative advantageis calculated as:
$ \hat{A}_{i,t} = \frac{r_i - \mathrm{mean}(r)}{\mathrm{std}(r)} $
-
: The total reward obtained for trajectory .
-
: The average reward across all trajectories in the current group.
-
: The standard deviation of rewards across all trajectories.
This
relative advantagenormalizes the reward, indicating whether a particular trajectory's reward is above or below the average for the current batch, allowing theRLalgorithm to learn from relative performance within a group.
5. Experimental Setup
5.1. Datasets
The paper introduces , a new benchmark designed to systematically evaluate humanoid visual search in rich, dense, and visually cluttered real-world environments.
5.1.1. Dataset Overview
-
Scale: Approximately 3,000 annotated task instances. By initializing the agent with four distinct starting orientations per task, a total of 12,000 search episodes are generated.
-
Resolution: High-resolution panoramic videos (up to pixels).
-
Source: Sourced from both self-collected footage across global metropolitan areas (New York, Paris, Amsterdam, Frankfurt) and open platforms (
YouTubeand the360+X dataset[11]). -
Geographical Coverage: Broad, spanning 13 cities across 12 countries and 4 continents (see
images/8.jpg, Figure I). This diversity contributes to a wide range of architectural styles, languages/scripts on signage, and environmental conditions. -
Scene Diversity: Systematically organized into 6 major scene categories and 18 fine-grained scene types, covering a wide spectrum of challenging scenarios (see
images/3.jpg, Figure 3 Top). These include:- Transportation hubs (airports, subway stations)
- Large-scale retail spaces (supermarkets, shopping malls)
- Public institutions (libraries, museums)
- Urban streets
-
Task Type Diversity: supports both
Humanoid Object Search (HOS)andHumanoid Path Search (HPS). The diversity ofHOStarget objects is summarized inimages/3.jpg, Figure 3 Bottom Left.
该图像是H Bench的示意图,展示了来自不同全球位置的全景视频,包含了各种视觉复杂的环境。每个位置上都有指引,展示了在这些场景中可能的活动,例如在超市购物、购买车票等,突显了在真实环境中进行视觉搜索的挑战。*
Figure I. This image illustrates H Bench, showcasing panoramic videos from diverse global locations with visually cluttered environments. Each location features prompts indicating possible activities, such as shopping in a supermarket or purchasing tickets, highlighting the challenges of visual search in real-world scenarios.*
Figure 3. This image is a data visualization chart that displays the distribution of different scene categories and object instances. The upper part shows the distribution of six scene categories, with large-scale retail spaces being predominant. The lower part lists the specific classifications of object instances, including signs, furniture, and electronics, while also providing an analysis of the difficulty levels for humanoid tasks and path search.
5.1.2. Benchmark Construction
5.1.2.1. Task Annotation
- Interface: Annotators use a
perspective-view interfacethat rendersnarrow-field-of-view (FOV)images from the panorama at known viewing angles . - Process: Annotators freely rotate the virtual camera to inspect the scene, identify a suitable embodied search task, write a natural-language instruction, and mark the target.
- Target Marking: For
HOS, the target is marked by drawing a tightbounding boxthat specifies its optimal direction. This bounding box is thenback-projectedonto the panorama, and its center yields the optimal target direction . - HPS Annotation: For
HPS, only theazimuthis retained as the optimal direction, assuming the environment can be approximated by a planar ground geometry.
5.1.2.2. Cold-Start Data Curation (for SFT)
- To create high-quality
multi-turn trajectoriesforSupervised Fine-Tuning (SFT), a subset of annotated task instances is augmented with structuredChain-of-Thought (CoT)rationales. - GPT-4o Prompting: A strong
MLLM(GPT-4o[41]) is prompted to produce a concise, observation-grounded rationale for each annotation step (given task instruction, current observation, and human-provided optimal action). - Human-in-the-Loop: Annotators review and refine the generated rationales to eliminate
hallucinations, ensure grounding in visible scene evidence, and enforce stylistic consistency. - Dataset: The resulting
SFTdataset consists of 2,000multi-turn trajectoriescontainingvisual observations, verifiedCoT rationales, and actions. - Effort: Six annotators dedicated 250 hours to embodied question annotation and
CoT refinement.
5.1.2.3. Difficulty Taxonomy
-
HOS Difficulty: Defined by the
initial visibilityof the target object.-
Visibility ratio: Fraction of the object area visible in the initial viewpoint compared to the complete object area. -
Categorization:
Easy(high visibility),Medium(partially visible),Hard(low/invisible). -
Visualizations: See
images/9.jpg(Figure III).
Figure III. This image is an illustration of HOS task examples, showcasing different scenarios of searching for specific objects in a shopping environment. Each scenario includes the target area of the object, initial observation images, and descriptions of varying difficulty levels. The formula is used in the image to indicate the difficulty of object detection.
-
-
HPS Difficulty: Depends on two factors:
-
Whether the scene contains
textual cues(e.g., signs). -
Whether the
visualortextual cuesalign with the actual path direction. -
These factors jointly define four difficulty levels.
-
Visualizations: See
images/14.jpg(Figure IV),images/20.jpg(Figure V),images/25.jpg(Figure VI),images/27.jpg(Figure VII).
Figure IV. This image is an illustration showcasing a 360° panoramic view and a robotic figure within an urban setting. It highlights the directions of Cue and Motion with textual annotations, demonstrating how to effectively combine cues and motion direction information in visual search tasks.
Figure V. This image is a schematic diagram illustrating the misalignment of motion and cue directions in response to textual instructions. The left side shows a panoramic view, while the right side features two close-up images indicating the direction of motion and the direction of the cue. This diagram aims to depict the challenges faced by humanoid agents during visual search in complex environments.
Figure VI. This image is a chart illustrating the alignment of cue and motion directions without textual instruction. The panoramic view on the left displays the layout of the environment, while the right side provides examples of direction of motion and direction of cue. Different colors highlight each direction, emphasizing the complexity of visual search in a multidimensional environment.
Figure VII. This image is a visualization of extreme-level HPS task instances, showcasing examples of a robot making path choices in complex environments across multiple scenarios. Each task illustrates the mismatch between the robot's movement direction and cues, aiming to explore the difficulty of decision-making in action.
-
5.1.3. Train-Test Split
- H*Bench (Evaluation): 1,000 instances (600
HOSand 400HPS) were reserved as the for evaluation, resulting in 4,000 evaluation episodes (due to 4 starting orientations per instance). - SFT Split: 250 instances from
HOSand 250 instances fromHPSpools were randomly sampled to construct theSFTdataset. - RL Training Split: All leftover instances from the initial 3,000, after allocating for and
SFT, were exclusively used forRLtraining.
5.2. Evaluation Metrics
The primary evaluation metric is Success Rate (%).
5.2.1. Conceptual Definition
A trial (episode) is considered a success if the agent's final submitted viewing direction falls within a predefined tolerance region around the ground-truth optimal direction . This metric directly quantifies the agent's ability to accurately locate and orient towards the target object or path.
5.2.2. Mathematical Formula
The tolerance region is defined as:
$
[\phi^* - \tau_\phi, \phi^* + \tau_\phi] \times [\gamma^* - \tau_\gamma, \gamma^* + \tau_\gamma]
$
A submission is successful if is within AND is within .
The tolerance parameters and are calculated as:
$
\tau_\phi = \max\left(\frac{w_\phi}{2}, \text{constant } \tau_{min,\phi}\right)
$
$
\tau_\gamma = \max\left(\frac{w_\gamma}{2}, \text{constant } \tau_{min,\gamma}\right)
$
(The paper uses and where the second and refer to the minimum constant tolerance values. For clarity, I've used and .)
5.2.3. Symbol Explanation
- : The
azimuthandpolar angleof the agent's final submitted viewing direction. - : The
azimuthandpolar angleof the ground-truth optimal direction (center of the annotated bounding box). - : The
tolerance angleforazimuth(yaw) deviation. - : The
tolerance angleforpolar angle(pitch) deviation. - : The angular width of the annotated bounding box.
- : The angular height of the annotated bounding box.
- : A function that returns the larger of the two input values, ensuring a minimum tolerance even for very small bounding boxes.
Specific Tolerances Used:
- For
HOStasks: , . These values are chosen to mimic humanfoveation(the area of sharpest vision). Both and are evaluated. - For
HPStasks: . Only is assessed, aspath searchprimarily requires precisemotion directionon a planar ground.
5.3. Baselines
The paper evaluates several Multimodal Large Language Models (MLLMs) from both open-source and proprietary categories, comparing them against their own fine-tuned models.
5.3.1. Open-Weight Multi-Image Models
These models are generally accessible for research and development.
InternVL3.5-4B[13]InternVL3.5-8B[13]Qwen2.5-VL-3B-Instruct[61] (This is the base model for their own fine-tunedHVS-3Bmodel)Qwen2.5-VL-7B-Instruct[61]Gemma-3-4B-it[19]Gemma-3-12B-it[19]Kimi-VL-A3B-Instruct[53]
5.3.2. Proprietary Models
These are state-of-the-art models developed by major AI companies, known for their high performance.
GPT-4o[41] (OpenAI)Gemini2.5-Pro[46] (Google DeepMind)
5.3.3. Fine-Tuned Models (Ours)
These represent the models developed and evaluated by the authors.
-
HVS-3B (w/ SFT only): This is theQwen2.5-VL-3B-Instructmodel after undergoing only theSupervised Fine-Tuning (SFT)stage. -
HVS-3B: This is theQwen2.5-VL-3B-Instructmodel after completing both theSFTandReinforcement Learning (RL)stages.These baselines were chosen to represent a broad spectrum of current
MLLMcapabilities, from smaller open-source models to powerful proprietary ones, allowing for a comprehensive assessment of the challenges and potential ofhumanoid visual search.
6. Results & Analysis
6.1. Core Results Analysis
The main experimental results, as presented in Table 1, reveal a significant performance gap and the effectiveness of the proposed post-training framework.
The following are the results from Table 1 of the original paper:
| Humanoid Object Search | Humanoid Path Search | ||||||||
| Method | Overall | Easy | Medium | Hard | Overall | Easy | Medium | Hard | Extreme |
| Open-Weight Multi Image Models | |||||||||
| InternVL3.5-4B [13] | 3.08 | 7.32 | 2.84 | 1.49 | 4.81 | 6.00 | 5.70 | 4.67 | 0.46 |
| InternVL3.5-8B [13] | 6.38 | 9.76 | 9.10 | 4.79 | 7.25 | 10.00 | 7.68 | 5.14 | 4.17 |
| Qwen2.5-VL-3B-Instruct [61] | 14.83 | 27.97 | 13.07 | 10.01 | 6.44 | 7.00 | 8.77 | 4.91 | 3.24 |
| Qwen2.5-VL-7B-Instruct [61] | 11.38 | 23.42 | 9.10 | 7.02 | 6.31 | 9.00 | 5.92 | 5.84 | 1.85 |
| Gemma-3-4B-it [19] | 17.13 | 32.85 | 26.14 | 10.13 | 14.44 | 17.20 | 14.47 | 14.72 | 7.41 |
| Gemma-3-12B-it [19] | 10.21 | 24.72 | 17.33 | 3.88 | 14.50 | 16.80 | 14.25 | 14.49 | 9.72 |
| Kimi-VL-A3B-Instruct [53] | 4.92 | 12.85 | 0.57 | 2.36 | 4.32 | 8.79 | 3.32 | 2.21 | 4.17 |
| Proprietary Models | |||||||||
| GPT-4o [41] | 19.75 | 18.17 | 17.35 | 20.92 | 23.69 | 26.80 | 22.59 | 26.17 | 13.89 |
| Gemini2.5-Pro [46] | 31.96 | 33.58 | 23.78 | 32.13 | 33.00 | 41.60 | 29.39 | 35.75 | 15.28 |
| Fine-Tuned Models (Ours) | |||||||||
| HVS-3B (w/ SFT only) | 40.83 | 53.82 | 23.86 | 37.73 | 23.00 | 28.00 | 23.03 | 21.26 | 14.81 |
| HVS-3B | 47.38 | 60.49 | 24.43 | 44.87 | 24.94 | 34.80 | 20.18 | 25.00 | 12.04 |
6.1.1. Performance Gap Between Models
- Proprietary Models vs. Open-Weight Models: There's a substantial performance gap.
Gemini2.5-Prostands out as the strongest baseline, achieving31.96%overall forHOSand33.00%forHPS.GPT-4oalso performs reasonably well (19.75%HOS,23.69%HPS). In contrast, most open-weight models, even larger ones, achieve significantly lower success rates, often in single digits or low teens forHPS.Gemma-3-4B-itis the best among open-weight models (17.13% HOS, 14.44% HPS). - Model Size Anomaly: Interestingly, larger model sizes do not consistently guarantee better performance. For
Gemma-3andQwen2.5-VLseries, the smaller models often surpass their counterparts inHOS, and perform comparably inHPS. This suggests that model scale alone isn't sufficient for these embodied tasks without specific training.
6.1.2. Effectiveness of Post-Training
The authors' post-training framework (Supervised Fine-Tuning + Reinforcement Learning) demonstrates significant improvements over the base Qwen2.5-VL-3B-Instruct model.
- SFT Contribution:
Supervised Fine-Tuning (SFT)alone contributes the majority of performance gains. ForHOS,HVS-3B (w/ SFT only)jumps from14.83%to40.83%(an increase of26.00%). ForHPS, it improves from6.44%to23.00%(an increase of16.56%). This indicatesSFTis crucial for establishing fundamentaltask-oriented capabilitiesandtool-use. - RL Contribution: Subsequent
Reinforcement Learning (RL)provides additional, albeit more modest, gains.HVS-3B(with bothSFTandRL) further improvesHOSfrom40.83%to47.38%(an increase of6.55%) andHPSfrom23.00%to24.94%(an increase of1.94%). This suggestsRLacts as a refinement step for optimization.
6.1.3. Task-Dependent Efficacy
- Object Search Superiority: For the relatively simpler
object search,HVS-3B(47.38%) outperforms the state-of-the-art proprietary modelGemini2.5-Pro(31.96%). This indicates thatpost-trainingcan be highly effective in improvingvisual groundingandexplorationforHOS. - Path Search Challenges: For the more complex
path search,HVS-3B(24.94%) still falls short ofGemini2.5-Pro(33.00%). This gap suggests thatpost-traininghas limitations in enhancinghigher-order spatial reasoning capabilitiesrequired forHPS. The lower ceiling forHPSis attributed to the demand for sophisticatedspatial commonsense.
6.1.4. Error Analysis
The paper provides an in-depth error analysis, highlighting challenges in MLLMs:
-
HOS Errors:
-
Limited visual grounding capabilities: The agent struggles to reliably identify targets in cluttered environments (e.g., failing to distinguish a specific panda-patterned product, as shown inimages/28.jpgandimages/30.jpg). -
Perception-action gap: The agent might detect a target but fail to perform precisefine-grained foveation(e.g., not rotating enough to center the target, as shown inimages/32.jpgandimages/34.jpg).
Figure VIII. Qualitative Examples of Limited Visual Grounding Capabilities in HOS. This image is an illustration showing the interior shelves of a large retail space. The shelves on both sides are lined with various canned foods, reflecting a crowded shopping environment, highlighting the complexity of visual search in real-world scenarios.
Figure IX. Qualitative Examples of Perception-Action Gap in HOS. This image is a panoramic view of a retail space, illustrating a crowded store interior with a complex layout of customers and shelves. This scene is suitable for evaluating the performance of visual search algorithms in real-world environments.
-
-
HPS Errors:
-
Vision-action mismatch: The model perceives visual cues (e.g., signs) but fails to translate them into correct physical actions (e.g., seeing an arrow but rotating the wrong way, as shown inimages/5.jpgLeft andimages/37.jpg). -
Lack of physical commonsense: Actions violate 3D constraints (e.g., attempting to pass through walls, misjudging vertical connections, ignoring drop-offs, as shown inimages/42.jpg). -
Lack of socio-spatial commonsense: The model misses implicit rules and norms of built environments (e.g., ignoring functions of stairs, police tape, crosswalks, or attempting to use an emergency exit as a routine path, as shown inimages/5.jpgLeft andimages/45.jpg).
Figure 5. This image is an illustrative diagram showing the interaction between human instructions and the machine learning model (MLLM) in locating target positions. The left side describes the MLLM's deficiency in recognizing airline signs, while the right side highlights its challenges in choosing the route to the airport due to a lack of socio-spatial commonsense.
Figure X. Qualitative Examples of Vision-Action Mismatch in HPS. This image is an indoor scene that showcases a spacious public area with large windows and modern lighting fixtures. This setting may be located in a transportation hub, retail space, or public institution, reflecting the paper's focus on applying visual search in real-world scenarios.
Figure XI. Qualitative Examples of Lack of Physical Commonsense in HPS. This image is a scene of a subway station, showcasing a crowded crowd and the platform structure, reflecting the complexity of urban transportation environments. The design of the structures and the variation in lighting clearly illustrate the interaction of people with their surroundings.
Figure XII. Qualitative Examples of Lack of Socio-Spatial Commonsense in HPS. This image is an illustration depicting a crowded subway passage scene, featuring turnstiles and the surrounding environment. This scene showcases the application of humanoid visual search in complex environments, highlighting the challenges of object and path search in the real world.
-
These findings collectively suggest that MLLMs can form linguistically grounded spatial models for passive world description, but struggle to develop physically grounded ones for embodied world interaction.
6.2. Ablation Studies / Parameter Analysis
6.2.1. On the Role and Limits of Post-Training
6.2.1.1. Effectiveness of SFT and RL
As previously noted in the main results, SFT is the primary driver of performance gains, establishing fundamental task capabilities. RL provides additional, albeit smaller, refinement. The paper observes that post-training specifically improves:
- Precise control over rotation angles: Allows for fine-tuned
foveationof targets. - Use of large-angle turns to explore new areas: Essential for efficient exploration in
360° environments. - Capacity to act on directional signs: Better interpretation and translation of visual cues into actions.
Case studies (Figures XIII-XV in Appendix) illustrate these improvements. The authors also found that applying
RLdirectly without priorSFTdegrades the model'sinstruction-following capability, underscoring the importance ofSFTfor establishing a strong behavioral prior.
6.2.1.2. Task-Dependent Efficacy
The benefits of post-training are task-dependent: significant gains for object search but more modest for path search. This suggests that post-training effectively enhances visual grounding and exploration (crucial for HOS), but struggles to impart physical, spatial, and social commonsense required for HPS.
6.2.1.3. Negative Impact of RL on Complex Tasks
For HPS, RL surprisingly reduces performance on medium difficulty (from 23.03% to 20.18%) and extreme difficulty (from 14.81% to 12.04%). These scenarios are characterized by a misalignment between visual cues and the optimal path, posing a significant challenge. The authors hypothesize this degradation may stem from reward hacking, where the model learns to exploit the reward signal rather than genuinely improving its reasoning capability. This highlights a key challenge in RL: designing reward functions that consistently align with true task objectives across all difficulty levels, especially for complex, implicit commonsense reasoning.
6.2.1.4. Key Takeaway
The disparate impact of on object versus path search leads to a crucial conclusion: Post-training can improve visual grounding and exploration for object search, but struggles to impart physical, spatial, and social commonsense for path search, as these are often implicit, situational, and procedural.
6.2.2. Dissecting Object and Path Search
6.2.2.1. In-Task Superiority with an Exception
As shown in images/4.jpg (Figure 4), models generally perform best when trained on the specific task they are evaluated on. However, there's one exception: a model trained solely on object search achieves 37.8% on the easy HPS split, outperforming both the baseline (7.0%) and the dedicated in-task HPS model (33.8%). This is hypothesized to occur because easy HPS tasks often reduce to simple object searches where clear visual cues directly define the path. The powerful object-finding skills acquired during HOS training transfer effectively to these specific HPS scenarios.
Figure 4. This image is a comparison chart showing the success rates (%) of HOS and HPS tasks across different difficulty levels. The x-axis represents the difficulty levels, while the y-axis shows the success rates, featuring multiple curves to illustrate the performance differences of the models across tasks.
6.2.2.2. Cross-Task Generalization
The study observes a clear bidirectional synergy between HOS and HPS:
- Training on
object searchboostspath searchperformance from6.4%to20.7%. - Training on
path searchelevatesobject searchfrom14.8%to29.5%. This synergy occurs because skills likeactive explorationandpath reasoningacquired fromHPSlearning directly benefitHOS, whilevisual groundinghoned inHOSreciprocally aidsHPS.
6.2.2.3. Mixed-Data Training
Training on a mixed object and path search dataset yields the best overall performance. However, this comes with a challenge: performance gains are unevenly distributed, meaning improvements on certain splits might lead to reduced performance on others. Balancing this trade-off is critical for developing generalist humanoid agents.
6.2.3. Ablation Study: Reward Shaping
The authors ablate different reward functions for path search to understand their impact.
The following are the results from Table 2 of the original paper:
| Humanoid Path Search | |||||
| Method | Overall | Easy | Medium | Hard | Extreme |
| GRPO on HPS | |||||
| sft (baseline) | 23.44 | 26.00 | 24.56 | 24.77 | 12.50 |
| form+corr | 22.38 | 33.80 | 17.32 | 21.73 | 7.87 |
| form+corr+dist | 21.37 | 34.40 | 15.13 | 20.09 | 6.94 |
| form+dist | 21.31 | 29.80 | 17.54 | 20.56 | 11.11 |
The paper tests three types of reward shaping for HPS:
-
format + correctness (form+corr): This reward combines a component for correct output format and another for successful task completion. -
format + correctness + distance-to-goal (form+corr+dist): Adds a term that rewards the agent for getting closer to the target direction. -
format + distance-to-goal (form+dist): Combines correct format with adistance-to-goalreward, without an explicit correctness reward.All
reward shapingvariants only improve performance on theeasy split, often degrading performance on harder levels (e.g.,medium,hard,extreme). This underscores the inherent difficulty ofpath searchand indicates the need for more advancedlearning algorithmsand sophisticatedreward designbeyond simple distance metrics.
6.2.3.1. Reward Functions
The rule-based reward function used to calculate the reward for a trajectory is: $ r = r_{corr} + r_{form} $
Where: $ r_{\mathrm{corr}} = \left{ \begin{array}{l l} 0.5, & \mathrm{if \ the \ submitted \ answer \ satisfies \ the \ completion \ condition}, \ 0, & \mathrm{otherwise}, \end{array} \right. $
-
: The
correctness reward. It is0.5if the final submitted action successfully completes the task (e.g., the target is within tolerance), and0otherwise.$ r_{\mathrm{form}} = \left{ \begin{array}{l l} 0.5, & \mathrm{if \ the \ response \ is \ in \
<\text{/think}> \ <\text{answer}><\text{/answer}> \ format}, \ 0, & \mathrm{otherwise}. \end{array} \right. $ -
: The
format reward. It is0.5if the agent's output adheres to the expected format, and0otherwise. This encouragesinstruction-followingin terms of output structure.Additionally, a
distance-to-goal reward() is added specifically forHPS. This reward is calculated based on the distance of the final direction to the target bounding box.
$ r_{dist} = \frac{\pi - d(\phi_T, \phi^) + \pi - d(\gamma_T, \gamma^)}{2\pi} $
-
: The
distance-to-goal reward. This term encourages the agent to get closer to the target direction. The term normalizes the reward, making it higher when the distance is smaller. -
: The agent's
azimuthat the final step . -
: The agent's
polar angleat the final step . -
: The ground-truth optimal
azimuth. -
: The ground-truth optimal
polar angle. -
: A function representing the distance to the bounding box for a given angle and target angle .
The distance to bounding box is calculated by: $ d(\alpha, \alpha^) = |\alpha - (\alpha^ - \tau_\alpha)| + |\alpha - (\alpha^* + \tau_\alpha)| $
-
: This function calculates a distance-like value. When the angle is within the tolerance region , this function remains at a constant minimum value, effectively indicating that the target is "hit." Outside this region, it increases.
-
: The current angle (either or ).
-
: The target angle (either or ).
-
: The tolerance for the angle.
6.2.4. Ablation Study: Training Rollout and Context Length
As shown in images/6.jpg (Figure 6):
-
Training Rollout (Left): Models trained with shorter
GRPO rollouts(e.g., up to 5 turns) can achieve satisfactory performance throughtest-time scalingand match the performance of models trained with longer rollouts (10 turns). This indicates training efficiency can be achieved without sacrificing final performance. -
Context Length (Right): A short
context lengthof 2 rounds (meaning the model considers the current observation and the immediately preceding dialogue turn) is sufficient forHVS. Longer context lengths do not significantly improve performance, suggesting thatHVStasks, while multi-turn, do not require remembering an extensive dialogue history.
Figure 6. This image is a chart showing the cumulative success rate changes for the HOS and HPS tasks based on varying maximum turn limits during inference on the left side. The right side illustrates the impact of test-time context length on success rates for both tasks.
6.2.5. Ablation Study: Active vs. Passive Visual Search
The paper compares the active visual search paradigm (where the agent with a perspective view rotates to gather information) against the passive analysis of a complete panorama. As shown in images/7.jpg Left (Figure 7 Left):
- The
active paradigmis superior. - Reasons for Superiority:
- It mirrors efficient, human-like search strategies that coordinate head and eye movements.
- It avoids
panoramic distortionsthat can conflict withMLLMtraining priors (which are often based on standard perspective images).
- Empirical Validation: Using
Gemma-3-4B-it, thepassive approachleads to degraded performance. This aligns the work with research onactive vision[26, 62].
6.2.6. Ablation Study: Embodied vs. Disembodied Bench
This critical comparison, illustrated in images/7.jpg Right (Figure 7 Right), highlights the distinct challenges of embodied AI.
-
Performance on (Disembodied 2D): Traditional 2D methods like
Mini-o3[29] andChain-of-Focus[67] achieve near-saturation performance (88.2%and88.0%respectively) on thedisembodied V*Bench(a benchmark for visual search within a static 2D image). This indicates thatvisual searchwithin a static 2D image is largely "solved" forMLLMs. -
Performance on (Embodied 3D): However, the performance of these same methods plummets dramatically on , with success rates dropping to a mere
2.5%and11.6%. -
Contrast and Conclusion: This stark contrast demonstrates that capabilities learned from passive Internet data do not transfer effectively to
embodied active interaction in 3D. TheHVS-3Bmodel itself achieves only38.4%success on , indicating thatHVSremains a wide-open research problem. -
Unified Model Potential: Notably,
HVS-3Bmaintains a satisfactory65.5%success rate on . This suggests thatour model learns 3D embodied search without compromising its 2D visual search ability too much, indicating a promising path toward a unified model capable of operating in both physical and digital realms.
该图像是一个图表,左侧比较了主动和被动视觉搜索的成功率,分别在HOS和HPS场景下的表现,右侧展示了不同方法的成功率,其中包含了对H Bench和V* Bench的比较。*
Figure 7. This image is a chart that compares the success rates of active and passive visual searches on the left, showing performances in HOS and HPS scenarios, while the right side displays the success rates of different methods, including comparisons of H Bench and V* Bench.*
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces humanoid visual search (HVS), a novel task that enables MLLM agents to perform active spatial reasoning in 360° immersive environments, mimicking human cephalomotor and oculomotor control. By leveraging real-world 360° panoramas as lightweight, hardware-free simulators, the authors present , a systematic benchmark featuring diverse and challenging in-the-wild scenes.
The experimental results highlight that even top-tier proprietary MLLMs struggle with HVS, achieving only approximately 30% success. However, the proposed post-training pipeline (Supervised Fine-Tuning followed by Reinforcement Learning) significantly enhances an open-source model (Qwen2.5-VL-3B), improving object search success by over threefold (to 47.38%) and path search success (to 24.94%). A critical finding is the inherent difficulty of path search, which demands sophisticated spatial commonsense and reveals fundamental limitations in MLLMs' higher-level reasoning capabilities. While post-training effectively boosts low-level perceptual-motor skills like visual grounding and exploration, it struggles with implicit, situational, and procedural commonsense required for complex path search, with RL even showing detrimental effects in some challenging scenarios due to potential reward hacking. The work also demonstrates the superiority of active visual search over passive methods and underscores the significant gap between disembodied 2D visual search and embodied 3D interaction.
7.2. Limitations & Future Work
The authors acknowledge several limitations and propose future research directions:
- Reward Function Design: The current
rule-based reward functions(especially forRL) are insufficient for complex tasks likepath search, often leading toreward hackingand performance degradation in nuanced scenarios. Future work should focus on designing more robust and alignedreward functions. - Vision Tokenizers: More efficient
vision tokenizersare needed to better process and represent visual information forMLLMs, particularly in high-resolution, complex 360° scenes. - Pre-training for Spatial World Knowledge: Current
MLLMsare pre-trained ondisembodied Internet data. Developingpre-training methodsthat instillaction-oriented spatial world knowledgewill be crucial for improvingembodied reasoning. - Balancing Performance Across Task Difficulties: Achieving consistent performance gains across all difficulty levels, especially for
path search, remains a challenge. Future efforts should aim to balance these trade-offs. - Scaling Embodied Search Data: Scaling up the collection of diverse and densely annotated
embodied search datais essential to fully unlockvisual-spatial reasoningcapabilities in the wild.
7.3. Personal Insights & Critique
This paper presents a highly valuable contribution to the field of embodied AI and MLLMs.
- Innovation in Simulation: The most significant innovation, in my opinion, is the ingenious use of
360° panoramasas a lightweight, hardware-free simulator. This approach effectively bridges the gap betweendisembodied 2D tasksand expensive3D simulationsor real-world robotics. It allows for scalable data collection and experimentation in complexin-the-wildenvironments, which is a major bottleneck forembodied AIresearch. This method could potentially be transferred to otherembodied tasksthat require a wide field of view but can abstract away full 3D physics, such as high-level planning for drone navigation or detailed inspection tasks. - Quantifying the Commonsense Gap: The paper rigorously quantifies the
commonsense reasoninggap inMLLMs. The stark performance drop from2D benchmarksto , and the difficulty disparity betweenobject searchandpath search, clearly illustrate thatMLLMs, despite their linguistic prowess, profoundly lackphysical, spatial, and social commonsenserequired for real-world interaction. This is a crucial finding that should guide future research, emphasizing that simply scaling models orfine-tuningon current datasets might not be enough. - Critique on Reward Hacking: The observation of
reward hackinginRLfor complexHPStasks is a critical insight. It underscores a fundamental challenge inRL: aligningsparseorimperfect reward signalswith truetask objectives, especially when those objectives involve implicit human-likecommonsense. This suggests that futurereward designmight need to move beyond simple rule-based metrics to incorporate more nuancedhuman feedback(e.g.,RLHF-Vmethods [63]) orcuriosity-driven exploration[51] that explicitly rewardslearning useful commonsense. - Potential for Unified Models: The finding that
HVS-3Bmaintains strong performance on while also learning3D embodied searchis highly promising. It suggests that specializedembodied trainingdoesn't necessarily come at the cost of general2D visual understanding, paving the way forunified modelsthat can operate seamlessly in both digital and physical realms. - Future Directions: Future work could explore hybrid architectures that explicitly integrate
3D geometric reasoning moduleswithMLLMs, or novelpre-training strategiesthat expose models to diverseembodied experiencesandinteraction datarather than just passive observation. Furthermore, instead of justhead rotation, the framework could be extended to includewhole-body movementsormanipulation actionsto create more comprehensiveembodied agents.
Similar papers
Recommended via semantic vector search.