Paper status: completed

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Published:02/27/2025

Vision-Language-Action Model (38)Multi-Task Robotic Manipulation (5)Open-Ended Instruction Following (1)Complex Instruction Processing (1)Robotic Feedback Mechanism (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The `Hi Robot` system utilizes hierarchical vision-language models to effectively handle complex instructions and feedback, reasoning to determine the next steps in task execution, demonstrated across multiple robotic platforms for tasks like cleaning and making sandwiches.

Abstract

Generalist robots that can perform a range of different tasks in open-world settings must be able to not only reason about the steps needed to accomplish their goals, but also process complex instructions, prompts, and even feedback during task execution. Intricate instructions (e.g., "Could you make me a vegetarian sandwich?" or "I don't like that one") require not just the ability to physically perform the individual steps, but the ability to situate complex commands and feedback in the physical world. In this work, we describe a system that uses vision-language models in a hierarchical structure, first reasoning over complex prompts and user feedback to deduce the most appropriate next step to fulfill the task, and then performing that step with low-level actions. In contrast to direct instruction following methods that can fulfill simple commands ("pick up the cup"), our system can reason through complex prompts and incorporate situated feedback during task execution ("that's not trash"). We evaluate our system across three robotic platforms, including single-arm, dual-arm, and dual-arm mobile robots, demonstrating its ability to handle tasks such as cleaning messy tables, making sandwiches, and grocery shopping. Videos are available at https://www.pi.website/research/hirobot

Mind Map

In-depth Reading

English Analysis~35 min read · 46,501 chars

1. Bibliographic Information

1.1. Title

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

1.2. Authors

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, Adrian Li-Bell, Danny Driess, Lachy Groom, Sergey Levine, Chelsea Finn.

Affiliations:

1. Physical Intelligence
1. Stanford University
1. University of California, Berkeley

1.3. Journal/Conference

This paper is published as a preprint on arXiv. Comment on Venue Reputation: arXiv is a popular open-access archive for preprints of scientific papers in fields like physics, mathematics, computer science, quantitative biology, quantitative finance, statistics, electrical engineering and systems science, and economics. It allows researchers to disseminate their work quickly before, or in parallel with, peer review and formal publication. While not a peer-reviewed journal or conference itself, publication on arXiv indicates active research in the field and provides early access to new findings for the scientific community.

1.4. Publication Year

Published at (UTC): 2025-02-26T18:58:41.000Z (This indicates an upcoming or very recent publication/release, as the current UTC time is 2026-01-04.)

1.5. Abstract

Generalist robots operating in open-world settings need to perform a variety of tasks, requiring not only reasoning about task steps but also processing complex instructions, prompts, and real-time feedback. Such intricate commands ("Could you make me a vegetarian sandwich?", "I don't like that one") demand the ability to physically execute steps and to contextualize complex language within the physical environment. This work introduces Hi Robot, a system that employs vision-language models (VLMs) in a hierarchical architecture. This system first reasons over complex prompts and user feedback to determine the most appropriate next high-level step, and then executes that step using low-level actions. Unlike direct instruction-following methods limited to simple commands ("pick up the cup"), Hi Robot can process complex prompts and integrate situated feedback during execution ("that's not trash"). The system is evaluated across three robotic platforms (single-arm, dual-arm, and dual-arm mobile robots) on tasks such as cleaning messy tables, making sandwiches, and grocery shopping, demonstrating its advanced capabilities.

1.6. Original Source Link

https://arxiv.org/abs/2502.19417 Publication Status: Preprint on arXiv.

2. Executive Summary

2.1. Background & Motivation

The core problem addressed by this paper is enabling generalist robots to operate flexibly and effectively in open-ended, human-centric environments. Current robotic systems, particularly those relying on standard language-conditioned imitation learning, are typically limited to simple, atomic instructions (e.g., "pick up the coke can"). However, real-world human-robot interaction involves much more complex, dynamic, and often ambiguous natural language.

Why this problem is important: Achieving true general-purpose robotics in human environments requires robots to go beyond executing predefined scripts. They need to interpret nuanced commands (e.g., "Could you make me a vegetarian sandwich? I'd prefer it without tomatoes."), adapt to new situations, incorporate real-time corrections and feedback (e.g., "leave it alone," "that's not trash"), and handle unfamiliar challenges. This flexibility is crucial for intuitive and steerable human-robot interaction, unlocking capabilities for users to guide robots through novel tasks and correct them on the fly.

Specific challenges or gaps in prior research:

Complexity of Instructions: Prior work largely focuses on System 1-level behaviors (automatic, simple commands), neglecting the System 2-level reasoning (deliberative, complex, long-horizon tasks, interpreting feedback) required for intricate prompts.
Situational Context: Many existing systems struggle to integrate complex contextual information, such as visual observations and real-time user interjections, into their reasoning process. Language-only systems, for instance, cannot correctly ground feedback like "that's not trash" without visual context.
Scalability of Data: Training models for complex, open-ended instruction following with real human-robot interaction data is labor-intensive and not scalable.

Paper's entry point or innovative idea: The paper proposes a hierarchical reasoning system that leverages the power of vision-language models (VLMs) for both high-level deliberation and low-level action execution. This architecture aims to bridge the gap between complex human language and robotic physical capabilities by breaking down intricate instructions into manageable, atomic commands. A key innovation is the use of synthetic data generation to create diverse examples of complex prompts and human interjections, overcoming the data scarcity challenge for high-level reasoning.

2.2. Main Contributions / Findings

The paper's primary contributions are:

Hi Robot Framework: Introduction of a novel hierarchical interactive robot learning system (Hi Robot) that utilizes vision-language models (VLMs) for both high-level reasoning (interpreting complex prompts and feedback) and low-level task execution (generating precise robot actions). This framework effectively implements a "System 1" (low-level, reactive) and "System 2" (high-level, deliberative) cognitive architecture for robots.
Synthetic Data Generation Scheme: A new method for synthetically generating diverse and situated interaction data. This involves prompting a state-of-the-art VLM with robot observations and atomic commands to generate plausible high-level user prompts and robot responses, significantly expanding the training data for the high-level policy.
Enhanced Open-Ended Instruction Following: The system demonstrates the ability to process much more complex, open-ended prompts and seamlessly incorporate real-time situated feedback during task execution, a significant advance over prior end-to-end instruction following systems.
Empirical Validation Across Diverse Platforms and Tasks: Hi Robot is evaluated on various challenging tasks (cleaning tables, making sandwiches, grocery shopping) across different robotic platforms (single-arm, dual-arm, and mobile dual-arm robots). The results show strong generalization to novel task variations and combinations of skills.
Superior Performance: Experiments show that Hi Robot significantly surpasses multiple prior approaches, including API-based VLM (e.g., GPT-4o) and flat vision-language-action (VLA) policies, in both Instruction Accuracy (IA) and Task Progress (TP), demonstrating stronger alignment with human intent and higher task success rates.

3.1. Foundational Concepts

Vision-Language Models (VLMs): VLMs are a class of artificial intelligence models that combine capabilities from computer vision and natural language processing. They are trained on vast datasets of images and corresponding text (e.g., image captions, descriptions). This training allows them to understand and generate content that involves both visual and linguistic information. For instance, a VLM can answer questions about an image, generate a description for an image, or follow text instructions based on visual input. In the context of this paper, VLMs are crucial for interpreting complex human prompts that refer to objects and actions in the physical world observed by the robot.
- Autoregressive Decoder-Only Transformer Model: Many VLMs are based on the Transformer architecture, particularly using an autoregressive decoder-only structure. A Transformer is a neural network architecture that uses self-attention mechanisms to process sequences of data. An autoregressive model predicts future tokens in a sequence based on past tokens (e.g., predicting the next word in a sentence). A decoder-only Transformer means it primarily focuses on generating output sequences, often conditioned on an input sequence (like an image-language prefix). The core of a Transformer is the Attention mechanism, which allows the model to weigh the importance of different parts of the input sequence when processing each element. $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ Where:
  - $Q$ (Query), $K$ (Key), $V$ (Value) are matrices representing the input sequence. These are derived from the same input sequence, transformed by different learned weight matrices.
  - $Q K^T$ calculates the dot product similarity between queries and keys, indicating how much each element should focus on others.
  - $\sqrt{d_k}$ is a scaling factor, where $d_k$ is the dimension of the key vectors, used to prevent large dot products from pushing the softmax function into regions with tiny gradients.
  - $\mathrm{softmax}$ normalizes the scores, turning them into probabilities.
  - The result is multiplied by $V$ to get a weighted sum of the value vectors, representing the attended output.
Vision-Language-Action (VLA) Models: VLA models are an extension of VLMs specifically designed for robotic control. They take visual observations (images) and language commands as input and directly output robotic actions. These models leverage the VLM pre-training to enable strong generalization and instruction following in physical tasks. In this paper, the low-level policy is a VLA model.
- Flow-Matching: A technique used in some VLA models (like $π_0 VLA$ mentioned in the paper) to produce continuous action outputs. Instead of discretizing actions into tokens, flow-matching trains a neural network to learn a vector field that transports a simple noise distribution to the complex target data distribution (e.g., robot action distributions). This allows the model to generate continuous, precise actions.
Hierarchical Control: A control architecture where a complex task is broken down into a series of simpler sub-tasks. A high-level policy makes decisions about which sub-task to execute and when, often operating at a slower rate and with a broader view of the goal. A low-level policy then executes these sub-tasks, operating at a faster rate and focusing on immediate actions. This paper uses a hierarchical structure where a high-level VLM generates low-level language commands, which are then executed by a low-level VLA policy.
Imitation Learning: A machine learning paradigm where an agent learns a policy by observing demonstrations from an expert. In robotics, this often involves teleoperating a robot to perform tasks and recording the observations and actions. The learned policy then tries to imitate the expert's behavior. Hi Robot uses teleoperated robot demonstrations for training.
Teleoperation: The control of a robot or machine from a distance by a human operator. In the context of data collection for robotic learning, teleoperation is a common method to gather expert demonstrations by having a human manually guide the robot through tasks.
System 1 and System 2 Cognitive Processes (Kahneman, 2011): A psychological theory distinguishing between two modes of thought.
- System 1: Fast, intuitive, automatic, and emotional thinking. It operates quickly with little or no effort and no sense of voluntary control (e.g., recognizing faces, driving on an empty road).
- System 2: Slower, more deliberative, logical, and effortful thinking. It allocates attention to effortful mental activities that demand it (e.g., solving complex math problems, deciding on a vacation route). The paper draws an analogy between these systems and its hierarchical robotic control, where the low-level policy is akin to System 1 and the high-level policy to System 2.

3.2. Previous Works

The paper categorizes related work into VLMs for robotic control, using LLMs/VLMs for high-level reasoning, and robotic language interaction/feedback.

Directly Training VLMs for Robotic Control:
- Methods fine-tune VLMs to output robotic controls based on images and language commands (e.g., Brohan et al., 2023a (RT-2), Wen et al., 2024 (TinyVLA), Kim et al., 2024 (OpenVLA), Black et al., 2024 ( $π_0 VLA$ ), Liu et al., 2024c, Li et al., 2024, O'Neill et al., 2024 (Open X-Embodiment), Zawalski et al., 2024, Zheng et al., 2025, Pertsch et al., 2025 (FAST)).
- Limitation: While showing impressive generalization, these methods are typically trained for simple, atomic commands (e.g., "put the cup on the plate") and do not handle the complex prompts and feedback that Hi Robot addresses.
Using LLMs/VLMs out-of-the-box with Pre-defined Robot Skills:
- Earlier methods (LLMs with learned/hand-designed skills): Huang et al., 2022 (Language Models as Zero-Shot Planners), Brohan et al., 2023b (SayCan), Liang et al., 2023 (Code as Policies), Shah et al., 2024 (Bumble), Singh et al., 2023 (ProgPrompt), Wang et al., 2024 (LLM^3), Li et al., 2025a.
  - Limitation: These systems have limited ability to incorporate complex context, such as image observations, into the reasoning process. SayCan, for example, uses an LLM to select among predefined skills based on language, but its grounding in the visual world is indirect, relying on skill success estimators rather than direct VLM reasoning over images.
- More recent methods (VLMs for skill parameterization): Huang et al., 2023 (VoxPoser), Liu et al., 2024a (Moka), Nasiriany et al., 2024 (Pivot), Chen et al., 2024, Liu et al., 2024b (OK-Robot), Stone et al., 2023, Qiu et al., 2024, Zhi et al., 2024.
  - Limitation: These approaches can process more complex commands and situate them with visual observations, but they have shown limited physical dexterity and limited ability to incorporate real-time language interaction with humans.
Robotic Language Interaction and Feedback:
- Model-based systems: Parse language instructions and feedback and ground them via symbolic representations of the scene (e.g., Swadzba et al., 2009, Matuszek et al., 2013, Namasivayam et al., 2023, Patki et al., 2019).
- Learning-based methods (hierarchical architecture, processing feedback directly): Liu et al., 2023 (OLAF), Xiao et al., 2024 (Robi Butler), Shi et al., 2024 (YAY Robot), Belkhale et al., 2024 (RT-H), Singh et al., 2024 (LGR2), McCallum et al., Driess et al., 2023 (PaLM-E), Dai et al., 2024 (RACER), Hu et al., 2023, Li et al., 2025b (HAMSTER).
  - Differentiation from OLAF (Liu et al., 2023): OLAF uses an LLM to modify robot trajectories. Hi Robot can incorporate situated corrections based on robot observations, respond in real-time, and follow complex prompts for dexterous manipulation.
  - Differentiation from YAY Robot (Shi et al., 2024): YAY Robot handles situated real-time corrections but is limited to one prompt and corrections seen in human-written data. Hi Robot leverages VLMs and a new data generation scheme for diverse prompts and open-ended corrections.
  - Differentiation from RACER (Dai et al., 2024): RACER incorporates situated corrections but relies on a physics simulator for recovery behaviors. Hi Robot uses real robot demonstrations without intentional perturbations or corrections and is applicable to open-ended prompts.

3.3. Technological Evolution

The field has evolved from basic robot control (e.g., pick-and-place) to language-conditioned imitation learning for atomic commands. The advent of large language models (LLMs) brought zero-shot planning capabilities, allowing robots to break down tasks. More recently, vision-language models (VLMs) have enabled richer multimodal reasoning, grounding language in visual observations for robot control (e.g., RT-1, RT-2, PaLM-E).

Hi Robot fits into this evolution by pushing the boundaries of VLM-based control. It addresses the gap where existing VLM-action models are strong at System 1 (atomic skills) but weak at System 2 (deliberative planning and feedback integration for complex, open-ended tasks). It also improves upon prior LLM/VLM-based high-level planners by integrating VLM capabilities at both hierarchical levels and introducing a scalable synthetic data generation method for complex interactions. The paper aims to combine the physical dexterity of VLA models with the high-level reasoning capabilities of VLMs to achieve more flexible and steerable human-robot interaction.

3.4. Differentiation Analysis

Compared to related work, the core differences and innovations of Hi Robot's approach are:

Hierarchical VLM Architecture: Hi Robot uniquely employs VLMs for both the high-level reasoning and the low-level action execution. The high-level VLM interprets complex instructions and feedback, producing atomic language commands. The low-level VLA (also a VLM fine-tuned for actions) then translates these commands into robot movements. This contrasts with systems that use LLMs (without direct visual grounding for high-level) or separate, non-VLM-based low-level controllers.
Synthetic Data for High-Level Policy: A novel scheme to generate synthetic user prompts and robot responses for training the high-level VLM. This addresses the challenge of acquiring diverse interaction data for complex, open-ended scenarios, going beyond what is typically available in human-labeled datasets or specific feedback types. This synthetic data allows the high-level policy to generalize to a broader range of complex and compositional language.
Situated Real-time Feedback Integration: Unlike many prior methods, Hi Robot can dynamically incorporate situated (visually grounded) feedback and corrections during task execution. The high-level policy observes current images and user utterances, enabling it to correctly interpret and respond to commands like "that's not trash" in real-time.
Open-Ended, Complex Instruction Following: The system is designed and demonstrated to handle much more intricate and variable prompts than typical VLA systems, which are usually limited to simple, atomic commands. This includes multi-stage instructions, novel task variations, and adaptive responses to user constraints (e.g., "vegetarian sandwich, no tomatoes").
Physical Dexterity and Generalization: By building upon a state-of-the-art VLA model ( $π_0 VLA$ ) for low-level control, Hi Robot maintains high physical dexterity and can generalize across diverse robotic platforms (single-arm, dual-arm, mobile bimanual) and objects.

4. Methodology

4.1. Principles

The core idea behind Hi Robot is to decompose the complex problem of open-ended instruction following into a hierarchical structure, analogous to Daniel Kahneman's "System 1" and "System 2" cognitive processes.

System 2 (High-Level Policy): This layer embodies deliberate, high-level reasoning. It is responsible for interpreting complex, open-ended user prompts ( $\ell_t$ ) and real-time feedback, considering the current visual observations ( $\mathbf{I}_t^1, ..., \mathbf{I}_t^n$ ). Its goal is to deduce the most appropriate next abstract step or intermediate language command ( $\hat{\ell}_t$ ) for the robot to execute. This policy operates at a lower frequency, as high-level reasoning does not need constant updates. This VLM-based high-level policy leverages vast semantic and visual knowledge acquired from web-scale pre-training.
System 1 (Low-Level Policy): This layer represents the automatic, reactive physical execution. It takes the simpler, atomic language command ( $\hat{\ell}_t$ ) generated by the high-level policy, along with current visual observations and the robot's state ( $\mathbf{q}_t$ ), and directly translates it into a sequence of specific robot actions ( $\mathbf{A}_t$ ). This policy operates at a higher frequency for smooth, real-time control. This VLM-based low-level policy (VLA model) is specifically fine-tuned for producing robotic actions.

The two systems communicate via language, allowing the high-level VLM to break down complex human intent into "bite-sized" instructions that the low-level VLA can reliably execute. A novel synthetic data generation scheme is introduced to train the high-level policy to handle diverse and complex interactions, grounding its reasoning in the robot's actual capabilities.

4.2. Core Methodology In-depth (Layer by Layer)

The Hi Robot system is a hierarchical policy $p(\mathbf{A}_t | \mathbf{o}_t)$ decomposed into a low-level and a high-level inference process.

4.2.1. Overall System Architecture

The overall architecture is depicted in Figure 2.

Figure 2: Overview of hierarchical VLA. The policy consists of a high-level and a low-level policy. The high-level policy processes open-ended instructions and images from base and wristmounted cameras to generate low-level language commands. The low-level policy uses these commands, images, and robot states to produce actions and optionally verbal responses. 该图像是示意图，展示了层次化视觉语言行动模型的工作流程。该系统通过高层策略处理用户提示及图像，生成低层语言指令；随后，低层策略根据这些指令和机器人状态执行相应的动作，并可作出口头响应。

4.2.2. Policy Definition

A learned policy controls a robot by processing observation inputs, denoted as $\mathbf{o}_t$ , and producing one or more actions $\mathbf{A}_t = [\mathbf{a}_t, \mathbf{a}_{t+1}, ..., \mathbf{a}_{t+H-1}]$ . Here, $\mathbf{A}_t$ represents an action chunk consisting of the next $H$ actions to execute.

The system's observation $\mathbf{o}_t$ at time $t$ consists of:

Images from multiple cameras: $\mathbf{I}_t^1, ..., \mathbf{I}_t^n$
The robot's configuration (joint and gripper positions): $\mathbf{q}_t$
A language prompt: $\ell_t$

Thus, the full observation is $\mathbf{o}_t = [\mathbf{I}_t^1, ..., \mathbf{I}_t^n, \boldsymbol{\ell}_t, \mathbf{q}_t]$ . The policy aims to learn the distribution $p(\mathbf{A}_t | \mathbf{o}_t)$ .

4.2.3. Vision-Language-Action (VLA) Models as Low-Level Policies

The paper builds on Vision-Language-Action (VLA) models, specifically the $π_0 VLA$ (Black et al., 2024), which serves as the foundation for the low-level policy. A VLA model is produced by fine-tuning a Vision-Language Model (VLM) $p(\ell' | \mathbf{I}, \ell)$ such that the actions $\mathbf{A}_t$ are represented by tokens in the suffix $\ell'$ . The $π_0 VLA$ specifically handles:

Multiple images and continuous state observations $\mathbf{q}_t$ .
Outputs continuous action chunk distributions via flow-matching, rather than tokenizing actions.

While standard VLA models can follow various language prompts, they are typically limited to simple and atomic commands.

4.2.4. Hierarchical Inference

The Hi Robot approach decomposes the overall policy $p(\mathbf{A}_t | \mathbf{o}_t)$ into two distinct inference processes:

High-Level Policy ( $p^{\mathrm{hi}}$ ):
- Input: Image observations ( $\mathbf{I}_t^1, ..., \mathbf{I}_t^n$ ) and an open-ended language prompt ( $\ell_t$ ).
- Output: An intermediate language command ( $\hat{\ell}_t$ ).
- Function: This policy, implemented as a VLM, interprets the overall task prompt $\ell_t$ and accompanying context (images and user interactions), translating it into a suitable, simpler command $\hat{\ell}_t$ that the low-level policy can understand and execute.
- Formal Representation: $p^{\mathrm{hi}}(\hat{\ell}_t | \mathbf{I}_t^1, ..., \mathbf{I}_t^n, \ell_t)$
Low-Level Policy ( $p^{\mathrm{lo}}$ ):
- Input: Image observations ( $\mathbf{I}_t^1, ..., \mathbf{I}_t^n$ ), the robot's configuration ( $\mathbf{q}_t$ ), and the intermediate language command ( $\hat{\ell}_t$ ) provided by the high-level policy.
- Output: An action chunk ( $\mathbf{A}_t$ ).
- Function: This policy, implemented as a VLA, takes the simpler command from the high-level policy and converts it into specific physical actions for the robot.
- Formal Representation: $p^{\mathrm{lo}}(\mathbf{A}_t | \mathbf{I}_t^1, ..., \mathbf{I}_t^n, \hat{\ell}_t, \mathbf{q}_t)$

4.2.5. Inference Frequency

The two policies run at different rates:

Low-level process: Produces action chunks ( $\mathbf{A}_t$ ) at a high frequency (e.g., $50 \mathrm{Hz}$ control rates, as mentioned in Appendix B.3, using action chunking).
High-level process: Is invoked less often. In the implementation, it reruns inference and recomputes $\hat{\ell}_t$ $\hat{ℓ}_{t}$ either:
- When one second has elapsed.
- Upon receiving a new interaction with the user. This strategy provides reactive behavior to user feedback while maintaining simplicity.

4.2.6. Incorporating User Interaction

The system is designed to handle dynamic user interventions, which can be text commands or spoken language (transcribed via Whisper large-v2).

Triggering High-Level Inference: Any user intervention immediately triggers the high-level inference to recompute $\hat{\ell}_t$ .
Robot Verbal Utterances: The high-level policy has the option to include a verbal utterance ( $u_t$ ) in its command $\hat{\ell}_t$ . These are confirmations or clarifications from the robot. If $u_t$ is included, a text-to-speech system (Cartetia API) plays it to the user, and then $u_t$ is removed from $\hat{\ell}_t$ before passing it to the low-level policy.
Contextual Grounding: The high-level policy's responses are contextual because it observes the current image observations in addition to the prompt. This enables it to correctly ground feedback like "that's not trash," which is not possible with language-only systems.
Task Resumption: After an interjection (e.g., "leave it alone") is fulfilled, the user can signal the robot to switch back to the previous command and continue the main task.

4.2.7. Data Collection and Training `Hi Robot`

Training Hi Robot involves both human-labeled and synthetically generated data, as illustrated in Figure 3.

该图像是示意图，描述了高层策略训练的数据收集与生成过程。图中分为四个部分：第一部分展示机器人数据收集，第二部分展示人类注释和合成数据生成，第三部分说明通过视觉语言模型生成用户指令和机器人响应，最后一部分展示高层政策训练如何将图像观察和用户命令映射到机器人反应和技能标签。

Figure 3: Data collection and generation for training the highlevel policy. We first collect teleoperated robot demonstrations and segment them into short skills (e.g., pick up KitKat). Using this labeled data, we prompt a vision-language model (VLM) to generate synthetic user instructions (e.g., "Can you get me something sweet?") and robot responses. The resulting dataset is used to train the high-level policy, which maps image observations and user commands to verbal responses and skill labels.

Robot Demonstration Data Collection ( $\mathcal{D}_{demo}$ ):
- Data is collected via teleoperation, where a human remotely controls the robot to perform tasks.
- Trajectories include coarse language annotations of the overall goal (e.g., "make a sandwich").
Segmentation into Short Skills ( $\mathcal{D}_{labeled}$ ):
- Full demonstration episodes are segmented into short skills ( $\hat{\ell}_t$ ), typically lasting 1-3 seconds (e.g., "pick up one piece of lettuce").
- Basic movement primitives (e.g., "move the right arm to the left") are heuristically extracted from raw robot actions.
- This yields a dataset $\mathcal{D}_{labeled}$ containing tuples of $(\hat{\ell}_t, \mathbf{I}_t^1, ..., \mathbf{I}_t^n)$ that describe robot skills.
Synthetic Data Generation ( $\mathcal{D}_{syn}$ ):
- A large Vision-Language Model (VLM) denoted as $p^{\mathrm{gen}}$ is used to produce synthetic user prompts ( $\ell_t$ ) and robot utterances ( $u_t$ ).
- $p^{\mathrm{gen}}$ is prompted with both the visual context ( $\mathbf{I}_t^1, ..., \mathbf{I}_t^n$ ) and the skill label ( $\hat{\ell}_t$ ) from $\mathcal{D}_{labeled}$ (e.g., "pick up the lettuce").
- $p^{\mathrm{gen}}$ generates plausible user interactions that might have led to $\hat{\ell}_t$ , including user prompts (e.g., "Can you add some lettuce for me?") and the robot's verbal responses/clarifications ( $u_t$ ).
- Prompt Construction for Contextual Grounding (Appendix A.2):
  - The prompt $\mathcal{P}$ given to $p^{\mathrm{gen}}$ includes a detailed description of the task (e.g., bussing a table, making a sandwich).
  - The model is instructed to ground responses in visual observations and prior context, leveraging its world knowledge (e.g., inferring "lactose intolerant" for sandwich making or "something sweet" for grocery shopping).
  - To maintain consistency in multi-step tasks, $p^{\mathrm{gen}}$ is conditioned on prior skill labels within an episode ( $\hat{\ell}_0, ..., \hat{\ell}_{t-1}$ ).
  - The formal generation is $p^{\mathrm{gen}}(\ell_t, u_t | \mathbf{I}_t^1, ..., \mathbf{I}_t^n, \hat{\ell}_0, ..., \hat{\ell}_{t-1}, \hat{\ell}_t, \mathcal{P})$ , producing a richer dataset $\mathcal{D}_{syn}$ .
- Scenario and Response Categorization (Appendix A.1): To ensure quality and diversity, structured scenario classification (e.g., negative task, situated correction, specific constraint) and response categorization (e.g., simple confirmation, clarification, error handling) are incorporated into the prompt design for $p^{\mathrm{gen}}$ .
Training Hi Robot Policies:
- High-Level Policy ( $p^{\mathrm{hi}}$ ): Trained on the combined dataset $\mathcal{D}_{syn} \cup \mathcal{D}_{labeled}$ using cross-entropy loss for next-token prediction.
- Low-Level Policy ( $p^{\mathrm{lo}}$ ): Trained on $\mathcal{D}_{labeled} \cup \mathcal{D}_{demo}$ using a flow-matching objective, following Black et al. (2024) for continuous action output.

4.2.8. Model Architecture and Implementation (Section 4.4)

Base VLM: Both low-level and high-level policies start from the same base VLM, PaliGemma-3B (Beyer et al., 2024). PaliGemma-3B is an open-source, 3-billion-parameter VLM known for its balance of performance and computational efficiency.
Low-Level Policy: It is the $π_0 VLA$ (Black et al., 2024). This involves fine-tuning PaliGemma-3B with an additional flow matching "action expert" to produce continuous actions. The full model is unfrozen for fine-tuning.
High-Level Policy: It is fine-tuned from PaliGemma-3B on the image-language tuples (derived from $\mathcal{D}_{syn} \cup \mathcal{D}_{labeled}$ ) to predict commands.
Optimizer and Hyperparameters (Appendix C.2):
- AdamW optimizer (Loshchilov & Hutter, 2017) with $\beta_1 = 0.9$ , $\beta_2 = 0.95$ , and no weight decay.
- Gradient norms clipped to a maximum magnitude of 1.
- Exponential Moving Average (EMA) of network weights with a decay factor of 0.999.
- Learning rate warmed up over the first 1,000 steps, then held constant at $1 \times 10^{-5}$ .
- Batch size of 512.
Training Duration and Resources (Appendix C.3):
- High-level policy training: Approximately 2 hours on $8 \times \mathrm{H100}$ GPUs.
- Low-level policy training: Similar pipeline, but time varies with dataset size and task complexity.
Perception and Language Processing (Appendix B.1):
- Speech-to-text: Whisper large-v2 (Radford et al., 2023) locally.
- Text-to-speech: Cartetia API.
Inference Hardware (Appendix B.2): One to two NVIDIA GeForce RTX 4090 consumer-grade GPUs.
Real-Time Inference Latency (Appendix B.3):
- Low-Level Policy Per-Step Inference Times (on RTX 4090):
  - Image encoding: 14 ms
  - Observation processing: 32 ms
  - Action prediction (x10 actions): 27 ms
  - Total (on-board): 73 ms
  - Total (off-board + WiFi): 86 ms
- High-Level Policy (Single Decoding Step):
  - RTX 4090: 47 ms (prefill) + 13.2 ms (decode)
  - H100: 17.3 ms (prefill) + 5.7 ms (decode) These measurements confirm real-time feasibility at $\sim 10 \mathrm{Hz}$ control rates, achieving $50 \mathrm{Hz}$ with action chunking.

5. Experimental Setup

The experiments aim to evaluate Hi Robot's ability to follow complex prompts and feedback, compare it to baselines, and assess the importance of synthetic data and hierarchical structure.

5.1. Datasets

The training data for the low-level policy ( $\mathcal{D}_{labeled} \cup \mathcal{D}_{demo}$ ) consists of teleoperated robot demonstrations segmented into short skills across the three problem domains. The high-level policy is trained on a combination of this human-labeled data and the synthetically generated interaction data ( $\mathcal{D}_{syn}$ ).

The three complex problem domains used for evaluation are:

Table Bussing:
- Task: Cleaning up a table, placing dishes and utensils into a bussing bin, and trash items into the trash.
- Training Data: Full table cleaning episodes.
- Physical Challenges: Nuanced grasping (e.g., plate by edge), singulating different objects, object manipulation using other objects (e.g., tilting a plate to dump trash).
- Evaluation Prompts: Substantively alter the goal, requiring high-level reasoning (e.g., "can you clean up only the trash, but not dishes?", "can you clean up only the dishes, but not trash?", "bus all the yellowish things"). This tests object recognition (reusable plastic cups are dishes, paper cups are trash) and understanding what not to do.
- Contextual Feedback: "this is not trash", "leave the rest", "leave it alone."
Sandwich Making:
- Task: Making a sandwich using up to six ingredients and bread.
- Physical Challenges: Manipulating deformable and delicate ingredients, requiring careful grasping and precise placement.
- Training Data: Examples of different types of sandwiches, with segment labels (e.g., "pick up one slice of bread").
- Evaluation Prompts: Complex prompts (e.g., "hi robot, can you make me a sandwich with cheese, roast beef, and lettuce?", "can you make me a vegetarian sandwich? I'm allergic to pickles") and live corrections (e.g., "that's all, no more").
Grocery Shopping:
- Task: Picking up requested items from a grocery shelf, placing them into a basket, and then placing the basket on a nearby table.
- Robot Platform: Requires controlling a bimanual mobile manipulator (Mobile ARX).
- Evaluation Prompts: Nuanced semantics involving variable numbers of objects (e.g., "hey robot, can you get me some chips? I'm preparing for a movie night", "can you get me something sweet?", "can you grab me something to drink?", "hey robot, can you get me some Twix and Skittles?") and interjections (e.g., "I also want some Kitkat").
  
  The following figure (Figure 4 from the original paper) shows examples of tasks performed by the robot.
  
  该图像是示意图，展示了机器人在不同任务中的执行流程，包括清理桌子、制作三明治和购物。每个任务展示了机器人如何根据用户的语音指令进行反馈和相应的操作，体现了系统的复杂指令处理能力。

Figure 4: The image is an illustration showing the execution process of a robot in different tasks, including table cleaning, sandwich making, and grocery shopping. Each task demonstrates how the robot responds to user voice commands with feedback and actions, highlighting the system's capability to handle complex instructions.

5.2. Evaluation Metrics

Two complementary metrics are used, measured by a human evaluator blind to the method being run, with 20 trials per task per method.

Instruction Accuracy (IA):
- Conceptual Definition: This metric quantifies how well the high-level policy's predicted instruction aligns with the human user's intent. It assesses the system's multimodal understanding, requiring it to correctly interpret both the user's command and the current state of the environment as observed through images.
- Mathematical Formula: The paper does not provide a specific mathematical formula for Instruction Accuracy. Instead, it describes a human evaluation process.
- Symbol Explanation: For each predicted low-level instruction from the high-level model, a human evaluator determines if it is consistent with both the user's command and the current observation. $ \mathrm{IA} = \frac{\text{Number of correct high-level predictions}}{\text{Total number of high-level predictions}} $
  - A correct prediction means the predicted instruction is consistent with both the user's command and the current visual observation.
  - A prediction refers to each instance where the high-level policy generates an intermediate language command.
  - For flat baselines, which do not explicitly generate high-level language predictions, scoring is based on the evaluator's interpretation of the policy's overall behavior and its alignment with the instruction.
Task Progress (TP):
- Conceptual Definition: This metric provides a granular view of task completion for complex, long-horizon tasks. It quantifies how closely the robot moves towards or achieves the intended goal.
- Mathematical Formula: The paper does not provide a specific mathematical formula for Task Progress.
- Symbol Explanation: Task progress is computed as the proportion of objects that are successfully placed in their correct locations or configurations. $ \mathrm{TP} = \frac{\text{Number of objects successfully placed/configured correctly}}{\text{Total number of objects involved in the task goal}} $
  - An object successfully placed/configured correctly means the object has reached its intended final state or location as per the task goal (e.g., a dish in the bussing bin, a sandwich ingredient on the bread, a requested item in the basket).
  - Total number of objects involved refers to all items the robot is instructed to handle for the specific task.

5.3. Baselines

The paper compares Hi Robot against several alternative approaches and ablations:

Expert Human High-Level (Oracle):
- Description: This acts as an oracle baseline where an expert human manually provides the low-level language commands to the low-level policy. The human operator is assumed to provide the most optimal commands to achieve the task.
- Purpose: This baseline helps to isolate the performance limitations. If the low-level policy performs well with human input, it suggests that failures in other methods are primarily due to reasoning (high-level policy) rather than the physical capabilities (low-level policy).
GPT-4o High-Level Model:
- Description: This method uses the same hierarchical decomposition as Hi Robot but replaces Hi Robot's trained high-level VLM with the GPT-4o API-based model. GPT-4o is a much larger VLM but is not fine-tuned on the specific real and synthetic robot interaction datasets.
- Prompt Engineering: To align GPT-4o with the robot's affordances (what the robot can actually do), a carefully engineered prompt is used. This prompt includes task-relevant instructions (derived from ranking common skill labels in the human-annotated dataset) and asks GPT-4o to choose among them.
- Purpose: This baseline evaluates the performance of a powerful, general-purpose, off-the-shelf VLM as a high-level planner without specialized fine-tuning for robotic interaction data. It's an advanced version of SayCan (Brohan et al., 2023b).
Flat VLA (Baseline):
- Description: This comparison directly uses the $π_0$ low-level policy, which is a state-of-the-art VLA model, but without any high-level reasoning component or synthetic data. The complex user prompt is directly fed to this single VLA policy.
- Purpose: This represents a strong, non-hierarchical, end-to-end instruction-following baseline to show the benefits of Hi Robot's hierarchical structure and specialized training.
Flat VLA with Synthetic Data (Ablation):
- Description: This ablation uses the $π_0$ low-level policy by itself (flat architecture) but includes the synthetic data ( $\mathcal{D}_{syn}$ ) in its training. This means the single VLA policy is trained to directly process complex prompts from the synthetic dataset.
- Purpose: This baseline allows evaluation of the benefit of the hierarchical structure independent of the effect of the synthetic data. It helps determine if the hierarchy itself is beneficial, or if simply training a flat policy on more diverse data is sufficient.
Hi Robot without Synthetic Data (Ablation):
- Description: This ablation corresponds to the full Hi Robot method but without the synthetically generated training data ( $\mathcal{D}_{syn}$ ) for the high-level policy. The high-level policy is trained only on human-labeled data ( $\mathcal{D}_{labeled}$ ).
- Purpose: This evaluates the importance of including diverse synthetically-generated prompts in training the high-level reasoning. It can be seen as an advanced VLM-based version of YAY Robot (Shi et al., 2024), which also uses a high-level model to predict language commands for a low-level model but relies on human-written data.

5.4. Robot System Details (Appendix B.4)

The experiments are conducted on three different robot configurations:

UR5e:
- Configuration: A 6-DoF (Degrees of Freedom) robotic arm with a parallel jaw gripper.
- Cameras: Two cameras: one wrist-mounted and one over-the-shoulder.
- Action Space: 7-dimensional configuration and action space.
Bimanual ARX:
- Configuration: Two 6-DoF ARX arms.
- Cameras: Three cameras: two wrist-mounted and one base camera.
- Action Space: Combined 14-dimensional configuration and action space, enabling dexterous bimanual manipulation.
Mobile ARX:
- Configuration: Built on the Mobile ALOHA (Fu et al., 2024) platform, integrating two 6-DoF ARX robotic arms mounted on a mobile base.
- Cameras: Two wrist-mounted cameras and a base camera.
- Action Space: 14-dimensional configuration space for the arms, plus the nonholonomic base introduces two additional action dimensions, resulting in a 16-dimensional action space for navigation and manipulation.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents quantitative and qualitative results, highlighting Hi Robot's superior performance in open-ended instruction following and adaptation to feedback.

The following are the results from Figure 5 of the original paper:

该图像是一个示意图，展示了机器人在不同任务中的指令准确性和任务进展情况，包括清理桌子、制作三明治和杂货购物。图中比较了平面视觉-语言模型、GPT-4 高级模型和专家人类的表现。相关指标显示了各个任务的执行效果。

Figure 5: The image is a diagram that illustrates the instruction accuracy and task progress of the robot across different tasks, including table bussing, sandwich making, and grocery shopping. It compares the performance of Flat VLA, GPT-4 High-Level model, and Expert Human High-Level (Oracle). The metrics reflect the execution effectiveness for each task.

Summary of Findings:

Hi Robot Excels at Open-Ended Instruction Following:
- Across all tasks (Table Bussing, Sandwich Making, Grocery Shopping), Hi Robot consistently demonstrates substantially higher Instruction Accuracy (IA) and Task Progress (TP) compared to both the GPT-4o high-level baseline and the Flat VLA baseline.
- For instance, on average, Hi Robot achieves over 40% higher instruction accuracy than GPT-4o.
- Hi Robot successfully identifies, picks up, and places the correct items, even under complex constraints like "I'm allergic to pickles" (omitting ingredients) or handling only certain objects.
- In contrast, the GPT-4o baseline often struggles with maintaining context, issuing irrelevant or "nonsensical commands" (e.g., "pick up bermuda triangle"), or incorrectly labeling objects (e.g., everything as "plate" or "spoon"), which severely disrupts long-horizon tasks.
Strong Situated Reasoning and Adaptation to Feedback:
- Hi Robot effectively updates its low-level commands in response to real-time user feedback and modifications, such as "leave the rest" or "I also want a KitKat." This demonstrates its ability to perform situated reasoning by grounding language corrections in current visual observations.
- GPT-4o often fails to maintain a coherent internal state during mid-task interactions, leading to errors like attempting to pick up new objects when the gripper is already occupied or prematurely switching tasks.
- The Flat VLA baseline, lacking a high-level reasoning component, does not react to real-time feedback at all.
Effectiveness Across Diverse Tasks, Robots, and User Constraints:
- Hi Robot exhibits robust performance across different robotic platforms (single-arm, dual-arm, and mobile bimanual) and a variety of distinct objects (e.g., fragile cheese slices, tall bottles).
- It successfully respects dynamic user constraints, such as "bus only yellowish items" or "don't add tomatoes."
- The Flat VLA and GPT-4o baselines, when faced with mid-episode prompt changes, frequently revert to default behaviors (e.g., picking up every object, including all sandwich ingredients), indicating a lack of flexible adaptation.
Expert Human Guidance Reveals Low-Level Policy Strengths:
- The Expert Human High-Level (Oracle) baseline shows that with ideal high-level instructions, the low-level policy executes nearly flawlessly. This indicates that the physical capabilities and low-level control of the $π_0 VLA$ are robust.
- This result underscores that the primary bottleneck for complex instruction following is not the robot's ability to perform atomic actions, but rather the high-level reasoning and language understanding component. Hi Robot effectively bridges this gap by providing an automated high-level VLM that can generate commands aligning with user prompts and real-time observations, offering a scalable alternative to human intervention.
  
  The following are the results from Figure 6 of the original paper:
  
  该图像是插图，展示了系统在接收用户指令和图像观察后的低级命令预测流程。左侧列出用户提示和对应的图像观察，右侧则分别展示了其他方法与本研究所提出的“Hi Robot”系统的低级命令输出对比。该系统能够处理复杂请求并给出相应的回应，体现出其在任务执行中的优势。

Figure 6: The image is an illustration showing the low-level command prediction process of the system after receiving user instructions and image observations. The left side lists the user prompts and corresponding image observations, while the right side compares the low-level command outputs of other methods with those of the proposed 'Hi Robot' system. This system is capable of handling complex requests and providing appropriate responses, demonstrating its advantages in task execution.

Qualitative Analysis (Figure 6): Figure 6 illustrates the qualitative differences in low-level command generation for different methods.

The GPT-4o high-level model, despite being a powerful VLM, often produces semantically correct language commands based on visual input (e.g., identifying a plate), but these commands might ignore user constraints (e.g., "bus only yellowish things" leads to "pick up plate," which might not be yellowish). This highlights GPT-4o's lack of fine-tuning for robotic affordances and dynamic task adherence.
The Flat VLA baseline demonstrates that its low-level policy often aligns well with image observations for atomic actions (e.g., "pick up bowl"), but it also ignores user constraints because it lacks the high-level reasoning to interpret and enforce them (e.g., it might pick up all dishes even if asked for only trash). It directly acts on the initial instruction without re-evaluating context or feedback.
Hi Robot shows superior performance by generating commands that are both visually grounded and adhere to user-specified constraints. This illustrates the successful integration of complex reasoning with low-level execution, enabled by its hierarchical structure and synthetic data training.

6.2. Ablation Studies

Two key ablation studies were conducted to isolate the contributions of synthetic data and the hierarchical structure.

6.2.1. (A) Synthetic Data is Critical for Open-Ended Instruction Following

The following are the results from Figure 7 of the original paper:

$Figure 7: Ablation on synthetic data. Synthetic data is essential for handling open-ended instructions, as the model trained without it struggle with user-driven deviations, failing to integrate clarifications and constraints, whereas Hi Robot adapts seamlessly by leveraging diverse, compositional language prompts. $\\mathrm { { I A } } = \\mathrm { { I n } } .$ struction Accuracy, $\\mathrm { T P } =$ Task Progress)$ 该图像是一个图表，展示了 Hi Robot 在处理不同任务时的表现，包括桌面清理、三明治制作和购物。图中比较了使用合成数据和不使用合成数据的 Hi Robot 在任务执行中的准确性（IA）和任务进展（TP），右侧显示了两者的平均差距，分别为 46% 和 39%。

Figure 7: Ablation on synthetic data. Synthetic data is essential for handling open-ended instructions, as the model trained without it struggle with user-driven deviations, failing to integrate clarifications and constraints, whereas Hi Robot adapts seamlessly by leveraging diverse, compositional language prompts. $\mathrm{IA} = \mathrm{Instruction}$ Accuracy, $\mathrm{TP} = \mathrm{Task}$ Progress)

Analysis:

Comparison: Hi Robot (trained on human-labeled + synthetic data) is compared to Hi Robot without synthetic data (trained solely on human-labeled data).
Results: The model trained without synthetic data shows significantly lower performance across all tasks. On average, the inclusion of synthetic data leads to an approximate 46% increase in Instruction Accuracy (IA) and a 39% increase in Task Progress (TP).
Conclusion: Synthetic interactions are crucial for boosting language flexibility. Without this diverse, compositionally rich synthetic data, the model struggles to integrate clarifications (e.g., "this is not trash") or adhere to constraints (e.g., avoiding forbidden items like pickles). Hi Robot's ability to smoothly adapt to such feedback is directly attributable to the broader coverage of compositional language provided by the synthetic data, enabling robust high-level reasoning. This confirms that acquiring sufficient, diverse data for complex interactions is a key bottleneck that synthetic generation effectively addresses.

6.2.2. (B) Hierarchical Structure Outperforms a Flat Policy

The following are the results from Figure 8 of the original paper:

$Figure 8: Hierarchical policy vs. flat policy. The hierarchical approach outperforms the flat variant trained on the same data, as it effectively integrates user feedback and partial instructions, whereas the flat model struggles with mid-task clarifications and nuanced task variations.( $\\mathrm { { I A } = }$ Instruction Accuracy, $\\mathrm { T P } = \\mathrm { T a s k }$ Progress)$ 该图像是图表，展示了在不同任务中，层次策略（Hi Robot）与平面策略（Flat VLA）在指令准确率（IA）和任务进度（TP）上的表现。层次策略平均表现优于平面策略，IA和TP之间存在19%和34%的差距。

Figure 8: Hierarchical policy vs. flat policy. The hierarchical approach outperforms the flat variant trained on the same data, as it effectively integrates user feedback and partial instructions, whereas the flat model struggles with mid-task clarifications and nuanced task variations.( $\mathrm{IA} = \mathrm{Instruction}$ Accuracy, $\mathrm{TP} = \mathrm{Task}$ Progress)

Analysis:

Comparison: Hi Robot (hierarchical) is compared to Flat VLA with synthetic data (a flat policy trained on the same combined human-labeled and synthetic data).
Results: The hierarchical approach (Hi Robot) significantly outperforms the flat variant. On average, Hi Robot achieves a 19% higher Instruction Accuracy (IA) and a 34% higher Task Progress (TP).
Conclusion: Separating high-level reasoning from low-level control is highly beneficial for multi-step coherence and adapting to dynamic user inputs. The flat model often reverts to generic behaviors (e.g., clearing all items) or fails to handle partial instructions (e.g., "bus only the yellowish things") effectively. The hierarchical Hi Robot, by contrast, re-checks the prompt at each high-level step and responds coherently to mid-task updates, demonstrating that the explicit deliberation of the high-level VLM is essential for robust performance in complex, interactive scenarios, even when the underlying data is rich.

6.3. Failure Cases (Appendix C.4)

The authors acknowledge several failure modes:

High-level Failures:
- Difficulty with instructions requiring long-context reasoning, as the current system lacks persistent memory of past actions or broader episodic context beyond what's immediately provided.
Low-level Failures:
- Temporarily ignoring instructions: E.g., grabbing cheese when the robot is close to it despite a user's lactose intolerance instruction. This is attributed to training bias toward proximal objects.
- Error accumulation and out-of-distribution (OOD) recovery: E.g., dropped objects. The system struggles to recover from situations not covered in its training data or when errors compound.

Mitigations for Future Work (Appendix C.4): The authors suggest several directions to address these limitations:

Stronger instruction-following model.
Long-context model (to address memory limitations).
Adversarial data generation for edge cases (to improve robustness).
Diverse data collection, including failure recovery and annotation.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper introduces Hi Robot, a novel hierarchical robot learning system designed for open-ended instruction following and seamless integration of user feedback. The core innovation lies in its architecture, which employs vision-language models (VLMs) for both high-level deliberative reasoning (akin to "System 2") and low-level reactive action execution (akin to "System 1"). The high-level VLM interprets complex prompts and real-time user feedback, generating atomic language commands. These commands are then executed by a low-level Vision-Language-Action (VLA) model, which produces physical robot actions. A crucial contribution is the synthetic data generation scheme, which leverages a large VLM to create diverse and situated interaction examples, overcoming data scarcity for training the high-level policy.

Evaluations across diverse robotic platforms (single-arm, dual-arm, and mobile bimanual) and complex tasks (table bussing, sandwich making, grocery shopping) demonstrate that Hi Robot significantly outperforms flat VLA policies and API-based VLM baselines like GPT-4o in both Instruction Accuracy and Task Progress. Ablation studies confirm the critical importance of both synthetic data for language flexibility and the hierarchical structure for multi-step coherence and adaptation to dynamic user inputs.

7.2. Limitations & Future Work

The authors highlight several limitations and propose future research directions:

Limitations:

Long-Context Reasoning: The current system lacks persistent memory, which limits its ability to handle instructions requiring reasoning over extended past interactions or a broader task context.
Reliance on Prompt Engineering: The training process for the high-level model, particularly the synthetic data generation, still relies on some prompt engineering to elicit desired behaviors from the large VLM ( $p^{\mathrm{gen}}$ ).
Decoupled Models: The high-level and low-level models are trained separately and are not inherently aware of each other's specific capabilities or limitations, beyond what is implicitly learned from the training examples. This might lead to the high-level policy issuing commands that the low-level policy struggles with or vice-versa.
Training Bias (Low-Level): The low-level policy can sometimes temporarily ignore instructions in favor of strong biases from its training data, such as proximal objects (e.g., grabbing cheese despite a "lactose intolerant" instruction if cheese is very close).
Error Accumulation and OOD Recovery: The system struggles with out-of-distribution (OOD) scenarios and failure recovery (e.g., dropped objects), leading to error accumulation.

Future Work:

Unified Model for System 1/2: A natural next step is to combine both high-level and low-level systems into a single model, drawing the "System 1" vs. "System 2" distinction purely at inference time through dynamic processing or attention mechanisms.
Intricate Interleaving of Processing: Develop more sophisticated strategies for interleaving high-level and low-level processing, moving beyond the current fixed-frequency or event-triggered approach. This could involve adaptive systems that process inputs and language asynchronously at multiple levels of abstraction.
Coupling High-Level and Low-Level Policies: Enhance the high-level policy's awareness of the low-level policy's capabilities and success rate. This could involve feedback loops or shared representations that allow the high-level policy to generate more executable and robust commands.
Dynamic Reasoning: Create robotic vision-language-action models that can dynamically reason not only about inputs and feedback but also about their own capabilities to produce suitable situated responses in complex open-world settings.
Addressing Specific Failure Modes: Implement stronger instruction-following models, incorporate long-context models (with memory), use adversarial data generation for edge cases, and collect more diverse data specifically including failure recovery scenarios.

7.3. Personal Insights & Critique

This paper presents a highly relevant and compelling approach to a critical challenge in robotics: enabling robots to interact with humans through complex, natural language in dynamic environments.

Insights:

Power of VLM Generalization: The work beautifully demonstrates how VLMs, with their vast pre-trained knowledge, can be effectively leveraged for both high-level semantic reasoning and low-level action control in robotics. This hints at a powerful paradigm shift where general-purpose foundation models become central to embodied AI.
Synthetic Data as a Scalability Enabler: The synthetic data generation scheme is particularly clever. Real-world human-robot interaction data for complex, multi-modal feedback scenarios is extremely expensive and difficult to collect. Using a strong VLM (like GPT-4o if available, or PaliGemma here) to "imagine" plausible human-robot dialogues based on observed robot actions is a highly scalable solution to this data bottleneck. It allows for the creation of diverse, compositional language examples that would be almost impossible to gather manually.
Hierarchy's Enduring Value: The ablation studies clearly reaffirm the value of hierarchical control. Even with rich training data, a flat policy struggles with coherence and dynamic adaptation. This suggests that for genuinely complex, long-horizon tasks, explicit decomposition into deliberative planning and reactive execution remains a robust design principle, allowing each layer to focus on its respective strengths. The System 1/System 2 analogy is well-suited and intuitively explains why this architecture is effective.
Situational Awareness: The emphasis on situated feedback is crucial. A simple instruction like "that's not trash" is meaningless without the visual context of what "that" refers to and what "trash" implies in the current scene. Hi Robot's ability to ground such feedback in observations is a key step towards truly natural human-robot interaction.

Critique / Areas for Improvement:

Dependency on VLM for Synthetic Data: While powerful, the quality and biases of the synthetically generated data are directly dependent on the capabilities and biases of the VLM used for generation ( $p^{\mathrm{gen}}$ ). If $p^{\mathrm{gen}}$ has blind spots or generates unrealistic scenarios, these could propagate into the high-level policy. More sophisticated validation or adversarial generation strategies could mitigate this.
Simple High-Level Re-inference Strategy: The strategy of re-running high-level inference every second or on new user input is pragmatic but simplistic. A more intelligent system might use confidence scores, detection of sub-task completion, or goal progress monitoring to decide when to deliberate, potentially improving efficiency and coherence.
Lack of Explicit Failure Recovery Mechanisms: The paper identifies error accumulation and OOD recovery as limitations. While the high-level policy can react to feedback, explicit self-reflection or failure detection modules that trigger specialized recovery plans (beyond just re-planning the next step) could enhance robustness significantly. The current system relies on human feedback to correct certain failures, which is not ideal for full autonomy.
Black-Box Nature of VLM Decisions: As with all large neural models, the exact reasoning process of the VLMs (both high-level and low-level) remains somewhat opaque. While the language commands provide an interpretable interface, the internal decision-making for complex scenarios could be hard to debug.
Scalability to More Abstract Goals: While Hi Robot handles complex prompts, the "atomic commands" for the low-level are still quite specific. Scaling to even more abstract, open-ended goals (e.g., "prepare for a party" which involves many sub-tasks not explicitly demonstrated) would require more advanced long-context reasoning and memory capabilities, as the authors themselves point out.

Transferability: The core hierarchical architecture and the synthetic data generation method are highly transferable.
The hierarchical VLM approach could be applied to other embodied AI domains like virtual agents or large-scale simulation environments where complex, human-like instruction following is desired.
The synthetic data generation technique could be adapted to accelerate data collection for any task requiring diverse language interaction and situated grounding, not just robotics, but also for chatbot development with environmental context or instruction tuning for multimodal models.

Overall, Hi Robot represents a significant step towards more intelligent, flexible, and human-friendly robotic systems, effectively merging the power of modern VLMs with a robust hierarchical control paradigm.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~35 min read · 46,501 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Overall System Architecture

4.2.2. Policy Definition

4.2.3. Vision-Language-Action (VLA) Models as Low-Level Policies

4.2.4. Hierarchical Inference

4.2.5. Inference Frequency

4.2.6. Incorporating User Interaction

4.2.7. Data Collection and Training Hi Robot

4.2.8. Model Architecture and Implementation (Section 4.4)

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

5.4. Robot System Details (Appendix B.4)

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablation Studies

6.2.1. (A) Synthetic Data is Critical for Open-Ended Instruction Following

6.2.2. (B) Hierarchical Structure Outperforms a Flat Policy

6.3. Failure Cases (Appendix C.4)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers

4.2.7. Data Collection and Training `Hi Robot`