AiPaper
Paper status: completed

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

Published:12/12/2023
Original LinkPDF
Price: 0.10
Price: 0.10
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ThinkBot addresses Embodied Instruction Following by using Chain of Thought reasoning to fill in missing action steps in human commands, enhancing coherence. It leverages a Large Language Model for instruction completion and a multimodal Transformer for accurate object localizati

Abstract

Embodied Instruction Following (EIF) requires agents to complete human instruction by interacting objects in complicated surrounding environments. Conventional methods directly consider the sparse human instruction to generate action plans for agents, which usually fail to achieve human goals because of the instruction incoherence in action descriptions. On the contrary, we propose ThinkBot that reasons the thought chain in human instruction to recover the missing action descriptions, so that the agent can successfully complete human goals by following the coherent instruction. Specifically, we first design an instruction completer based on large language models to recover the missing actions with interacted objects between consecutive human instruction, where the perceived surrounding environments and the completed sub-goals are considered for instruction completion. Based on the partially observed scene semantic maps, we present an object localizer to infer the position of interacted objects for agents to achieve complex human goals. Extensive experiments in the simulated environment show that our ThinkBot outperforms the state-of-the-art EIF methods by a sizable margin in both success rate and execution efficiency.

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

ThinkBot: Embodied Instruction Following with Thought Chain Reasoning

1.2. Authors

  • Guanxing Lu (Tsinghua Shenzhen International Graduate School, Tsinghua University)
  • Ziwei Wang (Carnegie Mellon University) - Corresponding Author
  • Changliu Liu (Carnegie Mellon University)
  • Jiwen Lu (Department of Automation, Tsinghua University)
  • Yansong Tang (Tsinghua Shenzhen International Graduate School, Tsinghua University)

1.3. Journal/Conference

Published at (UTC): 2023-12-12. Venue: arXiv (Preprint). Context: While currently an arXiv preprint, the paper targets the domain of Embodied AI and Robotics, citing top-tier venues like CVPR, ICCV, and CoRL, indicating it aligns with high-standard computer vision and robotics research standards.

1.4. Publication Year

2023

1.5. Abstract

This paper addresses the challenge of Embodied Instruction Following (EIF), where robots must follow human commands to interact with objects in complex environments. The authors identify a critical flaw in existing methods: human instructions are often "sparse" and "incoherent" (e.g., telling a robot to "take the mug" without mentioning "open the fridge" first). To solve this, they propose ThinkBot. This system uses a Large Language Model (LLM) to reason about the "thought chain"—filling in missing steps in the instructions—and a specialized Object Localizer to find the exact positions of objects mentioned in these recovered steps. Experiments on the ALFRED benchmark show ThinkBot significantly outperforms state-of-the-art methods in success rate and efficiency.

2. Executive Summary

2.1. Background & Motivation

The Core Problem: In the field of Embodied AI, robots are expected to act as household assistants. However, humans often give sparse instructions. For example, a user might say, "Heat up the apple," but physically, the robot needs to: (1) Find the apple, (2) Pick it up, (3) Go to the microwave, (4) Open the microwave, (5) Put the apple in, etc. Conventional methods often fail because they try to map the high-level command directly to actions without realizing the intermediate steps (like "opening the microwave") are missing. This "incoherence" leads to execution failures.

The following figure (Figure 1 from the original paper) illustrates this problem perfectly. The top row shows a standard method (Prompter) failing because it doesn't know it needs to open the fridge. The bottom row shows ThinkBot successfully reasoning that it must first "Go to fridge" and "Open fridge."

Figure 1. Comparison between conventional EIF methods (Prompter \[11\]) and our ThinkBot. Existing methods directly leverage sparse human instruction to generate action sequence, which usually get stuck due to the incoherence of instruction. Our ThinkBot recovers missing action descriptions by reasoning the thought chain in sparse human instruction, and can successfully complete challenging tasks.

Why It Matters: If robots cannot infer implied steps, they cannot function in the real world where humans are naturally imprecise. Bridging the gap between "what is said" and "what must be done" is a fundamental hurdle for autonomous agents.

2.2. Main Contributions / Findings

  1. ThinkBot Agent: A novel framework that uses "Thought Chain Reasoning" to explicitly predict missing sub-goals (actions and objects) from sparse human instructions.
  2. Instruction Completer: An LLM-based module designed with specific prompts to act as a "gap-filler," taking perceived environment data and incomplete instructions to generate a coherent plan.
  3. Multimodal Object Localizer: A technical module that predicts exactly where an object is on a map (e.g., where the "fridge" is) by aligning the text of the recovered instruction with visual map data, enhanced by learning object correlations (e.g., apples are likely found near fridges or tables).
  4. Superior Performance: On the challenging ALFRED benchmark, ThinkBot achieves higher success rates than previous state-of-the-art methods, particularly in unseen environments, proving its ability to generalize.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

To understand ThinkBot, you need to grasp these concepts:

  • Embodied Instruction Following (EIF): A task where a virtual robot (agent) is placed in a simulated 3D home and must complete a task (like "clean the lamp") based on natural language instructions. It involves both Navigation (moving around) and Interaction (picking up, toggling, slicing objects).
  • Large Language Models (LLMs): AI models (like GPT-3 or GPT-4) trained on vast amounts of text. They have "commonsense" knowledge. For example, an LLM knows that to get a cold drink, you usually need to open a fridge. ThinkBot exploits this commonsense to fill in missing instructions.
  • Chain of Thought (CoT): A prompting technique where the AI is encouraged to "show its work" or reason step-by-step before giving a final answer. Instead of just outputting "Action A," it outputs "Since X is true, I need to do Y, so the action is A."
  • Semantic Map: A top-down 2D grid representation of the environment where each cell is labeled with what object is there (e.g., "floor," "table," "obstacle").
  • Transformer & Attention Mechanism: A deep learning architecture that handles sequences of data. The "Attention" mechanism allows the model to focus on specific parts of the input. For instance, when looking for a "spoon," the model should pay more "attention" to map areas labeled "kitchen counter" rather than "bathroom."

3.2. Previous Works

  • End-to-End Methods (e.g., E.T., MOCA): These treat the problem as a translation task, taking instructions and pixels as input and directly outputting motor actions (forward, turn left). They struggle to generalize because they don't "plan" explicitly.
  • Modular Methods (e.g., FILM, Prompter): These break the problem into explicit modules: "Map Builder," "Planner," and "Controller."
    • Prompter: A key baseline for this paper. It uses an LLM to help identify which object to interact with but doesn't fully reason through the sequence of missing steps required to get there. ThinkBot improves upon Prompter by recovering the entire missing chain of actions.

3.3. Differentiation Analysis

ThinkBot differentiates itself by addressing the "Instruction Incoherence" problem. While previous works like Prompter used LLMs to guess target objects, ThinkBot uses LLMs to rewrite the instruction itself, inserting the missing logical steps (sub-goals) and then precisely localizing the objects associated with those hidden steps.

4. Methodology

4.1. Principles

The core philosophy of ThinkBot is that execution requires coherent planning. Instead of blindly following a user's sparse command, the agent should "think" like a human:

  1. Reason: "The user wants a mug. Mugs are usually in cabinets. I don't see a mug. I should find a cabinet, open it, and then look."

  2. Locate: "Where is the cabinet on my visual map?"

  3. Act: Move to that location.

    The following figure (Figure 2 from the original paper) outlines this overall pipeline. It shows the flow from Human Instruction -> Instruction Completer (LLM) -> Recovered Instruction -> Object Localizer -> Action.

    该图像是一个示意图,展示了ThinkBot系统如何通过人类指令进行物体定位和指令补全。图中包含任务、观察场景和当前帧,逻辑链推理和目标生成的步骤,突出展示物体定位器的角色以及完成指令的各个子步骤。 该图像是一个示意图,展示了ThinkBot系统如何通过人类指令进行物体定位和指令补全。图中包含任务、观察场景和当前帧,逻辑链推理和目标生成的步骤,突出展示物体定位器的角色以及完成指令的各个子步骤。

4.2. Core Methodology In-depth (Layer by Layer)

The system operates in a loop. At each time step tt, the agent generates a high-level sub-goal AtA_t based on the current instruction ItI_t. A plan is defined as a tuple At=(at,ot,pt)A_t = (a_t, o_t, p_t), where:

  • ata_t: The primitive action (e.g., "Open").
  • oto_t: The target object (e.g., "Fridge").
  • ptp_t: The coordinates (position) of the object.

4.2.1. Instruction Completer (The "Brain")

This module uses a Large Language Model (GPT-3.5) to fill in the gaps.

  • Input: The module receives a prompt containing:

    1. System Message: Defines the world rules (e.g., "You are a robot assistant," "Available actions are: Go to, Open, Pickup").
    2. Agent Message: Contains the Sparse Human Instruction (e.g., "Get chilled mug"), the Task Progress (what has been done so far), and Observed Objects (what the robot currently sees in the room).
  • Process (Thought Chain Reasoning): The LLM is prompted to reason step-by-step. It identifies that "chilled mug" implies finding a "fridge," "opening" it, and then "taking" the mug.

  • Output: It outputs a Recovered Sub-goal (e.g., Action: OpenObject, Object: Fridge).

    The following figure (Figure 3 from the original paper) details the prompt structure. Note the "Thought Chain" section in the output where the model explicitly writes down its reasoning logic.

    该图像是一个示意图,展示了ThinkBot在遵循人类指令时的思维链和恢复的子目标。图中左侧为系统消息,如角色解释和响应格式;右侧为代理消息,展示人类指令与观察到的对象的关联。在响应部分,通过思维链,当前指令为“拿一个杯子”,恢复的子目标包括“去冰箱”和“打开冰箱”。 该图像是一个示意图,展示了ThinkBot在遵循人类指令时的思维链和恢复的子目标。图中左侧为系统消息,如角色解释和响应格式;右侧为代理消息,展示人类指令与观察到的对象的关联。在响应部分,通过思维链,当前指令为“拿一个杯子”,恢复的子目标包括“去冰箱”和“打开冰箱”。

4.2.2. Multimodal Object Localizer (The "Eyes")

Once the LLM says "Open the Fridge," the agent needs to know where the fridge is. This module predicts the position coordinates ptp_t.

The architecture involves three main parts: Encoders, Graph Learning, and Alignment.

Step 1: Feature Extraction (Encoders)

  • Instruction Features (Xs\mathbf{X}_s): A pre-trained BERT model processes the text of the recovered instruction (e.g., "Fridge") to create a vector representation.
  • Map Features (Xt\mathbf{X}_t'): A Convolutional Neural Network (CNN) processes the current top-down semantic map (which shows obstacles and explored areas) to create a visual feature map Xt\mathbf{X}_t'.

Step 2: Object Correlation Graph Learning This is a critical innovation. The robot learns relationships between objects (e.g., "Sinks are often near Stoves") to improve localization even if the map is incomplete. The authors construct a graph G=(V,E)\mathcal{G} = (\mathbf{V}, \mathbf{E}) where nodes V\mathbf{V} are object categories. The edge matrix E\mathbf{E}, representing correlations, is learned dynamically from the map features: Et=f(XtWe) \mathbf{E}_t = f(\mathbf{X}_t' \mathbf{W}_e)

  • Et\mathbf{E}_t: The learned correlation matrix (edges).

  • Xt\mathbf{X}_t': The initial map features.

  • We\mathbf{W}_e: A learnable weight matrix.

  • f()f(\cdot): An activation function.

    These correlations are fused back into the map features using a Graph Convolutional layer approach. The enhanced map features Xt\mathbf{X}_t are calculated as: Xt=Xt+EXtWa \mathbf{X}_t = \mathbf{X}_t' + \mathbf{E}\mathbf{X}_t' \mathbf{W}_a

  • Xt\mathbf{X}_t: The final map features enriched with object relationship knowledge.

  • Xt\mathbf{X}_t': The original map features.

  • EXt\mathbf{E}\mathbf{X}_t': This term propagates information across the graph (message passing).

  • Wa\mathbf{W}_a: A learnable weight matrix for the alignment step.

Step 3: Map-Instruction Alignment (Multimodal Transformer) Now the model must find the specific area in the map Xt\mathbf{X}_t that matches the instruction Xs\mathbf{X}_s. This is done using a Cross-Attention mechanism. The map features act as the Query (since we are searching the map), and the instruction acts as Key and Value.

First, projections are computed: Qs=XsWq,Kt=XtWk,Vt=XtWv \mathbf{Q}_s = \mathbf{X}_s \mathbf{W}_q, \quad \mathbf{K}_t = \mathbf{X}_t \mathbf{W}_k, \quad \mathbf{V}_t = \mathbf{X}_t \mathbf{W}_v (Note: The subscripts in the paper's Eq 4 seem to swap ss and tt compared to standard Cross-Attention definitions where Query usually comes from the primary modality being updated. In the text, they state "features of the semantic maps as the query". However, in Eq 4, they write Qs=XsWq\mathbf{Q}_s = \mathbf{X}_s \mathbf{W}_q which implies Query comes from Instruction (ss). Let's adhere strictly to the formulas provided in the paper text while noting this potential textual ambiguity. Based on Eq 5 below, Qs\mathbf{Q}_s and Kt\mathbf{K}_t are multiplied. The dimension of the result depends on the order. Let's proceed with the formulas exactly as written.)

Wait, looking closely at the text: "we take the features of the semantic maps as the query... Qs=XsWq\mathbf{Q}_s = \mathbf{X}_s \mathbf{W}_q". There is a contradiction in the original paper's text vs variable naming (ss usually denotes string/instruction, tt usually denotes map/time). Correction based on standard Transformer logic and paper context: Usually, to highlight a map region based on text, you compute Attention(Query=Map, Key=Text, Value=Text). However, the paper defines Qs\mathbf{Q}_s (likely derived from Instruction/Sentence) and Kt\mathbf{K}_t (from Map). Let's stick strictly to the mathematical definition provided in the paper to ensure faithfulness:

Qs=XsWq,Kt=XtWk,Vt=XtWv \mathbf{Q}_s = \mathbf{X}_s \mathbf{W}_q, \quad \mathbf{K}_t = \mathbf{X}_t \mathbf{W}_k, \quad \mathbf{V}_t = \mathbf{X}_t \mathbf{W}_v

The aligned representation Hts\mathbf{H}_t^s is computed via the standard Scaled Dot-Product Attention formula: Hts=Softmax(QsKtTd)Vt \mathbf{H}_t^s = \operatorname{Softmax} \left( \frac{\mathbf{Q}_s \mathbf{K}_t^T}{\sqrt{d}} \right) \mathbf{V}_t

  • Hts\mathbf{H}_t^s: The multimodal features fusing map and text.

  • dd: The dimensionality of the features (scaling factor to prevent vanishing gradients).

  • Softmax\operatorname{Softmax}: Converts scores into probabilities.

    Finally, this feature map Hts\mathbf{H}_t^s is passed to a Decoder which outputs a probability heatmap. The pixel with the highest value is selected as the target coordinates ptp_t.

The following figure (Figure 4 from the original paper) visualizes this localizer pipeline:

Figure 4. The overall pipeline of the multimodal object localizer, which uses recovered instruction and observed semantic map to predict object positions for interaction. The object correlation graph is also learned to strengthen the map features.

Training: The Localizer is trained using Binary Cross-Entropy Loss to minimize the difference between the predicted heatmap and the ground-truth location of objects derived from expert demonstrations.

5. Experimental Setup

5.1. Datasets

  • Name: ALFRED (Action Learning From Realistic Environments and Directives).
  • Source: Based on the AI2-THOR simulator.
  • Scale: 25,743 trajectory-instruction pairs.
  • Complexity: 7 task types (e.g., "Pick & Place," "Stack & Create," "Clean & Place") in 120 scenes.
  • Splits:
    • Train: Used for learning.
    • Test Seen: Scenes that appeared in training (tests memory/interpolation).
    • Test Unseen: New rooms/scenes never seen before (tests generalization).
  • Example: A task might be "Place sliced lettuce into a bin." The instructions provided to the agent might be: "Take a knife. Cut the lettuce in the fridge." (Note the missing steps about handling the bin or opening the fridge).

5.2. Evaluation Metrics

The paper uses standard ALFRED metrics.

  1. Success Rate (SR):

    • Definition: The percentage of episodes where the agent successfully completes the entire task goal.
    • Formula: SR=1Ni=1NI(successi)SR = \frac{1}{N} \sum_{i=1}^{N} \mathbb{I}(\text{success}_i)
    • Meaning: I\mathbb{I} is an indicator function (1 if success, 0 if fail). NN is total episodes.
  2. Goal-Condition Success Rate (GC):

    • Definition: The percentage of specific sub-goals completed (e.g., if the task has 5 steps and the agent does 3, GC is 0.6).
    • Formula: GC=1Ni=1Ncompleted_subgoalsitotal_subgoalsiGC = \frac{1}{N} \sum_{i=1}^{N} \frac{\text{completed\_subgoals}_i}{\text{total\_subgoals}_i}
  3. Path-Length-Weighted SR (PLWSR):

    • Definition: Penalizes successful agents that take inefficient, long paths.
    • Formula: PLWSR=SR×Lexpertmax(Lexpert,Lagent)PLWSR = SR \times \frac{L_{expert}}{max(L_{expert}, L_{agent})}
    • Meaning: LexpertL_{expert} is the shortest path length. LagentL_{agent} is the agent's path length. If the agent takes a huge detour, the score drops.
  4. Path-Length-Weighted GC (PLWGC):

    • Definition: Same as PLWSR but applied to the Goal-Condition score.

5.3. Baselines

The authors compare ThinkBot against two categories of methods:

  • End-to-End: Seq2seq, MOCA, E.T., LWIT. (Generally perform worse due to lack of planning).
  • Modular:
    • FILM: A strong modular baseline.
    • Prompter: The direct predecessor. It uses an LLM but only for object identification, not full chain-of-thought instruction recovery.
    • Prompter+: A stronger version of Prompter created by the authors, adding better memory and object detection to make the comparison fairer.

6. Results & Analysis

6.1. Core Results Analysis

The experiments show that ThinkBot establishes a new state-of-the-art (SOTA).

The following are the results from Table 1 of the original paper. This table compares ThinkBot against all major baselines on the "Test Seen" and "Test Unseen" splits.

Method Test Seen Test Unseen
PLWGC GC PLWSR SR PLWGC GC PLWSR SR
Seq2seq 6.27 9.42 2.02 3.98 4.26 7.03 0.08 3.9
MOCA 22.05 28.29 15.10 22.05 9.99 14.28 2.72 5.30
E.T. 34.93 45.44 27.78 38.42 11.46 18.56 4.10 8.57
LWIT 23.10 40.53 43.10 30.92 16.34 20.91 5.60 9.42
HITUT 17.41 29.97 11.10 21.27 11.51 20.31 5.86 13.87
ABP 4.92 51.13 3.88 44.55 2.22 24.76 1.08 15.43
LLM-Planner 26.77 18.20 23.37 - 16.42
FILM 15.59 39.55 11.27 28.83 15.13 38.52 11.32 27.80
LGS-RPA 28.97 48.66 21.28 40.05 22.76 45.24 22.76 35.41
Prompter 30.72 63.43 25.81 53.23 26.22 58.76 20.76 45.72
CPEM 27.49 59.40 22.61 50.62 27.00 61.10 22.61 49.84
Prompter+ 36.35 70.20 31.12 60.86 30.09 65.71 26.22 55.46
ThinkBot (Ours) 37.01 71.64 32.02 62.69 30.73 67.75 26.93 57.82

Analysis:

  • Unseen Generalization: In the crucial "Test Unseen" split (new environments), ThinkBot achieves a Success Rate (SR) of 57.82%. This beats the previous best (Prompter+) of 55.46% and CPEM's 49.84%. This confirms that reasoning about missing steps helps the agent adapt to new situations better than just memorizing mapping patterns.
  • Efficiency: The PLWSR (Path-Length Weighted SR) is also highest (26.93%), meaning ThinkBot doesn't just succeed; it does so efficiently, taking direct paths rather than wandering aimlessly.

6.2. Ablation Studies

To prove that their specific components (Instruction Completer and Object Localizer) are necessary, the authors removed them one by one. They tested on "Valid Unseen" (normal difficulty) and "Hard Valid Unseen" (specifically difficult cases where objects are hidden inside closed containers, requiring interaction inference).

The following are the results from Table 2 of the original paper:

Method Valid Unseen Hard Valid Unseen
PLWGC GC PLWSR SR PLWGC GC PLWSR SR
Random 26.18 67.64 23.80 59.68 0.32 5.41 0 0
Prompter+ 29.36 72.00 26.82 64.43 0.48 5.41 0 0
Groundtruth Location 39.71 72.75 37.01 67.97 0.79 5.41 0 0
w/o Instruction Completer 29.09 72.38 26.43 64.92 0.48 5.41 0 0
w/o Object Localizer 30.24 74.37 27.87 66.99 9.29 22.41 8.11 16.22
w/o Object Correlation Graph 30.41 73.89 28.14 67.36 11.31 29.46 9.74 21.62
ThinkBot (Ours) 31.11 75.30 28.73 67.72 11.95 30.86 10.26 22.97

Key Takeaways:

  1. Impact of Instruction Completer: In "Hard Valid Unseen" (where objects are hidden), removing the Instruction Completer drops the Success Rate (SR) to 0%. This is a stunning finding. It proves that without inferring the hidden "Open" action, the agent cannot possibly succeed in these hard tasks.
  2. Impact of Object Localizer: Removing the localizer (using Prompter's search policy instead) drops SR from 22.97% to 16.22% in hard cases. This confirms that even if you know what to do, you need the specialized localizer to find where to do it.
  3. Impact of Correlation Graph: Removing the object correlation graph reduces SR from 22.97% to 21.62%. While a smaller drop, it shows that knowing "apples are near fridges" helps.

6.3. Qualitative Analysis

The authors provide visual proof of the system's reasoning.

Action Sequence: The following figure (Figure 5 from the original paper) compares action sequences.

  • Top (Prompter+): Instructed to "Cut lettuce in fridge." It wanders around, failing to interact.

  • Bottom (ThinkBot): Correctly infers: Walk to fridge -> Open fridge -> Take lettuce -> Close fridge.

    Figure 5. Visualization of the agent action sequence acquired by Prompter \(^ +\) (top) and our ThinkBot (bottom), where our method can

    Localization: The following figure (Figure 6 from the original paper) shows the Object Localizer in action. It successfully highlights the pixels on the semantic map corresponding to "Pencil" (top) and multiple "Tomatoes" (bottom), aligning well with the Ground Truth (GT).

Figure 6. The visualization of the predicted and groundtruth positions of interacted objects, where the partially observed semantic maps are also depicted. 该图像是图表,展示了与 interacted objects 相关的部分观察到的语义图和预测与真实位置的对比。上方为指示 "从柜台后面拾取番茄" 的任务,展示了语义图、预测与真实位置;下方则为另一组类似展示。

7. Conclusion & Reflections

7.1. Conclusion Summary

ThinkBot successfully tackles the problem of sparse and incoherent instructions in embodied agents. By leveraging the reasoning power of LLMs (via Chain of Thought) to recover missing steps and employing a specialized Multimodal Object Localizer to map these steps to the physical world, ThinkBot achieves state-of-the-art performance on the ALFRED benchmark. It proves that "thinking before acting"—specifically, filling in the logical gaps left by humans—is crucial for robust robotic assistants.

7.2. Limitations & Future Work

  • Hallucination: The authors acknowledge that providing too many candidate objects to the LLM can cause hallucinations (predicting objects that don't exist). They mitigated this by filtering candidates by room type, but it remains a risk.
  • Dependency on Pre-trained Models: The system relies heavily on the quality of the underlying LLM (GPT-3.5) and visual encoders (BERT, ResNet). If these models fail, ThinkBot fails.
  • 2D Maps: The system relies on 2D top-down semantic maps. Navigating complex 3D vertical spaces (e.g., high shelves vs. low shelves) might be limited by this 2D representation.

7.3. Personal Insights & Critique

  • The "Hard" Case Revelation: The most impressive result is the ablation study on "Hard Valid Unseen" tasks. The fact that the baseline scored 0% while ThinkBot scored 22.97% highlights that "Instruction Completion" isn't just an optimization; it's a necessity for complex interaction tasks. Without it, agents are fundamentally broken when facing hidden objects.
  • Bridge between NLP and Vision: ThinkBot is a classic example of "Neuro-Symbolic" reasoning. It uses the symbolic/logical reasoning of LLMs to guide the neural/pixel-based processing of the vision system. This hybrid approach is likely the future of robotics.
  • Generalization Potential: The method of "Learning Object Correlation Graphs" (Eq 2 & 3) is transferable. It could be applied to other domains like autonomous driving (e.g., "pedestrians are usually near crosswalks") to improve perception reliability.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.