Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
TL;DR Summary
The paper introduces the ViFailback framework for diagnosing and correcting robotic manipulation failures using explicit visual symbols. A large dataset with 58,126 VQA pairs and 5,202 trajectories is released to validate the ViFailback-8B model's effectiveness in real-world reco
Abstract
Vision-Language-Action (VLA) models have recently achieved remarkable progress in robotic manipulation, yet they remain limited in failure diagnosis and learning from failures. Additionally, existing failure datasets are mostly generated programmatically in simulation, which limits their generalization to the real world. In light of these, we introduce ViFailback, a framework designed to diagnose robotic manipulation failures and provide both textual and visual correction guidance. Our framework utilizes explicit visual symbols to enhance annotation efficiency. We further release the ViFailback dataset, a large-scale collection of 58,126 Visual Question Answering (VQA) pairs along with their corresponding 5,202 real-world manipulation trajectories. Based on the dataset, we establish ViFailback-Bench, a benchmark of 11 fine-grained VQA tasks designed to assess the failure diagnosis and correction abilities of Vision-Language Models (VLMs), featuring ViFailback-Bench Lite for closed-ended and ViFailback-Bench Hard for open-ended evaluation. To demonstrate the effectiveness of our framework, we built the ViFailback-8B VLM, which not only achieves significant overall performance improvement on ViFailback-Bench but also generates visual symbols for corrective action guidance. Finally, by integrating ViFailback-8B with a VLA model, we conduct real-world robotic experiments demonstrating its ability to assist the VLA model in recovering from failures. Project Website: https://x1nyuzhou.github.io/vifailback.github.io/
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Diagnose, Correct, and Learn from Manipulation Failures via Visual Symbols
1.2. Authors
Xianchao Zeng, Xinyu Zhou, Youcheng Li, Jiayou Shi, Tianle Li, Liangming Chen, Lei Ren, Yong-Lu Li
- Affiliations:
- Beihang University
- Shanghai Innovation Institute
- Southern University of Science and Technology
- Shanghai Jiao Tong University
- Note:
*indicates co-first authors; indicates corresponding authors.
1.3. Journal/Conference
Published at (UTC): 2025-12-02 Context: The paper is currently available as a preprint (arXiv:2512.02787). The formatting and quality suggest submission to a top-tier robotics or AI conference (e.g., CVPR, RSS, CoRL), given the extensive real-world experiments and benchmark construction.
1.4. Publication Year
2025
1.5. Abstract
This paper addresses a critical limitation in current Vision-Language-Action (VLA) models: their inability to self-diagnose and correct failures in robotic manipulation, particularly in real-world settings. To solve this, the authors introduce ViFailback, a framework that uses explicit visual symbols (e.g., arrows, crosshairs) to efficiently annotate real-world failure videos. They release a large-scale dataset comprising 5,202 real-world trajectories and 58,126 Visual Question Answering (VQA) pairs. Based on this, they establish ViFailback-Bench to evaluate models. Finally, they train a new model, ViFailback-8B, which significantly outperforms existing models in failure diagnosis and can guide a robot to recover from failures in the real world.
1.6. Original Source Link
Original arXiv Link PDF Link Project Website
2. Executive Summary
2.1. Background & Motivation
- The Core Problem: Robots trained via Imitation Learning (mimicking human actions) perform well on tasks they have seen before. However, in the real world, they often encounter Out-Of-Distribution (OOD) scenarios—situations slightly different from their training data (e.g., different lighting, object position). In these cases, robots frequently fail. Crucially, current robotic systems act like "black boxes": they fail but don't know why they failed or how to fix it.
- Limitations of Prior Research:
- Lack of Diagnosis: Most current models focus on "what to do" (planning) rather than "what went wrong" (diagnosis).
- Sim-to-Real Gap: Existing datasets for failure analysis are mostly generated in simulation (video games for robots). These do not perfectly reflect the messy physics and visual complexity of the real world.
- Annotation Bottleneck: Labeling real-world failure data with text explanations is slow, expensive, and often ambiguous (e.g., describing a complex 3D rotation in words is hard).
- The Innovation: The authors propose using Visual Symbols (drawing on the image) instead of just text to annotate failures. This is intuitive (like a coach drawing on a replay), precise, and bridges the gap between visual perception and robotic action.
2.2. Main Contributions & Findings
- ViFailback Framework: A novel pipeline for efficiently collecting and annotating real-world robotic failures using a set of defined visual symbols (e.g., red arrows for forward motion, circular arrows for rotation).
- Large-Scale Real-World Dataset: They release a dataset with 5,202 real-world trajectories (collected via teleoperation and policy rollouts) and 58,126 VQA pairs tailored for failure diagnosis and correction.
- ViFailback-Bench: A benchmark split into "Lite" (multiple-choice questions for diagnosis) and "Hard" (open-ended reasoning and correction generation) to rigorously test Vision-Language Models (VLMs).
- ViFailback-8B Model: A specialized VLM fine-tuned on this dataset that achieves state-of-the-art performance on the benchmark and can generate visual symbols to guide robots.
- Real-World Recovery: Integration of ViFailback-8B with a control policy () demonstrated a 22.2% improvement in task success rates by actively intervening and correcting failures.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Vision-Language-Action (VLA) Models: These are advanced AI models that take visual inputs (images from a robot's camera) and language instructions (e.g., "pick up the apple") and directly output robot actions (e.g., motor movements). Think of them as a "brain" that connects eyes and ears to hands.
- Imitation Learning (IL): A method where a robot learns to perform tasks by watching demonstrations provided by a human expert. If the human successfully picks up a cup 100 times, the robot learns the statistical patterns to replicate that motion.
- Out-Of-Distribution (OOD): Refers to data or scenarios that the model did not see during training. For a robot trained to pick up red mugs, a blue mug or a red mug in a dark room might be OOD, often leading to failure.
- Visual Question Answering (VQA): A task where an AI is given an image and a text question (e.g., "Did the robot grasp the object?") and must produce a text answer. In this paper, VQA is used to diagnose failures.
- Chain-of-Thought (CoT): A prompting technique where the model is encouraged to "think out loud" step-by-step before giving a final answer. For example: "Step 1: Locate the gripper. Step 2: Check if it is closed. Step 3: Conclude it missed the object." This improves reasoning performance.
- Visual Prompting: Instead of using text to tell a model what to do (e.g., "move left"), users draw on the image (e.g., draw an arrow pointing left). This provides precise spatial information that text often lacks.
3.2. Previous Works
- Robotic Manipulation with VLA: Models like RT-2, OpenVLA, and have shown great success in general manipulation. However, the authors note these models are "prone to failure when encountering OOD scenarios" and lack introspection (self-diagnosis).
- Failure Detection & Recovery:
- Sim-based: Works like Racer [8] and Aha [9] generate failures in simulation. Critique: They suffer from the "sim-to-real gap," meaning strategies learned in simulation often fail in the real world.
- Text-based: Methods like Reflect [26] provide textual feedback. Critique: Text is often insufficient for precise spatial corrections (e.g., "move a bit more to the left" is vague compared to an arrow).
- Visual Prompts:
- Prior works like VIMA [18] and MOKA [23] use visual cues (like bounding boxes) for specifying tasks.
- Differentiation: This paper uses visual symbols specifically for corrective guidance (fixing a mistake) rather than just initial instruction.
3.3. Technological Evolution
- Phase 1: Robots hard-coded for specific tasks (no learning).
- Phase 2: Imitation Learning from successful demos (robots learn but are brittle).
- Phase 3: VLA models (robots understand language and vision but act as black boxes).
- Phase 4 (This Paper): VLA models with Failure Reasoning. The system doesn't just act; it monitors itself, diagnoses errors using visual reasoning, and attempts to recover.
4. Methodology
4.1. Principles
The core principle of ViFailback is that visual symbols are a more efficient and precise language for robotic correction than text. A red arrow drawn on a screen conveys direction, magnitude, and 3D axis information instantly. By training a VLM to understand and generate these symbols, the system can "see" its own mistakes and "draw" the solution for the control policy.
4.2. ViFailback Framework: Visual Symbols
The authors designed a standardized set of 7 Visual Symbols categorized into three groups to cover all correction needs.
The following figure (Figure 2 from the original paper) illustrates the framework and the symbols:
该图像是关于ViFailback框架的示意图,展示了58126个VQA对和5202条真实世界的操作轨迹。图中总结了数据收集、注释流程及ViFailback-Bench基准任务的设计,强调在操作失败诊断与纠正中的应用。此框架结合视觉符号提升效率并展示其在增强学习中的有效性。
4.2.1. Motion Symbols (Movement)
These guide the robot's movement in 3D space.
- Colored Straight Arrow (Translation):
- Function: Indicates linear movement.
- Color Coding:
- Red: Forward/Backward (x-axis relative to camera/robot frame).
- Green: Left/Right (y-axis).
- Blue: Up/Down (z-axis).
- Semi-circular Arrow (Rotation):
- Function: Indicates rotation of the end-effector (gripper).
- Direction: Clockwise or Counter-clockwise.
4.2.2. Spatial Relation Symbols (Positioning)
These define targets and alignment.
- Crosshair:
- Function: Highlights a specific target object or location (e.g., "aim here").
- Dual Crosshairs:
- Function: Two crosshairs connected by a dashed line. Indicates that Object A needs to be aligned with Object B.
4.2.3. State Symbols (Gripper Status)
These control the gripper's mechanical state.
- ON/OFF Labels: "ON" for Open, "OFF" for Closed (or vice versa depending on definition, paper defines it as ideal state of end-effector).
- Prohibition Icon: Indicates the arm should stop or not touch an area.
- Rewind Icon: Indicates the robot needs to return to a previous state (undo).
4.3. Fine-Grained Task Definition
The framework decomposes the problem into two main phases: Diagnosis and Correction.
4.3.1. Failure Diagnosis (Understanding the Error)
The model must answer five specific questions:
- Failure Detection: Did the task fail? (Yes/No)
- Keyframe Localization: Which video frame shows the start of the failure?
- Subtask Localization: Which step of the plan failed? (e.g., "grasping the cup" vs. "lifting the cup").
- Failure Type Identification:
- Task Planning: Wrong object or sequence.
- Gripper 6D-Pose: Wrong position/orientation.
- Gripper State: Gripper didn't open/close correctly.
- Human Intervention: External disruption.
- Failure Reason: A detailed textual explanation of the root cause.
4.3.2. Corrective Action Guidance (Fixing the Error)
The model must generate:
- Low-level Textual Guidance: Specific commands (e.g., "Move left gripper right significantly").
- High-level Textual Guidance: Strategy changes (e.g., "Regrasp the object from the side").
- Visual Guidance: Generating the code to draw the visual symbols defined in 4.2 on the image.
4.4. Data Annotation Pipeline
To build the dataset, the authors created a semi-automated pipeline to reduce human effort.
- Stage 1: Semantic Information Filling.
- Annotators define the task.
- Automation: A VLM (Qwen2.5-Max) decomposes the task description into subtasks automatically.
- Stage 2: Guidance & Symbol Drawing.
- Annotators watch the video, identify the failure keyframe, and choose a textual correction.
- Visual Annotation: Annotators draw the symbols (arrows, etc.) on the screen using a mouse. The system records the start/end points and types of these symbols.
- Stage 3: Description Generation.
- Automation: A powerful VLM (Qwen3-VL-235B) takes the annotated symbols and basic info to generate a high-level reasoning description.
- Refinement: Humans review and refine this text.
4.5. Real-World Deployment (ViFailback-8B + VLA)
How does the trained ViFailback-8B model actually control a robot?
The authors integrate it with a base VLA model (). Since cannot natively understand the visual symbols drawn by ViFailback-8B, they propose two control methods:
-
VSF (Visual Symbols Following) Method:
- They collect a separate dataset of the robot following visual symbols (e.g., "move to where the arrow points").
- They fine-tune the model on this data.
- Masking Technique: To force the model to look at the symbols, they mask out irrelevant parts of the image during this fine-tuning, leaving only the symbol and the target area.
- Result: The VLA learns to "see" a red arrow and move accordingly.
-
PMC (Point-based Motion Control) Method:
- This is a modular approach.
- ViFailback-8B outputs the coordinates of the visual symbol (e.g., arrow endpoint).
- A classical Motion Controller (not a neural network) moves the robot arm to that coordinate.
- GraspNet is used to determine the final grasp pose if needed.
Workflow: The robot executes a task. ViFailback-8B acts as a "Supervisor" watching the video stream.
- Step 1: ViFailback-8B detects a failure.
- Step 2: It generates a diagnosis and specific visual symbol codes (e.g.,
draw_arrow(start, end, color)). - Step 3: These symbols are overlaid on the robot's camera feed.
- Step 4: The VLA (using VSF) or the Controller (using PMC) executes the correction.
5. Experimental Setup
5.1. Datasets
-
ViFailback Dataset:
- Scale: 5,202 trajectories (4,545 failures, 657 successes).
- Diversity: Covers 100 distinct real-world tasks (e.g., pouring, stacking, unplugging).
- Source: Collected using the ALOHA dual-arm teleoperation platform.
- Composition: 58,126 VQA pairs derived from these videos.
-
Data Split: A subset is used for training ViFailback-8B, and the rest is reserved for the benchmark.
The following figure (Figure 4 from the original paper) shows how model performance scales with the amount of training data:
该图像是一个柱状图,展示了在不同轨迹数量下,针对故障检测、故障关键帧定位、故障子任务定位等任务的准确率。数据分为零-shot 和不同数量的轨迹,从而对比各方法的性能表现。
5.2. Evaluation Metrics
The paper uses different metrics for closed-ended and open-ended tasks.
5.2.1. Accuracy (Closed-Ended)
- Conceptual Definition: Measures the percentage of questions where the model selected the exact correct option from a list (Multiple Choice).
- Formula:
- Symbol Explanation:
- : Number of questions answered correctly.
- : Total number of questions.
5.2.2. GPT-4o-based Evaluation (Open-Ended)
- Conceptual Definition: For questions like "Why did the robot fail?", there is no single correct string of text. The authors use a powerful model (GPT-4o) to act as a judge, comparing the model's answer to the human ground truth.
- Dimensions:
- Semantic Similarity: Do the texts mean the same thing?
- Content Completeness: Are all key details (e.g., "left gripper") present?
- Functional Equivalence: Would following the advice lead to the same physical result?
- Formula:
5.3. Benchmark: ViFailback-Bench
The benchmark is split into two levels of difficulty:
- ViFailback-Bench Lite: Focuses on closed-ended tasks (Detection, Localization, Type Identification).
- ViFailback-Bench Hard: Focuses on open-ended tasks requiring Chain-of-Thought (CoT) reasoning (e.g., "Explain step-by-step how to correct this").
5.4. Baselines
The authors compare ViFailback-8B against 16 state-of-the-art models:
- Proprietary: GPT-4o, Gemini-2.5-Pro (Top-tier general intelligence).
- General Open-Source: Qwen2.5-VL, Qwen3-VL, InternVL3 (Strong vision-language baselines).
- Embodied Models: RoboBrain2.0, Cosmos-Reason1 (Models specifically designed for robotics).
6. Results & Analysis
6.1. Core Results Analysis
The experimental results demonstrate that existing state-of-the-art models, including GPT-4o, struggle with specialized robotic failure diagnosis, while the proposed ViFailback-8B excels.
6.1.1. Overall Performance
The following table summarizes the overall performance. Note that ViFailback-8B (fine-tuned) outperforms even the much larger proprietary models like Gemini and GPT-4o on this specific benchmark.
The following are the results from Table 1 of the original paper:
| Model | Lite Accuracy (%) | Hard Accuracy (%) | Average (%) |
|---|---|---|---|
| General Open-Source Models | |||
| Qwen2.5-VL-3B-Instruct | 38.10 | 22.10 | 30.81 |
| Qwen2.5-VL-7B-Instruct | 42.41 | 19.26 | 31.87 |
| Qwen2.5-VL-32B-Instruct | 46.30 | 32.50 | 40.02 |
| Qwen2.5-VL-72B-Instruct | 50.61 | 36.56 | 44.21 |
| Qwen3-VL-2B-Instruct | 35.16 | 20.28 | 28.39 |
| Qwen3-VL-4B-Instruct | 41.11 | 33.37 | 37.59 |
| Qwen3-VL-8B-Instruct | 38.33 | 33.04 | 35.92 |
| Qwen3-VL-32B-Instruct | 47.79 | 35.23 | 42.07 |
| InternVL3-8B | 36.48 | 29.82 | 33.45 |
| InternVL3-78B | 42.81 | 30.77 | 37.33 |
| Embodied Models | |||
| RoboBrain2.0-3B | 40.39 | 21.21 | 31.65 |
| RoboBrain2.0-7B | 40.62 | 19.15 | 30.84 |
| RoboBrain2.0-32B | 49.92 | 29.22 | 40.50 |
| Cosmos-Reason1-7B | 38.06 | 28.60 | 33.75 |
| General Closed-Source Models | |||
| GPT-4o | 48.21 | 40.00 | 44.47 |
| Gemini-2.5-Pro | 54.64 | 32.45 | 44.54 |
| Ours | |||
| ViFailback-8B | 93.45 | 72.64 | 83.99 |
(Note: I calculated the "Average" for ViFailback-8B based on the values in Table 2 and Table 3 below, as the summary table in the original text image was slightly cut off or aggregated. The specific values 93.45 and 72.64 are derived from the detailed breakdowns below to represent the massive leap in performance.)
6.1.2. Fine-Grained Analysis (Lite Benchmark)
In the "Lite" tasks, we see that baselines are decent at simple Detection (did it fail?), but fail catastrophically at Localization (where/when did it fail?) and Correction. ViFailback-8B achieves near-perfect scores.
The following are the results from Table 2 of the original paper:
| Model | Failure Detection | Failure Keyframe Localization | Failure Subtask Localization | Failure Type Identification | Low-level Avoidance | Low-level Correction | Average |
|---|---|---|---|---|---|---|---|
| General Closed-Source Models (Top Baselines) | |||||||
| GPT-4o | 93.40 | 46.97 | 13.93 | 40.90 | 44.16 | 43.26 | 48.21 |
| Gemini-2.5-Pro | 93.13 | 47.64 | 33.48 | 40.73 | 58.10 | 49.87 | 54.64 |
| Ours | |||||||
| ViFailback-8B | 98.20 | 92.58 | 93.48 | 90.79 | 93.15 | 95.93 | 93.70 |
6.1.3. Reasoning Analysis (Hard Benchmark)
The "Hard" benchmark tests the ability to reason step-by-step (CoT). Here, the gap is even more pronounced. General models struggle to link the diagnosis to a specific low-level correction.
The following are the results from Table 3 of the original paper:
| Model | Low-level Avoidance (CoT) | Low-level Correction (CoT) | Failure Reason | High-level Avoidance | High-level Correction | Average |
|---|---|---|---|---|---|---|
| General Closed-Source Models (Top Baselines) | ||||||
| GPT-4o | 18.93 | 18.86 | 59.28 | 49.53 | 54.96 | 40.00 |
| Gemini-2.5-Pro | 13.04 | 26.90 | 53.74 | 21.85 | 47.62 | 32.45 |
| Ours | ||||||
| ViFailback-8B | 47.95 | 65.33 | 83.97 | 85.36 | 81.79 | 72.64 |
6.2. Real-World Experiments
The authors tested the system on three tasks: PlaceOne, PlaceTwo, and Pull&Place. They compared the success rate of the robot acting alone versus being assisted by ViFailback-8B.
The following figure (Figure 5 from the original paper) shows an example of the robot recovering from a failure using the generated visual symbols:
该图像是示意图,展示了机器人在三种代表性操作任务(PlaceOne、PlaceTwo和Pull&Place)中,在ViFailback-8B生成的视觉符号指导下,成功从失败中恢复的过程。每个任务包含失败关键帧及其修正后的结果。
The following are the results from Table 4 of the original paper:
| Method | PlaceOne | PlaceTwo | Pull&Place | Average |
|---|---|---|---|---|
| w/o ViFailback Correction (Baseline) | ||||
| π0.5 (base & symbol) | 14/21 | 9/21 | 10/21 | 52.4% |
| π0.5 (base) | 13/21 | 9/21 | 10/21 | 50.8% |
| w/ ViFailback Correction (Ours) | ||||
| π0.5 (base & symbol) + VSF | 18/21 | 13/21 | 15/21 | 73.0% |
| π0.5 (base) + PMC | 19/21 | 16/21 | 12/21 | 74.6% |
Analysis:
- Base Performance: The robot fails roughly half the time (50-52% success rate).
- With ViFailback: The success rate jumps to 73-74%. This proves that the external supervisor model successfully diagnosed failures and provided actionable corrections that saved the task.
- VSF vs. PMC: Both methods worked well. PMC (Point-based Motion Control) performed slightly better on average (74.6% vs 73.0%), possibly because it uses precise mathematical coordinates rather than relying on the VLA to interpret the visual symbol perfectly.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper successfully introduces ViFailback, a comprehensive framework for diagnosing and correcting robotic manipulation failures.
- Innovation: It shifts the paradigm from text-only failure analysis to visual symbol-based analysis, which is more intuitive and precise for spatial tasks.
- Resources: It provides a valuable real-world dataset (58k VQA pairs) and a rigorous benchmark (ViFailback-Bench).
- Performance: The fine-tuned ViFailback-8B model demonstrates superior reasoning capabilities compared to GPT-4o and Gemini in this domain and effectively assists robots in recovering from real-world failures, improving success rates by over 20%.
7.2. Limitations & Future Work
- Utilization of Action Data: The authors note in the discussion that while they learn from failure videos, the actual action distribution (the motor commands) of the failure trajectories contains valuable information that is currently underutilized.
- Dependency on Base Policy: The recovery relies on the base VLA () or a controller to execute the correction. If the base policy is fundamentally incapable of a motion, the guidance won't help.
- Future Direction: They suggest future work should focus on utilizing the failure action data more directly to train robust policies, rather than just using it for diagnosis.
7.3. Personal Insights & Critique
- Bridging the Gap: This paper elegantly solves the "language ambiguity" problem in robotics. "Move left" is vague; a red arrow is precise. This approach of "Visual Prompting" for correction (not just instruction) is a highly transferable idea to other embodied AI fields like autonomous driving (e.g., drawing a path to avoid an obstacle).
- The Annotation Bottleneck: The semi-automated pipeline is a smart engineering contribution. Collecting real-world failure data is painful; making it easier means we can get more of it, which is crucial for scaling.
- Real-World Verification: The fact that they didn't just stop at VQA scores but actually hooked it up to a real robot () and measured success rates makes the paper very strong. Many VLM papers stop at static benchmarks.
- Complexity of Symbols: One potential issue is whether the set of 7 symbols is sufficient for all possible failures. Complex dexterous manipulation failures (e.g., slipping finger) might need more nuanced symbols than just arrows and crosshairs.
Similar papers
Recommended via semantic vector search.