Paper status: completed

I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models

Published:09/19/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
2 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The I-FailSense framework is presented for detecting failures in robotic manipulation, focusing on semantic misalignment errors. It builds datasets for detecting these failures and post-trains a VLM with classification heads at multiple layers, showing superior detection performa

Abstract

Language-conditioned robotic manipulation in open-world settings requires not only accurate task execution but also the ability to detect failures for robust deployment in real-world environments. Although recent advances in vision-language models (VLMs) have significantly improved the spatial reasoning and task-planning capabilities of robots, they remain limited in their ability to recognize their own failures. In particular, a critical yet underexplored challenge lies in detecting semantic misalignment errors, where the robot executes a task that is semantically meaningful but inconsistent with the given instruction. To address this, we propose a method for building datasets targeting Semantic Misalignment Failures detection, from existing language-conditioned manipulation datasets. We also present I-FailSense, an open-source VLM framework with grounded arbitration designed specifically for failure detection. Our approach relies on post-training a base VLM, followed by training lightweight classification heads, called FS blocks, attached to different internal layers of the VLM and whose predictions are aggregated using an ensembling mechanism. Experiments show that I-FailSense outperforms state-of-the-art VLMs, both comparable in size and larger, in detecting semantic misalignment errors. Notably, despite being trained only on semantic misalignment detection, I-FailSense generalizes to broader robotic failure categories and effectively transfers to other simulation environments and real-world with zero-shot or minimal post-training. The datasets and models are publicly released on HuggingFace (Webpage: https://clemgris.github.io/I-FailSense/).

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

I-FailSense: Towards General Robotic Failure Detection with Vision-Language Models

Central topic: The paper addresses automatic failure detection in language-conditioned robotic manipulation, with a particular focus on detecting semantic misalignment errors (the robot executes a semantically meaningful task that mismatches the instruction). It proposes a VLM-based framework, I-FailSense, that post-trains a base model and adds lightweight classification heads at multiple internal layers to robustly classify success/failure in robot trajectories.

1.2. Authors

  • Clémence Grislain (equal contribution)

  • Hamed Rahimi (equal contribution)

  • Olivier Sigaud

  • Mohamed Chetouani

    Affiliation: Institut des Systèmes Intelligents et de Robotique (ISIR), Sorbonne Université, Paris, France. Emails: {grislain, rahimi, sigaud, chetouani}@isir.upmc.fr

Research background: ISIR is a leading French robotics and AI lab. The author team’s expertise spans vision-language modeling, human-robot interaction, robotic manipulation, and multimodal representation learning.

1.3. Journal/Conference

Publication venue: arXiv (preprint).
Comment: arXiv is a widely used open-access repository for disseminating research prior to or alongside peer review. While influential for rapid sharing, it is not itself a peer-reviewed venue. The work may later appear at a conference like ICRA, CoRL, RSS, or in a robotics/ML journal.

1.4. Publication Year

2025 (published at UTC: 2025-09-19T15:19:38.000Z).

1.5. Abstract

The paper argues that open-world, language-conditioned robotic manipulation requires accurate task execution and robust failure detection. Existing vision-language models (VLMs) have strong spatial reasoning and planning capabilities but are limited in recognizing their own failures, especially semantic misalignment errors (task executed is meaningful but inconsistent with the instruction). The authors propose:

  • A method to construct Semantic Misalignment Failures (SMF) datasets from existing language-conditioned manipulation datasets.

  • I-FailSense: an open-source VLM framework with grounded arbitration designed for failure detection.

    Approach: Two-stage post-training of a base VLM (PaliGemma2-mix-3B), then training lightweight binary classification heads (FS blocks) attached to multiple internal layers. Predictions are aggregated via a weighted ensembling/voting mechanism.

Results: I-FailSense outperforms state-of-the-art VLMs of similar and larger size in detecting semantic misalignment. It generalizes to broader failure categories (control errors), transfers across simulation environments (RLBench vs CALVIN), and to real-world with zero-shot or minimal post-training. Datasets and models are publicly released (HuggingFace; website provided).

2. Executive Summary

2.1. Background & Motivation

  • Core problem: Automatic detection of failures in language-conditioned robotic manipulation, particularly semantic misalignment errors where the robot’s executed behavior is meaningful but mismatches the instruction (e.g., “rotate left” executed as “rotate right”).
  • Importance: Failure detection underpins robust deployment by enabling reward shaping, policy evaluation, error recovery, and reliable human-robot interaction. Foundation models (e.g., VLA—vision-language-action) benefit from autonomous feedback loops but still need self-detection of failures without external supervision.
  • Challenge in prior work: Most failure detection focuses on control errors (failed grasp, dropped object, etc.). Semantic misalignment requires temporal-spatial grounding of language with object-level interaction and motion trajectory—harder than visual clues alone.
  • Entry point/innovation: Build SMF datasets by re-pairing expert trajectories with mismatched but same-category instructions, and augment VLMs with dedicated, grounded classification heads (FS blocks) attached to intermediate language representations to improve discriminative failure detection, aggregated via ensembling/voting.

2.2. Main Contributions / Findings

  • Dataset construction pipeline (SMF): From existing language-conditioned expert demonstrations, construct balanced success/failure pairs by pairing trajectories with original instructions (positive) or different instructions from the same task type (negative).
  • I-FailSense framework:
    1. Stage 1: Parameter-efficient post-training (LoRA) of a base VLM (PaliGemma2-mix-3B), tuning the projector MLP and KQV attention projections, with a frozen vision encoder (SigLIP).
    2. Stage 2: Train lightweight binary classification heads (FS blocks) attached to multiple internal layers; aggregate their predictions with a weighted voting mechanism alongside the VLM’s own prediction.
  • Key results:
    • On SMF-CALVIN (semantic misalignment, simulated): up to 90% accuracy; better precision-recall balance than zero-shot VLMs (e.g., Qwen2.5-VL).
    • On AHA (mostly control errors, simulated): 89% failure detection rate, surpassing AHA models trained on AHA (7B and 13B) by roughly +19 points.
    • Real-world SMF-DROID: up to 74% accuracy with minimal fine-tuning of FS blocks; sim-to-real transfer benefits from Stage 1 LoRA representations; dual PoV (egocentric + exocentric) helps in real-world settings.
  • Conclusion: Attaching grounded classification heads to different levels of VLM internal representations yields robust and generalizable failure detection. Training on semantic misalignment confers transferable understanding that aids detection of other error types.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Vision-language models (VLMs): Neural models jointly processing images (or video) and text to perform tasks like visual question answering, captioning, and multimodal reasoning. They typically integrate a visual encoder (e.g., SigLIP) with a language model (LLM) via a projector that maps visual features into token-like embeddings for the LLM.
  • Large language models (LLMs): Transformer-based models trained on massive text corpora to perform generative and reasoning tasks. In multimodal settings, LLMs consume both textual tokens and projected visual tokens.
  • Vision-language-action (VLA) models: Extend VLMs to generate actions for robotics (e.g., Octo, OpenVLA), aligning multimodal perception with control outputs.
  • Partially Observable Markov Decision Process (POMDP): Formalism for decision making under partial observability. Here, manipulation tasks are modeled with states, actions, transitions, and an observation function providing visual inputs. The paper defines the manipulation setting as a goal-conditioned POMDP and frames failure detection as binary classification over trajectories given a language goal.
  • Semantic misalignment errors: Failures where the robot’s behavior is reasonable but does not satisfy the instruction (wrong object or wrong operation/parameter). Distinct from control errors (low-level inability to execute).
  • LoRA (Low-Rank Adaptation): A parameter-efficient fine-tuning scheme for LLMs that injects low-rank trainable matrices into attention (and/or MLP) layers, enabling task adaptation without full model retraining. A typical LoRA formulation for a weight matrix WRd×kW \in \mathbb{R}^{d \times k} adds a low-rank update ΔW=BA\Delta W = B A where ARr×kA \in \mathbb{R}^{r \times k} and BRd×rB \in \mathbb{R}^{d \times r} with small rank rr, yielding W=W+ΔWW' = W + \Delta W. Symbols: WW original weight, WW' adapted weight, AA down-projection, BB up-projection, rr rank.
  • SigLIP (Sigmoid Loss for Language-Image Pretraining): A vision-language pretraining scheme that replaces softmax cross-entropy with a sigmoid loss to learn image-text alignment, often improving transfer and simplifying training; used as the frozen visual encoder.
  • Ensembling/voting: Aggregating predictions from multiple classifiers to improve robustness and accuracy. Weighted voting assigns different influence to each predictor.

3.2. Previous Works

  • AHA (Duan et al., 2024): A VLM for failure analysis in RLBench that frames detection and explanation as generative reasoning, producing free-form explanations and integrating with error recovery. While powerful for explanations, its architecture focuses less on discriminative binary classification accuracy.

  • SAFE (Gu et al., 2025): Uses features from the final layer of a VLA to predict control errors (multitask failure detection), without exploiting multi-level internal representations or specific semantic misalignment emphasis.

  • Visual Instruction Tuning (Liu et al., NeurIPS 2023): Method for adapting VLMs with instruction-following style data; relevant to Stage 1 fine-tuning paradigms.

  • VL/VLA foundations: PaLI-Gemma 2 (Steiner et al., 2024) as base VLM; Octo, OpenVLA for generalist robot policies; RLBench (James et al., 2020) and CALVIN (Mees et al., 2022) as simulation benchmarks; DROID (Khazatsky et al., 2024) as real-world dataset.

    Additionally, relevant core Transformer attention mechanism (not reprinted in the paper but crucial background): $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right) V $ Symbols: QQ query matrix, KK key matrix, VV value matrix, dkd_k dimensionality of keys. Attention computes weights by query-key similarity, normalizes with softmax, and aggregates values—central to VLM cross-modal fusion via projected visual tokens in the LLM.

3.3. Technological Evolution

  • Early robotics failure detection focused on control errors and heuristic detectors.
  • Rise of VLMs enabled zero-shot/few-shot success detection and explanation-based reasoning (e.g., AHA).
  • Persistent gaps: fine-grained grounding and temporal reasoning limitations in SOTA VLMs; underexplored semantic misalignment detection.
  • This paper advances discriminative failure classification by exploiting intermediate VLM representations via FS blocks, grounded arbitration across multiple abstraction levels, and a dataset pipeline specifically targeting semantic misalignment.

3.4. Differentiation Analysis

  • Compared to prior VLM-based methods:
    • I-FailSense adds multiple lightweight classification heads at different internal LLM layers—explicitly leveraging multi-level language-space representations rather than only final outputs.
    • Uses a weighted voting mechanism combining FS blocks and the VLM’s own prediction to increase robustness and accuracy.
    • Trains on a carefully constructed SMF dataset where negatives are semantically challenging (same task type, subtly mismatched instructions).
  • Relative to AHA:
    • I-FailSense targets high-accuracy binary discrimination (success vs failure) rather than generative explanations, and empirically achieves superior accuracy even on control-error-heavy data when trained on SMF only.

4. Methodology

4.1. Principles

Core idea: Detect failures in language-conditioned robotic trajectories by grounding the decision in the internal multimodal representations of a post-trained VLM. The method:

  1. Fine-tunes the base VLM with LoRA to adapt to the failure detection task while keeping most weights frozen.

  2. Attaches lightweight binary classification heads (FS blocks) at several internal LLM layers, each receiving different-level features.

  3. Aggregates all head predictions plus the VLM’s own prediction via a weighted voting mechanism to form the final success/failure decision.

    This explicitly ties failure detection to language-space reasoning and multi-level abstraction, which is critical for semantic misalignment (requiring alignment of instruction semantics with spatial-temporal trajectory cues).

4.2. Problem Setup and Formalization

The paper models language-conditioned manipulation as a goal-conditioned POMDP: $ \mathcal{M} = (S, A, \mathcal{T}, \rho_{0}, \Omega, O, G) $ Symbols:

  • SS: state space; ρ0\rho_{0}: initial state distribution.

  • AA: action space (e.g., joint configurations or nn-DoF commands).

  • T\mathcal{T}: transition function governing environment dynamics.

  • Ω\Omega: observation space; O(s)ΩO(s) \in \Omega is the observation function driven by visual sensors.

  • GVG \subseteq \mathcal{V}^{*}: goal/instruction space (finite sequences over vocabulary V\mathcal{V}).

    For NN camera points of view (PoV), an observation is O(s)R3×H×W×NO(s) \in \mathbb{R}^{3 \times H \times W \times N}, where HH and WW are image height and width. A trajectory of size TT is τ=(o0,,oT)\tau = (o_{0}, \dots, o_{T}). The task is binary classification of τ\tau into success/failure with respect to instruction gg.

The input trajectory is aggregated into a single image-like tensor: $ \tau \in \mathbb{R}^{3 \times (H \cdot N) \times (W \cdot T)} $ Symbols: HH height, WW width, NN number of PoVs stacked along height, TT number of timesteps stacked along width, 3 color channels.

4.3. Stage 1: Supervised Fine-Tuning with LoRA

  • Base VLM: PaliGemma2-mix-3B, chosen for a good size-performance trade-off.

  • Vision encoder: SigLIP (frozen).

  • Projector MLP: fine-tuned to bridge visual encoder outputs to the language model inputs.

  • Language model: LoRA adapters inserted into key-query-value (KQV) projection layers of attention blocks to adapt with few parameters.

    Input formatting: A fixed prompt template P(τ,g)P(\tau, g) that fuses the aggregated trajectory tensor and the textual instruction gg for the VLM.

Training objective: Minimize cross-entropy over supervision tokens ( or ), using the VLM’s generative head. The paper’s training loss is: $ \mathcal{L}{\mathrm{CE}}(\boldsymbol{\theta}) = -\frac{1}{M} \sum{i=1}^{M} \log p_{\boldsymbol{\theta}} \left( t_{i} \mid \boldsymbol{x}, t_{< i} \right) $ Symbols:

  • MM: sequence length.

  • tit_{i}: target token at position ii (either the success/fail supervision token).

  • t<it_{< i}: all previous tokens in the sequence.

  • x\boldsymbol{x}: input context (the prompt P(τ,g)P(\tau, g)).

  • pθ()p_{\boldsymbol{\theta}}(\cdot): model probability distribution with parameters θ\boldsymbol{\theta}.

    Interpretation: The VLM is trained to predict the correct supervision token given the prompt, effectively learning to output success/fail textually.

The following figure (Figure 2 from the original paper) shows the system architecture:

该图像是I-FailSense框架的示意图,展示了视觉语言模型(VLM)与语言模型(LLM)之间的结构和数据流。框架通过FS块对失败进行检测,并通过投票机制输出任务成功或失败的结果。 该图像是I-FailSense框架的示意图,展示了视觉语言模型(VLM)与语言模型(LLM)之间的结构和数据流。框架通过FS块对失败进行检测,并通过投票机制输出任务成功或失败的结果。

4.4. Stage 2: Grounded Arbitration via FS Blocks

Rationale: Different LLM layers encode different abstraction levels of language-space reasoning. Attaching classification heads at multiple depths can capture complementary signals for failure detection.

  • FS blocks: Lightweight binary classification heads attached to KK internal layers of the post-trained VLM language model, indexed by {ik}k=1K\{i_{k}\}_{k=1}^{K} (layer depths).

  • For each head kk:

    • Extract features at the chosen layer: $ f_{k} = VLM_{\theta'}\big(P(\tau, g)\big)[i_{k}] $ Symbols: VLMθVLM_{\theta'} denotes the post-trained VLM with parameters θ\theta' fixed in Stage 2; [ik][i_{k}] selects the layer’s output features; P(τ,g)P(\tau, g) is the prompt.
    • Apply a hybrid pooling (MLP + multi-head attention, MHA) to aggregate features, then residual MLP blocks with batch normalization, and finally a linear layer to produce a binary probability: $ p_{k} = FS_{\phi_{k}}(f_{k}) $ Symbols: FSϕkFS_{\phi_{k}} is the FS block with parameters ϕk\phi_{k} trained in Stage 2; pkp_{k} is the probability over success/failure.
    • Training loss: Binary cross-entropy between pkp_{k} and true label yy; only ϕk\phi_{k} is updated.
  • VLM output: The post-trained VLM also produces a textual (free-form) output that can be converted to a binary decision yvlmy_{vlm} (e.g., via token decoding and success/fail mapping).

  • Weighted voting for final prediction: $ \hat{y} = \mathbb{1}\left[ \sum_{k=1}^{K} \omega_{k} y_{k} + \omega_{vlm} y_{vlm} > 0.5 \left( \sum_{k=1}^{K} \omega_{k} + \omega_{vlm} \right) \right] $ Symbols:

  • yk=argmax(pk)y_{k} = \operatorname{argmax}(p_{k}): binary prediction from FS block kk (0 fail, 1 success).

  • yvlmy_{vlm}: binary prediction from the VLM’s free-form output.

  • ωk\omega_{k}: weight of FS block kk.

  • ωvlm\omega_{vlm}: weight of the VLM prediction.

  • 1[]\mathbb{1}[\cdot]: indicator function mapping the inequality to 0/1.

    Implementation details:

  • K=3K = 3 FS blocks.

  • Equal weights for FS blocks, ωk=1\omega_{k} = 1.

  • Double weight for VLM, ωvlm=2\omega_{vlm} = 2, to break ties.

    The following figure (Figure 1 from the original paper) illustrates the overall failure detection concept across semantic misalignment and control errors, and sim-to-real transfer:

    Fig. 1: Overview of I-FailSense, which classifies a robot's observation trajectories conditioned on language instructions into failure or success. Trained on semantic misalignment failure detection, I-FailSense excels at identifying these challenging errors, zero-shot generalizes to detecting control errors and errors in new simulation environments, and detects errors in real-world observations with minimal post-training. 该图像是一个示意图,展示了 I-FailSense 系统如何基于语言指令对机器人的观察轨迹进行成功或失败分类。系统通过检测语义不一致错误,能够有效识别控制错误以及新仿真环境下的错误,并在实际环境中以最小的后期训练实现转移。

5. Experimental Setup

5.1. Datasets

  • SMF-CALVIN (semantic misalignment, simulated):

    • Based on CALVIN Task D benchmark (34 tasks), grouped into lifting, rotating, pushing, opening/closing, placing, and lighting.

    • Each task has ~150 expert demonstrations and 11 textual instructions.

    • Construction:

      • Positive examples: expert trajectory paired with its original instruction (τ,g,y=1)(\tau, g^{*}, y=1).
      • Negative examples: same trajectory paired with a different instruction gG{g}g \in G \setminus \{g^{*}\} from the same task category (τ,g,y=0)(\tau, g, y=0), creating subtle semantic misalignment (e.g., “rotate left” vs “rotate right”).
    • Example (from Figure 3): Top: instruction matches trajectory (e.g., lifting blue block). Bottom: semantic misalignment (rotating the pink cube right instead of instructed left).

      The following figure (Figure 3 from the original paper) shows example data:

      Fig. 3: Example data in DsMF-CALvIn: Top: a positive example where the observation trajectory correctly matches the paired instruction. Bottom: a negative example illustrating semantic misalignment, where the robot rotates the correct object—the pink cube—right instead of the instructed left. 该图像是示意图,展示了两个案例,分别为语义一致的正例和语义不一致的负例。在正例中,机械手臂按照指令成功举起蓝色方块;在负例中,虽然正确的粉色方块被转动,但方向错误,违反了指令。图中分为两行,分别标注了操作步骤。

  • AHA (failure detection, simulated, RLBench):

    • 79 tasks; seven error categories: six control errors (incomplete grasp, inadequate grip retention, misaligned keyframe, incorrect rotation, missing rotation, wrong action sequence) and one semantic misalignment (wrong target object).

    • Authors reconstructed a test set of 400 negative trajectory–instruction pairs following AHA’s pipeline (dataset itself not publicly released).

    • Trajectories include egocentric and exocentric PoVs.

    • Examples (from Figure 4): Top: knife slips from gripper; Bottom: failure to grasp computer lid.

      The following figure (Figure 4 from the original paper) shows examples:

      Fig. 4: Example data in \(\\mathcal { D } _ { \\mathbf { A H A } }\) : Two negative examples from the AHA dataset (exocentric \(\\mathrm { P o V }\) demonstrating control failures—top: the knife slips through the robot's gripper; bottom: the robot fails to grasp the computer lid. 该图像是插图,展示了两个控制失败的负面例子,来源于 AHA 数据集。上方示例中,刀具滑出机器人抓手;下方示例中,机器人未能正确抓取电脑盖。

  • SMF-DROID (semantic misalignment, real-world):

    • Based on DROID with egocentric (gripper) and two exocentric cameras. Positive examples are original demonstrations; negatives are re-paired instructions from the same category to induce semantic misalignment.

    • Size: 6K training, 276 test, balanced positives/negatives.

    • Real-world setting introduces more distractors and unstable viewpoints; combining egocentric and exocentric PoVs proves beneficial.

      The following figure (Figure 5 from the original paper) shows example data:

      该图像是示意图,展示了机器人执行任务的成功与失败示例。上方的四张图片显示了机器人正确执行操作的过程,其中目标物体为一个绿色物品;下方的图片则展示了机器人未能正确执行任务的情况。通过这种对比,图像突出表现了语义误对齐(Semantic Misalignment Failures)在机器人操作中的重要性。 该图像是示意图,展示了机器人执行任务的成功与失败示例。上方的四张图片显示了机器人正确执行操作的过程,其中目标物体为一个绿色物品;下方的图片则展示了机器人未能正确执行任务的情况。通过这种对比,图像突出表现了语义误对齐(Semantic Misalignment Failures)在机器人操作中的重要性。

Why these datasets? SMF-CALVIN enables controlled evaluation on subtle semantic misalignments in simulation; AHA tests transfer to OOD environment and predominantly control errors; SMF-DROID assesses sim-to-real generalization to real-world semantic misalignments.

5.2. Evaluation Metrics

For datasets with both positive and negative examples (SMF-CALVIN, SMF-DROID):

  • Accuracy:

    • Concept: Overall correctness across both classes.
    • Formula: $ \mathrm{Accuracy} = \frac{TP + TN}{TP + TN + FP + FN} $
    • Symbols: TP true positives (correct successes), TN true negatives (correct failures), FP false positives (misclassified failures as success), FN false negatives (misclassified successes as failure).
  • Precision:

    • Concept: Of predicted successes, how many are correct (avoids falsely claiming success when misaligned).
    • Formula: $ \mathrm{Precision} = \frac{TP}{TP + FP} $
    • Symbols: TP, FP as above.
  • Recall:

    • Concept: Of actual successes, how many were identified (sensitivity to success cases).
    • Formula: $ \mathrm{Recall} = \frac{TP}{TP + FN} $
    • Symbols: TP, FN as above.
  • F1 score:

    • Concept: Harmonic mean of precision and recall; balances both aspects.

    • Formula: $ \mathrm{F1} = 2 \cdot \frac{\mathrm{Precision} \cdot \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}} $

    • Symbols: Defined above.

      For AHA (constructed test set with only negative examples):

  • Accuracy:

    • Here, interpretable as failure detection rate (i.e., correctly classifying negatives as failure).
    • Same formula as above, with TP/TN defined according to negative-only evaluation.

5.3. Baselines

  • Zero-shot VLMs:
    • GPT-4o (large multimodal model).
    • PaliGemma2-mix-3B (same base model as I-FailSense but zero-shot).
    • Qwen2.5-VL-7B (strong open-source multimodal baseline).
  • AHA models (trained on AHA):
    • LLaMA-2-based VLMs (7B, 13B) fine-tuned on AHA’s failure detection tasks with a frozen vision encoder/tokenizer and tuned projector/transformer layers.

      Rationale: These baselines represent state-of-the-art general-purpose and task-specific multimodal models across sizes, allowing comparison against zero-shot capabilities and specialized training on control-error-heavy tasks (AHA).

6. Results & Analysis

6.1. Core Results Analysis

  • Q1 (Semantic misalignment, SMF-CALVIN):

    • Zero-shot GPT-4o and PaliGemma2-mix perform near random; Qwen2.5-VL improves recall but suffers from low precision (frequent false positives on subtle misalignments).
    • I-FailSense Stage 1 (LoRA only) significantly boosts performance; Stage 2 (FS blocks + voting) further improves both accuracy and F1, reaching up to 90.64% accuracy (1 PoV) and 88.18% (2 PoV), with F1 up to 0.9132.
    • Insight: Multi-level internal representations and ensembling reduce the precision–recall gap, better handling subtle mismatches.
  • Q2 (Transfer to control errors and OOD, AHA):

    • Zero-shot models perform poorly; AHA fine-tuned models reach ~69–70% accuracy.

    • I-FailSense, trained only on SMF-CALVIN, achieves 89% failure detection rate on AHA (mostly control errors). This suggests semantic misalignment training teaches the model to check if observed motion matches the instruction—a skill that also flags control failures where meaningful motion is absent or incorrect.

      The following figure (Figure 6 from the original paper) shows the comparative failure detection rates:

      Fig. 6: Failure detection rate of I-FailSense trained on DsMF-CALVIN and evaluated on \(\\mathcal { D } _ { \\mathrm { A H A } }\) , compared to `A H A` baselines (7B, 13B) trained on \(\\mathcal { D } _ { \\mathrm { A H A } }\) and zero-shot VLMs. All models use both egocentric and exocentric PoV. Results from prior work are marked with \\*. 该图像是图表,展示了 I-FailSense 在 DsMF-CALVIN 数据集上训练后评估于 ext{D}_{ ext{AHA}} 的故障检测率,与 AHA 基线模型(7B, 13B)及零-shot VLMs 进行了比较。 I-FailSense 的故障检测率为 89.0%,显著高于其他模型。所有模型均使用自我中心和外部视角。

  • Q3 (OOD generalization across simulation environments):

    • The good AHA performance indicates transfer from CALVIN (fixed objects like cubes, sliders, lights) to RLBench (varied objects like knives, bowls, drawers) with different camera setups.
  • Q4 (Real-world generalization, SMF-DROID):

    • Direct transfer of fully simulated-trained I-FailSense increases accuracy but hurts F1 due to recall drop.
    • Training FS blocks on both simulated and real-world data reduces recall drop and increases precision; best performance arises from training FS blocks solely on real-world SMF-DROID, indicating simulated patterns can sometimes hinder real-world failure detection.
    • Dual PoV (egocentric + exocentric) helps in real-world settings with more distractors and unstable views, unlike simulation where single exocentric PoV can be sufficient.

6.2. Data Presentation (Tables)

The following are the results from Table I of the original paper:

Method #PoV Accuracy Precision Recall F1
GPT-40 (zs) 1 0.5764 0.8125 0.2453 0.3768
GPT-40 (zs) 2 0.6305 0.7719 0.4151 0.5399
PaliGemma2-mix 3B (zs) 1 0.4729 0.4961 0.5943 0.5408
2 0.5271 0.5329 0.7642 0.6279
Qwen2.5-VL 7B (zs) 1 0.6798 0.6323 0.9245 0.7510
Qwen2.5-VL 7B (zs) 2 0.6946 0.6507 0.8962 0.7540
I-FailSense (LoRA only) 1 0.8571 0.8667 0.8585 0.8626
2 0.8227 0.8365 0.8208 0.8286
I-FailSense (ours) 1 0.9064 0.8850 0.9434 0.9132
2 0.8818 0.8596 0.9245 0.8909

Note: “GPT-40” in the table corresponds to GPT-4o in the text; we reproduce the table as in the paper.

The following are the results from Table II of the original paper:

Method #PoV Accuracy Precision Recall F1
GPT-40 (zs) 1 0.5471 0.8421 0.1159 0.2038
GPT-40 (zs) 2 0.5797 0.8056 0.2101 0.3333
PaliGemma2-mix 3B (zs) 1 0.5399 0.5304 0.6957 0.6019
2 0.5145 0.5099 0.7464 0.6059
Qwen2.5-VL 7B (zs) 1 0.6884 0.8250 0.4783 0.6055
Qwen2.5-VL 7B (zs) 2 0.7536 0.8646 0.6014 0.7094
I-FailSense (DsMF-CALVIN) 1 0.5580 0.6250 0.2899 0.3960
2 0.6196 0.8000 0.3188 0.4560
I-FailSense (DsMF-CALVIN + DsMF-DROID) 1 0.6594 0.7075 0.5435 0.6148
2 0.6848 0.7802 0.5145 0.6201
I-FailSense (DsMF-DROID) 1 0.7100 0.7500 0.6304 0.6850
2 0.7428 0.7680 0.6957 0.7300

6.3. Ablation Studies / Parameter Analysis

  • Stage-wise ablation (SMF-CALVIN):

    • Zero-shot base PaliGemma2-mix-3B: 47–52% accuracy; poor due to domain gap (simulation vs real images) and need for temporal grounding over aggregated trajectories.
    • Stage 1 (LoRA only): Improves to 82–85% accuracy, showing LoRA adaptation of projector and KQV layers is critical.
    • Stage 2 (FS blocks + voting): Further boosts to 88–90% accuracy and F1 (0.89–0.91). Multi-layer representations add complementary signals, and ensembling strengthens decisions.
  • PoV analysis:

    • Simulation (SMF-CALVIN): Single exocentric PoV sometimes outperforms dual PoV (slight differences).
    • Real-world (SMF-DROID): Dual PoV clearly helps due to viewpoint instability and distractors.
  • Transfer analysis:

    • Training FS blocks solely on real-world data yields best real-world performance, indicating that simulation-only patterns may not optimally capture real-world error cues. Nonetheless, Stage 1 LoRA training on simulated data provides useful foundational representations that enable effective minimal fine-tuning on real-world.

7. Conclusion & Reflections

7.1. Conclusion Summary

  • The paper introduces I-FailSense, a two-stage VLM framework for robust robotic failure detection, especially semantic misalignment errors.
  • Stage 1 leverages parameter-efficient LoRA to adapt the base VLM; Stage 2 adds FS blocks at multiple internal layers and aggregates their predictions via weighted voting.
  • A new dataset construction pipeline for semantic misalignment failures is proposed, leveraging re-pairing within the same task category.
  • Empirical results demonstrate superior performance over zero-shot and AHA baselines:
    • 90% accuracy on SMF-CALVIN (semantic misalignment, simulation).
    • 89% failure detection on AHA (mostly control errors, OOD).
    • Up to 74% accuracy on real-world SMF-DROID with minimal FS block fine-tuning.
  • The method shows that focusing on semantic misalignment and exploiting multi-level VLM representations generalizes to broader failure categories and environments.

7.2. Limitations & Future Work

  • Limitations:

    • Trajectory flattening into a single image may discard fine-grained temporal dynamics; temporal modeling (e.g., transformers over frames) could improve motion reasoning.
    • Negative examples are synthetically constructed by instruction mismatch; while effective, real-world misalignments can be more varied and nuanced.
    • Fixed voting weights (ωvlm=2\omega_{vlm}=2, ωk=1\omega_{k}=1) are heuristic; adaptive weighting or meta-learning could yield further gains.
    • Evaluation on AHA uses a reconstructed test set (since original dataset is not public), which, while faithful to the pipeline, is not identical to the official benchmark.
    • The architecture details of FS blocks (e.g., exact MHA/MLP configurations, hyperparameters) are described at a high level; reproducibility may benefit from more granular specifications.
  • Future work suggested by authors:

    • Extend from detection to recovery: enable robots not only to recognize failures but to adapt and correct during execution.
    • Broaden failure mode coverage and explanatory capabilities while preserving discriminative accuracy.
    • Strengthen sim-to-real transfer via domain adaptation or real-world pretraining.

7.3. Personal Insights & Critique

  • Insights:
    • Multi-layer grounded arbitration is a strong design choice: different LLM layers capture complementary semantics, and leveraging them improves discriminative failure detection—particularly important for semantic misalignment.
    • Training on semantic misalignment appears to teach a general “goal satisfaction” checker that transfers to control errors, a valuable principle for designing failure-aware agents.
  • Potential applications:
    • Integrate with VLA policies as a reliable success detector for reward shaping and autonomous evaluation frameworks (e.g., AutoEval).
    • Use FS blocks as plug-in monitors for robotic pipelines in manufacturing or service robots where instruction comprehension is critical.
  • Areas for improvement:
    • Investigate sequence models over video frames instead of spatial flattening for richer temporal reasoning (e.g., spatiotemporal attention).

    • Learn dynamic voting weights conditioned on the input (confidence-aware arbitration).

    • Expand datasets with real-world semantic misalignment curated by human-in-the-loop labeling to capture nuanced, non-synthetic errors.

    • Combine discriminative FS blocks with generative explanation modules (AHA-style) to trigger recovery policies with both accurate flags and informative reasons.

      Overall, I-FailSense demonstrates a compelling architecture and dataset strategy that moves the needle on failure detection in language-conditioned robotic manipulation, particularly for the challenging and underexplored class of semantic misalignment errors.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.