Paper status: completed

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Published:06/12/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SAFE introduces a multitask failure detection system for vision-language-action models, leveraging internal features to predict task failure. It demonstrates superior performance across various environments, balancing accuracy and timeliness, and supports zero-shot generalization

Abstract

While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out of the box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while generalist VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, π0π_0, and π0π_0-FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results and code can be found at the project webpage: https://vla-safe.github.io/

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

SAFE: Multitask Failure Detection for Vision-Language-Action Models

1.2. Authors

  • Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Florian Shkurti (University of Toronto, UofT Robotics Institute; Vector Institute)

  • Haruki Nishimura, Masha Itkina (Toyota Research Institute)

    The authors draw from robotics, robot learning, uncertainty quantification, and applied machine learning. UofT and TRI have extensive track records in robotic manipulation and safety research.

1.3. Journal/Conference

arXiv preprint. The work is not yet a peer-reviewed publication; it positions itself within the robotic learning and uncertainty quantification community, referencing and evaluating against recent CoRL and RSS works.

1.4. Publication Year

2025 (arXiv posting: 2025-06-11T16:59:13.000Z)

1.5. Abstract

Vision-language-action (VLA) models are generalist robotic policies capable of following natural language instructions to perform diverse manipulation tasks. Out-of-the-box, VLAs achieve strong performance on seen tasks but suffer a large drop on unseen tasks (from roughly 80–90% to 30–60% success). To safely deploy VLAs without per-task fine-tuning, the paper introduces multitask failure detection: train a single detector using seen tasks and evaluate zero-shot on unseen tasks. The proposed method, SAFE (ScAlable Failure Estimation), probes VLA internal features and regresses a scalar failure score over time. Using functional conformal prediction (CP), SAFE sets time-varying thresholds that provide distribution-free guarantees under mild assumptions. SAFE is compatible with multiple VLA architectures (OpenVLA, π0, π0-FAST) and shows superior accuracy-time trade-offs over baselines (token uncertainty, sample consistency, OOD detectors, and action-consistency methods) across simulation (LIBERO-10, SimplerEnv) and real robots (Franka, WidowX).

2. Executive Summary

2.1. Background & Motivation

  • Problem: VLAs often fail on unseen tasks and novel environments when deployed directly. Without safe monitoring, failures can lead to unsafe robot behaviors. Existing failure detectors are typically task-specific and do not generalize across tasks, making them impractical for generalist VLAs.
  • Why important: Generalist policies are meant to encounter new instructions and settings. It is infeasible to train and calibrate detectors per task. A task-generic detector must be accurate, timely, efficient, and compatible with various VLA architectures to support real-time robotic control.
  • Entry point/innovation: The authors analyze latent features inside VLA models and discover a “failure zone”—a feature-space region that captures high-level, task-generic separation between success and failure. This motivates probing internal features to learn a single scalar failure score and threshold it via conformal prediction to provide calibrated, time-varying alerts.

2.2. Main Contributions / Findings

  • Contribution 1: Introduce multitask failure detection for generalist VLAs—train on a subset of tasks (seen) and evaluate zero-shot on held-out (unseen) tasks without per-task fine-tuning or additional rollouts.
  • Contribution 2: Empirical analysis shows VLA internal features distinctly separate successful vs. failed rollouts across tasks. Failure trajectories converge in a shared “failure zone” under t-SNE visualization.
  • Contribution 3: SAFE—an efficient detector that:
    • Probes last-layer VLA features.
    • Learns a scalar failure score with a simple MLP or LSTM over time.
    • Uses functional conformal prediction to set time-varying thresholds with distribution-free guarantees under exchangeability.
  • Contribution 4: Extensive evaluation across models (OpenVLA, π0, π0-FAST) and settings (LIBERO-10, SimplerEnv, real Franka, real WidowX) shows SAFE achieves state-of-the-art failure detection (ROC-AUC) and best accuracy–timeliness trade-offs, outperforming strong baselines (including OOD detectors and action-consistency).

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Vision-Language-Action (VLA) Models: Neural policies that condition on images (RGB), language instructions, and robot states to output actions. Typically built on transformer architectures pretrained as vision-language models (VLMs), then augmented with action heads (e.g., discrete action tokenization, diffusion, flow matching). VLAs aim for generalist manipulation, handling diverse tasks without per-task retraining.
  • Transformer fundamentals: A transformer uses multi-head self-attention to integrate information across tokens. While the paper does not derive attention, the core formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where QQ are queries, KK keys, VV values, and dkd_k is the key dimensionality. VLAs leverage transformer blocks to encode multimodal inputs and intermediate latent features.
  • Conformal Prediction (CP): A distribution-free framework that provides calibrated prediction sets or bands under exchangeability assumptions. Functional CP constructs time-varying bands for sequences, ensuring that, with probability 1α1-\alpha, a new successful rollout’s score remains within the band at all timesteps. Exceeding the band signals an anomaly/failure with controlled false positive rate.
  • Uncertainty Quantification (UQ) in LLMs/VLMs: Methods estimate confidence (e.g., token-level entropy/probability) or consistency (variance across sampled outputs). They can be adapted to actions from VLAs as proxies for failure likelihood but require multiple samples or tend to be weak indicators.
  • Out-of-Distribution (OOD) detection: Methods model distributions of embeddings for “normal” data and flag deviations. Applied to robotics: learn distributions from successful rollouts and score new embeddings by distance or learned density. Strong for detecting anomalies but may misclassify unseen-but-successful tasks as failures.

3.2. Previous Works

  • VLAs: OpenVLA; RT-series (RT-1, RT-2); Octo; π0 (flow matching); FAST (efficient action tokenization). These achieve good performance on seen tasks but reduced success on unseen tasks.
  • Failure detection in robotics:
    • Task-specific detectors trained per task—unsuitable for generalist VLAs.
    • FAIL-Detect [8]: systematic evaluation of detectors (OOD, smoothness, consistency), reporting LogpZO (flow-matching-based density) as strongest—but for single-task policies; suffers in multitask generalization per this paper.
    • STAC [18]: action-consistency over overlapping chunks, requiring many samples per timestep; strong performance but heavy inference overhead for large VLAs.
    • VLM-based detectors [19, 20]: query large models for success/failure judgments; expensive and impractical for real-time control.
  • UQ for LLMs/VLMs: token probability/entropy; semantic consistency; latent-space classifiers (e.g., hallucination detection by probing internal activations), showing simple probes can be effective and efficient.

3.3. Technological Evolution

  • From specialist, per-task policies and detectors → generalist VLAs that can follow diverse instructions.
  • From task-specific failure detectors → desire for task-generic, multitask detectors compatible with VLAs.
  • From heavy sampling or large-model queries → lightweight feature-probing and time-calibrated thresholds.

3.4. Differentiation Analysis

  • SAFE’s novelty lies in:
    • Probing last-layer VLA features to exploit emergent task-generic success/failure structure.
    • A unified detector trained on success/failure across tasks, evaluated zero-shot on unseen tasks.
    • Functional CP setting time-varying thresholds with theoretical calibration for false positives on successful rollouts.
  • Compared to baselines:
    • UQ (token/sample consistency) is weaker or expensive (needs multiple samples).
    • OOD detectors model “normal” success distribution but may overfit and misclassify unseen successes; SAFE directly learns failure discrimination.
    • STAC yields strong performance but sampling overhead is impractical for large VLAs; SAFE operates with single forward passes and minimal overhead.

4. Methodology

4.1. Principles

Core idea: VLA internal features encode high-level, abstract indicators of task success versus failure across tasks. By probing these latent features over time and learning a scalar failure score sts_t, we can detect failures early. A principled, time-varying threshold δt\delta_t from functional conformal prediction yields calibrated alerts with controlled false positive rates on successful rollouts.

The following figure (Figure 2 from the original paper) shows the system architecture:

Figure 2: The proposed failure detector, SAFE, has three major components: (1) SAFE extracts the latent feature from the last layer of a VLA model; (2) SAFE sequentially processes the latent feature and predicts a failure score, using an MLP or an LSTM backbone; and (3) SAFE determines a time-varying threshold using functional conformal prediction (CP) on a hold-out calibration set. If the predicted score exceeds the threshold during testing, SAFE confidently detects a failure. 该图像是示意图,展示了FAIL检测器SAFE的三个主要组件:首先,从VLA模型提取潜在特征;其次,使用MLP或LSTM后端学习失败评分预测器;最后,通过功能性符合预测校准失败检测阈值,并在测试过程中检测故障。

4.2. Data Flow and Feature Extraction

  • Inputs at timestep tt: observation ot\mathbf{o}_t (images, instruction, robot state).

  • VLA outputs: a chunk of actions At=[at,at+1,,at+H1]\mathbf{A}_t = [\mathbf{a}_t, \mathbf{a}_{t+1}, \ldots, \mathbf{a}_{t+H-1}] for horizon HH. The first HHH' \leq H actions are executed, then the policy replans at t+Ht+H'.

  • Internal embedding: et\mathbf{e}_t denotes the latent feature vector at time tt from the last layer of the VLA before decoding to tokens (OpenVLA, π0-FAST) or velocity field (π0).

  • Token decoding (some VLAs): Wt=[wt1,,wtm]\mathbf{W}_t = [\mathbf{w}_t^1, \dots, \mathbf{w}_t^m] are action tokens; SAFE can also use token-level uncertainty baselines (for comparison) but its core relies on et\mathbf{e}_t.

    The authors ablate multiple aggregation strategies to obtain a single vector et\mathbf{e}_t from multi-token or multi-step internal tensors EE:

  • First: e=E1\mathbf{e} = E_1

  • Last: e=En\mathbf{e} = E_n

  • Mean: e=1ni=1nEi\mathbf{e} = \frac{1}{n}\sum_{i=1}^n E_i

  • First&Last: e=concat(E1,En)\mathbf{e} = \mathrm{concat}(E_1, E_n)

    For π0 (flow matching), features exist across horizon and diffusion steps: ERH×k×dE \in \mathbb{R}^{H \times k \times d}, aggregated along HH and kk dimensions via agghoriagg_{\mathrm{hori}} and aggdiffagg_{\mathrm{diff}} to produce eRd\mathbf{e} \in \mathbb{R}^{d'}.

4.3. SAFE Score Models

SAFE learns a scalar failure score sts_t from feature history e0:t={e1,,et}\mathbf{e}_{0:t} = \{\mathbf{e}_1, \ldots, \mathbf{e}_t\}.

4.3.1. MLP-based Accumulation

At each timestep tt, an MLP g()g(\cdot) maps et\mathbf{e}_t to a scalar, then a sigmoid σ()\sigma(\cdot) normalizes it. The cumulative failure score up to tt is: $ \bar{f}{\mathrm{MLP}}^{-}(\mathbf{e}{0:t}) = \sum_{\tau = 1, \dots, t} \sigma\big(g(\mathbf{e}_{\tau})\big) $

  • Symbols:
    • eτ\mathbf{e}_{\tau}: VLA latent feature at time τ\tau.

    • g()g(\cdot): MLP mapping features to a scalar logit.

    • σ()\sigma(\cdot): sigmoid activation, ensuring 0<σ()<10 < \sigma(\cdot) < 1.

    • fˉMLP\bar{f}_{\mathrm{MLP}}^{-}: cumulative score (the superscript is notation in the paper).

      Range property noted by authors: 0<st<t0 < s_t < t (since it sums tt sigmoids each in (0,1)(0,1)).

Training loss (applied across timesteps), pushing scores up for failed rollouts (yi=1y_i=1) and down for successful ones (yi=0y_i=0): $ L_{\mathrm{MLP}} = \sum_i \left[ y_i \sum_t (t - s_t) + (1 - y_i) \sum_t s_t \right] $

  • Symbols:
    • ii: indexes rollouts in training set Dtrain\mathcal{D}_{\mathrm{train}}.

    • yi{0,1}y_i \in \{0,1\}: rollout-level label (1 = failure; 0 = success).

    • sts_t: failure score at time tt (here, the cumulative form).

      Intuition: For failures, encourage sts_t to be high (closer to tt); for successes, encourage sts_t to be low.

4.3.2. LSTM-based Sequential Scoring

An LSTM processes e0:t\mathbf{e}_{0:t} sequentially; a linear head plus sigmoid produces sts_t: $ f_{\mathrm{LSTM}}(\mathbf{e}{0:t}) = \sigma\big(\mathrm{LSTM}(\mathbf{e}{0:t})\big) $

  • Symbols:

    • LSTM()\mathrm{LSTM}(\cdot): recurrent model capturing temporal evolution of features.

    • σ()\sigma(\cdot): sigmoid, yielding 0st10 \leq s_t \leq 1.

      Training loss is binary cross-entropy (BCE) over timesteps: $ L_{\mathrm{LSTM}} = \sum_i \sum_t \left[ y_i \log(s_t) + (1 - y_i) \log(1 - s_t) \right] $

  • Symbols as above; BCE encourages sts_t near 1 for failed rollouts, near 0 for successful rollouts.

    The following figure (Figure 1 from the original paper) motivates SAFE by showing latent-space separation:

    Figure 1: The internal features of a VLA capture high-level information about task success and failure. When the VLA is failing, the features, even those from different tasks, fall into the same failure zone". This motivates `S A F E` , an efficient multitask failure detector that is based on VLA internal features and can generalize to unseen tasks. Plot (a) visualizes the latent features of \(\\pi _ { 0 }\) -FAST on LIBERO-10 \[56\] using t-SNE \[57\]. For successful rollouts, features are colored in blue. For failed rollouts, features follow a blue-to-red gradient based on timestep progression, with red corresponding to later timesteps that often coincide with failure. Plot (b) visualizes the same set of t-SNE features, colored by task ID. In (c), we show two example rollouts over time and mark their corresponding projected features in (a) and (b). 该图像是一个示意图,展示了基于 t-SNE 技术的策略潜在特征分布。图 (a) 中的特征根据任务成功与否进行着色,成功的回放为蓝色,失败的回放则呈现蓝到红的渐变。图 (b) 则按照任务 ID 对特征进行着色。图 (c) 显示了成功和失败回放的示例,强调了在特定任务中失败的发生。

4.4. Threshold Selection via Functional Conformal Prediction

SAFE raises a failure flag when sts_t exceeds a time-varying threshold δt\delta_t. The threshold is constructed from a CP prediction band calibrated on successful rollouts in a held-out calibration set Deval-seen\mathcal{D}_{\mathrm{eval\text{-}seen}}.

  • One-sided time-varying CP band Cα={[lowert,uppert]:t=1,,T}C_{\alpha} = \{[\mathrm{lower}_t, \mathrm{upper}_t]: t=1,\ldots,T\} with:
    • lowert=\mathrm{lower}_t = -\infty
    • uppert=μt+ht\mathrm{upper}_t = \mu_t + h_t
  • Symbols:
    • α(0,1)\alpha \in (0,1): user-specified significance level, controlling conservativeness.

    • μt\mu_t: time-varying mean of scores from successful rollouts at time tt.

    • hth_t: bandwidth chosen to achieve coverage 1α1-\alpha under exchangeability.

    • TT: rollout duration.

      Guarantee (under mild assumptions): For a new successful rollout, st<μt+hts_t < \mu_t + h_t for all t=1,,Tt = 1,\dots, T with probability 1α1 - \alpha. Conversely, st>μt+hts_t > \mu_t + h_t indicates failure with nominal confidence 1α1-\alpha. SAFE uses δt=uppert\delta_t = \mathrm{upper}_t as the threshold.

5. Experimental Setup

5.1. Datasets

  • LIBERO-10: A challenging simulation suite of 10 long-horizon manipulation tasks with diverse objects, layouts, and instructions. VLAs (OpenVLA, π0, π0-FAST) are evaluated using authors’ released checkpoints. 3/10 tasks randomly designated as unseen; remaining 7 are seen.

  • SimplerEnv: High-fidelity simulation environments replicating real datasets (RT-series, BridgeData V2). Evaluates π0 reproductions (denoted π0\pi_0^*) on Google Robot and WidowX embodiments. Each embodiment uses 4 tasks (3 seen, 1 unseen); 100 rollouts per task.

  • Real-world Franka: Deploy π0-FAST-DROID on a Franka Emika Panda. 13 tasks; for each, 30 successful and 30 failed rollouts. 3/13 tasks as unseen. Fixed rollout lengths per task.

  • Real-world WidowX: Deploy OpenVLA pretrained on “Open-X Magic Soup++” on a WidowX arm. 8 lifting and pick-and-place tasks, totaling 532 rollouts (244 successes, 288 failures). 2/8 tasks unseen.

    Examples of task instructions (from real setups) are given in Tables 3 and 4 below.

5.2. Evaluation Metrics

  • ROC-AUC (Area Under the Receiver Operating Characteristic curve):
    • Concept: Measures the ability of a scalar score to separate positive (failures) from negative (successes) across all thresholds. Higher AUC indicates better overall discrimination.
    • Formula (conceptual as area under curve): $ \mathrm{AUC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}^{-1}(x)) , dx $ Alternatively, for discrete points, AUC is the area under the ROC obtained by plotting TPR\mathrm{TPR} vs FPR\mathrm{FPR} across thresholds.
    • Symbols:
      • TPR\mathrm{TPR}: true positive rate (sensitivity, recall of failures).
      • FPR\mathrm{FPR}: false positive rate (1TNR1 - \mathrm{TNR}).
  • TPR (True Positive Rate): $ \mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $
    • TP: number of correctly detected failed rollouts.
    • FN: number of missed failed rollouts.
  • TNR (True Negative Rate, specificity): $ \mathrm{TNR} = \frac{\mathrm{TN}}{\mathrm{TN} + \mathrm{FP}} $
    • TN: number of successful rollouts with no failure flag raised.
    • FP: number of successful rollouts falsely flagged as failures at any time.
  • Balanced Accuracy: $ \mathrm{Bal\text{-}Acc} = \frac{\mathrm{TPR} + \mathrm{TNR}}{2} $
  • Detection Time (TdetT_{\mathrm{det}}):
    • Concept: For each rollout with ground-truth failure, the time index when sts_t first exceeds δt\delta_t (CP band upper bound). Lower values indicate earlier detection.
    • Formalization: $ T_{\mathrm{det}} = \min{ t \in {1,\dots, T} \mid s_t > \delta_t } $ Averaged across failed rollouts in evaluation.

SAFE also evaluates using the maximum-so-far score sˉt=maxτ=1,,tsτ\bar{s}_t = \max_{\tau=1,\dots,t} s_\tau to compute ROC-AUC on sˉT\bar{s}_T (max over rollout), reflecting that any threshold crossing during a successful rollout counts as a false positive.

5.3. Baselines

  • Token Uncertainty (single forward pass; for tokenized action VLAs):
    • Max prob.: maxi(logpi)\max_i(-\log p_i)
    • Avg prob.: 1milogpi-\frac{1}{m}\sum_i \log p_i
    • Max entropy: maxiHi\max_i H_i
    • Avg entropy: 1miHi\frac{1}{m}\sum_i H_i
    • Symbols: pip_i is probability of token wti\mathbf{w}_t^i; HiH_i is entropy for token ii; mm tokens per action.
  • Sample Consistency (requires multiple action samples per timestep):
    • Action total variance: trace(cov(At))\mathrm{trace}(\mathrm{cov}(\mathcal{A}_t)) over sampled action sequences At={Atk}k=1K\mathcal{A}_t = \{\mathbf{A}_t^k\}_{k=1}^K.
    • Translational, rotational, and gripper variances computed similarly on components.
    • Cluster entropy: entropy over cluster labels from agglomerative clustering of sampled actions.
  • Embedding Distance (feature-space OOD-inspired): $ s_t = d(\mathbf{e}t, E{\mathrm{succ}}) - d(\mathbf{e}t, E{\mathrm{fail}}) $ where EsuccE_{\mathrm{succ}} and EfailE_{\mathrm{fail}} are sets of embeddings from successful and failed training rollouts, respectively. d(,)d(\cdot,\cdot) is Mahalanobis, or kk-NN averaged Euclidean/Cosine distance; PCA-KMeans distance also evaluated.
  • Learned OOD Detectors:
    • RND, LogpZO adapted to embeddings: train fsuccOOD()f_{\mathrm{succ}}^{\mathrm{OOD}}(\cdot) and ffailOOD()f_{\mathrm{fail}}^{\mathrm{OOD}}(\cdot) on EsuccE_{\mathrm{succ}} and EfailE_{\mathrm{fail}}, respectively; $ s_t = f_{\mathrm{succ}}^{\mathrm{OOD}}(\mathbf{e}t) - f{\mathrm{fail}}^{\mathrm{OOD}}(\mathbf{e}_t) $
  • Action Consistency:
    • STAC: statistical distance on overlapping action chunks across timesteps (requires many samples, e.g., 256; here tested with 10 in simulation).
    • STAC-Single: single-sample variant for real-time, using one action per timestep.

6. Results & Analysis

6.1. Core Results Analysis

SAFE consistently achieves top or near-top ROC-AUC across simulation and real robots, and the best accuracy–timeliness trade-offs via CP. Token uncertainty baselines are weak (consistent with LLM UQ literature). Sample consistency and STAC perform well when multiple samples are available but are computationally heavy for VLAs. Embedding-distance baselines are strong, validating that VLA latents encode success/failure; however, SAFE’s learned probe extracts more abstract signals and generalizes better to unseen tasks, mitigating overfitting observed in learned OOD detectors like LogpZO for multitask settings.

Trade-off plots (Bal-Acc vs. TdetT_{\mathrm{det}} across α\alpha) show SAFE detects failures earlier without sacrificing accuracy, often aligning with human-judged failure moments and sometimes before they occur, enabling preemptive intervention.

Qualitative examples across simulation and real robots illustrate that SAFE scores rise upon missed grasps, unstable motions, oscillations, slips, or freezing, crossing CP bands appropriately.

The following are the results from Table 1 of the original paper:

VLA Model Benchmark Eval Task SplitOpenVLA LIBEROπ0-FAST LIBEROπ0 LIBEROπrt} SimplerEnvAverage
SeenUnseenSeenUnseenSeenUnseenSeenUnseenSeenUnseen
Token Unc.Max prob.50.2553.8361.3269.44----55.7961.64
Avg prob.44.0551.5852.4658.04----48.2654.81
Max entropy52.9453.0946.6962.96----49.8158.03
Avg entropy45.2750.0350.9358.63----48.1054.33
Embed. Distr.Mahalanobis dist.62.0358.8593.5683.7977.1274.3188.4252.8480.2867.45
Euclidean dist. k-NN66.0055.2392.0484.1275.6470.7389.7368.4180.8569.62
Cosine dist. k-NN67.0969.4592.0984.6475.7670.3190.1971.3281.2873.93
PCA-KMeans [9]57.1855.1068.4657.1264.9260.3566.8861.1964.3658.44
RND [39]52.5746.8888.6781.5771.9269.44885.0765.8974.5665.95
LogpZO [8]61.5752.9191.5283.0776.8073.2388.7974.6679.6770.97
Sample Consist.Action total var.62.7665.4376.9574.5077.2075.1868.4167.9471.3370.76
Trans. total var.55.3358.9978.2180.0349.3854.7163.2755.9061.5562.41
Rot. total var.47.8555.3080.8777.2952.9461.0658.0762.1059.9363.94
Gripper total var. Cluster entropy61.84 50.1664.48 51.4476.82 80.2274.42 80.5377.19 76.1975.19 72.1269.16 68.2569.29 73.6671.25 68.7170.84 69.44
Action Consist.STAC [18]--83.0785.3146.5547.9160.7462.2163.4565.14
STAC-Single--85.4681.1668.4669.3968.7170.4074.2173.65
SAFE (Ours)SAFE-LSTM70.2472.4792.9884.4876.9871.0988.8580.1182.2677.04
SAFE-MLP72.6873.4790.0680.4473.5073.2789.5084.8281.4378.00

The following figure (Figure 4 from the original paper) illustrates the accuracy–timeliness trade-off using CP across benchmarks:

Figure 4: In all simulation experiments, the proposed SAFE-LSTM and SAFE-MLP perform better than or on par with the best baselines. The plots show the variation of balanced accuracy (bal-acc) with respect to average detection time (T-det) on \(\\mathcal { D } _ { \\mathrm { e v a l } }\) -unseen, under different significance levels \(\\alpha\) used for functional CP. Good failure detection methods should detect policy failures both accurately (high bal-ac) and proactively (lower T-det), and thus place curves towards the top left in each plot. Note that baselines in gray require multiple action samples. 该图像是图表,展示了不同方法在 OpenVLA LIBERO、π0π_0 LIBERO、π0π_0-FAST LIBERO 和 π0π_0 SimplrEnv 环境下的平衡准确性(bal-acc)与平均检测时间(T-det)的关系。图中曲线表明,所提的 SAFE-LSTM 和 SAFE-MLP 在性能上优于或与最佳基线持平。

6.2. Ablations and Additional Analyses

  • Visual latent-space analyses (t-SNE) across datasets show successful vs. failed rollouts separate in VLA feature spaces, often revealing a common “failure zone.” Even when visual separation is less clear (e.g., real Franka tasks), SAFE still outperforms baselines.

  • Qualitative detections with CP bands:

    • The following figure (Figure 5 from the original paper) shows failures detected by SAFE-LSTM aligning with actual rollouts:

      Figure 5: Failures detected by `S A F E` -LSTM align well with the actual robot failures, as shown in the corresponding camera observations from simulation experiments. The blue-shaded areas show the functional CP band \(C _ { \\alpha }\) . Once failure scores exceed \(C _ { \\alpha }\) , a failure flag is raised. In (a), the \(\\pi _ { 0 }\) FAST policy misses the insertion, and its actions become unstable after that. In (b) and (c), OpenVLA and \(\\pi _ { 0 } ^ { * }\) miss the grasp but still proceed to the placing action, causing a failure detection. Note that these tasks are not seen when training SAFE-LSTM. 该图像是一个示意图,展示了不同机器人政策在特定任务中的失败检测情况。图中包含三个子图,分别对应不同的机器人策略:π0\pi_0-FAST、OpenVLA 和 π0\pi_0^*。每个子图中,上方为机器人在操作期间的相机观察,下方为随着时间推移的失败得分变化曲线。蓝色区域代表成功的置信区间,一旦失败得分超出此范围,即会触发失败警报。图中清晰展示了策略在执行任务时的失败情况及其检测效果。

  • Real-world performance:

    • The following figure (Figure 6 from the original paper) shows SAFE-MLP leading on ROC-AUC and qualitative examples on Franka and WidowX:

      Figure 6: SAFE-MLP achieves the best failure detection performance in real-world experiments with both \(\\pi _ { 0 }\) -FAST Franka and OpenVLA WidowX. Plot (a) presents quantitative results, while (be) show qualitative examples from `S A F E` -MLP on the real robot. ROC-AUC values are averaged over five random seeds with different task splits. 该图像是图表,展示了 SAFE-MLP 在实际实验中与 extπ0 ext{π}_0-FAST Franka 和 OpenVLA WidowX 的失败检测性能。图 (a) 显示了 ROC-AUC 值,分别针对已见和未见任务进行比较;图 (b) 至 (e) 展示了成功和失败任务的定性示例,展示不同时间步骤下的得分变化。

  • Number of training tasks: Increasing training-task diversity generally improves unseen-task performance; SAFE retains strong performance even with fewer training tasks.

  • Foundation-model features vs. VLA features: Probing last-layer VLA features substantially outperforms DINOv2 or CLIP features, indicating VLAs’ latent spaces carry task-execution semantics (success/failure) more directly.

    The following are the results from Table 6 of the original paper:

    # Training Tasks 1 Seen 1 Unseen 3 Seen 3 Unseen 5 Seen 5 Unseen 7 Seen 7 Unseen
    Mahalanobis 40.21 52.75 58.00 52.31 57.68 50.78 62.03 58.85
    Euclid. k-NN 49.74 63.76 61.66 67.02 59.14 67.11 66.00 55.23
    Cosine. k-NN 53.27 60.76 65.39 65.64 67.46 70.57 67.09 69.45
    PCA-KMeans 60.39 40.58 61.18 52.87 61.50 53.06 57.18 55.10
    RND 29.29 50.32 54.46 47.39 56.71 49.15 52.57 46.88
    LogpZo 61.75 56.17 52.89 50.49 65.99 56.60 61.57 52.91
    SAFE-LSTM 50.88 52.25 68.85 63.31 70.70 66.31 70.24 72.47
    SAFE-MLP 54.34 63.76 67.86 67.03 69.32 68.17 72.68 73.47

The following are the results from Table 7 of the original paper:

Method Eval Task Split LSTM Seen LSTM Unseen MLP Seen MLP Unseen
DINOv2 76.93 56.96 76.20 59.46
CLIP 76.77 52.71 77.88 59.77
DINOv2+CLIP 77.09 59.65 76.36 58.43
VLA (Ours) 77.27 58.70 86.76 64.16

The following are the results from Table 8 of the original paper:

VLA Model Benchmark Eval Task SplitOpenVLA LIBEROπ0-FAST LIBEROπ0 LIBEROπ0 SimplerEnvπ0-FAST Real Franka
SeenUnseenSeenUnseenSeenUnseenSeenUnseenSeenUnseen
Max prob.50.25±2.5153.83±6.3261.32±9.5769.44±13.61---53.74±3.4648.59±3.00
Avg prob.44.05±1.2651.58±1.8252.46±3.4458.04±5.6451.60±3.1247.30±4.32
Max entropy52.94±4.3653.09±7.6846.69±13.3362.96±19.6259.23±3.0653.50±3.15
Avg entropy45.27±1.7850.03±3.1850.93±1.2258.63±3.4750.67±3.9646.08±4.79
Mahalanobis dist.62.03±5.1158.85±4.1693.56±2.32- 77.12±8.57- 74.31±12.6488.42±2.82- 52.84±31.9775.54±4.0753.93±5.06
Euclidean dist. k-NN66.00±2.3355.23±10.0592.04±2.3983.79±7.18 84.12±6.4775.64±6.2070.73±16.6989.73±3.0868.41±9.2280.35±5.3660.27±4.79
Cosine dist. k-NN67.09±2.7469.45±6.1492.09±1.7084.64±4.9075.76±6.1670.31±16.8490.19±4.0571.32±12.0280.23±5.1259.51±5.76
PCA-KMeans [9]57.18±2.0455.10±1.1668.46±4.9257.12±10.4464.92±8.9060.35±19.9366.88±5.1061.19±14.7651.91±4.2049.86±6.19
RND [39]52.57±4.5646.88±4.9288.67±3.0581.57±8.6771.92±7.0269.44±19.3985.07±4.0465.89±6.5262.00±5.4445.83±5.10
LogpZO [8]61.57±3.6252.91±5.7991.52±2.3983.07±7.1776.80±9.1273.23±11.6488.79±4.9274.66±14.9664.43±7.8252.24±3.68
Action total var.62.76±1.6665.43±2.5076.95±7.2274.50±12.1977.20±5.6575.18±5.0868.41±10.8167.94±15.97--
Trans. total var.55.33±2.0658.99±5.1378.21±4.0980.03±9.1149.38±9.9554.71±7.5763.27±7.1755.90±19.19--
Rot. total var.47.85±2.8855.30±4.3880.87±5.8577.29±8.7152.94±7.5661.06±10.6058.07±10.4162.10±9.39-
Gripper total var.61.84±2.6764.48±3.0576.82±7.1074.42±12.1377.19±5.6675.19±5.0869.16±9.5069.29±14.77
Cluster entropy50.16±2.3651.44±1.0180.22±7.3780.53±8.6576.19±4.3172.12±1.0468.25±9.0373.66±16.03
STAC [18]-83.07±4.6185.31±6.7146.55±8.9047.91±20.9460.74±13.8962.21±16.72-
STAC-Single--85.46±6.5581.16±8.6368.46±5.1069.39±8.2268.71±7.0670.40±8.7645.24±3.6838.01±9.81
SAFE-LSTM70.24±1.4972.47±5.5592.98±2.6284.48±7.2976.98±5.3471.09±6.9488.85±6.3080.11±10.4977.27±5.8258.70±4.37
SAFE-MLP72.68±2.3873.47±5.3990.06±2.8280.44±5.7273.50±7.4373.27±11.8589.50±4.4984.82±8.1286.76±2.6464.16±5.88

6.3. Additional Tables (Datasets and Splits)

The following are the results from Table 2 of the original paper:

Embodiment Task ID Environment Name π Success Rate (%)
Google Robot 1 google_robot_move_near_vO 77
Google Robot 2 google_robot_open_drawer 50
Google Robot 3 google_robot_close_drawer 80
Google Robot 4 google_robot_place_apple_in_closed_top_drawer 40
WidowX 1 widowx_carrot_on_plate 44
WidowX 2 widowx_put_eggplant_in_basket 88
WidowX 3 widowx_spoon_on_towel 79
WidowX 4 widowx_stack_cube 43

The following are the results from Table 3 of the original paper:

Task Instruction Rollout Length T
1 close the door 300
2 close the drawer 200
3 pick up the ball and place it in the bowl 400
4 pick up the knife and put it on the plate 350
5 pick up the lid and place it on the pot 400
6 pick up the lid from the pot and place it on the table 400
7 pick up the marker and place it in the cup 400
8 place the green block on the yellow block 350
9 place the pink cup to the right of the blue cup 300
10 press the button 200
11 put both the carrot and the ball in the bowl 500
12 put the cup to the upright position 500
13 unfold the cloth 500

The following are the results from Table 4 of the original paper:

Task Instruction
1 Lift AAA Battery
2 Lift Eggplant
3 Lift Red Bottle
4 Lift Blue Cup
5 Put Blue Cup on Plate
6 Put the Red Bottle into Pot
7 Put the Carrot on Plate
8 Put the Red Block into the Pot

The following are the results from Table 5 of the original paper:

BenchmarkNumber of TasksNumber of rollouts
SeenUnseenTotalTrainEval SeenEval UnseenTotal
LIBERO7310210140150500
π SimplerEnv, Google Robot224198102100400
π SimplerEnv, WidowX224198102100400
Octo SimplerEnv93125943063001200
Real Franka10313450150180780
Real WidowX628250133149532

7. Conclusion & Reflections

7.1. Conclusion Summary

SAFE tackles multitask failure detection for generalist VLAs by probing last-layer internal features and learning a scalar failure score over time, with thresholds calibrated via functional conformal prediction. Across diverse models and environments, SAFE achieves state-of-the-art ROC-AUC and the best accuracy–timeliness trade-offs, outperforming representative baselines (token UQ, sample consistency, OOD detectors, and STAC). Analyses show VLAs encode task-generic success/failure signals in their latent spaces; SAFE leverages this structure efficiently and robustly.

7.2. Limitations & Future Work

  • Scope: Focused on manipulation tasks; generalization across embodiments, sim2real gaps, or action-less video conditioning remains open.
  • Feature depth: Uses last-layer features only; fusing multi-layer information or discovering optimal intermediate layers may further improve detection.
  • CP under shift: Functional CP bands calibrated on seen tasks may face distribution shift on unseen tasks; exploring adaptive/online CP could enhance calibration.
  • Recovery/improvement: SAFE provides timely detection; integrating with recovery policies or activation steering to proactively correct VLA behaviors is a promising direction, but closed-loop manipulation complicates direct transfer from LLM steering.

7.3. Personal Insights & Critique

  • Strengths:
    • Elegant insight that VLA latents carry generic failure semantics, demonstrated visually and empirically.
    • Practicality: small MLP/LSTM probes with negligible runtime overhead; compatible with multiple VLAs without retraining them.
    • Principled thresholding: CP-based bands offer interpretable, calibrated alerts and tunable conservativeness.
  • Considerations:
    • While SAFE generalizes across tasks, it still requires training on a diverse set of successes and failures. The real-world cost of collecting failures can be non-trivial; leveraging synthetic/augmented failures or semi-supervised methods might reduce data burden.
    • CP coverage guarantees hinge on exchangeability; performance under significant domain shift may deviate. The paper acknowledges deviations in TNR curves; future work could incorporate adaptive CP to correct calibration drift.
    • Feature aggregation choices can impact performance; automated feature-selection across layers or attention-based pooling might improve robustness.
  • Transferability:
    • The latent-probing paradigm is broadly applicable: beyond VLAs, any large multimodal policy may encode execution status signals. SAFE-like probes could monitor failures in mobile manipulation, navigation, or multimodal human-robot interaction.

    • Combining SAFE with policy improvement (e.g., activation steering guided by separation vectors) is compelling but requires careful study of causality between latents and physical outcomes in embodied agents.

      In summary, SAFE is a timely, well-motivated, and empirically strong contribution to safe deployment of generalist VLAs. Its simplicity and effectiveness make it a valuable building block for practical robot monitoring and a foundation for future recovery and improvement strategies.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.