Paper status: completed

SAFE: Multitask Failure Detection for Vision-Language-Action Models

Published:06/12/2025

Generalist Robot Policies (8)Failure Detection for Vision-Language-Action Models (1)Multitask Failure Detection (1)Feature-Based Failure Prediction (1)Adaptive Failure Alert System (1)

Original Link PDF

Price: 0.100000

1 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

SAFE introduces a multitask failure detection system for vision-language-action models, leveraging internal features to predict task failure. It demonstrates superior performance across various environments, balancing accuracy and timeliness, and supports zero-shot generalization

Abstract

While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out of the box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while generalist VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, $π_0$ , and $π_0$ -FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results and code can be found at the project webpage: https://vla-safe.github.io/

Mind Map

In-depth Reading

English Analysis~16 min read · 24,146 chars

1. Bibliographic Information

1.1. Title

SAFE: Multitask Failure Detection for Vision-Language-Action Models

1.2. Authors

Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Florian Shkurti (University of Toronto, UofT Robotics Institute; Vector Institute)
Haruki Nishimura, Masha Itkina (Toyota Research Institute)

The authors draw from robotics, robot learning, uncertainty quantification, and applied machine learning. UofT and TRI have extensive track records in robotic manipulation and safety research.

1.3. Journal/Conference

arXiv preprint. The work is not yet a peer-reviewed publication; it positions itself within the robotic learning and uncertainty quantification community, referencing and evaluating against recent CoRL and RSS works.

1.4. Publication Year

2025 (arXiv posting: 2025-06-11T16:59:13.000Z)

1.5. Abstract

Vision-language-action (VLA) models are generalist robotic policies capable of following natural language instructions to perform diverse manipulation tasks. Out-of-the-box, VLAs achieve strong performance on seen tasks but suffer a large drop on unseen tasks (from roughly 80–90% to 30–60% success). To safely deploy VLAs without per-task fine-tuning, the paper introduces multitask failure detection: train a single detector using seen tasks and evaluate zero-shot on unseen tasks. The proposed method, SAFE (ScAlable Failure Estimation), probes VLA internal features and regresses a scalar failure score over time. Using functional conformal prediction (CP), SAFE sets time-varying thresholds that provide distribution-free guarantees under mild assumptions. SAFE is compatible with multiple VLA architectures (OpenVLA, π0, π0-FAST) and shows superior accuracy-time trade-offs over baselines (token uncertainty, sample consistency, OOD detectors, and action-consistency methods) across simulation (LIBERO-10, SimplerEnv) and real robots (Franka, WidowX).

1.6. Original Source Link

arXiv abstract: https://arxiv.org/abs/2506.09937
PDF: https://arxiv.org/pdf/2506.09937v2.pdf
Project page (code and qualitative videos): https://vla-safe.github.io/
Publication status: arXiv preprint (not yet peer-reviewed).

2. Executive Summary

2.1. Background & Motivation

Problem: VLAs often fail on unseen tasks and novel environments when deployed directly. Without safe monitoring, failures can lead to unsafe robot behaviors. Existing failure detectors are typically task-specific and do not generalize across tasks, making them impractical for generalist VLAs.
Why important: Generalist policies are meant to encounter new instructions and settings. It is infeasible to train and calibrate detectors per task. A task-generic detector must be accurate, timely, efficient, and compatible with various VLA architectures to support real-time robotic control.
Entry point/innovation: The authors analyze latent features inside VLA models and discover a “failure zone”—a feature-space region that captures high-level, task-generic separation between success and failure. This motivates probing internal features to learn a single scalar failure score and threshold it via conformal prediction to provide calibrated, time-varying alerts.

2.2. Main Contributions / Findings

Contribution 1: Introduce multitask failure detection for generalist VLAs—train on a subset of tasks (seen) and evaluate zero-shot on held-out (unseen) tasks without per-task fine-tuning or additional rollouts.
Contribution 2: Empirical analysis shows VLA internal features distinctly separate successful vs. failed rollouts across tasks. Failure trajectories converge in a shared “failure zone” under t-SNE visualization.
Contribution 3: SAFE—an efficient detector that:
- Probes last-layer VLA features.
- Learns a scalar failure score with a simple MLP or LSTM over time.
- Uses functional conformal prediction to set time-varying thresholds with distribution-free guarantees under exchangeability.
Contribution 4: Extensive evaluation across models (OpenVLA, π0, π0-FAST) and settings (LIBERO-10, SimplerEnv, real Franka, real WidowX) shows SAFE achieves state-of-the-art failure detection (ROC-AUC) and best accuracy–timeliness trade-offs, outperforming strong baselines (including OOD detectors and action-consistency).

3.1. Foundational Concepts

Vision-Language-Action (VLA) Models: Neural policies that condition on images (RGB), language instructions, and robot states to output actions. Typically built on transformer architectures pretrained as vision-language models (VLMs), then augmented with action heads (e.g., discrete action tokenization, diffusion, flow matching). VLAs aim for generalist manipulation, handling diverse tasks without per-task retraining.
Transformer fundamentals: A transformer uses multi-head self-attention to integrate information across tokens. While the paper does not derive attention, the core formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where $Q$ are queries, $K$ keys, $V$ values, and $d_k$ is the key dimensionality. VLAs leverage transformer blocks to encode multimodal inputs and intermediate latent features.
Conformal Prediction (CP): A distribution-free framework that provides calibrated prediction sets or bands under exchangeability assumptions. Functional CP constructs time-varying bands for sequences, ensuring that, with probability $1-\alpha$ , a new successful rollout’s score remains within the band at all timesteps. Exceeding the band signals an anomaly/failure with controlled false positive rate.
Uncertainty Quantification (UQ) in LLMs/VLMs: Methods estimate confidence (e.g., token-level entropy/probability) or consistency (variance across sampled outputs). They can be adapted to actions from VLAs as proxies for failure likelihood but require multiple samples or tend to be weak indicators.
Out-of-Distribution (OOD) detection: Methods model distributions of embeddings for “normal” data and flag deviations. Applied to robotics: learn distributions from successful rollouts and score new embeddings by distance or learned density. Strong for detecting anomalies but may misclassify unseen-but-successful tasks as failures.

3.2. Previous Works

VLAs: OpenVLA; RT-series (RT-1, RT-2); Octo; π0 (flow matching); FAST (efficient action tokenization). These achieve good performance on seen tasks but reduced success on unseen tasks.
Failure detection in robotics:
- Task-specific detectors trained per task—unsuitable for generalist VLAs.
- FAIL-Detect [8]: systematic evaluation of detectors (OOD, smoothness, consistency), reporting LogpZO (flow-matching-based density) as strongest—but for single-task policies; suffers in multitask generalization per this paper.
- STAC [18]: action-consistency over overlapping chunks, requiring many samples per timestep; strong performance but heavy inference overhead for large VLAs.
- VLM-based detectors [19, 20]: query large models for success/failure judgments; expensive and impractical for real-time control.
UQ for LLMs/VLMs: token probability/entropy; semantic consistency; latent-space classifiers (e.g., hallucination detection by probing internal activations), showing simple probes can be effective and efficient.

3.3. Technological Evolution

From specialist, per-task policies and detectors → generalist VLAs that can follow diverse instructions.
From task-specific failure detectors → desire for task-generic, multitask detectors compatible with VLAs.
From heavy sampling or large-model queries → lightweight feature-probing and time-calibrated thresholds.

3.4. Differentiation Analysis

SAFE’s novelty lies in:
- Probing last-layer VLA features to exploit emergent task-generic success/failure structure.
- A unified detector trained on success/failure across tasks, evaluated zero-shot on unseen tasks.
- Functional CP setting time-varying thresholds with theoretical calibration for false positives on successful rollouts.
Compared to baselines:
- UQ (token/sample consistency) is weaker or expensive (needs multiple samples).
- OOD detectors model “normal” success distribution but may overfit and misclassify unseen successes; SAFE directly learns failure discrimination.
- STAC yields strong performance but sampling overhead is impractical for large VLAs; SAFE operates with single forward passes and minimal overhead.

4. Methodology

4.1. Principles

Core idea: VLA internal features encode high-level, abstract indicators of task success versus failure across tasks. By probing these latent features over time and learning a scalar failure score $s_t$ , we can detect failures early. A principled, time-varying threshold $\delta_t$ from functional conformal prediction yields calibrated alerts with controlled false positive rates on successful rollouts.

The following figure (Figure 2 from the original paper) shows the system architecture:

该图像是示意图，展示了FAIL检测器SAFE的三个主要组件：首先，从VLA模型提取潜在特征；其次，使用MLP或LSTM后端学习失败评分预测器；最后，通过功能性符合预测校准失败检测阈值，并在测试过程中检测故障。

4.2. Data Flow and Feature Extraction

Inputs at timestep $t$ : observation $\mathbf{o}_t$ (images, instruction, robot state).
VLA outputs: a chunk of actions $\mathbf{A}_t = [\mathbf{a}_t, \mathbf{a}_{t+1}, \ldots, \mathbf{a}_{t+H-1}]$ for horizon $H$ . The first $H' \leq H$ actions are executed, then the policy replans at $t+H'$ .
Internal embedding: $\mathbf{e}_t$ denotes the latent feature vector at time $t$ from the last layer of the VLA before decoding to tokens (OpenVLA, π0-FAST) or velocity field (π0).
Token decoding (some VLAs): $\mathbf{W}_t = [\mathbf{w}_t^1, \dots, \mathbf{w}_t^m]$ are action tokens; SAFE can also use token-level uncertainty baselines (for comparison) but its core relies on $\mathbf{e}_t$ .

The authors ablate multiple aggregation strategies to obtain a single vector $\mathbf{e}_t$ from multi-token or multi-step internal tensors $E$ :
First: $\mathbf{e} = E_1$
Last: $\mathbf{e} = E_n$
Mean: $\mathbf{e} = \frac{1}{n}\sum_{i=1}^n E_i$
First&Last: $\mathbf{e} = \mathrm{concat}(E_1, E_n)$

For π0 (flow matching), features exist across horizon and diffusion steps: $E \in \mathbb{R}^{H \times k \times d}$ , aggregated along $H$ and $k$ dimensions via $agg_{\mathrm{hori}}$ and $agg_{\mathrm{diff}}$ to produce $\mathbf{e} \in \mathbb{R}^{d'}$ .

4.3. SAFE Score Models

SAFE learns a scalar failure score $s_t$ from feature history $\mathbf{e}_{0:t} = \{\mathbf{e}_1, \ldots, \mathbf{e}_t\}$ .

4.3.1. MLP-based Accumulation

At each timestep $t$ , an MLP $g(\cdot)$ maps $\mathbf{e}_t$ to a scalar, then a sigmoid $\sigma(\cdot)$ normalizes it. The cumulative failure score up to $t$ is: $ \bar{f}{\mathrm{MLP}}^{-}(\mathbf{e}{0:t}) = \sum_{\tau = 1, \dots, t} \sigma\big(g(\mathbf{e}_{\tau})\big) $

Symbols:
- $\mathbf{e}_{\tau}$ : VLA latent feature at time $\tau$ .
- $g(\cdot)$ : MLP mapping features to a scalar logit.
- $\sigma(\cdot)$ : sigmoid activation, ensuring $0 < \sigma(\cdot) < 1$ .
- $\bar{f}_{\mathrm{MLP}}^{-}$ : cumulative score (the superscript is notation in the paper).
  
  Range property noted by authors: $0 < s_t < t$ (since it sums $t$ sigmoids each in $(0,1)$ ).

Training loss (applied across timesteps), pushing scores up for failed rollouts ( $y_i=1$ ) and down for successful ones ( $y_i=0$ ): $ L_{\mathrm{MLP}} = \sum_i \left[ y_i \sum_t (t - s_t) + (1 - y_i) \sum_t s_t \right] $

Symbols:
- $i$ : indexes rollouts in training set $\mathcal{D}_{\mathrm{train}}$ .
- $y_i \in \{0,1\}$ : rollout-level label (1 = failure; 0 = success).
- $s_t$ : failure score at time $t$ (here, the cumulative form).
  
  Intuition: For failures, encourage $s_t$ to be high (closer to $t$ ); for successes, encourage $s_t$ to be low.

4.3.2. LSTM-based Sequential Scoring

An LSTM processes $\mathbf{e}_{0:t}$ sequentially; a linear head plus sigmoid produces $s_t$ : $ f_{\mathrm{LSTM}}(\mathbf{e}{0:t}) = \sigma\big(\mathrm{LSTM}(\mathbf{e}{0:t})\big) $

Symbols:
- $\mathrm{LSTM}(\cdot)$ : recurrent model capturing temporal evolution of features.
- $\sigma(\cdot)$ : sigmoid, yielding $0 \leq s_t \leq 1$ .
  
  Training loss is binary cross-entropy (BCE) over timesteps: $ L_{\mathrm{LSTM}} = \sum_i \sum_t \left[ y_i \log(s_t) + (1 - y_i) \log(1 - s_t) \right] $
Symbols as above; BCE encourages $s_t$ near 1 for failed rollouts, near 0 for successful rollouts.

The following figure (Figure 1 from the original paper) motivates SAFE by showing latent-space separation:

$Figure 1: The internal features of a VLA capture high-level information about task success and failure. When the VLA is failing, the features, even those from different tasks, fall into the same failure zone". This motivates `S A F E` , an efficient multitask failure detector that is based on VLA internal features and can generalize to unseen tasks. Plot (a) visualizes the latent features of $\\pi _ { 0 }$ -FAST on LIBERO-10 \[56\] using t-SNE \[57\]. For successful rollouts, features are colored in blue. For failed rollouts, features follow a blue-to-red gradient based on timestep progression, with red corresponding to later timesteps that often coincide with failure. Plot (b) visualizes the same set of t-SNE features, colored by task ID. In (c), we show two example rollouts over time and mark their corresponding projected features in (a) and (b).$ 该图像是一个示意图，展示了基于 t-SNE 技术的策略潜在特征分布。图 (a) 中的特征根据任务成功与否进行着色，成功的回放为蓝色，失败的回放则呈现蓝到红的渐变。图 (b) 则按照任务 ID 对特征进行着色。图 (c) 显示了成功和失败回放的示例，强调了在特定任务中失败的发生。

4.4. Threshold Selection via Functional Conformal Prediction

SAFE raises a failure flag when $s_t$ exceeds a time-varying threshold $\delta_t$ . The threshold is constructed from a CP prediction band calibrated on successful rollouts in a held-out calibration set $\mathcal{D}_{\mathrm{eval\text{-}seen}}$ .

One-sided time-varying CP band $C_{\alpha} = \{[\mathrm{lower}_t, \mathrm{upper}_t]: t=1,\ldots,T\}$ $C_{α} = {[lower_{t}, upper_{t}] : t = 1, \dots, T}$ with:
- $\mathrm{lower}_t = -\infty$
- $\mathrm{upper}_t = \mu_t + h_t$
Symbols:
- $\alpha \in (0,1)$ : user-specified significance level, controlling conservativeness.
- $\mu_t$ : time-varying mean of scores from successful rollouts at time $t$ .
- $h_t$ : bandwidth chosen to achieve coverage $1-\alpha$ under exchangeability.
- $T$ : rollout duration.
  
  Guarantee (under mild assumptions): For a new successful rollout, $s_t < \mu_t + h_t$ for all $t = 1,\dots, T$ with probability $1 - \alpha$ . Conversely, $s_t > \mu_t + h_t$ indicates failure with nominal confidence $1-\alpha$ . SAFE uses $\delta_t = \mathrm{upper}_t$ as the threshold.

5. Experimental Setup

5.1. Datasets

LIBERO-10: A challenging simulation suite of 10 long-horizon manipulation tasks with diverse objects, layouts, and instructions. VLAs (OpenVLA, π0, π0-FAST) are evaluated using authors’ released checkpoints. 3/10 tasks randomly designated as unseen; remaining 7 are seen.
SimplerEnv: High-fidelity simulation environments replicating real datasets (RT-series, BridgeData V2). Evaluates π0 reproductions (denoted $\pi_0^*$ ) on Google Robot and WidowX embodiments. Each embodiment uses 4 tasks (3 seen, 1 unseen); 100 rollouts per task.
Real-world Franka: Deploy π0-FAST-DROID on a Franka Emika Panda. 13 tasks; for each, 30 successful and 30 failed rollouts. 3/13 tasks as unseen. Fixed rollout lengths per task.
Real-world WidowX: Deploy OpenVLA pretrained on “Open-X Magic Soup++” on a WidowX arm. 8 lifting and pick-and-place tasks, totaling 532 rollouts (244 successes, 288 failures). 2/8 tasks unseen.

Examples of task instructions (from real setups) are given in Tables 3 and 4 below.

5.2. Evaluation Metrics

ROC-AUC (Area Under the Receiver Operating Characteristic curve):
- Concept: Measures the ability of a scalar score to separate positive (failures) from negative (successes) across all thresholds. Higher AUC indicates better overall discrimination.
- Formula (conceptual as area under curve): $ \mathrm{AUC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}^{-1}(x)) , dx $ Alternatively, for discrete points, AUC is the area under the ROC obtained by plotting $\mathrm{TPR}$ vs $\mathrm{FPR}$ across thresholds.
- Symbols:
  - $\mathrm{TPR}$ : true positive rate (sensitivity, recall of failures).
  - $\mathrm{FPR}$ : false positive rate ( $1 - \mathrm{TNR}$ ).
TPR (True Positive Rate): $ \mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}} $
- TP: number of correctly detected failed rollouts.
- FN: number of missed failed rollouts.
TNR (True Negative Rate, specificity): $ \mathrm{TNR} = \frac{\mathrm{TN}}{\mathrm{TN} + \mathrm{FP}} $
- TN: number of successful rollouts with no failure flag raised.
- FP: number of successful rollouts falsely flagged as failures at any time.
Balanced Accuracy: $ \mathrm{Bal\text{-}Acc} = \frac{\mathrm{TPR} + \mathrm{TNR}}{2} $
Detection Time ( $T_{\mathrm{det}}$ $T_{det}$ ):
- Concept: For each rollout with ground-truth failure, the time index when $s_t$ first exceeds $\delta_t$ (CP band upper bound). Lower values indicate earlier detection.
- Formalization: $ T_{\mathrm{det}} = \min{ t \in {1,\dots, T} \mid s_t > \delta_t } $ Averaged across failed rollouts in evaluation.

SAFE also evaluates using the maximum-so-far score $\bar{s}_t = \max_{\tau=1,\dots,t} s_\tau$ to compute ROC-AUC on $\bar{s}_T$ (max over rollout), reflecting that any threshold crossing during a successful rollout counts as a false positive.

5.3. Baselines

Token Uncertainty (single forward pass; for tokenized action VLAs):
- Max prob.: $\max_i(-\log p_i)$
- Avg prob.: $-\frac{1}{m}\sum_i \log p_i$
- Max entropy: $\max_i H_i$
- Avg entropy: $\frac{1}{m}\sum_i H_i$
- Symbols: $p_i$ is probability of token $\mathbf{w}_t^i$ ; $H_i$ is entropy for token $i$ ; $m$ tokens per action.
Sample Consistency (requires multiple action samples per timestep):
- Action total variance: $\mathrm{trace}(\mathrm{cov}(\mathcal{A}_t))$ over sampled action sequences $\mathcal{A}_t = \{\mathbf{A}_t^k\}_{k=1}^K$ .
- Translational, rotational, and gripper variances computed similarly on components.
- Cluster entropy: entropy over cluster labels from agglomerative clustering of sampled actions.
Embedding Distance (feature-space OOD-inspired): $ s_t = d(\mathbf{e}t, E{\mathrm{succ}}) - d(\mathbf{e}t, E{\mathrm{fail}}) $ where $E_{\mathrm{succ}}$ and $E_{\mathrm{fail}}$ are sets of embeddings from successful and failed training rollouts, respectively. $d(\cdot,\cdot)$ is Mahalanobis, or $k$ -NN averaged Euclidean/Cosine distance; PCA-KMeans distance also evaluated.
Learned OOD Detectors:
- RND, LogpZO adapted to embeddings: train $f_{\mathrm{succ}}^{\mathrm{OOD}}(\cdot)$ and $f_{\mathrm{fail}}^{\mathrm{OOD}}(\cdot)$ on $E_{\mathrm{succ}}$ and $E_{\mathrm{fail}}$ , respectively; $ s_t = f_{\mathrm{succ}}^{\mathrm{OOD}}(\mathbf{e}t) - f{\mathrm{fail}}^{\mathrm{OOD}}(\mathbf{e}_t) $
Action Consistency:
- STAC: statistical distance on overlapping action chunks across timesteps (requires many samples, e.g., 256; here tested with 10 in simulation).
- STAC-Single: single-sample variant for real-time, using one action per timestep.

6. Results & Analysis

6.1. Core Results Analysis

SAFE consistently achieves top or near-top ROC-AUC across simulation and real robots, and the best accuracy–timeliness trade-offs via CP. Token uncertainty baselines are weak (consistent with LLM UQ literature). Sample consistency and STAC perform well when multiple samples are available but are computationally heavy for VLAs. Embedding-distance baselines are strong, validating that VLA latents encode success/failure; however, SAFE’s learned probe extracts more abstract signals and generalizes better to unseen tasks, mitigating overfitting observed in learned OOD detectors like LogpZO for multitask settings.

Trade-off plots (Bal-Acc vs. $T_{\mathrm{det}}$ across $\alpha$ ) show SAFE detects failures earlier without sacrificing accuracy, often aligning with human-judged failure moments and sometimes before they occur, enabling preemptive intervention.

Qualitative examples across simulation and real robots illustrate that SAFE scores rise upon missed grasps, unstable motions, oscillations, slips, or freezing, crossing CP bands appropriately.

The following are the results from Table 1 of the original paper:

	VLA Model Benchmark Eval Task Split	OpenVLA LIBERO		π0-FAST LIBERO		π0 LIBERO		πrt} SimplerEnv		Average
	VLA Model Benchmark Eval Task Split	Seen	Unseen	Seen	Unseen	Seen	Unseen	Seen	Unseen	Seen	Unseen
Token Unc.	Max prob.	50.25	53.83	61.32	69.44	-	-	-	-	55.79	61.64
	Avg prob.	44.05	51.58	52.46	58.04	-	-	-	-	48.26	54.81
	Max entropy	52.94	53.09	46.69	62.96	-	-	-	-	49.81	58.03
	Avg entropy	45.27	50.03	50.93	58.63	-	-	-	-	48.10	54.33
Embed. Distr.	Mahalanobis dist.	62.03	58.85	93.56	83.79	77.12	74.31	88.42	52.84	80.28	67.45
	Euclidean dist. k-NN	66.00	55.23	92.04	84.12	75.64	70.73	89.73	68.41	80.85	69.62
	Cosine dist. k-NN	67.09	69.45	92.09	84.64	75.76	70.31	90.19	71.32	81.28	73.93
	PCA-KMeans [9]	57.18	55.10	68.46	57.12	64.92	60.35	66.88	61.19	64.36	58.44
	RND [39]	52.57	46.88	88.67	81.57	71.92	69.44	885.07	65.89	74.56	65.95
	LogpZO [8]	61.57	52.91	91.52	83.07	76.80	73.23	88.79	74.66	79.67	70.97
Sample Consist.	Action total var.	62.76	65.43	76.95	74.50	77.20	75.18	68.41	67.94	71.33	70.76
	Trans. total var.	55.33	58.99	78.21	80.03	49.38	54.71	63.27	55.90	61.55	62.41
	Rot. total var.	47.85	55.30	80.87	77.29	52.94	61.06	58.07	62.10	59.93	63.94
	Gripper total var. Cluster entropy	61.84 50.16	64.48 51.44	76.82 80.22	74.42 80.53	77.19 76.19	75.19 72.12	69.16 68.25	69.29 73.66	71.25 68.71	70.84 69.44
Action Consist.	STAC [18]	-	-	83.07	85.31	46.55	47.91	60.74	62.21	63.45	65.14
Action Consist.	STAC-Single	-	-	85.46	81.16	68.46	69.39	68.71	70.40	74.21	73.65
SAFE (Ours)	SAFE-LSTM	70.24	72.47	92.98	84.48	76.98	71.09	88.85	80.11	82.26	77.04
SAFE (Ours)	SAFE-MLP	72.68	73.47	90.06	80.44	73.50	73.27	89.50	84.82	81.43	78.00

The following figure (Figure 4 from the original paper) illustrates the accuracy–timeliness trade-off using CP across benchmarks:

$Figure 4: In all simulation experiments, the proposed SAFE-LSTM and SAFE-MLP perform better than or on par with the best baselines. The plots show the variation of balanced accuracy (bal-acc) with respect to average detection time (T-det) on $\\mathcal { D } _ { \\mathrm { e v a l } }$ -unseen, under different significance levels $\\alpha$ used for functional CP. Good failure detection methods should detect policy failures both accurately (high bal-ac) and proactively (lower T-det), and thus place curves towards the top left in each plot. Note that baselines in gray require multiple action samples.$ 该图像是图表，展示了不同方法在 OpenVLA LIBERO、 $π_0$ LIBERO、 $π_0$ -FAST LIBERO 和 $π_0$ SimplrEnv 环境下的平衡准确性（bal-acc）与平均检测时间（T-det）的关系。图中曲线表明，所提的 SAFE-LSTM 和 SAFE-MLP 在性能上优于或与最佳基线持平。

6.2. Ablations and Additional Analyses

Visual latent-space analyses (t-SNE) across datasets show successful vs. failed rollouts separate in VLA feature spaces, often revealing a common “failure zone.” Even when visual separation is less clear (e.g., real Franka tasks), SAFE still outperforms baselines.
Qualitative detections with CP bands:
- The following figure (Figure 5 from the original paper) shows failures detected by SAFE-LSTM aligning with actual rollouts:
  
  $Figure 5: Failures detected by `S A F E` -LSTM align well with the actual robot failures, as shown in the corresponding camera observations from simulation experiments. The blue-shaded areas show the functional CP band $C _ { \\alpha }$ . Once failure scores exceed $C _ { \\alpha }$ , a failure flag is raised. In (a), the $\\pi _ { 0 }$ FAST policy misses the insertion, and its actions become unstable after that. In (b) and (c), OpenVLA and $\\pi _ { 0 } ^ { * }$ miss the grasp but still proceed to the placing action, causing a failure detection. Note that these tasks are not seen when training SAFE-LSTM.$ 该图像是一个示意图，展示了不同机器人政策在特定任务中的失败检测情况。图中包含三个子图，分别对应不同的机器人策略： $\pi_0$ -FAST、OpenVLA 和 $\pi_0^*$ 。每个子图中，上方为机器人在操作期间的相机观察，下方为随着时间推移的失败得分变化曲线。蓝色区域代表成功的置信区间，一旦失败得分超出此范围，即会触发失败警报。图中清晰展示了策略在执行任务时的失败情况及其检测效果。
Real-world performance:
- The following figure (Figure 6 from the original paper) shows SAFE-MLP leading on ROC-AUC and qualitative examples on Franka and WidowX:
  
  $Figure 6: SAFE-MLP achieves the best failure detection performance in real-world experiments with both $\\pi _ { 0 }$ -FAST Franka and OpenVLA WidowX. Plot (a) presents quantitative results, while (be) show qualitative examples from `S A F E` -MLP on the real robot. ROC-AUC values are averaged over five random seeds with different task splits.$ 该图像是图表，展示了 SAFE-MLP 在实际实验中与 $ext{π}_0$ -FAST Franka 和 OpenVLA WidowX 的失败检测性能。图 (a) 显示了 ROC-AUC 值，分别针对已见和未见任务进行比较；图 (b) 至 (e) 展示了成功和失败任务的定性示例，展示不同时间步骤下的得分变化。
Number of training tasks: Increasing training-task diversity generally improves unseen-task performance; SAFE retains strong performance even with fewer training tasks.

Foundation-model features vs. VLA features: Probing last-layer VLA features substantially outperforms DINOv2 or CLIP features, indicating VLAs’ latent spaces carry task-execution semantics (success/failure) more directly.

The following are the results from Table 6 of the original paper:

# Training Tasks	1 Seen	1 Unseen	3 Seen	3 Unseen	5 Seen	5 Unseen	7 Seen	7 Unseen
Mahalanobis	40.21	52.75	58.00	52.31	57.68	50.78	62.03	58.85
Euclid. k-NN	49.74	63.76	61.66	67.02	59.14	67.11	66.00	55.23
Cosine. k-NN	53.27	60.76	65.39	65.64	67.46	70.57	67.09	69.45
PCA-KMeans	60.39	40.58	61.18	52.87	61.50	53.06	57.18	55.10
RND	29.29	50.32	54.46	47.39	56.71	49.15	52.57	46.88
LogpZo	61.75	56.17	52.89	50.49	65.99	56.60	61.57	52.91
SAFE-LSTM	50.88	52.25	68.85	63.31	70.70	66.31	70.24	72.47
SAFE-MLP	54.34	63.76	67.86	67.03	69.32	68.17	72.68	73.47

The following are the results from Table 7 of the original paper:

Method Eval Task Split	LSTM Seen	LSTM Unseen	MLP Seen	MLP Unseen
DINOv2	76.93	56.96	76.20	59.46
CLIP	76.77	52.71	77.88	59.77
DINOv2+CLIP	77.09	59.65	76.36	58.43
VLA (Ours)	77.27	58.70	86.76	64.16

The following are the results from Table 8 of the original paper:

VLA Model Benchmark Eval Task Split	OpenVLA LIBERO		π0-FAST LIBERO		π0 LIBERO		π0 SimplerEnv		π0-FAST Real Franka
VLA Model Benchmark Eval Task Split	Seen	Unseen	Seen	Unseen	Seen	Unseen	Seen	Unseen	Seen	Unseen
Max prob.	50.25±2.51	53.83±6.32	61.32±9.57	69.44±13.61	-	-		-	53.74±3.46	48.59±3.00
Avg prob.	44.05±1.26	51.58±1.82	52.46±3.44	58.04±5.64					51.60±3.12	47.30±4.32
Max entropy	52.94±4.36	53.09±7.68	46.69±13.33	62.96±19.62					59.23±3.06	53.50±3.15
Avg entropy	45.27±1.78	50.03±3.18	50.93±1.22	58.63±3.47					50.67±3.96	46.08±4.79
Mahalanobis dist.	62.03±5.11	58.85±4.16	93.56±2.32		- 77.12±8.57	- 74.31±12.64	88.42±2.82	- 52.84±31.97	75.54±4.07	53.93±5.06
Euclidean dist. k-NN	66.00±2.33	55.23±10.05	92.04±2.39	83.79±7.18 84.12±6.47	75.64±6.20	70.73±16.69	89.73±3.08	68.41±9.22	80.35±5.36	60.27±4.79
Cosine dist. k-NN	67.09±2.74	69.45±6.14	92.09±1.70	84.64±4.90	75.76±6.16	70.31±16.84	90.19±4.05	71.32±12.02	80.23±5.12	59.51±5.76
PCA-KMeans [9]	57.18±2.04	55.10±1.16	68.46±4.92	57.12±10.44	64.92±8.90	60.35±19.93	66.88±5.10	61.19±14.76	51.91±4.20	49.86±6.19
RND [39]	52.57±4.56	46.88±4.92	88.67±3.05	81.57±8.67	71.92±7.02	69.44±19.39	85.07±4.04	65.89±6.52	62.00±5.44	45.83±5.10
LogpZO [8]	61.57±3.62	52.91±5.79	91.52±2.39	83.07±7.17	76.80±9.12	73.23±11.64	88.79±4.92	74.66±14.96	64.43±7.82	52.24±3.68
Action total var.	62.76±1.66	65.43±2.50	76.95±7.22	74.50±12.19	77.20±5.65	75.18±5.08	68.41±10.81	67.94±15.97	-	-
Trans. total var.	55.33±2.06	58.99±5.13	78.21±4.09	80.03±9.11	49.38±9.95	54.71±7.57	63.27±7.17	55.90±19.19	-	-
Rot. total var.	47.85±2.88	55.30±4.38	80.87±5.85	77.29±8.71	52.94±7.56	61.06±10.60	58.07±10.41	62.10±9.39	-
Gripper total var.	61.84±2.67	64.48±3.05	76.82±7.10	74.42±12.13	77.19±5.66	75.19±5.08	69.16±9.50	69.29±14.77
Cluster entropy	50.16±2.36	51.44±1.01	80.22±7.37	80.53±8.65	76.19±4.31	72.12±1.04	68.25±9.03	73.66±16.03
STAC [18]		-	83.07±4.61	85.31±6.71	46.55±8.90	47.91±20.94	60.74±13.89	62.21±16.72	-
STAC-Single	-	-	85.46±6.55	81.16±8.63	68.46±5.10	69.39±8.22	68.71±7.06	70.40±8.76	45.24±3.68	38.01±9.81
SAFE-LSTM	70.24±1.49	72.47±5.55	92.98±2.62	84.48±7.29	76.98±5.34	71.09±6.94	88.85±6.30	80.11±10.49	77.27±5.82	58.70±4.37
SAFE-MLP	72.68±2.38	73.47±5.39	90.06±2.82	80.44±5.72	73.50±7.43	73.27±11.85	89.50±4.49	84.82±8.12	86.76±2.64	64.16±5.88

6.3. Additional Tables (Datasets and Splits)

The following are the results from Table 2 of the original paper:

Embodiment	Task ID	Environment Name	π Success Rate (%)
Google Robot	1	google_robot_move_near_vO	77
Google Robot	2	google_robot_open_drawer	50
Google Robot	3	google_robot_close_drawer	80
Google Robot	4	google_robot_place_apple_in_closed_top_drawer	40
WidowX	1	widowx_carrot_on_plate	44
WidowX	2	widowx_put_eggplant_in_basket	88
WidowX	3	widowx_spoon_on_towel	79
WidowX	4	widowx_stack_cube	43

The following are the results from Table 3 of the original paper:

Task	Instruction	Rollout Length T
1	close the door	300
2	close the drawer	200
3	pick up the ball and place it in the bowl	400
4	pick up the knife and put it on the plate	350
5	pick up the lid and place it on the pot	400
6	pick up the lid from the pot and place it on the table	400
7	pick up the marker and place it in the cup	400
8	place the green block on the yellow block	350
9	place the pink cup to the right of the blue cup	300
10	press the button	200
11	put both the carrot and the ball in the bowl	500
12	put the cup to the upright position	500
13	unfold the cloth	500

The following are the results from Table 4 of the original paper:

Task	Instruction
1	Lift AAA Battery
2	Lift Eggplant
3	Lift Red Bottle
4	Lift Blue Cup
5	Put Blue Cup on Plate
6	Put the Red Bottle into Pot
7	Put the Carrot on Plate
8	Put the Red Block into the Pot

The following are the results from Table 5 of the original paper:

Benchmark	Number of Tasks			Number of rollouts
Benchmark	Seen	Unseen	Total	Train	Eval Seen	Eval Unseen	Total
LIBERO	7	3	10	210	140	150	500
π SimplerEnv, Google Robot	2	2	4	198	102	100	400
π SimplerEnv, WidowX	2	2	4	198	102	100	400
Octo SimplerEnv	9	3	12	594	306	300	1200
Real Franka	10	3	13	450	150	180	780
Real WidowX	6	2	8	250	133	149	532

7. Conclusion & Reflections

7.1. Conclusion Summary

SAFE tackles multitask failure detection for generalist VLAs by probing last-layer internal features and learning a scalar failure score over time, with thresholds calibrated via functional conformal prediction. Across diverse models and environments, SAFE achieves state-of-the-art ROC-AUC and the best accuracy–timeliness trade-offs, outperforming representative baselines (token UQ, sample consistency, OOD detectors, and STAC). Analyses show VLAs encode task-generic success/failure signals in their latent spaces; SAFE leverages this structure efficiently and robustly.

7.2. Limitations & Future Work

Scope: Focused on manipulation tasks; generalization across embodiments, sim2real gaps, or action-less video conditioning remains open.
Feature depth: Uses last-layer features only; fusing multi-layer information or discovering optimal intermediate layers may further improve detection.
CP under shift: Functional CP bands calibrated on seen tasks may face distribution shift on unseen tasks; exploring adaptive/online CP could enhance calibration.
Recovery/improvement: SAFE provides timely detection; integrating with recovery policies or activation steering to proactively correct VLA behaviors is a promising direction, but closed-loop manipulation complicates direct transfer from LLM steering.

7.3. Personal Insights & Critique

Strengths:
- Elegant insight that VLA latents carry generic failure semantics, demonstrated visually and empirically.
- Practicality: small MLP/LSTM probes with negligible runtime overhead; compatible with multiple VLAs without retraining them.
- Principled thresholding: CP-based bands offer interpretable, calibrated alerts and tunable conservativeness.
Considerations:
- While SAFE generalizes across tasks, it still requires training on a diverse set of successes and failures. The real-world cost of collecting failures can be non-trivial; leveraging synthetic/augmented failures or semi-supervised methods might reduce data burden.
- CP coverage guarantees hinge on exchangeability; performance under significant domain shift may deviate. The paper acknowledges deviations in TNR curves; future work could incorporate adaptive CP to correct calibration drift.
- Feature aggregation choices can impact performance; automated feature-selection across layers or attention-based pooling might improve robustness.
Transferability:
- The latent-probing paradigm is broadly applicable: beyond VLAs, any large multimodal policy may encode execution status signals. SAFE-like probes could monitor failures in mobile manipulation, navigation, or multimodal human-robot interaction.
- Combining SAFE with policy improvement (e.g., activation steering guided by separation vectors) is compelling but requires careful study of causality between latents and physical outcomes in embodied agents.
  
  In summary, SAFE is a timely, well-motivated, and empirically strong contribution to safe deployment of generalist VLAs. Its simplicity and effectiveness make it a valuable building block for practical robot monitoring and a foundation for future recovery and improvement strategies.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

SAFE: Multitask Failure Detection for Vision-Language-Action Models

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~16 min read · 24,146 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Data Flow and Feature Extraction

4.3. SAFE Score Models

4.3.1. MLP-based Accumulation

4.3.2. LSTM-based Sequential Scoring

4.4. Threshold Selection via Functional Conformal Prediction

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.2. Ablations and Additional Analyses

6.3. Additional Tables (Datasets and Splits)

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers