SAFE: Multitask Failure Detection for Vision-Language-Action Models
TL;DR Summary
SAFE introduces a multitask failure detection system for vision-language-action models, leveraging internal features to predict task failure. It demonstrates superior performance across various environments, balancing accuracy and timeliness, and supports zero-shot generalization
Abstract
While vision-language-action models (VLAs) have shown promising robotic behaviors across a diverse set of manipulation tasks, they achieve limited success rates when deployed on novel tasks out of the box. To allow these policies to safely interact with their environments, we need a failure detector that gives a timely alert such that the robot can stop, backtrack, or ask for help. However, existing failure detectors are trained and tested only on one or a few specific tasks, while generalist VLAs require the detector to generalize and detect failures also in unseen tasks and novel environments. In this paper, we introduce the multitask failure detection problem and propose SAFE, a failure detector for generalist robot policies such as VLAs. We analyze the VLA feature space and find that VLAs have sufficient high-level knowledge about task success and failure, which is generic across different tasks. Based on this insight, we design SAFE to learn from VLA internal features and predict a single scalar indicating the likelihood of task failure. SAFE is trained on both successful and failed rollouts and is evaluated on unseen tasks. SAFE is compatible with different policy architectures. We test it on OpenVLA, , and -FAST in both simulated and real-world environments extensively. We compare SAFE with diverse baselines and show that SAFE achieves state-of-the-art failure detection performance and the best trade-off between accuracy and detection time using conformal prediction. More qualitative results and code can be found at the project webpage: https://vla-safe.github.io/
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
SAFE: Multitask Failure Detection for Vision-Language-Action Models
1.2. Authors
-
Qiao Gu, Yuanliang Ju, Shengxiang Sun, Igor Gilitschenski, Florian Shkurti (University of Toronto, UofT Robotics Institute; Vector Institute)
-
Haruki Nishimura, Masha Itkina (Toyota Research Institute)
The authors draw from robotics, robot learning, uncertainty quantification, and applied machine learning. UofT and TRI have extensive track records in robotic manipulation and safety research.
1.3. Journal/Conference
arXiv preprint. The work is not yet a peer-reviewed publication; it positions itself within the robotic learning and uncertainty quantification community, referencing and evaluating against recent CoRL and RSS works.
1.4. Publication Year
2025 (arXiv posting: 2025-06-11T16:59:13.000Z)
1.5. Abstract
Vision-language-action (VLA) models are generalist robotic policies capable of following natural language instructions to perform diverse manipulation tasks. Out-of-the-box, VLAs achieve strong performance on seen tasks but suffer a large drop on unseen tasks (from roughly 80–90% to 30–60% success). To safely deploy VLAs without per-task fine-tuning, the paper introduces multitask failure detection: train a single detector using seen tasks and evaluate zero-shot on unseen tasks. The proposed method, SAFE (ScAlable Failure Estimation), probes VLA internal features and regresses a scalar failure score over time. Using functional conformal prediction (CP), SAFE sets time-varying thresholds that provide distribution-free guarantees under mild assumptions. SAFE is compatible with multiple VLA architectures (OpenVLA, π0, π0-FAST) and shows superior accuracy-time trade-offs over baselines (token uncertainty, sample consistency, OOD detectors, and action-consistency methods) across simulation (LIBERO-10, SimplerEnv) and real robots (Franka, WidowX).
1.6. Original Source Link
- arXiv abstract: https://arxiv.org/abs/2506.09937
- PDF: https://arxiv.org/pdf/2506.09937v2.pdf
- Project page (code and qualitative videos): https://vla-safe.github.io/
- Publication status: arXiv preprint (not yet peer-reviewed).
2. Executive Summary
2.1. Background & Motivation
- Problem: VLAs often fail on unseen tasks and novel environments when deployed directly. Without safe monitoring, failures can lead to unsafe robot behaviors. Existing failure detectors are typically task-specific and do not generalize across tasks, making them impractical for generalist VLAs.
- Why important: Generalist policies are meant to encounter new instructions and settings. It is infeasible to train and calibrate detectors per task. A task-generic detector must be accurate, timely, efficient, and compatible with various VLA architectures to support real-time robotic control.
- Entry point/innovation: The authors analyze latent features inside VLA models and discover a “failure zone”—a feature-space region that captures high-level, task-generic separation between success and failure. This motivates probing internal features to learn a single scalar failure score and threshold it via conformal prediction to provide calibrated, time-varying alerts.
2.2. Main Contributions / Findings
- Contribution 1: Introduce multitask failure detection for generalist VLAs—train on a subset of tasks (seen) and evaluate zero-shot on held-out (unseen) tasks without per-task fine-tuning or additional rollouts.
- Contribution 2: Empirical analysis shows VLA internal features distinctly separate successful vs. failed rollouts across tasks. Failure trajectories converge in a shared “failure zone” under t-SNE visualization.
- Contribution 3: SAFE—an efficient detector that:
- Probes last-layer VLA features.
- Learns a scalar failure score with a simple MLP or LSTM over time.
- Uses functional conformal prediction to set time-varying thresholds with distribution-free guarantees under exchangeability.
- Contribution 4: Extensive evaluation across models (OpenVLA, π0, π0-FAST) and settings (LIBERO-10, SimplerEnv, real Franka, real WidowX) shows SAFE achieves state-of-the-art failure detection (ROC-AUC) and best accuracy–timeliness trade-offs, outperforming strong baselines (including OOD detectors and action-consistency).
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Vision-Language-Action (VLA) Models: Neural policies that condition on images (RGB), language instructions, and robot states to output actions. Typically built on transformer architectures pretrained as vision-language models (VLMs), then augmented with action heads (e.g., discrete action tokenization, diffusion, flow matching). VLAs aim for generalist manipulation, handling diverse tasks without per-task retraining.
- Transformer fundamentals: A transformer uses multi-head self-attention to integrate information across tokens. While the paper does not derive attention, the core formula is: $ \mathrm{Attention}(Q, K, V) = \mathrm{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $ where are queries, keys, values, and is the key dimensionality. VLAs leverage transformer blocks to encode multimodal inputs and intermediate latent features.
- Conformal Prediction (CP): A distribution-free framework that provides calibrated prediction sets or bands under exchangeability assumptions. Functional CP constructs time-varying bands for sequences, ensuring that, with probability , a new successful rollout’s score remains within the band at all timesteps. Exceeding the band signals an anomaly/failure with controlled false positive rate.
- Uncertainty Quantification (UQ) in LLMs/VLMs: Methods estimate confidence (e.g., token-level entropy/probability) or consistency (variance across sampled outputs). They can be adapted to actions from VLAs as proxies for failure likelihood but require multiple samples or tend to be weak indicators.
- Out-of-Distribution (OOD) detection: Methods model distributions of embeddings for “normal” data and flag deviations. Applied to robotics: learn distributions from successful rollouts and score new embeddings by distance or learned density. Strong for detecting anomalies but may misclassify unseen-but-successful tasks as failures.
3.2. Previous Works
- VLAs: OpenVLA; RT-series (RT-1, RT-2); Octo; π0 (flow matching); FAST (efficient action tokenization). These achieve good performance on seen tasks but reduced success on unseen tasks.
- Failure detection in robotics:
- Task-specific detectors trained per task—unsuitable for generalist VLAs.
- FAIL-Detect [8]: systematic evaluation of detectors (OOD, smoothness, consistency), reporting LogpZO (flow-matching-based density) as strongest—but for single-task policies; suffers in multitask generalization per this paper.
- STAC [18]: action-consistency over overlapping chunks, requiring many samples per timestep; strong performance but heavy inference overhead for large VLAs.
- VLM-based detectors [19, 20]: query large models for success/failure judgments; expensive and impractical for real-time control.
- UQ for LLMs/VLMs: token probability/entropy; semantic consistency; latent-space classifiers (e.g., hallucination detection by probing internal activations), showing simple probes can be effective and efficient.
3.3. Technological Evolution
- From specialist, per-task policies and detectors → generalist VLAs that can follow diverse instructions.
- From task-specific failure detectors → desire for task-generic, multitask detectors compatible with VLAs.
- From heavy sampling or large-model queries → lightweight feature-probing and time-calibrated thresholds.
3.4. Differentiation Analysis
- SAFE’s novelty lies in:
- Probing last-layer VLA features to exploit emergent task-generic success/failure structure.
- A unified detector trained on success/failure across tasks, evaluated zero-shot on unseen tasks.
- Functional CP setting time-varying thresholds with theoretical calibration for false positives on successful rollouts.
- Compared to baselines:
- UQ (token/sample consistency) is weaker or expensive (needs multiple samples).
- OOD detectors model “normal” success distribution but may overfit and misclassify unseen successes; SAFE directly learns failure discrimination.
- STAC yields strong performance but sampling overhead is impractical for large VLAs; SAFE operates with single forward passes and minimal overhead.
4. Methodology
4.1. Principles
Core idea: VLA internal features encode high-level, abstract indicators of task success versus failure across tasks. By probing these latent features over time and learning a scalar failure score , we can detect failures early. A principled, time-varying threshold from functional conformal prediction yields calibrated alerts with controlled false positive rates on successful rollouts.
The following figure (Figure 2 from the original paper) shows the system architecture:
该图像是示意图,展示了FAIL检测器SAFE的三个主要组件:首先,从VLA模型提取潜在特征;其次,使用MLP或LSTM后端学习失败评分预测器;最后,通过功能性符合预测校准失败检测阈值,并在测试过程中检测故障。
4.2. Data Flow and Feature Extraction
-
Inputs at timestep : observation (images, instruction, robot state).
-
VLA outputs: a chunk of actions for horizon . The first actions are executed, then the policy replans at .
-
Internal embedding: denotes the latent feature vector at time from the last layer of the VLA before decoding to tokens (OpenVLA, π0-FAST) or velocity field (π0).
-
Token decoding (some VLAs): are action tokens; SAFE can also use token-level uncertainty baselines (for comparison) but its core relies on .
The authors ablate multiple aggregation strategies to obtain a single vector from multi-token or multi-step internal tensors :
-
First:
-
Last:
-
Mean:
-
First&Last:
For π0 (flow matching), features exist across horizon and diffusion steps: , aggregated along and dimensions via and to produce .
4.3. SAFE Score Models
SAFE learns a scalar failure score from feature history .
4.3.1. MLP-based Accumulation
At each timestep , an MLP maps to a scalar, then a sigmoid normalizes it. The cumulative failure score up to is: $ \bar{f}{\mathrm{MLP}}^{-}(\mathbf{e}{0:t}) = \sum_{\tau = 1, \dots, t} \sigma\big(g(\mathbf{e}_{\tau})\big) $
- Symbols:
-
: VLA latent feature at time .
-
: MLP mapping features to a scalar logit.
-
: sigmoid activation, ensuring .
-
: cumulative score (the superscript is notation in the paper).
Range property noted by authors: (since it sums sigmoids each in ).
-
Training loss (applied across timesteps), pushing scores up for failed rollouts () and down for successful ones (): $ L_{\mathrm{MLP}} = \sum_i \left[ y_i \sum_t (t - s_t) + (1 - y_i) \sum_t s_t \right] $
- Symbols:
-
: indexes rollouts in training set .
-
: rollout-level label (1 = failure; 0 = success).
-
: failure score at time (here, the cumulative form).
Intuition: For failures, encourage to be high (closer to ); for successes, encourage to be low.
-
4.3.2. LSTM-based Sequential Scoring
An LSTM processes sequentially; a linear head plus sigmoid produces : $ f_{\mathrm{LSTM}}(\mathbf{e}{0:t}) = \sigma\big(\mathrm{LSTM}(\mathbf{e}{0:t})\big) $
-
Symbols:
-
: recurrent model capturing temporal evolution of features.
-
: sigmoid, yielding .
Training loss is binary cross-entropy (BCE) over timesteps: $ L_{\mathrm{LSTM}} = \sum_i \sum_t \left[ y_i \log(s_t) + (1 - y_i) \log(1 - s_t) \right] $
-
-
Symbols as above; BCE encourages near 1 for failed rollouts, near 0 for successful rollouts.
The following figure (Figure 1 from the original paper) motivates SAFE by showing latent-space separation:
该图像是一个示意图,展示了基于 t-SNE 技术的策略潜在特征分布。图 (a) 中的特征根据任务成功与否进行着色,成功的回放为蓝色,失败的回放则呈现蓝到红的渐变。图 (b) 则按照任务 ID 对特征进行着色。图 (c) 显示了成功和失败回放的示例,强调了在特定任务中失败的发生。
4.4. Threshold Selection via Functional Conformal Prediction
SAFE raises a failure flag when exceeds a time-varying threshold . The threshold is constructed from a CP prediction band calibrated on successful rollouts in a held-out calibration set .
- One-sided time-varying CP band with:
- Symbols:
-
: user-specified significance level, controlling conservativeness.
-
: time-varying mean of scores from successful rollouts at time .
-
: bandwidth chosen to achieve coverage under exchangeability.
-
: rollout duration.
Guarantee (under mild assumptions): For a new successful rollout, for all with probability . Conversely, indicates failure with nominal confidence . SAFE uses as the threshold.
-
5. Experimental Setup
5.1. Datasets
-
LIBERO-10: A challenging simulation suite of 10 long-horizon manipulation tasks with diverse objects, layouts, and instructions. VLAs (OpenVLA, π0, π0-FAST) are evaluated using authors’ released checkpoints. 3/10 tasks randomly designated as unseen; remaining 7 are seen.
-
SimplerEnv: High-fidelity simulation environments replicating real datasets (RT-series, BridgeData V2). Evaluates π0 reproductions (denoted ) on Google Robot and WidowX embodiments. Each embodiment uses 4 tasks (3 seen, 1 unseen); 100 rollouts per task.
-
Real-world Franka: Deploy π0-FAST-DROID on a Franka Emika Panda. 13 tasks; for each, 30 successful and 30 failed rollouts. 3/13 tasks as unseen. Fixed rollout lengths per task.
-
Real-world WidowX: Deploy OpenVLA pretrained on “Open-X Magic Soup++” on a WidowX arm. 8 lifting and pick-and-place tasks, totaling 532 rollouts (244 successes, 288 failures). 2/8 tasks unseen.
Examples of task instructions (from real setups) are given in Tables 3 and 4 below.
5.2. Evaluation Metrics
- ROC-AUC (Area Under the Receiver Operating Characteristic curve):
- Concept: Measures the ability of a scalar score to separate positive (failures) from negative (successes) across all thresholds. Higher AUC indicates better overall discrimination.
- Formula (conceptual as area under curve): $ \mathrm{AUC} = \int_0^1 \mathrm{TPR}(\mathrm{FPR}^{-1}(x)) , dx $ Alternatively, for discrete points, AUC is the area under the ROC obtained by plotting vs across thresholds.
- Symbols:
- : true positive rate (sensitivity, recall of failures).
- : false positive rate ().
- TPR (True Positive Rate):
$
\mathrm{TPR} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}
$
- TP: number of correctly detected failed rollouts.
- FN: number of missed failed rollouts.
- TNR (True Negative Rate, specificity):
$
\mathrm{TNR} = \frac{\mathrm{TN}}{\mathrm{TN} + \mathrm{FP}}
$
- TN: number of successful rollouts with no failure flag raised.
- FP: number of successful rollouts falsely flagged as failures at any time.
- Balanced Accuracy: $ \mathrm{Bal\text{-}Acc} = \frac{\mathrm{TPR} + \mathrm{TNR}}{2} $
- Detection Time ():
- Concept: For each rollout with ground-truth failure, the time index when first exceeds (CP band upper bound). Lower values indicate earlier detection.
- Formalization: $ T_{\mathrm{det}} = \min{ t \in {1,\dots, T} \mid s_t > \delta_t } $ Averaged across failed rollouts in evaluation.
SAFE also evaluates using the maximum-so-far score to compute ROC-AUC on (max over rollout), reflecting that any threshold crossing during a successful rollout counts as a false positive.
5.3. Baselines
- Token Uncertainty (single forward pass; for tokenized action VLAs):
- Max prob.:
- Avg prob.:
- Max entropy:
- Avg entropy:
- Symbols: is probability of token ; is entropy for token ; tokens per action.
- Sample Consistency (requires multiple action samples per timestep):
- Action total variance: over sampled action sequences .
- Translational, rotational, and gripper variances computed similarly on components.
- Cluster entropy: entropy over cluster labels from agglomerative clustering of sampled actions.
- Embedding Distance (feature-space OOD-inspired): $ s_t = d(\mathbf{e}t, E{\mathrm{succ}}) - d(\mathbf{e}t, E{\mathrm{fail}}) $ where and are sets of embeddings from successful and failed training rollouts, respectively. is Mahalanobis, or -NN averaged Euclidean/Cosine distance; PCA-KMeans distance also evaluated.
- Learned OOD Detectors:
- RND, LogpZO adapted to embeddings: train and on and , respectively; $ s_t = f_{\mathrm{succ}}^{\mathrm{OOD}}(\mathbf{e}t) - f{\mathrm{fail}}^{\mathrm{OOD}}(\mathbf{e}_t) $
- Action Consistency:
- STAC: statistical distance on overlapping action chunks across timesteps (requires many samples, e.g., 256; here tested with 10 in simulation).
- STAC-Single: single-sample variant for real-time, using one action per timestep.
6. Results & Analysis
6.1. Core Results Analysis
SAFE consistently achieves top or near-top ROC-AUC across simulation and real robots, and the best accuracy–timeliness trade-offs via CP. Token uncertainty baselines are weak (consistent with LLM UQ literature). Sample consistency and STAC perform well when multiple samples are available but are computationally heavy for VLAs. Embedding-distance baselines are strong, validating that VLA latents encode success/failure; however, SAFE’s learned probe extracts more abstract signals and generalizes better to unseen tasks, mitigating overfitting observed in learned OOD detectors like LogpZO for multitask settings.
Trade-off plots (Bal-Acc vs. across ) show SAFE detects failures earlier without sacrificing accuracy, often aligning with human-judged failure moments and sometimes before they occur, enabling preemptive intervention.
Qualitative examples across simulation and real robots illustrate that SAFE scores rise upon missed grasps, unstable motions, oscillations, slips, or freezing, crossing CP bands appropriately.
The following are the results from Table 1 of the original paper:
| VLA Model Benchmark Eval Task Split | OpenVLA LIBERO | π0-FAST LIBERO | π0 LIBERO | πrt} SimplerEnv | Average | ||||||
| Seen | Unseen | Seen | Unseen | Seen | Unseen | Seen | Unseen | Seen | Unseen | ||
| Token Unc. | Max prob. | 50.25 | 53.83 | 61.32 | 69.44 | - | - | - | - | 55.79 | 61.64 |
| Avg prob. | 44.05 | 51.58 | 52.46 | 58.04 | - | - | - | - | 48.26 | 54.81 | |
| Max entropy | 52.94 | 53.09 | 46.69 | 62.96 | - | - | - | - | 49.81 | 58.03 | |
| Avg entropy | 45.27 | 50.03 | 50.93 | 58.63 | - | - | - | - | 48.10 | 54.33 | |
| Embed. Distr. | Mahalanobis dist. | 62.03 | 58.85 | 93.56 | 83.79 | 77.12 | 74.31 | 88.42 | 52.84 | 80.28 | 67.45 |
| Euclidean dist. k-NN | 66.00 | 55.23 | 92.04 | 84.12 | 75.64 | 70.73 | 89.73 | 68.41 | 80.85 | 69.62 | |
| Cosine dist. k-NN | 67.09 | 69.45 | 92.09 | 84.64 | 75.76 | 70.31 | 90.19 | 71.32 | 81.28 | 73.93 | |
| PCA-KMeans [9] | 57.18 | 55.10 | 68.46 | 57.12 | 64.92 | 60.35 | 66.88 | 61.19 | 64.36 | 58.44 | |
| RND [39] | 52.57 | 46.88 | 88.67 | 81.57 | 71.92 | 69.44 | 885.07 | 65.89 | 74.56 | 65.95 | |
| LogpZO [8] | 61.57 | 52.91 | 91.52 | 83.07 | 76.80 | 73.23 | 88.79 | 74.66 | 79.67 | 70.97 | |
| Sample Consist. | Action total var. | 62.76 | 65.43 | 76.95 | 74.50 | 77.20 | 75.18 | 68.41 | 67.94 | 71.33 | 70.76 |
| Trans. total var. | 55.33 | 58.99 | 78.21 | 80.03 | 49.38 | 54.71 | 63.27 | 55.90 | 61.55 | 62.41 | |
| Rot. total var. | 47.85 | 55.30 | 80.87 | 77.29 | 52.94 | 61.06 | 58.07 | 62.10 | 59.93 | 63.94 | |
| Gripper total var. Cluster entropy | 61.84 50.16 | 64.48 51.44 | 76.82 80.22 | 74.42 80.53 | 77.19 76.19 | 75.19 72.12 | 69.16 68.25 | 69.29 73.66 | 71.25 68.71 | 70.84 69.44 | |
| Action Consist. | STAC [18] | - | - | 83.07 | 85.31 | 46.55 | 47.91 | 60.74 | 62.21 | 63.45 | 65.14 |
| STAC-Single | - | - | 85.46 | 81.16 | 68.46 | 69.39 | 68.71 | 70.40 | 74.21 | 73.65 | |
| SAFE (Ours) | SAFE-LSTM | 70.24 | 72.47 | 92.98 | 84.48 | 76.98 | 71.09 | 88.85 | 80.11 | 82.26 | 77.04 |
| SAFE-MLP | 72.68 | 73.47 | 90.06 | 80.44 | 73.50 | 73.27 | 89.50 | 84.82 | 81.43 | 78.00 | |
The following figure (Figure 4 from the original paper) illustrates the accuracy–timeliness trade-off using CP across benchmarks:
该图像是图表,展示了不同方法在 OpenVLA LIBERO、 LIBERO、-FAST LIBERO 和 SimplrEnv 环境下的平衡准确性(bal-acc)与平均检测时间(T-det)的关系。图中曲线表明,所提的 SAFE-LSTM 和 SAFE-MLP 在性能上优于或与最佳基线持平。
6.2. Ablations and Additional Analyses
-
Visual latent-space analyses (t-SNE) across datasets show successful vs. failed rollouts separate in VLA feature spaces, often revealing a common “failure zone.” Even when visual separation is less clear (e.g., real Franka tasks), SAFE still outperforms baselines.
-
Qualitative detections with CP bands:
-
The following figure (Figure 5 from the original paper) shows failures detected by SAFE-LSTM aligning with actual rollouts:
该图像是一个示意图,展示了不同机器人政策在特定任务中的失败检测情况。图中包含三个子图,分别对应不同的机器人策略:-FAST、OpenVLA 和 。每个子图中,上方为机器人在操作期间的相机观察,下方为随着时间推移的失败得分变化曲线。蓝色区域代表成功的置信区间,一旦失败得分超出此范围,即会触发失败警报。图中清晰展示了策略在执行任务时的失败情况及其检测效果。
-
-
Real-world performance:
-
The following figure (Figure 6 from the original paper) shows SAFE-MLP leading on ROC-AUC and qualitative examples on Franka and WidowX:
该图像是图表,展示了 SAFE-MLP 在实际实验中与 -FAST Franka 和 OpenVLA WidowX 的失败检测性能。图 (a) 显示了 ROC-AUC 值,分别针对已见和未见任务进行比较;图 (b) 至 (e) 展示了成功和失败任务的定性示例,展示不同时间步骤下的得分变化。
-
-
Number of training tasks: Increasing training-task diversity generally improves unseen-task performance; SAFE retains strong performance even with fewer training tasks.
-
Foundation-model features vs. VLA features: Probing last-layer VLA features substantially outperforms DINOv2 or CLIP features, indicating VLAs’ latent spaces carry task-execution semantics (success/failure) more directly.
The following are the results from Table 6 of the original paper:
# Training Tasks 1 Seen 1 Unseen 3 Seen 3 Unseen 5 Seen 5 Unseen 7 Seen 7 Unseen Mahalanobis 40.21 52.75 58.00 52.31 57.68 50.78 62.03 58.85 Euclid. k-NN 49.74 63.76 61.66 67.02 59.14 67.11 66.00 55.23 Cosine. k-NN 53.27 60.76 65.39 65.64 67.46 70.57 67.09 69.45 PCA-KMeans 60.39 40.58 61.18 52.87 61.50 53.06 57.18 55.10 RND 29.29 50.32 54.46 47.39 56.71 49.15 52.57 46.88 LogpZo 61.75 56.17 52.89 50.49 65.99 56.60 61.57 52.91 SAFE-LSTM 50.88 52.25 68.85 63.31 70.70 66.31 70.24 72.47 SAFE-MLP 54.34 63.76 67.86 67.03 69.32 68.17 72.68 73.47
The following are the results from Table 7 of the original paper:
| Method Eval Task Split | LSTM Seen | LSTM Unseen | MLP Seen | MLP Unseen |
|---|---|---|---|---|
| DINOv2 | 76.93 | 56.96 | 76.20 | 59.46 |
| CLIP | 76.77 | 52.71 | 77.88 | 59.77 |
| DINOv2+CLIP | 77.09 | 59.65 | 76.36 | 58.43 |
| VLA (Ours) | 77.27 | 58.70 | 86.76 | 64.16 |
The following are the results from Table 8 of the original paper:
| VLA Model Benchmark Eval Task Split | OpenVLA LIBERO | π0-FAST LIBERO | π0 LIBERO | π0 SimplerEnv | π0-FAST Real Franka | |||||
| Seen | Unseen | Seen | Unseen | Seen | Unseen | Seen | Unseen | Seen | Unseen | |
| Max prob. | 50.25±2.51 | 53.83±6.32 | 61.32±9.57 | 69.44±13.61 | - | - | - | 53.74±3.46 | 48.59±3.00 | |
| Avg prob. | 44.05±1.26 | 51.58±1.82 | 52.46±3.44 | 58.04±5.64 | 51.60±3.12 | 47.30±4.32 | ||||
| Max entropy | 52.94±4.36 | 53.09±7.68 | 46.69±13.33 | 62.96±19.62 | 59.23±3.06 | 53.50±3.15 | ||||
| Avg entropy | 45.27±1.78 | 50.03±3.18 | 50.93±1.22 | 58.63±3.47 | 50.67±3.96 | 46.08±4.79 | ||||
| Mahalanobis dist. | 62.03±5.11 | 58.85±4.16 | 93.56±2.32 | - 77.12±8.57 | - 74.31±12.64 | 88.42±2.82 | - 52.84±31.97 | 75.54±4.07 | 53.93±5.06 | |
| Euclidean dist. k-NN | 66.00±2.33 | 55.23±10.05 | 92.04±2.39 | 83.79±7.18 84.12±6.47 | 75.64±6.20 | 70.73±16.69 | 89.73±3.08 | 68.41±9.22 | 80.35±5.36 | 60.27±4.79 |
| Cosine dist. k-NN | 67.09±2.74 | 69.45±6.14 | 92.09±1.70 | 84.64±4.90 | 75.76±6.16 | 70.31±16.84 | 90.19±4.05 | 71.32±12.02 | 80.23±5.12 | 59.51±5.76 |
| PCA-KMeans [9] | 57.18±2.04 | 55.10±1.16 | 68.46±4.92 | 57.12±10.44 | 64.92±8.90 | 60.35±19.93 | 66.88±5.10 | 61.19±14.76 | 51.91±4.20 | 49.86±6.19 |
| RND [39] | 52.57±4.56 | 46.88±4.92 | 88.67±3.05 | 81.57±8.67 | 71.92±7.02 | 69.44±19.39 | 85.07±4.04 | 65.89±6.52 | 62.00±5.44 | 45.83±5.10 |
| LogpZO [8] | 61.57±3.62 | 52.91±5.79 | 91.52±2.39 | 83.07±7.17 | 76.80±9.12 | 73.23±11.64 | 88.79±4.92 | 74.66±14.96 | 64.43±7.82 | 52.24±3.68 |
| Action total var. | 62.76±1.66 | 65.43±2.50 | 76.95±7.22 | 74.50±12.19 | 77.20±5.65 | 75.18±5.08 | 68.41±10.81 | 67.94±15.97 | - | - |
| Trans. total var. | 55.33±2.06 | 58.99±5.13 | 78.21±4.09 | 80.03±9.11 | 49.38±9.95 | 54.71±7.57 | 63.27±7.17 | 55.90±19.19 | - | - |
| Rot. total var. | 47.85±2.88 | 55.30±4.38 | 80.87±5.85 | 77.29±8.71 | 52.94±7.56 | 61.06±10.60 | 58.07±10.41 | 62.10±9.39 | - | |
| Gripper total var. | 61.84±2.67 | 64.48±3.05 | 76.82±7.10 | 74.42±12.13 | 77.19±5.66 | 75.19±5.08 | 69.16±9.50 | 69.29±14.77 | ||
| Cluster entropy | 50.16±2.36 | 51.44±1.01 | 80.22±7.37 | 80.53±8.65 | 76.19±4.31 | 72.12±1.04 | 68.25±9.03 | 73.66±16.03 | ||
| STAC [18] | - | 83.07±4.61 | 85.31±6.71 | 46.55±8.90 | 47.91±20.94 | 60.74±13.89 | 62.21±16.72 | - | ||
| STAC-Single | - | - | 85.46±6.55 | 81.16±8.63 | 68.46±5.10 | 69.39±8.22 | 68.71±7.06 | 70.40±8.76 | 45.24±3.68 | 38.01±9.81 |
| SAFE-LSTM | 70.24±1.49 | 72.47±5.55 | 92.98±2.62 | 84.48±7.29 | 76.98±5.34 | 71.09±6.94 | 88.85±6.30 | 80.11±10.49 | 77.27±5.82 | 58.70±4.37 |
| SAFE-MLP | 72.68±2.38 | 73.47±5.39 | 90.06±2.82 | 80.44±5.72 | 73.50±7.43 | 73.27±11.85 | 89.50±4.49 | 84.82±8.12 | 86.76±2.64 | 64.16±5.88 |
6.3. Additional Tables (Datasets and Splits)
The following are the results from Table 2 of the original paper:
| Embodiment | Task ID | Environment Name | π Success Rate (%) |
|---|---|---|---|
| Google Robot | 1 | google_robot_move_near_vO | 77 |
| Google Robot | 2 | google_robot_open_drawer | 50 |
| Google Robot | 3 | google_robot_close_drawer | 80 |
| Google Robot | 4 | google_robot_place_apple_in_closed_top_drawer | 40 |
| WidowX | 1 | widowx_carrot_on_plate | 44 |
| WidowX | 2 | widowx_put_eggplant_in_basket | 88 |
| WidowX | 3 | widowx_spoon_on_towel | 79 |
| WidowX | 4 | widowx_stack_cube | 43 |
The following are the results from Table 3 of the original paper:
| Task | Instruction | Rollout Length T |
|---|---|---|
| 1 | close the door | 300 |
| 2 | close the drawer | 200 |
| 3 | pick up the ball and place it in the bowl | 400 |
| 4 | pick up the knife and put it on the plate | 350 |
| 5 | pick up the lid and place it on the pot | 400 |
| 6 | pick up the lid from the pot and place it on the table | 400 |
| 7 | pick up the marker and place it in the cup | 400 |
| 8 | place the green block on the yellow block | 350 |
| 9 | place the pink cup to the right of the blue cup | 300 |
| 10 | press the button | 200 |
| 11 | put both the carrot and the ball in the bowl | 500 |
| 12 | put the cup to the upright position | 500 |
| 13 | unfold the cloth | 500 |
The following are the results from Table 4 of the original paper:
| Task | Instruction |
|---|---|
| 1 | Lift AAA Battery |
| 2 | Lift Eggplant |
| 3 | Lift Red Bottle |
| 4 | Lift Blue Cup |
| 5 | Put Blue Cup on Plate |
| 6 | Put the Red Bottle into Pot |
| 7 | Put the Carrot on Plate |
| 8 | Put the Red Block into the Pot |
The following are the results from Table 5 of the original paper:
| Benchmark | Number of Tasks | Number of rollouts | |||||
| Seen | Unseen | Total | Train | Eval Seen | Eval Unseen | Total | |
| LIBERO | 7 | 3 | 10 | 210 | 140 | 150 | 500 |
| π SimplerEnv, Google Robot | 2 | 2 | 4 | 198 | 102 | 100 | 400 |
| π SimplerEnv, WidowX | 2 | 2 | 4 | 198 | 102 | 100 | 400 |
| Octo SimplerEnv | 9 | 3 | 12 | 594 | 306 | 300 | 1200 |
| Real Franka | 10 | 3 | 13 | 450 | 150 | 180 | 780 |
| Real WidowX | 6 | 2 | 8 | 250 | 133 | 149 | 532 |
7. Conclusion & Reflections
7.1. Conclusion Summary
SAFE tackles multitask failure detection for generalist VLAs by probing last-layer internal features and learning a scalar failure score over time, with thresholds calibrated via functional conformal prediction. Across diverse models and environments, SAFE achieves state-of-the-art ROC-AUC and the best accuracy–timeliness trade-offs, outperforming representative baselines (token UQ, sample consistency, OOD detectors, and STAC). Analyses show VLAs encode task-generic success/failure signals in their latent spaces; SAFE leverages this structure efficiently and robustly.
7.2. Limitations & Future Work
- Scope: Focused on manipulation tasks; generalization across embodiments, sim2real gaps, or action-less video conditioning remains open.
- Feature depth: Uses last-layer features only; fusing multi-layer information or discovering optimal intermediate layers may further improve detection.
- CP under shift: Functional CP bands calibrated on seen tasks may face distribution shift on unseen tasks; exploring adaptive/online CP could enhance calibration.
- Recovery/improvement: SAFE provides timely detection; integrating with recovery policies or activation steering to proactively correct VLA behaviors is a promising direction, but closed-loop manipulation complicates direct transfer from LLM steering.
7.3. Personal Insights & Critique
- Strengths:
- Elegant insight that VLA latents carry generic failure semantics, demonstrated visually and empirically.
- Practicality: small MLP/LSTM probes with negligible runtime overhead; compatible with multiple VLAs without retraining them.
- Principled thresholding: CP-based bands offer interpretable, calibrated alerts and tunable conservativeness.
- Considerations:
- While SAFE generalizes across tasks, it still requires training on a diverse set of successes and failures. The real-world cost of collecting failures can be non-trivial; leveraging synthetic/augmented failures or semi-supervised methods might reduce data burden.
- CP coverage guarantees hinge on exchangeability; performance under significant domain shift may deviate. The paper acknowledges deviations in TNR curves; future work could incorporate adaptive CP to correct calibration drift.
- Feature aggregation choices can impact performance; automated feature-selection across layers or attention-based pooling might improve robustness.
- Transferability:
-
The latent-probing paradigm is broadly applicable: beyond VLAs, any large multimodal policy may encode execution status signals. SAFE-like probes could monitor failures in mobile manipulation, navigation, or multimodal human-robot interaction.
-
Combining SAFE with policy improvement (e.g., activation steering guided by separation vectors) is compelling but requires careful study of causality between latents and physical outcomes in embodied agents.
In summary, SAFE is a timely, well-motivated, and empirically strong contribution to safe deployment of generalist VLAs. Its simplicity and effectiveness make it a valuable building block for practical robot monitoring and a foundation for future recovery and improvement strategies.
-
Similar papers
Recommended via semantic vector search.