RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models
TL;DR Summary
The RICL framework injects in-context adaptability into pre-trained Vision-Language-Action (VLA) models via a specific fine-tuning approach, enabling users to improve performance using only 10-20 task demonstrations without parameter tuning.
Abstract
Multi-task ``vision-language-action'' (VLA) models have recently demonstrated increasing promise as generalist foundation models for robotics, achieving non-trivial performance out of the box on new tasks in new environments. However, for such models to be truly useful, an end user must have easy means to teach them to improve. For language and vision models, the emergent ability to perform in-context learning (ICL) has proven to be a versatile and highly useful interface to easily teach new tasks with no parameter finetuning. Unfortunately, VLAs pre-trained with imitation learning objectives do not naturally acquire ICL abilities. In this paper, we demonstrate that, with the right finetuning recipe and a small robot demonstration dataset, it is possible to inject in-context adaptability post hoc into such a VLA. After retraining for in-context learning (RICL), our system permits an end user to provide a small number (10-20) of demonstrations for a new task. RICL then fetches the most relevant portions of those demonstrations into the VLA context to exploit ICL, performing the new task and boosting task performance. We apply RICL to inject ICL into the -FAST VLA, and show that it permits large in-context improvements for a variety of new manipulation tasks with only 20 demonstrations per task, without any parameter updates. When parameter updates on the target task demonstrations is possible, RICL finetuning further boosts performance. We release code and model weights for RICL--FAST alongside the paper to enable, for the first time, a simple in-context learning interface for new manipulation tasks. Website: https://ricl-vla.github.io.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
RICL: Adding In-Context Adaptability to Pre-Trained Vision-Language-Action Models
The title clearly states the paper's core contribution: a method named RICL designed to bestow in-context adaptability—the ability to learn from examples on the fly—onto existing, pre-trained robotic models that understand vision, language, and can perform actions (Vision-Language-Action or VLA models).
1.2. Authors
The authors are Kaustubh Sridhar, Souradeep Dutta, Dinesh Jayaraman, and Insup Lee. Their affiliations are with the University of Pennsylvania and the University of British Columbia, both of which are prominent institutions with strong research programs in robotics and artificial intelligence.
1.3. Journal/Conference
The paper is available as a preprint on arXiv, a popular repository for academic papers before they undergo peer review. The provided publication date of August 4, 2025, suggests it has been submitted for publication at a future conference. Given the topic and the authors' related work (e.g., REGENT at ICLR 2025), it is likely targeting a top-tier machine learning or robotics conference such as the International Conference on Learning Representations (ICLR), the Conference on Robot Learning (CoRL), or Neural Information Processing Systems (NeurIPS), which are highly influential venues in the field.
1.4. Publication Year
The metadata indicates a publication date of 2025, with the first version appearing on arXiv.
1.5. Abstract
The abstract introduces the problem that large-scale Vision-Language-Action (VLA) models, despite their promise as generalist robot brains, lack the in-context learning (ICL) ability seen in Large Language Models (LLMs). This means they cannot easily improve on new tasks without costly parameter updates. The authors propose RICL (Retraining for In-Context Learning), a finetuning method to inject this ICL capability into a pre-trained VLA. After RICL training, a user can provide a small number of demonstrations (10-20) for a new task. The system then automatically retrieves relevant parts of these demonstrations to use as context, enabling the VLA to perform the new task with significantly improved performance, all without any further training. The authors demonstrate this by applying RICL to the π₀-FAST VLA, showing large performance gains on new manipulation tasks. They also find that if further training is permissible, finetuning the RICL-enabled model is far more effective. The authors release their code and models to facilitate further research.
1.6. Original Source Link
- Original Source Link: https://arxiv.org/abs/2508.02062
- PDF Link: https://arxiv.org/pdf/2508.02062v1.pdf
- Publication Status: This is a preprint on arXiv and has not yet been officially published in a peer-reviewed venue.
2. Executive Summary
2.1. Background & Motivation
In recent years, the field of robotics has seen the rise of "foundation models" for robotics, known as Vision-Language-Action (VLA) models. These are large neural networks trained on vast datasets of robot interactions, enabling them to perform a variety of tasks based on visual input and natural language commands (e.g., "pick up the red apple"). While these models show impressive generalization out of the box, they have a critical weakness: improving them is difficult. If a VLA fails at a new task, the standard solution is to collect more demonstration data for that specific task and perform finetuning, a process that updates the model's millions of parameters and requires significant computational resources and expertise.
This contrasts sharply with the user experience of modern Large Language Models (LLMs) like GPT-4. A key reason for their widespread adoption is their emergent ability of in-context learning (ICL). An LLM can learn a new task (e.g., translating English to pirate-speak) simply by being shown a few examples in the input prompt, with no need to retrain the model itself.
The core problem this paper addresses is the absence of ICL in robotic VLAs. The authors argue that for VLAs to be truly useful general-purpose tools, end-users need an equally simple way to teach them new skills. The central research question is: How can we inject in-context learning abilities into a pre-trained VLA? The innovative idea is to develop a specific training recipe that doesn't build a new model from scratch, but rather "unlocks" the ICL potential within an existing, powerful VLA.
2.2. Main Contributions / Findings
The paper presents several key contributions to solve this problem:
-
A Novel Finetuning Recipe (
RICL): The authors proposeRICL(Retraining for In-Context Learning), a method to post-train an existing VLA. This recipe reframes the learning objective to explicitly teach the model how to use contextual examples to solve a given task. -
Parameter-Free Adaptation via RAG+ICL: The primary finding is that after
RICLtraining, a VLA gains the ability to adapt to new tasks without any parameter updates. An end-user can collect a small set of 10-20 demonstrations for a novel task. TheRICL-VLAthen uses aRetrieval-Augmented Generation(RAG) mechanism to automatically find the most relevant moments from these demonstrations and feeds them into its context. This allows the model to perform the new task with a significant performance boost, purely through ICL. -
Superior Performance with Finetuning: The paper shows that when further training on the new task's data is an option, finetuning the
RICL-VLAis substantially more effective than applying standard finetuning to the original VLA. Across all evaluation tasks, the finetunedRICL-VLAachieved an aggregate success rate of 61.67%, nearly double the 31.67% achieved by the finetuned original VLA. -
Practical Demonstration and Open-Sourcing: The authors successfully apply
RICLto a state-of-the-art model,π₀-FAST-DROID. They demonstrate its effectiveness on a range of challenging manipulation tasks involving unseen objects, novel motions, and new environments. To enable others to build upon this work, they release their code and the trainedRICL-π₀-FAST-DROIDmodel weights.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
-
Vision-Language-Action (VLA) Models: These are sophisticated neural networks designed to control robots. They take in multiple types of data (modalities) as input:
- Vision: Images from the robot's cameras (e.g., from its wrist, head).
- Language: A natural language command specifying the task (e.g., "put the cup in the sink").
- Action: In some architectures, the model's own previous actions are fed back in. The model's output is a sequence of actions for the robot to execute, such as joint velocities or gripper commands. VLAs are typically trained via imitation learning, where they learn to mimic actions from a large dataset of human-teleoperated demonstrations.
-
In-Context Learning (ICL): A remarkable capability of large transformer-based models (especially LLMs). It allows the model to learn a new task at inference time by conditioning on a few input-output examples provided in the prompt (the "context"). For instance, you could provide the prompt:
Translate English to French.The model will likely answermenthe poivréeby recognizing the pattern from the examples, without its internal weights being updated. VLAs trained with standard imitation learning objectives do not naturally exhibit this ability. -
Retrieval-Augmented Generation (RAG): A technique to enhance the performance of generative models like LLMs. Instead of relying solely on the knowledge stored in its parameters, a RAG system first retrieves relevant information from an external knowledge base (like a database of documents or, in this paper, robot demonstrations) based on the user's query. This retrieved information is then added to the model's context, providing it with up-to-date and specific information to generate a more accurate response.
-
Imitation Learning (IL): A paradigm in machine learning for teaching an agent how to perform a task by showing it examples. The simplest form is Behavior Cloning (BC), where the model is trained as a supervised learning problem to map states (what the robot sees) to actions (what the expert did in that state). Most large VLAs are trained using variants of BC on massive datasets.
3.2. Previous Works
-
Generalist VLA Models: The paper builds upon a recent wave of powerful VLAs. The specific model used is
π₀-FAST[1], which is an evolution of [2].π₀-FAST: This is an autoregressive VLA, meaning it predicts actions one step at a time. It is based on Google'sPaliGemmavision-language model, which combines aSigLIPimage encoder with theGemmaLLM. Its key innovation is theFASTtokenizer, which efficiently converts continuous robot actions into discrete tokens that the LLM can process alongside text tokens. The model used in the paper,π₀-FAST-DROID, is a version ofπ₀-FASTthat was further finetuned on the large-scaleDROIDdataset [25].
-
ICL for Generalist Agents: The authors acknowledge three prior works that trained agents with ICL capabilities, all in simulated environments. The most influential for this paper is
REGENT[12].REGENT: This method trains a generalist agent from scratch to perform ICL. For a new task, it is given a few expert demonstrations. During execution, it retrieves state-action pairs from these demonstrations that are similar to the current state and places them in its context. The model then predicts an action based on both the current state and the retrieved examples.RICLadopts this retrieval-based ICL strategy but applies it as a post-training step to an existing VLA, rather than training from scratch.
-
ICL in Robotics using LLMs/VLMs: Other researchers have leveraged the native ICL abilities of off-the-shelf LLMs and Vision-Language Models (VLMs) for robotics.
- Methods like [19, 20, 21] use LLMs to generate action sequences (represented as code or keypoints) based on in-context examples. However, the paper notes two major drawbacks: these methods often operate open-loop (they generate a full plan and execute it without feedback, making them brittle to errors) and, if using only LLMs, they lose crucial visual information needed for precise, reactive control.
RICLavoids these issues by integrating ICL directly into a closed-loop VLA that continuously processes images.
- Methods like [19, 20, 21] use LLMs to generate action sequences (represented as code or keypoints) based on in-context examples. However, the paper notes two major drawbacks: these methods often operate open-loop (they generate a full plan and execute it without feedback, making them brittle to errors) and, if using only LLMs, they lose crucial visual information needed for precise, reactive control.
3.3. Technological Evolution
The field of robot learning has progressed from training separate policies for each individual task to developing multi-task policies capable of handling a fixed set of tasks. The latest evolution is the creation of large, pre-trained generalist VLAs like RT-X, Octo, and , which are trained on massive, diverse datasets and can generalize to a certain extent to new tasks and environments "zero-shot".
However, this created a new challenge: how to efficiently adapt or teach these giant models when they fail. RICL positions itself as the solution to this "last-mile" problem. It represents a shift from simply scaling up pre-training data to developing methods for efficient post-training adaptation. It bridges the gap between the raw power of VLAs and the user-friendly adaptability of LLMs.
3.4. Differentiation Analysis
RICLvs.REGENT: WhileREGENTintroduced the core idea of retrieval-augmented ICL for agents, it required training a model from the ground up, which is computationally expensive and doesn't leverage existing powerful foundation models.RICL's key innovation is its post-hoc recipe, making it a practical tool to upgrade existing VLAs likeπ₀-FAST.RICLvs. LLM-based Robotics: Unlike methods that use general-purpose LLMs as a high-level planner,RICLintegrates the ICL mechanism directly into a dedicated, end-to-end VLA. This allows theRICL-VLAto remain a closed-loop, reactive policy that continuously uses visual feedback for fine-grained control, overcoming the brittleness of open-loop execution common in LLM-based approaches.RICLvs. Standard Finetuning: Standard finetuning updates a model's weights to specialize it for a new task, effectively making it "forget" how to generalize.RICLinstead teaches the model how to learn from context. This allows a singleRICL-VLAmodel to adapt to many different tasks on the fly, without needing to store separate finetuned weights for each one.
4. Methodology
4.1. Principles
The core principle of RICL is to change how a VLA is trained. Instead of just learning to map a single state to an action (), RICL teaches the model to answer the question: "Given these examples of how to do a task, what action should I take in my current state?"
This is achieved by post-training (or finetuning) a pre-trained VLA on a specially constructed "priming" dataset. Each data sample in this priming phase is structured to mimic an in-context learning problem at deployment. The model learns to look at the provided examples (the "context"), compare them to its current situation (the "query"), and produce an action that is intelligently informed by those examples. This primes the model to effectively use its context window for adaptation.
4.2. Core Methodology In-depth (Layer by Layer)
The RICL methodology can be broken down into its architecture, how it processes data during training and deployment, and its learning objective.
4.2.1. Architecture and Data Flow
The paper applies RICL to the π₀-FAST VLA, resulting in a RICL-π₀-FAST model. The overall architecture is shown in Figure 2 from the paper.
该图像是图示,展示了 RICL--FAST 模型的架构。图中显示了大型语言模型的初始结构,以及如何通过检索演示和 DINO 的加权动作插值层来实现任务。左侧部分是模型的输入图像嵌入,右侧则为输出和查询流程。
-
Inputs: The model takes two sets of inputs:
- The Query: This represents the robot's current situation at time . It consists of the current camera images (top, side, wrist), the natural language task instruction (e.g., "push the lever on the toaster"), and the robot's proprioceptive state (e.g., joint angles).
- The Retrieved Context: This is a set of "neighbor" examples retrieved from a small demonstration dataset specific to the target task. Each neighbor consists of images, a proprioceptive state, and the corresponding action chunk that the expert performed.
-
Input Processing:
- Frozen Image Encoder: The image encoder of the base VLA (in this case,
π₀-FAST's SigLIP encoder) is kept frozen. This is crucial as it preserves the powerful visual representations learned during large-scale pre-training. It processes all images (from both the query and the retrieved context) into feature embeddings. - LLM Finetuning: The outputs from the image encoder, along with the text instruction and state information, are tokenized and fed as a long sequence into the LLM component of the VLA (in this case,
Gemma). Only the LLM's parameters are updated during theRICLtraining process. The input sequence is ordered with the most relevant neighbor first, followed by less relevant ones, and finally the query: .
- Frozen Image Encoder: The image encoder of the base VLA (in this case,
4.2.2. Retrieval Mechanism (RAG)
At each timestep during deployment, the RICL-VLA performs a retrieval step to build its context:
- Embed Query Image: The top-view image from the current query is passed through an off-the-shelf, pre-trained
DINO-v2image encoder to get a feature embedding.DINO-v2is chosen for its strong semantic feature extraction capabilities. - Find Nearest Neighbors: This query embedding is compared against a pre-computed set of
DINO-v2embeddings for all top-view images in the task-specific demonstration buffer. - Select Context: The demonstrations corresponding to the smallest L2 distance (i.e., the most visually similar states) are selected to form the context. The paper uses .
4.2.3. Action Prediction with Interpolation
This is the most innovative part of the RICL architecture. The final action is not just the output of the LLM. Instead, it's a weighted blend of the LLM's prediction and the action from the single closest retrieved neighbor. This is handled by the action interpolation layer.
The final probability distribution over the output action tokens is calculated as follows:
Let's break down this formula symbol by symbol:
-
: This is the final output of the
RICL-VLAmodel. It represents the probability distribution over all possible action tokens. The action executed by the robot is sampled from this distribution. -
: This is the interpolation weight. It determines how much to trust the retrieved action versus the LLM's own prediction.
- : The L2 distance between the
DINO-v2embedding of the current query's top image and that of the closest neighbor. A small means the current situation is very similar to a demonstrated one. - : A hyperparameter (set to 10 in the paper) that controls the sharpness of the interpolation. A larger makes the weight change more rapidly with distance.
- When is very small (high similarity), approaches 1. When is large (low similarity), approaches 0.
- : The L2 distance between the
-
: The action from the single closest neighbor (), represented as a one-hot vector. This is a "hard" target, directly suggesting an action.
-
: This is the complementary weight for the LLM's output. If the weight for the retrieved action is high, this weight is low, and vice-versa.
-
: This is the standard output of the VLA's LLM component (). It takes the entire sequence of retrieved examples and the query as input and produces a probability distribution over action tokens, which is then passed through a Softmax function .
In essence, this formula creates a dynamic policy: if the robot is in a state very similar to one it has seen in a demonstration, it will heavily favor copying the demonstrated action. If it's in a novel state far from any demonstration, it will rely on the generalized knowledge of its internal LLM, which has seen the context but is making its own inference.
4.2.4. Training and Finetuning
- Priming Phase: To create the initial
RICL-VLA, the model is finetuned on a "priming" dataset of 20 diverse pick-and-place tasks. During this training, for each state in a demonstration, it is treated as a "query", and four neighbors are retrieved from the other demonstrations of the same task. The model's parameters (, the LLM weights) are updated to minimize the cross-entropy loss between the predicted action distribution and the true ground-truth action for the query. Crucially, the loss is only calculated on the query's action, not the actions in the context. - Deployment (RAG + ICL): After priming, the
RICL-VLAcan be deployed on a completely new task. A small set of demonstrations (e.g., 20) for this new task is provided as the retrieval buffer. The model then performs the task using retrieval and in-context learning, with no further gradient updates. - Further Finetuning (Optional): For maximum performance, the
RICL-VLAcan be further finetuned on the new task's 20 demonstrations. The training process is identical to the priming phase: the model retrieves from the same 20 demonstrations it is being trained on, and the objective remains to predict the query action. This is more effective than standard finetuning because the model can focus its capacity on learning how to interpolate between the provided examples, rather than having to memorize the entire task from scratch.
5. Experimental Setup
5.1. Datasets
-
Priming Dataset: To teach the base VLA the skill of in-context learning, the authors collected a custom dataset.
- Source & Scale: 400 total demonstrations collected using a
Franka DROIDrobot platform. This consisted of 20 demonstrations for each of 20 different pick-and-place tasks (e.g., "move the apple to the right", "put the cloth in the bowl"). - Characteristics: The initial positions of objects were randomized to ensure variety. The tasks involved a range of common objects like fruits, cups, and boxes (see Figure 6).
- Purpose: This dataset's role is not to teach the specific tasks, but to teach the model the general skill of using contextual demonstrations.
- Source & Scale: 400 total demonstrations collected using a
-
Evaluation Datasets: To test the
RICL-VLA's adaptation capabilities, the authors designed a suite of challenging new tasks.- Source & Scale: For each evaluation task, a small set of 20 demonstrations was collected to serve as the retrieval buffer for
RICLand the finetuning data for baseline comparisons. - Characteristics: These tasks were designed to be "out-of-distribution" for the original
π₀-FAST-DROIDmodel. They included:- Unseen Objects:
pokeball,bagel,idli plate,squeegee. These objects were not in the DROID pre-training data or the RICL priming data. - Novel Motions: Dragging a squeegee, pushing a toaster lever, grasping a uniquely shaped idli plate.
- New Scenes: Moving tasks from a tabletop to a kitchen sink area with different lighting and camera angles.
- Long-Tail Tasks: Using variants of objects seen infrequently in the DROID dataset, like a specific brand of toaster or shelf, which tests generalization to rare object types.
- Unseen Objects:
- Example Task:
(pokeball) "pick up the pokeball and put it in the tray": This task included an unseen Pokeball object along with distractor objects on the table to test for correct language grounding.
- Source & Scale: For each evaluation task, a small set of 20 demonstrations was collected to serve as the retrieval buffer for
5.2. Evaluation Metrics
The paper uses two primary metrics to evaluate performance across 10 randomized test rollouts for each task.
-
Success Rate:
- Conceptual Definition: This is the most straightforward metric. It measures the percentage of test trials in which the robot successfully completes the entire task from start to finish as specified by the language command. A higher success rate indicates better overall task completion ability.
- Mathematical Formula:
- Symbol Explanation:
Number of Successful Trials: The count of rollouts where the task goal was fully achieved.Total Number of Trials: The total number of evaluation attempts (10 in this paper).
-
Checkpoint Completion Rate:
- Conceptual Definition: Tasks are often composed of multiple sub-steps or "checkpoints" (e.g., 1. reach object, 2. grasp object, 3. move to target, 4. release object). This metric measures the percentage of trials that successfully complete each intermediate checkpoint. It provides a more fine-grained view of a policy's capabilities, helping to identify exactly where it fails (e.g., a model might be good at reaching but bad at grasping).
- Mathematical Formula: For a given checkpoint :
- Symbol Explanation:
Number of Trials Reaching Checkpoint i: The count of rollouts that successfully completed the i-th stage of the task.Total Number of Trials: The total number of evaluation attempts.
5.3. Baselines
The proposed RICL-π₀-FAST-DROID model is compared against several key baselines to demonstrate its effectiveness.
π₀-FAST-DROID(Vanilla): The original, off-the-shelf VLA with no modifications. This baseline measures the zero-shot performance of a state-of-the-art model on the new tasks.Retrieve and play: A simple non-learning baseline. At every step, it finds the single nearest neighbor in the demonstration data and simply executes (plays back) the corresponding action. This tests whether the performance gains come from intelligent inference or just simple matching.Diffusion Policy: A representative, modern imitation learning method trained from scratch only on the 20 task-specific demonstrations. This shows what a standard, task-specific policy can achieve with the same small amount of data.π₀-FAST-DROID-finetuned: The originalπ₀-FAST-DROIDmodel after undergoing standard "vanilla" finetuning on the 20 new task demonstrations. This is a very strong baseline, representing the current standard practice for adapting a VLA.RICL-π₀-FAST-DROID-finetuned: TheRICL-primed model after it is also finetuned on the 20 new task demonstrations using theRICLobjective. This is compared against the vanilla-finetuned baseline to show the benefits of theRICLtraining paradigm.
6. Results & Analysis
The paper presents compelling quantitative and qualitative results that validate the effectiveness of the RICL method.
6.1. Core Results Analysis
The main quantitative results are summarized in Figure 4, which shows stacked bar plots of success rates across various tasks.
该图像是一个图表,展示了不同任务的成功率及不同方法的比较。图中列出了六个任务的成功率,包括“捡起宝可梦球并放入托盘”、“将闲置盘子移至右侧”等。每个任务下方的柱形图分别展示了不同训练策略(如RICL、finetune、R&P等)的表现。右下角的线图则显示了在不同演示数量下,针对闲置盘子任务的各方法成功率的变化趋势。
Here's a breakdown of the key takeaways from the chart:
-
RICL(ICL-only) vs. Base VLA (): Across all tasks, the baseπ₀-FAST-DROIDmodel (labeled ) performs very poorly, with an aggregate full task success rate of just 2.5%. It often fails to even start the task correctly (large gray bars). In contrast,RICL-π₀-FAST-DROIDusing only RAG and ICL (labeledRICL) achieves a dramatically higher aggregate success rate of 31.25%. This demonstrates thatRICLsuccessfully injects in-context learning capabilities, enabling significant, parameter-free adaptation. For instance, on thepokeballtask,RICLachieves a high success rate while the base model completely fails. -
RICL-finetunedvs.π₀-finetuned: When parameter updates are allowed, the benefits ofRICLbecome even more pronounced. TheRICL-finetunedmodel achieves an aggregate success rate of 61.67%, which is nearly double the 31.67% achieved by the standardπ₀-finetunedmodel. This shows that finetuning a model already primed for ICL is far more data-efficient and effective than finetuning a standard VLA. TheRICLarchitecture allows the model to better leverage the few available demonstrations. -
RICL(ICL-only) vs.π₀-finetuned: A striking result is that theRICLmodel using only ICL (31.25% success) performs on par with the base model that has been fully finetuned on the task data (31.67% success). This is a major practical advantage:RICLprovides the performance benefits of finetuning but with the ease of a simple, parameter-free, drag-and-drop interface (just providing demonstration files). -
Qualitative Observations: The paper complements these numbers with qualitative examples (Figure 1).
- Language Grounding: On the
pokeballtask, the base model gets confused by the distractor duck, whereas theRICLmodel correctly identifies the unseen "pokeball" from the language command by using the demonstrations as context. - Adaptation: On tasks like
squeegeeandtoaster, the base model aimlessly wanders, unable to figure out the novel grasp or motion. TheRICLmodel, by referencing the demonstrations, successfully infers and executes these new behaviors. - Emergent Solutions: Interestingly, the authors note that in the
idliplatetask,RICLsolved the task by pushing the plate, a strategy not present in any of the demonstrations (which all used a grasp-and-lift motion). This suggests the model is not just mimicking but may be capable of interpolating its latent knowledge in novel ways to achieve a goal.
- Language Grounding: On the
6.2. Ablation Studies / Parameter Analysis
The paper includes an important ablation study on the number of demonstrations provided for retrieval/finetuning, shown in the bottom-right plot of Figure 4 for the idliplate task.
- Effect of Demonstration Count: The plot shows that performance generally improves for all methods as the number of demonstrations increases from 5 to 20.
- Minimum Requirement: With only 5 demonstrations, the
RICLmodel's performance is much lower and it begins to exhibit failure modes similar to the base model (e.g., getting distracted by other objects). This indicates that a certain minimum number of examples (around 10-20) is needed to provide a sufficiently rich context for effective ICL. - Consistent Superiority of RICL-Finetuning: Across all data counts (5, 10, 15, 20), the
RICL-finetunedmodel consistently and significantly outperforms theπ₀-finetunedmodel, reinforcing that theRICLobjective is a fundamentally better way to learn from few-shot data.
6.3. Additional Analyses
-
No Loss of Capabilities: A crucial sanity check was performed to ensure that the
RICLpost-training didn't damage the model's original abilities. When tested on tasks from its original training set with irrelevant demonstrations in its context, theRICLmodel performed just as well as the originalπ₀-FAST-DROID(80% success rate), demonstrating that it retained its pre-trained knowledge. -
Reactivity and Robustness: The qualitative visualizations in Figure 5 and 9 show the finetuned
RICLmodel reacting in real-time to human perturbations (e.g., moving the target object during an attempt). This confirms that the resulting policy is a robust, closed-loop system, not a brittle open-loop planner.
该图像是图表,展示了三项任务的机器人操作细节:第一项任务是“捡起宝可梦球并放入托盘”; 第二项任务是“将圆盘移至右侧”; 第三项任务是“打开底层架子的门”。每项任务均通过一系列图像呈现,包括机器人与物品的互动过程。
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper introduces RICL, a novel and practical method for injecting in-context learning (ICL) capabilities into pre-trained Vision-Language-Action (VLA) models. The authors successfully demonstrate that by post-training a VLA with the RICL recipe, it gains the ability to adapt to entirely new manipulation tasks by referencing a small number of demonstrations provided in its context, all without requiring any parameter updates. This approach significantly boosts performance on tasks with unseen objects and novel motions. Furthermore, the paper shows that if finetuning is performed, the RICL framework provides a much more effective and data-efficient learning objective than standard finetuning, nearly doubling the success rate of the base VLA. By open-sourcing their RICL-π₀-FAST-DROID model, the authors provide the community with the first VLA that has a simple, user-friendly ICL interface for learning new robot skills.
7.2. Limitations & Future Work
The authors transparently discuss the limitations of their current work:
- Bounded by Base Model Capabilities:
RICLcan help a VLA adapt to variations of tasks it was pre-trained on (like pick-and-place), but it cannot teach it fundamentally new skills that are too far outside its pre-trained knowledge base (e.g., playing tennis). The adaptability is constrained by the latent skills already present in the base VLA. - Need for Teleoperated Demonstrations: While the number of demonstrations is small (10-20), they still need to be collected via teleoperation for each new task. This is not a fully scalable solution for teaching a vast number of skills. The authors suggest future work could explore using passive human videos, which are much easier to obtain.
- Generalization to Novel Motions: The model still struggles with motions that are significantly different from the pick-and-place motions in the RICL priming dataset. A more diverse set of motions in the priming phase could address this.
- Scaling: The authors propose that scaling up
RICL—by using more priming tasks/demonstrations and larger models—is a promising direction for further improving performance and generalization.
7.3. Personal Insights & Critique
-
Pragmatic Innovation: The most impressive aspect of
RICLis its pragmatism. Instead of proposing a new, monolithic architecture that needs to be trained from scratch,RICLis a clever "patch" that upgrades existing, powerful foundation models. This focus on post-hoc adaptation makes the research immediately applicable and valuable to the robotics community. -
Elegant Simplicity: The core mechanism—retrieval based on
DINO-v2embeddings and a simple weighted interpolation layer—is elegant and effective. It avoids over-complicating the architecture and shows that significant gains can be achieved by properly structuring the learning problem. -
Potential for Broader Impact: The principle behind
RICL—explicitly training a model to perform in-context learning—could be highly impactful beyond robotics. Many specialized domains (e.g., medical image analysis, financial modeling) use models trained on narrow datasets that lack adaptability. ARICL-like approach could be used to give these specialized models a user-friendly way to adapt to new data distributions or sub-tasks without full retraining. -
Areas for Further Scrutiny:
- Retrieval Bottleneck: The retrieval is based solely on the top-down camera view. For tasks where the critical information is in the robot's gripper (wrist camera) or involves complex 3D relationships, this simple retrieval key might be insufficient. Exploring multi-view or proprioception-based retrieval could be a valuable next step.
- Nature of "Latent Elicitation": The observation of the model generating a novel pushing strategy is fascinating. Is this true "reasoning," or a happy accident of interpolating between different behaviors in the VLA's vast latent space? This phenomenon warrants deeper investigation, as it touches on the core question of whether these models are truly understanding or just performing complex pattern matching.
- Priming Data Diversity: The paper shows
RICLworks for pick-and-place. A key question for future work is how the diversity of the "priming" dataset affects the generality of the ICL skill. To make a truly generalist ICL-VLA, one might need a priming dataset covering a much wider range of interaction types (pushing, pulling, cutting, etc.). The "curriculum" for teaching ICL itself is an open and interesting research problem.
Similar papers
Recommended via semantic vector search.