REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation
TL;DR Summary
REALM introduces a high-fidelity simulation environment for evaluating the generalization of Vision-Language-Action models, featuring 15 perturbation factors and over 3,500 objects. Validation shows strong alignment between simulated and real-world performance, highlighting ongoi
Abstract
Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on, which is presently difficult and expensive to evaluate in the real-world. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world performance through high-fidelity visuals and aligned robot control. Our environment offers a suite of 15 perturbation factors, 7 manipulation skills, and more than 3,500 objects. Finally, we establish two task sets that form our benchmark and evaluate the π_{0}, π_{0}-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge. More broadly, we also show that simulation gives us a valuable proxy for the real-world and allows us to systematically probe for and quantify the weaknesses and failure modes of VLAs. Project page: https://martin-sedlacek.com/realm
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation
1.2. Authors
Martin Sedlacek, Pavlo Yefanov, Georgy Ponimatkin, Jai Bardhan, Simon Pilc, Mederic Fourmy, Evangelos Kazakos, Cees G. M. Snoek, Josef Sivic, and Vladimir Petrik. The authors are affiliated with the Czech Institute of Informatics, Robotics and Cybernetics at the Czech Technical University in Prague and the University of Amsterdam.
1.3. Journal/Conference
The paper was published on arXiv on December 22, 2024. While the specific conference is not explicitly named in the header, the reference style and content suggest it is targeting high-tier robotics conferences like ICRA (International Conference on Robotics and Automation) or CoRL (Conference on Robot Learning).
1.4. Publication Year
2024 (Published at UTC: 2025-12-22T16:44:23.000Z per metadata, though the header indicates late 2024).
1.5. Abstract
Vision-Language-Action (VLA) models allow robots to perform tasks from natural language instructions. However, evaluating how these models generalize to new environments is expensive and difficult in the real world. This paper introduces REALM, a simulation environment and benchmark specifically designed to evaluate VLA generalization. REALM features high-fidelity visuals and aligned robot control to ensure a strong correlation between simulated and real-world performance. The benchmark includes 15 perturbation factors, 7 manipulation skills, and over 3,500 objects. The authors evaluate state-of-the-art models (, -FAST, and GR00T N1.5), demonstrating that generalization remains an open challenge and that simulation can serve as a reliable proxy for real-world testing.
1.6. Original Source Link
- PDF Link: https://arxiv.org/pdf/2512.19562v1.pdf
- Project Page: https://martin-sedlacek.com/realm
- Status: Preprint (arXiv).
2. Executive Summary
2.1. Background & Motivation
The field of robotics is currently moving toward Vision-Language-Action (VLA) models—large neural networks trained on vast amounts of data to translate visual inputs and text instructions directly into robot movements (actions). While these models show great promise, testing them is a bottleneck.
-
Real-world testing is slow, expensive, and nearly impossible to reproduce exactly across different labs.
-
Simulation testing is fast and scalable, but often suffers from the "sim-to-real gap": a policy that works in a simplified simulation might fail in the real world because the visuals look different or the physics/control of the robot doesn't match reality.
The core problem
REALMaims to solve is the lack of a validated simulation benchmark that specifically probes the generalization capabilities of VLA models—how well they handle changes in lighting, object types, and instruction phrasing.
2.2. Main Contributions / Findings
- REALM Environment: A high-fidelity simulation environment (built on NVIDIA IsaacSim) featuring 15 types of perturbations (visual, semantic, and behavioral) and a library of 3,500+ objects.
- Real-to-Sim Validation: The authors conducted nearly 800 parallel rollouts in both the real world and simulation. They found a high Pearson correlation () and low
Mean Maximum Rank Violation (MMRV), proving thatREALMperformance predicts real-world performance accurately. - Benchmark Evaluation: Testing leading models like and GR00T revealed that even the most advanced VLAs struggle with behavioral generalization (e.g., handling new object shapes) and certain semantic nuances, despite being pre-trained on massive datasets.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
- Vision-Language-Action (VLA) Models: These are robotic "foundation models." They take a camera image and a text string (e.g., "pick up the red mug") and output control signals (e.g., joint velocities). They are often built by fine-tuning
Large Language Models (LLMs)orVision-Language Models (VLMs). - Generalization: The ability of a model to perform correctly on data it hasn't seen during training. In robotics, this means working in a new kitchen, with a different mug, or under different lighting.
- Perturbations: Systematic changes made to an environment to test robustness.
- Visual: Changing lighting or camera angles.
- Semantic: Changing how the instruction is worded.
- Behavioral: Changing physical properties like the mass or shape of the object.
- Sim-to-Real Gap: The discrepancy between a simulated environment and the physical world. If the gap is too large, simulation results become meaningless for real-world deployment.
3.2. Previous Works
The paper builds on several key developments:
- DROID [45]: A massive real-world dataset for robot manipulation.
REALMuses the same robot embodiment (Franka Panda) and many tasks derived from DROID to ensure compatibility. - SIMPLER [20]: A previous benchmark that attempted to align simulation with real-world policies.
REALMexpands on this by offering more viewpoints, a much larger object set, and more complex perturbations. - [9]: A state-of-the-art VLA model using "flow matching" (a generative technique similar to diffusion) to predict robot actions.
3.3. Technological Evolution
Early robotics relied on hand-coded task logic. This moved to Imitation Learning (IL), where robots mimic humans. The current era uses Large Behavior Models (LBMs) and VLAs, which leverage Internet-scale data. REALM fits into this timeline as the "testing ground" for these new, data-hungry models.
4. Methodology
4.1. Principles
The core intuition of REALM is that for a simulation to be a valid proxy for reality, it must satisfy two conditions:
- Visual Fidelity: The rendered images must be "realistic" enough that a model trained on real images doesn't perceive a massive "distribution shift."
- Control Alignment: When the model says "move 5cm left," the simulated robot must move exactly as the real physical robot would, accounting for motor latency, friction, and inertia.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. System Identification (Control Alignment)
To minimize the control gap, the authors re-implemented the controller used in the DROID platform within the simulation. They identified 14 free parameters that represent physical characteristics of the robot joints. Specifically, they focused on:
-
: Mechanical resistance to motion in a joint.
-
: The reflected inertia of the joint motors (crucial for stability).
To optimize these, they collected real-world trajectories , where is the timestep, is the episode index, and is the 7D joint angle vector. The following loss function was minimized to align the simulation to the real robot:
$ \mathcal{L}(\theta_{friction}, \theta_{armature}) = \sum_{i=1}^{N} \sum_{t=1}^{T} | \pmb{q}{i, t}^{real} - \pmb{q}{i, t}^{sim} |_{2}^{2} $
-
: The actual joint positions recorded from the physical robot.
-
: The joint positions resulting from the same action commands in simulation.
-
: The squared L2 norm (Euclidean distance), measuring the error between the two trajectories.
The authors used the CMA-ES (Covariance Matrix Adaptation Evolution Strategy) algorithm, a derivative-free optimization method, to find the parameters that minimize this loss.
The following figure (Figure 4 from the original paper) visualizes the result of this alignment:
该图像是示意图,展示了默认模拟控制(左)与我们对齐模拟控制(右)之间的轨迹重放。黄色轨迹代表真实机器人的地面真值,蓝色轨迹则来自模拟。我们的系统识别使得轨迹跟随更加真实。
4.2.2. Tiered Task Progression
Instead of a simple "Success/Failure" binary, the authors use a Tiered Progression metric. This breaks a skill down into ordered stages. For example, for the Put skill, the stages are:
-
Reach: Move the gripper to the object.
-
Grasp: Close the gripper on the object.
-
Lift: Raise the object.
-
Move Close: Move the object toward the target container.
-
IsInside: Successfully place the object in the container.
Each stage is weighted equally. If a robot completes 3 out of 5 stages, its score is
0.6. This provides a much more granular signal for model evaluation than a simple0or1.
4.2.3. Perturbation Implementation
REALM implements 15 perturbations across three categories:
-
Visual (V): e.g.,
V-LIGHT(changing scene illumination color and intensity). -
Semantic (S): e.g.,
S-AFF(referencing human needs like "I am thirsty" instead of "pick up the bottle"). -
Behavioral (B): e.g.,
B-HOBJ(changing the mass of the object, which requires different torque/control).The following figure (Figure 3 from the original paper) illustrates several of these perturbations:
该图像是一个示意图,展示了15种扰动中的9种,分为视觉、语义和行为三类。视觉扰动改变像素空间的观察,但不要求策略调整行为;语义扰动则需要理解人类语言的不同方面;行为扰动要求在解决任务时改变机器人的移动方式。某些扰动可以归纳为多个类别。
5. Experimental Setup
5.1. Datasets and Tasks
The benchmark uses two primary task sets:
-
REALM-base: 8 tasks involving picking, placing, and stacking common objects (blocks, bowls, bottles).
-
REALM-articulated: 2 tasks involving opening and closing cabinet drawers.
The objects are drawn from a pool of over 3,500 models, ensuring the models encounter shapes they likely didn't see in their specific fine-tuning sets. The following figure (Figure 2 from the original paper) shows examples of these tasks:
该图像是示意图,展示了REALM基准测试中的多种任务,包括旋转杯子、取瓶子、将块放入碗中等。这些任务用于评估视觉-语言-动作模型的操作能力和泛化性能。
5.2. Evaluation Metrics
The paper uses three primary metrics to validate the simulation and evaluate models:
1. Pearson Correlation Coefficient ():
- Conceptual Definition: Measures the linear relationship between the task progression in reality and simulation. A value of
1.0means they move in perfect lockstep. - Mathematical Formula:
r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}} - Symbol Explanation: is the real-world progression, is simulated progression, and are their respective means.
2. Mean Maximum Rank Violation (MMRV):
- Conceptual Definition: Quantifies how often the simulation incorrectly "ranks" one policy as better than another compared to reality. Lower is better.
- Fallback Formula: for all pairs where .
3. Root Mean Squared Deviation (RMSD):
- Conceptual Definition: Quantifies the magnitude of the effect a perturbation has on task progression.
- Mathematical Formula: $ \mathrm{RMSD}(p) = \sqrt{\frac{1}{MT} \sum_{m=1}^{M} \sum_{t=1}^{T} (r_{m, t}^{p} - r_{m, t}^{default})^{2} } $
- Symbol Explanation: is the perturbation; is the number of models; is the number of tasks; is the progression under perturbation; is the baseline progression.
5.3. Baselines
The authors evaluated three SOTA VLA models:
- : A vision-language-action flow model.
- -FAST: An optimized version of with efficient action tokenization.
- GR00T N1.5: A foundation model designed for generalist humanoid robots, here adapted for the Franka arm.
6. Results & Analysis
6.1. Core Results Analysis
The most striking finding is that -FAST is currently the most robust model, but even it is highly sensitive to certain changes.
- Visual robustness: Models are somewhat robust to lighting but struggle with camera viewpoint shifts (
V-VIEW). - Semantic weakness: Despite being built on LLMs, the models often fail when instructions are phrased differently (
S-INT,S-AFF). This suggests that the "robotics fine-tuning" process might be degrading the model's original language understanding. - Behavioral challenge: Changing an object's pose (
VB-POSE) or using a completely new object (VSB-NOBJ) caused the largest performance drops.
6.2. Tables of Results
The following are the results from Table I of the original paper, comparing REALM to existing benchmarks:
| Benchmark | Perturbations (V/S/B) | HV/AC/MV | Diversity (S/C/O) |
|---|---|---|---|
| GemBench [17] | 2 / 1 / 2 | X / X / √ | 7 / 1 / 50+ |
| VLABench [18] | 1 / 7 / 2 | X / X / √ | 10 / 20 / 2,000+ |
| COLOSSEUM [19] | 5 / 0 / 2 | X / X / √ | 10 / 1 / 20+ |
| SIMPLER [20] | 5 / 0 / 2 | √ / √ / X | 5 / 3 / 10+ |
| REALM (ours) | 6 / 8 / 7 | √ / √ / √ | 7 / 10 / 3,500+ |
(Note: HV=High-fidelity Visuals, AC=Aligned Control, MV=Multiple Views. S/C/O=Skills/Scenes/Objects)
The following are the results from Table III of the original paper, defining the task progression rubric:
| Set | Skill | Tiered task progression |
|---|---|---|
| Base | Put | Reach → Grasp → Lift → Move Close → IsInside |
| Pick | Reach → Grasp → Lift | |
| Stack | Reach → Grasp → Lift → Move Close → IsOnTop | |
| Push / Rotate | Reach → Touch → IsToggledOn / Reach → Grasp → Rotate 45° | |
| Articulated | Open / Close | Reach → Touch & Move → Open 50% → 75% → 95% / Closed 50% → Closed |
6.3. Simulation Validation Results
As seen in Figure 6, the simulation shows a strong correlation with real-world performance ( in many cases), meaning if a model improves in REALM, it is highly likely to improve in the real world.
该图像是一个图表,展示了REALM的仿真与现实世界任务进展的验证结果。图中左侧的四个分部分分别展示了整体情况、默认情况、物体姿态扰动和相机姿态扰动下的任务进展。每个点表示不同任务的仿真与现实表现,强相关性通过Pearson相关系数统计量与p值(均小于0.001)得以展示,指示REALM作为现实世界性能的有效代理。
7. Conclusion & Reflections
7.1. Conclusion Summary
REALM successfully demonstrates that high-fidelity simulation with carefully aligned control can serve as a rigorous, scalable, and reproducible benchmark for the next generation of robot models. The evaluation of current VLAs (, GR00T) shows that while they have "learned" to move, their ability to generalize to new semantic instructions and physical object properties is still weak.
7.2. Limitations & Future Work
- Embodiment: Currently,
REALMonly supports the Franka Panda robot (DROID platform). Extending this to humanoids or other mobile manipulators is a key next step. - Low Performance: Some models had such low baseline success rates that testing perturbations became "uninformative" (if it fails at
0intensity, it also fails at10intensity). - Future Directions: The authors plan to add more skills and support more robot types.
7.3. Personal Insights & Critique
This paper provides a much-needed "reality check" for the VLA community. While many videos show VLAs performing impressive tasks, REALM provides a systematic way to break them.
Innovation: The use of CMA-ES for system identification to match real-world joint trajectories is a technically sound way to bridge the sim-to-real gap.
Critique: One potential issue is the reliance on IsaacSim's specific rendering; if a VLA model is trained on data from a different sensor (e.g., lower resolution or different lens distortion), the visual gap might reappear. However, the "Attention Map" similarity study (Figure 7) showing 0.85 similarity between real and sim attention is a very clever way to prove that the model "sees" the same important features in both worlds.
Similar papers
Recommended via semantic vector search.