Paper status: completed

REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation

Published:12/23/2025
Original LinkPDF
Price: 0.100000
Price: 0.100000
Price: 0.100000
3 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

REALM introduces a high-fidelity simulation environment for evaluating the generalization of Vision-Language-Action models, featuring 15 perturbation factors and over 3,500 objects. Validation shows strong alignment between simulated and real-world performance, highlighting ongoi

Abstract

Vision-Language-Action (VLA) models empower robots to understand and execute tasks described by natural language instructions. However, a key challenge lies in their ability to generalize beyond the specific environments and conditions they were trained on, which is presently difficult and expensive to evaluate in the real-world. To address this gap, we present REALM, a new simulation environment and benchmark designed to evaluate the generalization capabilities of VLA models, with a specific emphasis on establishing a strong correlation between simulated and real-world performance through high-fidelity visuals and aligned robot control. Our environment offers a suite of 15 perturbation factors, 7 manipulation skills, and more than 3,500 objects. Finally, we establish two task sets that form our benchmark and evaluate the π_{0}, π_{0}-FAST, and GR00T N1.5 VLA models, showing that generalization and robustness remain an open challenge. More broadly, we also show that simulation gives us a valuable proxy for the real-world and allows us to systematically probe for and quantify the weaknesses and failure modes of VLAs. Project page: https://martin-sedlacek.com/realm

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

REALM: A Real-to-Sim Validated Benchmark for Generalization in Robotic Manipulation

1.2. Authors

Martin Sedlacek, Pavlo Yefanov, Georgy Ponimatkin, Jai Bardhan, Simon Pilc, Mederic Fourmy, Evangelos Kazakos, Cees G. M. Snoek, Josef Sivic, and Vladimir Petrik. The authors are affiliated with the Czech Institute of Informatics, Robotics and Cybernetics at the Czech Technical University in Prague and the University of Amsterdam.

1.3. Journal/Conference

The paper was published on arXiv on December 22, 2024. While the specific conference is not explicitly named in the header, the reference style and content suggest it is targeting high-tier robotics conferences like ICRA (International Conference on Robotics and Automation) or CoRL (Conference on Robot Learning).

1.4. Publication Year

2024 (Published at UTC: 2025-12-22T16:44:23.000Z per metadata, though the header indicates late 2024).

1.5. Abstract

Vision-Language-Action (VLA) models allow robots to perform tasks from natural language instructions. However, evaluating how these models generalize to new environments is expensive and difficult in the real world. This paper introduces REALM, a simulation environment and benchmark specifically designed to evaluate VLA generalization. REALM features high-fidelity visuals and aligned robot control to ensure a strong correlation between simulated and real-world performance. The benchmark includes 15 perturbation factors, 7 manipulation skills, and over 3,500 objects. The authors evaluate state-of-the-art models (π0\pi_{0}, π0\pi_{0}-FAST, and GR00T N1.5), demonstrating that generalization remains an open challenge and that simulation can serve as a reliable proxy for real-world testing.

2. Executive Summary

2.1. Background & Motivation

The field of robotics is currently moving toward Vision-Language-Action (VLA) models—large neural networks trained on vast amounts of data to translate visual inputs and text instructions directly into robot movements (actions). While these models show great promise, testing them is a bottleneck.

  • Real-world testing is slow, expensive, and nearly impossible to reproduce exactly across different labs.

  • Simulation testing is fast and scalable, but often suffers from the "sim-to-real gap": a policy that works in a simplified simulation might fail in the real world because the visuals look different or the physics/control of the robot doesn't match reality.

    The core problem REALM aims to solve is the lack of a validated simulation benchmark that specifically probes the generalization capabilities of VLA models—how well they handle changes in lighting, object types, and instruction phrasing.

2.2. Main Contributions / Findings

  1. REALM Environment: A high-fidelity simulation environment (built on NVIDIA IsaacSim) featuring 15 types of perturbations (visual, semantic, and behavioral) and a library of 3,500+ objects.
  2. Real-to-Sim Validation: The authors conducted nearly 800 parallel rollouts in both the real world and simulation. They found a high Pearson correlation (rr) and low Mean Maximum Rank Violation (MMRV), proving that REALM performance predicts real-world performance accurately.
  3. Benchmark Evaluation: Testing leading models like π0\pi_{0} and GR00T revealed that even the most advanced VLAs struggle with behavioral generalization (e.g., handling new object shapes) and certain semantic nuances, despite being pre-trained on massive datasets.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Vision-Language-Action (VLA) Models: These are robotic "foundation models." They take a camera image and a text string (e.g., "pick up the red mug") and output control signals (e.g., joint velocities). They are often built by fine-tuning Large Language Models (LLMs) or Vision-Language Models (VLMs).
  • Generalization: The ability of a model to perform correctly on data it hasn't seen during training. In robotics, this means working in a new kitchen, with a different mug, or under different lighting.
  • Perturbations: Systematic changes made to an environment to test robustness.
    • Visual: Changing lighting or camera angles.
    • Semantic: Changing how the instruction is worded.
    • Behavioral: Changing physical properties like the mass or shape of the object.
  • Sim-to-Real Gap: The discrepancy between a simulated environment and the physical world. If the gap is too large, simulation results become meaningless for real-world deployment.

3.2. Previous Works

The paper builds on several key developments:

  • DROID [45]: A massive real-world dataset for robot manipulation. REALM uses the same robot embodiment (Franka Panda) and many tasks derived from DROID to ensure compatibility.
  • SIMPLER [20]: A previous benchmark that attempted to align simulation with real-world policies. REALM expands on this by offering more viewpoints, a much larger object set, and more complex perturbations.
  • π0\pi_{0} [9]: A state-of-the-art VLA model using "flow matching" (a generative technique similar to diffusion) to predict robot actions.

3.3. Technological Evolution

Early robotics relied on hand-coded task logic. This moved to Imitation Learning (IL), where robots mimic humans. The current era uses Large Behavior Models (LBMs) and VLAs, which leverage Internet-scale data. REALM fits into this timeline as the "testing ground" for these new, data-hungry models.

4. Methodology

4.1. Principles

The core intuition of REALM is that for a simulation to be a valid proxy for reality, it must satisfy two conditions:

  1. Visual Fidelity: The rendered images must be "realistic" enough that a model trained on real images doesn't perceive a massive "distribution shift."
  2. Control Alignment: When the model says "move 5cm left," the simulated robot must move exactly as the real physical robot would, accounting for motor latency, friction, and inertia.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. System Identification (Control Alignment)

To minimize the control gap, the authors re-implemented the controller used in the DROID platform within the simulation. They identified 14 free parameters that represent physical characteristics of the robot joints. Specifically, they focused on:

  • θfriction\pmb{\theta}_{friction}: Mechanical resistance to motion in a joint.

  • θarmature\theta_{armature}: The reflected inertia of the joint motors (crucial for stability).

    To optimize these, they collected NN real-world trajectories D={{(qi,treal,qi,tsim);qi,tR7}t=1T}i=1N\mathcal{D} = \{ \{ (\pmb{q}_{i, t}^{real}, \pmb{q}_{i, t}^{sim}) ; \pmb{q}_{i, t} \in \mathbb{R}^{7} \} _{t = 1}^{T} \} _{i = 1}^{N}, where tt is the timestep, ii is the episode index, and q\pmb{q} is the 7D joint angle vector. The following loss function was minimized to align the simulation to the real robot:

$ \mathcal{L}(\theta_{friction}, \theta_{armature}) = \sum_{i=1}^{N} \sum_{t=1}^{T} | \pmb{q}{i, t}^{real} - \pmb{q}{i, t}^{sim} |_{2}^{2} $

  • qi,treal\pmb{q}_{i, t}^{real}: The actual joint positions recorded from the physical robot.

  • qi,tsim\pmb{q}_{i, t}^{sim}: The joint positions resulting from the same action commands in simulation.

  • 22\|\cdot\|_{2}^{2}: The squared L2 norm (Euclidean distance), measuring the error between the two trajectories.

    The authors used the CMA-ES (Covariance Matrix Adaptation Evolution Strategy) algorithm, a derivative-free optimization method, to find the parameters that minimize this loss.

The following figure (Figure 4 from the original paper) visualizes the result of this alignment:

Fig. 4: Visualization of the aligned control. We show a trajectory replay in simulation using the default controller (left) and our aligned control (right). Yellow trajectory represents the ground truth from a real robot and blue is from simulation. Our system identification results in significantly more realistic trajectory following. 该图像是示意图,展示了默认模拟控制(左)与我们对齐模拟控制(右)之间的轨迹重放。黄色轨迹代表真实机器人的地面真值,蓝色轨迹则来自模拟。我们的系统识别使得轨迹跟随更加真实。

4.2.2. Tiered Task Progression

Instead of a simple "Success/Failure" binary, the authors use a Tiered Progression metric. This breaks a skill down into ordered stages. For example, for the Put skill, the stages are:

  1. Reach: Move the gripper to the object.

  2. Grasp: Close the gripper on the object.

  3. Lift: Raise the object.

  4. Move Close: Move the object toward the target container.

  5. IsInside: Successfully place the object in the container.

    Each stage is weighted equally. If a robot completes 3 out of 5 stages, its score is 0.6. This provides a much more granular signal for model evaluation than a simple 0 or 1.

4.2.3. Perturbation Implementation

REALM implements 15 perturbations across three categories:

  • Visual (V): e.g., V-LIGHT (changing scene illumination color and intensity).

  • Semantic (S): e.g., S-AFF (referencing human needs like "I am thirsty" instead of "pick up the bottle").

  • Behavioral (B): e.g., B-HOBJ (changing the mass of the object, which requires different torque/control).

    The following figure (Figure 3 from the original paper) illustrates several of these perturbations:

    Fig. 3: Visualization of 9 out of 15 perturbations from three categories. Visual perturbations alter the observations in pixel space, but do not require the policy to adjust its behavior or understand a different phrasing of an instruction. Semantic perturbations require understanding different aspects of human language that can be present in robot commands. Behavioral perturbations require changes to robot movement when solving a task compared to the default setting. Some perturbations can encompass multiple categories - e.g. changing the manipulated object (bottom middle) is both visual and behavioral. 该图像是一个示意图,展示了15种扰动中的9种,分为视觉、语义和行为三类。视觉扰动改变像素空间的观察,但不要求策略调整行为;语义扰动则需要理解人类语言的不同方面;行为扰动要求在解决任务时改变机器人的移动方式。某些扰动可以归纳为多个类别。

5. Experimental Setup

5.1. Datasets and Tasks

The benchmark uses two primary task sets:

  1. REALM-base: 8 tasks involving picking, placing, and stacking common objects (blocks, bowls, bottles).

  2. REALM-articulated: 2 tasks involving opening and closing cabinet drawers.

    The objects are drawn from a pool of over 3,500 models, ensuring the models encounter shapes they likely didn't see in their specific fine-tuning sets. The following figure (Figure 2 from the original paper) shows examples of these tasks:

Fig. 2: Visualization of tasks from REALM-base and REALM-articulated. The full benchmark consists of 8 base tasks (6 shown) and 2 articulated tasks (both shown), which are used for the large-scale evaluation in Section V. 该图像是示意图,展示了REALM基准测试中的多种任务,包括旋转杯子、取瓶子、将块放入碗中等。这些任务用于评估视觉-语言-动作模型的操作能力和泛化性能。

5.2. Evaluation Metrics

The paper uses three primary metrics to validate the simulation and evaluate models:

1. Pearson Correlation Coefficient (rr):

  • Conceptual Definition: Measures the linear relationship between the task progression in reality and simulation. A value of 1.0 means they move in perfect lockstep.
  • Mathematical Formula: r = \frac{\sum (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum (x_i - \bar{x})^2 \sum (y_i - \bar{y})^2}}
  • Symbol Explanation: xix_i is the real-world progression, yiy_i is simulated progression, and xˉ,yˉ\bar{x}, \bar{y} are their respective means.

2. Mean Maximum Rank Violation (MMRV):

  • Conceptual Definition: Quantifies how often the simulation incorrectly "ranks" one policy as better than another compared to reality. Lower is better.
  • Fallback Formula: MMRV=1P(i,j)Pmax(0,scorereal(j)scorereal(i))MMRV = \frac{1}{|P|} \sum_{(i,j) \in P} \max(0, \text{score}_{real}(j) - \text{score}_{real}(i)) for all pairs where scoresim(i)>scoresim(j)\text{score}_{sim}(i) > \text{score}_{sim}(j).

3. Root Mean Squared Deviation (RMSD):

  • Conceptual Definition: Quantifies the magnitude of the effect a perturbation has on task progression.
  • Mathematical Formula: $ \mathrm{RMSD}(p) = \sqrt{\frac{1}{MT} \sum_{m=1}^{M} \sum_{t=1}^{T} (r_{m, t}^{p} - r_{m, t}^{default})^{2} } $
  • Symbol Explanation: pp is the perturbation; MM is the number of models; TT is the number of tasks; rm,tpr_{m,t}^{p} is the progression under perturbation; rm,tdefaultr_{m,t}^{default} is the baseline progression.

5.3. Baselines

The authors evaluated three SOTA VLA models:

  1. π0\pi_{0}: A vision-language-action flow model.
  2. π0\pi_{0}-FAST: An optimized version of π0\pi_{0} with efficient action tokenization.
  3. GR00T N1.5: A foundation model designed for generalist humanoid robots, here adapted for the Franka arm.

6. Results & Analysis

6.1. Core Results Analysis

The most striking finding is that π0\pi_{0}-FAST is currently the most robust model, but even it is highly sensitive to certain changes.

  • Visual robustness: Models are somewhat robust to lighting but struggle with camera viewpoint shifts (V-VIEW).
  • Semantic weakness: Despite being built on LLMs, the models often fail when instructions are phrased differently (S-INT, S-AFF). This suggests that the "robotics fine-tuning" process might be degrading the model's original language understanding.
  • Behavioral challenge: Changing an object's pose (VB-POSE) or using a completely new object (VSB-NOBJ) caused the largest performance drops.

6.2. Tables of Results

The following are the results from Table I of the original paper, comparing REALM to existing benchmarks:

Benchmark Perturbations (V/S/B) HV/AC/MV Diversity (S/C/O)
GemBench [17] 2 / 1 / 2 X / X / √ 7 / 1 / 50+
VLABench [18] 1 / 7 / 2 X / X / √ 10 / 20 / 2,000+
COLOSSEUM [19] 5 / 0 / 2 X / X / √ 10 / 1 / 20+
SIMPLER [20] 5 / 0 / 2 √ / √ / X 5 / 3 / 10+
REALM (ours) 6 / 8 / 7 √ / √ / √ 7 / 10 / 3,500+

(Note: HV=High-fidelity Visuals, AC=Aligned Control, MV=Multiple Views. S/C/O=Skills/Scenes/Objects)

The following are the results from Table III of the original paper, defining the task progression rubric:

Set Skill Tiered task progression
Base Put Reach → Grasp → Lift → Move Close → IsInside
Pick Reach → Grasp → Lift
Stack Reach → Grasp → Lift → Move Close → IsOnTop
Push / Rotate Reach → Touch → IsToggledOn / Reach → Grasp → Rotate 45°
Articulated Open / Close Reach → Touch & Move → Open 50% → 75% → 95% / Closed 50% → Closed

6.3. Simulation Validation Results

As seen in Figure 6, the simulation shows a strong correlation with real-world performance (r>0.8r > 0.8 in many cases), meaning if a model improves in REALM, it is highly likely to improve in the real world.

Fig. 6: Sim-to-real validation of REALM. Task progression is shown in the real-world ( \(\\mathbf { \\bar { X } }\) axis) and simulation (y-axis). Left: We show a strong Pearson correlation `( r )` with datapoints close to identity (gray dashed line) and a low Mean Maximum Rn under individual perturbations. We observe a p-value of \(p < 0 . 0 0 1\) for the correlation between real and simulated rollouts for all settings indicating that REALM is a strong proxy for real-world performance. 该图像是一个图表,展示了REALM的仿真与现实世界任务进展的验证结果。图中左侧的四个分部分分别展示了整体情况、默认情况、物体姿态扰动和相机姿态扰动下的任务进展。每个点表示不同任务的仿真与现实表现,强相关性通过Pearson相关系数rr统计量与p值(均小于0.001)得以展示,指示REALM作为现实世界性能的有效代理。

7. Conclusion & Reflections

7.1. Conclusion Summary

REALM successfully demonstrates that high-fidelity simulation with carefully aligned control can serve as a rigorous, scalable, and reproducible benchmark for the next generation of robot models. The evaluation of current VLAs (π0\pi_{0}, GR00T) shows that while they have "learned" to move, their ability to generalize to new semantic instructions and physical object properties is still weak.

7.2. Limitations & Future Work

  • Embodiment: Currently, REALM only supports the Franka Panda robot (DROID platform). Extending this to humanoids or other mobile manipulators is a key next step.
  • Low Performance: Some models had such low baseline success rates that testing perturbations became "uninformative" (if it fails at 0 intensity, it also fails at 10 intensity).
  • Future Directions: The authors plan to add more skills and support more robot types.

7.3. Personal Insights & Critique

This paper provides a much-needed "reality check" for the VLA community. While many videos show VLAs performing impressive tasks, REALM provides a systematic way to break them. Innovation: The use of CMA-ES for system identification to match real-world joint trajectories is a technically sound way to bridge the sim-to-real gap. Critique: One potential issue is the reliance on IsaacSim's specific rendering; if a VLA model is trained on data from a different sensor (e.g., lower resolution or different lens distortion), the visual gap might reappear. However, the "Attention Map" similarity study (Figure 7) showing 0.85 similarity between real and sim attention is a very clever way to prove that the model "sees" the same important features in both worlds.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.