Paper status: completed

Physical Autoregressive Model for Robotic Manipulation without Action Pretraining

Published:08/13/2025

Physical Autoregressive Model for Robotics (1)Video Pretraining for Robotic Manipulation (1)Action-Free Physical Dynamics Modeling (1)DiT-Based Continuous Token De-tokenizer (1)ManiSkill Benchmark (1)

Original Link PDF

Price: 0.100000

18 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

The Physical Autoregressive Model (PAR) addresses robotic data scarcity by leveraging world knowledge from video pretraining, eliminating action pretraining. It models combined frame-action "physical tokens" continuously, achieving 100% success on PushCube and matching action-pre

Abstract

The scarcity of manipulation data has motivated the use of pretrained large models from other modalities in robotics. In this work, we build upon autoregressive video generation models to propose a Physical Autoregressive Model (PAR), where physical tokens combine frames and actions to represent the joint evolution of the robot and its environment. PAR leverages the world knowledge embedded in video pretraining to understand physical dynamics without requiring action pretraining, enabling accurate video prediction and consistent action trajectories. It also adopts a DiT-based de-tokenizer to model frames and actions as continuous tokens, mitigating quantization errors and facilitating mutual enhancement. Furthermore, we incorporate a causal mask with inverse kinematics, parallel training, and the KV-cache mechanism to further improve performance and efficiency. Experiments on the ManiSkill benchmark show that PAR achieves a 100% success rate on the PushCube task, matches the performance of action-pretrained baselines on other tasks, and accurately predicts future videos with tightly aligned action trajectories. These findings underscore a promising direction for robotic manipulation by transferring world knowledge from autoregressive video pretraining. The project page is here: https://hcplab-sysu.github.io/PhysicalAutoregressiveModel/

Mind Map

In-depth Reading

English Analysis~16 min read · 18,311 chars

1. Bibliographic Information

Title: Physical Autoregressive Model for Robotic Manipulation without Action Pretraining
Authors: Zijian Song, Sihan Qin, Tianshui Chen, Liang Lin, Guangrun Wang.
Affiliations: The authors are affiliated with Sun Yat-sen University, Guangdong Key Laboratory of Big Data Analysis and Processing, x-Era AI Lab, and Guangdong University of Technology. These institutions are prominent in computer science and AI research in China.
Journal/Conference: The paper is available on arXiv, which is a preprint server for academic articles. This means it has not yet undergone formal peer review for publication in a specific journal or conference. The provided arXiv ID 2508.09822 appears to be a placeholder or non-standard.
Publication Year: The citations within the paper reference works from 2024 and even use placeholders for 2025, indicating this is a very recent work, likely from late 2024 or early 2025.
Abstract: The paper addresses the scarcity of robotic manipulation data by proposing a Physical Autoregressive Model (PAR). Instead of relying on action pretraining, PAR transfers "world knowledge" from pretrained autoregressive video generation models. It represents the robot-environment interaction using "physical tokens" that combine visual frames and actions. To improve accuracy, it models these tokens as continuous signals using a Diffusion Transformer (DiT)-based de-tokenizer, which mitigates quantization errors. The model is further enhanced with a specialized causal mask for implicit inverse kinematics, parallel training, and KV-caching for efficiency. On the ManiSkill benchmark, PAR achieves a 100% success rate on the PushCube task and performs competitively with action-pretrained models on other tasks, demonstrating accurate video prediction and coherent action generation.
Original Source Link:
- Original Source (likely intended ArXiv link): https://arxiv.org/pdf/2508.09822?
- PDF Link: http://arxiv.org/pdf/2508.09822v4
- Project Page: https://hcplab-sysu.github.io/PhysicalAutoregressiveModel/

2. Executive Summary

Background & Motivation (Why):
- Core Problem: Building generalist robots is hampered by the high cost and difficulty of collecting large-scale human demonstration data for training action policies. This data scarcity makes "action pretraining" on par with language or vision pretraining impractical.
- Existing Gaps: A common strategy is to transfer knowledge from models pretrained in other modalities, such as Large Language Models (LLMs), to create Vision-Language-Action (VLA) models. However, there is a fundamental modality gap between abstract, symbolic language and the continuous, physical nature of robot actions, often leading to poor alignment and control.
- Fresh Angle: This paper proposes a more natural knowledge transfer source: pretrained video generation models. The core intuition is that these models, trained to predict future frames, inherently learn a rich, implicit model of physical dynamics—a "world model." By building on this foundation, a robot policy can learn manipulation skills without needing any prior training on action data.
Main Contributions / Findings (What):
1. Physical Autoregressive Model (PAR): The authors introduce a novel framework that treats robotic manipulation as a joint video-and-action prediction problem. It combines visual frames and action commands into unified "physical tokens" and models their evolution autoregressively, step-by-step.
2. Zero Action Pretraining: PAR's primary innovation is its ability to leverage the physical understanding embedded in a pretrained video model (NOVA). This allows it to achieve strong performance without any pretraining on large-scale robotics datasets, directly addressing the data scarcity problem.
3. Continuous Token Representation: Unlike many models that discretize continuous signals like actions (which introduces errors), PAR uses a Diffusion Transformer (DiT) as a "de-tokenizer." This allows it to model frames and actions as continuous distributions, preserving signal fidelity and enabling smoother, more accurate trajectory generation.
4. Architectural Enhancements: The model incorporates a unique causal attention mask that allows action prediction to be conditioned on the current visual state, functioning as an implicit form of inverse kinematics. It also uses standard efficiency techniques like parallel training and KV-caching.
5. Strong Empirical Performance: Experiments on the ManiSkill benchmark show PAR achieves a perfect 100% success rate on the PushCube task and performs on par with state-of-the-art models that do require extensive action pretraining.

Foundational Concepts:
- Autoregressive Models: These are generative models that build a sequence of data one step at a time. At each step, the model predicts the next element based on all the elements generated so far. This is the core principle behind models like GPT in language, where the next word is predicted based on the preceding text. PAR applies this concept to a sequence of "physical tokens" (frame + action).
- Diffusion Models: A class of generative models that learn to create data by reversing a noise-adding process. They start with pure noise and iteratively refine it into a coherent sample (like an image or, in this case, an action vector). They are known for generating high-quality samples and can model complex, arbitrary data distributions.
- Diffusion Loss: This is the training objective for a diffusion model. Instead of using a complex objective, it simplifies to a mean squared error loss between the true noise added to a sample and the noise predicted by the model. The paper uses this loss to train its "de-tokenizer" to predict continuous action and frame tokens.
- Transformer Architecture: A neural network architecture based on the self-attention mechanism. It excels at processing sequential data by weighing the importance of different elements in the input sequence when producing an output. This is the backbone of PAR, processing the sequence of physical tokens.
- Inverse Kinematics (IK): A fundamental concept in robotics. It is the process of calculating the required joint angles and positions of a robot arm to place its end-effector (e.g., a gripper) at a specific target position and orientation in space. PAR achieves this implicitly through its causal masking strategy.
Previous Works & Technological Evolution:
- Vision-Language-Action (VLA) Models: The paper positions itself as an alternative to the dominant VLA trend. VLAs (e.g., RT-2, OpenVLA) append an action decoder to a pretrained Large Language Model (LLM). While powerful, their reliance on text can create a disconnect with the physical world. PAR argues that video is a more grounded modality for this transfer.
- Video-Action Joint Prediction: Prior works like UVA and VPP have explored jointly learning from video and action data. However, PAR differentiates itself with its autoregressive formulation. This allows it to reason over a variable-length history and generate a step-by-step plan, whereas other models might make a single prediction based on a fixed-length context.
- Continuous Signal Tokenization: Discretizing actions into a fixed vocabulary (like words) is a common technique but can cause "quantization error," leading to jerky or imprecise movements. To avoid this, some models use simple MLPs to decode actions, but these are often deterministic and struggle with generative modeling. PAR builds on recent work that uses diffusion/denoising to model continuous signals as distributions, applying it for the first time in a joint vision-action autoregressive framework.

4. Methodology (Core Technology & Implementation)

The core of the paper is the Physical Autoregressive Model (PAR), which formulates robotic control as a sequence generation task over combined vision and action tokens.

该图像为示意图，展示了论文中提出的Physical Autoregressive Model (PAR)的工作流程。图中自上而下依次显示环境演变（environment evolution）、机器人执行动作并更新环境（execute和update过程），通过图像（Image）和机器人本体状态（Proprio）进行编码和解码，转化为物理tokens（Physical Token）。模型通过自回归（autoregression）机制结合序列物理tokens学习环境和动作的联合演化。

Figure 1 provides a high-level illustration of the PAR concept. The model operates in an autoregressive loop (bottom red arrow) that runs in sync with the environment's evolution (top blue arrow). At each step, the model predicts a Physical Token (red square). This token is decoded into a predicted image and an action. The action is executed in the environment, which updates its state and provides a new observation (image and proprioception). This new observation is encoded and added to the context for the next prediction step.

Principles: The central idea is to treat the state of the world (video frame) and the robot's intervention (action) as an inseparable unit—a Physical Token. By predicting a sequence of these tokens autoregressively, the model learns the coupled dynamics of perception and action. This process is kickstarted by inheriting a strong prior about physical world dynamics from a pretrained video generation model.
Steps & Procedures:

该图像为模型结构示意图，展示了Physical Autoregressive Model (PAR) 的整体架构。图中从左到右依次输入文本嵌入、帧嵌入和动作嵌入，通过因果Transformer处理物理token，结合视频预训练权重分别进行帧扩散和动作扩散，预测机器人操作过程中的视频帧与动作序列。右侧插图展示了Transformer中包含点对点前馈、多头交叉注意力和多头自注意力的模块细节，体现了模型的多层堆叠和注意力机制。

Figure 2 shows the detailed model architecture. The overall pipeline is as follows:
1. Input Tokenization:
  - Text Instruction: A language instruction (e.g., "Push the cube to the goal") is encoded by a frozen text encoder (Phi-2) into text tokens.
  - Visual Frames: Each video frame is encoded by a frozen 3D Variational Autoencoder (VAE) into a latent representation, which is then projected and flattened into a sequence of frame tokens.
  - Actions: An action "chunk" (a sequence of L low-level actions) is encoded by a simple MLP into a sequence of action tokens.
2. Physical Autoregression:
  - A Physical Token P_n is formed by concatenating the tokens for the n-th frame O_n and the n-th action chunk A_n: P_n = [O_n; A_n]. The very first observation O_0 is paired with a special Begin-Of-Action (BOA) token.
  - A causal Transformer processes the sequence of text tokens followed by the history of physical tokens (P_0, ..., P_{n-1}) to predict an embedding Z_n for the next physical token P_n. The autoregressive process is defined by the probability: $P \left( T , O _ { 0 } , A _ { 0 } , \cdots , O _ { N } , A _ { N } \right) = \prod _ { n = 0 } ^ { N } p \left( P _ { n } | T , P _ { 0 } , \cdots , P _ { n - 1 } \right)$ This formula states that the probability of the entire trajectory is the product of the conditional probabilities of each physical token given all preceding tokens.
3. De-Tokenization (Decoding):
  - The predicted embedding Z_n does not directly map to an output. Instead, it serves as a conditional input to two separate diffusion-based de-tokenizers: one for the frame and one for the action.
  - These de-tokenizers are trained with Diffusion Loss. For a ground-truth token x and its conditional embedding z, the loss is: $\mathcal { L } ( z , x ) = \mathbb { E } _ { \epsilon , t } [ | | \epsilon - \epsilon _ { \theta } ( x _ { t } | t , z ) | | ^ { 2 } ]$
    - Symbol Explanation:
      - x: The ground-truth continuous token (e.g., an action vector).
      - z: The conditional context vector from the Transformer.
      - t: A random timestep from the diffusion process.
      - $\epsilon$ : A random noise vector sampled from a standard normal distribution.
      - $x_t$ : The ground-truth token x corrupted with noise $\epsilon$ at timestep t.
      - $\epsilon_{\theta}(...)$ : The neural network (the DiT-based de-tokenizer) that tries to predict the original noise $\epsilon$ from the noisy token $x_t$ , conditioned on t and z.
  - At inference, the de-tokenizer starts with pure noise and iteratively denoises it over several steps, conditioned on Z_n, to generate the final frame and action tokens. The action decoder is a lightweight DiT model named Action-DiT.
Key Details:
- Causal Mask:
  
  该图像为图表，展示了文本、图像帧（Frame）和动作（Action）之间的因果掩码矩阵关系。深色方块表示允许模型访问的内容，呈现出上三角的递进结构，说明当前时刻的信息只能依赖于之前或当前时刻的文本、帧和动作，体现了因果顺序和信息流的限制。标签包括Frame 0、BOA、Frame 1、Action 1、Frame 2、Action 2等，突出物理自回归模型中时间序列数据的依赖关系。 Figure 3 visualizes the custom attention mask used in the Transformer. This mask is crucial for controlling information flow.
  - Dark squares indicate that the token in the row can attend to the token in the column.
  - Temporal Causality: The overall upper-triangular structure ensures that predictions for a given timestep n can only depend on information from past timesteps (0 to n-1).
  - Implicit Inverse Kinematics: A key detail is that Action 1 tokens can attend to Frame 1 tokens. Since the model is trained to predict Frame 1 and Action 1 together, the representation of Frame 1 implicitly contains information about the future state of the world. By attending to it, the action decoder can plan a more accurate action, effectively performing a soft version of inverse kinematics.
- Loss Function: The total training loss is the sum of the diffusion losses for the frames (L_obs) and the actions (L_act), averaged over the sequence length: $loss = \sum _ { n = 1 } ^ { N } \left( \mathcal { L } _ { o b s } ( Z _ { O , n } , O _ { n } ) + \mathcal { L } _ { a c t } ( Z _ { A , n } , A _ { n } ) \right)$
- Efficiency: The model uses teacher forcing during training for parallelization and a KV-cache during inference. KV-caching stores the intermediate key and value matrices in the attention layers, so they don't need to be recomputed for the entire history at each new token generation step, dramatically speeding up autoregressive inference.
  
  Note: This table is a transcription of the original data in Table 1 of the paper.
  
  Module Parameters
  
  Action-Tokenizer 0.6M
  
  Action-DeTokenizer 21.1M
  
  Frame-DeTokenizer 32.8M
  
  Casual-Transformer 608.7M
  
  Total 663.2M
This table shows that the newly introduced action-specific modules (Action-Tokenizer and Action-DeTokenizer) add only about 21.7M parameters. The vast majority of parameters (608.7M) come from the pretrained NOVA video model, highlighting the knowledge transfer aspect.

Module	Parameters
Action-Tokenizer	0.6M
Action-DeTokenizer	21.1M
Frame-DeTokenizer	32.8M
Casual-Transformer	608.7M
Total	663.2M

5. Experimental Setup

Datasets: The experiments are conducted on the ManiSkill Benchmark, a popular and challenging simulator for robotic manipulation. They focus on three specific single-task scenarios: PushCube, PickCube, and StackCube. For each task, 1,000 human demonstrations were rendered for finetuning.
Evaluation Metrics:
- Success Rate: This is the primary metric used.
  1. Conceptual Definition: It measures the percentage of trials in which the robot successfully completes the specified task. For example, in PushCube, success is achieved if the cube is pushed to the target location. It is a direct and intuitive measure of task-level performance.
  2. Mathematical Formula: $\text{Success Rate} = \frac{\text{Number of Successful Rollouts}}{\text{Total Number of Rollouts}} \times 100\%$
  3. Symbol Explanation:
    - Number of Successful Rollouts: The count of test episodes where the task goal was met.
    - Total Number of Rollouts: The total number of test episodes run. In this paper, it is 125 (5 random seeds across 25 initial states).
Baselines: PAR is compared against a strong set of existing methods:
- ACT (Action Chunk Transformer): A Transformer-based behavioral cloning model.
- BC-T (Behavior Cloning Transformer): Another Transformer-based imitation learning approach.
- DP (Diffusion Policy): A state-of-the-art method that uses a diffusion model to generate actions.
- ICRT (In-Context Robot Transformer): A model pretrained on action trajectories.
- RDT (RobotDiffusionTransformer): A large, 1.3B parameter model pretrained on a massive robotics dataset. This is a very strong baseline that, unlike PAR, uses extensive action pretraining.

6. Results & Analysis

Core Results:

Note: This table is a transcription of the original data in Table 2 of the paper.

Method	PushCube	PickCube	StackCube	Avg.
ACT [2023]	76%	20%	30%	42%
BC-T [2021]	98%	4%	14%	39%
DP [2023]	88%	40%	80%	69%
ICRT [2025]	77%	78%	30%	62%
RDT [2024]	100%	77%	74%	84%
PAR (Ours)	100%	73%	48%	74%

Analysis: PAR's performance is impressive. Without any action pretraining, it achieves an average success rate of 74%, matching the second-best baseline RDT in two tasks and surpassing all others except RDT on average.
- On PushCube, it achieves a perfect 100% success rate.
- On PickCube, it is only 4% behind the action-pretrained RDT model.
- Its weakest task is StackCube (48%), a more complex, multi-stage task. Even so, it significantly outperforms ACT and ICRT.
Key Takeaway: These results strongly support the paper's main hypothesis: transferring world knowledge from video pretraining is a highly effective alternative to action pretraining for robotic manipulation.

Ablations / Parameter Sensitivity:

Note: This table is a transcription of the original data in Table 3 of the paper.

Method PushCube PickCube StackCube Avg.

PAR-NoAR 29.6% 4.0% 0.0% 11.2%

PAR-Discrete 87.2% 65.6% 7.2% 53.3%

PAR-Full 100.0% 72.8% 48.0% 73.6%

This study validates the key design choices of PAR.
1. PAR-NoAR: This version removes the autoregressive Transformer, connecting the encoders directly to the decoders. The performance plummets from 73.6% to 11.2%. This demonstrates that the sequential, step-by-step reasoning provided by the autoregressive framework is critical for success.
2. PAR-Discrete: This version replaces the diffusion-based de-tokenizer with a simple MLP-based one, which performs deterministic regression instead of generative modeling. The average success rate drops from 73.6% to 53.3%. This confirms that modeling actions as continuous distributions with the generative de-tokenizer is superior, likely because it better captures the inherent uncertainty and multimodality in action policies.
Visualizations:

该图像为多组机器人操控任务的视频帧示意图，展示了PushCube、PickCube和StackCube三个任务中，机器人对不同颜色方块的预测（predict）与实际执行（execute）过程。图中每组包含连续动作的图像帧，直观对比了模型预测的视频序列与机器人真实执行动作的视频序列，表现了模型对未来物理状态的准确预测能力。 Figure 4 visualizes predicted video sequences against the actual executed trajectories. The alignment is remarkably high across all three tasks. For instance, in PickCube, the model correctly predicts the subtle rotation of the gripper needed to grasp the cube, and the robot executes this motion precisely. This shows the model has a fine-grained understanding of object interaction dynamics learned from video pretraining.

该图像是一张图表，展示了多头注意力机制（不同head编号）在处理文本、帧和动作序列时的关注分布，色块深浅表示关注强度。图中红框放大部分对应机器人操作场景，可见注意力焦点集中在机器人手臂和目标物位置，反映模型对关键物理交互区域的关注。 Figure 5 shows attention maps. The top row (token-level) reveals that different attention heads specialize, with some focusing on past frames and others on past actions. The bottom row (pixel-level) shows that when predicting an action, the model's attention is spatially focused on the most relevant objects: the cube to be manipulated, the target area, and the robot's own arm.

该图像为插图，展示了机器人手臂在抓取红色方块时的动作预测与实际执行过程。上排“Predict”显示模型对动作和物体位移的连续预测画面，下排“Execute”则为机器人真实执行的对应动作及结果，并通过红框放大对比抓取细节，体现模型预测与实际动作的高度一致性。 Figure 6 presents a failure case analysis for the PickCube task. The model predicts a plausible trajectory, but during execution, the robot arm fails to place the cube correctly, missing the target along the depth axis. The authors hypothesize this is due to the difficulty of inferring precise 3D depth from a single RGB camera view, a known challenge in vision-based robotics.

Method	PushCube	PickCube	StackCube	Avg.
PAR-NoAR	29.6%	4.0%	0.0%	11.2%
PAR-Discrete	87.2%	65.6%	7.2%	53.3%
PAR-Full	100.0%	72.8%	48.0%	73.6%

7. Conclusion & Reflections

Conclusion Summary: The paper successfully introduces the Physical Autoregressive Model (PAR), demonstrating that transferring world knowledge from pretrained video models is a powerful and viable strategy for robotic manipulation, eliminating the need for costly action pretraining. By unifying frames and actions into physical tokens and modeling them as continuous signals within an autoregressive framework, PAR achieves state-of-the-art or highly competitive performance on the ManiSkill benchmark. The strong alignment between predicted videos and executed actions validates the effectiveness of this knowledge transfer.
Limitations & Future Work:
- Depth Ambiguity: The authors acknowledge that performance can be limited by the lack of explicit 3D information from single-view RGB images, as shown in the failure case. They suggest incorporating depth estimation modules as a direction for future work.
- Training Efficiency: While the model is effective, it requires full-parameter finetuning of a large video model. The authors propose exploring parameter-efficient finetuning techniques like LoRA in the future to reduce training costs.
Personal Insights & Critique:
- Novelty and Impact: The central idea of leveraging video pretraining instead of language or action pretraining is highly compelling. It frames robotic manipulation not as a language-grounding problem, but as a physics prediction problem, which feels more fundamentally aligned with the task. This could shift the focus of foundation models for robotics.
- Strengths: The paper is methodologically sound, with strong justifications for each design choice (autoregression, continuous tokenization, causal masking) that are backed by a thorough ablation study. The results are very strong, especially considering the "no action pretraining" constraint.
- Open Questions & Potential Weaknesses:
  - Scalability and Generalization: The model is finetuned on a per-task basis. A key challenge for foundation models is generalization across many diverse tasks. It remains to be seen how PAR would perform in a multi-task setting or on tasks unseen during finetuning.
  - Inference Speed: Diffusion-based decoding is inherently iterative and slower than single-pass regression models. While KV-caching helps, the real-time applicability of such a model on physical hardware could be a concern.
  - Dependency on Pretraining Data: The model's "world knowledge" is entirely dependent on the data used to pretrain the underlying video model (NOVA). Any biases, artifacts, or limitations in that dataset (e.g., lack of object diversity, non-realistic physics) could negatively impact the downstream manipulation policy.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.