Paper status: completed

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

Published:07/23/2025

Multimodal Large Language Model (25)Vision-Language-Action Reasoning (1)Reinforced Visual Latent Planning (1)Long-Horizon Planning (1)Robotic Action Execution (1)

Original Link PDF

Price: 0.100000

2 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

ThinkAct proposes a dual-system framework that connects high-level reasoning and low-level action execution through reinforced visual latent planning, enabling few-shot adaptation, long-horizon planning, and self-correction in complex environments.

Abstract

Vision-language-action (VLA) reasoning tasks require agents to interpret multimodal instructions, perform long-horizon planning, and act adaptively in dynamic environments. Existing approaches typically train VLA models in an end-to-end fashion, directly mapping inputs to actions without explicit reasoning, which hinders their ability to plan over multiple steps or adapt to complex task variations. In this paper, we propose ThinkAct, a dual-system framework that bridges high-level reasoning with low-level action execution via reinforced visual latent planning. ThinkAct trains a multimodal LLM to generate embodied reasoning plans guided by reinforcing action-aligned visual rewards based on goal completion and trajectory consistency. These reasoning plans are compressed into a visual plan latent that conditions a downstream action model for robust action execution on target environments. Extensive experiments on embodied reasoning and robot manipulation benchmarks demonstrate that ThinkAct enables few-shot adaptation, long-horizon planning, and self-correction behaviors in complex embodied AI tasks.

Mind Map

In-depth Reading

English Analysis~23 min read · 31,409 chars

1. Bibliographic Information

1.1. Title

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

The title clearly outlines the paper's core concepts. ThinkAct suggests a dual-process system: first thinking (reasoning), then acting. Vision-Language-Action (VLA) Reasoning defines the problem domain—agents that must understand multimodal inputs (vision, language) to perform physical actions. Reinforced Visual Latent Planning specifies the core technical approach: using reinforcement learning (RL) to guide a planning process that operates in a compressed, latent visual space.

1.2. Authors

Authors: Chi-Pin Huang, Yueh-Hua Wu, Min-Hung Chen, Yu-Chiang Frank Wang, Fu-En Yang.
Affiliations: The authors are affiliated with NVIDIA and National Taiwan University. NVIDIA is a leading company in AI and robotics research, lending significant credibility and resources to the work. The collaboration with a prominent academic institution like National Taiwan University indicates a strong blend of industrial and academic research.

1.3. Journal/Conference

The paper was submitted to arXiv, an open-access repository of electronic preprints. The specified publication date is 2025-07-22T17:59:46.000Z, which indicates this is a preprint version, likely submitted for review to a top-tier AI or robotics conference scheduled for 2025, such as CVPR, ICRA, or CoRL. As a preprint, it has not yet completed the formal peer-review process.

1.4. Publication Year

2025 (as per the arXiv submission metadata).

1.5. Abstract

The abstract introduces the problem that existing Vision-Language-Action (VLA) models, trained end-to-end, struggle with long-horizon planning and adaptation due to a lack of explicit reasoning. To address this, the paper proposes ThinkAct, a dual-system framework. The core idea is to train a multimodal large language model (MLLM) to generate "embodied reasoning plans." This training is guided by reinforcement learning using novel action-aligned visual rewards based on goal completion and trajectory consistency. These plans are then compressed into a visual plan latent. This latent representation conditions a separate, downstream action model, enabling robust execution. The abstract concludes by highlighting the key results from experiments on robotics benchmarks: ThinkAct demonstrates superior few-shot adaptation, long-horizon planning, and emergent self-correction behaviors.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2507.16815v2
PDF Link: https://arxiv.org/pdf/2507.16815v2.pdf
Publication Status: This is a preprint available on arXiv. It has not yet been officially published in a peer-reviewed journal or conference proceeding.

2. Executive Summary

2.1. Background & Motivation

Core Problem: The primary challenge in embodied AI is creating agents that can perform complex, multi-step tasks in dynamic environments based on natural language and visual inputs. Current VLA models often function as "black boxes," directly mapping sensory inputs to motor outputs. This end-to-end approach lacks an explicit reasoning or planning stage, making it difficult for them to handle tasks that require long-term strategy, adapt to unexpected changes, or generalize from a small number of examples.
Importance and Gaps: As we move towards more general-purpose robots, the ability to reason about a task before acting is crucial. Previous attempts to instill reasoning, such as using Chain-of-Thought (CoT) prompting, have relied on large-scale, human-annotated datasets of reasoning steps. This process is expensive, time-consuming, and may not cover the vast range of possible scenarios a robot might encounter. While reinforcement learning has been used to train LLMs to reason, the reward signals are often abstract (e.g., accuracy on a question-answering task) and not directly connected to the physical world, limiting their utility for embodied agents.
Innovative Idea: ThinkAct's central innovation is to ground the abstract process of "thinking" in the physical reality of "acting." It trains an MLLM to generate plans using reinforcement learning, but the reward signal is not based on textual correctness. Instead, it is derived from action-aligned visual feedback—how well the generated plan (represented as a visual trajectory) matches a successful real-world action in terms of its goal and path. This novel reward mechanism allows the model to learn effective reasoning patterns without expensive, explicit textual annotations. The resulting plan is then used to guide a separate, fast-acting policy, creating a practical and effective dual-system architecture.

2.2. Main Contributions / Findings

Primary Contributions:
1. ThinkAct Framework: A novel dual-system VLA framework that decouples high-level, slow reasoning (thinking) from low-level, fast action execution (acting) via a visual plan latent.
2. Action-Aligned Visual Rewards: A new reinforcement learning reward function for training embodied reasoning. The reward is based on visual goal completion and trajectory alignment, effectively grounding the MLLM's planning in physically plausible actions.
3. Reinforced Visual Latent Planning: A method where an MLLM is fine-tuned with RL to generate high-level plans, which are then compressed into a latent vector to guide a downstream action model, enhancing its performance and adaptability.
Key Findings:
1. Superior Performance: ThinkAct significantly outperforms existing VLA models on complex robot manipulation and embodied reasoning benchmarks, demonstrating its effectiveness.
2. Long-Horizon Planning: The explicit reasoning stage enables the model to successfully break down and execute long-sequence tasks that challenge end-to-end models.
3. Few-Shot Adaptation: The high-level reasoning provides a strong prior that allows the action model to adapt to new tasks and environments with very few demonstrations (e.g., 5 or 10).
4. Emergent Self-Correction: ThinkAct exhibits the ability to detect execution failures, reason about the failure, and generate a new plan to recover and complete the task—a critical capability for robust real-world deployment.

3.1. Foundational Concepts

Multimodal Large Language Models (MLLMs): These are advanced AI models that can process and understand information from multiple modalities, typically text and images (or video). An MLLM usually consists of a pre-trained vision encoder (like ViT) that converts an image into numerical representations (embeddings) and a large language model (LLM) that processes these visual embeddings alongside text embeddings. This allows the model to perform tasks like describing an image, answering questions about it, or even reasoning about its content.
Vision-Language-Action (VLA) Models: VLAs are a specialized type of MLLM designed for embodied AI tasks like robotics. They take visual input (e.g., a camera feed) and a language instruction (e.g., "pick up the red block") and output a sequence of actions (e.g., motor commands for a robot arm) to accomplish the task. Early VLAs were often trained end-to-end, learning a direct mapping from perception to action.
Reinforcement Learning (RL): RL is a machine learning paradigm where an "agent" learns to make decisions by interacting with an "environment." The agent receives a "reward" signal for each action it takes. The goal is to learn a "policy" (a strategy for choosing actions) that maximizes the cumulative reward over time. It's a trial-and-error learning process well-suited for tasks where explicit supervision is unavailable.
Chain-of-Thought (CoT) Prompting: CoT is a technique used to improve the reasoning abilities of LLMs. Instead of asking for a direct answer, the model is prompted to generate a series of intermediate, step-by-step reasoning steps that lead to the final answer. This mimics a human-like thought process and has been shown to significantly improve performance on tasks requiring logic, math, or complex planning.
Diffusion Policy: This is a modern approach to imitation learning (learning from demonstrations) based on diffusion models. A diffusion model works in two steps: a "forward process" where noise is gradually added to data (e.g., an action sequence) until it becomes pure noise, and a "reverse process" where a neural network learns to denoise it step-by-step to recover the original data. In a Diffusion Policy, the model learns to reverse this process, generating a realistic action sequence by "denoising" a random vector, conditioned on the current state (visual observation and instruction). The paper uses a DiT-based policy, which refers to a Diffusion Transformer, a powerful architecture for diffusion models.

3.2. Previous Works

End-to-End VLA Models: Works like RT-1, RT-2, and OpenVLA represent this paradigm. They are typically large transformer models trained on massive datasets of robot demonstrations (e.g., the Open X-Embodiment dataset). They directly output low-level actions from high-level inputs. While powerful for short-horizon skills they have been trained on, they lack transparency and struggle with long-horizon planning and out-of-distribution generalization.
Supervised Reasoning VLA Models: To address the lack of explicit reasoning, models like ECoT and RAD were developed. These methods first use a powerful off-the-shelf LLM (like GPT-4) to generate textual Chain-of-Thought reasoning traces for robot tasks. Then, a VLA model is trained via supervised fine-tuning to mimic these reasoning steps before predicting an action. The major drawback is the reliance on expensive and potentially limited "canned" reasoning data.
RL for LLM Reasoning & GRPO: Recent works have applied RL to fine-tune LLMs for better reasoning, avoiding the need for static CoT datasets. The reward is typically based on the final answer's correctness (e.g., in math or QA tasks). A key algorithm used in ThinkAct is Group Relative Policy Optimization (GRPO). GRPO is an RL algorithm designed to be more stable than traditional policy gradient methods for LLMs. Instead of comparing a generated response's reward to an average baseline, it samples a group of $M$ responses from the policy and computes the relative ranking of each response within that group. The advantage for a response is its normalized reward relative to the mean and standard deviation of the rewards in its group. This avoids issues with high variance in reward signals and makes learning more efficient. The GRPO objective is to maximize the likelihood of better-than-average responses from the sampled group.

3.3. Technological Evolution

The field has progressed from task-specific robot controllers to more generalist models:

Early Imitation Learning: Simple models learning policies from demonstrations for a single task.
Large-Scale End-to-End Models: Rise of transformers and large datasets like Open X-Embodiment led to generalist policies like RT-1 and OpenVLA that could perform many short-horizon skills.
Reasoning via Supervised Learning: Researchers recognized the need for planning and introduced CoT. Models like ECoT used supervised learning on LLM-generated reasoning traces.
Reasoning via Reinforcement Learning: The current paper, ThinkAct, represents the next step. It moves away from supervised reasoning traces and uses RL to allow the model to discover effective reasoning strategies. Crucially, it grounds this discovery process in physical plausibility through its novel visual reward function.

3.4. Differentiation Analysis

Compared to previous works, ThinkAct's innovations are:

Reward Function: Unlike RL-for-reasoning works that use task-agnostic rewards like QA accuracy, ThinkAct's action-aligned visual reward is specifically designed for embodied tasks. It directly links the quality of a thought process (the plan) to the quality of the intended physical outcome (the trajectory). This grounding is the key differentiator.
No Reliance on Textual CoT Data: Unlike ECoT or RAD, ThinkAct does not require a pre-curated dataset of textual reasoning steps. It learns to reason by exploring and receiving feedback from its visual reward function, making it more scalable and adaptable.
Dual-System Architecture: The explicit separation of a "thinker" (reasoning MLLM) and an "actor" (action policy) is a deliberate design choice. It allows for "slow thinking" (complex, deliberative planning) to happen less frequently, guiding a "fast control" loop for real-time execution. This is more practical for real-world robotics than a single monolithic model that must reason at every single timestep.

4. Methodology

4.1. Principles

The core principle of ThinkAct is to create an embodied agent that "thinks before it acts." This is achieved through a dual-system architecture that separates high-level planning from low-level motor control.

The Thinker (Reasoning MLLM $\mathcal{F}_{\theta}$ ): A multimodal LLM responsible for long-horizon planning. It observes the world and a given instruction, then "thinks" by generating an internal reasoning monologue and a high-level plan. This thinking process is optimized using reinforcement learning with a novel reward signal that reflects physical plausibility.
The Actor (Action Model $\pi_{\phi}$ ): A fast and efficient policy (a Diffusion Transformer) that executes the low-level actions. It doesn't perform complex reasoning itself but is guided by the plan generated by the Thinker.
The Bridge (Visual Plan Latent $c_t$ ): A compact vector representation of the Thinker's plan. It serves as the communication channel, allowing the high-level intent from the MLLM to condition and guide the low-level action model.

The overall architecture is depicted in the following figure from the paper (Figure 2):

该图像是图表，展示了在RoboVQA和OpenEQA基准上，ThinkAct模型在进行实体推理任务时的推理过程及其生成的答案对比。左侧为未使用强化学习的情况，右侧则为应用强化学习的情况。红色标记表示不正确的推理和答案，绿色则表示正确的推理和答案。

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

At any timestep $t$ , the agent receives a visual observation $o_t$ (an image) and a textual instruction $l$ . The goal is to generate a sequence of executable actions $a_t$ .

ThinkAct decomposes this problem into two stages:

The reasoning MLLM $\mathcal{F}_{\theta}$ generates a visual plan latent $c_t$ based on the inputs: $c_t = \mathcal{F}_{\theta}(o_t, l)$ .
The action model $\pi_{\phi}$ uses this latent plan $c_t$ along with the current observation to predict a sequence of $N$ actions: $[a_t, ..., a_{t+N-1}] = \pi_{\phi}(c_t, o_t, l)$ .

4.2.2. Reinforced Visual Latent Planning for Embodied Reasoning

This section details how the reasoning MLLM $\mathcal{F}_{\theta}$ is trained to produce effective plans.

1. Plan Generation Given an observation $o_t$ and instruction $l$ , the MLLM $\mathcal{F}_{\theta}$ autoregressively generates a response. This response contains two parts: a textual reasoning trace (the "think" part) and a visual plan. The visual plan is represented as a text string of 2D keypoints $\tau = [p_k]_{k=1}^K$ , where each $p_k$ is an (x, y) coordinate on the image. This trajectory represents the intended path of the robot's end-effector. For example, $p_1$ is the starting point and $p_K$ is the target endpoint.

2. Reward Shaping from Action-Aligned Visual Feedback To train the MLLM with reinforcement learning, a reward function $r$ is needed to score the quality of a generated plan $\tau$ . ThinkAct designs a reward based on comparing the generated trajectory $\tau$ to a ground-truth trajectory $\hat{\tau}$ (obtained from demonstrations using an off-the-shelf object detector). This reward has two main visual components:

Goal Reward ( $r_{\mathrm{goal}}$ ): This reward encourages the MLLM to correctly identify the start and end points of the manipulation task. It measures the proximity of the predicted start/end points ( $p_1, p_K$ ) to the ground-truth start/end points ( $\hat{p}_1, \hat{p}_K$ ). The formula is: $ r_{\mathrm{goal}} = \frac{1}{2} \left( f(p_1, \hat{p}_1) + f(p_K, \hat{p}_K) \right) , \quad \mathrm{where} ~ f(p, p') = \max(0, 1 - | p - p' |_2^2) . $
- Symbol Explanation:
  - $p_1, p_K$ : The predicted start and end points of the trajectory.
  - $\hat{p}_1, \hat{p}_K$ : The ground-truth start and end points.
  - $\| p - p' \|_2^2$ : The squared Euclidean distance between two points. The function f(p, p') returns a reward of 1 if the points are identical and decreases quadratically to 0 as they move apart.
Trajectory Reward ( $r_{\mathrm{traj}}$ ): This reward ensures the entire path of the planned trajectory is physically plausible and similar to demonstrated motions. It uses Dynamic Time Warping (DTW) distance to measure the similarity between the predicted trajectory $\tau$ and the ground-truth trajectory $\hat{\tau}$ . DTW is an algorithm that finds the optimal alignment between two time series, making it robust to slight variations in speed. The formula is: $ r_{\mathrm{traj}} = \max(0, 1 - d(\tau, \hat{\tau})). $
- Symbol Explanation:
  - $\tau, \hat{\tau}$ : The predicted and ground-truth trajectories, respectively.
  - $d(\tau, \hat{\tau})$ : The Dynamic Time Warping (DTW) distance between the two trajectories. A smaller DTW distance means the trajectories are more similar, resulting in a higher reward.
Overall Reward ( $r$ ): The final reward is a weighted combination of the visual rewards and a standard format correctness reward ( $r_{\mathrm{format}}$ ) that encourages the model to produce output in the correct format. $ r = 0.9 r_{\mathrm{visual}} + 0.1 r_{\mathrm{format}} , \quad \mathrm{where} \quad r_{\mathrm{visual}} = \omega_{\mathrm{goal}} r_{\mathrm{goal}} + \omega_{\mathrm{traj}} r_{\mathrm{traj}} . $
- Symbol Explanation:
  - $\omega_{\mathrm{goal}}$ and $\omega_{\mathrm{traj}}$ are weighting coefficients, both set to 0.5. This gives equal importance to achieving the correct goal and following a plausible path.

3. Reinforced Fine-Tuning with GRPO The MLLM $\mathcal{F}_{\theta}$ is fine-tuned using the Group Relative Policy Optimization (GRPO) algorithm. The process is as follows:

Sample Responses: For a given input $(o_t, l)$ , sample a group of $M$ different responses $\{z_1, ..., z_M\}$ from the current policy $\mathcal{F}_{\theta_{\mathrm{old}}}$ .
Evaluate: Calculate the reward $r_i$ for each response $z_i$ using the reward function defined above.
Compute Advantage: For each response, calculate a normalized advantage $A_i$ . This measures how much better or worse a response is compared to the average of the group. $ A_i = \frac{r_i - \mathrm{mean}({r_1, \dots, r_M})}{\mathrm{std}({r_1, \dots, r_M})}. $
Optimize: Update the model parameters $\theta$ $θ$ by maximizing the GRPO objective function: $ \mathcal{I}{\mathrm{GRPO}}(\theta) = \frac{1}{M} \sum{i=1}^M \left( \frac{\mathcal{F}{\theta}(z_i | o_t, l)}{\mathcal{F}{\theta_{\mathrm{old}}}(z_i | o_t, l)} A_i - \beta D_{KL}(\mathcal{F}{\theta}(z_i | o_t, l) \parallel \mathcal{F}{\theta_{\mathrm{old}}}(z_i | o_t, l)) \right). $
- Symbol Explanation:
  - $\frac{\mathcal{F}_{\theta}(...)}{\mathcal{F}_{\theta_{\mathrm{old}}}(...)}$ : The policy ratio, which measures how much more likely the new policy is to generate response $z_i$ compared to the old policy. The objective increases the probability of responses with positive advantage ( $A_i > 0$ ).
  - $D_{KL}(\cdot \parallel \cdot)$ : The KL divergence, a regularization term weighted by $\beta$ . It prevents the updated policy $\mathcal{F}_{\theta}$ from deviating too much from the previous policy $\mathcal{F}_{\theta_{\mathrm{old}}}$ , ensuring training stability.

4.2.3. Reasoning-Enhanced Action Adaptation

Once the reasoning MLLM $\mathcal{F}_{\theta}$ is trained, it is frozen. The next stage is to train the action model $\pi_{\phi}$ to execute actions based on the plans from $\mathcal{F}_{\theta}$ .

Connecting Plan to Action: The MLLM generates the high-level plan and it is compressed into a visual plan latent $c_t$ . This latent vector is fed into the action model $\pi_{\phi}$ to provide guidance. A latent projector (specifically a Q-Former) is used to map $c_t$ into the action model's input space.
Training via Imitation Learning: The action model $\pi_{\phi}$ $π_{ϕ}$ , along with its state encoder and the latent projector, is trained using imitation learning on a dataset of expert demonstrations. The model learns to predict the demonstrated action $a_i$ $a_{i}$ given the state $(o_i, l)$ $(o_{i}, l)$ and the guiding plan $c_t$ $c_{t}$ . The loss function is a standard supervised loss (e.g., Mean Squared Error): $ \mathcal{L}{\mathrm{IL}}(\phi) = \mathbb{E}{(o_i, l, a_i)} \left[ \ell(\pi_{\phi}(c_t, o_i, l), a_i) \right]. $
- Symbol Explanation:
  - $\ell$ : A loss function comparing the predicted action $\pi_{\phi}(\cdot)$ to the ground-truth action $a_i$ .
  - $\mathbb{E}_{(o_i, l, a_i)}$ : The expectation over the dataset of demonstrations.
    
    Asynchronous Operation: A key feature is that the reasoning and action loops run asynchronously. The MLLM generates one plan $c_t$ , which is then used by the action model for the next $N$ timesteps. This "slow thinking, fast control" paradigm is computationally efficient and practical for real-world robotics.

4.2.4. Learning Strategy and Inference

The overall training is a multi-stage process:

Cold-Start: The MLLM $\mathcal{F}_{\theta}$ is first supervised fine-tuned (SFT) on a mix of trajectory and QA datasets to learn the basic input/output formats. The action model $\pi_{\phi}$ is pre-trained on the large-scale Open X-Embodiment dataset to gain generalist motor skills.
Reinforced Fine-Tuning: The SFT-initialized MLLM is then fine-tuned with the action-aligned rewards using GRPO.
Action Adaptation: The pre-trained action model $\pi_{\phi}$ is fine-tuned on target environment data, conditioned on the plans generated by the now-frozen reasoning MLLM.

At inference time, the agent observes $o_t$ and $l$ , the MLLM $\mathcal{F}_{\theta}$ generates a plan latent $c_t$ , and the action model $\pi_{\phi}$ takes $c_t$ as guidance to produce a sequence of actions.

5. Experimental Setup

5.1. Datasets

5.1.1. Training Datasets

For SFT Cold-Start & RL:
- Open X-Embodiment (OXE): A large-scale robotics dataset containing demonstrations from many different robots. The paper uses subsets (fractal20220817_data, bridge) to extract 2D gripper trajectories.
- Something-Something v2: A dataset of human videos performing actions. Used to extract human hand trajectories to provide more diverse manipulation priors.
- RoboVQA: A dataset with question-answer pairs about long-horizon robotic manipulation videos.
- EgoPlan-T: A dataset for planning in egocentric daily tasks.
- Video-R1-CoT: A video reasoning dataset with CoT annotations, used for the SFT cold-start.
- Reflect (RoboFail): A dataset focused on robot manipulation failures, used to improve failure detection reasoning.
- LLaVA-Video-178K: A general video instruction-following dataset.

5.1.2. Evaluation Benchmarks

Robot Manipulation:
- SimplerEnv: A simulation benchmark designed to test robustness to visual variations (color, lighting, camera pose). It includes Visual Matching and Variant Aggregation settings.
- LIBERO: A benchmark for lifelong robot learning, with suites designed to test generalization to new spatial layouts (LIBERO-Spatial), objects (LIBERO-Object), goals (LIBERO-Goal), and long-horizon sequences (LIBERO-Long).
Embodied Reasoning:
- EgoPlan-Bench2: A multiple-choice QA benchmark for evaluating multi-step planning in egocentric videos.
- RoboVQA: A free-form QA benchmark for long-horizon reasoning in robotic manipulation videos.
- OpenEQA: A free-form QA benchmark for zero-shot embodied understanding in diverse real-world scenes.

5.2. Evaluation Metrics

Task Success Rate:
1. Conceptual Definition: This metric measures the overall effectiveness of the agent in accomplishing its task. It is calculated as the percentage of experimental trials in which the agent successfully completes the specified goal.
2. Mathematical Formula: $ \text{Success Rate} = \frac{\text{Number of Successful Trials}}{\text{Total Number of Trials}} \times 100% $
3. Symbol Explanation: A trial is deemed successful if the final state of the environment meets the pre-defined success criteria for the task.
Accuracy:
1. Conceptual Definition: Used for multiple-choice question-answering tasks (like EgoPlan-Bench2), this metric measures the proportion of questions for which the model selected the correct option.
2. Mathematical Formula: $ \text{Accuracy} = \frac{\text{Number of Correct Answers}}{\text{Total Number of Questions}} \times 100% $
3. Symbol Explanation: Straightforward ratio of correct predictions to total predictions.
BLEU (Bilingual Evaluation Understudy):
1. Conceptual Definition: Used for free-form QA tasks (like RoboVQA), BLEU measures the quality of a machine-generated text by comparing it to one or more high-quality human reference texts. It quantifies the correspondence of n-grams (contiguous sequences of n words) between the generated text and the reference texts. A higher BLEU score indicates greater similarity.
2. Mathematical Formula: The formula for BLEU is: $ \text{BLEU} = \text{BP} \cdot \exp\left(\sum_{n=1}^{N} w_n \log p_n\right) $
3. Symbol Explanation:
  - $\text{BP}$ : The Brevity Penalty, which penalizes generated texts that are too short compared to the reference length. It is calculated as $\text{BP} = \begin{cases} 1 & \text{if } c > r \\ e^{1-r/c} & \text{if } c \le r \end{cases}$ , where $c$ is the length of the candidate text and $r$ is the effective reference length.
  - $p_n$ : The modified n-gram precision. It is the count of n-grams in the candidate text that are also found in any reference text, divided by the total count of n-grams in the candidate text.
  - $w_n$ : The weight for each n-gram precision $p_n$ , typically uniform ( $1/N$ ).
  - $N$ : The maximum n-gram size to consider, commonly $N=4$ .
LLM-based Scoring:
1. Conceptual Definition: For complex, open-ended QA tasks (like OpenEQA), traditional metrics like BLEU can be insufficient. LLM-based scoring uses a powerful, proprietary LLM (e.g., GPT-4) as an impartial judge. The judge is given the original question, the model's generated answer, and a ground-truth answer, and is prompted to provide a score (e.g., on a scale of 1-10) reflecting the quality, correctness, and relevance of the generated answer. This metric aligns better with human judgment.

5.3. Baselines

The paper compares ThinkAct against a comprehensive set of baselines:

End-to-End VLA Models: Octo-Base, RT1-X, OpenVLA, TraceVLA.
Reasoning-based VLA Models: CoT-VLA, Magma.
ThinkAct's Action Model (Ablation): DiT-Policy (the action model used in ThinkAct, but without the reasoning guidance).
General MLLMs (for reasoning tasks): GPT-4V, LLaVA-Video, InternVL, NVILA, Qwen2.5-VL.

These baselines are representative because they cover the main competing paradigms: direct action prediction, supervised reasoning, and general-purpose multimodal reasoning.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Robot Manipulation Results

The following are the results from Table 1 of the original paper, evaluating performance on the SimplerEnv and LIBERO benchmarks.

Dataset	Split	Octo-Base	RT1-X	OpenVLA	DiT-Policy	TraceVLA	CoT-VLA	Magma	ThinkAct (Ours)
Simpler-Google (Visual Matching)	Open/Close Drawer	1.0	22.5	49.5	44.9	57.0		56.0	50.0
	Move Near	3.0	55.0	47.1	58.9	53.7		65.4	72.4
	Pick Coke Can	1.3	52.8	15.3	64.3	28.0		83.7	92.0
	Overall	1.8	43.4	37.3	56.0	46.2		68.4	71.5
Simpler-Google (Variant Aggregation)	Open/Close Drawer	22.0	56.0	22.5	35.5	31.0		53.4	47.6
	Move Near	4.2	34.2	54.0	52.8	56.4		65.7	63.8
	Pick Coke Can	17.0	54.0	52.8	56.4	60.0		68.8	84.0
	Overall	14.4	48.1	43.1	48.2	49.1		62.6	65.1
Simpler-Bridge (Visual Matching)	Put Carrot on Plate Stack Blocks	8.3	4.2	4.2	29.4			31.0	37.5
	Put Spoon on Towel	0.0	0.0	0.0	0.0	-		12.7	8.7
		12.5	0.0	8.3	34.5			37.5	58.3
	Put Eggplant in Basket Overall	43.1	0.0	45.8	65.5			60.5	70.8
LIBERO	Spatial	16.0	1.1	14.6	32.4			35.4	43.8
		78.9	−	84.7	82.6	84.6	87.5		88.3
	Object	85.7	-	88.4	84.7	85.2	91.6		91.4
	Goal	84.6		79.2	82.1	75.1	87.6		87.1
	Long Overall	51.1 75.1		53.7 76.5	57.6 76.8	54.1 74.8	69.0 83.9		70.9 84.4

Analysis:

Superiority over Baselines: ThinkAct consistently achieves the highest overall success rates across all three SimplerEnv setups and on the LIBERO benchmark. On SimplerEnv, it scores 71.5%, 65.1%, and 43.8%, outperforming the next best methods. On LIBERO, its overall score of 84.4% surpasses all competitors, including the strong CoT-VLA baseline.
Value of Reasoning: The most critical comparison is with DiT-Policy, which is ThinkAct's action model without the reasoning guidance. ThinkAct outperforms DiT-Policy by a large margin (e.g., 71.5% vs. 56.0% on Simpler-Google-VM). This directly proves that the reasoning-guided visual plan latent significantly enhances the performance of the action policy.
Long-Horizon Planning: The strong performance on LIBERO, especially the Long and Spatial subtasks, validates the claim that ThinkAct's explicit planning mechanism helps in solving complex, multi-step tasks.

6.1.2. Embodied Reasoning Results

The following are the results from Table 2 of the original paper, evaluating performance on reasoning benchmarks.

Dataset	Split / Metric	GPT-4V	LLaVA-Video	InternVL2.5	InternVL3	NVILA	Qwen2.5-VL	Qwen2.5-VL*	Magma	ThinkAct (Ours)
EgoPlan- Bench2	Daily life	36.7	38.0	36.2	38.5	35.8	31.4	47.9	32.1	50.1
	Work	27.7	29.9	28.7	32.9	28.7	26.7	46.3	25.7	49.8
	Recreation	33.9	39.0	34.4	36.1	37.2	29.5	44.3	34.4	44.8
	Hobbies	32.5	37.4	35.4	37.2	35.4	28.6	44.2	29.3	45.2
	Overall	32.6	35.5	33.5	36.2	33.7	29.1	45.7	29.8	48.2
RoboVQA	BLEU-1	32.2	35.4	40.5	44.3	42.7	47.8	65.3	38.6	69.1
	BLEU-2	26.5	32.1	33.3	36.5	39.7	41.2	57.3	31.5	61.8
	BLEU-3	24.7	30.0	29.6	31.6	37.6	36.2	52.2	28.1	56.0
	BLEU-4	23.9	29.0	27.5	28.9	36.1	33.7	48.0	26.7	52.4
	Overall	26.8	31.6	32.7	35.3	39.0	39.7	55.7	31.2	59.8
OpenEQA	Obj. State	63.2	69.1	70.2	68.9	66.1	63.2	62.4	59.9	70.0
	Obj. Recog.	43.4	42.6	47.2	49.1	49.5	46.2	45.2	43.8	47.2
	Func. Reason.	57.4	50.3	56.2	54.6	51.0	51.2	52.3	50.0	53.2
	Spatial	33.6	46.2	44.1	43.3	43.1	41.2	42.8	39.3	47.6
	Attri. Recog.	57.2	64.1	64.9	74.4	69.3	63.0	65.0	58.3	71.1
	World Know.	50.7	60.5	56.5	53.1	59.4	54.3	54.2	53.3	58.6
	Obj. Loc.	42.0	38.2	41.9	45.0	39.9	36.5	41.9	38.9	45.9
	Overall	49.6	53.0	54.4	55.5	54.0	50.8	52.0	49.1	56.2

Analysis:

Enhanced Reasoning Capability: ThinkAct achieves state-of-the-art results on all three reasoning benchmarks. The significant improvement over the base Qwen2.5-VL model (and even its fine-tuned version, Qwen2.5-VL*) demonstrates that the proposed reinforced fine-tuning with action-aligned rewards genuinely enhances the MLLM's ability to reason about embodied scenarios.
Qualitative Evidence: The paper provides qualitative examples (Figure 3 and Figure 4) that show the model's explicit reasoning process. For instance, in Figure 3, the model verbalizes its plan: "First, identify the book... Use the robot's arm... Move the book smoothly... Place it in the compartment." This confirms that the model is performing structured, multi-step planning. Figure 4 below shows a direct comparison of the reasoning process with and without RL, where the RL-tuned model correctly infers future steps while the SFT model fails.

该图像是一个示意图，展示了ThinkAct系统在处理烹饪任务中的推理过程。左侧为无强化学习的结果，右侧为有强化学习的结果，红色标记表示错误推理，绿色标记表示正确推理。该图中涉及选择合适的下一步行动，以完成“准备烘焙粉”的任务。

6.2. Ablation Studies / Parameter Analysis

The authors perform a crucial ablation study to validate the components of their proposed reward function. The following are the results from Table 3 of the original paper:

Method	SimplerEnv	EgoPlan	RoboVQA
ThinkAct (Ours)	60.1	48.2	59.8
Ours w/o r_traj	59.2	47.9	58.5
Ours w/o r_goal	59.1	47.6	58.9
Ours w/o r_traj, r_goal	56.9	47.2	58.3
SFT cold-start	56.4	46.4	57.9

Analysis:

Both Rewards are Crucial: Removing either the trajectory reward (r_traj) or the goal reward (r_goal) leads to a drop in performance across all benchmarks. This indicates that both encouraging a plausible path (r_traj) and identifying the correct start/end points (r_goal) are essential for learning effective plans.
Visual Rewards are Key: When both visual rewards are removed (w/o r_traj, r_goal), leaving only rewards from general QA datasets, the performance drops significantly, nearing the SFT baseline. This is the strongest evidence that the action-aligned visual rewards are the primary driver of the performance gains.
RL is Necessary: The SFT cold-start model performs the worst, confirming that supervised learning alone is insufficient and the reinforcement learning stage is vital for eliciting advanced reasoning capabilities.

6.3. Analysis of Emergent Capabilities

6.3.1. Reasoning Enhances Few-Shot Adaptation

The paper investigates if the learned reasoning capabilities help the agent adapt to new tasks more quickly. The results from Figure 5 show the success rate on LIBERO tasks after fine-tuning on only 10 demonstrations.

Figure A8: More Demonstrations of self-reflection and correction capability of ThinkAct. 该图像是图示，展示了ThinkAct在机器人操作中的自我反思和修正能力。左侧显示机器人在抓取杯子时的困难情况，右侧展示机器人通过重新规划和执行任务成功地完成抓取。

Analysis: ThinkAct consistently outperforms the baseline models, especially on adapting to new goals (LIBERO-Goal, +7.3% over Magma) and new environments (LIBERO-Spatial, +9.5%). This suggests that the high-level reasoning provides a structured and generalizable understanding of tasks, which serves as a powerful prior, allowing the action model to learn the specifics of a new task from very few examples.

6.3.2. Reasoning Elicits Self-Correction

A standout result is ThinkAct's emergent ability to recover from errors. Figure 8 in the paper demonstrates this qualitatively.

该图像是示意图，展示了机器人抓取目标物体的过程。左侧三幅图中，机器人未能成功抓取目标物体，标注为"Fail to pick up target object!!"；右侧两幅图中，机器人执行了"Replan & Execute"，成功进行了重规划和执行。此图阐释了应对复杂任务时的行动策略。

Analysis: In this scenario, the robot accidentally drops the object it is supposed to place in a basket. The reasoning MLLM, by observing the video context, identifies the failure. It then generates a new reasoning trace ("Let's reconsider how to complete the task") and produces a revised plan to go back, regrasp the dropped object, and complete the original instruction. This is a significant step towards creating truly robust and autonomous agents that can operate reliably in unpredictable real-world environments.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully presents ThinkAct, a dual-system framework that effectively integrates high-level reasoning with low-level action execution for VLA tasks. By introducing a novel action-aligned visual reward function for reinforcement learning, ThinkAct trains an MLLM to generate physically grounded, multi-step plans. These plans, encoded as a visual latent, guide a downstream action model, leading to state-of-the-art performance on challenging robotics benchmarks. The work's key contributions are the demonstration of enhanced long-horizon planning, remarkable few-shot adaptation, and the emergence of critical self-correction capabilities, paving a promising path toward more intelligent and adaptable embodied AI.

7.2. Limitations & Future Work

The authors acknowledge a primary limitation: ThinkAct, being built upon a pre-trained MLLM, inherits its potential flaws, most notably hallucinations. The reasoning MLLM might generate plans based on incorrect assumptions about the visual scene (e.g., misidentifying objects or their properties), which could lead to execution failures. While the paper's action grounding mechanism helps mitigate this to some extent, it does not eliminate the problem. Future work could focus on developing more robust grounding techniques or methods for hallucination suppression specifically for embodied reasoning to improve reliability for real-world deployment.

7.3. Personal Insights & Critique

Strengths:
- The action-aligned visual reward is the paper's most significant and elegant contribution. It provides a scalable and effective way to ground the abstract reasoning of LLMs in the physical world without requiring a full simulation loop for every reasoning step or expensive human annotation of thought processes.
- The dual-system "Think-Act" architecture is pragmatically designed. It leverages the strengths of large, slow models for complex planning while relying on a smaller, faster model for real-time control, which is a practical blueprint for real-world robotic systems.
- The demonstration of emergent self-correction is highly compelling. It shows that with the right training paradigm, models can develop sophisticated, human-like recovery behaviors, which is a crucial step towards building truly autonomous systems.
Potential Issues and Areas for Improvement:
- Dependency on Ground-Truth Trajectories: The RL training process relies on trajectories extracted from demonstrations using an "off-the-shelf detector." The quality and robustness of this detector could be a potential bottleneck. If the detector fails or provides noisy data, the reward signal would be corrupted, hindering the learning of the reasoning model.
- Fixed Replanning Frequency: The model replans every $N$ steps, where $N$ is a fixed hyperparameter. A more intelligent, adaptive system might decide when to replan based on task complexity or the detection of an anomaly. A fixed frequency could be inefficient (replanning too often for simple tasks) or ineffective (not replanning soon enough when an error occurs).
- Computational Overhead: While the asynchronous design is more efficient than reasoning at every step, running a large MLLM like Qwen2.5-VL 7B for planning still represents a significant computational cost, which could be a barrier for deployment on resource-constrained robotic hardware.
- Generalization of Visual Rewards: The visual rewards are based on 2D trajectory matching. This might not be sufficient for tasks where 3D spatial relationships, forces, or object dynamics are critical. Extending the reward function to incorporate more sophisticated physical common sense could be a valuable future direction.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

ThinkAct: Vision-Language-Action Reasoning via Reinforced Visual Latent Planning

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~23 min read · 31,409 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.2. Main Contributions / Findings

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.2. Previous Works

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Problem Formulation

4.2.2. Reinforced Visual Latent Planning for Embodied Reasoning

4.2.3. Reasoning-Enhanced Action Adaptation

4.2.4. Learning Strategy and Inference

5. Experimental Setup

5.1. Datasets

5.1.1. Training Datasets

5.1.2. Evaluation Benchmarks

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Robot Manipulation Results

6.1.2. Embodied Reasoning Results

6.2. Ablation Studies / Parameter Analysis

6.3. Analysis of Emergent Capabilities

6.3.1. Reasoning Enhances Few-Shot Adaptation

6.3.2. Reasoning Elicits Self-Correction

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers