TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

He Wang

Paper status: completed

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

Published:10/08/2025

Vision-Language-Action Model (23)Long-Horizon Consistency Modeling (5)Spatial Reasoning Mechanism (1)Target Identification Memory (1)Autoregressive Reasoning Model (1)

Original Link PDF

Price: 0.10

5 readers

This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

TrackVLA++ is a novel Vision-Language-Action model that enhances embodied visual tracking by introducing a spatial reasoning mechanism and Target Identification Memory. It effectively addresses tracking failures under severe occlusions and achieves state-of-the-art performance.

Abstract

Embodied Visual Tracking (EVT) is a fundamental ability that underpins practical applications, such as companion robots, guidance robots and service assistants, where continuously following moving targets is essential. Recent advances have enabled language-guided tracking in complex and unstructured scenes. However, existing approaches lack explicit spatial reasoning and effective temporal memory, causing failures under severe occlusions or in the presence of similar-looking distractors. To address these challenges, we present TrackVLA++, a novel Vision-Language-Action (VLA) model that enhances embodied visual tracking with two key modules, a spatial reasoning mechanism and a Target Identification Memory (TIM). The reasoning module introduces a Chain-of-Thought paradigm, termed Polar-CoT, which infers the target's relative position and encodes it as a compact polar-coordinate token for action prediction. Guided by these spatial priors, the TIM employs a gated update strategy to preserve long-horizon target memory, ensuring spatiotemporal consistency and mitigating target loss during extended occlusions. Extensive experiments show that TrackVLA++ achieves state-of-the-art performance on public benchmarks across both egocentric and multi-camera settings. On the challenging EVT-Bench DT split, TrackVLA++ surpasses the previous leading approach by 5.1 and 12, respectively. Furthermore, TrackVLA++ exhibits strong zero-shot generalization, enabling robust real-world tracking in dynamic and occluded scenarios.

Mind Map

In-depth Reading

English Analysis~22 min read · 30,480 chars

1. Bibliographic Information

1.1. Title

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

1.2. Authors

The authors of the paper are Jiahang Liu, Yunpeng Qi, Jiazhao Zhang, Minghan Li, Shaoan Wang, Kui Wu, Hanjing Ye, Hong Zhang, Zhibo Chen, Fangwei Zhong, Zhizheng Zhang, and He Wang.

The affiliations of the authors span several prestigious institutions and companies:

Peking University
Galbot
University of Science and Technology of China (USTC)
Beijing Academy of Artificial Intelligence (BAAI)
Beihang University
Southern University of Science and Technology (SUSTech)
Beijing Normal University

This collaboration between academic and industry researchers indicates a strong combination of theoretical innovation and practical application focus.

1.3. Journal/Conference

The paper is currently available as a preprint on arXiv, a repository for scientific articles. The arXiv ID 2510.07134v1 and the provided publication date 2025-10-08 suggest it has been submitted to a top-tier conference in robotics or computer vision for the year 2025. Venues like Robotics: Science and Systems (RSS), the Conference on Computer Vision and Pattern Recognition (CVPR), or the International Conference on Robotics and Automation (ICRA) are likely targets, as these are highly reputable and influential platforms for publishing work in embodied AI and robotics.

1.4. Publication Year

The paper was submitted to arXiv in 2025.

1.5. Abstract

The abstract introduces Embodied Visual Tracking (EVT) as a crucial capability for real-world robotic applications like companion and service robots. It points out that existing language-guided tracking methods fall short in situations with severe occlusions or visually similar distractors due to a lack of explicit spatial reasoning and effective temporal memory. To overcome these issues, the paper proposes TrackVLA++, a novel Vision-Language-Action (VLA) model. TrackVLA++ features two main innovations: a spatial reasoning mechanism called Polar-CoT and a Target Identification Memory (TIM). Polar-CoT uses a Chain-of-Thought approach to infer the target's relative position in polar coordinates, which guides the action prediction. The TIM module employs a gated update strategy, guided by the reasoning output, to maintain a long-term memory of the target, preventing target loss during occlusions. The authors demonstrate through extensive experiments that TrackVLA++ achieves state-of-the-art results on public benchmarks and shows robust zero-shot generalization to real-world scenarios.

1.6. Original Source Link

Original Source Link: https://arxiv.org/abs/2510.07134v1
PDF Link: https://arxiv.org/pdf/2510.07134v1.pdf
Publication Status: The paper is a preprint and is not yet officially published in a peer-reviewed venue.

2. Executive Summary

2.1. Background & Motivation

2.1.1. What is the core problem the paper aims to solve?

The core problem is Embodied Visual Tracking (EVT), where a mobile agent (like a robot) must physically navigate its environment to continuously follow a specific moving target. The target is often described using natural language (e.g., "follow the person in the red shirt").

2.1.2. Why is this problem important in the current field? What specific challenges or gaps exist in prior research?

EVT is a fundamental skill for human-robot interaction and service robotics. Imagine a robot assistant in a hospital that needs to follow a specific doctor, or a companion robot that follows its owner through a crowded mall. The success of these applications hinges on robust tracking.

However, prior research, even state-of-the-art Vision-Language-Action (VLA) models like TrackVLA, faces significant challenges in complex, real-world environments. The paper identifies two primary gaps:

Lack of Explicit Spatial Reasoning: Existing models often perform tracking in a reactive, "black-box" manner. They don't explicitly reason about where the target is relative to the agent. This makes them brittle and prone to failure when the visual input is ambiguous.
Ineffective Long-Horizon Memory: When a target is temporarily occluded (e.g., walks behind a pillar) or when there are visually similar distractors (e.g., multiple people wearing the same uniform), the models can easily lose track or switch to the wrong target. They lack a robust mechanism to remember what the correct target looks like over extended periods of time.

2.1.3. What is the paper's entry point or innovative idea?

The paper's innovative idea is to explicitly integrate reasoning and memory into the VLA framework to make tracking more intelligent and robust. Instead of just mapping vision to action, the model is designed to first think about the target's location and then use that thought process to inform its memory and subsequent actions. This is achieved through two novel modules:

Polar Chain-of-Thought (Polar-CoT): A lightweight reasoning mechanism that forces the model to first predict the target's relative position (angle and distance) before deciding how to move.
Target Identification Memory (TIM): A gated memory system that uses the confidence of the Polar-CoT reasoning to decide whether to update its memory of the target's appearance. This prevents the memory from being corrupted by incorrect observations during occlusion or distraction.

2.2. Main Contributions / Findings

2.2.1. What are the paper's primary contributions?

The paper lists three main contributions:

A novel Polar-CoT mechanism that equips the VLA model with explicit spatial reasoning capabilities. This improves performance while remaining computationally efficient, which is crucial for real-time tracking.
The Target Identification Memory (TIM) module, which provides robust long-horizon target identification. Its reasoning-guided update strategy makes the model resilient to severe occlusions and distractors.
State-of-the-art performance: The paper demonstrates through extensive experiments that $TrackVLA++$ sets a new state-of-the-art on multiple simulation benchmarks and generalizes effectively to real-world scenarios.

2.2.2. What key conclusions or findings did the paper reach?

The key finding is that by explicitly modeling spatial reasoning and long-term memory, an embodied agent can achieve significantly more robust and reliable visual tracking. The proposed Polar-CoT and TIM modules directly address the core weaknesses of previous methods, leading to substantial performance gains, particularly in challenging scenarios involving occlusions and distractors. This work suggests that future advancements in embodied AI should focus not just on scaling up end-to-end models but also on incorporating structured reasoning and memory components.

3.1. Foundational Concepts

3.1.1. Embodied AI

Embodied AI refers to artificial intelligence systems that are not just disembodied algorithms processing data, but are physically situated within an environment. These systems, often robots, can perceive the world through sensors (like cameras) and act upon it using actuators (like wheels or legs). The "embodiment" is crucial: the agent's physical form and its interaction with the world are integral to its learning and decision-making process. The EVT task is a classic example of embodied AI, as it requires an agent to move and navigate in a physical space.

3.1.2. Vision-Language Models (VLMs)

Vision-Language Models (VLMs) are a class of deep learning models designed to understand and process information from both images (vision) and text (language) simultaneously. They learn to associate visual concepts with their textual descriptions. A famous example is CLIP (Contrastive Language-Image Pre-Training), which learns to match images with their corresponding captions. This ability allows VLMs to perform tasks like zero-shot image classification (classifying images without being explicitly trained on those classes) and image captioning.

3.1.3. Vision-Language-Action (VLA) Models

Vision-Language-Action (VLA) Models are an evolution of VLMs, specifically for embodied AI. They extend the capabilities of VLMs by adding an "action" component. A VLA model takes visual input (e.g., from a robot's camera) and a language instruction (e.g., "pick up the red apple") and outputs a sequence of actions (e.g., motor commands) for the robot to execute. This creates an end-to-end connection from high-level human commands to low-level physical actions, forming the foundation of modern robotics and embodied agents. $TrackVLA++$ is a VLA model.

3.1.4. Chain-of-Thought (CoT) Reasoning

Chain-of-Thought (CoT) Reasoning is a technique used to improve the reasoning abilities of large language models (LLMs). Instead of directly outputting a final answer to a complex question, the model is prompted to generate a series of intermediate, step-by-step reasoning steps that lead to the solution. For example, when asked a math word problem, a CoT-prompted model would first write out the steps to solve the problem before giving the final numerical answer. This process has been shown to significantly improve performance on tasks requiring logical deduction and planning. $TrackVLA++$ adapts this idea for a physical task with its Polar-CoT mechanism.

3.2. Previous Works

3.2.1. Decoupled vs. End-to-End Paradigms

Early approaches to EVT used a decoupled paradigm. This involved two separate modules:

A perception module (often a visual foundation model like GroundingDINO) to detect and identify the target in the current camera view.
A planning module (often trained with reinforcement learning) that takes the target's location from the perception module and decides how the robot should move. The major drawback of this approach is that errors can propagate from the perception module to the planning module, and the information flow between them can be a bottleneck.

Recent works have shifted towards end-to-end VLA models that learn a single, unified policy from perception to action.

TrackVLA: The direct predecessor to $TrackVLA++$ . It introduced a unified VLA framework that takes visual and language inputs and directly outputs a tracking trajectory. It demonstrated strong performance but, as noted, lacked explicit reasoning and memory.
LOVON: Another VLA model that uses a hierarchical approach. A high-level LLM acts as a planner, breaking down a complex instruction into simpler sub-tasks, which are then executed by a low-level motion controller. This introduces some planning structure but still lacks the fine-grained, step-by-step spatial reasoning of $TrackVLA++$ .

3.2.2. Chain-of-Thought in Embodied AI

The idea of CoT has been explored in other embodied tasks, primarily in robotic manipulation. These works typically have the model generate intermediate representations such as:

High-level textual plans (e.g., "first, open the drawer; second, grab the cup").
Object coordinates or bounding boxes to specify where to look or grasp.
Subgoal images representing intermediate states. The paper argues that these methods, while effective for static manipulation tasks, are too computationally intensive and slow for highly dynamic tasks like EVT, where real-time responsiveness is critical.

3.3. Technological Evolution

The field of EVT has evolved as follows:

Classical Methods: Early works relied on traditional computer vision techniques for tracking (e.g., visual servoing) combined with classical motion planning.
Decoupled Deep Learning: With the rise of deep learning, methods started using powerful visual foundation models for perception and reinforcement learning for planning, but kept the two stages separate.
End-to-End VLA Models: The recent trend is to use large-scale VLA models (Uni-NaVid, TrackVLA) that learn a direct mapping from multimodal inputs (vision, language) to actions, leveraging the power of pre-trained models.
VLAs with Reasoning and Memory: $TrackVLA++$ represents the next step in this evolution, augmenting the end-to-end VLA paradigm with explicit modules for spatial reasoning and long-term memory to handle more complex and realistic scenarios.

3.4. Differentiation Analysis

$TrackVLA++$ differentiates itself from prior work in two key ways:

Nature of Reasoning: Unlike CoT methods in manipulation that produce verbose or computationally expensive intermediates (text plans, bounding boxes), $TrackVLA++$ introduces Polar-CoT, a highly efficient, agent-centric reasoning mechanism. It encodes the target's position into a single, compact token representing a polar coordinate sector. This is ideal for dynamic navigation as it's fast and directly useful for action prediction. It also elegantly handles multi-camera setups by providing a unified spatial representation, avoiding the complexities of merging multiple bounding box predictions.
Reasoning-Gated Memory: While other models might use some form of memory (e.g., recurrent neural networks), the TIM module in $TrackVLA++$ is unique because its update mechanism is explicitly controlled by the output of the reasoning module. The memory is only updated when the model is confident in its reasoning about the target's location. This "confidence gating" is a robust defense against memory corruption from distractors or temporary target loss, a significant innovation for long-horizon tracking.

4. Methodology

4.1. Principles

The core principle of $TrackVLA++$ is to enhance a standard Vision-Language-Action (VLA) model with a structured, two-stage process that mimics intelligent tracking: first reason, then act. This is implemented through a feedback loop where an explicit spatial reasoning step guides both a long-term memory module and the final action generation. This design moves away from a purely reactive "black-box" model towards one that maintains and utilizes an internal state representing its understanding of the target's position and appearance over time.

4.2. Core Methodology In-depth

The overall architecture of $TrackVLA++$ , shown in Figure 2, is an end-to-end VLA model that processes a stream of visual observations and a language instruction to predict a tracking trajectory.

$Fig. 2: The pipeline of TrackVLA $^ { + + }$ Given a video stream and a language instruction, Track $\\mathrm { { V L A + + } }$ predicts a tracking trajectory by predictions for long-horizon tracking.$ 该图像是TrackVLA++的示意图，展示了视觉语言动作模型的工作流程。左侧部分显示输入视频流和语言指令如何通过文本编码器、视觉编码器和网格池化技术生成不同类型的token。右侧部分则介绍了推理token和目标识别token的计算过程，特别是Polar-CoT推理机制及其输出的信心、距离和角度logits。整体图解说明了模型如何实现长时间的跟踪预测。

4.2.1. Task Formulation

The Embodied Visual Tracking (EVT) task is formally defined as follows: At each time step $T$ , the agent receives a language instruction $\mathcal{L}$ describing the target, along with a history of RGB observations $\{ \mathcal{O}_t^n | t=1,...,T, n=1,...,N \}$ from $N$ cameras. The goal is to predict a continuous tracking trajectory $\mathcal{W}_T = \{ w_1, w_2, ... \}$ , where each waypoint $w_i = (x, y, \theta) \in \mathbb{R}^3$ represents a target displacement (x, y) and heading change $\theta$ in the agent's egocentric coordinate system. The task is successful if the agent maintains a specified following distance from the target.

4.2.2. Observation Encoding

The model first processes the incoming video stream to extract meaningful visual features.

Dual-Encoder Backbone: The model uses two powerful pre-trained visual encoders, SigLIP and DINOv2, to extract features from each image frame. The features from both are concatenated to create a rich visual representation.
Grid Pooling for Efficiency: To handle long video sequences without prohibitive computational cost, a grid pooling strategy is employed. This creates two different resolutions of features:
- $V^{\mathrm{fine}} \in \mathbb{R}^{64 \times C}$ : High-resolution features from the most recent frame, capturing fine-grained details needed for precise localization.
- $V^{\mathrm{coarse}} \in \mathbb{R}^{4 \times C}$ : Low-resolution features from historical frames, summarizing the past context in a compact form.
Dual-Memory Architecture: The model uses two types of memory:
- Short-Term Memory: A sliding window of the last $k=32$ frames forms the current visual feature sequence: $V_T^{\mathrm{track}} = \{ V_{T-k}^{\mathrm{coarse}}, ..., V_{T-1}^{\mathrm{coarse}}, V_T^{\mathrm{fine}} \}$ .
- Long-Term Memory: The proposed Target Identification Memory (TIM), denoted as $M_T^{\mathrm{TIM}}$ , stores a persistent representation of the target's appearance.
Projection to LLM Space: The short-term visual features $V_T^{\mathrm{track}}$ and the long-term memory features $M_T^{\mathrm{TIM}}$ are projected into the latent space of the Large Language Model (LLM) using a 2-layer MLP projector $\mathcal{P}(\cdot)$ .

$E_T^V = \mathcal{P}(V_T^{\mathrm{track}}), \quad E_T^M = \mathcal{P}(M_T^{\mathrm{TIM}})$
$E_T^V$ : The embedded tokens for the current visual sequence.
$E_T^M$ : The embedded tokens for the long-term target memory.

4.2.3. Polar-CoT Reasoning Forwarding

This is the first key module, which introduces explicit spatial reasoning.

Concept: Instead of predicting bounding boxes, Polar-CoT discretizes the space around the agent into a polar grid. The agent's perceivable area is divided into sectors, each defined by a quantized range of angles $(\theta)$ and distances (d). Each unique $(\theta, d)$ combination is mapped to a special token in the LLM's vocabulary.
Reasoning Process: The projected memory tokens $E_T^M$ , visual tokens $E_T^V$ , and the language instruction tokens $E^L$ are concatenated and fed into the LLM. The LLM's task is to predict a single reasoning token, $E_T^{\mathrm{CoT}}$ , which corresponds to the polar coordinate sector where it believes the target is located.
Handling Occlusion: The vocabulary is extended with a special $<invalid>$ token. The LLM predicts this token when it infers that the target is occluded or not visible.
Mathematical Formulation: The reasoning process is formulated as a next-token prediction task for the LLM:

$E_T^{\mathrm{CoT}} = \mathrm{LLM}(\mathrm{Concat}[E_T^M, E_T^V, E^L])$

4.2.4. Reasoning Feedback Memory Update (TIM)

The second key module is the Target Identification Memory (TIM), which maintains a stable representation of the target's visual identity. Its update mechanism is crucially guided by the confidence of the Polar-CoT reasoning.

Gated Update: At each timestep $T$ , the TIM state $M_T^{\mathrm{TIM}}$ is updated from its previous state $M_{T-1}^{\mathrm{TIM}}$ by taking a weighted average with a new candidate feature $f_{T-1}$ . The candidate feature $f_{T-1}$ is the visual feature from the high-resolution observation $V_{T-1}^{\mathrm{fine}}$ corresponding to the location predicted by the previous reasoning step $E_{T-1}^{\mathrm{CoT}}$ .

$M_T^{\mathrm{TIM}} = (1 - w_T) \cdot M_{T-1}^{\mathrm{TIM}} + w_T \cdot f_{T-1}$
Confidence-based Weight: The weight $w_T$ determines how much the new observation influences the memory. It is designed to be high when the model is confident and low when it is uncertain. This weight is calculated based on the confidence score $C_{T-1}$ of the previous reasoning step, normalized against the historical average confidence $\overline{C_{T-2}}$ :

$w_T = \frac{C_{T-1}}{\overline{C_{T-2}} + C_{T-1}}, \quad \mathrm{with} \quad \overline{C_{T-2}} = \frac{1}{T-2} \sum_{i=1}^{T-2} C_i$
Confidence Score Calculation: The confidence score $C_{T-1}$ quantifies the certainty of the reasoning token prediction. It is calculated using the normalized entropy of the probability distribution (logits $\mathbf{P}$ ) over the reasoning vocabulary. Entropy measures uncertainty; a sharp, confident distribution has low entropy, while a flat, uncertain one has high entropy.

$C_{T-1} = 1 - \frac{H(\mathrm{softmax}(\mathbf{P}))}{\log K}$
- $H(p) = - \sum p_i \log p_i$ is the Shannon entropy function.
- $\mathbf{P}$ is the vector of logits (pre-softmax outputs) for the reasoning vocabulary.
- $K$ is the size of the reasoning vocabulary (the number of polar sectors plus the $<invalid>$ token).
- The term $\log K$ normalizes the entropy to the range [0, 1].
- Intuition: If the model is highly confident, the probability distribution will be peaky (one-hot like), entropy will be close to 0, and $C_{T-1}$ will be close to 1. If the model is uncertain, the distribution will be uniform, entropy will be maximal ( $\log K$ ), and $C_{T-1}$ will be close to 0.
Memory Freezing: If the predicted reasoning token is $<invalid>$ , its confidence score $C_t$ is forced to 0. This makes the update weight $w_{t+1}$ zero, effectively freezing the memory and preserving the last known good representation of the target until it is confidently re-detected.

4.2.5. Action Forwarding

After the reasoning and memory update steps, the model generates the final action.

Input Sequence: The LLM receives an augmented input sequence that now includes the newly generated reasoning token $E_T^{\mathrm{CoT}}$ . This provides the model with the explicit spatial context it just reasoned about.
Action Token Prediction: The LLM predicts an action token $E_T^{\mathrm{pred}}$ .

$E_T^{\mathrm{pred}} = \mathrm{LLM}(\mathrm{Concat}[E_T^M, E_T^V, E^L, E_T^{\mathrm{CoT}}])$
Trajectory Decoding: This action token is then decoded by a simple MLP-based ActionHead into a sequence of future waypoints $\mathcal{W}_T$ .

$\mathcal{W}_T = \mathrm{ActionHead}(E_T^{\mathrm{pred}})$

4.2.6. Training Objective

The model is trained end-to-end using a composite loss function that supervises trajectory planning, reasoning, and general language understanding. $\mathcal{L} = \mathcal{L}_{\mathrm{traj}} + \alpha \mathcal{L}_{\mathrm{reason}} + \beta \mathcal{L}_{\mathrm{text}}$

$\alpha$ and $\beta$ are hyperparameters balancing the loss components, set to 0.2 and 0.5, respectively.
Trajectory Loss ( $\mathcal{L}_{\mathrm{traj}}$ ): This is the Mean Squared Error (MSE) between the predicted waypoints $\hat{w}_i$ and the ground truth expert waypoints $w_i^{\mathrm{gt}}$ .

$\mathcal{L}_{\mathrm{traj}} = \sum_{i=1}^{M} \mathrm{MSE}(\hat{w}_i, w_i^{\mathrm{gt}})$
- $M$ is the number of predicted waypoints.
Reasoning Loss ( $\mathcal{L}_{\mathrm{reason}}$ ): This is a standard cross-entropy loss that trains the model to predict the correct reasoning token $E_T^{\mathrm{CoT}}$ (i.e., the correct polar sector for the target).

$\mathcal{L}_{\mathrm{reason}} = - \log \mathbf{P}(E_T^{\mathrm{CoT}} | \mathbf{Concat}[E_T^M, E_T^V, E^L])$
Text Prediction Loss ( $\mathcal{L}_{\mathrm{text}}$ ): This is a vanilla language modeling loss applied during question-answering data training, which helps maintain the LLM's general language capabilities.

5. Experimental Setup

5.1. Datasets

The model is trained on a combination of tracking-specific data and general question-answering (QA) data.

Polar-CoT Tracking Data: A large-scale dataset of 1 million multi-view tracking samples was generated using the EVT-Bench training split in the Habitat 3.0 simulator.
- Data Content: Each sample includes a history of multi-view RGB images, a language description of the target, the ground truth expert trajectory, and the novel Polar-CoT annotations.
- Annotation Generation: Polar-CoT annotations were created by recording the target's ground truth relative angle and distance. The $<invalid>$ flag was assigned if the target's semantic mask was smaller than 2,500 pixels, indicating it was too distant or occluded.
- Data Augmentation: The authors introduced randomization in camera parameters (position, height, field-of-view) and camera views to improve generalization.
QA Data: To enhance the model's general visual recognition and language understanding, it was co-trained on 1 million QA samples, mixed in a 1:1 ratio with the tracking data. This included:
- 294K person identification samples from SYNTH-PEDES.
- 205K image-based QA samples.
- 501K video-based QA samples from various public datasets.

5.2. Evaluation Metrics

The paper uses standard metrics for evaluating embodied tracking performance.

Success Rate (SR):
1. Conceptual Definition: This metric measures the percentage of episodes in which the agent successfully completes the task. For tracking, success is typically defined as ending the episode within a certain distance threshold (e.g., 1-3 meters) of the target and with the correct orientation. It is the primary metric for overall task completion.
2. Mathematical Formula: $ \mathrm{SR} = \frac{\text{Number of successful episodes}}{\text{Total number of episodes}} $
3. Symbol Explanation: N/A.
Tracking Rate (TR):
1. Conceptual Definition: This metric measures the fraction of timesteps during an episode where the agent is successfully tracking the target (i.e., staying within the predefined distance). It provides a more fine-grained measure of tracking quality than SR, as an agent could ultimately fail but still have a high TR for a large portion of the episode.
2. Mathematical Formula: $ \mathrm{TR} = \frac{1}{N_{\text{episodes}}} \sum_{i=1}^{N_{\text{episodes}}} \frac{\text{Timesteps tracking successfully}_i}{\text{Total timesteps}_i} $
3. Symbol Explanation: $N_{\text{episodes}}$ is the total number of evaluation episodes.
Collision Rate (CR):
1. Conceptual Definition: This metric measures the percentage of episodes that are terminated because the agent collided with an obstacle. It evaluates the agent's ability to navigate safely. A lower CR is better.
2. Mathematical Formula: $ \mathrm{CR} = \frac{\text{Number of episodes with collisions}}{\text{Total number of episodes}} $
3. Symbol Explanation: N/A.
Episode Length (EL):
1. Conceptual Definition: This metric measures the average number of steps an agent survives or successfully tracks before the episode terminates (either by success, failure, or timeout). In tracking tasks with a maximum episode length, a higher EL indicates more persistent tracking.
2. Mathematical Formula: $ \mathrm{EL} = \frac{1}{N_{\text{episodes}}} \sum_{i=1}^{N_{\text{episodes}}} \text{Length of episode}_i $
3. Symbol Explanation: N/A.

5.3. Baselines

The paper compares $TrackVLA++$ against a comprehensive set of baselines representing different approaches:

Classical/Decoupled Methods: IBVS, PoliFormer, EVT, AD-VAT, TS. These methods typically separate perception and planning.
End-to-End VLA Models: Uni-NaVid and TrackVLA (the direct predecessor).
Foundation Models: NavFoM (the base model upon which $TrackVLA++$ is built), which is a powerful navigation foundation model.
VLM/LLM-based Recognition Models: RexSeek, $LISA++$ , SoM + GPT-4o. These are used for the visual recognition benchmark to compare fine-grained identification capabilities.

These baselines are representative as they cover the spectrum from older, decoupled systems to the latest large-scale VLA and foundation models, providing a robust context for evaluating the contributions of $TrackVLA++$ .

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance on EVT-Bench

The main simulation results are presented on the EVT-Bench benchmark, which includes three challenging splits: Single-Target Tracking (STT), Distracted Tracking (DT), and Ambiguity Tracking (AT).

The following are the results from Table I of the original paper:

Methods	Single-Target Tracking (STT)			Distracted Tracking (DT)			Ambiguity Tracking (AT)
Methods	SR↑	TR↑	CR↓	SR↑	TR↑	CR↓	SR↑	TR↑	CR↓
IBVS† [51]	42.9	56.2	3.75	10.6	28.4	6.14	15.2	39.5	4.90
PoliFormer† [35]	4.67	15.5	40.1	2.62	13.2	44.5	3.04	15.4	41.5
EVT [6]	24.4	39.1	42.5	3.23	11.2	47.9	17.4	21.1	45.6
EVT[6]	32.5	49.9	40.5	15.7	35.7	53.3	18.3	21.0	44.9
Uni-NaVid [10]	25.7	39.5	41.9	11.3	27.4	43.5	8.26	28.6	43.7
TrackVLA [12]	85.1	78.6	1.65	57.6	63.2	5.80	50.2	63.7	17.1
NavFoM [44] (Single view)	85.0	80.5		61.4	68.2	-	-	-	-
Ours (single view)	86.0	81.0	2.10	66.5	68.8	4.71	51.2	63.4	15.9
NavFoM [44] (Four views)	88.4	80.7	-	62.0	67.9	-	-	-	-
Ours(Four views)	90.9	82.7	1.50	74.0	73.7	3.51	55.9	63.8	15.1

Analysis:

$TrackVLA++$ establishes a new state-of-the-art across all tasks and settings (single-view and four-view).
The most impressive gain is on the Distracted Tracking (DT) split. In the four-view setting, $TrackVLA++$ achieves a Success Rate (SR) of 74.0%, which is a 12% absolute improvement over the strong NavFoM baseline (62.0%) and a 5.1% improvement over the single-view version of itself. This directly validates the core hypothesis of the paper: the explicit reasoning and memory mechanisms are highly effective at handling visually similar distractors.
The model also shows significant improvements in Collision Rate (CR), indicating that better reasoning about the target's location also leads to safer navigation.

6.1.2. Zero-shot Performance on Gym-UnrealCV

To test generalization to unseen environments, the model was evaluated on the Gym-UnrealCV benchmark in a zero-shot manner.

The following are the results from Table II of the original paper:

Methods	Single Target		Distractor		Unseen Objects
Methods	EL↑	SR↑	EL↑	SR↑	EL↑	SR↑
DiMP [55]	367	0.58	309	0.27	-	-
SARL [33]	394	0.57	240	0.14	-	-
AD-VAT [3]	416	0.62	220	0.12	-	-
AD-VAT+ [56]	454	0.76	224	0.12	-	-
TS [36]	474	0.86	371	0.48	-	-
EVT [6]	490	0.95	459	0.81	480	0.96
TrackVLA [12]	500	1.00	474	0.91	500	1.00
Ours†	500	1.00	484	0.92	500	1.00

Analysis:

$TrackVLA++$ achieves perfect scores on the Single Target and Unseen Objects tasks, demonstrating flawless generalization.
Crucially, in the most challenging Distractor task, $TrackVLA++$ outperforms the previous best model, TrackVLA, achieving a higher SR (0.92 vs. 0.91) and longer EL (484 vs. 474). This further confirms its superior ability to handle distractors.

6.1.3. Performance on Visual Recognition

This experiment isolates and evaluates the model's fine-grained recognition ability.

The following are the results from Table III of the original paper:

Methods	ACC (%) ↑	FPS ↑
RexSeek [53]	54.3	1.1
LISA++ [54]	78.2	0.6
SoM [49]+GPT-4o [50]	82.4	0.1
TrackVLA [12]	80.7	10
NavFoM [44]	84	5.1
Ours† w/o Polar-CoT	83	5.2
Ours†	87.5	4.8

Analysis:

$TrackVLA++$ achieves a state-of-the-art accuracy of 87.5%, outperforming powerful baselines including those using GPT-4o. This shows that the internal reasoning process helps the model make more accurate identifications.
The comparison between Ours† (with Polar-CoT) and Ours† w/o Polar-CoT is very telling: adding the Polar-CoT reasoning module improves recognition accuracy from 83% to 87.5% with only a minor drop in inference speed (5.2 FPS to 4.8 FPS). This demonstrates that the reasoning module enhances perception effectively and efficiently.
The model maintains a near real-time inference speed of 4.8 FPS, making it practical for real-world deployment, unlike the much slower GPT-based methods (0.1 FPS).

6.1.4. Real-World Results

The paper demonstrates the model's sim-to-real transfer capabilities on a quadruped robot in three challenging scenarios.

$Fig. 5: Visualizations of the Real World Experiments. We evaluate ${ \\mathrm { T r a c k V L A } } + +$ on three different tasks: Obstacle, Winding Path, coparisonucs ratbetweeTackLAanTrackLA+,hlihtinhrove perancure$

Analysis:

Obstacle: The target is occluded by large obstacles. $TrackVLA++$ outperforms TrackVLA by 14% in success rate, highlighting the effectiveness of the TIM module in maintaining target identity during occlusion.
Winding Path: The target follows a complex path. $TrackVLA++$ shows a 7% improvement, indicating better tracking fidelity.
Distractor: A human distractor attempts to confuse the robot. $TrackVLA++$ achieves a 17% higher success rate, showcasing its enhanced robustness to interference, a direct result of the Polar-CoT and TIM modules.

6.2. Ablation Studies / Parameter Analysis

An ablation study was conducted on the challenging DT split of EVT-Bench to isolate the contributions of the new modules.

The following are the results from Table IV of the original paper:

Methods	Distracted Tracking (DT)
Methods	SR ↑	TR ↑	CR ↓
TrackVLA [12]	57.6	63.2	5.80
NaVFoM (Four views)	62.0	67.9	-
TrackVLA++ (Ours)	74.0	73.7	3.51
w/o Polar-CoT & TIM	65.2	64.8	8.17
w/o TIM	71.2	69.8	4.74
w TIM (16 tokens)	74.2 (+0.2)	73.4 (-0.3)	3.27 (-0.24)

Analysis:

w/o Polar-CoT & TIM vs. w/o TIM: This comparison shows the impact of adding only the Polar-CoT reasoning module. The SR jumps from 65.2% to 71.2%, a +6.0% improvement. This confirms that explicit spatial reasoning is highly beneficial.
w/o TIM vs. Full Model: This shows the impact of adding the TIM module on top of Polar-CoT. The SR increases further from 71.2% to 74.0%, a +2.8% improvement. This demonstrates that the reasoning-gated memory provides a significant, complementary benefit.
w TIM (16 tokens): The authors tested increasing the size of the TIM memory from 4 tokens to 16. The performance gain was negligible (SR +0.2%). This is a very interesting result, suggesting that a compact memory representation is sufficient and that the model's strength comes from the intelligent gating mechanism rather than the sheer capacity of the memory. This highlights the design's efficiency.

7. Conclusion & Reflections

7.1. Conclusion Summary

The paper successfully introduces $TrackVLA++$ , a VLA model that significantly advances the state-of-the-art in Embodied Visual Tracking. By integrating two novel components—the Polar-CoT for efficient spatial reasoning and the TIM for robust, confidence-gated long-term memory—the model effectively addresses the critical challenges of severe occlusion and visually similar distractors. The extensive experimental results on both simulation benchmarks and real-world robots provide strong evidence for the model's superior performance, data efficiency, and generalization capabilities. $TrackVLA++$ demonstrates that incorporating structured reasoning and memory is a promising direction for building more intelligent and reliable embodied agents.

7.2. Limitations & Future Work

The paper does not explicitly state its limitations, but we can infer some potential areas for future work:

Reasoning Granularity: The Polar-CoT uses a discretized grid. While efficient, this might lack the precision needed for tasks requiring very fine-grained spatial understanding or manipulation of the target. Future work could explore adaptive or continuous spatial representations.
Interpretability: While Polar-CoT provides more insight into the model's internal state than a pure black-box model, the reasoning process within the LLM itself is still not fully interpretable.
Computational Cost: The real-world deployment relies on a remote server with a high-end GPU. For full autonomy, the model would need to be optimized or compressed to run on-board the robot, which remains a significant challenge.
Predictive Reasoning: The current model reasons about the target's current position. A more advanced system could reason about the target's intentions or predict its future trajectory, enabling more proactive and smoother tracking behavior.

7.3. Personal Insights & Critique

Strengths:
- The Polar-CoT mechanism is an elegant and highly effective solution for injecting spatial priors into a VLA model for a dynamic task. Its agent-centric nature and computational efficiency are major advantages over more common but cumbersome methods like bounding box prediction, especially in multi-camera settings.
- The confidence-gated update for the TIM is a simple yet powerful concept. It directly tackles the problem of memory corruption, a common failure point in long-horizon tasks. Linking the memory update to the reasoning module's confidence creates a virtuous cycle of reasoning and remembering.
- The paper provides a very strong and comprehensive evaluation, including multiple benchmarks, zero-shot tests, real-world deployments, and a thorough ablation study. This leaves little doubt about the effectiveness of the proposed components.
Potential Areas for Improvement/Critique:
- Failure Case Analysis: The paper focuses on showcasing success, but a detailed analysis of the remaining failure cases would be highly valuable. Understanding why and when $TrackVLA++$ still fails could provide crucial insights for the next generation of models. For example, does it fail due to perception errors (misidentifying the target despite the memory), reasoning errors (incorrectly localizing a correctly identified target), or planning errors (choosing a poor path)?
- Real-World Quantitative Evaluation: The real-world experiments are presented as demonstrations with overall success rates. While compelling, a more rigorous quantitative analysis using standard robotics metrics (e.g., Average Tracking Error, trajectory similarity) would further strengthen the claims of real-world robustness.
- Complexity of Target Descriptions: The experiments seem to focus on tracking targets based on visual appearance ("person in red shirt"). It would be interesting to see how the model performs with more complex, relational, or action-based descriptions (e.g., "follow the person who just opened the door"). This would test the limits of the VLM's language understanding in the context of a dynamic task.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.

TrackVLA++: Unleashing Reasoning and Memory Capabilities in VLA Models for Embodied Visual Tracking

TL;DR Summary

Abstract

Mind Map

In-depth Reading

English Analysis~22 min read · 30,480 chars

1. Bibliographic Information

1.1. Title

1.2. Authors

1.3. Journal/Conference

1.4. Publication Year

1.5. Abstract

1.6. Original Source Link

2. Executive Summary

2.1. Background & Motivation

2.1.1. What is the core problem the paper aims to solve?

2.1.2. Why is this problem important in the current field? What specific challenges or gaps exist in prior research?

2.1.3. What is the paper's entry point or innovative idea?

2.2. Main Contributions / Findings

2.2.1. What are the paper's primary contributions?

2.2.2. What key conclusions or findings did the paper reach?

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Embodied AI

3.1.2. Vision-Language Models (VLMs)

3.1.3. Vision-Language-Action (VLA) Models

3.1.4. Chain-of-Thought (CoT) Reasoning

3.2. Previous Works

3.2.1. Decoupled vs. End-to-End Paradigms

3.2.2. Chain-of-Thought in Embodied AI

3.3. Technological Evolution

3.4. Differentiation Analysis

4. Methodology

4.1. Principles

4.2. Core Methodology In-depth

4.2.1. Task Formulation

4.2.2. Observation Encoding

4.2.3. Polar-CoT Reasoning Forwarding

4.2.4. Reasoning Feedback Memory Update (TIM)

4.2.5. Action Forwarding

4.2.6. Training Objective

5. Experimental Setup

5.1. Datasets

5.2. Evaluation Metrics

5.3. Baselines

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. Performance on EVT-Bench

6.1.2. Zero-shot Performance on Gym-UnrealCV

6.1.3. Performance on Visual Recognition

6.1.4. Real-World Results

6.2. Ablation Studies / Parameter Analysis

7. Conclusion & Reflections

7.1. Conclusion Summary

7.2. Limitations & Future Work

7.3. Personal Insights & Critique

Similar papers