VGGT: Visual Geometry Grounded Transformer
TL;DR Summary
VGGT is a feed-forward neural network that directly infers 3D scene attributes from one or multiple views, achieving state-of-the-art results in various 3D tasks while enhancing downstream task performance.
Abstract
We present VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from one, a few, or hundreds of its views. This approach is a step forward in 3D computer vision, where models have typically been constrained to and specialized for single tasks. It is also simple and efficient, reconstructing images in under one second, and still outperforming alternatives that require post-processing with visual geometry optimization techniques. The network achieves state-of-the-art results in multiple 3D tasks, including camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. We also show that using pretrained VGGT as a feature backbone significantly enhances downstream tasks, such as non-rigid point tracking and feed-forward novel view synthesis. Code and models are publicly available at https://github.com/facebookresearch/vggt.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
The central topic of the paper is a novel feed-forward neural network designed for comprehensive 3D scene attribute inference, named VGGT: Visual Geometry Grounded Transformer.
1.2. Authors
The authors of this paper are:
-
Jianyuan Wang
-
Minghao Chen
-
Nikita Karaev
-
Andrea Vedaldi
-
Christian Rupprecht
-
David Novotny
Their affiliations are:
-
Visual Geometry Group, University of Oxford (for Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht)
-
Meta AI (for Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny)
This indicates a strong collaboration between a renowned academic institution for computer vision (Visual Geometry Group at Oxford) and a leading industry research lab (Meta AI), suggesting expertise in both fundamental research and large-scale practical applications of deep learning and computer vision.
1.3. Journal/Conference
The paper is published at (UTC): 2025-03-14T17:59:47.000Z. While the specific journal or conference is not explicitly mentioned in the provided text, the publication date in 2025 suggests it is either a very recent publication or a preprint accepted for an upcoming top-tier computer vision conference (e.g., CVPR, ICCV, ECCV) or journal. These venues are highly reputable and influential in the field of computer vision and machine learning.
1.4. Publication Year
1.5. Abstract
The paper introduces VGGT, a feed-forward neural network that directly infers all key 3D attributes of a scene, including camera parameters, point maps, depth maps, and 3D point tracks, from a variable number of input views (one to hundreds). This approach represents a significant advancement in 3D computer vision by moving away from models specialized for single tasks, offering a unified solution. VGGT is characterized by its simplicity and efficiency, capable of reconstructing images in under one second, yet it outperforms traditional optimization-based alternatives that require computationally expensive post-processing. The network achieves state-of-the-art results across multiple 3D tasks: camera parameter estimation, multi-view depth estimation, dense point cloud reconstruction, and 3D point tracking. Furthermore, the paper demonstrates that pretrained VGGT as a feature backbone significantly enhances downstream tasks such as non-rigid point tracking and feed-forward novel view synthesis. The code and models are publicly available.
1.6. Original Source Link
The official source is a preprint on arXiv: https://arxiv.org/abs/2503.11651.
2. Executive Summary
2.1. Background & Motivation
The core problem the paper aims to solve is the efficient and comprehensive estimation of diverse 3D attributes of a scene from multiple images using a single, unified feed-forward neural network.
This problem is important because:
-
Traditional 3D reconstruction methods, like
Structure-from-Motion (SfM)andBundle Adjustment (BA), heavily rely on iterative optimization techniques. While powerful, these methods are computationally expensive and complex, increasingruntimeandcomplexity. -
Machine learning in
3D computer visionhas often been applied to specialized sub-tasks (e.g., feature matching, monocular depth prediction) or integrated into traditional pipelines (e.g.,VGGSfMcombining ML and geometry via differentiable BA). However, a truly unified,neural-firstapproach that eschews geometry post-processing has been lacking. -
Prior neural networks for 3D tasks, such as
DUSt3RandMASt3R, typically handle only pairs of images and still require costly post-processing (e.g., fusing pairwise reconstructions, global alignment) to reconstruct larger scenes or achieve optimal results. Other large 3D neural networks likeDepthAnything,MoGe, andLRMfocus on single 3D tasks, not a comprehensive suite of attributes.The paper's entry point and innovative idea is to ask if a powerful
neural networkcan directly infer all key 3D attributes—camera parameters,point maps,depth maps, and3D point tracks—from a variable number of input views (from one to hundreds) in a singlefeed-forward pass, without relying on iterativevisual geometry optimization. This aims to simplify the pipeline, increase efficiency, and potentially improve generalization.
2.2. Main Contributions / Findings
The paper makes several primary contributions:
-
Introduction of
VGGT: A largefeed-forward transformerthat, given one to hundreds of images of a scene, can predict all its key 3D attributes (camera intrinsicsandextrinsics,point maps,depth maps, and3D point tracks) in seconds. This unified approach moves away from specialized, single-task models. -
State-of-the-Art Feed-Forward Performance:
VGGT's predictions are directly usable and highly competitive, often outperforming state-of-the-art methods that rely on slow post-processingoptimization techniques(e.g.,Bundle Adjustment,Global Alignment). This highlights the effectiveness of a purely neural approach for 3D reconstruction. -
Enhanced Performance with Optional Optimization: When combined with
Bundle Adjustment (BA)post-processing,VGGTachieves even strongerstate-of-the-artresults across all evaluated 3D tasks, often substantially improving quality. Importantly,VGGT's direct predictions provide excellent initializations forBA, making the combined process significantly faster than traditionalBApipelines. -
Versatile Feature Backbone: The features learned by
pretrained VGGTare shown to significantly enhance various downstream tasks, specifically demonstrating improved performance innon-rigid point trackingandfeed-forward novel view synthesis. This suggestsVGGTcan serve as a powerful foundation model for a wide range of computer vision applications. -
Public Availability: The code and models are made publicly available, facilitating further research and benefiting the computer vision community.
The key conclusions and findings are that a large
transformermodel, trained on diverse 3D-annotated data, can effectively learn to directly infer a comprehensive set of interrelated 3D scene properties in a fast,feed-forwardmanner, achievingstate-of-the-artresults. This solves the problem of computational complexity and specialization in3D computer vision, offering a more generalized and efficient solution.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the VGGT paper, a beginner should understand the following foundational concepts:
-
3D Attributes of a Scene: These are properties that describe a scene in three dimensions.
- Camera Parameters: Define how a 3D scene is projected onto a 2D image.
- Intrinsics: Internal camera properties, such as
focal length(how much the lens magnifies the scene),principal point(where the optical axis intersects the image plane), anddistortion coefficients(how the lens deforms the image). - Extrinsics: External camera properties, specifying the camera's position and orientation (pose) in a 3D world coordinate system. This includes a
rotation(how the camera is rotated) and atranslation(where the camera is located).
- Intrinsics: Internal camera properties, such as
- Depth Map: For each pixel in a 2D image, a
depth mapstores the distance from the camera to the corresponding 3D point in the scene. It's a grayscale image where pixel intensity often represents depth. - Point Map: A
point mapassigns to each pixel in a 2D image the 3D coordinates (X, Y, Z) of the point in the scene that projects onto that pixel. Unlike adepth map, it directly provides 3D coordinates, typically in a common reference frame (e.g., the first camera's frame). - 3D Point Tracks: These are sequences of 2D pixel locations across multiple images (or video frames) that correspond to the same physical 3D point in the scene. Tracking a point means finding its corresponding location in every other relevant image.
- Point Cloud: A set of data points in a three-dimensional coordinate system. These points represent the external surface of an object or environment.
Dense point cloud reconstructionaims to generate a highly detailedpoint cloudfrom multiple images.
- Camera Parameters: Define how a 3D scene is projected onto a 2D image.
-
Neural Networks: A computational model inspired by the structure and function of biological neural networks.
- Feed-forward Neural Network: A type of neural network where connections between nodes do not form a cycle. Information flows only in one direction—from the input layer, through any hidden layers, to the output layer. There are no loops or recurrent connections.
- Transformer: A neural network architecture introduced in 2017, initially for
natural language processing (NLP)tasks. It revolutionized sequence processing by relying entirely onattention mechanismsinstead ofrecurrent neural networks (RNNs)orconvolutional neural networks (CNNs).- Attention Mechanism: A core component of
Transformers. It allows the model to weigh the importance of different parts of the input sequence when processing each element. Instead of processing input tokens sequentially,attentioncan simultaneously consider all tokens, focusing on the most relevant ones. - Self-Attention: A specific type of
attention mechanismwhere a sequence attends to itself. This allows the model to understand the relationships between different words (ortokensin general) within the same sequence. For example, in a sentence, it can determine which words are most relevant to understanding a particular word. - Multi-Head Attention: An extension of
self-attentionthat allows the model to jointly attend to information from different representation subspaces at different positions. Essentially, it runs severalself-attentionmechanisms in parallel and concatenates their outputs, allowing the model to capture diverse relationships. - Tokens: In the context of
Transformers, input data (like images or text) is broken down into discrete units calledtokens. For images, these are often small patches of the image that are then flattened and embedded into a vector representation.
- Attention Mechanism: A core component of
-
Computer Vision Tasks:
- Structure-from-Motion (SfM): The process of reconstructing the 3D structure of a scene and the camera poses (positions and orientations) from a series of overlapping 2D images. It typically results in a sparse
point cloud. - Multi-view Stereo (MVS): Builds upon
SfMby taking known camera parameters and multiple images to generate a dense 3D reconstruction (a densepoint cloudordepth maps) of the scene. - Bundle Adjustment (BA): A non-linear optimization technique used in
SfMandMVSto refine the estimated 3D scene structure and camera parameters simultaneously. It minimizes thereprojection error(the difference between observed image points and where the corresponding 3D points project into the image) across all images. It's an iterative, computationally intensive process. - Image Matching: The task of finding corresponding points (or features) between two or more images that depict the same scene or object from different viewpoints. This is a prerequisite for
SfMandMVS. - Tracking-Any-Point: Given a query point in one image (or frame), the task is to track its 2D location across all other images/frames in a sequence, even under challenging conditions like occlusions or dynamic motion.
- Novel View Synthesis: Generating new images of a scene from arbitrary, unseen camera viewpoints, given a set of input images.
- Structure-from-Motion (SfM): The process of reconstructing the 3D structure of a scene and the camera poses (positions and orientations) from a series of overlapping 2D images. It typically results in a sparse
3.2. Previous Works
The paper contextualizes VGGT against several important prior works, which can be broadly categorized into traditional visual geometry methods, deep learning-enhanced geometry pipelines, and end-to-end neural 3D models.
-
Traditional
Structure-from-Motion (SfM):COLMAP[94]: A widely popular, open-source framework that embodies the traditionalSfMpipeline. It consists of stages likeimage matching,triangulation(estimating 3D points from 2D correspondences), andbundle adjustment. While robust, it's iterative and can be slow.- Relevance to VGGT: VGGT aims to bypass the complexity and computational cost of such multi-stage, optimization-heavy pipelines by directly inferring all 3D attributes in a
feed-forwardmanner.
-
Deep Learning-Enhanced
SfM:VGGSfM[125]: This method integrates machine learning andvisual geometryend-to-end viadifferentiable Bundle Adjustment. It showed competitive performance against traditionalSfMon challengingphototourismscenarios.- Relevance to VGGT: VGGSfM still relies heavily on
Bundle Adjustment(even if differentiable). VGGT takes a further step by trying to remove the need for such iterative post-optimization almost entirely in its primary mode, although it shows that combining withBAcan further boost its already strong performance.VGGSfM's parametrization of camera parameters is also adopted byVGGT.
-
Deep Learning for
Multi-view Stereo (MVS):DUSt3R[129] andMASt3R[62]: These are recent methods that directly estimate aligned densepoint cloudsfrom a pair of views, without requiring camera parameters as input. However, for multiple images, they rely on post-processing (fusing pairwise reconstructions,global alignment) to achieve a coherent 3D scene.- Relevance to VGGT: VGGT is explicitly designed to handle multiple images (from a few to hundreds) in a single
feed-forward pass, directly predicting globally consistent 3D attributes. This is a substantial departure fromDUSt3R/MASt3Rwhich are constrained to pairwise processing and then need costly post-optimization. VGGT directly claims to outperform these methods by a large margin. Concurrent works[111, 127, 141, 156]: These works explore replacingDUSt3R's test-time optimization with neural networks but often achieve suboptimal or comparable performance toDUSt3R.VGGTclaims significant performance advantages over these as well.
-
Deep Learning for
Tracking-Any-Point:PIPs[44],TAP-Vid[23],TAPIR[24],CoTracker[55, 56, 57],DOT[60],TAPTR[63],LocoTrack[13]: These are specialized methods forpoint trackingacross video sequences, often dealing with dynamic motions and occlusions.CoTrackeris particularly highlighted for its use of correlations between points.- Relevance to VGGT: VGGT integrates a
tracking headthat uses features learned by its backbone. It demonstrates that its learned features, when coupled with existing point trackers (specificallyCoTracker2), yieldstate-of-the-arttracking performance, even thoughVGGTitself is not solely specialized for this task.
-
Large Vision Models /
Transformers:DINO[10, 78]: Aself-supervised vision transformerthat learns robust visual features.VGGTusesDINOforpatchifyinginput images intotokens.DPT[87]:Vision Transformersfor dense prediction tasks, often used formonocular depth estimation.VGGTuses aDPT layerfor itsdense prediction heads(depth, point maps, tracking features).GPTs[1, 29, 148],CLIP[86],Stable Diffusion[34]: General purpose large models that serve as versatile backbones.DepthAnything[142],MoGe[128],LRM[49]: Recent large 3D neural networks, but typically focus on a single 3D task (e.g., monocular depth,novel view synthesis).- Relevance to VGGT: VGGT is built in the same mold as these large, versatile
transformermodels, aiming to be a unified backbone for multiple3D computer visiontasks, rather than focusing on a single one.
Attention Mechanism (A Core Prerequisite)
Since VGGT is a Transformer, understanding the attention mechanism is crucial. The original Transformer introduced the Scaled Dot-Product Attention, defined as:
Where:
- (Query): A matrix representing the queries, derived from the input
tokens. Each row is a query vector. - (Key): A matrix representing the keys, derived from the input
tokens. Each row is a key vector. - (Value): A matrix representing the values, derived from the input
tokens. Each row is a value vector. - : The dot product of queries and keys, measuring the similarity between each query and all keys.
- : The square root of the dimension of the key vectors (). This scaling factor is used to prevent the dot products from becoming too large, which can push the
softmaxfunction into regions with very small gradients, making training difficult. - : A function that converts a vector of numbers into a probability distribution, ensuring all values are between 0 and 1 and sum to 1. It determines how much attention to pay to each value.
- The entire operation produces a weighted sum of the
valuevectors, where the weights are determined by theattention scores.
3.3. Technological Evolution
The field of 3D computer vision has seen a progression from purely geometric, optimization-based methods to increasingly neural network-driven approaches:
- Traditional Geometry (Pre-2010s): Methods like
COLMAPforSfMandMVSrelied on handcrafted features, geometric constraints, and iterative optimization (e.g.,Bundle Adjustment). These were robust but computationally expensive and often struggled with challenging image conditions (e.g., low texture). - Hybrid Approaches (2010s-Early 2020s): Deep learning started to improve individual components of
SfMpipelines (e.g.,keypoint detectionwithSuperPoint[21],feature matchingwithSuperGlue[92]). Later, methods likeVGGSfM[125] integrated ML and geometry more tightly withdifferentiable Bundle Adjustment, pushing performance further while still retaining a geometric core. - Neural-First, Pairwise Approaches (Early 2020s): Models like
DUSt3R[129] began to directly predict 3D scene information (e.g.,point maps) from image pairs usingneural networks, reducing reliance on explicit camera geometry initially. However, reconstructing full scenes from many images still required complex post-processing steps. - Unified, Multi-view Neural-First (Current Work - VGGT):
VGGTrepresents the latest step in this evolution, moving towards a single,feed-forward transformerthat can process many views directly and output all key 3D attributes (camera parameters,depth maps,point maps,tracks) without requiring iterative optimization in its core inference. It aims to achieve both superior performance and efficiency, even outperforming methods that use post-optimization.
3.4. Differentiation Analysis
Compared to the main methods in related work, VGGT offers several core differences and innovations:
-
From Optimization-Dependent to Feed-Forward:
- Traditional
SfM(e.g.,COLMAP) andMVS: Heavily rely on iterative geometric optimization (Bundle Adjustment) which is slow and complex. VGGSfM: While deep learning-enhanced, it still usesdifferentiable Bundle Adjustment, which is an optimization process.DUSt3R/MASt3R: Produce pairwise 3D reconstructions and require costlyglobal alignmentor fusion steps for multi-view consistency.VGGT's innovation: Directly infers all 3D attributes in a single feed-forward pass, often outperforming these optimization-based alternatives in terms of accuracy and significantly in speed. It makes 3D reconstruction usable in real-time applications.
- Traditional
-
From Single-Task Specialization to Unified Multi-Task Prediction:
- Dedicated 3D models (e.g.,
DepthAnythingfor monocular depth,LRMfor single-image to 3D,CoTrackerfor tracking): Focus on specialized individual tasks. VGGT's innovation: Uses a shared backbone to simultaneously predictcamera parameters,depth maps,point maps, and3D point tracks. The paper demonstrates that thismulti-task learningapproach enhances overall accuracy, suggesting benefits from learning the inherent interrelationships between these 3D quantities.
- Dedicated 3D models (e.g.,
-
Multi-view Handling:
DUSt3R/MASt3R: Primarily designed for two-view input, requiring external processes for multi-view consistency.VGGT's innovation: Inherently supports processing one, a few, or hundreds of input views directly within itstransformerarchitecture, generating a globally consistent 3D reconstruction in a single pass.
-
Architectural Simplicity and Generalization:
VGGTis based on a "fairly standard largetransformer" [119] with minimal explicit 3D inductive biases (except forAlternating-Attention), trained on a diverse collection of 3D-annotated datasets. This contrasts with highly specialized architectures sometimes seen in prior works.- The use of
Alternating-Attention(frame-wise and global) is a key architectural choice to efficiently integrate information across frames. VGGTshows strong generalization to unseen datasets (e.g.,RealEstate10K) and challenging in-the-wild scenes, demonstrating its robustness.
4. Methodology
4.1. Principles
The core idea behind VGGT is to leverage the power of a large Transformer architecture to directly learn the complex mappings from raw image pixels to various 3D scene attributes. Instead of relying on traditional, iterative visual geometry optimization techniques (like Bundle Adjustment) or breaking down the problem into specialized sub-tasks, VGGT aims to infer camera parameters, depth maps, point maps, and 3D point tracks in a single, efficient feed-forward pass. The underlying intuition is that with sufficient data and model capacity, a Transformer can implicitly learn the geometric principles governing 3D scenes and camera projections, making explicit optimization unnecessary for primary inference. The model is designed with minimal 3D inductive biases, allowing it to learn directly from a vast amount of 3D-annotated data, similar to how large language models learn from text. The Alternating-Attention mechanism is introduced to balance information integration within individual images and across multiple images efficiently.
4.2. Core Methodology In-depth (Layer by Layer)
The VGGT model is a large transformer that takes a set of images as input and produces various 3D quantities as output. The overall architecture is shown in the following figure (Image 2 from the original paper).
该图像是示意图,展示了VGGT模型的结构,包括输入、全局和帧注意力机制,以及相机头处理过程,最终输出相机参数、深度图、点图和轨迹等3D属性。
Figure 2. Schematic diagram illustrating the structure of the VGGT model, including input, global and frame attention mechanisms, and the camera head processing, ultimately outputting 3D attributes such as camera parameters, depth maps, point maps, and tracks.
4.2.1. Problem Definition and Notation
The input to the VGGT model is a sequence of RGB images, denoted as , where each image (3 color channels, height , width ). The transformer maps this sequence to a corresponding set of 3D annotations, one per frame:
Here's a breakdown of each output component:
-
Camera Parameters (): For each image , the model predicts its camera parameters . Following the parametrization from
VGGSfM[125], is a concatenation of:- A
rotation quaternion: Represents the 3D orientation of the camera relative to a reference frame. A quaternion is a way to encode 3D rotations, often preferred overEuler anglesorrotation matricesfor its compactness and avoidance ofgimbal lock. - A
translation vector: Represents the 3D position of the camera in the reference frame. - A
field of view: Represents the camera'sfocal lengthin horizontal and vertical directions. The paper assumes the camera'sprincipal point(the projection of the camera's optical center onto the image plane) is at the image center, a common simplification inSfMframeworks.
- A
-
Depth Map (): For each pixel location in the image (where ), the
depth mapassociates it with its corresponding depth value . This value is the distance from the -th camera to the 3D point observed at pixel . -
Point Map (): For each pixel , the
point mapassociates it with its corresponding 3D scene point . Importantly, similar toDUSt3R[129], thesepoint mapsareviewpoint invariant. This means all 3D points are defined in the coordinate system of the first camera (), which serves as the globalworld reference frame. -
Tracking Features (): The transformer outputs a grid of -dimensional features for each image. These are not the tracks directly but rather dense features used by a separate
tracking module.Tracking Module (): Given a query image point in a query image , the
tracking moduleuses the densetracking featuresto output a track , which is a sequence of 2D points in all images corresponding to the same 3D point as . The two networks, (theVGGTtransformer) and (thetracking module), are trained jointlyend-to-end.
Order of Predictions: The input image sequence order is arbitrary, except that the first image is always chosen as the reference frame for defining the world coordinate system. The network architecture is designed to be permutation equivariant for all frames except the first. This means if you reorder frames 2 through N, the outputs for those frames will also be reordered identically, maintaining their content.
Over-complete Predictions: The paper notes that the predicted quantities are not all independent. For instance, camera parameters can be inferred from the invariant point map (e.g., by solving the PnP problem). Similarly, depth maps can be deduced from point maps and camera parameters. However, VGGT is explicitly tasked with predicting all these quantities, and the authors demonstrate that this multi-task learning approach leads to substantial performance gains, even with redundancies. Interestingly, during inference, combining independently estimated depth maps and camera parameters to derive 3D points is often more accurate than directly using the point map head output.
4.2.2. Feature Backbone
The VGGT model is implemented as a large Transformer [119] with minimal 3D inductive biases.
-
Image Tokenization: Each input image is first
patchifiedinto a set oftokens(where is the number of patches and is the token dimension). This is achieved usingDINO[78], aself-supervised Vision Transformer.DINOv2[78] is specifically mentioned as the default method forpatchifying, indicating its effectiveness and training stability. -
Transformer Processing: The combined set of image
tokensfrom all input frames, , is then processed through the main network structure. This structure alternates between two types ofself-attentionlayers. -
Alternating-Attention (AA): This is a slight adjustment to the standard
Transformerdesign. It makes theTransformerfocus on information within each frame and globally across all frames in an alternating fashion:- Frame-wise Self-Attention: Each
tokenattends only to othertokenswithin its own frame. This helps integrate information specific to each image. - Global Self-Attention: Each
tokenattends to alltokensacross all frames. This allows the model to integrate information across different images, establishing global consistency and relationships. This alternating strategy balances local (within-frame) and global (cross-frame) information processing. By default, the model employs layers of alternating global and frame-wise attention. The architecture does not usecross-attentionlayers, onlyself-attentionones.
- Frame-wise Self-Attention: Each
4.2.3. Prediction Heads
After the feature backbone processes the tokens, specialized prediction heads are used to output the desired 3D quantities.
-
Augmented Tokens: For each input image , its corresponding image
tokensare augmented with two additional types oflearnable tokens:- A
camera token: This token is specifically designed to encode information relevant to the camera parameters of frame . - Four
register tokens: Inspired by [19], these tokens act as "memory slots" to stabilize training and improve performance, but their outputs are typically discarded after processing.
- A
-
Special Tokens for the First Frame: The
camera tokenandregister tokensfor the first frame are initialized with a different set oflearnable tokens(, ) compared to those of all other frames (, for ). This allows the model to explicitly distinguish the first frame, which defines theworld coordinate system, from the rest. -
Output Tokens: The concatenated tokens are processed by the
Alternating-Attention transformer, producing refined output tokens . Theoutput register tokensare discarded.Coordinate Frame: As established in the problem definition, all camera, point, and depth predictions are made relative to the
coordinate frameof the first camera (). Therefore, the camera extrinsics predicted for the first camera are implicitly set to identity (rotation quaternion and translation vector ).
4.2.3.1. Camera Prediction Head
The refined camera tokens are used to predict the camera parameters . This is done by passing them through four additional self-attention layers, followed by a linear layer. This sequence forms the camera head, which regresses the camera intrinsics and extrinsics.
4.2.3.2. Dense Prediction Head
The output image tokens are responsible for all dense predictions: depth maps , point maps , and tracking features .
- Feature Map Conversion: First, are converted into
dense feature mapsusing aDPT layer[87].DPT(Dense Prediction Transformer) is an architecture well-suited for transformingTransformer tokensback into dense image-like predictions. - Dense Output Generation: Each
dense feature mapis then passed through a3x3 convolutional layerto produce:- The
depth map. - The
point map. Dense featureswhich serve as input to thetracking module.
- The
- Uncertainty Maps: The
DPT headalso outputsuncertainty mapsand for the depth and point maps, respectively. These maps are positive values indicating the model's confidence in its predictions, and they are incorporated into the loss functions during training.
4.2.3.3. Tracking Module
To implement the tracking module , the paper utilizes the CoTracker2 architecture [57].
- Input: The
dense tracking featuresoutput by theDPT headare fed intoCoTracker2. - Query and Correspondence: Given a
query pointin aquery image(during training, , but any image can be a query), thetracking headpredicts a set of 2D points in all images that correspond to the same 3D point as . - Mechanism:
- The
feature mapof thequery imageisbilinearly sampledat to extract its feature vector. - This feature is then correlated with all other
feature maps() to generate a set ofcorrelation maps. Thesecorrelation mapshighlight potential correspondence locations. - These
correlation mapsare further processed byself-attention layerswithinCoTracker2to refine and predict the final 2D points , which are all in correspondence with .
- The
- Flexibility: Similar to
VGGSfM[125], this tracker does not assume any temporal ordering of input frames, making it applicable to any set of input images, not just videos.
4.2.4. Training
The VGGT model (and implicitly the tracking module ) is trained end-to-end using a multi-task loss.
4.2.4.1. Training Losses
The overall multi-task loss is a sum of individual loss terms:
-
The
camera(),depth(), andpoint-map() losses are found to have similar ranges and are weighted equally (implicit factor of 1). -
The
tracking loss() is down-weighted by a factor of .Each loss term is defined as follows:
-
Camera Loss (): Supervises the predicted camera parameters against the ground truth .
- : The predicted camera parameters for image .
- : The ground-truth camera parameters for image .
- : The
Huber loss, also known as a smooth L1 loss. It is quadratic for small errors and linear for large errors, making it less sensitive to outliers thansquared error lossand smoother thanabsolute error loss. It is defined as: $ |x|_\epsilon = \begin{cases} \frac{1}{2}x^2 & \text{if } |x| \le \epsilon \ \epsilon(|x| - \frac{1}{2}\epsilon) & \text{if } |x| > \epsilon \end{cases} $ where is a hyperparameter.
-
Depth Loss (): Follows
DUSt3R[129] and implements analeatoric-uncertainty loss[59, 75]. This loss weighs the discrepancy between predicted depth and ground-truth depth with the predicteduncertainty map. It also includes a gradient term to encourage smoothness and an uncertainty regularization term.- : The predicted depth map for image .
- : The ground-truth depth map for image .
- : The predicted depth
uncertainty map. Higher values indicate less confidence. - : The
channel-broadcast element-wise product(Hadamard product), meaning theuncertainty mapis applied to each channel of the depth difference. - : The L1 norm (sum of absolute values) for the error term.
- : The gradient operator, which computes spatial derivatives (e.g., Sobel or central differences) of the depth maps. This term penalizes differences in depth gradients, promoting smooth transitions.
- : A term that encourages the model to predict smaller uncertainties where it is confident. The hyperparameter controls its weighting. If is positive, minimizing this term reduces , leading to higher confidence.
-
Point Map Loss (): Defined analogously to the
depth loss, but uses thepoint-map uncertaintyand applies to 3D point coordinates.- : The predicted point map for image .
- : The ground-truth point map for image .
- : The predicted point-map
uncertainty map. - Other symbols are as defined for .
-
Tracking Loss ():
- : The number of ground-truth
query pointsin thequery image. - : The ground-truth 2D correspondence of
query pointin image . - : The predicted 2D correspondence for
query pointin image , obtained from thetracking module. - : The Euclidean distance (L2 norm) between the predicted and ground-truth 2D points.
- Additionally, following
CoTracker2[57], avisibility loss(binary cross-entropy) is applied to estimate whether a point is visible in a given frame.
- : The number of ground-truth
4.2.4.2. Ground Truth Coordinate Normalization
To address the inherent scale and global reference frame ambiguity in 3D reconstruction (where scaling or shifting the scene doesn't change the images), the training data is normalized:
- All quantities are first expressed in the
coordinate frameof the first camera . - The average Euclidean distance of all 3D points in the
point mapto the origin is computed. This value is used as ascale factor. - The
camera translations, thepoint map, and thedepth mapare then scaled using thisscale factor. Crucially, unlikeDUSt3R[129], this normalization is applied only to the ground truth data, not to the network's predictions. The network is forced to learn to output predictions in this specific normalized coordinate system.
4.2.4.3. Implementation Details
- Architecture:
Alternating-Attentionlayers (oneframe-wiseand oneglobal self-attentionlayer per block). Each attention layer has a feature dimension of 1024 and 16 heads, similar toViT-LinDINOv2[78].QKNorm[48] andLayerScale[115] (initialized at 0.01) are used for training stability.DINOv2is used forimage tokenizationwithpositional embedding. Tokens from specific blocks (4th, 11th, 17th, 23rd) are fed intoDPT[87] for upsampling. - Parameters: Approximately 1.2 billion parameters in total.
- Optimization:
AdamW optimizerfor 160K iterations. - Learning Rate:
Cosine learning rate schedulerwith a peak of 0.0002 and an 8K iteration warmup. - Batching: Randomly sample 2-24 frames (up to 224 in previous versions, 48 total in Appendix B) from a random training scene per batch. Scenes with fewer than 24 frames are excluded.
- Image Preprocessing: Input frames,
depth maps, andpoint mapsare resized to a maximum dimension of 518 pixels. The aspect ratio is randomized (0.33 to 1.0). - Augmentations: Random
color jittering,Gaussian blur, andgrayscale augmentationare applied independently to frames. - Ground Truth Tracks: Built by
unprojectingdepth mapsto 3D,reprojectingpoints to target frames, and retaining correspondences where reprojected depths match targetdepth maps. Frames with low similarity or no valid correspondences are excluded. - Training Infrastructure: 64 A100 GPUs over nine days.
- Stability:
Gradient norm clippingwith a threshold of 1.0,bfloat16 precision, andgradient checkpointingare used for memory and computational efficiency.
4.2.4.4. Training Data
The model is trained on a large and diverse collection of datasets:
-
Co3Dv2[88] -
BlendMVS[146] -
DL3DV[69] -
MegaDepth[64] -
Kubric[41] -
WildRGB[135] -
ScanNet[18] -
HyperSim[89] -
Mapillary[71] -
Habitat[107] -
Replica[104] -
MVS-Synth[50] -
PointOdyssey[159] -
Virtual KITTI[7] -
Aria Synthetic Environments[82] -
Aria Digital Twin[82] -
A synthetic dataset of artist-created assets similar to
Objaverse[20].These datasets cover indoor and outdoor environments, synthetic and real-world scenarios, with 3D annotations derived from sensor capture, synthetic engines, or
SfMtechniques. The combined size and diversity are comparable toMASt3R[30].
The paper also presents some qualitative examples of VGGT's performance, as shown in the following figure (Image 1 from the original paper) which illustrates the inference of 3D attributes from multiple viewpoints.
该图像是VGGT方法的示意图,展示了从多个视角推测场景的3D属性,包括相机参数、点云和深度图。图中左侧展示了输入视图,右侧则显示了由该方法重建的结果,展示了准确的深度信息和点云结构。
Figure 1. The image is a schematic of the VGGT method, illustrating the inference of 3D attributes of a scene from multiple viewpoints, including camera parameters, point clouds, and depth maps. The left side shows input views, while the right side displays the reconstructed results, highlighting accurate depth information and point cloud structure.
Additionally, Figure 3 and Figure 4 (from the original paper) provide further qualitative examples, showcasing the method's ability to reconstruct scenes from varying numbers of views and generalize to complex in-the-wild scenes.
该图像是示意图,展示了VGGT方法在处理单视图、双视图和多视图(32视图)情况下的效果与效率。左侧为输入图,右侧为VGGT生成的结果,底部列出了处理时间,表现出VGGT在重建任务中的速度与精度优势。
Figure 3. Illustration showing the performance and efficiency of the VGGT method for single view, two views, and multi-view (32 views) input. The leftmost column displays the input images, while the right column shows the results generated by VGGT, with processing times listed at the bottom, highlighting VGGT's speed and precision advantages in reconstruction tasks.
该图像是多个3D场景的示意图,展示了使用VGGT技术重建的包括斗兽场在内的场景。图中各个视角展示了不同的深度信息和点云,体现了该方法在三维重建中的应用效果。
Figure 4. Schematic representation of multiple 3D scenes, showcasing the reconstruction of scenes including the Colosseum using VGGT technology. The various views in the image demonstrate different depth information and point clouds, reflecting the effectiveness of this method in 3D reconstruction.
5. Experimental Setup
5.1. Datasets
The VGGT model is evaluated across multiple 3D tasks using various datasets, demonstrating its versatility and generalization capabilities.
-
Camera Pose Estimation:
CO3Dv2[88]: A large-scale dataset for 3D category reconstruction of common objects. It contains object-centric image collections with ground-truth camera poses and 3D shapes.RealEstate10K[161]: A large-scale dataset of real estate videos, often used fornovel view synthesisandcamera pose estimation. It features diverse indoor and outdoor scenes.Image Matching Challenge (IMC)[54]: A benchmark focusing onphototourismdata, often challenging forSfMdue to wide baselines and diverse scene structures.
-
Multi-view Depth Estimation:
DTU[51]: A widely used benchmark dataset formulti-view stereo, consisting of industrial objects scanned from multiple viewpoints, providing high-quality ground-truthdepth mapsand camera parameters.
-
Point Map Estimation:
ETH3D[97]: A multi-view stereo benchmark offering high-resolution images and multi-camera videos for both indoor and outdoor scenes, with precise ground-truth 3D models.
-
Image Matching:
ScanNet-1500[18, 92]: A dataset ofindoor sceneswith3D reconstructionsand camera poses, commonly used forfeature matchingandcamera pose estimationtasks. The-1500variant refers to a specific split or filtered subset.
-
Finetuning for Downstream Tasks:
- Feed-forward Novel View Synthesis:
GSO(Google Scanned Objects) [28]: A high-quality dataset of 3D scanned household items, suitable fornovel view synthesistasks.- Internal dataset similar to
Objaverse[20]: A universe of annotated 3D objects, providing diverse 3D content for training.
- Dynamic Point Tracking:
-
TAP-Vid benchmarks[23]: A set of benchmarks fortracking any point in a video, includingKinetics,RGB-S(RGB-Dsyntheticdata), andDAVIS(high-qualityvideo segmentationdataset). These datasets feature rapid dynamic motions and various data sources. -
Kubric[41]: A scalable dataset generator forsynthetic data, used for fine-tuning the tracking module.The datasets are chosen to cover a wide range of
3D computer visiontasks, scene types (indoor, outdoor, objects, real estate), and data characteristics (real, synthetic, static, dynamic), ensuring a comprehensive evaluation ofVGGT's performance and generalization.
-
- Feed-forward Novel View Synthesis:
5.2. Evaluation Metrics
For each task, standard and well-defined metrics are used:
-
Camera Pose Estimation:
AUC@30(Area Under Curve at 30 degrees): This metric, particularly forAUC@30as commonly used in [124], combinesRelative Rotation Accuracy (RRA)andRelative Translation Accuracy (RTA).- Conceptual Definition:
RRAmeasures the angular error in rotation between predicted and ground-truth camera poses for a pair of images.RTAmeasures the angular error in translation. For a set of thresholds, accuracy scores are determined based on whether these errors fall below the threshold.AUCis then computed as the area under the accuracy-threshold curve of the minimum values betweenRRAandRTA. It provides a comprehensive measure of relative pose accuracy, with higher values indicating better performance. - Mathematical Formula (General AUC definition): While the paper references
AUC@30as a standard metric, it doesn't provide a specific formula forAUCcombiningRRAandRTA. However,AUCgenerally refers to the integral of a curve. In this context, it would be calculated as: Or more commonly, for discrete thresholds: - Symbol Explanation:
- : The proportion of image pairs where the
Relative Rotation Erroris less than or equal to . - : The proportion of image pairs where the
Relative Translation Erroris less than or equal to . - : An angular error threshold (e.g., in degrees).
- : The maximum threshold for which AUC is computed (e.g., 30 degrees for
AUC@30). - : Takes the minimum of the two accuracy values at each threshold, ensuring both rotation and translation are accurate.
- : The step size between discrete thresholds.
- : The proportion of image pairs where the
- Conceptual Definition:
- Runtime: Measured in seconds, indicating the computational efficiency of the method.
-
Multi-view Depth / Point Map Estimation:
Accuracy():- Conceptual Definition: Measures how close the predicted 3D points/depths are to the ground truth. It's typically the smallest Euclidean distance from a predicted point to any ground-truth point. Lower values are better.
Completeness():- Conceptual Definition: Measures how well the predicted 3D points/depths cover the ground truth. It's typically the smallest Euclidean distance from a ground-truth point to any predicted point. Lower values are better.
Overall():- Conceptual Definition: The average of
AccuracyandCompleteness, often representing theChamfer distance. It provides a balanced measure of how well two point clouds match each other. Lower values are better. - Mathematical Formula (Chamfer Distance): For two point sets and :
- Symbol Explanation:
A, B: The two point clouds being compared (e.g., predicted and ground truth).- : The number of points in point cloud and , respectively.
- : A point in point cloud .
- : A point in point cloud .
- : The minimum Euclidean distance from point in to any point in . This term measures
completeness(how well covers ). - : The minimum Euclidean distance from point in to any point in . This term measures
accuracy(how well matches ). - The sum of these two averaged terms gives the
Chamfer distance.
- Conceptual Definition: The average of
-
Image Matching:
AUC@5, @10, @20():- Conceptual Definition: For image matching, this
AUCmetric typically refers to the area under the curve ofrelative pose accuracy. Matches are used to estimate anessential matrix, which is decomposed into arelative camera pose. The accuracy is measured by how close this estimated pose is to the ground truth, given a threshold.AUC@5means theArea Under the Curveforrelative pose errorsup to 5 degrees/units,AUC@10up to 10, and so on. Higher values indicate better matching performance. - Mathematical Formula: Similar to the
Camera Pose EstimationAUC, but applied to the pose derived fromtwo-view matches. The specific definition might vary slightly across benchmarks, but the principle is evaluating the accuracy of the estimated relative pose derived from matches.
- Conceptual Definition: For image matching, this
-
Dynamic Point Tracking:
Occlusion Accuracy (OA)():- Conceptual Definition: Measures the accuracy of predicting whether a point is visible or occluded in a given frame. It reflects the model's ability to handle occlusions and re-appearances. Higher values are better.
Average Visible Point Proportion (\delta_{\mathrm{avg}}^{\mathrm{vis}})():- Conceptual Definition: Measures the mean proportion of visible points that are accurately tracked within a certain pixel threshold (e.g., 1, 2, or 5 pixels). It assesses the precision of tracking for points that are actually visible. Higher values are better.
Average Jaccard (AJ)():- Conceptual Definition: A combined metric that measures both
tracking accuracyandocclusion prediction accuracy. It's related to theJaccard index(intersection over union) and provides a holistic evaluation of the tracking performance, especially in scenarios with dynamic motion and occlusions. Higher values are better.
- Conceptual Definition: A combined metric that measures both
-
Feed-forward Novel View Synthesis:
PSNR(Peak Signal-to-Noise Ratio) ():- Conceptual Definition: A widely used metric to quantify the quality of reconstruction of
lossy compression codecs(and image generation models). It compares a synthesized image to a ground-truth image. HigherPSNRindicates a higher quality (less noisy) generated image. - Mathematical Formula:
- Symbol Explanation:
- : The maximum possible pixel value of the image (e.g., 255 for 8-bit grayscale images, or 1 for normalized pixel values).
- :
Mean Squared Errorbetween the original and the synthesized image. For two images and of size :
- Conceptual Definition: A widely used metric to quantify the quality of reconstruction of
SSIM(Structural Similarity Index Measure) ():- Conceptual Definition: A perception-based model that considers image degradation as perceived change in structural information, while also incorporating luminance and contrast changes. It's often considered a better measure of perceived image quality than
PSNR. Values range from -1 to 1, with 1 indicating perfect similarity. Higher values are better. - Mathematical Formula: where (luminance comparison) (contrast comparison) (structure comparison)
- Symbol Explanation:
x, y: Two image patches being compared.- : The average of and .
- : The standard deviation of and .
- : The covariance of and .
- : Small constants to prevent division by zero.
- : Parameters to adjust the relative importance of the three components (often set to 1).
- Conceptual Definition: A perception-based model that considers image degradation as perceived change in structural information, while also incorporating luminance and contrast changes. It's often considered a better measure of perceived image quality than
LPIPS(Learned Perceptual Image Patch Similarity) ():- Conceptual Definition: A metric that uses a
deep neural network(pretrained on image classification) to measure the perceptual similarity between two images. It quantifies how visually similar two images are to a human observer, often correlating better with human judgment thanPSNRorSSIM. Lower values indicate higher perceptual similarity. - Mathematical Formula:
LPIPSis not a simple analytical formula but rather a distance computed by feeding image patches through a pre-trained CNN (e.g., AlexNet, VGG), extracting features from various layers, and calculating the L2 distance between these feature vectors. - Symbol Explanation:
x, y: Two input image patches.- : Features extracted from layer of a pre-trained network.
- : Learned weights for each layer.
- : Element-wise product.
- : Squared L2 norm.
- : Height and width of the feature maps at layer .
- Conceptual Definition: A metric that uses a
5.3. Baselines
The VGGT model is rigorously compared against state-of-the-art baselines for each task:
-
Camera Pose Estimation:
- [92]: Traditional
SfMwithSuperPointandSuperGluefor feature matching. PixSfM[66]:Pixel-perfectSfMwith featuremetric refinement.PoseDiff[124]: Adiffusion-aided bundle adjustmentmethod.DUSt3R[129]: Pairwise 3D reconstruction needing global alignment.MASt3R[62]: Evolution ofDUSt3R, also pairwise.VGGSfM v2[125]:Deep learning-enhanced SfMwithdifferentiable BA.MV-DUSt3R[111]‡,CUT3R[127]‡,FLARE[156]‡,Fast3R[141]‡: Concurrent works exploring neural replacements forDUSt3R's optimization.- On
IMC, additionalPixSfMvariants (LoFTR, ),DFSfM(LoFTR), and olderCOLMAP(SIFT+NN) are used.
- [92]: Traditional
-
Multi-view Depth Estimation:
Gipuma[40]: A traditionalMVSmethod.MVSNet[144]: An earlylearning-based MVSmethod.CIDER[139],PatchmatchNet[121],GeoMVSNet[157]: Modernlearning-based MVSmethods, typically assuming knownground-truth cameras.MASt3R[62]: Used here withground-truth camerasfor fair comparison.DUSt3R[129]: Compared withoutground-truth cameras.
-
Point Map Estimation:
DUSt3R[129] andMASt3R[62]: Pairwise methods requiringglobal alignment.
-
Image Matching:
SuperGlue[92]: A graphneural networkfor feature matching.LoFTR[105]:Detector-free local feature matchingwithtransformers.DKM[32]:Dense kernelized feature matching.CasMTR[9]:Transformer-based image matching.Roma[33]: Robust dense feature matching.
-
Feed-forward Novel View Synthesis:
LGM[110],GS-LRM[154],LVSM[53]: Recentnovel view synthesismodels. Many of these assume known camera parameters.
-
Dynamic Point Tracking:
-
TAPTR[63],LocoTrack[13],BootsTAPIR[26],CoTracker[56]: State-of-the-art specializedpoint trackers.CoTrackeris particularly relevant as its architecture is adapted forVGGT's features.These baselines represent a comprehensive selection of traditional, hybrid, and purely deep learning approaches, covering both specialized and general 3D tasks, thus providing a strong foundation for evaluating
VGGT's performance.
-
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. Camera Pose Estimation
The paper evaluates VGGT on CO3Dv2 and RealEstate10K, and also on IMC.
The following are the results from Table 1 of the original paper:
| Methods | Re10K (unseen) AUC@30 ↑ | CO3Dv2 AUC@30 ↑ | Time |
| Colmap+SPSG [92] | 45.2 | 25.3 | ~ 15s |
| PixSfM [66] | 49.4 | 30.1 | > 20s |
| PoseDiff [124] | 48.0 | 66.5 | ~ 7s |
| DUSt3R [129] | 67.7 | 76.7 | ~ 7s |
| MASt3R [62] | 76.4 | 81.8 | ~ 9s |
| VGGSfM v2 [125] | 78.9 | 83.4 | ~ 10s |
| MV-DUSt3R [111] ‡ | 71.3 | 69.5 | ~ 0.6s |
| CUT3R [127] | 75.3 | 82.8 | ~ 0.6s |
| FLARE [156] ‡ | 78.8 | 83.3 | ~ 0.5s |
| Fast3R [141] ‡ | 72.7 | 82.5 | ~ 0.2s |
| Ours (Feed-Forward) | 85.3 | 88.2 | ~ 0.2s |
| Ours (with BA) | 93.5 | 91.8 | ~ 1.8s |
Analysis:
-
VGGT(Feed-Forward) consistently achieves the highestAUC@30scores on bothRealEstate10K(85.3) andCO3Dv2(88.2) among all tested methods. This is remarkable because it outperforms alternatives likeVGGSfM v2,MASt3R, andDUSt3R, which require computationally expensive post-optimization steps (typically 7-10 seconds). -
VGGTachieves this superior performance while operating in a purelyfeed-forwardmanner, requiring only 0.2 seconds per reconstruction. This makes it significantly faster than all other competitive baselines. -
The performance advantage is particularly pronounced on the
RealEstate10Kdataset, which none of the methods were trained on. This suggestsVGGTpossesses superior generalization capabilities. -
The table also shows that
VGGT's performance can be further enhanced by combining it withBundle Adjustment (BA)post-processing.Ours (with BA)achieves an even higherAUC@30of 93.5 onRealEstate10Kand 91.8 onCO3Dv2. Despite involvingBA, this process is still relatively fast (around 1.8 seconds) becauseVGGTprovides accurate initializations, obviating the need for extensive triangulation and iterative refinement typically required byBA.The following are the results from Table 10 of the original paper, showing evaluation on
Image Matching Challenge (IMC):Method Test-time Opt. AUC@3 AUC@5 AUC@10 Runtime COLMAP (SIFT+NN) [94] ✓ 23.58 32.66 44.79 >10s PixSfM (SIFT + NN) [66] ✓ 25.54 34.80 46.73 >20s PixSfM (LoFTR) [66] ✓ 44.06 56.16 69.61 >20s PixSfM (SP + SG) [66] ✓ 45.19 57.22 70.47 >20s DFSfM (LoFTR) [47] ✓ 46.55 58.74 72.19 >10s DUSt3R [129] ✓ 13.46 21.24 35.62 ~ 7s MASt3R [62] ✓ 30.25 46.79 57.42 ~ 9s VGGSfM [125] ✓ 45.23 58.89 73.92 ~ 6s VGGSfMv2 [125] ✓ 59.32 67.78 76.82 ~ 10s VGGT (ours) X 39.23 52.74 71.26 0.2s VGGT + BA (ours) ✓ 66.37 75.16 84.91 1.8s
Analysis on IMC:
VGGT(ours) in itsfeed-forwardmode (withoutBA) achieves anAUC@10of 71.26, which is competitive with and even surpasses severalSfMmethods that employtest-time optimization, such asPixSfMvariants and significantly outperformsDUSt3RandMASt3R. Its speed (0.2s) is vastly superior to all methods with optimization.- When combined with
BA,VGGT + BAachieves anAUC@10of 84.91, setting a newstate-of-the-artonIMC, significantly outperformingVGGSfMv2(76.82), which previously ranked first. This shows the robustness ofVGGT's initial predictions.
6.1.2. Multi-view Depth Estimation
The paper evaluates VGGT on the DTU dataset.
The following are the results from Table 2 of the original paper:
| Known GT camera | Method | Acc.↓ | Comp.↓ | Overall↓ |
| ✓ | Gipuma [40] | 0.283 | 0.873 | 0.578 |
| ✓ | MVSNet [144] | 0.396 | 0.527 | 0.462 |
| ✓ | CIDER [139] | 0.417 | 0.437 | 0.427 |
| ✓ | PatchmatchNet [121] | 0.427 | 0.377 | 0.417 |
| ✓ | MASt3R [62] | 0.403 | 0.344 | 0.374 |
| ✓ | GeoMVSNet [157] | 0.331 | 0.259 | 0.295 |
| X | DUSt3R [129] | 2.677 | 0.805 | 1.741 |
| X | Ours | 0.389 | 0.374 | 0.382 |
Analysis:
VGGT(Ours) substantially outperformsDUSt3R(the only other method that does not know ground-truth cameras at test time) onDTU, reducing theOverallscore (Chamfer distance) from 1.741 to 0.382. This is a significant improvement.- More remarkably,
VGGTachieves results (Overall: 0.382) that are comparable to, and in some cases even better than, methods that operate with knownground-truth cameras(CIDER: 0.427,PatchmatchNet: 0.417,MVSNet: 0.462,Gipuma: 0.578).MASt3R(0.374) andGeoMVSNet(0.295) slightly outperformVGGTbut crucially operate withground-truth cameras. - This strong performance without prior knowledge of camera parameters highlights
VGGT's ability to natively reason aboutmulti-view triangulationthrough itsmulti-image training scheme, eliminating the need forad hoc alignmentprocedures like those inDUSt3R.
6.1.3. Point Map Estimation
The paper compares VGGT's point cloud accuracy against DUSt3R and MASt3R on the ETH3D dataset.
The following are the results from Table 3 of the original paper:
| Methods | Acc.↓ | Comp.↓ | Overall↓ | Time |
| DUSt3R | 1.167 | 0.842 | 1.005 | ~ 7s |
| MASt3R | 0.968 | 0.684 | 0.826 | ~ 9s |
| Ours (Point) | 0.901 | 0.518 | 0.709 | ~ 0.2s |
| Ours (Depth + Cam) | 0.873 | 0.482 | 0.677 | ~ 0.2s |
Analysis:
VGGT(Ours (Point)andOurs (Depth + Cam)) significantly outperforms bothDUSt3RandMASt3Racross all metrics (Accuracy,Completeness,Overall). For instance,Ours (Depth + Cam)achieves anOverallscore of 0.677, compared toMASt3R's 0.826 andDUSt3R's 1.005.- This performance is achieved in a simple
feed-forward regime(0.2 seconds), whereasDUSt3RandMASt3Rrequire expensiveglobal alignment optimization(around 7-9 seconds). - An interesting finding is that constructing
point cloudsfromVGGT's independently estimateddepth mapsandcamera parameters(Ours (Depth + Cam)) yields higher accuracy (Overall: 0.677) than directly using the output of the dedicatedpoint map head(Ours (Point): 0.709). This suggests that decomposing a complex task into simpler, related subproblems can be beneficial, even with joint supervision during training.
6.1.4. Image Matching
VGGT's tracking head is evaluated on two-view matching using ScanNet-1500.
The following are the results from Table 4 of the original paper:
| Method | AUC@5↑ | AUC@10 ↑ | AUC@20 ↑ |
| SuperGlue [92] | 16.2 | 33.8 | 51.8 |
| LoFTR [105] | 22.1 | 40.8 | 57.6 |
| DKM [32] | 29.4 | 50.7 | 68.3 |
| CasMTR [9] | 27.1 | 47.0 | 64.4 |
| Roma [33] | 31.8 | 53.4 | 70.9 |
| Ours | 33.9 | 55.2 | 73.4 |
Analysis:
- Despite not being specialized for
two-view matching(it's a generalpoint tracker),VGGT(Ours) achieves the highestAUCscores across all thresholds (AUC@5: 33.9,AUC@10: 55.2,AUC@20: 73.4). - It outperforms leading
state-of-the-art two-view matchingmethods likeRoma(which scored 31.8, 53.4, 70.9 respectively),DKM,LoFTR, andSuperGlue. This demonstrates the high quality and generalization capability of the features learned byVGGT's backbone, even for tasks it wasn't explicitly designed for.
6.1.5. Finetuning for Downstream Tasks
Feed-forward Novel View Synthesis
VGGT is finetuned for feed-forward novel view synthesis on the GSO dataset.
The following are the results from Table 7 of the original paper:
| Method | Known Input Cam | Size | PSNR ↑ | SSIM ↑ | LPIPS ↓ |
| LGM [110] | ✓ | 256 | 21.44 | 0.832 | 0.122 |
| GS-LRM [154] | ✓ | 256 | 29.59 | 0.944 | 0.051 |
| LVSM [53] | ✓ | 256 | 31.71 | 0.957 | 0.027 |
| Ours-NVS* | X | 224 | 30.41 | 0.949 | 0.033 |
Analysis:
-
VGGT's finetuned version (Ours-NVS*) achieves competitive results (PSNR: 30.41,SSIM: 0.949,LPIPS: 0.033) on theGSOdataset, even when compared to methods likeLVSM(PSNR: 31.71, SSIM: 0.957, LPIPS: 0.027) which requireknown camera parametersfor the input images and use a larger training dataset. -
The
*indicates thatOurs-NVSuses a small training set (only 20% of the size ofObjaverse). Despite this, and not requiring input camera parameters, its performance is very close toLVSM. This highlights the strength ofVGGT's learned features as a backbone fornovel view synthesis. -
The authors expect even better results with a larger training dataset. The following figure (Figure 6 from the original paper) provides qualitative examples of
novel view synthesis.
该图像是一个示意图,展示了新颖视图合成的定性示例。顶部行是输入图像,中间行是目标视点的真实图像,底部行是我们合成的图像。
Figure 6. Qualitative Examples of Novel View Synthesis. The top row shows the input images, the middle row displays the ground truth images from target viewpoints, and the bottom row presents our synthesized images.
Dynamic Point Tracking
VGGT's pretrained features are used to enhance CoTracker2 for dynamic point tracking on TAP-Vid benchmarks.
The following are the results from Table 8 of the original paper:
| Method | Kinetics | RGB-S | DAVIS | |||
| δvis | AJ | δvis | AJ | δvis | AJ | |
| TAPTR [63] | 64.4 | 49.0 | 76.2 | 60.8 | 76.1 | 63.0 |
| LocoTrack [13] | 66.8 | 52.9 | 83.2 | 69.7 | 75.3 | 62.9 |
| BootsTAPIR [26] | 68.4 | 54.6 | 83.0 | 70.8 | 73.6 | 61.4 |
| CoTracker [56] | 64.3 | 49.6 | 78.9 | 67.4 | 76.1 | 61.8 |
| CoTracker + Ours | 69.0 | 57.2 | 84.0 | 72.1 | 77.5 | 64.7 |
Analysis:
-
Integrating
pretrained VGGTas afeature backboneintoCoTracker(CoTracker + Ours) significantly enhances its performance across allTAP-Vid benchmarks(Kinetics,RGB-S,DAVIS). -
For example, on
Kinetics,CoTracker + Oursachieves a of 69.0 andAJof 57.2, outperforming the standaloneCoTracker(64.3 , 49.6AJ) and otherstate-of-the-arttrackers likeBootsTAPIR,LocoTrack, andTAPTR. -
This strong performance on videos with rapid dynamic motions demonstrates the robustness and generalization capability of
VGGT's learned features, even for scenarios (dynamic scenes) for which it was not explicitly designed or trained in its primary3D reconstructiontask. The following figure (Figure 5 from the original paper) visualizes therigidanddynamic point trackingresults.
该图像是示意图,展示了VGGT模型与CoTracker在刚性与动态点跟踪中的对比。上部分展示VGGT的关键点轨迹,下部分展示CoTracker的跟踪效果,体现了两种方法在不同场景下的表现差异。
Figure 5. Visualization of Rigid and Dynamic Point Tracking. Top: VGGT's tracking module outputs keypoint tracks for an CoTracker [56], which processes sequential inputs.
6.2. Ablation Studies / Parameter Analysis
6.2.1. Feature Backbone
The paper ablates the effectiveness of its proposed Alternating-Attention (AA) design.
The following are the results from Table 5 of the original paper:
| ETH3D Dataset | Acc.↓ | Comp.↓ | Overall↓ |
| Cross-Attention | 1.287 | 0.835 | 1.061 |
| Global Self-Attention Only | 1.032 | 0.621 | 0.827 |
| Alternating-Attention | 0.901 | 0.518 | 0.709 |
Analysis:
- The
Alternating-Attentionarchitecture (Overall: 0.709) clearly outperforms both alternativeattentiondesigns:Global Self-Attention Only(Overall: 0.827) andCross-Attention(Overall: 1.061). Cross-Attentionperforms the worst, despite being designed to fuse information across frames. This suggests that the cost and complexity ofcross-attentiondo not translate to better performance in this specific setup, and perhapsself-attentionvariants are more effective.- The
Global Self-Attention Onlyvariant performs better thanCross-Attentionbut still falls short ofAlternating-Attention. This validates that explicitly alternating betweenframe-wiseandglobal attentionprovides the best balance for integrating within-frame and cross-frame information, leading to better joint understanding of scene geometry and camera parameters.
6.2.2. Multi-task Learning
The paper verifies the benefit of training VGGT to simultaneously learn multiple 3D quantities.
The following are the results from Table 6 of the original paper:
| w. Lcamera | W. Ldepth | W. Ltrack | Acc.↓ | Comp.↓ | Overall↓ |
| × | ✓ | ✓ | 1.042 | 0.627 | 0.834 |
| ✓ | × | ✓ | 0.920 | 0.534 | 0.727 |
| ✓ | ✓ | × | 0.976 | 0.603 | 0.790 |
| ✓ | ✓ | ✓ | 0.901 | 0.518 | 0.709 |
Analysis:
- Training with all three loss components (
Lcamera,Ldepth,Ltrack) results in the bestpoint map estimationaccuracy (Overall: 0.709). - Removing any single loss term leads to a noticeable decrease in performance.
- Training without
Lcamera(camera loss) significantly degrades performance (Overall: 0.834). This indicates that explicit supervision forcamera parameter estimationis crucial for improving the model's understanding of overall scene geometry. - Training without
Ltrack(tracking loss) also leads to a substantial drop (Overall: 0.790). This suggests that learning to track points helps the model build a more robust and coherent 3D representation. - Training without
Ldepth(depth loss) results in a smaller but still noticeable decrease (Overall: 0.727). This implies that whiledepth mapsandpoint mapsare closely related, separate supervision for depth still offers marginal improvements.
- Training without
- The results confirm the benefit of
multi-task learningandjoint supervision, even for outputs that are geometrically related. Learning these interrelated quantities together forces the model to learn a more comprehensive and accurate underlying 3D representation.
6.2.3. Runtime and Memory
The paper provides an evaluation of inference runtime and peak GPU memory usage for the feature backbone with varying numbers of input frames.
The following are the results from Table 9 of the original paper:
| Input Frames | 1 | 2 | 4 | 8 | 10 | 20 | 50 | 100 | 200 |
| Time (s) | 0.04 | 0.05 | 0.07 | 0.11 | 0.14 | 0.31 | 1.04 | 3.12 | 8.75 |
| Mem. (GB) | 1.88 | 2.07 | 2.45 | 3.23 | 3.63 | 5.58 | 11.41 | 21.15 | 40.63 |
Analysis:
- The
runtimeandmemory usagescale with the number of input frames. For a small number of frames (e.g., 10),VGGTis incredibly fast (0.14 seconds), making it suitable for real-time applications. Even for 100 frames, it completes in just over 3 seconds. - The
GPU memory usagealso increases with the number of frames, reaching 40.63 GB for 200 frames. This high memory usage for a large number of frames is characteristic ofTransformermodels withglobal self-attention, as theattention mechanismcan have quadratic complexity with respect to the sequence length (number of tokens). - The
camera headandDPT headsare lightweight; thecamera headaccounts for about 5% of runtime and 2% of memory, while aDPT headadds 0.03 seconds and 0.2 GB per frame. - The authors acknowledge that for extremely large numbers of frames or limited
GPU memory, users might process predictions frame by frame, leveraging thatDPT headsmake independent predictions per frame, even if inter-frame relationships are handled in the backbone. They also suggestLLM deploymenttechniques likeTensor Parallelismfor acceleration.
6.3. Single-view Reconstruction Examples
The paper also presents qualitative examples of single-view reconstruction, as shown in the following figure (Image 7 from the original paper).
该图像是示意图,左侧展示了不同视角的物体形状,右侧为其轮廓线。此图试图强调以不同视角观察物体导致的形状变化,并结合3D视觉任务中的特征提取。
Figure 7. Diagram showing objects from different angles on the left, with their outlines on the right. This diagram aims to highlight how observing objects from different perspectives alters their shapes and integrates into feature extraction in 3D vision tasks.
Analysis:
- While
VGGTwas not explicitly trained forsingle-view reconstruction(it handles variable inputs, including one view), it still demonstrates surprisingly good results. This highlights the strong implicit 3D understanding learned by the model from itsmulti-view training.
7. Conclusion & Reflections
7.1. Conclusion Summary
The paper successfully introduces Visual Geometry Grounded Transformer (VGGT), a novel feed-forward neural network that marks a significant paradigm shift in 3D computer vision. VGGT can directly estimate all key 3D scene properties—including camera parameters, multi-view depth maps, dense point cloud reconstructions, and 3D point tracks—from a variable number of input views (from one to hundreds). It achieves state-of-the-art results across these diverse 3D tasks. A primary contribution is its ability to operate in a fast and efficient feed-forward manner, often outperforming traditional visual geometry-based methods that rely on computationally intensive iterative optimization and post-processing. Furthermore, VGGT's learned features prove to be a powerful backbone for enhancing downstream tasks such as non-rigid point tracking and feed-forward novel view synthesis. The simplicity, efficiency, and versatility of VGGT make it highly suitable for real-time applications and establish a new foundation for future research in fast and reliable 3D reconstruction.
7.2. Limitations & Future Work
The authors acknowledge several limitations of VGGT:
-
Unsupported Camera Models: The current model does not support specialized camera types like
fisheyeorpanoramic images. -
Extreme Input Rotations: Reconstruction performance may degrade under conditions involving extreme input rotations of the camera.
-
Non-Rigid Motion: While the model can handle minor
non-rigid motions, it fails in scenarios with substantialnon-rigid deformation, where objects in the scene undergo significant shape changes.However, the authors emphasize that an important advantage of their approach is its
flexibilityandease of adaptation. They suggest that these limitations can be addressed straightforwardly byfine-tuningthe model on targeted datasets with minimal architectural modifications. This adaptability is seen as a distinguishing factor from existing approaches that might require extensive re-engineering for specialized scenarios.
Future research directions implied or directly stated include:
- Targeted Fine-tuning: Adapting
VGGTto specific challenging scenarios (e.g.,fisheye cameras,extreme rotations,complex non-rigid motion) through fine-tuning on relevant datasets. - Large-scale Unsupervised Training: Exploring the use of
differentiable Bundle Adjustment(which showed promising preliminary results) forlarge-scale unsupervised trainingwhen explicit 3D annotations are scarce, as it can serve as an effective supervision signal. - Deployment Optimization: Implementing
LLM deploymenttechniques likeTensor Parallelismto further accelerate inference with multipleGPUsand manage memory more efficiently for processing very large numbers of input frames. - Exploring Decomposed Predictions: Further investigating why combining independently estimated
depth mapsandcamera parametersyields more accurate 3D points than directly using a dedicatedpoint map head, which could inform future model designs.
7.3. Personal Insights & Critique
VGGT represents a significant leap towards a truly neural-first approach for comprehensive 3D reconstruction. The paper's core strength lies in demonstrating that a large transformer, with appropriately designed attention mechanisms (Alternating-Attention) and multi-task training, can implicitly learn complex geometric reasoning, often outperforming traditional methods that explicitly optimize geometry. This is a powerful validation of the scaling laws observed in large language models and vision transformers applied to the 3D domain.
Inspirations & Transferability:
-
Foundation Model for 3D:
VGGT's success as afeature backbonefor downstream tasks likenovel view synthesisandpoint trackinghighlights its potential as afoundation modelfor the entire3D computer visionecosystem. Similar toCLIPorDINOfor 2D,VGGTcould provide general-purpose 3D features that are highly transferable. This opens up possibilities for rapid development of new 3D applications by simply fine-tuning or adding lightweight heads. -
Efficiency for Real-time Applications: The
feed-forwardnature and sub-second inference times for typical scenarios are game-changers for real-time 3D perception in robotics,augmented reality (AR),virtual reality (VR), and autonomous driving, where low latency is critical. -
Multi-task Learning Benefits: The ablation studies strongly reinforce the notion that learning interrelated tasks jointly improves performance on each individual task. This "rising tide lifts all boats" effect suggests that
AI modelscapable of a more holistic understanding of a domain (like 3D geometry) are inherently more robust and accurate.Potential Issues, Unverified Assumptions, or Areas for Improvement:
-
Implicit Geometry vs. Explicit Guarantee: While
VGGTimplicitly learns geometry, traditionaloptimization-based methodsprovide explicit guarantees on geometric consistency (e.g., minimizingreprojection error). The trade-off is between efficiency/generalization (neural) and interpretable, provable geometric correctness (traditional). For highly safety-critical applications, combiningVGGT's speed with a lightweight, certifiable geometric refinement could be ideal. -
Scalability to Extreme Views: While
VGGThandles "hundreds of views," the memory usage for 200 frames (40.63 GB) on anH100 GPUis substantial. For truly massive datasets (e.g., thousands of images from a city-scale reconstruction), further architectural innovations or efficientLLM deploymentstrategies would be crucial. The quadratic complexity ofglobal self-attentionremains a theoretical bottleneck, even withFlashAttention. -
Data Dependence: The success of
VGGTis heavily reliant on a "large and diverse collection of datasets with 3D annotations." The quality and coverage of this training data are paramount. The model's generalization, while impressive, might still be limited by the biases present in the training datasets. Curating even larger, more diverse 3D datasets, potentially with more focus on extreme conditions ornon-rigid motion, will be key for future improvements. -
Black-Box Nature: Like many
deep learningmodels,VGGToperates as a black box. Understanding how it implicitly learns geometric principles, compared to the explicit steps inSfM, is an open research question that could lead to more robust and interpretable 3DAI. -
Principal Point Assumption: The assumption of the
principal pointbeing at the image center simplifies the camera model. While common, relaxing this assumption could improve applicability to a wider range of uncalibrated cameras, though it would add complexity to the learning task.In conclusion,
VGGTrepresents a compelling demonstration of the power oflarge transformersto unify and accelerate complex3D computer visiontasks. It shifts the paradigm from geometrically-driven optimization to data-driven neural inference, offering a blueprint for futurefoundation modelsin the 3D domain.
Similar papers
Recommended via semantic vector search.